With the development of technology, people can now take pictures of their favorite moments with mobile phones and other devices. Many people may have thought, if there is a black technology that will make the flat 2D photos we take into three-dimensional 3D photos …
Facebook has long thought about this, and to improve the user experience, in 2018, Facebook introduced the 3D photo feature. This is a new immersive format that you can use to share photos with friends and family. However, this feature relies on the dual-lens “portrait mode” function found only on high-end smartphones, and cannot be used on ordinary mobile devices.
To allow more people to experience this new visual format, Facebook has developed a system using machine learning. This system can infer the 3D structure of any image, and any device and any time the image taken can be converted into 3D form, which can make people easily use 3D photo technology.
Not only that, it can also process family photos and other precious images from decades ago. Anyone with an iPhone 7 and above, or a mid-range or higher Android device can now try this feature in the Facebook app.
Building such enhanced 3D pictures requires overcoming many technical challenges, such as training a model that can correctly infer the 3D positions of various topics, and optimizing the system to execute on a typical mobile processor device in 1 second. To overcome these challenges, Facebook trained convolutional neural networks (CNNs) on millions of public 3D images and their accompanying depth maps, and leveraged various action optimization technologies previously developed by Facebook AI, such as FBNet and ChamNet. The team also recently discussed related research on 3D understanding .
This feature is now available to anyone using Facebook, so how exactly is it built? Let’s take a look at the technical details.
Delivering efficient performance on mobile devices
Given a standard RGB image, 3D Photos CNN (3D Photo Convolutional Neural Network) can estimate the distance of each pixel from the camera. Researchers achieve this goal in four ways:
- Build a network architecture with a set of parameterizable, action-optimizable neural building blocks.
- Automate architecture searches to find effective configurations of these modules, enabling the system to perform tasks on a variety of devices in less than a second.
- Quantitative perception training, using high-performance INT8 quantization on mobile devices, while minimizing performance degradation during quantization.
- Get a lot of training data from public 3D photos.
Neural Construction Module
Facebook’s architecture use is inspired by the building blocks of FBNet. FBNet is a framework for optimizing the ConvNet architecture for resource-constrained devices such as mobile devices. A building block consists of pointwise convolution, optional upsampling, kxk depth convolution, and additional point-by-point convolution. Facebook implemented a U-net-style architecture that has been modified to place FBNet building blocks along skip connections. The U-net encoder and decoder each contain 5 stages, each of which corresponds to a different spatial resolution.
Automated architecture search
In order to find an effective architecture configuration, the ChamNet algorithm developed by Facebook AI automates the search process. The ChamNet algorithm continuously extracts points from the search space to train precision predictors. The accuracy predictor is used to accelerate genetic search to find a model that maximizes prediction accuracy under the condition of meeting specific resource constraints.
A search space is used in this setup, which can change the channel expansion factor and the number of output channels of each module, thereby generating 3.4 × 1,022 possible architectures. Facebook then used the 800 Tesla V100 GPU to complete the search in approximately 3 days, setting and adjusting FLOP constraints on the model architecture to achieve different operating points.
Quantitative Perception Training
By default, its model is trained using single-precision floating-point weights and triggers, but researchers have found that quantizing weights and triggers to 8 bits has significant advantages. In particular, the int8 weight requires only a quarter of the storage required for the float32 weight, which reduces the number of bits that must be transferred to the device when first used.
The throughput of Int8-based operators is also much higher compared to float32-based operators, thanks to an optimized database such as QNNPACK of Facebook AI, which has been integrated into PyTorch. We use Quantitative Sensing Training (QAT) to avoid quality degradation caused by quantization. QAT is now part of PyTorch, which simulates quantization and supports back-propagation during training, thereby closing the gap between training and production performance.
Finding new ways to create 3D experiences
In addition to improving depth estimation algorithms, researchers are also working to provide high-quality depth estimates for images taken by mobile devices.
Since the depth of each frame must be consistent with the next frame, image processing technology is challenging, but it is also an opportunity to improve performance. Observing the same object multiple times can provide additional signals for highly accurate depth estimation. As the performance of Facebook’s neural network continues to improve, the team will also explore the use of technologies such as depth estimation, surface normal estimation, and spatial inference in real-time applications such as augmented reality.
In addition to these potential new experiences, this work will help researchers better understand the content of 2D images. A better understanding of 3D scenes can also help robots navigate and interact with the physical world. Facebook hopes to help the artificial intelligence community make progress in these areas by sharing the details of the 3D picture system and create new advanced 3D experiences.
More article: Beginner photography tips for better photos