Simultaneously with the kick-off of the 2020 TensorFlow Developer SummitGoogle today released a pipeline – Objectron – that recognizes objects in 2D images and estimates their poses and sizes using an AI model. The company states that this has implications for robotics, self-driving vehicles, image retrieval and augmented reality. For example, it could help a robot in the factory to avoid obstacles in real time.
Tracking 3D objects is a difficult task, especially when it comes to limited computing resources (like a smartphone system on the chip). And it gets more difficult when the only images available (usually videos) are 2D because data is missing and objects and shapes look different.
The Google team behind Objectron then developed a toolset that annotators used to label 3D bounding boxes (i.e., rectangular borders) for objects using a split screen view to display 2D video images. In addition to point clouds, camera positions and recognized levels, 3D bounding boxes were placed over them. Annotators drew 3D bounding boxes in 3D view and checked their position by checking the projections in 2D video images. For static objects, they only had to annotate the target object in a single frame. The tool transferred the location of the object to all frames using pose information from the ground truth camera from AR session data.
To complement the real data and increase the accuracy of the AI model’s predictions, the team developed an engine that places virtual objects in scenes with AR session data. This allowed the use of camera poses, recognized flat surfaces, and estimated lighting to create physically probable placements with lighting that matched the scene. This resulted in high quality synthetic data with rendered objects that respected the scene geometry and blended seamlessly into real backgrounds. In validation tests, the accuracy with the synthetic data increased by about 10%.
Even better, the current version of the Objectron model is light enough to run on flagship mobile devices in real time. The mobile graphics chip Adreno 650, which can be found in cell phones such as the LG V60 ThinQ, the Samsung Galaxy S20 + and the Sony Xperia 1 II, can process around 26 frames per second.
The Objectron is available in MediaPipe, a framework for building cross-platform AI pipelines that consists of fast inference and media processing (such as video decoding). Models for recognizing shoes and chairs as well as an end-to-end demo app are available.
The team plans to share additional solutions with the research and development community in the future to stimulate new use cases, applications, and research efforts. In addition, the Objectron model should be scaled to further object categories and the performance on the device should be further improved.