Meta has announced a computer vision model that learns to recognize images while also building contextual knowledge that makes artificial intelligence less clunky and costly.
“This model, the Image Joint Embedding Predictive Architecture (I-JEPA), learns by creating an internal model of the outside world, which compares abstract representations of images (rather than comparing the pixels themselves),” the data, VR and AI biz explained in a blog post.
“I-JEPA delivers strong performance on multiple computer vision tasks, and it’s much more computationally efficient than other widely used computer vision models.”
Computational efficiency means less GPU time is required for training – Meta managed to train a 632 million parameter visual transformer model with 16 Nvidia A100 GPUs in less than 72 hours. The resulting model, the company claims, outperforms other methods like Data2vec, Context Autoencoders, and Masked Autoencoders for low-shot classification on the ImageNet data set.
Meta claims alternative self-supervised learning methods take from two to 10 more GPU hours – and have worse error rates – with the same amount of data.
- Deepfakes being used in ‘sextortion’ scams, FBI warns
- Amazon finds something else AI can supposedly do well: Spotting damaged goods
- Google wants to target you – yes, YOU – with AI-generated ads
- UK’s GDPR replacement could wipe out oversight of live facial recognition
In a paper titled “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture,” Meta boffins, including outspoken AI pioneer, Yann LeCun, explain that I-JEPA works by trying to predict missing information in subdivided portions of images.
Where generative methods (such as Data2vec) learn by masking certain input and trying to predict missing pixels, I-JEPA operates on a more substantial area – blocks that are large enough to convey semantic details that give image excerpts more meaning.
Because these pieces convey contextual information about their adjoining blocks, the model can use that information to make better predictions.
The result is that I-JEPA is less prone to errors – like creating hands with extra fingers – when generating images. Generative architectures, Meta says, often have trouble with human hands because they try to fill in every bit of information without a conceptual basis for the scene.
Also, there’s reportedly no need for additional fine-tuning – a common step with other approaches.
“I-JEPA demonstrates the potential of architectures for learning competitive off-the-shelf image representations without the need for extra knowledge encoded through hand-crafted image transformations,” the boffins claimed.
Meta hopes I-JEPA will lead to self-supervised learning methods that incorporate more common-sense knowledge about the world. I-JEPA has been released as open source code under the Creative Commons Attribution-NonCommercial 4.0 International Public License. ®