On March 14, Chinese artificial intelligence pioneer SenseTime unveiled what it called the “largest multimodal open-source large-language model“, amid the latest AI wave triggered by ChatGPT. Named Intern 2.5, the model is the largest and most accurate on ImageNet among the world’s open-source models, and it is the only model listed in the object detection benchmark dataset COCO that exceeds 65.0 mAP, SenseTime said.
Intern 2.5’s cross-modal task processing capability can provide efficient and accurate perception and understanding support for automated driving and robots. The model was jointly developed by SenseTime, Shanghai Artificial Intelligence Laboratory, Tsinghua University, the Chinese University of Hong Kong and Shanghai Jiao Tong University.
From the date of its release, Intern 2.5 has been made available in OpenGVLab, a general visual open-source platform that SenseTime participates in.
Nowadays, as many applications undergo rapid growth, traditional computer vision has been unable to deal with several specific tasks needed in the real world. Intern 2.5, a higher-level visual system with universal scene perception and complex problem-solving capabilities, defines tasks through text, making it possible to flexibly define the task requirements of different scenarios. It can give instructions or answers based on given visual images and prompts for tasks, thereby possessing advanced perception and complex problem-solving abilities in general scenarios such as image description, visual question-answering, visual reasoning and text recognition.
For example, in automated driving, it can greatly improve perception and understanding abilities, accurately assist in judging the status of traffic lights, road signs and other information, and provide effective information for a vehicle’s decision-making and overall planning.
It also has the function of AI-generated content. According to requirements put forward by users, a diffusion model generation algorithm can be used to generate high-quality and realistic images.
The outstanding performance of this type of technology in the cross-modal field of graphics and text comes from the effective integration of vision, speech and multitasking modeling capabilities.
On ImageNet, a large visual database designed for use in visual object recognition software research, the model achieves 90.1% accuracy based on public data only. This is the only model with an accuracy rate exceeding 90.0% except for Google and Microsoft’s.
Sign up today for 5 free articles monthly!