Contextual AI Introduces LENS: An AI Framework for Augmented Vision Language Models Outperforms Flamingo by 9% (56->65%) on VQAv2

Large Language Models (LLM) have transformed natural language understanding in recent years, demonstrating remarkable aptitudes in semantic understanding, query resolution, and text output, particularly in zero-shot and few-shot environments. As seen in Fig. 1(a), several methods have been proposed for using LLM on tasks involving vision. An optical encoder can be trained to represent each image as a series of continuous embeds, enabling the LLM to understand it. Another uses a contrastingly trained Frozen Vision Encoder whilst adding further levels to the Frozen LLM which are then learned from scratch.

Another method recommends training a lightweight transformer to align a frozen (contrastingly pre-trained) visual encoder and a frozen LLM. Even if they have made progress in the above research, it is still difficult to justify the computational cost of the additional pre-training stage(s). In addition, huge databases, including text, photos and video, are required to synchronize visual and language modalities with an existing LLM. Flamingo adds new levels of cross attention in a pre-trained LLM to add visual capabilities.

Figure 1: Comparing Methods for Coordinating Visual and Linguistic Modalities There are two options for multimodal pre-training: (a) use a coupled or web data set; and (b) LENS, a no-pretraining technique that can be used with any standard LLM without the need for additional multimodal datasets. Unlike LENS, the previous approaches require prior training in joint alignment on substantial multimodal datasets to perform visual tasks.

The multimodal pre-training phase requires a staggering 2 billion image-text pairs and 43 million websites, which can take up to 15 days, even employing a pre-trained image encoder and a pre-frozen LLM. trained. Instead, using a variety of ‘vision modules’, they can extract information from visual inputs and produce detailed textual representations (such as tags, attributes, actions and relationships, among other things), which they can then feed directly to the LLM to avoid the need of further multimodal pre-training, as shown in Fig. 1(b). Researchers from Contextual AI and Stanford University introduce LENS (Large Language Models ENnhanced to See), a modular strategy that uses an LLM as a “reasoning module” and works through separate “vision modules”.

Join the fastest growing ML subreddit

They first extract rich textual information in the LENS technique using pre-trained vision modules, such as contrastive models and image caption models. The text is then sent to the LLM, enabling it to perform tasks, including object recognition, vision and language (V&L). LENS bridges the gap between modalities at no cost by eliminating the need for additional steps or multimodal pre-training data. Integrating LENS gives them a model that operates across multiple domains without the need for additional cross-domain pretraining. Additionally, this integration allows us to immediately use the latest developments in computer vision and natural language processing, maximizing the benefits associated with both disciplines.

They provide the following contributions:

They present LENS, a modular method that handles computer vision challenges by using learning capabilities in the context of a few shots of language patterns through natural language descriptions of visual inputs.

LENS offers any commercially available LLM the ability to see without additional training or data.

They use frozen LLMs to handle object recognition and visual reasoning tasks without further alignment of vision and language or multimodal data. Experimental results show that their approach achieves zero-shot performance competitive or better than jointly pre-trained end-to-end models such as Kosmos and Flamingo. A partial implementation of their document is available on GitHub.

Check out theDocument, demo, Github link and blog.Don’t forget to subscribeour 25k+ ML SubReddit,Discord channel,ANDEmail newsletter, where we share the latest news on AI research, cool AI projects, and more. If you have any questions regarding the above article or if you have missed anything, please do not hesitate to email us

Featured tools:

Check out 100s AI Tools in the AI ​​Tools Club

Aneesh Tickoo is a Consulting Intern at MarktechPost. She is currently pursuing her BA in Data Science and Artificial Intelligence from Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects that harness the power of machine learning. Her research interest is image processing and she is passionate about building solutions around it. She loves connecting with people and collaborating on interesting projects. has just released some great features. Generate an illustrated story from a prompt. Check it out here. (Sponsored)

#Contextual #Introduces #LENS #Framework #Augmented #Vision #Language #Models #Outperforms #Flamingo #VQAv2
Image Source :

Leave a Comment