Researchers at UC Berkeley and Google present an AI framework that formulates the visual answer to questions as modular code generation

https://arxiv.org/abs/2306.05392

The domain of Artificial Intelligence (AI) is evolving and advancing with the release of each new model and solution. Large Language Models (LLM), which have recently become very popular due to their amazing capabilities, are the main reason for the rise of AI. The subdomains of AI, whether it’s natural language processing, natural language understanding, or computer vision, are all advancing, and for all good reasons. One area of ​​research that has recently garnered a lot of interest from the AI ​​and deep learning communities is Visual Question Answering (VQA). VQA is the job of answering text-based open-ended questions about an image.

Systems that adopt Visual Question Answering attempt to appropriately answer natural language questions about an input in the form of an image, and these systems are designed to understand the contents of an image similarly to how they do humans and thus effectively communicate the results. Recently, a team of researchers from UC Berkeley and Google Research proposed an approach called CodeVQA that tackles visual answering of questions using modular code generation. CodeVQA formulates VQA as a program synthesis problem and uses coding language models that take questions as input and generate code as output.

The main goal of this framework is to create Python programs that can call pre-trained visual models and combine their outputs to provide responses. The produced programs manipulate the visual model outputs and derive a solution using arithmetic and conditional logic. In contrast to previous approaches, this framework uses pre-trained language models, pre-trained visual models based on image-caption pairings, a small number of VQA samples, and pre-trained visual models to support learning in context.

Check out 100s AI Tools in our AI Tools Club

To extract specific visual information from the image, such as captions, pixel locations of things, or image-text similarity scores, CodeVQA uses primitive visual APIs wrapped around Visual Language Models. The created code coordinates various APIs to collect the necessary data, then uses the full expressiveness of Python code to analyze the data and reason about it using mathematics, logic structures, feedback loops and other programming constructs to arrive at a solution.

For evaluation, the team compared the performance of this new technique against a baseline of a few shots that didn’t use code generation to evaluate its effectiveness. COVR and GQA were the two baseline datasets used in the evaluation, among which the GQA dataset includes multihop questions created from single-photo scene graphs of the visual genome that humans manually annotated, and the COVR dataset contains multihop questions about image sets in the Visual Genome Genome Dataset and imSitu. The results showed that CodeVQA performed better on both datasets than baseline. Notably, it showed an accuracy improvement of at least 3% on the COVR dataset and about 2% on the GQA dataset.

The team said that CodeVQA is simple to implement and use because it requires no additional training. It uses pre-trained models and a limited number of VQA samples for contextual learning, which helps tailor the programs you create to particular question-and-answer patterns. To sum up, this framework is powerful and leverages the power of pre-trained LMs and visual models, providing a code-based, modular approach to VQA.


Check out ThePaperANDGitHub link.Don’t forget to subscribeour 24k+ ML SubReddit,Discord channel,ANDEmail newsletterwhere we share the latest news on AI research, cool AI projects, and more. If you have any questions regarding the above article or if you have missed anything, please do not hesitate to email us atAsif@marktechpost.com

Check out 100s AI Tools in the AI ​​Tools Club

Tanya Malhotra is a final year student at Petroleum and Energy University, Dehradun pursuing BTech in Computer Engineering with a major in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, coupled with a burning interest in acquiring new skills, leading teams, and managing work in an organized manner.

Try Noota – your AI meeting assistant that records, analyzes and summarizes your meetings (sponsored)

#Researchers #Berkeley #Google #present #framework #formulates #visual #answer #questions #modular #code #generation

Leave a Comment