ViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validation
Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant impr...
- Autores:
-
Montes Buitrago, Tony Santiago
- Tipo de recurso:
- Trabajo de grado de pregrado
- Fecha de publicación:
- 2024
- Institución:
- Universidad de los Andes
- Repositorio:
- Séneca: repositorio Uniandes
- Idioma:
- eng
- OAI Identifier:
- oai:repositorio.uniandes.edu.co:1992/75454
- Acceso en línea:
- https://hdl.handle.net/1992/75454
- Palabra clave:
- Video question-answering
Video grounding
Multimodal
Large language model
Chain-of-thought
Vision-language models
Open-vocabulary
Ingeniería
- Rights
- embargoedAccess
- License
- Attribution 4.0 International
Summary: | Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. |
---|