ViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validation

Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant impr...

Full description

Autores:
Montes Buitrago, Tony Santiago
Tipo de recurso:
Trabajo de grado de pregrado
Fecha de publicación:
2024
Institución:
Universidad de los Andes
Repositorio:
Séneca: repositorio Uniandes
Idioma:
eng
OAI Identifier:
oai:repositorio.uniandes.edu.co:1992/75454
Acceso en línea:
https://hdl.handle.net/1992/75454
Palabra clave:
Video question-answering
Video grounding
Multimodal
Large language model
Chain-of-thought
Vision-language models
Open-vocabulary
Ingeniería
Rights
embargoedAccess
License
Attribution 4.0 International
id UNIANDES2_1e668dc55523467458412e58ebbc7629
oai_identifier_str oai:repositorio.uniandes.edu.co:1992/75454
network_acronym_str UNIANDES2
network_name_str Séneca: repositorio Uniandes
repository_id_str
dc.title.eng.fl_str_mv ViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validation
title ViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validation
spellingShingle ViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validation
Video question-answering
Video grounding
Multimodal
Large language model
Chain-of-thought
Vision-language models
Open-vocabulary
Ingeniería
title_short ViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validation
title_full ViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validation
title_fullStr ViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validation
title_full_unstemmed ViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validation
title_sort ViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validation
dc.creator.fl_str_mv Montes Buitrago, Tony Santiago
dc.contributor.advisor.none.fl_str_mv Lozano Martínez, Fernando Enrique
dc.contributor.author.none.fl_str_mv Montes Buitrago, Tony Santiago
dc.contributor.jury.none.fl_str_mv Osma Cruz, Johann Faccelo
dc.subject.keyword.eng.fl_str_mv Video question-answering
Video grounding
Multimodal
Large language model
Chain-of-thought
Vision-language models
Open-vocabulary
topic Video question-answering
Video grounding
Multimodal
Large language model
Chain-of-thought
Vision-language models
Open-vocabulary
Ingeniería
dc.subject.themes.spa.fl_str_mv Ingeniería
description Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains.
publishDate 2024
dc.date.issued.none.fl_str_mv 2024-01-16
dc.date.accessioned.none.fl_str_mv 2025-01-16T21:04:22Z
dc.date.accepted.none.fl_str_mv 2025-01-16
dc.date.available.none.fl_str_mv 2026-01-15
dc.type.none.fl_str_mv Trabajo de grado - Pregrado
dc.type.driver.none.fl_str_mv info:eu-repo/semantics/bachelorThesis
dc.type.version.none.fl_str_mv info:eu-repo/semantics/acceptedVersion
dc.type.coar.none.fl_str_mv http://purl.org/coar/resource_type/c_7a1f
dc.type.content.none.fl_str_mv Text
dc.type.redcol.none.fl_str_mv http://purl.org/redcol/resource_type/TP
format http://purl.org/coar/resource_type/c_7a1f
status_str acceptedVersion
dc.identifier.uri.none.fl_str_mv https://hdl.handle.net/1992/75454
dc.identifier.instname.none.fl_str_mv instname:Universidad de los Andes
dc.identifier.reponame.none.fl_str_mv reponame:Repositorio Institucional Séneca
dc.identifier.repourl.none.fl_str_mv repourl:https://repositorio.uniandes.edu.co/
url https://hdl.handle.net/1992/75454
identifier_str_mv instname:Universidad de los Andes
reponame:Repositorio Institucional Séneca
repourl:https://repositorio.uniandes.edu.co/
dc.language.iso.none.fl_str_mv eng
language eng
dc.relation.references.none.fl_str_mv Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. Pali-x: On scaling up a multilingual vision and language model, 2023. 1
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish hapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. Pali: A jointly-scaled multilingual language-image model, 2023. 1
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time openvocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16901-16911, 2024. 2, 3, 5, 8
Rohan Choudhury, Koichiro Niinuma, Kris M. Kitani, and Laszló A. Jeni. Zero-shot video question answering with procedural programs. CoRR, abs/2312.00937, 2023. 1, 3, 8, 12
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 12
Shangzhe Di and Weidi Xie. Grounded question-answering in long egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12934-12943, 2024. 1, 3, 4
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. In Computer Vision - ECCV 2024, pages 75-92, Cham, 2025. Springer Nature Switzerland. 1, 3, 8, 12
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Forty-first International Conference on Machine Learning, 2024. 1, 3, 8, 12
Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn, 2023. 12
Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, and Libo Zhang. Context-guided spatio-temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18330-18339, 2024. 3, 4
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024. 1
Zaid Khan and Yun Fu. Consistency and uncertainty: Identifying unreliable responses from black-box vision-language models for selective visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10854-10863, 2024. 1, 3
Dohwan Ko, Ji Lee, Woo-Young Kang, Byungseok Roh, and Hyunwoo Kim. Large language models are temporal and causal reasoners for video question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4300-4316, Singapore, 2023. Association for Computational Linguistics. 8, 12
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 1, 3, 12
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding, 2024. 1, 3
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. 12
Zhaohe Liao, Jiangtong Li, Li Niu, and Liqing Zhang. Align and aggregate: Compositional reasoning with video alignment and answer aggregation for video question-answering. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13395-13404, 2024. 1, 3
Christian Limberg, Artur Gonc¸alves, Bastien Rigault, and Helmut Prendinger. Leveraging YOLO-world and GPT-4v LMMs for zero-shot person detection and action recognition in drone imagery. In First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 2
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2024. 1, 3, 12
Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13151-13160, 2024. 1, 3
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2024. 1, 3, 12, 13
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very longform video language understanding. In Advances in Neural Information Processing Systems, pages 46212-46244. Curran Associates, Inc., 2023. 3, 7, 12
Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular reasoning models for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13235-13245, 2024. 1, 3, 4, 8, 12, 13
Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14420-14431, 2024. 1, 3
David Mogrovejo and Thamar Solorio. Question-instructed visual descriptions for zero-shot video answering. In Findings of the Association for Computational Linguistics: ACL 2024, pages 9329-9339, Bangkok, Thailand, 2024. Association for Computational Linguistics. 1, 3
OpenAI. Gpt-4 technical report, 2024. 1
Peng Qi, Zehong Yan, Wynne Hsu, and Mong Li Lee. Sniffer: Multimodal large language model for explainable outof-context misinformation detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13052-13062, 2024. 1, 3
Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, and Roei Herzig. TraveLER: A modular multi-LMM agent framework for video question-answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9740-9766, Miami, Florida, USA, 2024. Association for Computational Linguistics. 1, 3, 8, 12
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18221-18232, 2024. 1, 3
Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11888-11898, 2023. 12
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. Video understanding with large language models: A survey, 2024. 1, 3
Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. 1, 3, 13
Shijie Wang, Qi Zhao, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, and Chen Sun. Vamos: Versatile action models for video understanding, 2024. 12
Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena YeungLevy. Videoagent: Long-form video understanding with large language model as agent. In Computer Vision - ECCV 2024, pages 58-76, Cham, 2025. Springer Nature Switzerland. 1, 3, 8, 12
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning, 2022. 12
Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. VideoCoT: A video chain-of-thought dataset with active annotation tool. In Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 92-101, Bangkok, Thailand, 2024. Association for Computational Linguistics. 3
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa:next phase of question-answering to explaining temporal actions, 2021. 3, 6, 12
Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yicong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, and Angela Yao. Videoqa in the era of llms: An empirical study, 2024. 1, 3
Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13204-13214, 2024. 1, 3, 4
Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, and Yan Lu. Retrieval-based video language model for efficient long video question answering, 2023. 1, 3
Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, and Jiahui Yu. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners, 2023. 12
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models, 2022. 12
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Learning to answer visual questions from web videos, 2022. 3, 6, 12, 13
Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal-aware video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15405-15416, 2023. 8, 12
Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, and Xudong Jiang. Video question answering using clip-guided visual-text attention. In 2023 IEEE International Conference on Image Processing (ICIP), pages 81-85, 2023. 3
Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering. In Advances in Neural Information Processing Systems, pages 76749-76771. Curran Associates, Inc., 2023. 8, 12
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering, 2019. 3, 6, 12
Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question-answering, 2024. 12
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023. 1, 3
Shitian Zhao, Zhuowan Li, Yadong Lu, Alan Yuille, and Yan Wang. Causal-cog: A causal-effect look at context generation for boosting multi-modal language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13342-13351, 2024. 1, 3
Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. Video question answering: Datasets, algorithms and challenges. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6439-6455, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. 3
dc.rights.en.fl_str_mv Attribution 4.0 International
dc.rights.uri.none.fl_str_mv http://creativecommons.org/licenses/by/4.0/
dc.rights.accessrights.none.fl_str_mv info:eu-repo/semantics/embargoedAccess
dc.rights.coar.none.fl_str_mv http://purl.org/coar/access_right/c_f1cf
rights_invalid_str_mv Attribution 4.0 International
http://creativecommons.org/licenses/by/4.0/
http://purl.org/coar/access_right/c_f1cf
eu_rights_str_mv embargoedAccess
dc.format.extent.none.fl_str_mv 24 páginas
dc.format.mimetype.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidad de los Andes
dc.publisher.program.none.fl_str_mv Ingeniería Electrónica
dc.publisher.faculty.none.fl_str_mv Facultad de Ingeniería
dc.publisher.department.none.fl_str_mv Departamento de Ingeniería Eléctrica y Electrónica
publisher.none.fl_str_mv Universidad de los Andes
institution Universidad de los Andes
bitstream.url.fl_str_mv https://repositorio.uniandes.edu.co/bitstreams/b6a4aa2f-0fb9-4af4-81ab-79d9624f3d32/download
https://repositorio.uniandes.edu.co/bitstreams/97ab6b00-52fb-47d6-88ad-6e0831c05581/download
https://repositorio.uniandes.edu.co/bitstreams/00723bb7-0f03-4bc9-b48e-405d6c3a172c/download
https://repositorio.uniandes.edu.co/bitstreams/dc487780-2735-4b6d-8ad5-f80a8e31a87a/download
https://repositorio.uniandes.edu.co/bitstreams/09edc6c8-28d1-44a3-8c49-3b52570bf2e6/download
https://repositorio.uniandes.edu.co/bitstreams/8ceefb26-631e-45e8-9804-06c1c4e8529c/download
https://repositorio.uniandes.edu.co/bitstreams/803c5311-d567-47ff-b02a-7014e66efbe7/download
https://repositorio.uniandes.edu.co/bitstreams/cccd1317-35d8-4b46-95c6-539d7b9f5648/download
bitstream.checksum.fl_str_mv d3b47b8a61ee3cc286cf664d0d5d315c
7ef398733b16fbfe7422bf519e014ad8
0175ea4a2d4caec4bbcc37e300941108
ae9e573a68e7f92501b6913cc846c39f
54990bc5a5c8b07ce28198cc7796d7fc
73e4d840908642f37bf874eb357dae40
778e47ea984f7395cdcc0304c7b24d26
337cdf11d458dace15e1f06da66716f0
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositorio institucional Séneca
repository.mail.fl_str_mv adminrepositorio@uniandes.edu.co
_version_ 1831927684561633280
spelling Lozano Martínez, Fernando Enriquevirtual::22147-1Montes Buitrago, Tony SantiagoOsma Cruz, Johann Faccelovirtual::22148-12025-01-16T21:04:22Z2026-01-152024-01-162025-01-16https://hdl.handle.net/1992/75454instname:Universidad de los Andesreponame:Repositorio Institucional Sénecarepourl:https://repositorio.uniandes.edu.co/Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains.Pregrado24 páginasapplication/pdfengUniversidad de los AndesIngeniería ElectrónicaFacultad de IngenieríaDepartamento de Ingeniería Eléctrica y ElectrónicaAttribution 4.0 Internationalhttp://creativecommons.org/licenses/by/4.0/info:eu-repo/semantics/embargoedAccesshttp://purl.org/coar/access_right/c_f1cfViQAgent: zero-shot video question answering via agent with open-vocabulary grounding validationTrabajo de grado - Pregradoinfo:eu-repo/semantics/bachelorThesisinfo:eu-repo/semantics/acceptedVersionhttp://purl.org/coar/resource_type/c_7a1fTexthttp://purl.org/redcol/resource_type/TPVideo question-answeringVideo groundingMultimodalLarge language modelChain-of-thoughtVision-language modelsOpen-vocabularyIngenieríaXi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. Pali-x: On scaling up a multilingual vision and language model, 2023. 1Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish hapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. Pali: A jointly-scaled multilingual language-image model, 2023. 1Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time openvocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16901-16911, 2024. 2, 3, 5, 8Rohan Choudhury, Koichiro Niinuma, Kris M. Kitani, and Laszló A. Jeni. Zero-shot video question answering with procedural programs. CoRR, abs/2312.00937, 2023. 1, 3, 8, 12Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 12Shangzhe Di and Weidi Xie. Grounded question-answering in long egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12934-12943, 2024. 1, 3, 4Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. In Computer Vision - ECCV 2024, pages 75-92, Cham, 2025. Springer Nature Switzerland. 1, 3, 8, 12Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Forty-first International Conference on Machine Learning, 2024. 1, 3, 8, 12Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn, 2023. 12Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, and Libo Zhang. Context-guided spatio-temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18330-18339, 2024. 3, 4Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024. 1Zaid Khan and Yun Fu. Consistency and uncertainty: Identifying unreliable responses from black-box vision-language models for selective visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10854-10863, 2024. 1, 3Dohwan Ko, Ji Lee, Woo-Young Kang, Byungseok Roh, and Hyunwoo Kim. Large language models are temporal and causal reasoners for video question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4300-4316, Singapore, 2023. Association for Computational Linguistics. 8, 12Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 1, 3, 12KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding, 2024. 1, 3Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. 12Zhaohe Liao, Jiangtong Li, Li Niu, and Liqing Zhang. Align and aggregate: Compositional reasoning with video alignment and answer aggregation for video question-answering. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13395-13404, 2024. 1, 3Christian Limberg, Artur Gonc¸alves, Bastien Rigault, and Helmut Prendinger. Leveraging YOLO-world and GPT-4v LMMs for zero-shot person detection and action recognition in drone imagery. In First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 2Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2024. 1, 3, 12Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13151-13160, 2024. 1, 3Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2024. 1, 3, 12, 13Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very longform video language understanding. In Advances in Neural Information Processing Systems, pages 46212-46244. Curran Associates, Inc., 2023. 3, 7, 12Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular reasoning models for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13235-13245, 2024. 1, 3, 4, 8, 12, 13Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14420-14431, 2024. 1, 3David Mogrovejo and Thamar Solorio. Question-instructed visual descriptions for zero-shot video answering. In Findings of the Association for Computational Linguistics: ACL 2024, pages 9329-9339, Bangkok, Thailand, 2024. Association for Computational Linguistics. 1, 3OpenAI. Gpt-4 technical report, 2024. 1Peng Qi, Zehong Yan, Wynne Hsu, and Mong Li Lee. Sniffer: Multimodal large language model for explainable outof-context misinformation detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13052-13062, 2024. 1, 3Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, and Roei Herzig. TraveLER: A modular multi-LMM agent framework for video question-answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9740-9766, Miami, Florida, USA, 2024. Association for Computational Linguistics. 1, 3, 8, 12Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18221-18232, 2024. 1, 3Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11888-11898, 2023. 12Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. Video understanding with large language models: A survey, 2024. 1, 3Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. 1, 3, 13Shijie Wang, Qi Zhao, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, and Chen Sun. Vamos: Versatile action models for video understanding, 2024. 12Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena YeungLevy. Videoagent: Long-form video understanding with large language model as agent. In Computer Vision - ECCV 2024, pages 58-76, Cham, 2025. Springer Nature Switzerland. 1, 3, 8, 12Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning, 2022. 12Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. VideoCoT: A video chain-of-thought dataset with active annotation tool. In Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 92-101, Bangkok, Thailand, 2024. Association for Computational Linguistics. 3Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa:next phase of question-answering to explaining temporal actions, 2021. 3, 6, 12Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yicong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, and Angela Yao. Videoqa in the era of llms: An empirical study, 2024. 1, 3Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13204-13214, 2024. 1, 3, 4Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, and Yan Lu. Retrieval-based video language model for efficient long video question answering, 2023. 1, 3Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, and Jiahui Yu. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners, 2023. 12Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models, 2022. 12Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Learning to answer visual questions from web videos, 2022. 3, 6, 12, 13Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal-aware video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15405-15416, 2023. 8, 12Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, and Xudong Jiang. Video question answering using clip-guided visual-text attention. In 2023 IEEE International Conference on Image Processing (ICIP), pages 81-85, 2023. 3Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering. In Advances in Neural Information Processing Systems, pages 76749-76771. Curran Associates, Inc., 2023. 8, 12Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering, 2019. 3, 6, 12Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question-answering, 2024. 12Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023. 1, 3Shitian Zhao, Zhuowan Li, Yadong Lu, Alan Yuille, and Yan Wang. Causal-cog: A causal-effect look at context generation for boosting multi-modal language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13342-13351, 2024. 1, 3Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. Video question answering: Datasets, algorithms and challenges. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6439-6455, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. 3202014562Publicationhttps://scholar.google.es/citations?user=6QQ-dqMAAAAJvirtual::22148-10000-0003-2928-3406virtual::22148-1https://scienti.minciencias.gov.co/cvlac/visualizador/generarCurriculoCv.do?cod_rh=0000025550virtual::22147-1https://scienti.minciencias.gov.co/cvlac/visualizador/generarCurriculoCv.do?cod_rh=0000221112virtual::22148-1edd81d8c-e0b9-4c1f-bf04-eed0e12e755dvirtual::22147-1edd81d8c-e0b9-4c1f-bf04-eed0e12e755dvirtual::22147-1a9f6ef37-65d7-4484-be71-8f3b4067a8favirtual::22148-1a9f6ef37-65d7-4484-be71-8f3b4067a8favirtual::22148-1ORIGINALFormato Entrega Tesis - Biblioteca.pdfFormato Entrega Tesis - Biblioteca.pdfHIDEapplication/pdf240956https://repositorio.uniandes.edu.co/bitstreams/b6a4aa2f-0fb9-4af4-81ab-79d9624f3d32/downloadd3b47b8a61ee3cc286cf664d0d5d315cMD51ViQAgent.pdfViQAgent.pdfRestricción de acceso hasta el año 2026, debido a posible publicación en una conferencia.application/pdf8553464https://repositorio.uniandes.edu.co/bitstreams/97ab6b00-52fb-47d6-88ad-6e0831c05581/download7ef398733b16fbfe7422bf519e014ad8MD52CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8908https://repositorio.uniandes.edu.co/bitstreams/00723bb7-0f03-4bc9-b48e-405d6c3a172c/download0175ea4a2d4caec4bbcc37e300941108MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-82535https://repositorio.uniandes.edu.co/bitstreams/dc487780-2735-4b6d-8ad5-f80a8e31a87a/downloadae9e573a68e7f92501b6913cc846c39fMD54TEXTFormato Entrega Tesis - Biblioteca.pdf.txtFormato Entrega Tesis - Biblioteca.pdf.txtExtracted texttext/plain2017https://repositorio.uniandes.edu.co/bitstreams/09edc6c8-28d1-44a3-8c49-3b52570bf2e6/download54990bc5a5c8b07ce28198cc7796d7fcMD55ViQAgent.pdf.txtViQAgent.pdf.txtExtracted texttext/plain65059https://repositorio.uniandes.edu.co/bitstreams/8ceefb26-631e-45e8-9804-06c1c4e8529c/download73e4d840908642f37bf874eb357dae40MD57THUMBNAILFormato Entrega Tesis - Biblioteca.pdf.jpgFormato Entrega Tesis - Biblioteca.pdf.jpgGenerated Thumbnailimage/jpeg10784https://repositorio.uniandes.edu.co/bitstreams/803c5311-d567-47ff-b02a-7014e66efbe7/download778e47ea984f7395cdcc0304c7b24d26MD56ViQAgent.pdf.jpgViQAgent.pdf.jpgGenerated Thumbnailimage/jpeg14447https://repositorio.uniandes.edu.co/bitstreams/cccd1317-35d8-4b46-95c6-539d7b9f5648/download337cdf11d458dace15e1f06da66716f0MD581992/75454oai:repositorio.uniandes.edu.co:1992/754542025-03-05 10:03:55.201http://creativecommons.org/licenses/by/4.0/Attribution 4.0 Internationalembargohttps://repositorio.uniandes.edu.coRepositorio institucional Sénecaadminrepositorio@uniandes.edu.coPGgzPjxzdHJvbmc+RGVzY2FyZ28gZGUgUmVzcG9uc2FiaWxpZGFkIC0gTGljZW5jaWEgZGUgQXV0b3JpemFjacOzbjwvc3Ryb25nPjwvaDM+CjxwPjxzdHJvbmc+UG9yIGZhdm9yIGxlZXIgYXRlbnRhbWVudGUgZXN0ZSBkb2N1bWVudG8gcXVlIHBlcm1pdGUgYWwgUmVwb3NpdG9yaW8gSW5zdGl0dWNpb25hbCBTw6luZWNhIHJlcHJvZHVjaXIgeSBkaXN0cmlidWlyIGxvcyByZWN1cnNvcyBkZSBpbmZvcm1hY2nDs24gZGVwb3NpdGFkb3MgbWVkaWFudGUgbGEgYXV0b3JpemFjacOzbiBkZSBsb3Mgc2lndWllbnRlcyB0w6lybWlub3M6PC9zdHJvbmc+PC9wPgo8cD5Db25jZWRhIGxhIGxpY2VuY2lhIGRlIGRlcMOzc2l0byBlc3TDoW5kYXIgc2VsZWNjaW9uYW5kbyBsYSBvcGNpw7NuIDxzdHJvbmc+J0FjZXB0YXIgbG9zIHTDqXJtaW5vcyBhbnRlcmlvcm1lbnRlIGRlc2NyaXRvcyc8L3N0cm9uZz4geSBjb250aW51YXIgZWwgcHJvY2VzbyBkZSBlbnbDrW8gbWVkaWFudGUgZWwgYm90w7NuIDxzdHJvbmc+J1NpZ3VpZW50ZScuPC9zdHJvbmc+PC9wPgo8aHI+CjxwPllvLCBlbiBtaSBjYWxpZGFkIGRlIGF1dG9yIGRlbCB0cmFiYWpvIGRlIHRlc2lzLCBtb25vZ3JhZsOtYSBvIHRyYWJham8gZGUgZ3JhZG8sIGhhZ28gZW50cmVnYSBkZWwgZWplbXBsYXIgcmVzcGVjdGl2byB5IGRlIHN1cyBhbmV4b3MgZGUgc2VyIGVsIGNhc28sIGVuIGZvcm1hdG8gZGlnaXRhbCB5L28gZWxlY3Ryw7NuaWNvIHkgYXV0b3Jpem8gYSBsYSBVbml2ZXJzaWRhZCBkZSBsb3MgQW5kZXMgcGFyYSBxdWUgcmVhbGljZSBsYSBwdWJsaWNhY2nDs24gZW4gZWwgU2lzdGVtYSBkZSBCaWJsaW90ZWNhcyBvIGVuIGN1YWxxdWllciBvdHJvIHNpc3RlbWEgbyBiYXNlIGRlIGRhdG9zIHByb3BpbyBvIGFqZW5vIGEgbGEgVW5pdmVyc2lkYWQgeSBwYXJhIHF1ZSBlbiBsb3MgdMOpcm1pbm9zIGVzdGFibGVjaWRvcyBlbiBsYSBMZXkgMjMgZGUgMTk4MiwgTGV5IDQ0IGRlIDE5OTMsIERlY2lzacOzbiBBbmRpbmEgMzUxIGRlIDE5OTMsIERlY3JldG8gNDYwIGRlIDE5OTUgeSBkZW3DoXMgbm9ybWFzIGdlbmVyYWxlcyBzb2JyZSBsYSBtYXRlcmlhLCB1dGlsaWNlIGVuIHRvZGFzIHN1cyBmb3JtYXMsIGxvcyBkZXJlY2hvcyBwYXRyaW1vbmlhbGVzIGRlIHJlcHJvZHVjY2nDs24sIGNvbXVuaWNhY2nDs24gcMO6YmxpY2EsIHRyYW5zZm9ybWFjacOzbiB5IGRpc3RyaWJ1Y2nDs24gKGFscXVpbGVyLCBwcsOpc3RhbW8gcMO6YmxpY28gZSBpbXBvcnRhY2nDs24pIHF1ZSBtZSBjb3JyZXNwb25kZW4gY29tbyBjcmVhZG9yIGRlIGxhIG9icmEgb2JqZXRvIGRlbCBwcmVzZW50ZSBkb2N1bWVudG8uPC9wPgo8cD5MYSBwcmVzZW50ZSBhdXRvcml6YWNpw7NuIHNlIGVtaXRlIGVuIGNhbGlkYWQgZGUgYXV0b3IgZGUgbGEgb2JyYSBvYmpldG8gZGVsIHByZXNlbnRlIGRvY3VtZW50byB5IG5vIGNvcnJlc3BvbmRlIGEgY2VzacOzbiBkZSBkZXJlY2hvcywgc2lubyBhIGxhIGF1dG9yaXphY2nDs24gZGUgdXNvIGFjYWTDqW1pY28gZGUgY29uZm9ybWlkYWQgY29uIGxvIGFudGVyaW9ybWVudGUgc2XDsWFsYWRvLiBMYSBwcmVzZW50ZSBhdXRvcml6YWNpw7NuIHNlIGhhY2UgZXh0ZW5zaXZhIG5vIHNvbG8gYSBsYXMgZmFjdWx0YWRlcyB5IGRlcmVjaG9zIGRlIHVzbyBzb2JyZSBsYSBvYnJhIGVuIGZvcm1hdG8gbyBzb3BvcnRlIG1hdGVyaWFsLCBzaW5vIHRhbWJpw6luIHBhcmEgZm9ybWF0byBlbGVjdHLDs25pY28sIHkgZW4gZ2VuZXJhbCBwYXJhIGN1YWxxdWllciBmb3JtYXRvIGNvbm9jaWRvIG8gcG9yIGNvbm9jZXIuPC9wPgo8cD5FbCBhdXRvciwgbWFuaWZpZXN0YSBxdWUgbGEgb2JyYSBvYmpldG8gZGUgbGEgcHJlc2VudGUgYXV0b3JpemFjacOzbiBlcyBvcmlnaW5hbCB5IGxhIHJlYWxpesOzIHNpbiB2aW9sYXIgbyB1c3VycGFyIGRlcmVjaG9zIGRlIGF1dG9yIGRlIHRlcmNlcm9zLCBwb3IgbG8gdGFudG8sIGxhIG9icmEgZXMgZGUgc3UgZXhjbHVzaXZhIGF1dG9yw61hIHkgdGllbmUgbGEgdGl0dWxhcmlkYWQgc29icmUgbGEgbWlzbWEuPC9wPgo8cD5FbiBjYXNvIGRlIHByZXNlbnRhcnNlIGN1YWxxdWllciByZWNsYW1hY2nDs24gbyBhY2Npw7NuIHBvciBwYXJ0ZSBkZSB1biB0ZXJjZXJvIGVuIGN1YW50byBhIGxvcyBkZXJlY2hvcyBkZSBhdXRvciBzb2JyZSBsYSBvYnJhIGVuIGN1ZXN0acOzbiwgZWwgYXV0b3IgYXN1bWlyw6EgdG9kYSBsYSByZXNwb25zYWJpbGlkYWQsIHkgc2FsZHLDoSBkZSBkZWZlbnNhIGRlIGxvcyBkZXJlY2hvcyBhcXXDrSBhdXRvcml6YWRvcywgcGFyYSB0b2RvcyBsb3MgZWZlY3RvcyBsYSBVbml2ZXJzaWRhZCBhY3TDumEgY29tbyB1biB0ZXJjZXJvIGRlIGJ1ZW5hIGZlLjwvcD4KPHA+U2kgdGllbmUgYWxndW5hIGR1ZGEgc29icmUgbGEgbGljZW5jaWEsIHBvciBmYXZvciwgY29udGFjdGUgY29uIGVsIDxhIGhyZWY9Im1haWx0bzpiaWJsaW90ZWNhQHVuaWFuZGVzLmVkdS5jbyIgdGFyZ2V0PSJfYmxhbmsiPkFkbWluaXN0cmFkb3IgZGVsIFNpc3RlbWEuPC9hPjwvcD4K