Iaroslav Ponomarenko

yaroslav [dot] ponomarenko [at] stu [dot] pku [dot] edu [dot] cn

I am a second-year master's student at Peking University's Center on Frontiers of Computing Studies, where I am fortunate to work with Professor Hao Dong at the joint Hyperplane/AGIBot PKU Lab on embodied AI, robotics, large foundation models, and computer vision. Additionally, I am a research intern at AGIBot.

Previously, I obtained an Engineering degree in Information Systems and Technologies from Voronezh Institute of High Technologies, as well as a Technician degree in Automated Information Processing and Control Systems from Borisoglebsk College of Informatics and Computer Engineering.

Google Scholar  /  LinkedIn  /  GitHub

Iaroslav Ponomarenko

Research Interest

My research focuses on the intersection of embodied AI, visual perception of the world, reasoning, and robotic control. Specifically, I investigate how to enable embodied agents to obtain environmental awareness through vision, including affordance understanding [1, 2] and spatial reasoning [3], in order to perform complex manipulation tasks. Currently, I am investigating these areas using large multimodal foundation models.


News

2024-10-17 Presented ManipVQA [2] at IROS 2024 (Abu Dhabi, United Arab Emirates).
2024-08-12 Presented ManipVQA [2] at Microsoft Research Asia Tech Fest (Beijing, China).
2024-06-30 🎉 Our paper ManipVQA [2] has been accepted for publication at IROS 2024.

Service

Reviewer for the IEEE International Conference on Robotics and Automation (ICRA 2025).

Selected Publications


(*) indicates equal contribution, while (†) denotes the corresponding author

[3]

SpatialBot: Precise Spatial Understanding with Vision Language Models
Wenxiao Cai*, Iaroslav Ponomarenko*, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, Bo Zhao†

In Review, 2024
Paper / GitHub

We introduce SpatialBot, a framework specifically designed to improve spatial reasoning of Vision Language Models (VLMs) by leveraging both RGB and depth images. To train VLMs for depth perception, we present the SpatialQA and SpatialQA-E datasets, which feature depth-related questions at multiple levels. In addition, we release models fine-tuned on SpatialQA and SpatialQA-E datasets, and present SpatialBench, a comprehensive evaluation framework for assessing spatial understanding capabilities of VLMs.

[2]

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
Siyuan Huang*, Yaroslav Ponomarenko*, Zhengkai Jiang, Xiaoqi Li, Xiaobin Hu, Peng Gao, Hongsheng Li, Hao Dong†

International Conference on Intelligent Robots and Systems (IROS 2024)
[Oral Pitch and Interactive Presentation]
Oral Pitch, Slides, Poster
Paper / arXiv / GitHub

We introduce ManipVQA, a robust framework for integrating physical knowledge and affordance reasoning into Multi-Modal Large Language Models.

[1]

Learning Part-Aware Visual Actionable Affordance for 3D Articulated Object Manipulation
Yuanchen Ju*, Haoran Geng*, Ming Yang*, Yiran Geng, Yaroslav Ponomarenko, Taewhan Kim, He Wang, Hao Dong†

CVPR Workshop on 3D Vision and Robotics (CVPR @ 3DVR), 2023
[Spotlight Presentation]
Paper / Workshop

We introduce Part-aware Affordance Learning methods. Our approach initially learns a prior for object parts and then generates an affordance map. To further improve precision, we incorporate a part-level scoring system to identify the most suitable part for manipulation.

Last updated on Tuesday, November 26, 2024, at 05:34:20 AM.

Design and source code from Jon Barron's website.