Publications

Preprints

2025

D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding

Yichi Zhang, Gongwei Chen, Jun Zhu, and Jia Wan

arXiv preprint, 2025
Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, and Liqiang Nie

arXiv preprint, 2025
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, and Liqiang Nie

arXiv preprint, 2025

2024

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, and Liqiang Nie

arXiv preprint, 2024
Enhancing the emotional generation capability of large language models via emotional chain-of-thought

Zaijing Li, Gongwei Chen, Rui Shao, Dongmei Jiang, and Liqiang Nie

arXiv preprint, 2024

2026

HiconAgent: History Context-aware Policy Optimization for GUI Agents

Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kaiwen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, and 1 more author

In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026

PDF Code
HATS : Hardness-Aware Trajectory Synthesis for GUI Agents

Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou, Shuai Wang, Weili Guan, and Gongwei Chen^†

In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026

PDF Code
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang, Pengwei Xie, Jianye Hao, and Liqiang Nie

In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026

PDF Code

2025

Enhancing GUI Agent with Uncertainty-Aware Self-Trained Evaluator

Gongwei Chen, Lirong Jie, Lexiao Zou, Weili Guan, Miao Zhang, and Liqiang Nie

In Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

PDF Code
PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning

Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, and Liqiang Nie

In ACM International Conference on Multimedia (ACM MM), 2025

PDF Code
Less is More: Empowering GUI Agent with Context-Aware Simplification

Gongwei Chen, Xurui Zhou, Rui Shao, Yibo Lyu, Kaiwen Zhou, Shuai Wang, Wentao Li, Yinchuan Li, and 2 more authors

In International Conference on Computer Vision (ICCV), Highlight , 2025

PDF Code
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

Renshan Zhang, Rui Shao, Gongwei Chen, Kaiwen Zhou, Weili Guan, and Liqiang Nie

In International Conference on Computer Vision (ICCV), 2025

PDF Code
GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent

Bin Xie, Rui Shao^†, Gongwei Chen^†, Kaiwen Zhou, Yinchuan Li, Jie Liu, Min Zhang, and Liqiang Nie

Annual Meeting of the Association for Computational Linguistics (ACL), 2025

PDF Code
Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie

In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

PDF Code
Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation

Yanda Chen^*, Gongwei Chen^*, Miao Zhang, Weili Guan, and Liqiang Nie

In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

PDF Code
Spa-bench: A comprehensive benchmark for smartphone agent evaluation

Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, and 3 more authors

In The Thirteenth International Conference on Learning Representations (ICLR), Spotlight (5.1%) , 2025

PDF Code

2024

Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL

Qi Lv, Xiang Deng, Gongwei Chen, Michael Yu Wang, and Liqiang Nie

In Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

PDF Code
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Leyang Shen^*, Gongwei Chen^*, Rui Shao, Weili Guan, and Liqiang Nie

In Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

PDF Code
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie

In Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

PDF Code
LION: Empowering multimodal large language model with dual-level visual knowledge

Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

PDF Code
Object-to-Manipulation Graph for Affordance Navigation

Xinhang Song, Bohan Wang, Liye Dong, Gongwei Chen, Xinyun Hu, and Shuqiang Jiang

CAAI Artificial Intelligence Research, 2024

2023

Composite Object Relation Modeling for Few-Shot Scene Recognition

Xinhang Song, Chenlong Liu, Haitao Zeng, Yaohui Zhu, Gongwei Chen, Xiaorong Qin, and Shuqiang Jiang

IEEE Transactions on Image Processing, 2023

2021

See More for Scene: Pairwise Consistency Learning for Scene Classification

Gongwei Chen, Xinhang Song, Bohan Wang, and Shuqiang Jiang

Advances in Neural Information Processing Systems, 2021

2020

Scene recognition with prototype-agnostic scene layout

Gongwei Chen, Xinhang Song, Haitao Zeng, and Shuqiang Jiang

IEEE Transactions on Image Processing, 2020
Amorphous Region Context Modeling for Scene Recognition

Haitao Zeng, Xinhang Song, Gongwei Chen, and Shuqiang Jiang

IEEE Transactions on Multimedia, 2020

2019

MUCH: Mutual Coupling Enhancement of Scene Recognition and Dense Captioning

Xinhang Song, Bohan Wang, Gongwei Chen, and Shuqiang Jiang

In Proceedings of the 27th ACM International Conference on Multimedia, 2019
Deep patch representations with shared codebook for scene classification

Shuqiang Jiang, Gongwei Chen, Xinhang Song, and Linhu Liu

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2019
Scene Recognition with Comprehensive Regions Graph Modeling

Haitao Zeng and Gongwei Chen

In International Conference on Image and Graphics, 2019
Image representations with spatial object-to-object relations for RGB-D scene recognition

Xinhang Song, Shuqiang Jiang, Bohan Wang, Chengpeng Chen, and Gongwei Chen

IEEE Transactions on Image Processing, 2019
Learning scene attribute for scene recognition

Haitao Zeng, Xinhang Song, Gongwei Chen, and Shuqiang Jiang

IEEE Transactions on Multimedia, 2019