RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
[CVPR 2026]
Linfei Li · Lin Zhang* · Ying Shen
Table of Contents
Note: If you encounter version issues during the installation process, please refer to EasyR1 and veRL. For tested environments, please refer to install.sh.
conda create -n realvlgr1 python==3.10
conda activate realvlgr1
git clone git@github.com:lif314/RealVLG-R1.git
cd RealVLG-R1
pip install -e .You can download the relevant datasets from ModelScope RealVLG-11B.
Each data sample is annotated as follows:
[
{
"image_name": "",
"image_path": "",
"object_id": "",
"mask_path": "",
"description": "",
"label": "", # short description
"bbox": [x1, y1, x2, y2],
"grasps": [
[x0,y0,x1,y1,x2,y2,x3,y3],
...
],
"contact_points": [
[x1,y1, x2, y2],
...
]
}
]The definition diagrams of bbox and grasp are shown in the figure below:
Download the dataset and extract xxx_VLG.zip. In each xxx_VLG folder, run python metadata_viewer.py to view the data formatting. The left/right keys switch between different objects in the same image, and the up/down keys switch between images. The visualization of different data subsets is shown below:
| Subdata | Cornell_VLG | VMRD_VLG | OCID_VLG | GraspNet_VLG | Jacquard_VLG |
|---|---|---|---|---|---|
| Demo | ![]() |
![]() |
![]() |
![]() |
![]() |
For more detailed data loading, please refer to metadata_viewer.py.
Note:
Jacquard_VLGis a simulated dataset not discussed in the paper. Its language annotations are derived from ShapeNetSem category labels.
# Bbox
bash examples/realvlg/bbox/qwen2_5_vl_3b_graspnet10p_bbox_grpo.slurm
# Seg
# First, download the SAM2 model into `./third_party/sam2/checkpoints/sam2.1_hiera_large.pt`.
mkdir ./third_party/sam2/checkpoints/
cd ./third_party/sam2/checkpoints/
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt
bash examples/realvlg/bbox_sam2/qwen2_5_vl_3b_graspnet10p_SAM_grpo.slurm
# Grasp
bash examples/realvlg/grasp/qwen2_5_vl_3b_graspnet10p_Grasp_grpo.slurm
# Contact
bash examples/realvlg/contact/qwen2_5_vl_3b_graspnet10p_contact_grpo.slurmConvert to Hugging Face model
python3 scripts/model_merger.py --local_dir path/to/the/model
- Bbox
model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/bbox"
python3 evaluation/eval_bbox.py \
--model_path=$model_path \
--data_root=$data_root \
--output_dir=$output_dir- Bbox+SAM2
sam_path="./third_party/sam2/checkpoints/sam2.1_hiera_large.pt"
model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/sam"
python3 evaluation/eval_sam.py \
--model_path=$model_path \
--sam_model_path=$sam_path \
--data_root=$data_root \
--output_dir=$output_dir- Grasp
model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/grasp"
python3 evaluation/eval_grasp.py \
--model_path=$model_path \
--data_root=$data_root \
--output_dir=$output_dir- Contact
model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/contact"
python3 evaluation/eval_contact.py \
--model_path=$model_path \
--data_root=$data_root \
--output_dir=$output_dirTODO
git clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GSPO_Bbox_3B.git
python infer_bbox.pygit clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GSPO_Bbox_3B.git
python infer_sam.pygit clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GRPO_Grasp_3B.git
python infer_grasp.pygit clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GRPO_Contact_3B.git
python infer_contact.pyWe thank the authors of the following repositories for their open-source code:
If you use this dataset, please cite the relevant work and comply with their licenses.
If you find our paper and code useful for your research, please use the following BibTeX entry.
@inproceedings{li2026realvlgr1,
title = {RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation},
author = {Li, Linfei and Zhang, Lin and Shen, Ying},
booktitle = {CVPR},
year = {2026},
}




