RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
[CVPR 2026]

Linfei Li · Lin Zhang* · Ying Shen

🌐Project page | 📊Dataset | 📝Paper(CVF) | 📝Paper(arXiv)

Table of Contents

Installation
Datasets
Benchmarking
Inference
Acknowledgement
Citation

Installation

Note: If you encounter version issues during the installation process, please refer to EasyR1 and veRL. For tested environments, please refer to install.sh.

conda create -n realvlgr1 python==3.10
conda activate realvlgr1

git clone git@github.com:lif314/RealVLG-R1.git
cd RealVLG-R1

pip install -e .

Datasets

You can download the relevant datasets from ModelScope RealVLG-11B.

Each data sample is annotated as follows:

[
  {
    "image_name": "",
    "image_path": "",
    "object_id": "",
    "mask_path": "",
    "description": "",
    "label": "", # short description
    "bbox": [x1, y1, x2, y2],
    "grasps": [
      [x0,y0,x1,y1,x2,y2,x3,y3],
      ...
    ],
    "contact_points": [
    [x1,y1, x2, y2],
    ...
    ]
  }
]

The definition diagrams of bbox and grasp are shown in the figure below:

Usage

Download the dataset and extract xxx_VLG.zip. In each xxx_VLG folder, run python metadata_viewer.py to view the data formatting. The left/right keys switch between different objects in the same image, and the up/down keys switch between images. The visualization of different data subsets is shown below:

Subdata	Cornell_VLG	VMRD_VLG	OCID_VLG	GraspNet_VLG	Jacquard_VLG
Demo

For more detailed data loading, please refer to metadata_viewer.py.

Note: Jacquard_VLG is a simulated dataset not discussed in the paper. Its language annotations are derived from ShapeNetSem category labels.

Benchmarking

Training

# Bbox
bash examples/realvlg/bbox/qwen2_5_vl_3b_graspnet10p_bbox_grpo.slurm

# Seg
# First, download the SAM2 model into `./third_party/sam2/checkpoints/sam2.1_hiera_large.pt`.
mkdir ./third_party/sam2/checkpoints/
cd ./third_party/sam2/checkpoints/
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt

bash examples/realvlg/bbox_sam2/qwen2_5_vl_3b_graspnet10p_SAM_grpo.slurm

# Grasp
bash examples/realvlg/grasp/qwen2_5_vl_3b_graspnet10p_Grasp_grpo.slurm

# Contact
bash examples/realvlg/contact/qwen2_5_vl_3b_graspnet10p_contact_grpo.slurm

Convert to Hugging Face model
python3 scripts/model_merger.py --local_dir path/to/the/model

Evaluation

Bbox

model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/bbox"
python3 evaluation/eval_bbox.py \
    --model_path=$model_path \
    --data_root=$data_root \
    --output_dir=$output_dir

Bbox+SAM2

sam_path="./third_party/sam2/checkpoints/sam2.1_hiera_large.pt"
model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/sam"
python3 evaluation/eval_sam.py \
    --model_path=$model_path \
    --sam_model_path=$sam_path \
    --data_root=$data_root \
    --output_dir=$output_dir

Grasp

model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/grasp"
python3 evaluation/eval_grasp.py \
    --model_path=$model_path \
    --data_root=$data_root \
    --output_dir=$output_dir

Contact

model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/contact"
python3 evaluation/eval_contact.py \
    --model_path=$model_path \
    --data_root=$data_root \
    --output_dir=$output_dir

Inference

TODO

Object Detection

git clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GSPO_Bbox_3B.git

python infer_bbox.py

Semantic Segmentation

git clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GSPO_Bbox_3B.git

python infer_sam.py

4-DoF Grasping

git clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GRPO_Grasp_3B.git

python infer_grasp.py

Contact Point Prediction

git clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GRPO_Contact_3B.git

python infer_contact.py

Acknowledgement

We thank the authors of the following repositories for their open-source code:

If you use this dataset, please cite the relevant work and comply with their licenses.

Citation

If you find our paper and code useful for your research, please use the following BibTeX entry.

@inproceedings{li2026realvlgr1,
  title     = {RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation},
  author    = {Li, Linfei and Zhang, Lin and Shen, Ying},
  booktitle = {CVPR},
  year      = {2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
evaluation		evaluation
examples		examples
inference_scripts		inference_scripts
scripts		scripts
tests		tests
third_party/sam2		third_party/sam2
verl		verl
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.legacy		Dockerfile.legacy
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
[CVPR 2026]

🌐Project page | 📊Dataset | 📝Paper(CVF) | 📝Paper(arXiv)

Installation

Datasets

Usage

Benchmarking

Training

Evaluation

Inference

Object Detection

Semantic Segmentation

4-DoF Grasping

Contact Point Prediction

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation [CVPR 2026]

🌐Project page | 📊Dataset | 📝Paper(CVF) | 📝Paper(arXiv)

Installation

Datasets

Usage

Benchmarking

Training

Evaluation

Inference

Object Detection

Semantic Segmentation

4-DoF Grasping

Contact Point Prediction

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
[CVPR 2026]

Packages