Skip to content

lif314/RealVLG-R1

Repository files navigation

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
[CVPR 2026]

Linfei Li · Lin Zhang* · Ying Shen

teaser

Table of Contents
  1. Installation
  2. Datasets
  3. Benchmarking
  4. Inference
  5. Acknowledgement
  6. Citation

Installation

Note: If you encounter version issues during the installation process, please refer to EasyR1 and veRL. For tested environments, please refer to install.sh.

conda create -n realvlgr1 python==3.10
conda activate realvlgr1

git clone git@github.com:lif314/RealVLG-R1.git
cd RealVLG-R1

pip install -e .

Datasets

You can download the relevant datasets from ModelScope RealVLG-11B.

Each data sample is annotated as follows:

[
  {
    "image_name": "",
    "image_path": "",
    "object_id": "",
    "mask_path": "",
    "description": "",
    "label": "", # short description
    "bbox": [x1, y1, x2, y2],
    "grasps": [
      [x0,y0,x1,y1,x2,y2,x3,y3],
      ...
    ],
    "contact_points": [
    [x1,y1, x2, y2],
    ...
    ]
  }
]

The definition diagrams of bbox and grasp are shown in the figure below:

anno

Usage

Download the dataset and extract xxx_VLG.zip. In each xxx_VLG folder, run python metadata_viewer.py to view the data formatting. The left/right keys switch between different objects in the same image, and the up/down keys switch between images. The visualization of different data subsets is shown below:

Subdata Cornell_VLG VMRD_VLG OCID_VLG GraspNet_VLG Jacquard_VLG
Demo

For more detailed data loading, please refer to metadata_viewer.py.

Note: Jacquard_VLG is a simulated dataset not discussed in the paper. Its language annotations are derived from ShapeNetSem category labels.

Benchmarking

Training

# Bbox
bash examples/realvlg/bbox/qwen2_5_vl_3b_graspnet10p_bbox_grpo.slurm

# Seg
# First, download the SAM2 model into `./third_party/sam2/checkpoints/sam2.1_hiera_large.pt`.
mkdir ./third_party/sam2/checkpoints/
cd ./third_party/sam2/checkpoints/
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt

bash examples/realvlg/bbox_sam2/qwen2_5_vl_3b_graspnet10p_SAM_grpo.slurm

# Grasp
bash examples/realvlg/grasp/qwen2_5_vl_3b_graspnet10p_Grasp_grpo.slurm

# Contact
bash examples/realvlg/contact/qwen2_5_vl_3b_graspnet10p_contact_grpo.slurm

Convert to Hugging Face model

python3 scripts/model_merger.py --local_dir path/to/the/model

Evaluation

  • Bbox
model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/bbox"
python3 evaluation/eval_bbox.py \
    --model_path=$model_path \
    --data_root=$data_root \
    --output_dir=$output_dir
  • Bbox+SAM2
sam_path="./third_party/sam2/checkpoints/sam2.1_hiera_large.pt"
model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/sam"
python3 evaluation/eval_sam.py \
    --model_path=$model_path \
    --sam_model_path=$sam_path \
    --data_root=$data_root \
    --output_dir=$output_dir
  • Grasp
model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/grasp"
python3 evaluation/eval_grasp.py \
    --model_path=$model_path \
    --data_root=$data_root \
    --output_dir=$output_dir
  • Contact
model_path=path/to/model/huggingface
data_root="path/to/GraspNet_VLG"
output_dir="./outputs/evaluation/contact"
python3 evaluation/eval_contact.py \
    --model_path=$model_path \
    --data_root=$data_root \
    --output_dir=$output_dir

Inference

TODO

Object Detection

git clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GSPO_Bbox_3B.git

python infer_bbox.py

Semantic Segmentation

git clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GSPO_Bbox_3B.git

python infer_sam.py

4-DoF Grasping

git clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GRPO_Grasp_3B.git

python infer_grasp.py

Contact Point Prediction

git clone https://www.modelscope.cn/cslinfeili/RealVLG-R1_GRPO_Contact_3B.git

python infer_contact.py

Acknowledgement

We thank the authors of the following repositories for their open-source code:

If you use this dataset, please cite the relevant work and comply with their licenses.

Citation

If you find our paper and code useful for your research, please use the following BibTeX entry.

@inproceedings{li2026realvlgr1,
  title     = {RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation},
  author    = {Li, Linfei and Zhang, Lin and Shen, Ying},
  booktitle = {CVPR},
  year      = {2026},
}

About

[CVPR 2026] RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages