You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Building foundation AI systems for maps, mobility, and interactive worlds.
We advance large language models, reinforcement learning, agent systems, multimodal understanding,
generative AI, world models, autonomous driving, and intelligent mobility.
Our work appears at top-tier venues including ICLR, CVPR, ACL, AAAI, SIGGRAPH, ICCV, EMNLP, and ACM MM.
We are looking for talented interns and full-time researchers.
A framework enabling LLM agent skills to evolve collectively from real interactions, with automatic deduplication, improvement, and verification across sessions, agents, and devices.
Adopts tree-search rollouts in place of independent chain-based rollouts for LLM agent RL, achieving superior performance with only a quarter of the rollout budget.
A minimalist RL approach (Group Policy Gradient) that directly optimizes the original RL objective, eliminating critic/reference models and KL constraints while outperforming GRPO.
Proposes difficulty-aware GRPO and multi-aspect question reformulation to boost math reasoning by targeting harder questions from both algorithmic and data perspectives.
A framework for training LLM agents via agent-data mutual evolution, using RL with failure-signal-driven task synthesis under changing training distributions.
A novel text editing framework for multi-line scene text in complex visual scenarios, with Condition Injection LoRA module and regional text perceptual loss.
An RL-based single-pass 3D scene editing framework using VGGT as geometry-aware reward model and GRPO to anchor 2D editing priors onto the 3D consistency manifold.
Leverages stochastic block-dropping to construct sub-networks for training-free guidance, surpassing CFG on text-to-image and text-to-video generation.
Elucidating the SNR-t bias of diffusion probabilistic models and proposing a differential correction method to improve generation quality across various diffusion models.
Unified self-supervised pretraining via masked latent modeling in VAE space, significantly improving diffusion model convergence and generation quality.
A cascaded expert framework explicitly decoupling motion generation and appearance synthesis for high-quality music-driven dance video generation, with 70K-clip MA-Data dataset.
Combines contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition using GRPO with hierarchical rewards.
A prompt-guided adaptive test-time search strategy that dynamically adjusts search space and reward for imaginative video generation with long-distance semantic dependencies.
Introduces DualityForge, a controllable diffusion framework generating counterfactual videos for contrastive training, reducing MLLM video hallucinations by 24%.
A VLM-based GUI world model that predicts dynamic transitions via renderable code generation, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation.
A general-purpose world model for interactive world simulation, generating diverse, high-fidelity worlds that users can explore, control, and transform with event prompts.
A vision-language-action model using rule-based RL to elicit reasoning and self-reflection for autonomous driving trajectory prediction with physics-grounded rewards.
A vision-language reasoning framework for urban socio-semantic segmentation that simulates human annotation via cross-modal recognition and multi-stage RL-based reasoning.
A 14,715-image UGC dataset with 10 fine-grained attributes for realistic image quality and aesthetic scoring; achieves SOTA on 5 public IQA/IAA benchmarks using next-token prediction.