Yizhi Song

I am a research scientist working on Generative AI at Bytedance / Tiktok. I obtained my PhD in Computer Science from CGVLab at Purdue University, advised by Prof. Daniel Aliaga. Before coming to Purdue, I recieved my B.S. in Computer Science from Zhejiang University. I interned at Qualcomm in summer 2021, and worked as a research intern at Adobe in summer 2022 & 2023 & 2024. During my internships, I'm fortunate to work with Dr. Meng-Lin Wu, Dr. Zhifei Zhang, Dr. Wei Xiong and Dr. Zhe Lin.

I’m interested in GenAI, with a strong focus on post-training and RL. My work includes building diffusion models for image & video generation, and improving MLLMs for image & video understanding.

Please feel free to reach out if you're interested in research collaborations or internships.

Email / CV / Scholar / Github / LinkedIn

News

03-2026	We release PhysAlign, a video generation model that achieves strong physics coherence and physically plausible motions.
01-2026	We release ThinkRL-Edit, a reasoning-centric RL framework for image editing.
09-2025	MMIG-Bench (multi-modal image generation benchmark) has been accepted to D&B track of NeurIPS 2025.
05-2025	We release MMIG-Bench, a comprehensive benchmark for evaluating multi-modal image generation models.
04-2025	ObjectStitch and IMPRINT have been productized and appear in Adobe Summit Sneak and are mentioned in Adobe Research news!
04-2025	Kubrick (video generation agent) has been accepted to CVPR 2025 AI4CC Workshop.
03-2025	Passed my PhD thesis defense!
01-2025	Refine-by-Align (A model to refine/fix generative artifacts from any generated images) has been accepted to ICLR 2025 .
09-2024	We release GroundingBooth , an image customization model with finegrained layout control.
08-2024	We release Kubrick , the first multimodal agent-based video generation pipeline.
07-2024	Thinking Outside the BBox (a generative model that automatically drops & harmonizes foreground objects in background images at reasonable locations) has been accepted to ECCV 2024 .
02-2024	IMPRINT (the state-of-the-art model for image customization / object dropping) has been accepted to CVPR 2024 .

Selected Research Work

	ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, Deng Cai Under Review PDF / Project Page / Code We propose ThinkRL-Edit, a reasoning-centric RL framework for instruction-driven image editing that explicitly separates visual reasoning from image synthesis. By introducing CoT based planning and reflection before generation, unbiased preference grouping across rewards, and a binary checklist–based instruction reward, our method enables deeper reasoning exploration and more stable optimization.
	PhysAlign: Physics-Coherent Video Generation through Feature and 3D Representation Alignment Zhexiao Xiong, Yizhi Song, Liu He, Wei Xiong, Yu Yuan, Feng Qiao, Nathan Jacobs Under Review PDF / Project Page We proposed a framework that leverages vision-language model(VLM)'s physics understanding to enable video generation with physically consistent motion and accurate 3D dynamics; it achieved physically plausible video generation by combining relational alignment with foundation video understanding models, physics-aware feature encoding, and 3D geometry alignment.
	MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, Jiebo Luo NeurIPS 2025 PDF / Project Page / Code / Data We present MMIG-Bench, a comprehensive benchmark for evaluating multi-modal image generation models. MMIG-Bench unifies compositional evaluation across T2I and customized generation, introduces explainable aspect-level metrics, and offers a thorough analysis of SOTA diffusion, autoregressive, and API-based models (e.g., GPT-4o, Gemini*).
	Advancing MLLMs by Large-Scale 3D Visual Instruction Dataset Generation Liu He, Xiao Zeng, Yizhi Song, Albert Y. C. Chen, Lu Xia, Shashwat Verma, Sankalp Dayal, Min Sun, Daniel Aliaga WACV 2026 PDF / Project Page MLLMs struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. MLLMs fine-tuned on our dataset outperform commercial models by a large margin.
	Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment Yizhi Song, Liu He, Zhifei Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Zhe Lin, Brian L. Price, Scott Cohen, Jianming Zhang, Daniel Aliaga ICLR 2025 PDF / Project Page We introduce a new task: reference-guided refinement of generative artifacts. Given a synthesized image, a reference and a free-form mask marking the artifacts, the model automatically identifies the correspondence in the reference and extracts the localized feature, which is then used to fix the artifacts.
	GroundingBooth: Grounding Text-to-Image Customization Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, Nathan Jacobs PDF / Project Page We introduce GroundingBooth, a framework that achieves zero-shot instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task.
	Kubrick: Multimodal Agent Collaborations for Video Generation Liu He, Yizhi Song, Hejun Huang, Daniel Aliaga, Xin Zhou CVPR 2025 AI4CC Workshop PDF / Project Page We build the first multimodal agent-based video generation pipeline through 3D engine scripting. Given any text prompt, multimodal agents collaborate to produce detailed Blender scripts to generate video with plausible character and motion consistency in any length.
	IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian L. Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Daniel Aliaga CVPR 2024 PDF / Project Page Our tuning-free model achieves advanced image composition with a decent identity preservation, automatic object viewpoint/pose adjustment, color and lighting harmonization, and shadow synthesis. All these effects are achieved in a single framework!
	Thinking Outside the BBox: Unconstrained Generative Object Compositing Gemma Canet Tarrés, Zhe Lin, Zhifei Zhang, Jianming Zhang, Yizhi Song, Dan Ruta, Andrew Gilbert, John Collomosse, Soo Ye Kim ECCV 2024 PDF / Project Page (coming soon!) We introduce a novel task, unconstrained image compositing, where the generation is not bounded by the input mask and can even occur without one (thus supports automatic object placement). This allows the generation of realistic object effects (shadows and reflections) that go beyond the mask while preserving the surrounding background.
	ObjectStitch: Object Compositing With Diffusion Model Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian L. Price, Jianming Zhang, Soo Ye Kim, Daniel Aliaga CVPR 2023 Project Page / Paper / arXiv / Reposted by AK / Appeared in Adobe Summit Sneak We define a novel task: generative image compositing, and present the first diffusion model-based framework, ObjectStitch, which can handle multiple aspects of compositing such as viewpoint, geometry, lighting and shadow together in a unified model.
	A Three-Stage Real-Time Detector for Traffic Signs in Large Panoramas Yizhi Song, Ruochen Fan, Sharon Huang, Zhe Zhu, Ruofeng Tong CVM 2019 (oral) PDF We propose a novel three-stage traffic sign detection framework which achieves state-ofthe-art detection accuracy in real-time.