Zengyi Qin

qinzy [at] mit.edu

I am an MIT PhD student affiliated with MIT CSAIL. My current research focus is LLM Reasoning in Multi-modal Context (vision, language, audio and spatial). My ultimate goal is to develop general-purpose intelligence that advances how machines understand, interact, and create in both the digital and physical worlds. I am fortunate to work with Dr. Brian Anthony and Prof. William T. Freeman at MIT.

I have solid experience pre-training and post-training LLMs. I was the project lead of JetMoE, an MoA+MoE LLM pre-trained and post-trained from scratch (not finetuning existing models) with less than 0.1M USD cost but outperforms LLaMA2-7B.

I am also the main author of several popular open-source projects. One has 28k stars and was trending 1st on Github. One receives >4M average monthly downloads (more than Stable Diffusion) on Huggingface.

Previously, I was also visiting researcher in Stanford Vision and Learning Lab, where I had the privilege of working with Prof. Fei-fei Li and Prof. Silvio Savarese.

Beyond research, I also have extensive experience transforming cutting-edge research into practical applications. I co-founded MyShell.ai and developed the agentic framework for everyone to build AI agents without coding. The platform now has >3M users and more than 10K apps are built.

Education

News

  • [Apr 2024] We released JetMoE-8B, an MoA+MoE LLM pre-trained and post-trained from scratch with less than 0.1M USD cost but outperforms LLaMA2-7B. It democratized high-performance LLM pre-training and post-training, and received strong positive feedback from the field.
      Technical blog
      MIT CSAIL posts
      Comments from the field (1 2 3)
      The breakthrough represented by JetMoE-8B signals a significant democratization of AI technology (1)
  • [Jan 2024] We released OpenVoice, allowing users to clone any voice and generate speech in various styles and languages.
      Technical blog
      Trended 1st on Github. Now 27k stars
      Serving >3M users on MyShell. Solid production-grade algorithm
      Covered by VentureBeat, HyScaler and other medias
      AI Voice Cloning Redefined: OpenVoice Unveils Revolutionary Open-Source Technology (1)

Projects in Generative Models

Visual Reasoning by Learning Latent Symbolization
Despite the impressive capabilities of LLMs in text-based reasoning, they remain far from achieving comparable proficiency in visual reasoning. We identified the core issue as their inability to symbolize visual input and propose learning latent symbolization to enhance their visual reasoning capabilities.
paper coming soon

JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars
Yikang Shen, Zhen Guo, Tianle Cai and Zengyi Qin
Technical Report, 2024

JetMoE is pre-trained and post-trained from scratch with less than 0.1M USD cost but outperforms LLaMA2-7B. It democratized high-performance LLM pre-training and post-training with remarkable cost-efficiency.
website | github | tech report

OpenVoice: Versatile Instant Voice Cloning
Zengyi Qin, Wenliang Zhao, Xumin Yu and Xin Sun
Technical Report, 2024

Instantly clone any voice to generate speech in various styles and languages.
paper | website | source code

Trended 1st on Github Star

DreamVoice: Text-Guided Voice Conversion
Jiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, Mounya Elhilali
Interspeech, 2024

Convert a voice into any voice based on the input text prompt.
paper | website | source code

MeloTTS: A high-quality multi-lingual multi-accent text-to-speech library
Wenliang Zhao, Xumin Yu, Zengyi Qin

High-quality multi-lingual text-to-speech library that supports English (US, BR, AU, INDIAN), Spanish, French, Chinese, Japanese and Korean
source code

Star

Projects in 3D Computer Vision

MonoGRNet: A General Framework for Monocular 3D Object Detection
Zengyi Qin, Jinglu Wang and Yan Lu
The IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021

A general monocular 3D object detection framework that flexibly adapts to both fully and weakly supervised learning, which alleviates the need of extensive 3D labels and only requires ground truth 2D bounding boxes during training.

paper

Weakly Supervised 3D Object Detection from Point Clouds
Zengyi Qin, Jinglu Wang and Yan Lu
ACM Multimedia (ACM MM), 2020

A state-of-the-art framework for weakly supervised 3D object detection from point clouds without using any ground truth 3D bounding box for training. The core of our method is the unsupervised 3D object proposal module and the cross-modal knowledge distillation strategy.

paper | code

Triangulation Learning Network: from Monocular to Stereo 3D Object Detection
Zengyi Qin, Jinglu Wang and Yan Lu
The International Conference on Computer Vision and Pattern Recognition (CVPR), 2019

This is a pioneering work on stereo image based 3D object detection without calculating the pixel-level depth maps. We proposed a triangulation learning method to learn the object-level stereo geometric correspondence for 3D object detection.

paper | video | code | website

MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization
Zengyi Qin, Jinglu Wang and Yan Lu
The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI), 2019, Oral Presentation, Acceptance Rate < 8%

A state-of-the-art monocular 3D object detection approach based on geometric reasoning. We proposed to decompose the whole task into four progressive sub-tasks that significantly facilitates the monocular 3D object detection.

paper | video | code | website

Projects in Robotics

SABLAS: Learning Safe Control for Black-Box Dynamical Systems
Zengyi Qin, Dawei Sun and Chuchu Fan
IEEE Robotics and Automation Letters (RA-L), 2022

Learning control barrier functions (CBFs) for safe control of black-box systems. CBFs are a powerful tool to provide safety guarantee, but before this work, they cannot be directly applied to black box systems where their models are unavailable.

paper | code

KETO: Learning Keypoint Representations for Tool Manipulation
Zengyi Qin, Kuan Fang, Yuke Zhu, Li Fei-Fei and Silvio Savarese
The International Conference on Robotics and Automation (ICRA), 2020

KETO is a framework for robots to manipulate unseen objects as tools to complete diverse tasks. We proposed a method to learn the keypoint representations of objects, which simplify the manipulation task and improve the generality to novel objects.

paper | video | website | code

Learning Safe Multi-agent Control with Decentralized Neural Barrier Certificates
Zengyi Qin, Kaiqing Zhang, Yuxiao Chen, Jingkai Chen and Chuchu Fan
The International Conference on Learning Representations (ICLR), 2021

We study the multi-agent safe control problem where agents should avoid any collision while reaching their goals. Our method can scale up to an arbitrarily large number of agents (e.g., >1000 in our experiments) and achieve a 99-100% safety rate.

paper | video | code | website

Reactive and Safe Road User Simulations using Neural Barrier Certificate
Yue Meng, Zengyi Qin and Chuchu Fan
The International Conference on Intelligent Robots and Systems (IROS), 2021

Reactive and safe agent modelings are important for nowadays traffic simulator designs and safe planning applications. We propose a control barrier function-based method to simulate traffic agents that behave like humans or human controlled vehicles, which react to other road participants.

paper | website

Density Constrained Reinforcement Learning
Zengyi Qin, Yuxiao Chen and Chuchu Fan
The International Conference on Machine Learning (ICML), 2021

We study constrained reinforcement learning (CRL) from a novel perspective by setting constraints directly on state density functions, rather than the value functions considered by previous work. State density has a clear physical and mathematical interpretation, and is able to express a wide variety of constraints such as resource limits and safety requirements.

paper | code | website

Safe Nonlinear Control Using Robust Neural Lyapunov-Barrier Functions
Charles Dawson, Zengyi Qin, Sicun Gao and Chuchu Fan
The Conference on Robot Learning (CoRL), 2021

Safety and stability are common requirements for robotic control systems. We propose a robust feedback method based on robust control Lyapunov barrier functions that generalize despite model uncertainty, and with safety and stability guarantee.

paper

Controller synthesis for linear system with reach-avoid specifications
Chuchu Fan, Zengyi Qin, Umang Mathur, Qiang Ning, Sayan Mitra, and Mahesh Viswanathan
IEEE Transactions on Automatic Control (TAC), 2021

We address the problem of synthesizing provably correct controllers for linear systems with reach-avoid specifications. Our solution decomposes the overall synthesis problem into two smaller and more tractable problems, achieving a 2-150 times speedup compared with the previous techniques.

paper

Projects in AI for Healthcare

Learning fine-grained estimation of physiological states from coarse-grained labels by distribution restoration
Zengyi Qin, Jiansheng Chen, Zhenyu Jiang, Xumin Yu, Chunhua Hu, Yu Ma, Suhua Miao and Rongsong Zhou, Scientific Reports, 2020

Our method allows machine learning algorithms to perform fine-grained estimation of physiological states (e.g., sleep depth) even if the training labels are coarse-grained.

paper | code

sEMG based Tremor Severity Evaluation for Parkinson's Disease using a Light-weight CNN
Zengyi Qin*, Zhenyu Jiang*, Jiansheng Chen, Chunhua Hu and Yu Ma
IEEE Signal Processing Letters (SPL), 2019

A machine learning framework to assist the diagnosis of Parkinson's Disease by assessing the pathological tremor. We proposed a light-weight convolutional neural network and a similarity learning strategy to handle the scarcity of medical data.

paper | website