Community Project Profile
NExT-GPT
An any-to-any multimodal LLM across text, image, video, and audio
- Organisation
- NUS NExT++ Research Center
- Group
- University / research
- Category
- Any-to-any multimodal model
- Status
- Research open source
- Started
- 2023-08
- Language / Form
- Python
- License
- BSD-3-Clause
- GitHub Stars
- 3,621
- Updated
- 2026-05-04
NExT-GPT is a representative NUS multimodal LLM project, aiming to let one system understand and generate across text, image, video, and audio.
What It Is
NExT-GPT uses a large language model as the hub, connecting encoders and generators for different modalities. A user can input text, images, videos, or audio, and the system can output another modality or multiple modalities.
Its point is pushing multimodality beyond image-text question answering toward fuller any-to-any conversion.
AI Relevance
Multimodality is one of the core directions for the next stage of large models. NExT-GPT explores the orchestration problem early: how specialized models can coordinate around an LLM instead of retraining one giant model for every input-output pairing.
That path matters for research and gives application builders a composable architecture reference.
Singapore Relevance
NExT-GPT shows that NUS has globally visible work in multimodal foundation-model research. It is not a local Singapore application project, but a sample of Singapore academia participating in global model-paradigm competition.
This page is a place to keep adding citations, follow-on models, industrial translation, and links to other NUS multimodal teams.
Milestones
- 2023-08NExT-GPT repository released
- 2024Paper published at ICML 2024