NExT-GPT

An any-to-any multimodal LLM across text, image, video, and audio

GitHub stars

3.6k+

Paper

ICML 2024

Modalities

text / image / video / audio

Organisation: NUS NExT++ Research Center
Group: University / research
Category: Any-to-any multimodal model
Status: Research open source
Started: 2023-08
Language / Form: Python
License: BSD-3-Clause
GitHub Stars: 3,621
Updated: 2026-05-04

NExT-GPT is a representative NUS multimodal LLM project, aiming to let one system understand and generate across text, image, video, and audio.

What It Is

NExT-GPT uses a large language model as the hub, connecting encoders and generators for different modalities. A user can input text, images, videos, or audio, and the system can output another modality or multiple modalities.

Its point is pushing multimodality beyond image-text question answering toward fuller any-to-any conversion.

AI Relevance

Multimodality is one of the core directions for the next stage of large models. NExT-GPT explores the orchestration problem early: how specialized models can coordinate around an LLM instead of retraining one giant model for every input-output pairing.

That path matters for research and gives application builders a composable architecture reference.

Singapore Relevance

NExT-GPT shows that NUS has globally visible work in multimodal foundation-model research. It is not a local Singapore application project, but a sample of Singapore academia participating in global model-paradigm competition.

This page is a place to keep adding citations, follow-on models, industrial translation, and links to other NUS multimodal teams.

Milestones

2023-08
NExT-GPT repository released
2024
Paper published at ICML 2024

Resources

NExT-GPT on GitHub NExT-GPT project page

More Community Projects

NUS HPC-AI Lab

NExT-GPT

What It Is

AI Relevance

Singapore Relevance

Milestones

Resources

More Community Projects

Colossal-AI

OpenMMLab

Show-o

ShowUI