Back to Community Open Source Any-to-any multimodal model Research open source

Community Project Profile

NExT-GPT

An any-to-any multimodal LLM across text, image, video, and audio

GitHub stars
3.6k+
Paper
ICML 2024
Modalities
text / image / video / audio
Organisation
NUS NExT++ Research Center
Group
University / research
Category
Any-to-any multimodal model
Status
Research open source
Started
2023-08
Language / Form
Python
License
BSD-3-Clause
GitHub Stars
3,621
Updated
2026-05-04

NExT-GPT is a representative NUS multimodal LLM project, aiming to let one system understand and generate across text, image, video, and audio.

What It Is

NExT-GPT uses a large language model as the hub, connecting encoders and generators for different modalities. A user can input text, images, videos, or audio, and the system can output another modality or multiple modalities.

Its point is pushing multimodality beyond image-text question answering toward fuller any-to-any conversion.

AI Relevance

Multimodality is one of the core directions for the next stage of large models. NExT-GPT explores the orchestration problem early: how specialized models can coordinate around an LLM instead of retraining one giant model for every input-output pairing.

That path matters for research and gives application builders a composable architecture reference.

Singapore Relevance

NExT-GPT shows that NUS has globally visible work in multimodal foundation-model research. It is not a local Singapore application project, but a sample of Singapore academia participating in global model-paradigm competition.

This page is a place to keep adding citations, follow-on models, industrial translation, and links to other NUS multimodal teams.

Milestones

  1. 2023-08
    NExT-GPT repository released
  2. 2024
    Paper published at ICML 2024

Resources

More Community Projects