🧠 Core Technology Platform / Framework Active Founded 2023-12

SEA-LION

Parent

AI Singapore

Lead Ministry

Prime Minister’s Office / SNDGO (via AISG)

Scale / KPIs

11 Southeast Asian languages; flagship model at 70B parameters; downloads in the millions on HuggingFace

Website

aisingapore.org/aiproducts/sea-lion

Last Updated

2026-05-02

SEA-LION (Southeast Asian Languages In One Network) is the open-source LLM family AI Singapore has been developing since 2023, **purpose-built for semantic fidelity in 11 Southeast Asian languages** (including Malay, Tamil, Burmese, Khmer and other smaller languages). It does not compete with GPT/Claude/Gemini on general capability — it occupies the gap that "Western big tech has no incentive to fill and Southeast Asian players lack the compute to address". By 2026, SEA-LION has reached v3 with a flagship 70B model — the **first genuinely Southeast-Asia-oriented open LLM foundation in the world**.

📖 What it is

SEA-LION is an open-source LLM family, not a single model. It includes multiple sizes (3B, 7B/8B, 70B), multiple base architectures (originally in-house, then based on Llama 3 and Gemma from v3 onwards via continued training), and multiple variants (base, instruct fine-tuned, RAG-adapted).

On the technical stack:

Training data: centred on the 11 official Southeast Asian languages (English, Chinese, Malay, Indonesian, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao); training corpus around 1 trillion tokens, with SEA languages far over-represented compared to general LLMs
Base model choice: v1 self-built architecture → v2 based on Llama 2 → v3 based on Llama 3 / Gemma with continued pre-training and instruction tuning
Compute: relies on the Singapore National Supercomputing Centre (NSCC) and sponsored compute from Google Cloud / AWS
Open-source licence: MIT / Apache, commercially friendly, allowing direct enterprise use
Companion tools: SEA-HELM (evaluation benchmark) and SEA-Guard (safety) form the complete tooling chain

Models can be downloaded directly from HuggingFace, or accessed via the official sea-lion.ai API. It is one of the few LLMs that is produced by a national-level institution yet fully open and explicitly designed to encourage commercial use.

🤖 Relation to AI

SEA-LION holds a very clear position in the LLM ecosystem: "the SOTA foundation for Southeast Asian languages".

The core technical problem it solves: general LLMs collapse on smaller Southeast Asian languages. GPT-4 might score 95 on English/Chinese tasks but drops to 30–40 on Burmese, Khmer, or Lao (reproducible on SEA-HELM). The root cause is training data — SEA languages typically make up under 1% of general LLM training corpora.

SEA-LION's approach is continued pre-training:

Take a strong base model with general capabilities (Llama 3 / Gemma)
Continue pre-training with large amounts of SEA language corpora to restore semantic fidelity in smaller languages
Without sacrificing too much English capability (the technical challenge)

Once this works, SEA-LION beats same-sized Llama 3, Gemma, and Qwen on Southeast Asian language tasks in SEA-HELM — its most compelling hard evidence.

At a broader level, SEA-LION is also an important case study for "regional adaptation of open LLMs". It proves: not every country needs to train its own GPT-4, but every language region may need its own continued pre-training variant — a pattern Indonesia, Malaysia, and Vietnam are now imitating.

🇸🇬 Relation to Singapore

SEA-LION is the most symbolically important output of Singapore's AI strategy — clearer than any policy document on "what kind of AI Singapore wants to do".

In the seven-lever framework, SEA-LION sits across three levers:

Lever 5 (government adoption): government agencies deploy localised AI services on SEA-LION, avoiding sending data to overseas big tech
Lever 6 (international): SEA-LION is Singapore's "tech calling card" at ASEAN AI cooperation, GPAI, Bletchley/Seoul summits — proof that small countries can produce globally usable open-source models
Lever 3 (industry adoption): once open-sourced, local enterprises (especially in finance, government, healthcare with sensitive data) can fine-tune directly without depending on overseas APIs

A take: SEA-LION's real value is not in its benchmark numbers but in being a "sovereign AI" reference project — it tells Southeast Asia: "you can also have your own LLM foundation, you don't have to use only OpenAI". This narrative value far exceeds its lift on any single benchmark.

But SEA-LION has real bottlenecks too:

Not trained from scratch — it depends on Meta/Google open-source bases (Llama 3 / Gemma); if those move closed-source, the whole project must restart
Resources far smaller than big tech — AISG's compute budget is roughly 1/100 of big tech, iteration speed is naturally slower
Commercial loop unproven — currently mostly used by government and open-source community; enterprise paid scenarios have not scaled

These bottlenecks are exactly the questions to be answered in the NAIS 2.0 era — should SEA-LION v4/v5 move to a self-built base? Should compute be bound to a regional GPU cluster?

🗓️ Key Milestones

2023-04
AISG launches the SEA-LION project
Announced "Southeast Asia's own open-source LLM"; initial target of 11 languages.
2023-12
SEA-LION v1 released
3B and 7B sizes; in-house architecture; MIT-licensed open source.
2024-04
SEA-HELM benchmark released
Established standardized evaluation for Southeast Asian language models; complements SEA-LION as industry benchmark.
2024-12
SEA-LION v3 released (Llama 3-based)
70B and 8B variants; SOTA on Southeast Asian languages; downloads on HuggingFace pass the million mark.
2025
SEA-Guard safety toolkit released
Companion safety evaluation and guardrail layer for SEA-LION; addresses Southeast Asian context content safety.
2025
Government agencies deploy SEA-LION
Multiple Singapore government agencies deploy SEA-LION-based internal AI assistants and public service prototypes.