xGen-small: Enterprise-ready Small Language Models

Latest
May 3, 2025

Salesforce AI Research introduces xGen-small – An enterprise-ready compact Language Model that combines domain-focused data-curation, scalable pre-training, length-extension, instruction fine-tuning, and reinforcement-learning to deliver Enterprise AI with long-context performance at predictable, low cost.

Flipping the Traditional Scale-Up Paradigm

Small language models are uniquely suited to tackle the complex, high-volume inference demands of modern enterprises. Business workflows rarely revolve around isolated queries — instead, decisions hinge on synthesizing information from internal documentation, code repositories, research reports, and real-time data streams. Yet, as the past two years have shown, chasing ever-larger model architectures yields diminishing returns: impressive capabilities come tied to skyrocketing per-request costs, constant hardware upgrades, and increased risk of exposing sensitive data. Additionally, the accelerating energy demands of these models may impede AI development. Enterprises need solutions that balance long-context comprehension with efficient, predictable, low-cost serving and robust privacy guarantees.

Historically, language models were constrained by relatively short context windows — often just a few thousand tokens — limiting their ability to sustain long conversations, process multi-page documents, or integrate disparate sources like code repositories and research reports. As practitioners began relying on retrieval-augmented generation (RAG), external tool calls, and memory mechanisms to pull in or persist information beyond those narrow windows, it became clear that genuinely long-context capacity was essential. These very augmentations illustrate why extended context is invaluable: they allow retrieved passages, tool outputs, and conversational or document history to be fed into a single forward pass, streamlining pipelines and eliminating brittle “stitching.” By engineering small language models with true long-context capabilities, we can natively process entire transcripts, filings, and codebases in a single sweep — lowering latency, preserving privacy, and ensuring precision.

Our vision flips the traditional scale-up paradigm: instead of bloating parameter counts, we shrink model size while sharpening data distributions toward enterprise-relevant domains and training protocols. This “small but long” strategy demands deep expertise across every stage — from raw data curation and scalable pre-training to length-extension mechanisms, targeted post-training, reinforcement-learning, and rigorous evaluation. Only through a vertically integrated pipeline can we optimize each component to work in concert, ensuring that the final model delivers best-in-class performance where it matters most.

In this way, small language models will provide a strategic advantage in the enterprise world. They provide the cost efficiency, privacy safeguards, and long-context understanding that large, resource-hungry counterparts cannot match — offering businesses a clear, sustainable, predictable path to deploy Enterprise AI at scale.

Small Scale, Long Context

Enterprises increasingly demand small language models (LMs) that can process extensive context, comply with stringent privacy requirements, and operate at predictable low cost and reduced environmental impact. In this work, we introduce compact yet highly competitive LMs — xGen-small — trained from scratch, and optimized for compact sizes of 4B and 9B parameters with long-context support. Our vertically integrated pipeline unifies high-quality, domain-sharpened data curation with scalable pre-training, targeted length-extension techniques, task-specific post-training, reinforcement-learning fine-tuning, and rigorous evaluation.

By deliberately reducing parameter count while extending sequence capacity, we achieve long-context understanding without prohibitive inference costs. Our results show that small language models, when crafted with holistic expertise across every training stage, can rival larger counterparts while offering predictable low cost-to-serve, low energy use, and enhanced data privacy, heralding a new era where compact, context-aware models provide a strategic advantage in Enterprise AI.

Pipeline: From Data-Curation to Post-Training

Our pipeline unites every stage into a streamlined workflow. We harvest a multi-trillion-token data corpus and apply heuristic filters, classifier-based quality gating, and near-duplicate removal to sculpt a clean, diverse dataset. We then pre-train at scale on TPU with optimized learning schedules, apply targeted length-extension to expand context windows, and perform task-specific post-training to sharpen capabilities. Finally, reinforcement learning driven by reward functions refines behavior, followed by rigorous evaluation to certify performance.

Data-Curation

To help xGen-small achieve peak performance in a compact footprint, we initially harvested a corpus many times larger than our final budget of eight trillion training tokens from a wide spectrum of publicly available sources, and then carefully distilled these down.

General data formed the backbone, but we layered in carefully curated sets for code, mathematics, and natural language content so that critical verticals were present in the right proportions. Raw text was swept through fast heuristic filters to strip boilerplate and spam before passing a two‑stage quality gate: a lightweight ensemble of small classifiers handled high‑volume triage, and a larger, slower model delivered the final score that decided whether a document was discarded, kept, or even up‑sampled. Exact hashing and fuzzy fingerprinting then removed near‑duplicates so no single web page could dominate the mix, reducing memorization risk and boosting long‑tail diversity.

Over a hundred ablation studies tweaked thresholds, sampling ratios, and deduplication strategies until we landed on the recipe that driving gains in factual accuracy and overall model usefulness — proof that painstaking data curation compounds just as powerfully as model architecture tweaks.

Pre-Training

Strong base models capture broad world knowledge and enable the “aha” moments needed for downstream RL. We pre-train on TPU v5p pods with our in-house Jaxformer v8 library — leveraging FSDP, sequence-parallel attention, aggressive prefetching, and splash kernels to maximize throughput and efficiency.

Our learning-rate schedule begins with a gradual warmup into a sustained high plateau, then moves through a fast cosine anneal, a slower taper, and a final low-rate phase — ensuring efficient training dynamics across trillions of tokens.

Over training time, we sharpen the data distribution by blending low-entropy code corpora, high-entropy natural-language examples, mathematically rigorous texts, and up-weighted, classifier-filtered high-quality subsets — capturing both diversity and quality in the training mix.

Results are competitive with the most recent strong baselines in their respective size classes.

Table 1: Benchmark comparison on various tasks for pre-trained base models with equal or less than 4B parameters. Top score is bolded and the second-best score is underlined.

Table 2: Benchmark comparison on various tasks for pre-trained base models with equal or less than 9B parameters. Top score is bolded and the second-best score is underlined.

Length-Extension

Our xGen base models are top-notch on the long context benchmark – RULER test, with the 9B model achieving SOTA and the 4B model reaching the second place. In the 7-9B model category, the xGen-9B remains strong as context length increases from 4k to 128k tokens, in contrast to other open-source models whose performance tends to drop sharply. For example, the second performer, Llama-3.1-8B’s RULER score drops 10+ points on a 100 scale when context length increases from 64k to 128k while our model only drops ~2 points.

Table 3: Average RULER scores across context lengths of 4k to 128k tokens. Top score is bolded and the second-best score is underlined. We run RULER tests on all the models with and without chat template and report the best result for each model.

Figure 1: Detailed breakdown of RULER test results across context lengths ranging from 4k to 128k. While Llama-3.1-8B demonstrates a marginally superior performance within the 4k to 32k context length range, its effectiveness declines markedly at the 128k context length. In contrast, xGen-9B consistently exhibits strong performance across all evaluated context lengths and attains the highest overall score.

We would like to share the key factors in our long context training experiments:

Two-stage length-extension – We first extend our base model to 32k and then to 128k.
Over-length training – Although our goal is to reach 128k, we over-train our model to 256k to further improve performance on 128k.
Sequence parallelism – In order to fit in 256k context length, we used sequence sharding to reduce memory usage.

We would like to thank the authors of ProLong; their paper helped us shape our long-context training strategy.

Post-Training

To transform base models into capable instruction models, we apply a two-stage post-training pipeline: supervised fine-tuning followed by reinforcement learning.

We begin by curating a broad, high-quality instruction dataset spanning math, coding, safety, and general-purpose domains such as creative writing, project management, and data analysis. This ensures both depth and breadth across tasks.

This data foundation enables alignment, strengthens instruction-following abilities, and enhances reasoning capacity. Through supervised learning, the model learns a broad set of core behaviors, including, accurate instruction following, step-by-step reasoning, and traits like helpfulness, honesty, and harmlessness. These capabilities are further refined during a large-scale reinforcement learning stage, which sharpens the model’s policy for robust downstream performance.

Our post-training approach yields strong instruction models especially in areas requiring strong reasoning capacities such as math, coding, and STEM.

Table 4: Benchmark comparison on various tasks for post-trained instruction models with equal or less than 4B parameters. Top score is bolded and the second-best score is underlined. * Denotes non-thinking mode used during evaluation. Results for mathematics benchmark use greedy sampling (best of 1).

Table 5: Benchmark comparison on various tasks for post-trained instruction models with equal or less than 9B parameters. Top score is bolded and the second-best score is underlined. * Denotes non-thinking mode used during evaluation. Results for mathematics benchmark use greedy sampling (best of 1).

Conclusion

In summary, our work shows that deliberately limiting model size while extending sequence capacity delivers transformative benefits for Enterprise AI. Compact architectures drive down inference costs and hardware requirements, simplifying deployment and maintenance without sacrificing accuracy. At the same time, long-context processing enables seamless integration of internal documents, codebases, and other domain-specific knowledge — minimizing reliance on external retrieval steps and reducing hallucination risk. By combining meticulous data curation, scalable pre-training, targeted length-extension, and reinforcement-learning alignment within a unified pipeline, small language models can match or exceed the performance of larger counterparts. This “small but long” approach offers businesses a predictable, sustainable, cost-effective, and privacy-preserving path to harness AI at scale.