Multi-Node Instruction Tuning with NVIDIA NeMo: Scaling Qwen2.5-32B and Reviving s1

marketing639699
2 days ago
6 min read

Updated: 23 hours ago

Multi-Node Instruction Tuning with NVIDIA NeMo: Scaling Qwen2.5-32B and Reviving s1

Outline

1. Preface

2. Enhancing s1 with High-Quality Data and NeMo

3. Technical Advantages of NVIDIA NeMo for Multi-Node Fine-Tuning with Torchrun

4. Qwen2.5-32B-Instruct Fine-Tuning Implementation Overview

5. Outcome and Performance

6. Conclusion

7. Afterward

1. Preface

In early 2025, Dr. Fei-Fei Li’s research team introduced s1: A Simple test-time scaling. The paper presents a streamlined, reproducible pipeline showing that a model can exhibit clear test-time scaling behavior—and achieve substantial improvements in reasoning performance—by applying supervised fine-tuning (SFT) on a small amount of high-quality reasoning data, together with a simple test-time control method (budget forcing). This led us to ask: given that modern foundation models already possess strong capabilities, can data quality be more effective than data quantity in improving reasoning ability?

Motivated by this question, we conducted an internal study at APMIC using NVIDIA NeMo for cross-node distributed training. We fine-tuned Qwen2.5-32B-Instruct using the same s1K dataset of 1,000 math-reasoning examples to reproduce the paper’s findings. At the time (March 2025), NeMo was transitioning from 1.0 to 2.0, and NeMo 2.0 did not yet provide a complete out-of-the-box recipe for Qwen2.5-32B. As a result, we manually patched the training configuration and workflow, and completed training on a 16× H100 multi-node setup. This article summarizes the key environment and configuration fixes, the training procedure, and our benchmark results and observations before and after fine-tuning.

2. s1K Dataset Construction and Filtering Pipeline

s1K is a 1,000-example reasoning dataset designed around the principle of being small in size, yet sufficiently difficult and diverse. Its creation can be summarized in two steps: first, generate reasoning trajectories that can serve as supervision signals for training from a large pool of problems; then, refine them into the final 1,000 samples through a multi-stage filtering process.

The paper first aggregates roughly 59K problems from 16 open-source sources, and uses Gemini Thinking to generate a corresponding reasoning trace and final answer for each problem, forming supervision signals suitable for next-token prediction. The dataset is then distilled through three gates:

Quality filtering: remove samples with formatting errors or incomplete content.
Difficulty filtering: attempt to solve each problem using Qwen2.5-7B and Qwen2.5-32B, and have Claude 3.5 Sonnet evaluate correctness. Problems that both models can solve easily are discarded, while the token length of the reasoning trace is used as an auxiliary indicator of difficulty.
Diversity filtering: categorize problems into multiple domains (around 50 domains) and perform cross-domain sampling with a preference for more challenging problems, yielding the final 1,000 curated samples.

The authors later released s1K-1.1, which keeps the same set of problems but regenerates the reasoning traces and answers using a stronger reasoning model, DeepSeek-R1, further improving data quality and highlighting that the quality of the reasoning trace directly affects post-SFT reasoning performance.

3. Technical Advantages of NVIDIA NeMo for Multi-Node Fine-Tuning

The NeMo framework offers several key advantages that make it particularly well-suited for large-scale, multi-node fine-tuning of LLMs like Qwen2.5-32B.

First, NeMo natively supports distributed training by integrating both Megatron-LM and DeepSpeed. It allows for the combination of Tensor Parallelism (TP), Pipeline Parallelism (PP), and Data Parallelism (DP) to fully utilize the GPU cluster. The memory overhead is significantly reduced through ZeRO optimization, which shards optimizer states, gradients, and parameters across devices.

Second, NeMo is built on NCCL (NVIDIA Collective Communications Library) and takes full advantage of NVLink and InfiniBand for low-latency, high-throughput communication between GPUs and across nodes. This ensures efficient scaling, even in complex cluster topologies, and supports launchers like torchrun or slurm for seamless orchestration.

Third, NeMo introduces automatic selective activation recomputation, which intelligently recomputes memory-intensive intermediate activations—like those in Transformer attention or MLP blocks—during backpropagation. This optimization dramatically reduces peak GPU memory usage without requiring manual layer-level annotations.

Finally, NeMo offers a robust checkpointing strategy that improves both fault tolerance and I/O efficiency. Users can configure checkpoint saving frequency and retention policies, ensuring that training jobs remain stable and can recover from interruptions without consuming excessive disk space.

Together, these capabilities enable developers to train trillion-parameter-scale models across nodes with minimal friction and maximum hardware efficiency.

4. Qwen2.5-32B-Instruct Fine-Tuning Implementation Overview

This experiment presents a comprehensive implementation of fine-tuning the Qwen2.5-32B model within the NVIDIA NeMo framework. It covers key stages such as data preprocessing, model format conversion, parallelization strategy design, distributed training deployment, and customized training architecture construction.

4.1 Data Preparation and Format Conversion

The NeMo framework requires input data in .jsonl format using the standard question-answer (QA pair) structure. The s1K high-quality instruction dataset used in this experiment was converted into the Alpaca-compatible format, where each line is a JSON object containing input and output fields. For example:

{"input": "Let $a,b,A,B$ be given reals. We consider the function defined by \\[ f(x) = 1 - a \\cdot \\cos(x) - b \\cdot \\sin(x) - A \\cdot \\cos(2x) - B \\cdot \\sin(2x). \\] Prove that if for any real number $x$ we have $f(x) \\geq 0$ then $a^2 + b^2 \\leq 2$ and $A^2 + B^2 \\leq 1.$.", "output": "Let $P(x) = 1 - a \\cos x - b \\sin x - A \\cos 2x - B \\sin 2x$.
Since $P(x) \\geq 0$ for all real $x$, by Fejer-Riesz theorem..."}

4.2 Training Environment and Performance Evaluation

Fine-tuning was conducted on two hardware configurations. In both setups, the training converged successfully with notable efficiency gains. We launched the job using torchrun, configured inter-node communication via IP addresses, and achieved a streamlined and efficient multi-node training workflow.

Multi-node distributed training: Deployed on two nodes, each equipped with 8 NVIDIA H100 GPUs. The training was executed for 315 steps and completed in just 16 minutes—significantly faster than the 26 minutes reported in the original s1 configuration.
Single-node training: Conducted on a single machine with 8 NVIDIA B200 GPUs. The entire fine-tuning workflow was completed smoothly with full convergence.

4.3 Parallelization Strategy

Given the Qwen2.5-32B-Instruct model’s large parameter size (~32B) and the high memory bandwidth of modern GPUs like the H100 and B200, we adopted a flexible hybrid parallelism strategy tailored to different training environments:

Training Configuration	Tensor Parallel	Pipeline Parallel	Context Parallel	Notes
16 × H100 (2 nodes)	8	2	1	Recommended for multi-node scaling
8 × B200 (single node)	8	2	1	Main experiment configuration, memory-optimized

Tensor Parallelism (TP): Splits large matrix computations across multiple GPUs for efficient execution of large models.
Pipeline Parallelism (PP): Divides the model layers into sequential stages, allowing different GPUs to process different parts of the model and reduce memory pressure.
Context Parallelism (CP): Maintains the consistency of sequence inputs. It is fixed at 1 in this experiment to preserve contextual integrity.

4.4 Custom Recipe and Modular Training Workflow

NeMo 2.0 introduces the Recipe concept, which modularizes components such as model initialization, data loading, optimizer setup, and learning rate scheduling. This enables users to launch a full fine-tuning pipeline with a single command, promoting reusability and configurability.

At the time of this experiment, Qwen2.5-32B was not yet natively supported in the NeMo container. Therefore, we manually defined the Qwen25Config32B class and integrated it with the built-in llm.Qwen2Model architecture as the backbone for training.

The core training configuration was defined as follows:

def configure_recipe(args):
    recipe = default_finetune_recipe(
        run.Config(llm.Qwen2Model, config=run.Config(Qwen25Config32B)),
        hf_id="Qwen/Qwen2.5-32B-Instruct",
        dir="your path",
        name=args.experiment,
        num_nodes=args.num_nodes,
        num_gpus_per_node=args.num_gpus,
        packed_sequence=True,
    )
    return recipe

It is worth noting that after the completion of this experiment, NVIDIA officially added native support for the Qwen2.5 model series in later versions of the NeMo container. As a result, users can now directly reference the official configuration without the need for manual model structure definition.

5. Outcome and Performance

We evaluated the fine-tuned model on the MMLU benchmark, a popular multi-task test suite for general-purpose language understanding. The results were compelling:

Qwen2.5-32B-Instruct (fine-tuned with s1-dataform1): 82.05 accuracy
Qwen2.5-32B-Instruct (base model): 80.74 accuracy

These results further confirm the paper’s findings: even without large-scale pretraining or massive datasets, targeted fine-tuning with a small amount of high-quality data in NeMo can yield measurable improvements over the base model.

Notably, the above accuracy was achieved using a 16×H100 GPU setup (2 nodes). Moreover, we also observed that similar performance was reproducible on a single-node system equipped with 8×B200 GPUs, underscoring the efficiency, adaptability, and scalability of the NeMo framework when combined with a data-centric approach.

6. Conclusion

By integrating s1 with NVIDIA NeMo’s scalable AI infrastructure, we unlock the ability to fine-tune and deploy models tailored to specific domains and applications.

Fine-tuning large language models no longer requires massive data or compute resources—what it demands is precision in data curation and infrastructure that scales intelligently. NVIDIA NeMo offers a compelling solution: it enables memory-efficient, multi-node training with best-in-class communication, automatic recomputation, and highly configurable checkpointing.

Through this approach, our Qwen2.5-32B fine-tuning pipeline achieved improvements on real-world benchmarks like MMLU, affirming the value of high-quality instruction data and NeMo’s scalable training architecture. This workflow not only delivers strong model performance but also provides a practical blueprint for enterprise and research teams aiming to fine-tune LLMs efficiently and reliably.

7. Afterword

By 2026, NeMo 2.0’s training stack and recipes have officially migrated to the Megatron Bridge framework. NeMo Megatron Bridge is a refactored version of the previous NeMo training stack, built on a native PyTorch training loop, providing developers with greater flexibility and customizability.

Author infromation : Ethan Kuo (Researcher / LLM Engineer at APMIC’s R&D division, specializing in LLM fine-tuning using NVIDIA toolkits. )

Multi-Node Instruction Tuning with NVIDIA NeMo: Scaling Qwen2.5-32B and Reviving s1

1. Preface

2. s1K Dataset Construction and Filtering Pipeline

3. Technical Advantages of NVIDIA NeMo for Multi-Node Fine-Tuning

4. Qwen2.5-32B-Instruct Fine-Tuning Implementation Overview

4.1 Data Preparation and Format Conversion

4.2 Training Environment and Performance Evaluation

4.3 Parallelization Strategy

4.4 Custom Recipe and Modular Training Workflow

5. Outcome and Performance

6. Conclusion

7. Afterword

Recent Posts

Comments

Products

Resources

Legal

Company