Source: MachineLearningMastery.com
Some language models are too large to train on a single GPU. When the model fits on a single GPU but cannot be trained with a large batch size, you can use data parallelism. However, when the model is too large to fit on a single GPU, you need to split it across multiple GPUs. In this article, you will learn how to use pipeline parallelism to split models for training. In particular, you will learn about:
- What is pipeline parallelism
- How to use pipeline parallelism in PyTorch
- How to save and restore the model with pipeline parallelism
Let’s get started!

Train Your Large Model on Multiple GPUs with Pipeline Parallelism.
Photo by Ivan Ivankovic. Some rights reserved.
Overview
This article is divided into six parts; they are:
- Pipeline Parallelism Overview
- Model Preparation for Pipeline Parallelism
- Stage and Pipeline Schedule
- Training Loop
- Distributed Checkpointing
- Limitations of Pipeline Parallelism
Pipeline Parallelism Overview
Pipeline parallelism means creating the model as a pipeline of stages. If you have worked on a scikit-learn project, you may be familiar with the concept of a pipeline. An example of a scikit-learn pipeline is:
|
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipeline = Pipeline([ (‘scaler’, StandardScaler()), (‘classifier’, LogisticRegression()) ]) |
When you pass data to this pipeline, it is processed by the first stage (StandardScaler), and the output is passed to the second stage (LogisticRegression).
A transformer model is typically just a stack of transformer blocks. Each block takes one tensor as input and produces one tensor as output. This makes it a perfect candidate for a pipeline: each stage is a transformer block, and the blocks are chained together. Executing the pipeline is mathematically equivalent to executing the model.
With a transformer model, it is straightforward to manually create a pipeline. At a high level, all you need to do is the following:
|
stage1 = TransformerBlock().to(“cuda:0”) stage2 = TransformerBlock().to(“cuda:1”) stage3 = TransformerBlock().to(“cuda:2”) batch_size, seq_length, hidden_size = 4, 512, 768 input_tensor = torch.randn(batch_size, seq_length, hidden_size).to(“cuda:0”) output1 = stage1(input_tensor) output2 = stage2(output1.to(“cuda:1”)) output3 = stage3(output2.to(“cuda:2”)) |
However, this method is not efficient. When you run the stage1 model on GPU 0, GPUs 1 and 2 are idle. Only after stage1 finishes and the tensor output1 is ready, can you work on the stage2 model on GPU 1, and so on.
In PyTorch, there is infrastructure for managing the pipeline to keep all GPUs busy. This is based on the concept of micro-batches: instead of processing a batch of size $N$, you split the batch into $n$ micro-batches of size $N/n$ each. When stage2 processes the $i$-th micro-batch, stage1 can process the $(i+1)$-th micro-batch. Once all micro-batches are processed, aggregate the results to produce the final output.
Let’s see how you can implement a training script for pipeline parallelism in PyTorch.
Warning: The PyTorch pipeline parallelism API is experimental and may change in the future. The code in this article was tested on PyTorch 2.9.1. Running the code on a different PyTorch version may not work.
Model Preparation for Pipeline Parallelism
If your model can fit on a single GPU, distributed data parallel is preferable. When you need pipeline parallelism, your model is likely too large to fit on a single device.
Before you set up the pipeline, you need to create your model first. You have two options: either create the model for one stage so it fits on your GPU, or create the full model on a fake device and then trim it before transferring it to an actual GPU. The former requires defining your model with a stage argument in its constructor so that a particular stage can be created. For the latter, you can do the following:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
... with torch.device(“meta”): model_config = LlamaConfig() model = LlamaForPretraining(model_config, stage=rank) # Partition the model by removing some layers num_layers = model_config.num_hidden_layers partition = [num_layers // 3, 2 * num_layers // 3, num_layers] if rank == 0: # from embedding to 1/3 of the decoder layers for n in range(partition[0], partition[2]): model.base_model.layers[str(n)] = None model.base_model.norm = None model.lm_head = None elif rank == 1: # from 1/3 to 2/3 of the decoder layers model.base_model.embed_tokens = None for n in range(0, partition[0]): model.base_model.layers[str(n)] = None for n in range(partition[1], partition[2]): model.base_model.layers[str(n)] = None model.base_model.norm = None model.lm_head = None elif rank == 2: # from 2/3 to the end of the decoder layers and the final norm layer, LM head model.base_model.embed_tokens = None for n in range(partition[1]): model.base_model.layers[str(n)] = None else: raise ValueError(f“Invalid rank: {rank}”) |
The model is created using the class LlamaForPretraining defined in the previous post. If the model is too large, instantiating it would cause an out-of-memory error. Here, you create the model on a fake device meta. When a model is created on meta, the weights are not allocated.
In the code above, you partition the model into three stages: at rank 0 (the first stage), the model keeps the embedding layer and the first 1/3 of the decoder layers. At rank 1 (the second stage), the model keeps only the middle 1/3 of the decoder layers. At rank 2 (the third stage), the model keeps the last 1/3 of the decoder layers, the final normalization layer, and the prediction head. Components not needed in a particular stage are set to None. These stages have no overlap and tightly partition the model.
To make such a model work, you need to modify the model code so that when a component is None, it is skipped in the forward pass. This needs to be done in the classes LlamaModel and LlamaForPretraining:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
.. class LlamaModel(nn.Module): “”“The full Llama model without any pretraining heads.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() self.rope = RotaryPositionEncoding( config.hidden_size // config.num_attention_heads, config.max_position_embeddings, ) self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) self.layers = nn.ModuleDict({ str(i): LlamaDecoderLayer(config) for i in range(config.num_hidden_layers) }) self.norm = nn.RMSNorm(config.hidden_size, eps=1e–5) def forward(self, input_ids: Tensor) -> Tensor: # Convert input token IDs to embeddings if self.embed_tokens is not None: hidden_states = self.embed_tokens(input_ids) else: hidden_states = input_ids # Process through all transformer layers, then the final norm layer for n in range(len(self.layers)): if self.layers[str(n)] is not None: hidden_states = self.layers[str(n)](hidden_states, self.rope) if self.norm is not None: hidden_states = self.norm(hidden_states) # Return the final hidden states, and copy over the attention mask return hidden_states class LlamaForPretraining(nn.Module): def __init__(self, config: LlamaConfig, stage) -> None: super().__init__() self.base_model = LlamaModel(config) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) self.stage = stage def forward(self, input_ids: Tensor) -> Tensor: hidden_states = self.base_model(input_ids) if self.lm_head is not None: hidden_states = self.lm_head(hidden_states) return hidden_states |
You can see that several if-statements are added to check if the component is None before allowing it to process the hidden_states tensor.
After you create the partial model, you need to transfer it to the actual GPU. Transferring a model from the meta device to a real GPU device is done using the method to_empty(), not to(), as you need to allocate the weight tensors during the transfer:
|
... def reset_all_weights(model: nn.Module) -> None: @torch.no_grad() def weight_reset(m: nn.Module): reset_parameters = getattr(m, “reset_parameters”, None) if callable(reset_parameters): m.reset_parameters() # Applies fn recursively to model itself and all of model.children() model.apply(fn=weight_reset) model.to_empty(device=device) reset_all_weights(model) |
The function reset_all_weights() calls the reset_parameters() method on all model components. This initializes the weights correctly, such as setting the weights to normally distributed random values in nn.Linear modules or to all ones in nn.RMSNorm modules.
Stage and Pipeline Schedule
In PyTorch, pipeline parallelism should be executed using the torchrun command rather than running it as a plain Python script. This means multiple processes will be launched, each handling a stage of the pipeline.
When you write a script for torchrun, remember that the same script will be executed by multiple processes, and each process should operate only on its own scope of work. In pipeline parallelism, this means:
- The script should create only one stage of the model
- The script should set up a pipeline to allow communication between stages
The key is to use the process group in the torch.distributed module. When torchrun launches multiple processes, the total number of processes is called the world size. Each process has a unique rank. If you run these processes across multiple computers on a network, each process may be assigned a particular GPU device on a machine. The local rank identifies the device ID.
As with distributed data parallel, you should initialize the distributed environment before you set up the pipeline:
|
import torch.distributed as dist dist.init_process_group(backend=“nccl”) rank = dist.get_rank() local_rank = int(os.environ[“LOCAL_RANK”]) world_size = dist.get_world_size() device = torch.device(f“cuda:{local_rank}”) |
Then, you can create the stage object. It specifies which stage your model belongs to, which device it should run on, and how many stages there are in total:
|
... from torch.distributed.pipelining import PipelineStage, ScheduleGPipe stage = PipelineStage(model, stage_index=rank, num_stages=world_size, device=device) |
Now that you have set up the model pipeline, you still need to specify how the data is processed into micro-batches through this pipeline. PyTorch offers multiple algorithms to utilize the pipeline, called schedules. The default is to use ScheduleGPipe:
|
... def loss_fn(logits, target_ids): logits = logits.view(–1, logits.size(–1)) target_ids = target_ids.view(–1) loss = F.cross_entropy(logits, target_ids, ignore_index=PAD_TOKEN_ID) return loss n_microbatches = 4 # num split per batch schedule = ScheduleGPipe(stage, n_microbatches=n_microbatches, loss_fn=loss_fn) |
As mentioned above, the transformer model you used is a stack of transformer blocks, each of which takes one tensor as input and produces one tensor as output. In pipeline parallelism, you do not explicitly run the model’s forward and backward passes; instead, you use the pipeline schedule to coordinate the different stages of the pipeline.
Recall that the backward pass uses the output from the forward pass to compute the loss metric, then propagates the gradient back to the model parameters based on the loss. For the pipeline schedule to know how to trigger the backward pass, you need to implement a loss function, such as loss_fn() above.
The n_microbatches argument specifies how to split the batch into micro-batches. When you use pipeline parallelism, PyTorch expects a batched tensor as input to the pipeline schedule, which is then split and fed into the pipeline stages sequentially.
Micro-batches are key to keeping all GPUs busy, as each stage can process a different micro-batch in parallel. Once all micro-batches are processed, you aggregate the results to get the final output and perform gradient updates. This completes one training step; you then proceed to the next batch.
Not all GPUs are busy at all times. The number of idle GPUs and the duration of idle time are collectively called the bubble. Pipeline scheduling algorithms vary in how they minimize bubble formation, which is critical to the efficiency of pipeline parallelism.

Bubbles in pipeline parallelism: The numbered boxes are micro-batches that the devices processing, usually the backward pass takes at least twice the time to process the forward pass. The grey area means the devices are idle. The illustration is from Fig. 3 of Narayanan et al. (2021).
Training Loop
Once you have instantiated the partial model, created the pipeline stage object, and configured the schedule, the data loader, optimizer, and learning rate scheduler are the same as in single-GPU training.
However, in the training loop, you should use the pipeline schedule for the forward and backward passes. You should not call the model or compute the loss metric directly. Moreover, each stage of the pipeline works differently in the training loop. Below is how you should modify the training loop for pipeline parallelism:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
... for epoch in range(epochs): pbar = tqdm.tqdm(dataloader, desc=f“Epoch {epoch+1}/{epochs}”, disable=(rank != world_size – 1)) for batch_id, batch in enumerate(pbar): # zero grad before forward pass, since no explicit backward pass is called optimizer.zero_grad(set_to_none=True) # get batched data and run the pipeline input_ids, target_ids = batch if rank == 0: schedule.step(input_ids) elif rank == world_size – 1: losses = [] # expects one lost per microbatch logits = schedule.step(target=target_ids, losses=losses) with torch.no_grad(): pbar.set_postfix(loss=sum(losses).item() / len(losses)) else: schedule.step() # gradient update through optimizer torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() scheduler.step() pbar.update(1) pbar.close() |
You create the model object but never call it directly in the training loop. Instead, you pass the input tensor input_ids to the pipeline schedule if you are at rank 0. This is how you send the input to the first stage of the pipeline. For the remaining stages, call schedule.step() to have the pipeline process the output from the previous stage. In the final stage, you expect the model to produce its output. You provide the target tensor target_ids to signal that the loss function should be called to compute the loss metric and trigger the backward pass. The loss metric is not used explicitly in the training loop, as the pipeline schedule handles it internally. However, you can provide a Python list in the losses argument to store the loss metrics for each micro-batch.
After the model completes its forward and backward passes, the gradient is computed and stored with the model. You can then perform the usual gradient update processes, including gradient clipping, optimizer step, and learning rate scheduler update.
Since multiple processes will be running concurrently, you want to keep your output clean. Therefore, the tqdm progress bar is displayed only on the last stage, where you can collect the loss metric and print it. Note that cross-entropy loss is averaged per prediction by default, so it is averaged across all micro-batches to make it comparable to single-GPU training.
Distributed Checkpointing
Pipeline parallelism is unique in that no process contains the full model. Therefore, you cannot use model.state_dict() to get the model weights and save them with torch.save().
Saving the model with pipeline parallelism is tricky: you need to ensure all processes save the model simultaneously, preventing one process from having updated gradients while another does not. You also want to avoid reassembling the full model in any process to maintain speed.
In PyTorch, you need to use the distributed checkpointing API for this purpose. You typically save both the model and optimizer state together since they are tightly coupled. Below is a save function:
|
... from torch.distributed.checkpoint import load, save from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict, StateDictOptions def save_checkpoint(model, optimizer): dist.barrier() model_state, optimizer_state = get_state_dict( model, optimizer, options=StateDictOptions(full_state_dict=True) ) save( {“model”: model_state, “optimizer”: optimizer_state}, checkpoint_id=“checkpoint-dist”, # each rank will save its own file ) dist.barrier() |
Before you save, call dist.barrier() to synchronize all processes. After you save, call dist.barrier() again to ensure all save operations are complete before resuming training, preventing partial gradient updates.
Unlike torch.save(), you do not save to a single file. Instead, each process saves to a different file based on its rank. You also do not use model.state_dict() for this purpose. The save() function takes a checkpoint ID, which is the directory name to use. The file created by each process will be named __3_0.distcp for rank 3, for example. This is not in the same format as files created by torch.save().
To restore the model, you use a similar workflow:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
... def load_checkpoint(model, optimizer): dist.barrier() model_state, optimizer_state = get_state_dict( model, optimizer, options=StateDictOptions(full_state_dict=True) ) load( {“model”: model_state, “optimizer”: optimizer_state}, checkpoint_id=“checkpoint-dist” # each rank will save its own file ) # necessary if model.load_state_dict() should be called set_state_dict( model, optimizer, model_state_dict=model_state, optim_state_dict=optimizer_state, options=StateDictOptions(broadcast_from_rank0=True, full_state_dict=True) ) dist.barrier() |
The load() function is similar to save(): you need to pass a checkpoint ID and a dictionary of states. Unlike torch.load(), which returns a state dictionary, this method loads the checkpoint in-place. Therefore, using get_state_dict() to retrieve the model and optimizer weights and states is necessary.
Since load() updates the weights in-place, you simply need to call it with the correct arguments and fence it with dist.barrier() to ensure all processes are synchronized. However, some models may override the load_state_dict() method to perform additional operations. To be safe, you can call set_state_dict() as shown above to trigger the load_state_dict() method on both the model and optimizer. This does not harm if in-place weight updates are sufficient.
Also note that if you have other objects not managed by the pipeline, such as the learning rate scheduler, you still need to use torch.save() and torch.load() to save and restore them.
That’s all that’s needed to run model training with pipeline parallelism. For completeness, below is the full code:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 |
import dataclasses import os import datasets import tokenizers import torch import torch.distributed as dist import torch.nn as nn import torch.nn.functional as F import torch.optim.lr_scheduler as lr_scheduler import tqdm from torch import Tensor from torch.distributed.checkpoint import load, save from torch.distributed.checkpoint.state_dict import StateDictOptions, get_state_dict, set_state_dict from torch.distributed.pipelining import PipelineStage, ScheduleGPipe # Build the model @dataclasses.dataclass class LlamaConfig: “”“Define Llama model hyperparameters.”“” vocab_size: int = 50000 # Size of the tokenizer vocabulary max_position_embeddings: int = 2048 # Maximum sequence length hidden_size: int = 768 # Dimension of hidden layers intermediate_size: int = 4*768 # Dimension of MLP’s hidden layer num_hidden_layers: int = 12 # Number of transformer layers num_attention_heads: int = 12 # Number of attention heads num_key_value_heads: int = 3 # Number of key-value heads for GQA class RotaryPositionEncoding(nn.Module): “”“Rotary position encoding.”“” def __init__(self, dim: int, max_position_embeddings: int) -> None: “”“Initialize the RotaryPositionEncoding module. Args: dim: The hidden dimension of the input tensor to which RoPE is applied max_position_embeddings: The maximum sequence length of the input tensor ““” super().__init__() self.dim = dim self.max_position_embeddings = max_position_embeddings # compute a matrix of ntheta_i N = 10_000.0 inv_freq = 1.0 / (N ** (torch.arange(0, dim, 2) / dim)) inv_freq = torch.cat((inv_freq, inv_freq), dim=–1) position = torch.arange(max_position_embeddings) sinusoid_inp = torch.outer(position, inv_freq) # save cosine and sine matrices as buffers, not parameters self.register_buffer(“cos”, sinusoid_inp.cos()) self.register_buffer(“sin”, sinusoid_inp.sin()) def forward(self, x: Tensor) -> Tensor: “”“Apply RoPE to tensor x. Args: x: Input tensor of shape (batch_size, seq_length, num_heads, head_dim) Returns: Output tensor of shape (batch_size, seq_length, num_heads, head_dim) ““” batch_size, seq_len, num_heads, head_dim = x.shape dtype = x.dtype # transform the cosine and sine matrices to 4D tensor and the same dtype as x cos = self.cos.to(dtype)[:seq_len].view(1, seq_len, 1, –1) sin = self.sin.to(dtype)[:seq_len].view(1, seq_len, 1, –1) # apply RoPE to x x1, x2 = x.chunk(2, dim=–1) rotated = torch.cat((–x2, x1), dim=–1) output = (x * cos) + (rotated * sin) return output class LlamaAttention(nn.Module): “”“Grouped-query attention with rotary embeddings.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() self.hidden_size = config.hidden_size self.num_heads = config.num_attention_heads self.head_dim = self.hidden_size // self.num_heads self.num_kv_heads = config.num_key_value_heads # GQA: H_kv < H_q # hidden_size must be divisible by num_heads assert (self.head_dim * self.num_heads) == self.hidden_size # Linear layers for Q, K, V projections self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False) self.k_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False) self.v_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False) self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding) -> Tensor: bs, seq_len, dim = hidden_states.size() # Project inputs to Q, K, V query_states = self.q_proj(hidden_states).view(bs, seq_len, self.num_heads, self.head_dim) key_states = self.k_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim) value_states = self.v_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim) # Apply rotary position embeddings query_states = rope(query_states) key_states = rope(key_states) # Transpose tensors from BSHD to BHSD dimension for scaled_dot_product_attention query_states = query_states.transpose(1, 2) key_states = key_states.transpose(1, 2) value_states = value_states.transpose(1, 2) # Use PyTorch’s optimized attention implementation # setting is_causal=True is incompatible with setting explicit attention mask attn_output = F.scaled_dot_product_attention( query_states, key_states, value_states, is_causal=True, dropout_p=0.0, enable_gqa=True, ) # Transpose output tensor from BHSD to BSHD dimension, reshape to 3D, and then project output attn_output = attn_output.transpose(1, 2).reshape(bs, seq_len, self.hidden_size) attn_output = self.o_proj(attn_output) return attn_output class LlamaMLP(nn.Module): “”“Feed-forward network with SwiGLU activation.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() # Two parallel projections for SwiGLU self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False) self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False) self.act_fn = F.silu # SwiGLU activation function # Project back to hidden size self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False) def forward(self, x: Tensor) -> Tensor: # SwiGLU activation: multiply gate and up-projected inputs gate = self.act_fn(self.gate_proj(x)) up = self.up_proj(x) return self.down_proj(gate * up) class LlamaDecoderLayer(nn.Module): “”“Single transformer layer for a Llama model.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() self.input_layernorm = nn.RMSNorm(config.hidden_size, eps=1e–5) self.self_attn = LlamaAttention(config) self.post_attention_layernorm = nn.RMSNorm(config.hidden_size, eps=1e–5) self.mlp = LlamaMLP(config) def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding) -> Tensor: # First residual block: Self-attention residual = hidden_states hidden_states = self.input_layernorm(hidden_states) attn_outputs = self.self_attn(hidden_states, rope=rope) hidden_states = attn_outputs + residual # Second residual block: MLP residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) hidden_states = self.mlp(hidden_states) + residual return hidden_states class LlamaModel(nn.Module): “”“The full Llama model without any pretraining heads.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() self.rope = RotaryPositionEncoding( config.hidden_size // config.num_attention_heads, config.max_position_embeddings, ) self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) self.layers = nn.ModuleDict({ str(i): LlamaDecoderLayer(config) for i in range(config.num_hidden_layers) }) self.norm = nn.RMSNorm(config.hidden_size, eps=1e–5) def forward(self, input_ids: Tensor) -> Tensor: # Convert input token IDs to embeddings if self.embed_tokens is not None: hidden_states = self.embed_tokens(input_ids) else: hidden_states = input_ids # Process through all transformer layers, then the final norm layer for n in range(len(self.layers)): if self.layers[str(n)] is not None: hidden_states = self.layers[str(n)](hidden_states, self.rope) if self.norm is not None: hidden_states = self.norm(hidden_states) # Return the final hidden states, and copy over the attention mask return hidden_states class LlamaForPretraining(nn.Module): def __init__(self, config: LlamaConfig) -> None: super().__init__() self.base_model = LlamaModel(config) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) def forward(self, input_ids: Tensor) -> Tensor: hidden_states = self.base_model(input_ids) if self.lm_head is not None: hidden_states = self.lm_head(hidden_states) return hidden_states # Generator function to create padded sequences of fixed length class PretrainingDataset(torch.utils.data.Dataset): def __init__(self, dataset: datasets.Dataset, tokenizer: tokenizers.Tokenizer, seq_length: int, device: torch.device = None): self.dataset = dataset self.tokenizer = tokenizer self.device = device self.seq_length = seq_length self.bot = tokenizer.token_to_id(“[BOT]”) self.eot = tokenizer.token_to_id(“[EOT]”) self.pad = tokenizer.token_to_id(“[PAD]”) def __len__(self): return len(self.dataset) def __getitem__(self, index): “”“Get a sequence of token ids from the dataset. [BOT] and [EOT] tokens are added. Clipped and padded to the sequence length. ““” seq = self.dataset[index][“text”] tokens: list[int] = [self.bot] + self.tokenizer.encode(seq).ids + [self.eot] # pad to target sequence length toklen = len(tokens) if toklen < self.seq_length+1: pad_length = self.seq_length+1 – toklen tokens += [self.pad] * pad_length # return the sequence x = torch.tensor(tokens[:self.seq_length], dtype=torch.int64, device=self.device) y = torch.tensor(tokens[1:self.seq_length+1], dtype=torch.int64, device=self.device) return x, y def load_checkpoint(model: nn.Module, optimizer: torch.optim.Optimizer) -> None: dist.barrier() model_state, optimizer_state = get_state_dict( model, optimizer, options=StateDictOptions(full_state_dict=True), ) load( {“model”: model_state, “optimizer”: optimizer_state}, checkpoint_id=“checkpoint-dist”, ) set_state_dict( model, optimizer, model_state_dict=model_state, optim_state_dict=optimizer_state, options=StateDictOptions(broadcast_from_rank0=True, full_state_dict=True), ) dist.barrier() def save_checkpoint(model: nn.Module, optimizer: torch.optim.Optimizer) -> None: dist.barrier() model_state, optimizer_state = get_state_dict( model, optimizer, options=StateDictOptions(full_state_dict=True), ) save( {“model”: model_state, “optimizer”: optimizer_state}, checkpoint_id=“checkpoint-dist”, ) dist.barrier() # Load the tokenizer and dataset tokenizer = tokenizers.Tokenizer.from_file(“bpe_50K.json”) dataset = datasets.load_dataset(“HuggingFaceFW/fineweb”, “sample-10BT”, split=“train”) # Initialize the distributed environment dist.init_process_group(backend=“nccl”) rank = dist.get_rank() local_rank = int(os.environ[“LOCAL_RANK”]) world_size = dist.get_world_size() device = torch.device(f“cuda:{local_rank}”) print(f“World size {world_size}, rank {rank}, local rank {local_rank}. Using {device}”) assert world_size == 3, f“This script is designed for 3 GPUs, got {world_size}” # Create pretraining model with default config on meta device to prevent OOM with torch.device(“meta”): model_config = LlamaConfig() model = LlamaForPretraining(model_config) # Partition the model by removing some layers num_layers = model_config.num_hidden_layers partition = [num_layers // 3, 2 * num_layers // 3, num_layers] if rank == 0: # from embedding to 1/3 of the decoder layers for n in range(partition[0], partition[2]): model.base_model.layers[str(n)] = None model.base_model.norm = None model.lm_head = None elif rank == 1: # from 1/3 to 2/3 of the decoder layers model.base_model.embed_tokens = None for n in range(0, partition[0]): model.base_model.layers[str(n)] = None for n in range(partition[1], partition[2]): model.base_model.layers[str(n)] = None model.base_model.norm = None model.lm_head = None elif rank == 2: # from 2/3 to the end of the decoder layers and the final norm layer, LM head model.base_model.embed_tokens = None for n in range(partition[1]): model.base_model.layers[str(n)] = None else: raise ValueError(f“Invalid rank: {rank}”) # Move model from meta device to CUDA device, then initialize the weights def reset_all_weights(model: nn.Module) -> None: @torch.no_grad() def weight_reset(m: nn.Module): reset_parameters = getattr(m, “reset_parameters”, None) if callable(reset_parameters): m.reset_parameters() # Applies fn recursively to model itself and all of model.children() model.apply(fn=weight_reset) model.to_empty(device=device) reset_all_weights(model) model.train() stage = PipelineStage(model, stage_index=rank, num_stages=world_size, device=device) # Training parameters epochs = 3 learning_rate = 1e–3 batch_size = 64 seq_length = 512 num_warmup_steps = 1000 PAD_TOKEN_ID = tokenizer.token_to_id(“[PAD]”) # DataLoader, optimizer, scheduler, and loss function dataset = PretrainingDataset(dataset, tokenizer, seq_length, device) dataloader = torch.utils.data.DataLoader( dataset, batch_size=batch_size, ) num_training_steps = len(dataloader) * epochs print(f“Number of training steps: {num_training_steps} = {len(dataloader)} * {epochs}”) optimizer = torch.optim.AdamW( model.parameters(), lr=learning_rate, betas=(0.9, 0.99), eps=1e–8, weight_decay=0.1, ) warmup_scheduler = lr_scheduler.LinearLR( optimizer, start_factor=0.1, end_factor=1.0, total_iters=num_warmup_steps, ) cosine_scheduler = lr_scheduler.CosineAnnealingLR( optimizer, T_max=num_training_steps – num_warmup_steps, eta_min=0, ) scheduler = lr_scheduler.SequentialLR( optimizer, schedulers=[warmup_scheduler, cosine_scheduler], milestones=[num_warmup_steps], ) # if checkpoint-dist dir exists, load the checkpoint to model and optimizer # Note: You should implement how to reset the epoch and step to allow correct resume if os.path.exists(“checkpoint-dist”): load_checkpoint(model, optimizer) # Create pipeline schedule def loss_fn(logits: Tensor, target_ids: Tensor) -> Tensor: logits = logits.view(–1, logits.size(–1)) target_ids = target_ids.view(–1) return F.cross_entropy(logits, target_ids, ignore_index=PAD_TOKEN_ID) n_microbatches = 4 # num split per batch schedule = ScheduleGPipe(stage, n_microbatches=n_microbatches, loss_fn=loss_fn) # start training for epoch in range(epochs): pbar = tqdm.tqdm(dataloader, desc=f“Epoch {epoch+1}/{epochs}”, disable=(rank != world_size – 1)) for batch_id, batch in enumerate(pbar): if batch_id % 1000 == 0: save_checkpoint(model, optimizer) # zero grad before forward pass, since no explicit backward pass is called optimizer.zero_grad(set_to_none=True) # get batched data input_ids, target_ids = batch if rank == 0: schedule.step(input_ids) elif rank == world_size – 1: losses = [] # expects one lost per microbatch logits = schedule.step(target=target_ids, losses=losses) with torch.no_grad(): pbar.set_postfix(loss=sum(losses).item() / len(losses)) else: schedule.step() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() scheduler.step() pbar.update(1) pbar.close() # Save the model save_checkpoint(model, optimizer) # Clean up the distributed environment dist.destroy_process_group() |
Be sure to run this script with the torchrun command. For example, on a single computer with 3 GPUs:
|
torchrun —standalone —nproc_per_node=3 training.py |
If you need to run it on multiple machines, you should use the commands:
|
# run on each machine, with different node rank torchrun —nnodes=3 —nproc_per_node=1 —node_rank=0 —master_addr=10.1.1.1 —master_port=12345 training.py |
Limitations of Pipeline Parallelism
Comparing the model code from the previous post and the code above, you can see that the model no longer takes the attention mask as input. Instead, the attention function in the class LlamaAttention is called with is_causal=True to create a causal attention mask internally.
Numerically, these two implementations are equivalent, as the training loss ignores the padding tokens. However, without the padding mask, you spend more time computing attention weights that are not used.
This modification is necessary to use pipeline parallelism, as the pipeline schedule does not work well when the model takes two arguments in the forward pass. This may improve in the future, as the PyTorch pipeline-parallelism API is still experimental.
Further Readings
Below are some resources that you may find useful:
- L. Guan, D. Li, J. Liang, W. Wang, K. Ge, X. Lu (2024) Advances of pipeline model parallelism for deep learning training: An overview. Journal of Computer Science and Technology, Vol 39(3), pp. 567–584. Springer.
- Narayanan et al (2021) Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- Qi et al (2023) Zero Bubble Pipeline Parallelism
- Phillip Lippe (2022), “Training Models at Scale”, UvA DL Notebooks
- H. Huang (2024), Training with Zero-Bubble Pipeline Parallelism
- Pipeline Parallelism, from PyTorch documentation
- Introduction to Distributed Pipeline Parallelism, from PyTorch tutorials
- Getting Started with Distributed Checkpoint, from PyTorch recipes
Summary
In this article, you learned about pipeline parallelism and how to use it in PyTorch. Specifically, you learned:
- Pipeline parallelism is a technique to train a model on multiple GPUs by splitting the model into multiple stages.
- The pipeline schedule coordinates the pipeline’s stages.
- Distributed checkpointing is used to save and restore the model weights and optimizer state in a distributed environment, since you no longer have a single process with access to the full model.
- There are limitations in the current PyTorch pipeline-parallelism API. Your model may require modifications to support pipeline parallelism.
