[LLM]从零开始搭建GPT-2(3)
GPT-2的进阶优化训练
1.超参数,权重归一化,学习率设置
从GPT-3的官方文档查找到模型的超参数设置,以及在向后传播后对权重进行归一化,防止开头遇到一批不太好的数据导致模型初始化的损失函数过高,进而导致权重过高影响模型后续的训练。
lr=3e-4, betas=(0.9, 0.95), eps=1e-8
norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
GPT2使用了cosine衰减,可以使用pytorch内置函数或者自己实现。
def get_lr(it):
# 1) linear warmup for warmup_iters steps
if it < warmup_steps:
return max_lr * (it+1) / warmup_steps
# 2) if it > lr_decay_iters, return min learning rate
if it > max_steps:
return min_lr
# 3) in between, use cosine decay down to min learning rate
decay_ratio = (it - warmup_steps) / (max_steps - warmup_steps)
assert 0 <= decay_ratio <= 1
coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff starts at 1 and goes to 0
return min_lr + coeff * (max_lr - min_lr)
2.权重累积
用时间换空间
total_batch_size = 524288
B = 16
T = 1024
grad_accum_steps = total_batch_size // (B * T)
3.DDP
训练一般运行在多个GPU上,DDP就会分配不同批次的数据集到各个GPU,协同交流,大大加速训练。
DDP初始化:
ddp = int(os.environ.get('RANK', -1)) != -1 # is this a ddp run?
if ddp:
# use of DDP atm demands CUDA, we set the device appropriately according to rank
assert torch.cuda.is_available(), "for now i think we need CUDA for DDP"
init_process_group(backend='nccl')
ddp_rank = int(os.environ['RANK'])
ddp_local_rank = int(os.environ['LOCAL_RANK'])
ddp_world_size = int(os.environ['WORLD_SIZE'])
device = f'cuda:{ddp_local_rank}'
torch.cuda.set_device(device)
master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.
else:
# vanilla, non-DDP run
ddp_rank = 0
ddp_local_rank = 0
ddp_world_size = 1
master_process = True
# attempt to autodetect device
device = "cpu"
if torch.cuda.is_available():
device = "cuda"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
device = "mps"
print(f"using device: {device}")
4.训练数据集
GPT-2和3的数据集从未公布过,这里推荐的数据集是fineweb。
5.划分数据集,验证损失函数
完成GPT-2的预训练
-参考