使用torch.profiler记录模型训练的轨迹，并使用Tensorboard可视化和分析

使用 torch.profiler记录模型训练轨迹，并使用Tensorboard进行可视化分析，首先导入需要的库，准备模型和数据集，设置记录器，生成json格式的文件，最后通过Tensorboard可视化。

weixin_46091520

1785人浏览 · 2024-05-01 16:27:05

weixin_46091520 · 2024-05-01 16:27:05 发布

Steps

Prepare the data and model
Use profiler to record execution events
Run the profiler
Use TensorBoard to view results and analyze model performance
Improve performance with the help of profiler
Analyze performance with other advanced features
Additional Practices: Profiling PyTorch on AMD GPUs

1. Prepare the data and model

导入需要的库:

import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T

准备数据集

transform = T.Compose(
    [T.Resize(224),
     T.ToTensor(),
     T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True)

模型定义

device = torch.device("cuda:0")
model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()

模型训练

def train(data):
    inputs, labels = data[0].to(device=device), data[1].to(device=device)
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

2. 使用Profiler记录轨迹

some useful parameters are as follow:

schedule: 参数例如wait=1,warmup=1,active=3,repeat=1(profiler 会跳过第一个step/iteration，在第二个iter热身，记录三个iter。). In total, the cycle repeats once. Each cycle is called a “span” in TensorBoard plugin.

在wait阶段，profiler 不生效，在warmup 阶段，proliler 开始工作但不记录结果，是为了减少开销，proliling 的开始开销很大，会影响结果。

on_trace_ready : 在每个cylce结束时调用，例如使用torch.profiler.tensorboard_trace_handler来时生成Tensorboard使用的结果文件，在Profiling后，结果文件存储在./log/resnet18中。

record_shapes：是否记录输入张亮的形状

profile_memory: 追踪张量空间申请和释放。

with_stack：记录算子的代码信息，如果在vscode中集成TensorBoard, 单击可以跳转到特定行。

https://code.visualstudio.com/docs/datascience/pytorch-support#_tensorboard-integration

以上下文管理器启动/停止：

with torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True
) as prof:
    for step, batch_data in enumerate(train_loader):
        prof.step()  # Need to call this at each step to notify profiler of steps' boundary.
        if step >= 1 + 1 + 3:
            break
        train(batch_data)

也可以以非上下文管理器启动/停止：

prof = torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
        record_shapes=True,
        with_stack=True)
prof.start()
for step, batch_data in enumerate(train_loader):
    prof.step()
    if step >= 1 + 1 + 3:
        break
    train(batch_data)
prof.stop()