⚡ HANDS-ON WORKSHOP

Distributed Training with PyTorch

PyTorch 分散式訓練實作:DDP(DistributedDataParallel)

90 min Kaggle Notebook GPU T4 × 2

a single-GPU loop ───▶ a 2-GPU parallel trainer 從一支單卡訓練程式,一步步改成雙卡平行

01 / 11

01 PyTorch Basics · PyTorch 基礎

What is PyTorch?PyTorch 是什麼

An open-source deep learning framework (originally by Meta AI), now under the PyTorch Foundation. 開源深度學習框架(原由 Meta AI 開發),現由 PyTorch Foundation 維運。

Define-by-run / eager execution — the computation graph is built on the fly as Python runs. Easy to debug, feels like normal Python. 動態圖、即時執行:graph 隨 Python 跑邊建邊算,好除錯、寫起來就像一般 Python。

The dominant framework for research, and increasingly for production too. 研究界的主流框架,生產環境也越來越普及。

02 / 11

01 PyTorch Basics · PyTorch 基礎

Three core objects三個核心物件

torch.Tensor A multi-dimensional array that can live on the GPU and track gradients. 多維陣列,可放上 GPU 並追蹤梯度。 1

autograd Automatic differentiation. Call loss.backward() and PyTorch computes every gradient for you. 自動微分;loss.backward() 自動算出所有梯度。 2

nn.Module The base class for models and layers — bundles parameters + forward logic. 模型與層的基底類別;包住參數與 forward 邏輯。 3

03 / 11

01 PyTorch Basics · PyTorch 基礎

The training loop pattern訓練迴圈的固定形狀

                
                train_step.py — every step is these five lines
            
1opt.zero_grad()clear old grads · 清掉上一輪梯度
2out = model(x)forward · 前向計算
3loss = crit(out, y)compute loss · 算損失
4loss.backward()autograd → grads · 反向算梯度
5opt.step()update weights · 更新權重

Remember this shape — the whole course comes back to it. 記住這個形狀,整堂課都會回來對照它。

04 / 11

01 PyTorch Basics · PyTorch 基礎

The PyTorch stack從高階到低階的分層:high-level → low-level

High-level The API you write — nn.Module, optimizers, DataLoader. Clean Python you actually touch. 你實際撰寫的高階 API:nn.Module、optimizer、DataLoader。

▼ ▼ ▼

Backend Kernels — cuDNN / cuBLAS / hand-written CUDA do the actual math on the GPU. 底層 kernel(cuDNN、cuBLAS、CUDA)在 GPU 上真正做運算。

Key idea: you stay high-level; PyTorch translates down to fast GPU code. 重點:你只寫高階,PyTorch 幫你翻譯成快速的 GPU 程式。

05 / 11

01 PyTorch Basics · PyTorch 基礎

`torch.compile` — Python → Triton從 Python 編譯成 Triton(一)

Since PyTorch 2.0, one line — model = torch.compile(model) — makes training & inference faster, with no rewrite. PyTorch 2.0 起,一行 torch.compile(model) 就能加速,程式幾乎不用改。底層分三步:

1 TorchDynamo Captures your Python code into a graph (graph capture).把 Python 程式抓成計算圖。

2 TorchInductor The default backend — compiles that graph into optimized kernels.預設後端,把圖編譯成最佳化 kernel。

3 Codegen On GPU → generates Triton kernels; on CPU → C++/OpenMP.GPU 產生 Triton kernel;CPU 產生 C++/OpenMP。

06 / 11

01 PyTorch Basics · PyTorch 基礎

Triton & kernel fusionTriton 與運算融合(二)

Triton is a language for writing GPU kernels in Python-like syntax. It does fusion & memory tiling automatically — often matching hand-tuned CUDA. Triton 是用類 Python 語法寫 GPU kernel 的語言,自動做運算融合與記憶體排程,效能常逼近手寫 CUDA。

Kernel fusion is the big win: many small ops merge into one kernel → fewer memory round-trips → faster. 最大好處是「運算融合」:把多個小運算合成一個 kernel,減少記憶體往返,因此更快。

Takeaway: you write high-level PyTorch; torch.compile turns it into Triton GPU code for you. 重點:你寫高階 PyTorch,torch.compile 自動幫你變成 Triton GPU 程式。

07 / 11

02 The Kaggle Platform · Kaggle 平台

What is Kaggle?Kaggle 是什麼

A Google-owned platform for data science & machine learning. People use it for three things: Google 旗下的資料科學與機器學習平台。人們主要用它做三件事:

Competitions Companies post problems + prize money; you submit models. 競賽:企業出題並提供獎金,你提交模型。 1

Datasets Thousands of public datasets to explore and share. 資料集:大量公開資料集可探索與分享。 2

Notebooks Free in-browser notebooks with GPUs / TPUs built in. Notebook:免費雲端 notebook,內建 GPU/TPU。 3

08 / 11

02 The Kaggle Platform · Kaggle 平台

Why Kaggle for this course?為什麼這堂課用 Kaggle

Free GPUs — including a 2× T4 option. Exactly what we need to practice multi-GPU DDP without owning hardware. 免費 GPU,還有 T4 × 2 選項——不用自己買硬體就能練多卡 DDP。

Zero setup. PyTorch, CUDA, and torchvision are all pre-installed in the browser. 零安裝:PyTorch、CUDA、torchvision 都預裝好,瀏覽器即可用。

Weekly GPU quota (a limited number of hours/week) — plenty for a 90-min class. 每週有 GPU 時數配額,上 90 分鐘課綽綽有餘。

09 / 11

02 The Kaggle Platform · Kaggle 平台

Two settings you MUST check上課前一定要設的兩個開關

Settings → Accelerator → GPU T4 × 2 required Otherwise you only get one card and DDP can't run. 右側 Settings 把 Accelerator 設成 GPU T4 × 2,否則只有一張卡,DDP 跑不起來。

Settings → Internet → ON required Needed to download CIFAR-10. New notebooks have it OFF by default. Internet 要打開,才能下載 CIFAR-10;新 notebook 預設是關的。

10 / 11

03 Course Notebook · 課程實作

Hands-on DDP Notebook課程實作 Notebook

The full DDP walkthrough — from single-GPU loop to 2-GPU trainer — lives in the Kaggle notebook below. 完整的 DDP 實作(從單卡到雙卡)都在下面這份 Kaggle notebook。

We'll work through it together, cell by cell, live. 我們會一起逐 cell 實作。

🔗 kaggle.com/code/rrrrr5kkkkkk/pytorch-tenser

11 / 11

Distributed Training with PyTorch

What is PyTorch?PyTorch 是什麼

Three core objects三個核心物件

The training loop pattern訓練迴圈的固定形狀

The PyTorch stack從高階到低階的分層:high-level → low-level

torch.compile — Python → Triton從 Python 編譯成 Triton(一)

Triton & kernel fusionTriton 與運算融合(二)

What is Kaggle?Kaggle 是什麼

Why Kaggle for this course?為什麼這堂課用 Kaggle

Two settings you MUST check上課前一定要設的兩個開關

Hands-on DDP Notebook課程實作 Notebook

`torch.compile` — Python → Triton從 Python 編譯成 Triton(一)