⚡ HANDS-ON WORKSHOP
01 / 11
Distributed Training with PyTorch
PyTorch 分散式訓練實作:DDP(DistributedDataParallel)
90 min
Kaggle Notebook
GPU T4 × 2
a single-GPU loop
───▶
a 2-GPU parallel trainer
從一支單卡訓練程式,一步步改成雙卡平行
01 PyTorch Basics · PyTorch 基礎
02 / 11
What is PyTorch?PyTorch 是什麼
An open-source deep learning framework (originally by Meta AI), now under the PyTorch Foundation.
開源深度學習框架(原由 Meta AI 開發),現由 PyTorch Foundation 維運。
Define-by-run / eager execution — the computation graph is built on the fly as Python runs. Easy to debug, feels like normal Python.
動態圖、即時執行:graph 隨 Python 跑邊建邊算,好除錯、寫起來就像一般 Python。
The dominant framework for research, and increasingly for production too.
研究界的主流框架,生產環境也越來越普及。
01 PyTorch Basics · PyTorch 基礎
03 / 11
Three core objects三個核心物件
torch.Tensor
A multi-dimensional array that can live on the GPU and track gradients.
多維陣列,可放上 GPU 並追蹤梯度。
1
autograd
Automatic differentiation. Call
loss.backward() and PyTorch computes every gradient for you.
自動微分;loss.backward() 自動算出所有梯度。
2
nn.Module
The base class for models and layers — bundles parameters + forward logic.
模型與層的基底類別;包住參數與 forward 邏輯。
3
01 PyTorch Basics · PyTorch 基礎
04 / 11
The training loop pattern訓練迴圈的固定形狀
train_step.py — every step is these five lines
1opt.zero_grad()clear old grads · 清掉上一輪梯度
2out = model(x)forward · 前向計算
3loss = crit(out, y)compute loss · 算損失
4loss.backward()autograd → grads · 反向算梯度
5opt.step()update weights · 更新權重
Remember this shape — the whole course comes back to it.
記住這個形狀,整堂課都會回來對照它。
01 PyTorch Basics · PyTorch 基礎
05 / 11
The PyTorch stack從高階到低階的分層:high-level → low-level
High-level
The API you write —
▼ ▼ ▼
nn.Module, optimizers, DataLoader. Clean Python you actually touch.
你實際撰寫的高階 API:nn.Module、optimizer、DataLoader。
Backend
Kernels — cuDNN / cuBLAS / hand-written CUDA do the actual math on the GPU.
底層 kernel(cuDNN、cuBLAS、CUDA)在 GPU 上真正做運算。
Key idea: you stay high-level; PyTorch translates down to fast GPU code. 重點:你只寫高階,PyTorch 幫你翻譯成快速的 GPU 程式。
01 PyTorch Basics · PyTorch 基礎
06 / 11
torch.compile — Python → Triton從 Python 編譯成 Triton(一)
Since PyTorch 2.0, one line — model = torch.compile(model) — makes training & inference faster, with no rewrite.
PyTorch 2.0 起,一行 torch.compile(model) 就能加速,程式幾乎不用改。底層分三步:
1
TorchDynamo
Captures your Python code into a graph (graph capture).把 Python 程式抓成計算圖。
2
TorchInductor
The default backend — compiles that graph into optimized kernels.預設後端,把圖編譯成最佳化 kernel。
3
Codegen
On GPU → generates Triton kernels; on CPU → C++/OpenMP.GPU 產生 Triton kernel;CPU 產生 C++/OpenMP。
01 PyTorch Basics · PyTorch 基礎
07 / 11
Triton & kernel fusionTriton 與運算融合(二)
Triton is a language for writing GPU kernels in Python-like syntax. It does fusion & memory tiling automatically — often matching hand-tuned CUDA.
Triton 是用類 Python 語法寫 GPU kernel 的語言,自動做運算融合與記憶體排程,效能常逼近手寫 CUDA。
Kernel fusion is the big win: many small ops merge into one kernel → fewer memory round-trips → faster.
最大好處是「運算融合」:把多個小運算合成一個 kernel,減少記憶體往返,因此更快。
Takeaway: you write high-level PyTorch;
torch.compile turns it into Triton GPU code for you.
重點:你寫高階 PyTorch,torch.compile 自動幫你變成 Triton GPU 程式。
02 The Kaggle Platform · Kaggle 平台
08 / 11
What is Kaggle?Kaggle 是什麼
A Google-owned platform for data science & machine learning. People use it for three things: Google 旗下的資料科學與機器學習平台。人們主要用它做三件事:
Competitions
Companies post problems + prize money; you submit models.
競賽:企業出題並提供獎金,你提交模型。
1
Datasets
Thousands of public datasets to explore and share.
資料集:大量公開資料集可探索與分享。
2
Notebooks
Free in-browser notebooks with GPUs / TPUs built in.
Notebook:免費雲端 notebook,內建 GPU/TPU。
3
02 The Kaggle Platform · Kaggle 平台
09 / 11
Why Kaggle for this course?為什麼這堂課用 Kaggle
Free GPUs — including a 2× T4 option. Exactly what we need to practice multi-GPU DDP without owning hardware.
免費 GPU,還有 T4 × 2 選項——不用自己買硬體就能練多卡 DDP。
Zero setup. PyTorch, CUDA, and torchvision are all pre-installed in the browser.
零安裝:PyTorch、CUDA、torchvision 都預裝好,瀏覽器即可用。
Weekly GPU quota (a limited number of hours/week) — plenty for a 90-min class.
每週有 GPU 時數配額,上 90 分鐘課綽綽有餘。
02 The Kaggle Platform · Kaggle 平台
10 / 11
Two settings you MUST check上課前一定要設的兩個開關
Settings → Accelerator → GPU T4 × 2
required
Otherwise you only get one card and DDP can't run.
右側 Settings 把 Accelerator 設成 GPU T4 × 2,否則只有一張卡,DDP 跑不起來。
Settings → Internet → ON
required
Needed to download CIFAR-10. New notebooks have it OFF by default.
Internet 要打開,才能下載 CIFAR-10;新 notebook 預設是關的。
03 Course Notebook · 課程實作
🔗 kaggle.com/code/rrrrr5kkkkkk/pytorch-tenser
11 / 11
Hands-on DDP Notebook課程實作 Notebook
The full DDP walkthrough — from single-GPU loop to 2-GPU trainer — lives in the Kaggle notebook below.
完整的 DDP 實作(從單卡到雙卡)都在下面這份 Kaggle notebook。
We'll work through it together, cell by cell, live.
我們會一起逐 cell 實作。