paper-to-code · run · attention-is-all-you-need
Run output of popularTask 1: “implement Attention Is All You Need”.
Below is the real on-disk tree, plus three side-by-side previews that match the three things this course promises — citation anchors in src/model.py, an ambiguity audit in REPRODUCTION_NOTES.md, and a paper-quote → code-cell walkthrough in notebooks/walkthrough.ipynb.
1 # §3.2.1, Eq. 1 — Scaled dot-product attention
2 def scaled_dot_product_attention(query, key, value, mask=None, dropout=None):
3 """Attention(Q, K, V) = softmax(QKᵀ / √d_k) V — §3.2.1, Eq. 1"""
4 d_k = query.size(-1) # §3.2.1, Eq. 1 — d_k
5 scores = torch.matmul(query, key.transpose(-2, -1)) # §3.2.1, Eq. 1 — Q Kᵀ
6 scores = scores / math.sqrt(d_k) # §3.2.1, Eq. 1 — / √d_k
7 if mask is not None:
8 scores = scores + mask # §3.2.3 — additive mask, see decision #19
9 weights = F.softmax(scores, dim=-1) # §3.2.1, Eq. 1 — softmax
10 if dropout is not None:
11 weights = dropout(weights) # §5.4 — attention-weight dropout
12 output = torch.matmul(weights, value) # §3.2.1, Eq. 1 — · V
13 return output, weights
14
15 # §3.2.2, Eq. 2 — Multi-head attention
16 class MultiHeadAttention(nn.Module):
17 """MultiHead(Q, K, V) = Concat(head_1, …, head_h) W^O — §3.2.2"""
18 def __init__(self, cfg):
19 super().__init__()
20 self.h = cfg.n_heads # §3.2.2 — h
21 self.d_k = cfg.d_head # §3.2.2 footnote — d_k = d_model / h
22 self.w_q = nn.Linear(cfg.d_model, cfg.d_model, bias=cfg.linear_bias)
23 self.w_k = nn.Linear(cfg.d_model, cfg.d_model, bias=cfg.linear_bias)
24 self.w_v = nn.Linear(cfg.d_model, cfg.d_model, bias=cfg.linear_bias)
25 self.w_o = nn.Linear(cfg.d_model, cfg.d_model, bias=cfg.linear_bias)
26 self.attn_dropout = nn.Dropout(cfg.dropout) # §5.4 — see [UNSPECIFIED] #11
27
28 # §3.3, Eq. 4 — Position-wise feed-forward network
29 class PositionwiseFeedForward(nn.Module):
30 """FFN(x) = max(0, x W_1 + b_1) W_2 + b_2 — §3.3, Eq. 4"""
31 def __init__(self, cfg):
32 super().__init__()
33 self.w_1 = nn.Linear(cfg.d_model, cfg.d_ff) # §3.3, Eq. 4 — W_1, b_1
34 self.w_2 = nn.Linear(cfg.d_ff, cfg.d_model) # §3.3, Eq. 4 — W_2, b_2
35 self.act = F.relu # §3.3, Eq. 4 — max(0, ·)
36
37 # §3.5, Eq. 5–7 — Sinusoidal positional encoding
38 class SinusoidalPositionalEncoding(nn.Module):
39 """PE(pos,2i)=sin(pos/10000^(2i/d_model)) — §3.5, Eq. 5
40 PE(pos,2i+1)=cos(pos/10000^(2i/d_model)) — §3.5, Eq. 6"""
41
42 # §3.4 — Embeddings · √d_model + weight tying
43 if cfg.tie_weights:
44 self.out_proj.weight = self.embed.emb.weight # §3.4 — "share the same weight matrix"
Reading order matches Figure 1: attention → multi-head → FFN → positional encoding → embeddings + tying. The green spans are the live anchors carried in the file; the red ones cross-reference the audit row on the next tab.
“We employed label smoothing of value ε_ls = 0.1 … This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.”
— Vaswani et al. 2017, §5.4
| # | Decision | Status | Our value | Paper anchor | Alternatives |
|---|---|---|---|---|---|
| 1 | d_model | SPECIFIED | 512 | §3, Table 3 | 1024 (“big” model) |
| 7 | LayerNorm placement | PARTIAL | post-norm | Fig. 1, “Add & Norm” | pre-norm (Xiong 2020) |
| 8 | LayerNorm ε | [UNSPECIFIED] | 1e-6 | — | 1e-5 (PyTorch), 1e-8 |
| 11 | Attention-weight dropout | [UNSPECIFIED] | enabled @ 0.1 | §5.4 silent on placement | drop only at residual; drop inside FFN |
| 14 | Embedding × √d_model | SPECIFIED | enabled | §3.4 | off → tiny logits early on |
| 15 | Weight tying (embed ↔ out) | SPECIFIED | shared | §3.4 | tied with source embed too (Press & Wolf 2017) |
| 17 | Weight initialisation | [UNSPECIFIED] | Xavier-uniform | — | normal(0, 0.02) (GPT), Kaiming-ReLU |
| 21 | LR schedule (Eq. 3) | SPECIFIED | Noam, warmup 4000 | §5.3 | inverse-sqrt without warmup; cosine |
| 26 | BLEU implementation | [UNSPECIFIED] | sacrebleu (case-insens) | §6.1 footnote | multi-bleu.perl (paper-era), nltk |
Excerpt — full audit ships 28 rows in REPRODUCTION_NOTES.md, with paper-quote columns and a debugging guide. [UNSPECIFIED] rows are the ones a future reader can safely change without diverging from the paper.
Anchor in code: src/model.py::scaled_dot_product_attention. The next cell verifies that the per-head attention weights are row-stochastic — the property the softmax in Eq. 1 is supposed to give us.
Anchor in code: SinusoidalPositionalEncoding. The check below pins the closed-form values at (pos=0, dim=0,1) = (sin 0, cos 0) = (0, 1).