paper-to-code · run · attention-is-all-you-need

带引用锚点的 Transformer 复现 A citation-anchored Transformer — every line points back to the paper.

Run output of popularTask 1: “implement Attention Is All You Need”. Below is the real on-disk tree, plus three side-by-side previews that match the three things this course promises — citation anchors in src/model.py, an ambiguity audit in REPRODUCTION_NOTES.md, and a paper-quote → code-cell walkthrough in notebooks/walkthrough.ipynb.

paper: arXiv:1706.03762 · Vaswani et al., 2017 · framework: PyTorch · base config: N=6, d_model=512, h=8, d_ff=2048 (§3, Table 3)

files shipped 11 tree below ↓
paper anchors 42 §3.1–§5.4 · Eq. 1–7
ambiguity audit 28 rows 8 [UNSPECIFIED] flagged
 1 # §3.2.1, Eq. 1 — Scaled dot-product attention
 2 def scaled_dot_product_attention(query, key, value, mask=None, dropout=None):
 3     """Attention(Q, K, V) = softmax(QKᵀ / √d_k) V    — §3.2.1, Eq. 1"""
 4     d_k = query.size(-1)                              # §3.2.1, Eq. 1 — d_k
 5     scores = torch.matmul(query, key.transpose(-2, -1))  # §3.2.1, Eq. 1 — Q Kᵀ
 6     scores = scores / math.sqrt(d_k)                  # §3.2.1, Eq. 1 — / √d_k
 7     if mask is not None:
 8         scores = scores + mask                        # §3.2.3 — additive mask, see decision #19
 9     weights = F.softmax(scores, dim=-1)               # §3.2.1, Eq. 1 — softmax
10     if dropout is not None:
11         weights = dropout(weights)                    # §5.4 — attention-weight dropout
12     output = torch.matmul(weights, value)             # §3.2.1, Eq. 1 — · V
13     return output, weights
14
15 # §3.2.2, Eq. 2 — Multi-head attention
16 class MultiHeadAttention(nn.Module):
17     """MultiHead(Q, K, V) = Concat(head_1, …, head_h) W^O    — §3.2.2"""
18     def __init__(self, cfg):
19         super().__init__()
20         self.h = cfg.n_heads                          # §3.2.2 — h
21         self.d_k = cfg.d_head                         # §3.2.2 footnote — d_k = d_model / h
22         self.w_q = nn.Linear(cfg.d_model, cfg.d_model, bias=cfg.linear_bias)
23         self.w_k = nn.Linear(cfg.d_model, cfg.d_model, bias=cfg.linear_bias)
24         self.w_v = nn.Linear(cfg.d_model, cfg.d_model, bias=cfg.linear_bias)
25         self.w_o = nn.Linear(cfg.d_model, cfg.d_model, bias=cfg.linear_bias)
26         self.attn_dropout = nn.Dropout(cfg.dropout)   # §5.4 — see [UNSPECIFIED] #11
27
28 # §3.3, Eq. 4 — Position-wise feed-forward network
29 class PositionwiseFeedForward(nn.Module):
30     """FFN(x) = max(0, x W_1 + b_1) W_2 + b_2     — §3.3, Eq. 4"""
31     def __init__(self, cfg):
32         super().__init__()
33         self.w_1 = nn.Linear(cfg.d_model, cfg.d_ff)   # §3.3, Eq. 4 — W_1, b_1
34         self.w_2 = nn.Linear(cfg.d_ff, cfg.d_model)   # §3.3, Eq. 4 — W_2, b_2
35         self.act = F.relu                              # §3.3, Eq. 4 — max(0, ·)
36
37 # §3.5, Eq. 5–7 — Sinusoidal positional encoding
38 class SinusoidalPositionalEncoding(nn.Module):
39     """PE(pos,2i)=sin(pos/10000^(2i/d_model))   — §3.5, Eq. 5
40        PE(pos,2i+1)=cos(pos/10000^(2i/d_model)) — §3.5, Eq. 6"""
41
42 # §3.4 — Embeddings · √d_model + weight tying
43 if cfg.tie_weights:
44     self.out_proj.weight = self.embed.emb.weight  # §3.4 — "share the same weight matrix"

Reading order matches Figure 1: attention → multi-head → FFN → positional encoding → embeddings + tying. The green spans are the live anchors carried in the file; the red ones cross-reference the audit row on the next tab.

“We employed label smoothing of value ε_ls = 0.1 … This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.”
— Vaswani et al. 2017, §5.4

#DecisionStatusOur valuePaper anchorAlternatives
1d_model SPECIFIED 512 §3, Table 3 1024 (“big” model)
7LayerNorm placement PARTIAL post-norm Fig. 1, “Add & Norm” pre-norm (Xiong 2020)
8LayerNorm ε [UNSPECIFIED] 1e-6 1e-5 (PyTorch), 1e-8
11Attention-weight dropout [UNSPECIFIED] enabled @ 0.1 §5.4 silent on placement drop only at residual; drop inside FFN
14Embedding × √d_model SPECIFIED enabled §3.4 off → tiny logits early on
15Weight tying (embed ↔ out) SPECIFIED shared §3.4 tied with source embed too (Press & Wolf 2017)
17Weight initialisation [UNSPECIFIED] Xavier-uniform normal(0, 0.02) (GPT), Kaiming-ReLU
21LR schedule (Eq. 3) SPECIFIED Noam, warmup 4000 §5.3 inverse-sqrt without warmup; cosine
26BLEU implementation [UNSPECIFIED] sacrebleu (case-insens) §6.1 footnote multi-bleu.perl (paper-era), nltk

Excerpt — full audit ships 28 rows in REPRODUCTION_NOTES.md, with paper-quote columns and a debugging guide. [UNSPECIFIED] rows are the ones a future reader can safely change without diverging from the paper.

§3.2.1, Eq. 1 — Scaled dot-product attention

“We compute the dot products of the query with all keys, divide each by √d_k, and apply a softmax function to obtain the weights on the values.” — Vaswani et al., §3.2.1

Anchor in code: src/model.py::scaled_dot_product_attention. The next cell verifies that the per-head attention weights are row-stochastic — the property the softmax in Eq. 1 is supposed to give us.

# shape check — Eq. 1 returns row-stochastic weights, one row per query position
B, h, S_q, S_k, d_k = 2, cfg.n_heads, 5, 7, cfg.d_head
q = torch.randn(B, h, S_q, d_k)
k = torch.randn(B, h, S_k, d_k)
v = torch.randn(B, h, S_k, d_k)
out, weights = scaled_dot_product_attention(q, k, v)
assert out.shape == (B, h, S_q, d_k)
assert torch.allclose(weights.sum(-1), torch.ones(B, h, S_q), atol=1e-5)
print('out', tuple(out.shape), ' row-sum:', float(weights.sum(-1).mean()))
out (2, 4, 5, 16) row-sum: 1.0000

§3.5, Eq. 5–7 — Sinusoidal positional encoding

“PE(pos, 2i) = sin(pos / 10000^(2i/d_model)); PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)).” — Vaswani et al., §3.5

Anchor in code: SinusoidalPositionalEncoding. The check below pins the closed-form values at (pos=0, dim=0,1) = (sin 0, cos 0) = (0, 1).

# Eq. 5–7 — at pos=0 the table is exactly (0, 1, 0, 1, …)
pe = SinusoidalPositionalEncoding(cfg.d_model, cfg.max_seq_len)
table = pe.pe.squeeze(0)
assert math.isclose(table[0, 0].item(), 0.0, abs_tol=1e-6)
assert math.isclose(table[0, 1].item(), 1.0, abs_tol=1e-6)
print('PE[0, 0:4] =', table[0, :4].tolist())
PE[0, 0:4] = [0.0, 1.0, 0.0, 1.0]

§3.4 — Embeddings + weight tying

“In the embedding layers, we multiply those weights by √d_model … We share the same weight matrix between the two embedding layers and the pre-softmax linear transformation.” — Vaswani et al., §3.4
# §3.4 — input embedding and output projection back the same tensor
model = Transformer(cfg)
assert model.embed.emb.weight.data_ptr() == model.out_proj.weight.data_ptr()
print('weight tying:', 'shared')
weight tying: shared