# Day 3 (11/6)

## トランスフォーマー (Transformer) 
Googleが開発した Self-Attention を用いるモデルで、**最新のLLMの基本構造であり、まさに中核となるモデル**です。
今日の講義では、(1) 基礎となる概念的な話を簡単に説明した後、有名なTransformer解説ブログを一緒に見ます。
そのブログは視覚的に非常にわかりやすく整理されているため、それを活用します。
(2) その後、Transformerの後続モデルである T5 (Text-to-Text Transfer Transformer) を使った演習課題を行います。

本講義では、理論については、おおまかな核心だけ理解していきます。細かい部分は必要になったときに確認すれば大丈夫です。
この授業は演習形式でもあるので、理論を深く掘り下げるよりも、実際にコードを動かしてみることに重点を置きます。
いろいろ試しながら、「こんなこともできるかな？」「こういうことをやってみたら面白いかも？」といった自由な発想をしていただくのが、目的です。
（この章では T5 を使いますが、次の章からは Llama3.2-1B を使用する予定です。）


### 基礎となる概念

#### EncoderとDecoderの構造、それぞれの役割
Day2では、tokenizerによるencode / decodeについて学びました。
今回はモデルレベルでのencoder / decoderについて説明します。
役割の考え方は似ています。
encoderは入力文の意味や構造を理解し、重要な情報を表現した特徴ベクトル（内部表現）を生成します。
decoderはその特徴ベクトルをもとに、目的のタスク（翻訳・要約・質問応答など）の出力を生成します。

```{image} /_static/img/day3/The_transformer_encoders_decoders.png
:width: 80%
:align: center
```

#### Self-Attention（自己注意）
Transformerの中核となる仕組みがSelf-Attentionです。各encoderおよびdecoder内部にself-attention機構を取り入れることで、入力内の単語同士の関係を柔軟に捉えることができます。
例えば、下の図は次の英文を例にしたものです：
"The animal didn’t cross the street because it was too tired."

ここで「it」という単語（右側で灰色になっている部分）を見ると、
左側の「animal」に強く注意（attention）が向けられています。
つまりモデルは、「it」が「animal」を指している（＝文脈上の関係がある）と理解しているのです。
（実際のattentionの強さは線の濃さで表されています。線が濃いほど強く「注目」しています。）

このように、Self-Attentionでは文中のすべての単語が互いを参照し、
「どの単語が重要か」「どの単語と関係が深いか」を学習することで、
文全体の意味を捉えることができます。

```{image} /_static/img/day3/transformer_self-attention_visualization.png
:width: 50%
:align: center
```


#### Positional Encoding（位置エンコーディング）
文中に同じ単語が2回繰り返される場合、どう処理すれば良いでしょうか？
同じ単語でも、位置によって役割や意味が異なることがよくあります。
そのため、モデルが単語の **「位置」情報**を理解できるようにする必要があります。
これを実現するのがPositional Encoding（位置エンコーディング）です。

Self-Attentionは順序情報を直接扱わないため、各単語のベクトルに「位置を表す波のようなパターン」を加えます。
これによって、モデルは単語の順序関係を識別できるようになります。

下の図は、その位置エンコーディングの値を可視化したものです。
縦軸が トークンの位置 (Token position)、横軸が 埋め込み次元 (Embedding dimension) を示しています。
色の変化（縞模様）は、各位置に対応する異なるsin波・cos波の値を表しており、
単語の位置ごとに異なるパターンが生成されていることがわかります。

```{image} /_static/img/day3/attention-is-all-you-need-positional-encoding.png
:width: 60%
:align: center
```


#### Decoding方式（デコーディングの仕組み）
TransformerのDecoderは、Autoregressive（自己回帰的） な方法で出力を生成します。
これは、「すでに生成した単語をもとに、次の単語を1つずつ予測していく」仕組みです。

##### Autoregressiveの流れ 
下の図では、入力文（例：Je suis étudiant）がEncoderによって特徴ベクトルに変換されます。
Decoderは、これまでに生成した単語を手がかりにしながら、1語ずつ順番に次の単語を出力していきます。

各ステップでは、**語彙全体のすべての単語に対して確率分布を計算し、「次に来る可能性（確率）が高い単語はどれか？」を求めます。**
その確率分布をもとに、次のトークンが選ばれます。

例えば、 <br>
ステップ1： "I" を生成 <br>
ステップ2： "I" → "am" を生成 <br>
ステップ3： "I am" → "a" を生成 <br> 
ステップ4： "I am a" → "student" を生成 <br>

このように、Decoderは1語ずつ確率分布を出しながら予測を繰り返して文を完成させる仕組みになっています。
下の図では、各タイムステップごとにどのように出力が生成されるかを視覚的に確認できます。

![](/_static/img/day3/transformer_decoding_1.gif)
![](/_static/img/day3/transformer_decoding_2.gif)


##### 出力の生成方法（Sampling方式）
次の単語を選ぶ際には、上で得られた確率分布に基づいてさまざまな方法が使われます。

| 方式                           | 説明                                         |
| ---------------------------- | ------------------------------------------ |
| **Greedy decoding**          | 各ステップで最も確率の高い単語を選ぶ。シンプルだが単調になりやすい。         |
| **Sampling**                 | 出力された確率分布に基づいて、確率に応じて単語を選択します。より多様な文を生成しやすくなります。   |


:::{admonition} 図の出典
:class: note
上記の図はすべて[The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)から持ってきております。
:::


#### その他の詳細
細かい部品や派生的な改良については、興味があれば各自で調べてみましょう。
ただし、元論文に書かれている内容は現在では置き換えられている場合もあります。
基礎を理解しつつ、最新の仕組みや改良点を確認しておくと実践的です。

```{admonition} 初期のTransformerと比べて、最新のTransformerの部品は、どのように・どれほど多く変わったのか？
:class: tip
以下の表に示します。なお、これらの部品は常に改良・変化を続けているため、モデル構造の研究を専門にしていない限り、すべてを常に追う必要はありません。必要なときに調べれば十分です。
もちろん、最新動向を追い続けるのも良いことですが、細かい改良論文が次々と発表されているため、すべてを把握するのは大変です。したがって、時間をかけて検証され、実際に広く使われている大きな改善点だけを押さえておけば十分です。
```
| 要素                             | 初期Transformer (2017年)  | 最新のTransformer / LLM 系モデル                                                                                                                        | 主な目的・効果              |
| ------------------------------ | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------- |
| **Self-Attention 構造**          | ・全トークン間のGlobal Attention        | ・Sparse / Local / Sliding Window Attention<br>・Grouped Query Attention (GQA)、Multi-Query Attention (MQA)<br>・FlashAttention で高速化 | 計算量・メモリ削減、長文処理の効率化   |
| **Positional Encoding**        | ・固定的な正弦波(Sin/Cos)型の絶対位置エンコーディング<br>・位置情報が学習されない | ・RoPE (Rotary Position Embedding)、学習可能な相対位置埋め込み<br>・長文への一般化が容易                                                                 | 長いコンテキストでも位置関係を自然に表現 |
| **Feed Forward Network (FFN)** | ・単純な2層MLP + ReLU<br>                          | ・Gated Linear Unit (GLU), SwiGLU, ReGLU など新しい活性化関数                  | 勾配安定性・性能改善           |
| **正規化 (Normalization)**        | ・Post-LayerNorm構造 (各ブロック後にLayerNorm)                | ・Pre-LayerNorm,　RMSNorm                                                                                           | 大規模モデルの安定化           |


### Transformer解説ブログ
多くの研究者が参考にした、とても有名なTransformer解説ブログ（教科書に指定している「直感LLM」は、この著者がこのブログの内容を元に執筆したものです）があります：[The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)。
それを[日本語で翻訳したもの](https://tips-memo.com/translation-jayalmmar-transformer)を一緒にざっと見ていきます。
授業では「基礎となる概念」にあたる部分だけを取り上げますが、気になる方はぜひ他のパートも読んでみてください。図を見ながら重要な動作の流れを把握しましょう。

- [Lucas Beyerによるこちらのスライド](https://docs.google.com/presentation/d/1ZXFIhYczos679r70Yu8vV9uO6B1J0ztzeDxbnBxD1S0/edit?slide=id.g31364026ad_3_2#slide=id.g31364026ad_3_2)も非常に分かりやすい資料です。


:::{admonition} Decoder-Onlyモデル
:class: tip
現在のLLM（GPTやLLaMAなど）は、ほとんどが **Decoder-Onlyモデル** です。では、なぜEncoderは使われなくなったのでしょうか？考えてみるのも面白いでしょう。
現在はDecoder-Onlyが主流ですが、再びEncoder-Decoder型が注目される時期が来るかもしれません。 
ちょうど最近、Google DeepMindから関連する論文[Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model](https://arxiv.org/abs/2510.26622)が公開されたので、興味のある方はぜひ読んでみてください。
:::


### T5（Text-to-Text Transfer Transformer）を使った演習
オリジナルのTransformerはやや古いモデルのため、Hugging Face上に公式の学習済みモデルは公開されていません（そのため気軽に使うのは少し難しいです）。  
そこで今回は、Transformerと同じくGoogleが開発した、後継モデルである T5 を使って練習してみましょう。

T5もTransformerと同様にEncoder-Decoder構造を持っています。初期のTransformerは翻訳に特化していましたが、T5はより大規模で多様なデータで学習されており、翻訳・要約・質問応答・分類・生成など、幅広いタスクに対応し、より高い性能を示します。

[T5は複数のバージョンが提供されています。](https://huggingface.co/collections/google/t5-release)
`small`, `base`, `large`, `xl (3B)`, `xxl (11B)` の5種類があります。
この演習では [T5-Small](https://huggingface.co/google-t5/t5-small) を使用します。
Hugging Faceによると、先月のダウンロード数は約265万回と、2020年のモデルでありながら現在でも広く利用されています。

```{admonition} 演習（40分）
:class: important
Hugging Faceのtransformersライブラリを使ってT5モデルを動かしてみましょう。以下を読み進めて実行してみましょう
```
Colabで実行: [New Notebook](https://colab.new/) を開き、GPUを使う設定を行います：「ランタイム」 → 「ランタイムのタイプを変更」 → 「T4 GPU」を選択して「保存」。

Day1では`Pipeline`というクラスを使いましたが、Day2では `Tokenizer` を学んだので、`AutoModel`クラスを使ってみましょう。
`Pipeline`は簡単に使える便利な方法ですが、研究や開発の際に細かい部分を変更したい場合は、`AutoModel`を使うのが一般的です。


#### PipelineでT5を使うコード例
```python
import torch
from transformers import pipeline

# 1. text2text-generation タスク用のパイプラインを作成: T5モデルを使い、テキスト入力からテキスト出力を生成する
pipe = pipeline(
    task="text2text-generation",  # テキストからテキストを生成するタスク（例：翻訳、要約など）
    model="google-t5/t5-small",   # Googleが公開している軽量T5モデル
    dtype=torch.float16,          # 半精度(float16)で計算してメモリを節約
    device=0                      # GPUデバイス（0番）を使用。CPUの場合は -1 を指定
)

# 2. 翻訳タスクの例:「translate English to French:」というプロンプトを指定することで、英語→フランス語への翻訳を実行
result = pipe("translate English to French: The weather is nice today.")
print(result)
```


#### AutoModelを使うコード例
:::{admonition} Seq2Seq
:class: tip
Encoder-DecoderモデルはSeq2Seq（Sequence to Sequence）モデルとも呼ばれます。
そのため、以下のコードでは `AutoModelForSeq2SeqLM` というクラス名を使っています。
`Auto` は「モデル名を渡すと自動的に適切なクラスを選んでくれる」という意味です。
`LM` は Language Model の略です。

類似のクラスとして、Decoder-Only モデル用（GPT, LLaMA など）の`AutoModelForCausalLM`、Encoder-Only モデル用（BERT など）の`AutoModelForMaskedLM`があります。
:::

```python
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# 1.トークナイザーをロード: 文章をトークンID（数値）に変換するためのモデル
tokenizer = AutoTokenizer.from_pretrained(
    "google-t5/t5-small"   # Google が公開している小型の T5 モデル
)

# 2. T5 モデル本体をロード: 入力テキストから出力テキストを生成する Seq2Seq（エンコーダ-デコーダ）モデル
model = AutoModelForSeq2SeqLM.from_pretrained(
    "google-t5/t5-small",
    dtype=torch.float16,    # 半精度(float16)で軽量化
    device_map="auto"       # 利用可能なGPU/CPUデバイスに自動割り当て
)

# 3. 入力テキストをトークン化:「translate English to French:」というプロンプトで翻訳タスクを指定
input_ids = tokenizer(
    "translate English to French: The weather is nice today.",
    return_tensors="pt"     # PyTorchテンソル形式で返す
).to(model.device)          # モデルと同じデバイス（GPUなど）に転送

# 4. モデルに入力を渡して出力（生成）を得る。cache_implementation="static" はキャッシュの最適化設定（速度向上目的）
output = model.generate(**input_ids, cache_implementation="static")

# 5. 出力トークンをデコードして人間が読めるテキストに戻す
print(tokenizer.decode(output[0], skip_special_tokens=True))
```


#### T5モデルの動作をステップごとに確認
```python
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# モデルとトークナイザーをロード
model_name = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")

# モデル構造の確認（Encoder-Decoder から構成されていることが分かる）
print("Model architecture:", model)

# 入力テキスト
prompt = "The capital of France is"

# 1. トークナイズ（テキスト → トークンIDに変換）
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# 2. Encoder 出力を取得
encoder_outputs = model.encoder(input_ids=input_ids)
print("Encoder output shape:", encoder_outputs.last_hidden_state.shape)
# => (バッチサイズ, トークン数, 隠れ次元)

# 3. Decoder 入力の作成（開始トークン）
decoder_input_ids = torch.tensor([[model.config.decoder_start_token_id]]).to("cuda")

# 4. Decoder 出力を取得（Encoder 出力を参照しながら次トークンを生成）
decoder_outputs = model.decoder(
    input_ids=decoder_input_ids,
    encoder_hidden_states=encoder_outputs.last_hidden_state
)

# 5. 出力ベクトルを lm_head に通し、語彙ごとのスコア（＝確率分布のもとになる値）を計算
lm_head_output = model.lm_head(decoder_outputs.last_hidden_state)
print("LM head output shape:", lm_head_output.shape)

# 6. 最も確率の高いトークンを選択（= Greedy decodingの最初の一歩）
token_id = lm_head_output[0, -1].argmax(-1)
print("Next token:", tokenizer.decode(token_id))
```
model.generate() は内部的に、今見たような「Encoder → Decoder → lm_head → 次トークン予測」をループで何十回も繰り返しています。
今のコードは、その中の「1ステップだけ」を見たものです。


#### 様々なタスクを試してみる。
**(1) 要約 (Summarization):** pipelineに「summarization」を指定することで、要約タスクをさせることができます。
```python
from transformers import pipeline

# 1. パイプラインを作成: summarization にT5モデルを使う
summarizer = pipeline("summarization", model="google-t5/t5-small")

# 2. 入力テキスト
text = (
    "The Transformer model, proposed by Google in 2017, "
    "introduced a new architecture that does not rely on recurrence or convolution. "
    "By using only the self-attention mechanism, it was able to capture long-range dependencies efficiently. "
    "This innovation led to significant performance improvements across many NLP tasks, "
    "such as translation, summarization, and text generation, outperforming RNN- and CNN-based models. "
    "The success of Transformer paved the way for later models like BERT and GPT, "
    "which have further advanced natural language processing. "
    "Building on this, T5 extended the Transformer framework and unified all NLP tasks "
    "into a single text-to-text format, allowing translation, summarization, question answering, "
    "and classification to be handled within one model."
)

# 3. 実行
summary = summarizer(text, max_length=100, min_length=10, do_sample=False)

# 4. 要約結果を表示
print("results:", summary[0]['summary_text'])
```

**(2) 質問応答（Question Answering）:** 「question:」という指示文を使うことで、質問に対する答えを生成することができます。
```python
from transformers import pipeline

# 1. 質問応答タスク（文脈なしのシンプルな質問）
qa = pipeline("text2text-generation", model="google-t5/t5-small")

# 2. 簡単な質問（常識的な答えがあるもの）
text = "question: What color is the sky?"

# 3. モデルに入力を渡して回答を生成
result = qa(text)

# 4. 結果を表示
print("Answer:", result[0]["generated_text"])
```

**(3) テキスト分類（Text Classification）:** 「classify the sentiment:」という指示を与えることで、T5は入力文の感情を推定します。
```python
from transformers import pipeline

# 1. 感情分類タスク
# 「google/flan-t5-small」は、GoogleがT5モデルを指示（instruction）データで追加学習した軽量版です。
# Flan-T5は、与えられた指示に従ってテキストを「入力→出力」形式で変換するtext2textモデルとして設計されています。
# そのため、「〜してください」や「classify sentiment:」「complete the sentence:」のような指示に自然に応答できます。
classifier = pipeline(
    "text2text-generation", model="google/flan-t5-small"
)

# 2. 入力文（ポジティブ・ネガティブを判断）
text = "classify sentiment: I really loved this movie! It was amazing."

# 3. 推論を実行
result = classifier(text)

# 4. 結果を表示
print("Predicted label:", result[0]["generated_text"])
```

**(4) テキスト生成（Text Generation）:** 「complete the following sentence:」という指示文を与えることで、T5が文の続きを自然に生成します。
```python
from transformers import pipeline

# 1. テキスト生成タスク（文の続きを生成）
generator = pipeline("text2text-generation", model="google/flan-t5-small")

# 2. 入力文（シンプルでわかりやすいプロンプト）
text = "complete the following sentence: In the morning, people usually start"

# 3. 文章を生成
result = generator(text, max_length=100, do_sample=True)

# 4. 結果を表示
print("Generated text:", result[0]["generated_text"])
```


- 参考文献：Transformer 論文（現時点での引用数：約20万件）Vaswani et al., Attention Is All You Need, NIPS 2017
- 参考文献：T5 論文（現時点での引用数：約2.8万件）Raffel et al., Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, JMLR 2020 


## Pre-trainedモデルの活用

```{admonition} 演習（30分）
:class: important
以下のPrompt engineeringに関するコードを実行して、いろいろいじってみましょう。
```
Pre-trainedモデルを使って、様々な処理を行ってみましょう。ファインチューンなどの学習ステップを追加で行わず、公開モデルをそのまま使うだけでも、様々な処理を行うことが出来ます。
以下のコードは、参考書で提供されている[Chapter 6 - Prompt Engineering](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter06/Chapter%206%20-%20Prompt%20Engineering.ipynb)の一部を改良したものです。

Llama 3.2を使用するために、まず以下のコードでHugging Faceにログインします。
Day 1で作成したAccess Tokenを使用してください。
もしAccess Tokenが分からない場合は、Day 1の内容を参考にして再発行しましょう。
```python
from huggingface_hub import login
login()
```

### シンプルな入力文「Create a funny joke about chickens」を試してみましょう。
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Llama 3.2 1B Instructモデルとトークナイザーを読み込み
model_id = "meta-llama/Llama-3.2-1B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",       # GPUを自動的に使用
    torch_dtype="auto",      # 最適な精度でロード
    trust_remote_code=True,  # 外部コードの実行を許可
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token  # パディングトークンを設定

# テキスト生成用のパイプラインを作成
pipe = pipeline(
    "text-generation",       # テキスト生成タスクを指定
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,  # 生成部分のみを出力
    max_new_tokens=500,      # 最大生成トークン数
    do_sample=False,         # サンプリングを使わず決定的に生成
)

# 入力プロンプトを準備
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# モデルにプロンプトを入力して応答を生成
output = pipe(messages)
print("[Response]:", output[0]["generated_text"])
```

#### モデル専用プロンプトへの変換
文章「Create a funny joke about chickens.」は、apply_chat_template()を使ってLlama 3.2専用のチャット形式プロンプトに変換されます。
この関数は、ユーザー入力をモデルが理解できる内部形式（システム＋ユーザー構造）に整えてくれます。
```python
# モデルに渡す前に、chat形式のプロンプトへ変換
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)  # 実際にどのようなプロンプトに変換されたか確認
```
#### Temperatureを使った生成の多様性調整
`temperature`は、出力の多様性を制御するパラメータです。値を高くすると（例: 1.0）、よりランダムで多様な出力になります。逆に低くすると（例: 0.2）、より一貫性があり、安定した出力になります。
```python
# Using a high temperature
output = pipe(messages, do_sample=True, temperature=1)
print(output[0]["generated_text"])
```

#### Top-pを使った確率的な出力制御
`top_p`（または nucleus sampling）は、確率の上位p%に含まれる単語のみをサンプリング対象とします。
値を1.0にすると、事実上すべての単語から選ばれ、出力が多様になります。値を下げると、より限定的で一貫した応答になります。
```python
# Using a high top_p
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])
```


### 詳細プロンプト（Detailed Prompt）
シンプルな1文プロンプトとは異なり、ここではペルソナ（役割）や目的、文脈、文体、読者層など、複数の要素を組み合わせて、モデルにより明確で具体的な指示を与えます。
こうした詳細プロンプトを作ることで、出力の一貫性と品質を高めることができます。

例えば、以下のコードに示す要素を組み合わせることで、「モデルが迷わず正確にタスクを理解し、期待どおりの出力を生成する」ことができます。
```python
# ==== 各プロンプト要素の定義 ====
# persona（ペルソナ）: モデルに特定の役割や専門性を与える
persona = "You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.\n"

# instruction（指示）: モデルに何をしてほしいかを明確に示す
instruction = "Summarize the key findings of the paper provided.\n"

# context（文脈）: 出力における焦点や目的を補足する
context = "Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.\n"

# data_format（出力形式）: 出力のフォーマットを指定（例：箇条書き＋要約文）
data_format = "Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.\n"

# audience（読者層）: 想定される読者に合わせて文体や専門性を調整
audience = "The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.\n"

# tone（文体）: 出力のトーンを統一（例：フォーマル、カジュアルなど）
tone = "The tone should be professional and clear.\n"

# === 要約対象のテキスト ===
# ここでは “The Illustrated Transformer” の内容を抜粋して使用
text = """In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.
The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).
Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.
As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.
Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
Now We’re Encoding!
As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
"""

# 入力テキストを結合して、プロンプトに組み込む
data = f"Text to summarize: {text}"

# === すべての要素を結合して最終プロンプトを作成 ===
# 各部分を削除・追加して、出力の違いを試してみましょう！
query = persona + instruction + context + data_format + audience + tone + data
```

#### プロンプトの最終形を確認する
まず、作成した query（詳細プロンプト全体）をLlama 3.2 のチャット形式に整形して、モデルが理解できる形に変換します。
```python
messages = [
    {"role": "user", "content": query}  # ユーザーからモデルへの入力内容を定義
]
print(tokenizer.apply_chat_template(messages, tokenize=False))
```

#### モデルに入力して出力を生成する
整形された messages を実際にモデルに入力して、Llama 3.2 による応答を得ます。Promptに書いた指示をちゃんと守っているか確認してみましょう。
```python
# Generate the output
outputs = pipe(messages)
print(outputs[0]["generated_text"])
```


### Chain-of-Thought（思考の連鎖）：考えてから答える
「Chain-of-Thought（CoT）」とは、**モデルに「推論の過程を言語化させる」** プロンプト設計手法です。
単に答えを求めるのではなく、「考えながら答えさせる」ことで、より正確で論理的な出力を得ることができます。

#### 通常の質問（思考の指示なし）
まずは、通常の質問形式でモデルに答えさせます。推論を求めていないため、モデルは直接答えだけを出力します。
```python
# 推論を明示しない通常の質問
standard_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "11"},
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

# 出力を生成
outputs = pipe(standard_prompt)
print(outputs[0]["generated_text"])
```
この場合、モデルは単に「答えを予測」しているだけで、なぜそうなるのかという思考過程は省略されています。その結果、誤った答えを出してしまうことがあります。


#### 思考過程を含めた回答（Chain-of-Thought）
次に、「思考の連鎖」を明示的に与えたプロンプトを使います。モデルに「どのように考えたか」を示すことで、より論理的な推論を促します。
```python
# 推論の過程を含めて答える
cot_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."},
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

# 出力を生成
outputs = pipe(cot_prompt)
print(outputs[0]["generated_text"])
```
モデル（assistant）は単に「11」と答えるのではなく、その答えに至る過程を説明します。
モデルは与えられた「思考例」を模倣し、同様の段階的な推論（step-by-step reasoning）を行うようになります。
その結果、今回は正しい答えを導き出すことができます。

#### Zero-shot Chain-of-Thought
今度は、思考の例を与えずに「考えながら答えて」と一言指示します。このようなゼロショットCoTでは、「Let's think step-by-step.」のようなトリガー文を加えるだけで、モデルに推論を促すことができます。
```python
# Zero-shot Chain-of-Thought
zeroshot_cot_prompt = [
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Let's think step-by-step."}
]

# 出力を生成
outputs = pipe(zeroshot_cot_prompt)
print(outputs[0]["generated_text"])
```


### ゼロショットプロンプトとワンショットプロンプト
モデルにどのように指示（プロンプト）を与えるかによって、出力の精度や一貫性が変わります。
ここでは「RPGゲームのキャラクタープロフィールを作成せよ」という同じ課題に対して、ゼロショットプロンプトとワンショットプロンプトの2種類を比較してみましょう。


#### ゼロショットプロンプト（Zero-shot Prompting）
例を一切与えずにタスクの指示だけを行う方法です。モデルは過去の学習知識を頼りに、与えられた指示をもとに出力を生成します。
```python
# ゼロショットプロンプト: 例を与えずに指示する
zeroshot_prompt = [
    {"role": "user", "content": "Create a character profile for an RPG game in JSON format."}
]

# 出力を生成
outputs = pipe(zeroshot_prompt)
print(outputs[0]["generated_text"])
```

#### ワンショットプロンプト（One-shot Prompting）
1つの出力例（フォーマットの見本）を与えることで、モデルに望ましい出力の構造やスタイルを明示的に伝える方法です。
```python
# ワンショットプロンプト: 出力例を1つ与える
one_shot_template = """Create a short character profile for an RPG game. Make sure to only use this format:

{
  "description": "A SHORT DESCRIPTION",
  "name": "THE CHARACTER'S NAME",
  "armor": "ONE PIECE OF ARMOR",
  "weapon": "ONE OR MORE WEAPONS"
}
"""
one_shot_prompt = [
    {"role": "user", "content": one_shot_template}
]

# 出力を生成
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])
```
このように、出力のフォーマットを1つ示すだけで、モデルはそれを**模倣（まね）**し、より正確で安定したJSON形式の出力を生成できるようになります。
このようにフォーマット（出力の型）を指定する方法のほかに、**実際の出力例を1つ提示して**モデルに模倣させる手法もあります。これを「ワンショットプロンプト（One-shot Prompting）」と呼びます。
- 補足1：教科書「直感LLM」のコードでは、出力フォーマットを示す方法も「ワンショットプロンプト」として紹介されています。しかし、一般的には、実際の出力例を1つ示す方法を「ワンショットプロンプト」と呼ぶことが多いです。
- 補足2：複数の例を示して、より多くの出力パターンを模倣させるフューショットプロンプト（Few-shot Prompting）という手法もあります。


## モデルのファインチューニング
```{admonition} 演習（30分）
:class: important
以下のSupervised Fine-Tuning (SFT)に関するコードを実行して、いろいろいじってみましょう。
```
ここまでは公開モデルをそのまま使う応用を見てきました。ここからは、少量のデータでモデルを少しだけ更新する**ファインチューニング（微調整）** の事例を見ていきましょう。ここでは、参考書で紹介されているコード[Chapter 12 - Fine-tuning Generation Models](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter12/Chapter%2012%20-%20Fine-tuning%20Generation%20Models.ipynb)をベースに、一部を修正したものを使用します。

Colab環境には標準で含まれていない`trl`というライブラリを使用します。これは、Hugging Face が提供する ファインチューニング用に便利なツールです。
以下のコマンドを実行してインストールしましょう。
```python
!pip install trl==0.9.4
```

### ファインチューニングの基礎：SFT と LoRA
#### Supervised Fine-Tuning（SFT）
SFT（教師ありファインチューニング）は、既に学習済みの大規模モデル（LLM）を特定のタスクに合わせて再調整する手法です。
モデルに「入力と理想的な出力（正解例）」のペアを大量に与え、その出力が人間の意図に近づくように追加学習を行います。

例 1) GPTに「質問 → 回答」データを与え、質問応答に特化させる。 <br>
例 2) 翻訳や要約など、特定タスク用の小規模データで再訓練する。


#### LoRA（Low-Rank Adaptation）
次に、LoRA は「効率的なファインチューニング手法」です。
モデル全体のパラメータを更新する代わりに、図のように**ごく一部（全体の5%以下）の低ランク行列（AとB）** だけを追加して学習します。
これにより、事前学習済みの重み全体を更新せずに、AとBの小さな行列だけを訓練することで、全体を再学習したのに近い効果を得ることができます
（ただし、完全学習より性能はやや劣る場合があります）。
LoRAは、**大規模モデルを限られたGPU環境で効率的に再調整するための重要な技術**です。
以下の図では、青い部分が事前学習済みの重み（固定したまま更新しない部分）、オレンジの部分がLoRAで追加される小さな行列（AとB）で、ここだけを学習します。

```{image} /_static/img/day3/LoRA.jpg
:width: 30%
:align: center
```


### 学習前の準備
[ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)というデータセットを用いて、ファインチューニングを行います。
ultrachat_200kのリンクをクリックして、データの中身を見てみるのも良いです。
```python
# ------------------------------------------------------------
# 必要なライブラリのインポート
# ------------------------------------------------------------
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import load_dataset

# ------------------------------------------------------------
# 1️⃣ モデルとトークナイザーをロード
# ------------------------------------------------------------
model_name = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # PADトークンをEOSで代用

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # 半精度で軽量化（T4などの小型GPUにも対応）
    device_map="auto",
)

# ------------------------------------------------------------
# 2️⃣ LoRA設定
# ------------------------------------------------------------
peft_config = LoraConfig(
    r=8,                      # 低ランク次元数
    lora_alpha=16,            # スケーリング係数
    target_modules=["q_proj", "v_proj"],  # LoRAを適用する層
    lora_dropout=0.05,        # ドロップアウト率
    bias="none",
    task_type="CAUSAL_LM",    # 因果言語モデル（生成系タスク）
)
model = get_peft_model(model, peft_config) # モデルにLoRAパート（低ランク行列AとB）を追加
model.print_trainable_parameters()

# ------------------------------------------------------------
# 3️⃣ データセットの準備
# ------------------------------------------------------------
def format_prompt(example):
    """Llama 3.2のChatテンプレート形式に変換"""
    chat = example["messages"]
    prompt = tokenizer.apply_chat_template(chat, tokenize=False)
    return {"text": prompt}

# UltraChatデータの一部を利用（学習時間を短縮するため）
dataset = (
    load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
    .shuffle(seed=42)
    .select(range(3000))
)
dataset = dataset.map(format_prompt)
print("[0-th data]:", dataset["text"][0])
```

### Supervised Fine-Tuning（SFT：教師あり微調整）
```python
# ------------------------------------------------------------
# 4️⃣ 学習設定
# ------------------------------------------------------------
# TrainingArgumentsでは、出力ディレクトリ・学習率・エポック数など、
# モデルの学習に関する基本設定をまとめて指定します。
training_args = TrainingArguments(
    output_dir="./llama32-lora-sft",     # モデルの出力先
    per_device_train_batch_size=1,       # 1 GPUあたりのバッチサイズ
    gradient_accumulation_steps=4,       # 勾配累積で仮想的に大きなバッチを実現
    learning_rate=2e-4,                  # 学習率
    num_train_epochs=1,                  # エポック数（回数）
    logging_steps=20,                    # ログ出力間隔
    save_steps=200,                      # モデル保存間隔
    fp16=True,                           # 半精度訓練（T4などの小型GPU対応）
    report_to=[],                        # wandbなどの外部ログを無効化
)

# ------------------------------------------------------------
# 5️⃣ SFTTrainerの定義（TRL 0.9.4）
# ------------------------------------------------------------
# TRL（Transformer Reinforcement Learning）ライブラリのSFTTrainerを使って、
# LoRA付きのモデルを教師ありデータで再調整します。
trainer = SFTTrainer(
    model=model,                         # LoRAを適用済みのモデル
    train_dataset=dataset,                # 学習用データセット
    peft_config=peft_config,              # PEFT（LoRA）の設定
    dataset_text_field="text",            # 学習に使うテキストのカラム名
    max_seq_length=512,                   # 最大トークン長
    tokenizer=tokenizer,                  # トークナイザー
    args=training_args,                   # 学習設定
)

# ------------------------------------------------------------
# 6️⃣ 学習実行
# ------------------------------------------------------------
# train() を呼び出すと、LoRA部分（AとBの小さな行列）だけが学習されます。
trainer.train() # T4 GPUでの学習には約13分ほどかかります。

# ------------------------------------------------------------
# 7️⃣ モデル保存
# ------------------------------------------------------------
# 学習済みモデルとトークナイザーを保存します。
trainer.model.save_pretrained("./llama32-lora-sft-final")
tokenizer.save_pretrained("./llama32-lora-sft-final")

print("✅ 学習完了！LoRA付きモデルが ./llama32-lora-sft-final に保存されました。")
```


### 学習前後の推論比較 (Before / After Fine-tuning)
SFT（教師あり微調整）を行うと、モデルが指示（instruction）をより正確に理解し、出力のスタイルやフォーマットも改善されることが多いです。
ここでは、学習前のモデルとLoRA付きで微調整したモデルを並べて比較してみましょう。
同じプロンプトに対して、どのように出力が変わるかを確認します。

```python
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch

model_path_before = "meta-llama/Llama-3.2-1B-Instruct"
model_path_after = "./llama32-lora-sft-final"  # SFT後に保存したモデル

tokenizer = AutoTokenizer.from_pretrained(model_path_before)
tokenizer.pad_token = tokenizer.eos_token

# ------------------------------------------------------------
# 推論用パイプライン（学習前）
# ------------------------------------------------------------
pipe_before = pipeline(
    "text-generation",
    model=AutoModelForCausalLM.from_pretrained(
        model_path_before,
        torch_dtype=torch.float16,
        device_map="auto"
    ),
    tokenizer=tokenizer
)

# ------------------------------------------------------------
# 推論用パイプライン（学習後）
# ------------------------------------------------------------
pipe_after = pipeline(
    "text-generation",
    model=AutoModelForCausalLM.from_pretrained(
        model_path_after,
        torch_dtype=torch.float16,
        device_map="auto"
    ),
    tokenizer=tokenizer
)
```

#### 同じプロンプトを入力して、出力を比較
今回はデータ量が少なく、学習時間も短いため、明確に性能が向上したとは言い切れません。
しかし、両者を比較すると、出力内容が確実に変化していることが分かります。これはSFTによる影響です。

```python
prompt = "How can I improve my concentration while studying?"

# 学習前
print("=== 💭 Before fine-tuning ===")
result_before = pipe_before(prompt, max_new_tokens=1024, do_sample=False)[0]["generated_text"]
print(result_before)

# 学習後
print("\n=== 🚀 After fine-tuning ===")
result_after = pipe_after(prompt, max_new_tokens=1024, do_sample=False)[0]["generated_text"]
print(result_after)
```