Day 5 (11/11)¶

本日は講義の最終日です。今日はLLMに関する応用技術・周辺技術について学んでいきましょう。

その前にまず、Day 3の演習に関して注意があります。

ColabでのGPU利用制限について

ColabのT4 GPUは、無料ですが、利用制限があります。その上限は非公開ですが、一度利用制限に掛かったら、12時間以上待たないと使えなくなるようです。自由課題をするとき、finetuningをするなら、数時間ぐらいはcolabでも良さそうなので、数時間ぐらいの、割と小さめのfinetuningのみをおすすめします。

また、先にDay 6以降について説明します。

RAG¶

それではまず 検索拡張生成（Retrieval Augmented Generation; RAG） を見ていきましょう。

原理¶

RAGは、LLMが外部の知識源を参照するための最も基本的な技術です。一般に、LLMをサービスとして用いるとき、LLMモデルはデプロイ後に追加で訓練を行うことが難しいため、最初に訓練させるとそこで終わりであることが多いです。よって、「LLMに追加で情報を入力すること」は難しいです。情報を追加するためにLLMを再訓練するのは大変です。しかし、RAGを用いることで、LLMの推論の際に情報を簡単に追加することが出来ます。これにより、LLMは自分が訓練された後の最新情報に触れたり、あるいは社内の機密情報のような外部公開したくない情報とやり取りをすることが出来ます。

技術的には、RAGは主に以下のステップで構成されます。

オフライン
- 情報源を準備する（大量の文章など）
- 情報源から特徴量を抽出する（大量の$D$次元ベクトルを得る）
オンライン
- クエリ文章（プロンプト）が与えられる
- クエリ文章からも特徴量を抽出し、情報源特徴量と比較し、上位の結果の文章を得る
- これらの文章ともとのクエリ文章を混ぜて新しいプロンプトを作り、LLMに問い合わせる

このように、RAGとは特徴量ベクトルに基づく比較を行うだけで実行可能な、とてもシンプルな技術です。これを図示したものが下記です。

RAGに関するより深い解説はこのスライドを参考にしてください。

Wikiの記事 + SentenceTransformerによる例¶

それではコードを見てみましょう。以下のコードはOpenAI Cookbookを元にしています。以下は全てColab上で実行可能です。Azure OpenAIを使うAPIコールによる実行であるため、ColabはCPUモードでも大丈夫です。

まずライブラリをインポートし、Azure OpenAIの設定を行います。ここはDay 1でやったものと同様です。

from openai import AzureOpenAI
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize

# Azure OpenAI の設定
endpoint = "xxxxxx"                 # Azure OpenAI エンドポイントURLです。slackで共有したものです。
deployment = "gpt-4o"               # LLMのモデルです。今回は4oを使います。
subscription_key = "yyyyy"          # Azure OpenAIのキーです。slackで共有したものです。
api_version = "2024-12-01-preview"  # APIバージョン

# Azure OpenAI クライアントの初期化
client = AzureOpenAI(
    api_version=api_version,
    azure_endpoint=endpoint,
    api_key=subscription_key,
)

さて、ここで2024年の夏のオリンピックでメダルを一番多くとった国はどこか、GPT-4oに聞いてみましょう。GPT-4oは2024年以前に訓練されているため、この問題に答えることが出来ません。

# ユーザーからの質問
query = 'Which countries won the maximum number of gold, silver and bronze medals respectively at 2024 Summer Olympics?'

# RAGなしで直接LLMに質問（比較用）
print("=" * 80)
print("RAGなしの回答:")
print("=" * 80)

# LLMに問い合わせる
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the 2024 Games or latest events.'},
        {'role': 'user', 'content': query},
    ],
    model=deployment,
    temperature=0,
)

print(response.choices[0].message.content)

この場合、例えば出力は以下のようになり、「回答できない」となるでしょう。

I currently do not have information about the results of the 2024 Summer Olympics, as they are set to take place in Paris from July 26 to August 11, 2024. Medal counts and results will only be available after the events conclude. Stay tuned for updates during or after the Games!

さて、ここで次のように、知識源を用意しましょう。ここではwikipediaの一部の文章を抜粋しています。

# Wikipediaから取得した2024年夏季オリンピックに関する記事
# 出典: https://en.wikipedia.org/wiki/2024_Summer_Olympics

wikipedia_article = """2024 Summer Olympics

The 2024 Summer Olympics (French: Les Jeux Olympiques d'été de 2024), officially the Games of the XXXIII Olympiad (French: Jeux de la XXXIIIe olympiade de l'ère moderne) and branded as Paris 2024, were an international multi-sport event held from 26 July to 11 August 2024 in France, with several events started from 24 July. Paris was the host city, with events (mainly football) held in 16 additional cities spread across metropolitan France, including the sailing centre in the second-largest city of France, Marseille, on the Mediterranean Sea, as well as one subsite for surfing in Tahiti, French Polynesia.[4]

Paris was awarded the Games at the 131st IOC Session in Lima, Peru, on 13 September 2017. After multiple withdrawals that left only Paris and Los Angeles in contention, the International Olympic Committee (IOC) approved a process to concurrently award the 2024 and 2028 Summer Olympics to the two remaining candidate cities; both bids were praised for their high technical plans and innovative ways to use a record-breaking number of existing and temporary facilities. Having previously hosted in 1900 and 1924, Paris became the second city ever to host the Summer Olympics three times (after London, which hosted the games in 1908, 1948, and 2012).[5][6] Paris 2024 marked the centenary of Paris 1924 and Chamonix 1924 (the first Winter Olympics), as well as the sixth Olympic Games hosted by France (three Summer Olympics and three Winter Olympics) and the first with this distinction since the 1992 Winter Games in Albertville. The Summer Games returned to the traditional four-year Olympiad cycle, after the 2020 edition was postponed to 2021 due to the COVID-19 pandemic.

Paris 2024 featured the debut of breaking as an Olympic sport,[7] and was the final Olympic Games held during the IOC presidency of Thomas Bach.[8] The 2024 Games were expected to cost €9 billion.[9][10][11] The opening ceremony was held outside of a stadium for the first time in modern Olympic history, as athletes were paraded by boat along the Seine. Paris 2024 was the first Olympics in history to reach full gender parity on the field of play, with equal numbers of male and female athletes.[12]

The United States topped the medal table for the fourth consecutive Summer Games and 19th time overall, with 40 gold and 126 total medals.[13] 
China tied with the United States on gold (40), but finished second due to having fewer silvers; the nation won 91 medals overall. 
This is the first time a gold medal tie among the two most successful nations has occurred in Summer Olympic history.[14] Japan finished third with 20 gold medals and sixth in the overall medal count. Australia finished fourth with 18 gold medals and fifth in the overall medal count. The host nation, France, finished fifth with 16 gold and 64 total medals, and fourth in the overall medal count. Dominica, Saint Lucia, Cape Verde and Albania won their first-ever Olympic medals, the former two both being gold, with Botswana and Guatemala also winning their first-ever gold medals. 
The Refugee Olympic Team also won their first-ever medal, a bronze in boxing. At the conclusion of the games, despite some controversies throughout relating to politics, logistics and conditions in the Olympic Village, the Games were considered a success by the press, Parisians and observers.[a] The Paris Olympics broke all-time records for ticket sales, with more than 9.5 million tickets sold (12.1 million including the Paralympic Games).[15]

Medal table
Main article: 2024 Summer Olympics medal table
See also: List of 2024 Summer Olympics medal winners
Key
 ‡  Changes in medal standings (see below)

  *   Host nation (France)

2024 Summer Olympics medal table[171][B][C]
Rank	NOC	Gold	Silver	Bronze	Total
1	 United States‡	40	44	42	126
2	 China	40	27	24	91
3	 Japan	20	12	13	45
4	 Australia	18	19	16	53
5	 France*	16	26	22	64
6	 Netherlands	15	7	12	34
7	 Great Britain	14	22	29	65
8	 South Korea	13	9	10	32
9	 Italy	12	13	15	40
10	 Germany	12	13	8	33
11–91	Remaining NOCs	129	138	194	461
Totals (91 entries)	329	330	385	1,044

Podium sweeps
There was one podium sweep during the games:

Date	Sport	Event	Team	Gold	Silver	Bronze	Ref
2 August	Cycling	Men's BMX race	 France	Joris Daudet	Sylvain André	Romain Mahieu	[176]


Medals
Medals from the Games, with a piece of the Eiffel Tower
The President of the Paris 2024 Olympic Organizing Committee, Tony Estanguet, unveiled the Olympic and Paralympic medals for the Games in February 2024, which on the obverse featured embedded hexagon-shaped tokens of scrap iron that had been taken from the original construction of the Eiffel Tower, with the logo of the Games engraved into it.[41] Approximately 5,084 medals would be produced by the French mint Monnaie de Paris, and were designed by Chaumet, a luxury jewellery firm based in Paris.[42]

The reverse of the medals features Nike, the Greek goddess of victory, inside the Panathenaic Stadium which hosted the first modern Olympics in 1896. Parthenon and the Eiffel Tower can also be seen in the background on both sides of the medal.[43] Each medal weighs 455–529 g (16–19 oz), has a diameter of 85 mm (3.3 in) and is 9.2 mm (0.36 in) thick.[44] The gold medals are made with 98.8 percent silver and 1.13 percent gold, while the bronze medals are made up with copper, zinc, and tin.[45]


Opening ceremony
Main article: 2024 Summer Olympics opening ceremony

Pyrotechnics at the Pont d'Austerlitz marking the start of the Parade of Nations

The cauldron flying above the Tuileries Garden during the games. LEDs and aerosol produced the illusion of fire, while the Olympic flame itself was kept in a small lantern nearby
The opening ceremony began at 19:30 CEST (17:30 GMT) on 26 July 2024.[124] Directed by Thomas Jolly,[125][126][127] it was the first Summer Olympics opening ceremony to be held outside the traditional stadium setting (and the second ever after the 2018 Youth Olympic Games one, held at Plaza de la República in Buenos Aires); the parade of athletes was conducted as a boat parade along the Seine from Pont d'Austerlitz to Pont d'Iéna, and cultural segments took place at various landmarks along the route.[128] Jolly stated that the ceremony would highlight notable moments in the history of France, with an overall theme of love and "shared humanity".[128] The athletes then attended the official protocol at Jardins du Trocadéro, in front of the Eiffel Tower.[129] Approximately 326,000 tickets were sold for viewing locations along the Seine, 222,000 of which were distributed primarily to the Games' volunteers, youth and low-income families, among others.[130]

The ceremony featured music performances by American musician Lady Gaga,[131] French-Malian singer Aya Nakamura, heavy metal band Gojira and soprano Marina Viotti [fr],[132] Axelle Saint-Cirel (who sang the French national anthem "La Marseillaise" atop the Grand Palais),[133] rapper Rim'K,[134] Philippe Katerine (who portrayed the Greek god Dionysus), Juliette Armanet and Sofiane Pamart, and was closed by Canadian singer Céline Dion.[132] The Games were formally opened by president Emmanuel Macron.[135]

The Olympics and Paralympics cauldron was lit by Guadeloupean judoka Teddy Riner and sprinter Marie-José Pérec; it had a hot air balloon-inspired design topped by a 30-metre-tall (98 ft) helium sphere, and was allowed to float into the air above the Tuileries Garden at night. For the first time, the cauldron was not illuminated through combustion; the flames were simulated by an LED lighting system and aerosol water jets.[136]

Controversy ensued at the opening ceremony when a segment was interpreted by some as a parody of the Last Supper. The organisers apologised for any offence caused.[137] The Olympic World Library and fact-checkers would later debunk the interpretation that the segment was a parody of the Last Supper. The Olympic flag was also raised upside down.[138][139]

During the day of the opening ceremony, there were reports of a blackout in Paris, although this was later debunked.[140]

Closing ceremony


The ceremony and final fireworks
Main article: 2024 Summer Olympics closing ceremony
The closing ceremony was held at Stade de France on 11 August 2024, and thus marked the first time in any Olympic edition since Sarajevo 1984 that opening and closing ceremonies were held in different locations.[127] Titled "Records", the ceremony was themed around a dystopian future, where the Olympic Games have disappeared, and a group of aliens reinvent it. It featured more than a hundred performers, including acrobats, dancers and circus artists.[158] American actor Tom Cruise also appeared with American performers Red Hot Chili Peppers, Billie Eilish, Snoop Dogg, and H.E.R. during the LA28 Handover Celebration portion of the ceremony.[159][160] The Antwerp Ceremony, in which the Olympic flag was handed to Los Angeles, the host city of the 2028 Summer Olympics, was produced by Ben Winston and his studio Fulwell 73.[161]


Security
France reached an agreement with Europol and the UK Home Office to help strengthen security and "facilitate operational information exchange and international law enforcement cooperation" during the Games.[46] The agreement included a plan to deploy more drones and sea barriers to prevent small boats from crossing the Channel illegally.[47] The British Army would also provide support by deploying Starstreak surface-to-air missile units for air security.[48] To prepare for the Games, the Paris police held inspections and rehearsals in their bomb disposal unit, similar to their preparations for the 2023 Rugby World Cup at the Stade de France.[49]

As part of a visit to France by Qatari Emir Sheikh Tamim bin Hamad Al-Thani, several agreements were signed between the two nations to enhance security for the Olympics.[50] In preparation for the significant security demands and counterterrorism measures, Poland pledged to contribute security troops, including sniffer dog handlers, to support international efforts aimed at ensuring the safety of the Games.[51][52] The Qatari Minister of Interior and Commander of Lekhwiya (the Qatari security forces) convened a meeting on 3 April 2024 to discuss security operations ahead of the Olympics, with officials and security leaders in attendance, including Nasser Al-Khelaifi and Sheikh Jassim bin Mansour Al Thani.[53] A week before the opening ceremony, the Lekhwiya were reported to have been deployed in Paris on 16 July 2024.[54]

In the weeks running up to the opening of the Paris Olympics, it was reported that police officers would be deployed from Belgium,[55] Brazil,[56] Canada (through the RCMP/OPP/CPS/SQ),[57][58][59] Cyprus,[60] the Czech Republic,[61] Denmark,[62] Estonia,[63][64] Finland,[65] Germany (through Bundespolizei[66][67]/NRW Police[68]),[69] India,[70][71] Ireland,[72] Italy,[73] Luxembourg,[74] Morocco,[75] Netherlands,[76] Norway,[58] Poland,[77] Portugal,[78] Slovakia,[79] South Korea,[80][81] Spain (through the CNP/GC),[82] Sweden,[83] the UAE,[84] the UK,[49] and the US (through the LAPD,[85] LASD,[86] NYPD,[87] and the Fairfax County Police Department[88]), with more than 40 countries providing police assistance to their French counterparts.[89][90]

Security concerns impacted the plans that had been announced for the opening ceremony, which was to take place as a public event along the Seine; the expected attendance was reduced by half from an estimated 600,000 to 300,000, with plans for free viewing locations now being by invitation only. In April 2024, after Islamic State claimed responsibility for the Crocus City Hall attack in March, and made several threats against the UEFA Champions League quarter-finals, French president Emmanuel Macron indicated that the opening ceremony could be scaled back or re-located if necessary.[91][92][93] French authorities had placed roughly 75,000 police and military officials on the streets of Paris in the lead-up to the Games.[94]

Following the end of the Games, the national counterterrorism prosecutor, Olivier Christen, revealed that French authorities foiled three terror plots meant to attack the Olympic and Paralympic Games, resulting in the arrest of five suspects.[95]

"""

この文章の中には実は答えの情報が含まれます。なので、ここでやりたいことは、どうにかしてこの情報源の文章から有益な情報を抽出してLLMに与えたい、ということにです。ここで全ての文書をLLMに送ってしまっては効率が悪いですし、データが大量にある場合は現実的に不可能です。

さて、ここではまず情報源を文単位に分割し、その各文それぞれからテキスト特徴量を抽出してみましょう。ここではテキスト特徴量を抽出するためにもっとも使うのが楽なライブラリであるSentence Transformerを用います。ここで、軽量で高速なall-MiniLM-L6-v2モデルを使用します。そして、得られたベクトルはL2正規化しておきましょう（この埋め込みの場合コサイン距離で類似度を測るので、L2正規化をしておくと後々内積を計算するのみでランキングが計算できる）

# 文単位に分割（簡易的な方法として'.'で分割）
texts = [t.strip(' \n') for t in wikipedia_article.split('.')]

# Sentence Transformerモデルの読み込み
# all-MiniLM-L6-v2: 軽量で高速な文埋め込みモデル
text_encoder = SentenceTransformer("all-MiniLM-L6-v2")

# 各テキストをベクトルに変換し、L2正規化
# L2正規化により、コサイン類似度の計算が内積で行える
embeddings = normalize(text_encoder.encode(texts))
print(f"埋め込みベクトルの形状: {embeddings.shape}")  # (文の数, 384次元)

そして、クエリ文章自身も埋め込みベクトルに変換します。このクエリベクトルから、類似度の高い文章10件を選びます。ここでは、内積を計算して上位を選出することで類似度が高いものを見つけることができます。クエリ＝質問文に対する類似度が高いので、検索された文章は有益であると推定されます。

# クエリ（質問）もベクトルに変換し、L2正規化
query_embedding = normalize(text_encoder.encode([query]))

# L2正規化済みベクトルの内積 = コサイン類似度
# クエリと各テキストの類似度を計算
similarities = query_embedding @ embeddings.T
similarities = similarities[0]  # (1, 71) -> (71,)

# 類似度の高い順に並べ替え、上位10件のインデックスを取得
top_10_indices = similarities.argsort()[-10:][::-1]

実際に表示してみると、質問文に近いものになっていそうです。

print("\n" + "=" * 80)
print(f"クエリ: {query}")
print("=" * 80)
print("上位10件の類似テキスト:")
print("=" * 80)
for i, idx in enumerate(top_10_indices, 1):
    print(f"\n{i}. 類似度: {similarities[idx]:.4f}")
    print(f"テキスト: {texts[idx]}")

さて、それではこの上位10件の情報をまとめ、もとのクエリに追加して、新しくRAG用のプロンプトを作りましょう。ここでは単純に上位10件の文章を追記して、「これらの記事を使ってよい」とプロンプト中に直接記述します。

# 上位10件のテキストを結合して、コンテキストとして使用
top_10_texts = '\n\n'.join([texts[idx] for idx in top_10_indices])

# 検索された関連情報を含むプロンプトを作成
query_rag = f"""Use the below article on the 2024 Summer Olympics to answer the subsequent question.
Article:
\"\"\"
{top_10_texts}
\"\"\"
Question: {query}
"""

print("\n" + "=" * 80)
print("RAG用プロンプト:")
print("=" * 80)
print(query_rag)

これを用いると、次のように、無事に正解にたどり着くことができます。

# 関連情報を含むプロンプトでLLMに質問
print("\n" + "=" * 80)
print("RAGありの回答:")
print("=" * 80)
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the recent events.'},
        {'role': 'user', 'content': query_rag},
    ],
    model=deployment,
    temperature=0,
)

print(response.choices[0].message.content)

このときの答えは例えば次のようになります。

================================================================================
RAGありの回答:
================================================================================
At the 2024 Summer Olympics:

- **Gold medals**: The **United States** and **China** tied for the maximum number of gold medals, with 40 each.
- **Silver medals**: The **United States** won the maximum number of silver medals, with 44.
- **Bronze medals**: The **United States** also won the maximum number of bronze medals, with 42.

Tips

上記の例では探索部分にpythonの内積を使っていました。しかし、データ増えすぎると、単純な探索では遅くなってしまいます。そのような場合は近似最近傍探索のライブラリを用いるとよいでしょう。有名なのものはmeta社が作っているfaissなどがあります。

余談ですが、faissは現在condaでしか公式のインストールが出来ません。これに対し、今回教科書に指定した直感LLMを含め、ほとんど全ての書籍やライブラリでは、faissを使う場合にpipからインストールできるfaiss-cpuが使われています。しかしこれは実は公式のものではなく野良ビルドであり、サイバーエージェントの山口さんが一人で管理されているものなのです。山口さんの貢献は大きいですし、faiss公式はpip版も準備してほしいところです。

演習（20分）

上記を写経し、実行してみましょう。また、色々な質問を試したり、情報源を変更してみたりしてください。

マルチモーダル言語モデル¶

本演習では言語情報のみを扱ってきましたが、言語以外の情報を扱うマルチモーダル言語モデルも盛んに研究されています。そのようなモデルはMultimodal Large Language Model (MLLM)と呼ばれます。また、特に画像を扱うことが出来る言語モデルは大規模視覚言語モデル (Vision Language Model; VLM)とも呼ばれます。皆さんはChatGPTに画像をアップロードして、その内容を説明させることが出来ることを知っていると思います。この背景技術となっているのがマルチモーダル言語モデルです。ここでは、MLLMについて少しだけ触れておきましょう。

特徴量抽出器：CLIP¶

マルチモーダル言語モデルに触れる前に、まず Contrastive Language-Image Pre-Training (CLIP) について勉強しましょう。CLIPは画像情報と言語情報を同時に扱うことが出来る特徴量抽出器です。画像エンコーダとテキストエンコーダの２つのエンコーダから構成されます。これらのエンコーダが、大量の「画像・テキストペア」を使って学習されています。結果として、CLIPは画像とテキストを同じ空間にマッピングして比較することが出来るようになっています。例えば、"猫の画像"という文字列をテキストエンコーダに入力して得られる特徴量ベクトルと、猫画像を画像エンコーダに入力して得られる特徴量ベクトルが同じ次元をもち、そして同じような値を持つ（コサイン類似度が近い）ことになります。

これにより、文字列をクエリとして画像を探すマルチモーダル検索が可能になります。その関係を表示したものが下図です。このように、文字情報と画像情報を直接比較することが可能となります。

CLIPはComputer Vision分野において近年非常に基盤的な技術となっており、CLIPによって文字の世界と画像の世界の垣根がなくなったと言えます。また、上記の画像エンコーダとは具体的に何かというと、通常は画像用のトランスフォーマー（Vision Transformer; ViT）が使われます。

Tips

GPT-5などの大規模言語モデルに関する議論の文脈の中ではCLIPの立ち位置は「特徴量抽出器」なのですが、CLIP自体もTransformerを使った大規模なモデルであると言えます。そのため、状況に応じてはCLIPのことも大規模言語モデルと呼ぶことがありますので、注意するとよいです。

演習（20分）

CLIPを使って文字列クエリを用いて画像を検索するシンプルデモをやってみましょう。これはローカルで実行することを想定しています。

MLLMの例：BLIP-2¶

さて、CLIPのようなものを使うと、画像情報をベクトルとして表現できます。これを大規模言語モデルに接続するにはどうすればいいでしょうか。具体的に、大規模言語モデルが画像を読み込んでその内容を説明するにはどうすればいいのでしょうか。

これには様々な方法がありますが、基本的には画像の埋め込み表現をなんらかの形で大規模言語モデルが理解できる形にする、ということになります。BLIP2の場合は、画像側のエンコーダのVisual Transformerは訓練ずみのものを固定し、LLM側も何かを選んできて固定します。そして、この２つの間を橋渡しするものを、大量の画像テキストペアを用いて学習します（これをQ-Formerと呼びます）。最終的に、画像入力に対し画像特徴量を抽出しQ-Formerを通してLLMが理解できる埋め込みに変換し、その埋め込みをLLMに入力することが可能となります。

さて、それでは具体的なコードを見ていきましょう。以下は教科書のこちらのコードを元にしています。これらはColab上で実行可能です。ここではLLMをダウンロードしてきて実行するため、GPUをオンにしてください。

まずは必要なライブラリを読み込み、BLIP-2モデル、およびそれに対応するプロセッサ（処理の実行を抽象化したもの）を読み込みます。

from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch
from PIL import Image
from urllib.request import urlopen

# BLIP-2モデルとプロセッサの読み込み
blip_processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    device_map="auto"  # 自動でGPUに割り当て
)
device = next(model.parameters()).device  # デバイス情報を取得しておく

さて、次に、適当な画像を読み込み表示してみましょう。ここではPythonで画像を扱うときによく使うPILパッケージを用いています。また、画像のサイズも確認しておきましょう。ここでは(492, 520)という中途半端なサイズであることがわかります。

car_path = "https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/chapter09/images/car.png"
image = Image.open(urlopen(car_path)).convert("RGB")
print(image.size)
image

次に、画像に対する前処理として、BLIP-2モデルが理解できる形に変換しましょう。

# 画像に対する前処理
inputs = blip_processor(image, return_tensors="pt").to(device, torch.float16)
print(f"Pixel values shape: {inputs['pixel_values'].shape}")

ここで、出力はPixel values shape: torch.Size([1, 3, 224, 224])となっていると思います。ここではリサイズが行われるので画像の大きさが変更されています。

さて、それではようやく、画像からテキストを生成してみましょう。ここではモデルが持つgenerate関数に直接inputを入力させることが出来ています。これにより、LLMが画像情報を受け取り、それをそのまま文字列として出力します。

generated_ids = model.generate(**inputs, max_new_tokens=20)

これを実際にデコードして確認してみましょう。例えばGenerated caption: an orange supercar driving on the road at sunsetなどになり、画像を説明してくれていることがわかります。

generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)
generated_text = generated_text[0].strip()
print(f"Generated caption: {generated_text}")

さて、上記は画像をそのまま文章にするだけの単純な例でしたが、以下のようにblip_processor時点でプロンプトを与えることで、LLMに具体的に画像について質問することが出来ます。

# チャットのようなプロンプト
prompt = "Question: Write down what you see in this picture. Answer: A sports car driving on the road at sunset. Question: What would it cost me to drive that car? Answer:"

# 前処理
inputs = blip_processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)

# 生成
generated_ids = model.generate(**inputs, max_new_tokens=30)

# 文字列に戻して確認
generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)
generated_text = generated_text[0].strip()
generated_text

上記のBLIPは初期のVLMであり、現在はオープンモデルのLLaVAや、クローズドモデルのGPT-4V（およびそれ以降のGPTシリーズ）など、様々なVLMが登場しています。

演習（20分）

上記を写経し、実行してみましょう。また、様々な画像やプロンプトを試してみましょう。

エージェント¶

原理¶

次はエージェントという考え方について勉強します。LLMは巨大な知識を保持する知識源ですが、一方で、単純に外部のものを使うほうが効果的な場合が多々あります。最も有名な例は（１）電卓と（２）検索エンジンです。

電卓に関していうと、LLMは「1+1は？」といった質問に確かに答えることが出来ますが、LLMはこの問題を「かなり無理をして頑張って」なんとか解いています。例えば以下のブログなどが参考になります：LLMのキモい算術。このような単純な計算は、当然ながら電卓を使えば効果的に解くことができます。このような状況で、LLM自身が電卓を呼び出して電卓に計算させその結果を受け取って再利用するというのがエージェントの考え方です。ここで使われる電卓のようなものをツールと呼びます。

また、検索エンジンも同様です。LLMは巨大な知識を持っていますが、最新の情報を持っているわけではありません。例えば「2025年のノーベル物理学賞は誰が受賞したか？」といった質問に対して、2025年以前に訓練されたLLMは正確に答えることが出来ません。このような場合に、LLMが自動的に検索エンジンを呼び出して最新の情報を取得し、その情報を元に回答を生成するというのもエージェントの考え方です。

Claude codeのようなエージェントタイプのコーディング支援ソフトウェアを使ったことがある人は、エージェントの考え方に馴染みがあるかもしれません。Claude code自身がウェブ検索したり、linuxのコマンドを実行したりして試行錯誤しながらコーディングを支援してくれます。このとき、LLM自身は全体を統括する頭脳のように振舞います。

Tips

LLMのモデルは、自身が訓練された後の情報を持つことが出来ないという問題は、RAGでも同様の例を紹介しましたね。RAGは知識源として手持ちの信頼のおけるデータから情報を取り出しているおり、これはテスト勉強をする際に教科書を読む行為に似ています。エージェントで検索エンジンを使うことは、テスト勉強をする際にググってみることに似ています（似ているというかそのもの？）。どちらも、LLMが外部情報を参照するという意味で似ており、アプローチが違うと言えるかもしれません。ただ、レイヤーとしてはRAGというのは技術的な操作を指すのに対し、エージェントのほうがより広い考え方だと言えます。なので、「RAGを行うエージェント」というような言い方も可能です。

langchainによるReACTフレームワークの例¶

さて、それでは早速エージェントを試してみましょう。ここではlangchainというライブラリのReAct (Reasoning and Acting)というフレームワークを使ってみましょう。ここでは、LLMはユーザからのプロンプトに対し

思考 (Thought)：入力プロンプトに対しどうすべきか考える
行動 (Action)：電卓や検索エンジンなど外部ツールを利用
観察 (Observation)：その結果がLLMに返され、結果を観察

という３つのステップを繰り返します。ここでは検索エンジンとしてDuckDuckGoを使い、電卓と組み合わせてMacbookの値段を調べるというタスクを行ってみます。

以下は、教科書の下記のノートブックを元にしています。以下は全てColab上で実行可能です。Azure OpenAIを使うAPIコールによる実行であるため、ColabはCPUモードでも大丈夫です。

まずは追加ライブラリをインストールします。

!pip install -U langchain langchain-classic langchain-openai langchain-community langchain-core ddgs numexpr

次に、必要なライブラリをインポートします。

# LangChainを使ったReActエージェントのサンプル
# Azure OpenAI (gpt-4o) を使用してWeb検索と計算を行うエージェントを実装

from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_community.agent_toolkits.load_tools import load_tools
from langchain_core.tools import Tool
from langchain_community.tools import DuckDuckGoSearchResults
from langchain_classic.agents import create_react_agent, AgentExecutor

Azure OpenAIの設定を行います。ここはRAGのときと同じです。そして、LLMを初期化します。ここではlangchainのAzureChatOpenAIクラスを使います。

# Azure OpenAI の設定
endpoint = "xxxxxx"                 # Azure OpenAI エンドポイントURLです。slackで共有したものです。
deployment = "gpt-4o"               # LLMのモデルです。今回は4oを使います。
subscription_key = "yyyyy"          # Azure OpenAIのキーです。slackで共有したものです。
api_version = "2024-12-01-preview"  # APIバージョン

# LLMの初期化
openai_llm = AzureChatOpenAI(
    temperature=0,
    api_key=subscription_key,
    azure_endpoint=endpoint,
    api_version=api_version,
    azure_deployment=deployment
)

次に、ReActのプロンプトテンプレートを定義します。ここがキモとなる部分であり、具体的な指示を書きます。

# ReActプロンプトテンプレートの定義
# ReAct = Reasoning (思考) + Acting (行動) を組み合わせたプロンプト手法
react_template = """Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

IMPORTANT: You must ONLY output ONE action at a time. After outputting an Action and Action Input, STOP and wait for the Observation. Do NOT write the Observation yourself - it will be provided to you. Do NOT include a Final Answer until you have completed all necessary actions.

Begin!

Question: {input}
Thought:{agent_scratchpad}"""

prompt = PromptTemplate(
    template=react_template,
    input_variables=["tools", "tool_names", "input", "agent_scratchpad"]
)

次に、エージェントが使うツールを準備します。ここではDuckDuckGoという検索エンジンを実行するツールを準備します。加えて、計算機ツールも使います。これは既に準備されています。これらをtoolsリストにまとめていきます。

# エージェントが使用するツールの準備
# 1. Web検索ツール (DuckDuckGo)
search = DuckDuckGoSearchResults()
search_tool = Tool(
    name="duckduck",
    description="A web search engine. Use this as a search engine for general queries.",
    func=search.run,
)

# 2. 計算ツール (llm-math) とWeb検索ツールを組み合わせる
tools = load_tools(["llm-math"], llm=openai_llm)
tools.append(search_tool)

最後に、LLMの情報、toolsの情報、プロンプトテンプレートをまとめてエージェントを作り、それを実行します。

# ReActエージェントとエグゼキュータの構築
agent = create_react_agent(openai_llm, tools, prompt)
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,                    # 実行過程を表示
    handle_parsing_errors=True,      # パースエラーを自動処理
    max_iterations=20                # 最大反復回数（無限ループ防止）
)

# エージェントを実行
# 例: MacBook Proの価格をWeb検索し、為替レートで換算
result = agent_executor.invoke(
    {
        "input": "What is the current price of a MacBook Pro in USD? How much would it cost in EUR if the exchange rate is 0.85 EUR for 1 USD?"
    }
)

# 結果を表示
print("\n" + "="*50)
print("最終結果:")
print("="*50)
print(result["output"])

これにより、上記はMacbook Proの価格をDuckDuckGoで検索し、その価格を電卓ツールで為替換算するという一連の流れを自動的に行います。結果は例えば以下のようになることでしょう。

> Entering new AgentExecutor chain...
I need to find the current price of a MacBook Pro in USD first.  
Action: duckduck  
Action Input: Current price of MacBook Pro in USDsnippet: 3 days ago - Amazon has the 10-Core, 16GB RAM, 512GB 14-inch MacBook Pro for $200 off this month, a second-best price. ... Higher-end 14-inch models have not been updated and the current generation is still using the M4 Pro and M4 Max chips from 2024, ..., title: MacRumors Best MacBook Pro and MacBook Air Deals for November 2025 - MacRumors, link: https://www.macrumors.com/guide/macbook-pro-air-deals/, snippet: May 2, 2025 - Microcenter has the 36GB version for $1699 if you're in search of more RAM. https://www.microcenter.com/product/692052/apple-macbook-pro-14-z1au0029n-(late-2023)-142-laptop-computer-space-black ..., title: Reddit r/buildapcsales on Reddit: [Laptop] Apple 14" MacBook Pro (M3 Pro, Space Black) - $1399 ($1999 - $600) (B&H Photo Video), link: https://www.reddit.com/r/buildapcsales/comments/1kd0d06/laptop_apple_14_macbook_pro_m3_pro_space_black/, snippet: 4 days ago - If a big screen is your top priority, you can score a $264 discount on the 16-inch model of the latest MacBook Pro, dropping the price to $2,235 on Amazon . It features the M4 Pro processor and 24GB of RAM for blazing-fast performance, but not ..., title: CNET Best MacBook Deals: With Record Low Prices Now, Who Needs Black Friday? - CNET, link: https://www.cnet.com/deals/best-macbook-deals/, snippet: 3 weeks ago - In the Apple Silicon era, MacBooks have become more affordable than ever – with brand new models starting as low as $599. With looming global tariffs, these great deals could potentially be coming to an end soon – so if you’re in the market ..., title: 9to5Mac These are the best new MacBook deals right now: options starting at $599 - 9to5Mac, link: https://9to5mac.com/2025/10/18/best-apple-macbook-deals-right-now-october-2025/Action: duckduck  
Action Input: Price of 14-inch MacBook Pro (M3 Pro, Space Black) in USDsnippet: Oct 6, 2019 · 1) Befor the distributor can quote you a price on an equivalent pump, a sales engineer has to identify all the specifications of the existing unit, such as shaft, mounting, ports and …, title: "price on" and "price for" - English Language & Usage Stack …, link: https://english.stackexchange.com/questions/514713/price-on-and-price-for, snippet: Feb 9, 2011 · Price point means a point on a scale of possible prices at which something might be marketed; its meaning is different from the meaning of price, which is (principally, but not only) …, title: meaning - Differences between "price point" and "price" - English ..., link: https://english.stackexchange.com/questions/11879/differences-between-price-point-and-price, snippet: Feb 22, 2019 · The Merriam Webster dictionary defines cheap as charging or obtainable at a low price a: a good cheap hotel cheap tickets b : purchasable below the going price or the real value …, title: word usage - Should it be "cheaper price" or "lower price"?, link: https://english.stackexchange.com/questions/486705/should-it-be-cheaper-price-or-lower-price, snippet: Jul 28, 2014 · The preposition "OF" is used here to indicate that the price belongs to/is used in relation with prices of spare parts. Now, the definition of "FOR" as a preposition- For Used to …, title: "Prices of" vs "prices for" - English Language & Usage Stack …, link: https://english.stackexchange.com/questions/187425/prices-of-vs-prices-forAction: duckduck  
Action Input: Current price of 14-inch MacBook Pro M3 Pro in USDsnippet: 5 days ago - The MacBook Pro with Apple silicon is a line of Mac notebook computers introduced in November 2020 by Apple. It is the higher-end model of the MacBook family, sitting above the consumer-focused MacBook Air, and is currently sold with 14-inch (360 mm) and 16-inch (410 mm) screens., title: Wikipedia MacBook Pro (Apple silicon) - Wikipedia, link: https://en.wikipedia.org/wiki/MacBook_Pro_(Apple_silicon), snippet: ... 14 - inch M2 MacBook Pro is an excellent laptop for majority of users and currently you can pick one for $1,750 at Amazon , a $250 discount on its MSRP ..., title: Apple Macbook Pro 14 M2 is $250 off right now!, link: https://laptopdecision.com/blog/Apple-Macbook-Pro-14-M2-is-250-off-right-now, snippet: The updated MacBook Pros also have a lower starting price than the prior version, with the 14 - inch M3 -powered MacBook Pro starting at $1,599., title: Apple announces refreshed 14-inch, 16-inch MacBook Pros and, link: https://www.dpreview.com/news/9267805975/apple-m3-macbook-pro-m3-imac, snippet: ... report (when translated by Google) notes that supply chain sources revealed that "the new MacBook Pro is powered by a 16- inch LCD screen instead of ..., title: 16-inch MacBook Pro to Ship in October for Nearly $3,000, link: https://www.laptopmag.com/articles/16-inch-macbook-pro-release-date-priceAction: Calculator  
Action Input: 1599 * 0.85  Answer: 1359.1499999999999Final Answer: The current price of a 14-inch MacBook Pro (M3 Pro) is $1,599 USD. At an exchange rate of 0.85 EUR for 1 USD, it would cost approximately €1,359.15.

> Finished chain.

==================================================
最終結果:
==================================================
The current price of a 14-inch MacBook Pro (M3 Pro) is $1,599 USD. At an exchange rate of 0.85 EUR for 1 USD, it would cost approximately €1,359.15.

演習（20分）

上記を写経し、実行してみましょう。また、プロンプトやインプットを変更して挙動を見てみましょう。

MCP¶

さて、本日の最後に、Model Context Protocol (MCP) について勉強しましょう。MCPはAIエージェントを効率的に構築するために考えだされた、プロトコルです。

先ほどのエージェントの章では、エージェントはあくまでローカルで実行するライブラリの機能という立ち位置でした。しかし、エージェントにツールを提供するような部分は、もし規格を共通化すれば、自分が作ったエージェントを経由して他の人が自分のデータにアクセスする、といったことが可能になり、LLM間やLLMとその他のサービスの間で情報のやり取りが出来るかもしれません。そのための共通規格がMCPです。MCPはAnthropic社によって2024年に策定され、現在はLLMが他者の情報ソースにアクセスするためのプロトコルとしてデファクトスタンダードの位置を確立しつつあり、自らを「AIプリケーションのためのUSB-Type Cポート」であると謳っています。。ここでは、MCPについて少しだけ触れておきましょう。

MCPにはウェブアプリケーションのように「サーバ」「クライアント」の概念が存在します。MCPサーバはツールを提供し、MCPクライアントはそのツールを利用します。すなわち、ツールの提供者はMCPサーバを公開しツールを提供します。それを使いたい人は、OpenAI APIなどを通してそのMCPサーバにアクセスし、ツールを利用します。

それでは早速コードを元に見ていきましょう。ここではFastMCP 2.0という、簡単にMCPに触れられるライブラリを使い、OpenAI APIと連携させます。具体的にOpenAI API 🤝 FastMCPをベースに議論します。

MCPサーバ¶

まずMCPサーバを構築します。ここではローカルのコンピュータ上で実行することを仮定します。uvをインストールして仮想環境を作ったうえで、まずFastMCPをインストールしてください。

$ uv pip install fastmcp

Tips

uvを使ったことがない場合は、公式からインストールしてください。そのうえで、以下の２通りのやり方があります。

（１）真面目に正しくやる場合

以下のようにしてプロジェクトを作る

$ uv init -p 3.10 mcp_practice
$ cd mcp_practice

そして、uv addでプロジェクトにfastmcpを追加する

$ uv add fastmcp

アクティベートする

$ . .venv/bin/activate

（２）手っ取り早くやる場合（venv + better pip）

$ uv venv

で仮想環境を作る。

アクティベートする

$ . .venv/bin/activate

uv pipでfastmcpをインストールする

$ uv pip install fastmcp

そして、以下の内容をserver.pyというファイルに保存してください。

import random
from fastmcp import FastMCP

# FastMCPサーバーの準備
mcp = FastMCP(name="Dice Roller")

# Toolの設定。ここではサイコロを振るツールを定義
@mcp.tool
def roll_dice(n_dice: int) -> list[int]:
    """Roll `n_dice` 6-sided dice and return the results."""
    return [random.randint(1, 6) for _ in range(n_dice)]

if __name__ == "__main__":
    # サーバをローカルでlocalhost:8000として起動
    mcp.run(transport="http", port=8000)

これは、サイコロを振るroll_diceというツールを提供するMCPサーバです。

このプログラムを$ python server.pyで起動します。すると、http://127.0.0.1:8000/mcp（http://localhost:8000/mcp）にサーバが立ち上がります。ためしにそのURLをクリックして開いてみると、"server-error"のような画面が表示されます。これで正常です。

Tips

この段階で以下のコマンドで挙動を確認できます。

$ fastmcp dev server.py

もしnpxが必要だといわれたら、例えば次の手順を参考にインストールしてください。ただこの挙動確認はしなくても次に進めます。

# Install node env (with Volta)
$ curl https://get.volta.sh | bash
$ volta -V
$ volta install node

さて、ここでローカルでMCPサーバを立ち上げたのですが、今回はこのMCPサーバに外部からアクセスできる方法を体験してみます。ここでは、localhost:8000のようなローカルサーバへのアクセスをインターネット上に公開できるngrokというサービスを使います。ngrokの公式サイトからアカウントを作成し、公式サイトにログインしてみてください。そのダッシュボードにngrokをインストールする方法、および初期設定の方法が書いてありますので、自分の環境に合わせて実行してください。例えばWindows上のWSLでUbuntuを使っている場合、snap (ubuntuに標準搭載されている、aptのようなパッケージマネージャ)を使うと楽でしょう。

$ sudo snap install ngrok

マックの場合は

$ brew install ngrok

などだと思います。

その後、初期設定をします。xxxxxxxの部分は自分のダッシュボードに書いてあるものにしてください。

$ ngrok config add-authtoken xxxxxxxx

こうすることで、ngrokコマンドが使えるようになりました。例えば ngrok http 80 などを実行すると、https://xxx.yyy.zzz.ngrok-free.dev のようなURLが発行（xxx.yyy.zzzの部分は各ユーザにより異なる）され、そこにアクセスするとlocalhost:80にアクセスできるようになります。

Tips

ngrokは自分のコンピュータの中身を全世界に公開してしまう可能性がある大変危険なコマンドです。実際に使うときは、公開している内容に個人情報や機密情報が含まれないように注意し、実行していないときは必ずすぐに終了するようにしてください。

さて、それではMCPサーバを世界公開してみましょう。ターミナルを２つ立ち上げてください。１つ目で、MCPサーバを起動します。

$ python server.py

これを開いたまま、もう１つのターミナルで、ngrokを起動します。

$ ngrok http 8000

こうするとngrokの起動画面になります。そこには

Forwarding                    https://xxx.yyy.zzz.ngrok-free.dev -> http://localhost:8000

のような指示があるはずです。このhttps://xxx.yyy.zzz.ngrok-free.devの部分が、インターネット上に公開されたURLです。ブラウザでこのURLにアクセスし、さらにその後ろに/mcpをつけてアクセスしてみてください。例えばhttps://xxx.yyy.zzz.ngrok-free.dev/mcpのようなURLになります。すると、MCPサーバにアクセスできて、先程と同様に"server-error"のような画面が表示されるはずです。これでMCPサーバの公開は完了です。以下のクライアントを実行する間、この２つのターミナル画面はずっと開いたままにしておいてください。

MCPクライアント¶

次はクライアント側です。ここでは、これまでにやった通り、Colab上からAzure OpenAI Clientを使ってLLMをAPI実行し、その際に先程のMCPサーバを使うようにします。

以下をColab上に記載して実行してみてください。

from openai import AzureOpenAI

# Azure OpenAI の設定
endpoint = "xxxxxx"                 # Azure OpenAI エンドポイントURLです。slackで共有したものです。
deployment = "gpt-4o"               # LLMのモデルです。今回は4oを使います。
subscription_key = "yyyyy"          # Azure OpenAIのキーです。slackで共有したものです。
api_version = "2025-03-01-preview"  # APIバージョンです。今回は新しい2025-03-01にします。

# Azure OpenAI クライアントの初期化
client = AzureOpenAI(
    api_version=api_version,
    azure_endpoint=endpoint,
    api_key=subscription_key,
)

# サーバーURLの設定。ここは先ほどngrokで公開したURLを指定します。
url = 'https://xxx.yyy.zzz.ngrok-free.dev'

# MCPツールを使いつつ、LLMに問い合わせ
resp = client.responses.create(
    model="gpt-4o",
    tools=[
        {
            "type": "mcp",
            "server_label": "dice_server",
            "server_url": f"{url}/mcp",
            "require_approval": "never",
        },
    ],
    input="Roll a few dice!",
)

print(resp.output_text)

このコードはいつも通りGPT-4oをAPIで呼び出しているだけです。ここで、toolsのところでMCPサーバを指定し起動しています。ここでは認証も何もせずに呼び出していることに注意してください。これを実行すると、例えば以下のような結果となり、サイコロツールを使ってくれていることがわかります。

You rolled a **1** and a **5**! 🎲

この際、先程の２つのターミナル画面を見ると、どちらを見てもアクセスされている形跡がわかると思います。このようにして、MCPサーバの公開、およびMCPクライアントからのアクセスを確認できました。

演習（20分）

上記を実行してみましょう。また、mcpサーバの中身を色々書き換えてみて、クライアントからアクセスしてみましょう。例えば、以下の記述をサーバ側に追加します。

@mcp.tool
def greet(name: str) -> str:
    return f"Hello, {name}!"

そして、クライアント側でinput="Roll a few dice!"となっている部分をinput="Greet Alice"のように書き換えて実行してみましょう。そうすると、今度はgreetのツールを呼び出してくれていることがわかります。

演習（20分）

ngrokにより作られたURLには、他人もアクセスできます。なので、オリジナルのMCPサーバを作ったあとに、隣の友達にngrokで出来たURLを教えて、そこにアクセスしてみてもらいましょう。このようにして、MCPサーバを他人に使ってもらうことが体験できます。

Tips

繰り返しになりますが、ngrokを使うやり方は危険ですので多用しないでください。今回はあくまでデモ目的です。実際に自分が作ったMCPサーバをデプロイしたい場合、fastmcpは例えばFastMCP Cloudというホスティングサービスを提供しているようですので、このような公式のものを使うとよいでしょう。

また、MCPサーバをローカルで使う（自分が作ったデータに対し、自分のClaude Desktopなどでアクセスする）場合、そもそも世界公開する必要はありません。

次回以降¶

次回以降は自由課題の時間となります。自由課題の例はDay 6の資料を参照してください。チームで課題を行いたい人は次回までにチームを作ってみてください。１チームは最高で３人までです。