{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Transformers (Hugging Face) 教程 - 预训练模型中心\n", "\n", "欢迎来到 Hugging Face Transformers 库教程！`transformers` 库已经成为现代自然语言处理 (NLP) 领域的事实标准，它提供了数以万计的预训练模型（尤其是基于 Transformer 架构的模型，如 BERT, GPT, T5 等）以及方便的工具来下载、加载和使用这些模型进行推理和微调。\n", "\n", "**为什么 Hugging Face Transformers 如此重要？**\n", "\n", "1. **庞大的模型库 (Hugging Face Hub)**：轻松访问大量 SOTA (State-of-the-Art) 模型，涵盖 NLP、计算机视觉、音频等多个领域。\n", "2. **易于使用的 API**：提供了高级的 `pipeline` API 用于快速推理，以及统一的 `AutoModel`, `AutoTokenizer` 等类来加载模型和分词器。\n", "3. **框架兼容性**：支持 PyTorch, TensorFlow 和 JAX。\n", "4. **标准化与可复现性**：促进了模型共享和研究的可复现性。\n", "5. **强大的社区**：活跃的社区贡献了大量模型、数据集和教程。\n", "6. **迁移学习利器**：使得利用大型预训练模型的强大能力进行下游任务微调变得非常容易。\n", "\n", "**本教程将涵盖 Transformers 库的核心概念和用法：**\n", "\n", "1. 安装与准备\n", "2. 核心概念：Pipelines, Tokenizers, Models\n", "3. 使用 `pipeline` API 进行快速推理 (零样本/少样本)\n", "4. 加载预训练模型和分词器 (`AutoModel`, `AutoTokenizer`)\n", "5. 文本分词与编码\n", "6. 使用加载的模型进行推理 (获取 logits 或隐藏状态)\n", "7. (简介) 微调 (Fine-tuning) 流程\n", "8. Hugging Face Hub 简介" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. 安装与准备\n", "\n", "你需要安装 `transformers` 库。通常建议同时安装 PyTorch 或 TensorFlow (或两者)。\n", "\n", "```bash\n", "pip install transformers\n", "\n", "# 安装 PyTorch (根据你的系统和 CUDA 版本从官网获取命令: pytorch.org)\n", "# pip install torch torchvision torchaudio \n", "\n", "# 或者安装 TensorFlow (根据你的系统和 CUDA 版本从官网获取命令: tensorflow.org)\n", "# pip install tensorflow\n", "```\n", "对于某些特定任务或模型，可能还需要安装额外的依赖，例如 `sentencepiece` (用于某些分词器) 或 `datasets` (用于加载和处理数据集)。\n", "```bash\n", "pip install sentencepiece datasets\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 导入必要的库\n", "import transformers\n", "from transformers import pipeline # 高级 API\n", "from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModel # 底层 API\n", "import torch # 我们主要用 PyTorch 示例\n", "import pandas as pd\n", "\n", "print(f\"Transformers version: {transformers.__version__}\")\n", "print(f\"PyTorch version: {torch.__version__}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. 核心概念：Pipelines, Tokenizers, Models\n", "\n", "* **Pipeline (管道)**：这是最简单易用的接口，将模型的前处理（如分词）、模型推理和后处理封装在一起，可以直接处理原始输入（如文本）并返回易于理解的结果。非常适合快速应用或原型设计。\n", "* **Tokenizer (分词器)**：负责将原始文本转换为模型可以理解的数值输入（通常是 token IDs、attention mask 等）。每个预训练模型都有其对应的分词器，它们必须匹配使用。\n", "* **Model (模型)**：代表预训练模型的架构和权重。`transformers` 提供了各种针对不同任务的模型类（如 `AutoModelForSequenceClassification` 用于序列分类，`AutoModelForQuestionAnswering` 用于问答，`AutoModel` 用于获取基础 Transformer 的输出）。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. 使用 `pipeline` API 进行快速推理\n", "\n", "`pipeline()` 函数可以自动下载并缓存所需的模型和分词器，让你轻松完成各种任务。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Using pipeline API ---\")\n", "\n", "# --- 任务 1: 情感分析 (Sentiment Analysis) ---\n", "print(\"\\n--- Sentiment Analysis ---\")\n", "try:\n", " # 第一次运行时会自动下载模型 (默认模型通常是 distilbert-base-uncased-finetuned-sst-2-english)\n", " sentiment_pipeline = pipeline(\"sentiment-analysis\")\n", " \n", " text1 = \"This movie was absolutely fantastic! Highly recommended.\"\n", " text2 = \"The plot was predictable and the acting was mediocre.\"\n", " \n", " results = sentiment_pipeline([text1, text2])\n", " for i, result in enumerate(results):\n", " print(f\"Text {i+1}: Label='{result['label']}', Score={result['score']:.4f}\")\n", " \n", "except Exception as e:\n", " print(f\"Error running sentiment analysis pipeline (maybe model download failed?): {e}\")\n", "\n", "# --- 任务 2: 文本生成 (Text Generation) ---\n", "print(\"\\n--- Text Generation ---\")\n", "try:\n", " # 使用一个较小的 GPT-2 模型\n", " generator = pipeline('text-generation', model='gpt2') \n", " prompt = \"In a hole in the ground there lived a\"\n", " generated_texts = generator(prompt, max_length=30, num_return_sequences=2)\n", " \n", " print(f\"Prompt: '{prompt}'\")\n", " for i, text in enumerate(generated_texts):\n", " print(f\"Generated sequence {i+1}: {text['generated_text']}\")\n", " \n", "except Exception as e:\n", " print(f\"Error running text generation pipeline: {e}\")\n", "\n", "# --- 任务 3: 零样本分类 (Zero-Shot Classification) ---\n", "print(\"\\n--- Zero-Shot Classification ---\")\n", "try:\n", " # 可以在没有针对特定标签进行微调的情况下对文本进行分类\n", " zero_shot_classifier = pipeline(\"zero-shot-classification\")\n", " sequence_to_classify = \"Who are you voting for in 2024?\"\n", " candidate_labels = ['politics', 'economy', 'entertainment', 'environment']\n", " \n", " result = zero_shot_classifier(sequence_to_classify, candidate_labels)\n", " print(f\"Sequence: '{sequence_to_classify}'\")\n", " print(f\"Predicted labels and scores: {result['labels']} - { [f'{s:.3f}' for s in result['scores']] }\")\n", " \n", "except Exception as e:\n", " print(f\"Error running zero-shot classification pipeline: {e}\")\n", "\n", "# --- 任务 4: 掩码填充 (Fill-Mask) ---\n", "print(\"\\n--- Fill-Mask ---\")\n", "try:\n", " unmasker = pipeline('fill-mask') # 通常使用 BERT 类型的模型\n", " masked_text = \"Paris is the city of France.\"\n", " results = unmasker(masked_text, top_k=3) # 获取最可能的3个填充词\n", " \n", " print(f\"Masked text: '{masked_text}'\")\n", " for result in results:\n", " print(f\" Prediction: '{result['token_str']}' (Score: {result['score']:.4f}, Sequence: {result['sequence']})\")\n", " \n", "except Exception as e:\n", " print(f\"Error running fill-mask pipeline: {e}\")\n", " \n", "# 还有很多其他任务: 'ner' (命名实体识别), 'question-answering', 'summarization', 'translation_xx_to_yy' 等" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. 加载预训练模型和分词器 (`AutoModel`, `AutoTokenizer`)\n", "\n", "`AutoClasses` (如 `AutoTokenizer`, `AutoModel`, `AutoModelForSequenceClassification` 等) 是非常有用的工厂类。你只需要提供模型的标识符 (通常是 Hugging Face Hub 上的模型名称，如 `bert-base-uncased` 或 `distilbert-base-uncased`)，它们就能自动推断模型架构并加载相应的分词器和模型类。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Loading Tokenizer and Model using AutoClasses ---\")\n", "\n", "# 选择一个模型标识符 (checkpoint)\n", "model_checkpoint = \"distilbert-base-uncased-finetuned-sst-2-english\" # 同情感分析默认模型\n", "# model_checkpoint = \"bert-base-uncased\"\n", "\n", "try:\n", " # 1. 加载分词器\n", " tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)\n", " print(f\"Tokenizer loaded for '{model_checkpoint}'\")\n", " print(f\"Tokenizer class: {type(tokenizer)}\")\n", " \n", " # 2. 加载模型\n", " # 选择适合任务的模型类，例如用于序列分类\n", " model_for_classification = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)\n", " print(f\"\\nModel loaded for sequence classification: '{model_checkpoint}'\")\n", " print(f\"Model class: {type(model_for_classification)}\")\n", "\n", " # 如果只想获取模型的隐藏状态 (embeddings)，使用 AutoModel\n", " # base_model = AutoModel.from_pretrained(model_checkpoint)\n", " # print(f\"\\nBase model loaded: '{model_checkpoint}'\")\n", " # print(f\"Base model class: {type(base_model)}\")\n", "\n", "except OSError as e:\n", " print(f\"\\nError loading model/tokenizer '{model_checkpoint}'. Check model name or internet connection.\")\n", " print(f\"Error details: {e}\")\n", " tokenizer = None\n", " model_for_classification = None\n", "except Exception as e:\n", " print(f\"\\nAn unexpected error occurred: {e}\")\n", " tokenizer = None\n", " model_for_classification = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. 文本分词与编码\n", "\n", "分词器将文本转换为模型可以处理的格式。\n", "\n", "* **分词 (Tokenization)**：将文本分割成子词单元 (tokens)。\n", "* **转换为 ID (Conversion to IDs)**：将每个 token 映射到其在模型词汇表中的唯一整数 ID。\n", "* **添加特殊 Token (Special Tokens)**：添加模型需要的特殊 token，如 `[CLS]` (分类标记), `[SEP]` (分隔标记), `[PAD]` (填充标记)。\n", "* **生成 Attention Mask**：创建一个与输入 ID 序列相同长度的二进制掩码，用于指示哪些 token 是真实的输入，哪些是填充 (`1` 表示真实, `0` 表示填充)。\n", "\n", "调用 `tokenizer(text, ...)` 会执行以上所有步骤。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Tokenization and Encoding ---\")\n", "\n", "if tokenizer:\n", " text_example = \"This is a sample sentence for tokenization.\"\n", " \n", " # 基本分词\n", " tokens = tokenizer.tokenize(text_example)\n", " print(f\"\\nText: '{text_example}'\")\n", " print(f\"Tokens: {tokens}\") # 注意可能有 ## 前缀表示子词\n", " \n", " # 转换为 ID\n", " token_ids = tokenizer.convert_tokens_to_ids(tokens)\n", " print(f\"Token IDs: {token_ids}\")\n", " \n", " # 解码回文本\n", " decoded_text = tokenizer.decode(token_ids)\n", " print(f\"Decoded text: '{decoded_text}'\") # 可能与原始略有不同\n", " \n", " # --- 使用 tokenizer() 一步完成编码 (推荐) ---\n", " print(\"\\n--- Encoding using tokenizer() ---\")\n", " text_batch = [\n", " \"First sentence.\",\n", " \"This is a slightly longer second sentence.\"\n", " ]\n", " \n", " # padding=True: 将批次中的句子填充到最长句子的长度\n", " # truncation=True: 如果句子超过模型最大长度，则截断\n", " # return_tensors=\"pt\": 返回 PyTorch 张量 ('tf' for TensorFlow, 'np' for NumPy)\n", " encoded_input = tokenizer(text_batch, padding=True, truncation=True, return_tensors=\"pt\")\n", " \n", " print(\"Encoded input (PyTorch Tensors):\")\n", " # .items() 用于方便打印\n", " for key, value in encoded_input.items():\n", " print(f\" {key}:\")\n", " print(value)\n", " \n", " print(f\"\\nShape of input_ids: {encoded_input['input_ids'].shape}\")\n", " print(f\"Shape of attention_mask: {encoded_input['attention_mask'].shape}\")\n", " \n", " # 查看特殊 token\n", " print(f\"\\nSpecial tokens: {tokenizer.special_tokens_map}\")\n", " print(f\"CLS token: {tokenizer.cls_token}, ID: {tokenizer.cls_token_id}\")\n", " print(f\"SEP token: {tokenizer.sep_token}, ID: {tokenizer.sep_token_id}\")\n", " print(f\"PAD token: {tokenizer.pad_token}, ID: {tokenizer.pad_token_id}\")\n", "else:\n", " print(\"Tokenizer not loaded, skipping encoding examples.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. 使用加载的模型进行推理\n", "\n", "将编码后的输入传递给加载的模型，以获取模型的输出。\n", "* 对于分类任务 (`AutoModelForSequenceClassification`)，输出通常包含 `logits`（原始分数）。\n", "* 对于基础模型 (`AutoModel`)，输出通常包含 `last_hidden_state`（最后一层的隐藏状态/嵌入）。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Model Inference ---\")\n", "\n", "if model_for_classification and tokenizer:\n", " texts_for_inference = [\n", " \"This library is incredibly useful!\",\n", " \"I am not sure if I like this product.\"\n", " ]\n", " \n", " # 1. 编码输入\n", " inputs = tokenizer(texts_for_inference, padding=True, truncation=True, return_tensors=\"pt\")\n", " print(f\"Encoded inputs for inference:\\n{inputs}\")\n", " \n", " # 2. 将模型和输入移到同一设备 (如果使用 GPU)\n", " # model_for_classification.to(device)\n", " # inputs = {k: v.to(device) for k, v in inputs.items()}\n", " print(f\"\\nModel is on device: {next(model_for_classification.parameters()).device}\")\n", " \n", " # 3. 执行推理 (在 torch.no_grad() 下，因为不需要计算梯度)\n", " with torch.no_grad():\n", " outputs = model_for_classification(**inputs) # 使用 ** 解包字典作为关键字参数\n", " \n", " # 4. 处理输出\n", " print(f\"\\nModel output type: {type(outputs)}\")\n", " print(f\"Model output keys: {outputs.keys()}\") # 通常包含 'logits'\n", " \n", " logits = outputs.logits\n", " print(f\"\\nLogits (raw scores) shape: {logits.shape}\") # [batch_size, num_labels]\n", " print(f\"Logits:\\n{logits}\")\n", " \n", " # 将 logits 转换为概率 (使用 Softmax)\n", " probabilities = torch.softmax(logits, dim=-1)\n", " print(f\"\\nProbabilities:\\n{probabilities.round(decimals=3)}\")\n", " \n", " # 获取预测的类别 (概率最高的类别索引)\n", " predictions = torch.argmax(probabilities, dim=-1)\n", " print(f\"\\nPredicted class indices: {predictions}\")\n", " \n", " # 将索引映射回标签\n", " # model.config 包含了模型的配置信息，包括标签映射\n", " id2label = model_for_classification.config.id2label\n", " predicted_labels = [id2label[idx.item()] for idx in predictions]\n", " print(f\"Predicted labels: {predicted_labels}\")\n", " \n", "else:\n", " print(\"Model or Tokenizer not loaded, skipping inference example.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. (简介) 微调 (Fine-tuning) 流程\n", "\n", "微调是指在一个大型预训练模型的基础上，使用特定任务的数据集继续训练模型，以使其适应这个特定任务。这是迁移学习的一种形式，通常比从头开始训练模型更有效。\n", "\n", "**基本步骤：**\n", "1. **加载预训练模型和分词器**: 使用 `AutoModelForTask` (如 `AutoModelForSequenceClassification`) 加载适合下游任务的模型。\n", "2. **准备数据集**: 加载你的特定任务数据集 (可以使用 `datasets` 库)，并使用模型的分词器对其进行编码。\n", "3. **定义训练参数**: 使用 `TrainingArguments` 类设置训练超参数 (学习率、周期数、批大小等)。\n", "4. **创建 `Trainer`**: `Trainer` 类封装了训练和评估循环。\n", "5. **开始训练**: 调用 `trainer.train()`。\n", "6. **(可选) 评估**: 调用 `trainer.evaluate()`。\n", "\n", "**示例 (伪代码/概念):**\n", "```python\n", "# from datasets import load_dataset\n", "# from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer\n", "\n", "# # 1. 加载模型和分词器\n", "# model_name = \"bert-base-uncased\"\n", "# tokenizer = AutoTokenizer.from_pretrained(model_name)\n", "# model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=NUM_YOUR_CLASSES)\n", "\n", "# # 2. 加载和预处理数据集\n", "# raw_datasets = load_dataset(\"your_dataset_name\") # or load from files\n", "# def tokenize_function(examples):\n", "# return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n", "# tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)\n", "# # ... 可能还需要设置格式、移除不必要的列 ...\n", "# train_dataset = tokenized_datasets[\"train\"]\n", "# eval_dataset = tokenized_datasets[\"validation\"]\n", "\n", "# # 3. 定义训练参数\n", "# training_args = TrainingArguments(\n", "# output_dir=\"./results\", # 输出目录\n", "# evaluation_strategy=\"epoch\", # 每轮结束后评估\n", "# learning_rate=2e-5, # 学习率\n", "# per_device_train_batch_size=16, # 训练批大小\n", "# per_device_eval_batch_size=16, # 评估批大小\n", "# num_train_epochs=3, # 训练周期\n", "# weight_decay=0.01, # 权重衰减\n", "# push_to_hub=False, # 是否推送到 Hub (可选)\n", "# )\n", "\n", "# # 4. 创建 Trainer\n", "# trainer = Trainer(\n", "# model=model,\n", "# args=training_args,\n", "# train_dataset=train_dataset,\n", "# eval_dataset=eval_dataset,\n", "# # compute_metrics=compute_metrics_function # (可选) 自定义评估指标函数\n", "# )\n", "\n", "# # 5. 开始训练\n", "# trainer.train()\n", "\n", "# # 6. 评估\n", "# trainer.evaluate()\n", "```\n", "微调是一个更深入的主题，涉及数据准备、超参数选择等细节，Hugging Face 官方文档和教程提供了更详细的指南。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Hugging Face Hub 简介\n", "\n", "Hugging Face Hub ([huggingface.co](https://huggingface.co/)) 是一个协作平台，托管了：\n", "* **数以万计的预训练模型**：可以通过模型标识符直接在 `transformers` 库中加载。\n", "* **数千个数据集**：可以使用 `datasets` 库方便地加载。\n", "* **Spaces**: 用于托管和运行 ML 应用演示的平台。\n", "* **评估指标**。\n", "\n", "你可以浏览 Hub 来发现适用于你任务的模型和数据集，并查看模型卡片 (Model Cards) 来了解模型的细节、用法和限制。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 总结\n", "\n", "Hugging Face Transformers 库极大地简化了使用和部署最先进的预训练模型（尤其是基于 Transformer 的模型）的过程。其易用的 `pipeline` API、强大的 `AutoClasses` 以及与 PyTorch/TensorFlow 的良好集成，使其成为 NLP、CV、音频等领域研究和应用开发不可或缺的工具。\n", "\n", "**关键要点：**\n", "* `pipeline` 是进行快速推理的最简单方式。\n", "* `AutoTokenizer` 和 `AutoModel` (及各种变体) 用于加载模型组件。\n", "* 分词是将文本转换为模型输入的核心步骤。\n", "* 可以通过加载的模型进行详细的推理和获取内部状态。\n", "* 库支持在预训练模型上进行微调以适应特定任务。\n", "* Hugging Face Hub 是查找和共享模型、数据集的重要资源。\n", "\n", "掌握 Transformers 库将使你能够轻松利用强大的预训练模型来解决各种复杂的机器学习任务。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 5 }