{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scikit-learn - Python 通用机器学习库教程\n", "\n", "欢迎来到 Scikit-learn 教程!Scikit-learn (也写作 sklearn) 是 Python 中最流行、最全面的“经典”机器学习库。它提供了大量用于数据预处理、模型选择、模型训练、评估以及常见机器学习算法(分类、回归、聚类、降维)的工具。\n", "\n", "**为什么 Scikit-learn 对 ML/DL/数据科学很重要?**\n", "\n", "1. **一致的 API**:提供了简单、一致的接口 (`fit`, `predict`, `transform`) 来使用不同的算法。\n", "2. **广泛的算法覆盖**:包含了绝大多数常用的非深度学习算法。\n", "3. **数据预处理工具**:提供了丰富的特征缩放、编码、缺失值处理等工具。\n", "4. **模型选择与评估**:内置交叉验证、超参数调优和各种评估指标。\n", "5. **与其他库集成良好**:与 NumPy, SciPy, Pandas, Matplotlib 等紧密集成。\n", "6. **优秀的文档和社区支持**:文档清晰,示例丰富,社区活跃。\n", "7. **学习基础**:Scikit-learn 的设计思想和 API 风格对许多其他机器学习库(甚至深度学习库的部分接口)产生了深远影响。\n", "\n", "**本教程将涵盖 Scikit-learn 的核心工作流程和常用功能:**\n", "\n", "1. 加载数据集\n", "2. 数据预处理 (特征缩放, 编码)\n", "3. 数据集划分 (训练集/测试集)\n", "4. 模型训练 (`fit`)\n", "5. 模型预测 (`predict`, `predict_proba`)\n", "6. 常用模型示例 (分类: Logistic Regression, Random Forest; 回归: Linear Regression; 聚类: KMeans)\n", "7. 模型评估 (常用指标)\n", "8. 交叉验证 (`cross_val_score`)\n", "9. 超参数调优 (`GridSearchCV`)\n", "10. 管道 (`Pipeline`)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 准备工作:导入必要的库\n", "\n", "我们将导入 Scikit-learn 中需要的模块,以及 NumPy 和 Pandas。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Standard Libraries\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Scikit-learn modules\n", "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.linear_model import LinearRegression, LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n", "from sklearn.svm import SVC\n", "from sklearn.cluster import KMeans\n", "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, \n", " mean_squared_error, r2_score, confusion_matrix, classification_report\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.datasets import load_iris, load_boston # Example datasets (load_boston deprecated, use fetch_california_housing or others)\n", "from sklearn.compose import ColumnTransformer\n", "\n", "# Set plotting style\n", "sns.set_theme(style=\"whitegrid\")\n", "\n", "print(\"Libraries imported.\")\n", "\n", "# Handle potential deprecation of load_boston\n", "try:\n", " from sklearn.datasets import fetch_california_housing\n", " california_housing = fetch_california_housing(as_frame=True) # Use as_frame=True to get a Pandas DataFrame\n", " print(\"Using California Housing dataset.\")\n", " regression_data = california_housing\n", "except ImportError:\n", " print(\"fetch_california_housing not available. Some regression examples might be limited.\")\n", " regression_data = None\n", "\n", "# Load Iris dataset for classification\n", "iris = load_iris(as_frame=True)\n", "iris_df = iris.data\n", "iris_df['target'] = iris.target\n", "iris_df['species'] = iris_df['target'].map({i: name for i, name in enumerate(iris.target_names)}) # Add species names\n", "print(\"\\nIris dataset loaded into DataFrame:\")\n", "print(iris_df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. 数据预处理\n", "\n", "机器学习模型通常需要数值型、标准化的输入数据。预处理步骤包括处理缺失值、转换分类特征和缩放数值特征。\n", "\n", "* **特征缩放 (Scaling)**:将数值特征缩放到相似的范围,防止某些特征因数值范围过大而主导模型。\n", " * `StandardScaler`: 标准化 (均值为0,方差为1)。\n", " * `MinMaxScaler`: 归一化到 [0, 1] (或指定范围)。\n", "* **分类特征编码 (Encoding)**:将文本或类别标签转换为数值表示。\n", " * `LabelEncoder`: 将标签编码为 0 到 n_classes-1 的整数 (通常用于目标变量)。\n", " * `OneHotEncoder`: 将具有 k 个类别的特征转换为 k 个二元 (0/1) 特征 (通常用于输入特征,避免引入错误的顺序关系)。\n", "* **缺失值处理 (Imputation)**:用特定策略(如均值、中位数、众数)填充缺失值 (`NaN`)。\n", " * `SimpleImputer`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Data Preprocessing Examples ---\")\n", "\n", "# --- Feature Scaling ---\n", "print(\"\\n--- Feature Scaling ---\")\n", "data_scale = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]])\n", "print(f\"Original data:\\n{data_scale}\")\n", "\n", "scaler_standard = StandardScaler()\n", "scaled_standard = scaler_standard.fit_transform(data_scale)\n", "print(f\"\\nStandardized data (mean ~0, std ~1):\\n{scaled_standard}\")\n", "print(f\"Mean after scaling: {scaled_standard.mean(axis=0)}\")\n", "print(f\"Std after scaling: {scaled_standard.std(axis=0)}\")\n", "\n", "scaler_minmax = MinMaxScaler()\n", "scaled_minmax = scaler_minmax.fit_transform(data_scale)\n", "print(f\"\\nMinMax scaled data (range [0, 1]):\\n{scaled_minmax}\")\n", "\n", "# --- Categorical Encoding ---\n", "print(\"\\n--- Categorical Encoding ---\")\n", "categorical_feature = [['Male', 'Low'], ['Female', 'Medium'], ['Female', 'High'], ['Male', 'Low']]\n", "df_cat = pd.DataFrame(categorical_feature, columns=['Gender', 'Level'])\n", "print(f\"Original categorical data:\\n{df_cat}\")\n", "\n", "# OneHotEncoder for input features\n", "onehot_encoder = OneHotEncoder(sparse_output=False) # sparse=False returns numpy array\n", "encoded_features = onehot_encoder.fit_transform(df_cat)\n", "print(f\"\\nOneHot encoded features:\\n{encoded_features}\")\n", "print(f\"Feature names: {onehot_encoder.get_feature_names_out()}\")\n", "\n", "# LabelEncoder for target variable (example)\n", "target_labels = ['Cat', 'Dog', 'Cat', 'Fish', 'Dog']\n", "label_encoder = LabelEncoder()\n", "encoded_labels = label_encoder.fit_transform(target_labels)\n", "print(f\"\\nOriginal labels: {target_labels}\")\n", "print(f\"Label encoded labels: {encoded_labels}\")\n", "print(f\"Encoded classes: {label_encoder.classes_}\")\n", "\n", "# --- Imputation --- \n", "print(\"\\n--- Imputation (Missing Values) ---\")\n", "data_missing = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])\n", "print(f\"Data with missing values:\\n{data_missing}\")\n", "\n", "imputer_mean = SimpleImputer(strategy='mean')\n", "imputed_data = imputer_mean.fit_transform(data_missing)\n", "print(f\"\\nImputed data (using mean):\\n{imputed_data}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. 数据集划分 (训练集/测试集)\n", "\n", "将数据集分为训练集和测试集是评估模型泛化能力的关键步骤。\n", "`train_test_split` 是常用的工具。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Train/Test Split --- \")\n", "# 使用 Iris 数据集\n", "X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']] # 特征\n", "y = iris_df['target'] # 目标变量 (类别标签 0, 1, 2)\n", "\n", "# test_size: 测试集比例或数量\n", "# random_state: 随机种子,确保每次划分结果一致\n", "# stratify=y: 对于分类问题,确保训练集和测试集中各类别比例与原始数据一致\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)\n", "\n", "print(f\"Original dataset shape: X={X.shape}, y={y.shape}\")\n", "print(f\"Training set shape: X_train={X_train.shape}, y_train={y_train.shape}\")\n", "print(f\"Test set shape: X_test={X_test.shape}, y_test={y_test.shape}\")\n", "print(f\"\\nProportion of classes in original y:\\n{y.value_counts(normalize=True)}\")\n", "print(f\"\\nProportion of classes in y_train:\\n{y_train.value_counts(normalize=True)}\")\n", "print(f\"\\nProportion of classes in y_test:\\n{y_test.value_counts(normalize=True)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. 模型训练 (`fit`)\n", "\n", "Scikit-learn 的核心 API 非常一致:\n", "1. 选择一个模型类并实例化。\n", "2. 使用训练数据调用实例的 `fit(X_train, y_train)` 方法。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Model Training Example (Logistic Regression) ---\")\n", "\n", "# 0. (Optional but often needed) Scale the features\n", "scaler = StandardScaler()\n", "X_train_scaled = scaler.fit_transform(X_train) # Fit on training data, then transform\n", "# IMPORTANT: Use the SAME scaler fitted on training data to transform the test data\n", "X_test_scaled = scaler.transform(X_test)\n", "print(\"Features scaled using StandardScaler.\")\n", "\n", "# 1. Instantiate the model\n", "log_reg = LogisticRegression(random_state=42, max_iter=200) # Increase max_iter if needed\n", "print(f\"Model instantiated: {log_reg}\")\n", "\n", "# 2. Train the model using scaled training data\n", "log_reg.fit(X_train_scaled, y_train)\n", "print(\"Model training completed (fitted).\")\n", "\n", "# 模型已训练完成,其内部参数(如系数、截距)已被学习" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. 模型预测 (`predict`, `predict_proba`)\n", "\n", "训练好的模型可以用来对新数据(通常是测试集)进行预测。\n", "* `predict(X_test)`: 对 `X_test` 中的每个样本预测类别标签(分类)或数值(回归)。\n", "* `predict_proba(X_test)`: (仅限分类模型)返回每个样本属于各个类别的概率。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Model Prediction --- \")\n", "\n", "# 使用训练好的 log_reg 模型和缩放后的测试数据 X_test_scaled\n", "y_pred = log_reg.predict(X_test_scaled)\n", "print(f\"Predicted labels for test set (first 10): {y_pred[:10]}\")\n", "print(f\"Actual labels for test set (first 10): {y_test.values[:10]}\")\n", "\n", "# 获取预测概率\n", "y_pred_proba = log_reg.predict_proba(X_test_scaled)\n", "print(f\"\\nPredicted probabilities for test set (first 5 samples):\\n{y_pred_proba[:5].round(3)}\")\n", "# 每一行对应一个样本,每一列对应一个类别的概率 (顺序由 model.classes_ 决定)\n", "print(f\"Model classes: {log_reg.classes_}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. 常用模型示例\n", "\n", "Scikit-learn 提供了多种模型。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Other Model Examples ---\")\n", "\n", "# --- Random Forest Classifier ---\n", "print(\"\\nTraining Random Forest Classifier...\")\n", "rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1) # n_jobs=-1 使用所有CPU核心\n", "rf_clf.fit(X_train_scaled, y_train)\n", "y_pred_rf = rf_clf.predict(X_test_scaled)\n", "print(\"Random Forest training complete.\")\n", "# Evaluation will be done later\n", "\n", "# --- Support Vector Classifier (SVC) ---\n", "print(\"\\nTraining Support Vector Classifier...\")\n", "svc_clf = SVC(probability=True, random_state=42) # probability=True enables predict_proba\n", "svc_clf.fit(X_train_scaled, y_train)\n", "y_pred_svc = svc_clf.predict(X_test_scaled)\n", "print(\"SVC training complete.\")\n", "\n", "# --- Linear Regression (using California Housing) ---\n", "if regression_data is not None:\n", " print(\"\\n--- Linear Regression Example --- \")\n", " X_reg = regression_data.data\n", " y_reg = regression_data.target\n", " \n", " X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)\n", " \n", " # Scale regression features\n", " reg_scaler = StandardScaler()\n", " X_reg_train_scaled = reg_scaler.fit_transform(X_reg_train)\n", " X_reg_test_scaled = reg_scaler.transform(X_reg_test)\n", " \n", " lin_reg = LinearRegression()\n", " lin_reg.fit(X_reg_train_scaled, y_reg_train)\n", " y_pred_reg = lin_reg.predict(X_reg_test_scaled)\n", " print(\"Linear Regression training complete.\")\n", " print(f\"First 5 regression predictions: {y_pred_reg[:5].round(2)}\")\n", " print(f\"First 5 actual regression values: {y_reg_test.values[:5].round(2)}\")\n", "else:\n", " print(\"\\nSkipping Linear Regression example (dataset not loaded).\")\n", "\n", "# --- K-Means Clustering (Unsupervised) ---\n", "print(\"\\n--- K-Means Clustering Example --- \")\n", "# 使用未标记的 Iris 特征 (假设我们不知道类别)\n", "X_iris_unscaled = X # Use original unscaled data for clustering interpretation, or scale it\n", "kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) # n_clusters 通常需要预先确定或尝试不同值; n_init for stability\n", "kmeans.fit(X_iris_unscaled) # 无监督学习,只需要 X\n", "cluster_labels = kmeans.labels_\n", "print(f\"K-Means training complete. Cluster labels (first 20): {cluster_labels[:20]}\")\n", "print(f\"Cluster centers (centroids):\\n{kmeans.cluster_centers_.round(2)}\")\n", "\n", "# 比较 K-Means 找到的簇与真实类别 (需要注意簇标签与真实标签可能不对应)\n", "df_cluster_comparison = pd.DataFrame({'TrueLabel': y, 'ClusterLabel': cluster_labels})\n", "print(\"\\nCross-tabulation of True Labels vs K-Means Clusters:\")\n", "print(pd.crosstab(df_cluster_comparison['TrueLabel'], df_cluster_comparison['ClusterLabel']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. 模型评估\n", "\n", "评估模型在测试集上的性能至关重要。\n", "\n", "**分类常用指标:**\n", "* `accuracy_score`: 准确率 (正确预测的比例)。\n", "* `precision_score`: 精确率 (预测为正的样本中,实际为正的比例)。\n", "* `recall_score`: 召回率 (实际为正的样本中,被正确预测为正的比例)。\n", "* `f1_score`: F1 分数 (精确率和召回率的调和平均数)。\n", "* `confusion_matrix`: 混淆矩阵。\n", "* `classification_report`: 包含精确率、召回率、F1 分数的文本报告。\n", "* `roc_auc_score`: ROC 曲线下面积 (适用于二分类或多分类的 OvR/OvO)。\n", "\n", "**回归常用指标:**\n", "* `mean_squared_error (MSE)`: 均方误差。\n", "* `mean_absolute_error (MAE)`: 平均绝对误差。\n", "* `r2_score`: R 方(决定系数),表示模型解释了多少因变量的方差。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Model Evaluation ---\")\n", "\n", "# --- Classification Evaluation (using Logistic Regression results) ---\n", "print(\"\\n--- Logistic Regression Evaluation ---\")\n", "accuracy_logreg = accuracy_score(y_test, y_pred)\n", "# For multiclass precision/recall/f1, need 'average' parameter (e.g., 'macro', 'micro', 'weighted')\n", "precision_logreg = precision_score(y_test, y_pred, average='weighted')\n", "recall_logreg = recall_score(y_test, y_pred, average='weighted')\n", "f1_logreg = f1_score(y_test, y_pred, average='weighted')\n", "\n", "print(f\"Accuracy: {accuracy_logreg:.4f}\")\n", "print(f\"Precision (weighted): {precision_logreg:.4f}\")\n", "print(f\"Recall (weighted): {recall_logreg:.4f}\")\n", "print(f\"F1 Score (weighted): {f1_logreg:.4f}\")\n", "\n", "print(\"\\nConfusion Matrix:\")\n", "print(confusion_matrix(y_test, y_pred))\n", "\n", "print(\"\\nClassification Report:\")\n", "print(classification_report(y_test, y_pred, target_names=iris.target_names))\n", "\n", "# ROC AUC (requires probabilities, use OvR for multiclass)\n", "# roc_auc = roc_auc_score(y_test, y_pred_proba, multi_class='ovr', average='weighted')\n", "# print(f\"ROC AUC (weighted OvR): {roc_auc:.4f}\")\n", "\n", "# --- Regression Evaluation (using Linear Regression results) ---\n", "if regression_data is not None:\n", " print(\"\\n--- Linear Regression Evaluation ---\")\n", " mse_linreg = mean_squared_error(y_reg_test, y_pred_reg)\n", " r2_linreg = r2_score(y_reg_test, y_pred_reg)\n", " print(f\"Mean Squared Error (MSE): {mse_linreg:.4f}\")\n", " print(f\"R-squared (R2): {r2_linreg:.4f}\")\n", "else:\n", " print(\"\\nSkipping Regression Evaluation.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. 交叉验证 (`cross_val_score`)\n", "\n", "简单的训练/测试集划分可能因划分的随机性导致评估结果不稳定。交叉验证通过将数据分成多个“折”(fold),轮流使用其中一折作为验证集,其余作为训练集,然后对多次评估结果取平均,来提供更稳健的模型性能估计。\n", "\n", "`cross_val_score` 是一个便捷的函数。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Cross-Validation Example (Random Forest on Iris) ---\")\n", "\n", "# 使用完整的、缩放后的 Iris 数据集进行交叉验证\n", "X_iris_scaled = scaler.fit_transform(X) # Scale the full dataset\n", "y_iris = y\n", "\n", "rf_clf_cv = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)\n", "\n", "# cv 参数指定折数 (e.g., 5-fold cross-validation)\n", "# scoring 参数指定评估指标 (e.g., 'accuracy', 'f1_weighted', 'neg_mean_squared_error' for regression)\n", "cv_scores = cross_val_score(rf_clf_cv, X_iris_scaled, y_iris, cv=5, scoring='accuracy')\n", "\n", "print(f\"Cross-validation scores (accuracy): {cv_scores}\")\n", "print(f\"Average CV accuracy: {cv_scores.mean():.4f}\")\n", "print(f\"Standard deviation of CV accuracy: {cv_scores.std():.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. 超参数调优 (`GridSearchCV`)\n", "\n", "机器学习模型通常有许多**超参数**(在训练前设置的参数,如 Random Forest 的 `n_estimators` 或 SVC 的 `C` 和 `gamma`),它们会影响模型性能。\n", "\n", "`GridSearchCV` 通过系统地尝试超参数网格中的所有组合,并使用交叉验证来评估每种组合的性能,从而找到最佳的超参数设置。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Hyperparameter Tuning with GridSearchCV (SVC on Iris) ---\")\n", "\n", "# 定义要搜索的参数网格\n", "param_grid = {\n", " 'C': [0.1, 1, 10, 100], # 正则化参数\n", " 'gamma': [1, 0.1, 0.01, 0.001], # RBF 核的系数 ('rbf'是默认核)\n", " 'kernel': ['rbf', 'linear'] # 尝试不同的核函数\n", "}\n", "\n", "# 创建 GridSearchCV 对象\n", "# estimator: 要调优的模型实例\n", "# param_grid: 参数网格\n", "# cv: 交叉验证折数\n", "# scoring: 评估指标\n", "# n_jobs=-1: 使用所有CPU核心\n", "grid_search = GridSearchCV(SVC(random_state=42), \n", " param_grid, \n", " cv=5, \n", " scoring='accuracy', \n", " n_jobs=-1, \n", " verbose=1) # verbose 控制输出信息的详细程度\n", "\n", "# 在训练数据上执行网格搜索 (它会自动进行交叉验证)\n", "print(\"Starting Grid Search...\")\n", "grid_search.fit(X_train_scaled, y_train) # Fit on the training set\n", "print(\"Grid Search complete.\")\n", "\n", "# 查看最佳参数和最佳分数\n", "print(f\"\\nBest parameters found: {grid_search.best_params_}\")\n", "print(f\"Best cross-validation accuracy: {grid_search.best_score_:.4f}\")\n", "\n", "# 获取最佳模型\n", "best_svc = grid_search.best_estimator_\n", "\n", "# 使用最佳模型在测试集上评估\n", "y_pred_best_svc = best_svc.predict(X_test_scaled)\n", "accuracy_best_svc = accuracy_score(y_test, y_pred_best_svc)\n", "print(f\"\\nAccuracy on test set with best SVC: {accuracy_best_svc:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10. 管道 (`Pipeline`)\n", "\n", "管道 (`Pipeline`) 可以将多个数据处理步骤(如缩放、特征选择)和一个最终的模型估计器链接在一起。这有几个好处:\n", "* **方便性**:只需对管道调用一次 `fit` 和 `predict`。\n", "* **防止数据泄露**:确保预处理步骤(如缩放)只在训练数据上 `fit`,然后应用于测试数据,这在交叉验证和网格搜索中尤其重要。\n", "* **代码简洁**:将工作流封装在一个对象中。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Pipeline Example (Scaling + SVC) ---\")\n", "\n", "# 创建管道:包含一个缩放器和一个分类器\n", "pipeline = Pipeline([\n", " ('scaler', StandardScaler()), # 步骤1:命名为 'scaler',使用 StandardScaler\n", " ('svc', SVC(random_state=42)) # 步骤2:命名为 'svc',使用 SVC\n", "])\n", "\n", "# 直接在原始训练数据上训练管道\n", "# 管道会自动处理:\n", "# 1. scaler.fit_transform(X_train)\n", "# 2. svc.fit(scaled_X_train, y_train)\n", "pipeline.fit(X_train, y_train) \n", "print(\"Pipeline trained.\")\n", "\n", "# 使用管道进行预测\n", "# 管道会自动处理:\n", "# 1. scaler.transform(X_test) (使用在训练集上fit的scaler)\n", "# 2. svc.predict(scaled_X_test)\n", "y_pred_pipeline = pipeline.predict(X_test)\n", "accuracy_pipeline = accuracy_score(y_test, y_pred_pipeline)\n", "print(f\"Accuracy using pipeline on test set: {accuracy_pipeline:.4f}\")\n", "\n", "# 管道也可以与 GridSearchCV 结合使用,调优模型参数甚至预处理步骤的参数\n", "print(\"\\n--- Pipeline with GridSearchCV ---\")\n", "param_grid_pipeline = {\n", " 'svc__C': [0.1, 1, 10], # 注意参数名前缀:步骤名 + 双下划线 + 参数名\n", " 'svc__gamma': [0.1, 0.01]\n", "}\n", "\n", "grid_search_pipe = GridSearchCV(pipeline, param_grid_pipeline, cv=3, scoring='accuracy', verbose=0)\n", "print(\"Starting Grid Search with Pipeline...\")\n", "grid_search_pipe.fit(X_train, y_train) # Fit on original X_train\n", "print(\"Grid Search with Pipeline complete.\")\n", "print(f\"Best params for pipeline: {grid_search_pipe.best_params_}\")\n", "print(f\"Best CV score for pipeline: {grid_search_pipe.best_score_:.4f}\")\n", "\n", "# 评估最佳管道\n", "best_pipeline = grid_search_pipe.best_estimator_\n", "y_pred_best_pipe = best_pipeline.predict(X_test)\n", "accuracy_best_pipe = accuracy_score(y_test, y_pred_best_pipe)\n", "print(f\"Accuracy on test set with best pipeline: {accuracy_best_pipe:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 总结\n", "\n", "Scikit-learn 是 Python 中进行经典机器学习任务的强大而全面的库。其一致的 API、丰富的算法和工具集,使其成为数据科学家和机器学习工程师的必备技能。\n", "\n", "**关键要点:**\n", "* 遵循**数据加载 -> 预处理 -> 划分 -> 训练 -> 预测 -> 评估**的基本流程。\n", "* 熟练使用 `StandardScaler`, `OneHotEncoder` 等预处理工具。\n", "* 掌握 `fit()`, `predict()`, `transform()` 核心方法。\n", "* 了解常用分类、回归、聚类算法的 Scikit-learn 实现。\n", "* 使用交叉验证和网格搜索进行稳健的模型评估和超参数调优。\n", "* 利用 `Pipeline` 简化工作流程并避免数据泄露。\n", "\n", "Scikit-learn 的功能远不止于此,还包括特征选择、降维 (PCA, t-SNE)、更复杂的模型集成、半监督学习等。强烈建议深入学习其官方文档和用户指南。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 5 }