{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Vector Database 客户端库教程 (FAISS 重点)\n",
    "\n",
    "欢迎来到向量数据库客户端库教程！随着嵌入（Embeddings）技术在表示文本、图像等非结构化数据方面的成功，**向量数据库**（或称向量搜索引擎）成为了现代 AI 应用的关键基础设施。它们专门用于存储高维向量，并能高效地执行**相似性搜索 (Similarity Search)**，即根据一个查询向量找出数据库中最相似的向量。\n",
    "\n",
    "**为什么需要向量数据库/搜索库？**\n",
    "\n",
    "1.  **高效检索**: 对于大规模向量数据集（百万甚至十亿级别），传统的线性扫描计算相似度非常慢。向量数据库使用近似最近邻 (Approximate Nearest Neighbor, ANN) 等算法来大幅加速搜索过程。\n",
    "2.  **RAG 的核心**: 在检索增强生成 (RAG) 中，向量数据库用于存储文档块的嵌入向量，并根据用户问题的嵌入向量快速找到相关的文档块。\n",
    "3.  **推荐系统**: 找到与用户或物品嵌入向量相似的其他用户或物品。\n",
    "4.  **图像/音频搜索**: 基于内容的图像或音频检索。\n",
    "5.  **重复数据删除**: 查找相似的文本或图像。\n",
    "\n",
    "**本教程重点介绍 `faiss-cpu` (或 `faiss-gpu`)**: \n",
    "*   FAISS (Facebook AI Similarity Search) 是由 Facebook AI 开发的一个非常高效的向量相似性搜索库。\n",
    "*   它提供了多种索引类型，可以在内存或磁盘上运行。\n",
    "*   它是一个库，而不是一个数据库服务，通常嵌入在应用程序中或由其他框架（如 LangChain, LlamaIndex）调用。\n",
    "\n",
    "**其他流行的向量数据库/库 (简介):**\n",
    "*   **ChromaDB**: 开源，本地优先，易于使用，与 LangChain/LlamaIndex 集成良好。\n",
    "*   **Pinecone**: 商业化的、完全托管的云原生向量数据库服务。\n",
    "*   **Weaviate**: 开源的云原生向量数据库，支持 GraphQL。\n",
    "*   **Milvus**: 开源的云原生向量数据库。\n",
    "\n",
    "**本教程将涵盖 FAISS 的核心用法：**\n",
    "\n",
    "1.  安装 FAISS\n",
    "2.  准备示例向量数据 (使用 Sentence Transformers 获取文本嵌入)\n",
    "3.  构建 FAISS 索引 (如 `IndexFlatL2`, `IndexIVFFlat`)\n",
    "4.  向索引添加向量\n",
    "5.  执行相似性搜索 (`index.search()`)\n",
    "6.  (简介) 索引的保存与加载\n",
    "7.  (简介) 与 LangChain/LlamaIndex 的集成"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. 安装 FAISS 和 Sentence Transformers\n",
    "\n",
    "```bash\n",
    "# 安装 FAISS CPU 版本\n",
    "pip install faiss-cpu\n",
    "\n",
    "# 或者，如果你有兼容的 GPU 和 CUDA 环境，可以安装 GPU 版本\n",
    "# pip install faiss-gpu\n",
    "\n",
    "# 安装 Sentence Transformers 用于生成文本嵌入\n",
    "pip install sentence-transformers\n",
    "\n",
    "# 其他依赖\n",
    "pip install numpy pandas matplotlib\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import time\n",
    "import os\n",
    "\n",
    "# 尝试导入 faiss\n",
    "try:\n",
    "    import faiss\n",
    "    print(f\"FAISS version: {faiss.__version__}\")\n",
    "    faiss_available = True\n",
    "except ImportError:\n",
    "    print(\"FAISS library not found. Please install faiss-cpu or faiss-gpu.\")\n",
    "    print(\"pip install faiss-cpu\")\n",
    "    faiss_available = False\n",
    "\n",
    "# 尝试导入 sentence-transformers\n",
    "try:\n",
    "    from sentence_transformers import SentenceTransformer\n",
    "    print(\"SentenceTransformer imported successfully.\")\n",
    "    st_available = True\n",
    "    # 加载一个预训练的嵌入模型 (第一次运行时会自动下载)\n",
    "    # 'all-MiniLM-L6-v2' 是一个常用且相对较小的模型\n",
    "    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')\n",
    "    embedding_dim = embedding_model.get_sentence_embedding_dimension()\n",
    "    print(f\"Loaded Sentence Transformer model. Embedding dimension: {embedding_dim}\")\n",
    "except ImportError:\n",
    "    print(\"sentence-transformers library not found. Please install it: pip install sentence-transformers\")\n",
    "    st_available = False\n",
    "    embedding_model = None\n",
    "    embedding_dim = None\n",
    "except Exception as e:\n",
    "    print(f\"Error loading Sentence Transformer model: {e}. Check internet connection or model name.\")\n",
    "    st_available = False\n",
    "    embedding_model = None\n",
    "    embedding_dim = None\n",
    "\n",
    "# 辅助函数\n",
    "def time_it(func, *args, **kwargs):\n",
    "    start = time.time()\n",
    "    result = func(*args, **kwargs)\n",
    "    end = time.time()\n",
    "    print(f\"Execution time: {end - start:.4f} seconds\")\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. 准备示例向量数据\n",
    "\n",
    "我们需要一组向量来构建索引。这里，我们使用 `sentence-transformers` 将一些示例文本转换为嵌入向量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"--- Preparing Sample Data and Embeddings ---\")\n",
    "\n",
    "documents = [\n",
    "    \"The cat sat on the mat.\",\n",
    "    \"The dog chased the ball.\",\n",
    "    \"Apples and oranges are fruits.\",\n",
    "    \"Paris is the capital of France.\",\n",
    "    \"The weather is sunny today.\",\n",
    "    \"Machine learning models require data.\",\n",
    "    \"A feline rested upon the rug.\", # Similar to sentence 1\n",
    "    \"Information retrieval is key for RAG.\"\n",
    "]\n",
    "\n",
    "doc_embeddings = None\n",
    "if st_available and embedding_model:\n",
    "    print(f\"Generating embeddings for {len(documents)} documents...\")\n",
    "    # 使用 embedding_model.encode() 获取嵌入向量\n",
    "    doc_embeddings = embedding_model.encode(documents)\n",
    "    \n",
    "    # FAISS 需要 float32 类型的 NumPy 数组\n",
    "    doc_embeddings = np.array(doc_embeddings).astype('float32')\n",
    "    \n",
    "    print(f\"Embeddings generated. Shape: {doc_embeddings.shape}\") # (num_documents, embedding_dim)\n",
    "    # print(\"Sample embedding (first 5 dims of first doc):\")\n",
    "    # print(doc_embeddings[0, :5])\n",
    "else:\n",
    "    print(\"Sentence Transformer model not available. Cannot generate embeddings.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. 构建 FAISS 索引\n",
    "\n",
    "FAISS 提供了多种索引类型，适用于不同的数据规模、内存限制和搜索速度/精度权衡。\n",
    "\n",
    "*   **`faiss.IndexFlatL2`**: \n",
    "    *   最简单的索引，进行精确的暴力 L2 (欧氏距离) 搜索。\n",
    "    *   不需要训练。\n",
    "    *   适用于小型数据集，精度最高，但速度最慢。\n",
    "*   **`faiss.IndexFlatIP`**: 类似 `IndexFlatL2`，但使用内积 (Inner Product) 作为相似度度量（对于归一化向量，等价于余弦相似度）。\n",
    "*   **`faiss.IndexIVFFlat`**: \n",
    "    *   基于倒排文件 (Inverted File) 的索引，是常用的近似最近邻 (ANN) 算法。\n",
    "    *   需要一个**训练 (train)** 阶段，使用一部分数据（或全部数据）来学习数据空间中的聚类中心 (centroids)。\n",
    "    *   搜索时，先找到查询向量最近的几个聚类中心，然后在这些中心对应的列表中进行搜索，从而减少搜索范围。\n",
    "    *   参数：`quantizer` (通常是 `IndexFlatL2`)，`d` (向量维度)，`nlist` (聚类中心数量)。\n",
    "    *   `nprobe` 参数控制搜索时要检查的聚类列表数量（影响速度和精度）。\n",
    "*   **其他索引**: 如 `IndexPQ` (Product Quantization), `IndexHNSWFlat` (Hierarchical Navigable Small World graphs) 等，用于更大规模或需要更高压缩率/速度的场景。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- Building FAISS Index --- \")\n",
    "index_flat_l2 = None\n",
    "index_ivf_flat = None\n",
    "\n",
    "if faiss_available and doc_embeddings is not None:\n",
    "    d = embedding_dim # 向量维度\n",
    "    n_docs = doc_embeddings.shape[0]\n",
    "    \n",
    "    # --- 1. IndexFlatL2 (精确，暴力搜索) --- \n",
    "    print(\"\\nBuilding IndexFlatL2...\")\n",
    "    index_flat_l2 = faiss.IndexFlatL2(d)\n",
    "    print(f\"IndexFlatL2 created. Is trained: {index_flat_l2.is_trained}\")\n",
    "    print(f\"Initial vector count: {index_flat_l2.ntotal}\")\n",
    "    \n",
    "    # --- 2. IndexIVFFlat (近似，需要训练) --- \n",
    "    print(\"\\nBuilding IndexIVFFlat...\")\n",
    "    nlist = 4 # 聚类中心数量 (通常选择 sqrt(N) 到 N/100 之间，这里 N 很小)\n",
    "    quantizer = faiss.IndexFlatL2(d) # 底层使用 L2 距离计算中心\n",
    "    index_ivf_flat = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)\n",
    "    # faiss.METRIC_L2: 使用 L2 距离\n",
    "    # faiss.METRIC_INNER_PRODUCT: 使用内积\n",
    "    \n",
    "    print(f\"IndexIVFFlat created. Is trained: {index_ivf_flat.is_trained}\")\n",
    "    \n",
    "    # 训练索引 (学习聚类中心)\n",
    "    if n_docs >= nlist: # Need enough data to train\n",
    "      print(f\"Training IndexIVFFlat with {n_docs} vectors...\")\n",
    "      time_it(index_ivf_flat.train, doc_embeddings)\n",
    "      print(f\"IndexIVFFlat trained: {index_ivf_flat.is_trained}\")\n",
    "    else:\n",
    "      print(f\"Skipping IndexIVFFlat training (need at least {nlist} vectors, got {n_docs}).\")\n",
    "      index_ivf_flat = None # Cannot use untrained IVF index\n",
    "      \n",
    "else:\n",
    "    print(\"FAISS or embeddings not available, cannot build index.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. 向索引添加向量\n",
    "\n",
    "使用索引对象的 `.add(vectors)` 方法将向量（必须是 float32 NumPy 数组）添加到索引中。\n",
    "对于某些索引（如 IVF），向量会被分配到最近的聚类中心对应的列表中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- Adding Vectors to Index --- \")\n",
    "\n",
    "if index_flat_l2 and doc_embeddings is not None:\n",
    "    print(\"Adding vectors to IndexFlatL2...\")\n",
    "    index_flat_l2.add(doc_embeddings)\n",
    "    print(f\"IndexFlatL2 vector count: {index_flat_l2.ntotal}\")\n",
    "\n",
    "if index_ivf_flat and index_ivf_flat.is_trained and doc_embeddings is not None:\n",
    "    print(\"\\nAdding vectors to IndexIVFFlat...\")\n",
    "    index_ivf_flat.add(doc_embeddings)\n",
    "    print(f\"IndexIVFFlat vector count: {index_ivf_flat.ntotal}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. 执行相似性搜索 (`index.search()`)\n",
    "\n",
    "使用 `.search(query_vectors, k)` 方法来查找与查询向量最相似的 `k` 个向量。\n",
    "\n",
    "*   `query_vectors`: 一个包含一个或多个查询向量的 float32 NumPy 数组 (形状 `[num_queries, dimension]`)。\n",
    "*   `k`: 要查找的最近邻的数量。\n",
    "*   返回两个数组：\n",
    "    *   `D`: 距离数组 (形状 `[num_queries, k]`)，包含查询向量到每个最近邻的距离（L2 距离或负内积）。\n",
    "    *   `I`: 索引数组 (形状 `[num_queries, k]`)，包含每个最近邻在原始添加数据中的索引。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- Performing Similarity Search --- \")\n",
    "\n",
    "if (index_flat_l2 or index_ivf_flat) and embedding_model:\n",
    "    query_text = \" feline animal \" # Query related to 'cat sat on the mat'\n",
    "    k = 3 # Find top 3 similar documents\n",
    "    \n",
    "    print(f\"Query text: '{query_text}'\")\n",
    "    query_embedding = embedding_model.encode([query_text]).astype('float32')\n",
    "    print(f\"Query embedding shape: {query_embedding.shape}\")\n",
    "    \n",
    "    # --- Search using IndexFlatL2 --- \n",
    "    if index_flat_l2 and index_flat_l2.ntotal > 0:\n",
    "        print(\"\\nSearching using IndexFlatL2...\")\n",
    "        distances_flat, indices_flat = time_it(index_flat_l2.search, query_embedding, k)\n",
    "        print(f\"  Distances (L2): {distances_flat[0]}\")\n",
    "        print(f\"  Indices: {indices_flat[0]}\")\n",
    "        print(\"  Retrieved Documents (FlatL2):\")\n",
    "        for i, idx in enumerate(indices_flat[0]):\n",
    "            if 0 <= idx < len(documents):\n",
    "                 print(f\"    {i+1}. Index={idx}, Dist={distances_flat[0][i]:.4f} - '{documents[idx]}'\")\n",
    "            else:\n",
    "                 print(f\"    {i+1}. Invalid index {idx} found.\")\n",
    "                 \n",
    "    # --- Search using IndexIVFFlat --- \n",
    "    if index_ivf_flat and index_ivf_flat.is_trained and index_ivf_flat.ntotal > 0:\n",
    "        index_ivf_flat.nprobe = 2 # Search in the 2 nearest clusters (adjust for speed/accuracy)\n",
    "        print(f\"\\nSearching using IndexIVFFlat (nprobe={index_ivf_flat.nprobe})...\")\n",
    "        distances_ivf, indices_ivf = time_it(index_ivf_flat.search, query_embedding, k)\n",
    "        print(f\"  Distances (L2): {distances_ivf[0]}\")\n",
    "        print(f\"  Indices: {indices_ivf[0]}\")\n",
    "        print(\"  Retrieved Documents (IVFFlat):\")\n",
    "        # Note: Indices might be -1 if fewer than k results are found in probed clusters\n",
    "        for i, idx in enumerate(indices_ivf[0]):\n",
    "            if idx != -1 and 0 <= idx < len(documents):\n",
    "                 print(f\"    {i+1}. Index={idx}, Dist={distances_ivf[0][i]:.4f} - '{documents[idx]}'\")\n",
    "            elif idx == -1:\n",
    "                 print(f\"    {i+1}. No result found in probed clusters for this rank.\")\n",
    "            else:\n",
    "                 print(f\"    {i+1}. Invalid index {idx} found.\")\n",
    "else:\n",
    "    print(\"Index or embedding model not available, skipping search example.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. (简介) 索引的保存与加载\n",
    "\n",
    "FAISS 索引可以保存到磁盘，以便后续重用，避免重新构建。\n",
    "\n",
    "```python\n",
    "# import faiss\n",
    "\n",
    "# index_to_save = index_flat_l2 # Or index_ivf_flat\n",
    "# index_filename = \"my_faiss_index.index\"\n",
    "\n",
    "# # --- 保存 --- \n",
    "# if index_to_save:\n",
    "#     print(f\"Saving index to {index_filename}...\")\n",
    "#     faiss.write_index(index_to_save, index_filename)\n",
    "#     print(\"Index saved.\")\n",
    "\n",
    "# # --- 加载 --- \n",
    "# if os.path.exists(index_filename):\n",
    "#     print(f\"\\nLoading index from {index_filename}...\")\n",
    "#     loaded_index = faiss.read_index(index_filename)\n",
    "#     print(f\"Index loaded. Vector count: {loaded_index.ntotal}\")\n",
    "      # Ready to use loaded_index.search(...)\n",
    "#     os.remove(index_filename) # Cleanup\n",
    "# else:\n",
    "#     print(\"Index file not found for loading.\")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. (简介) 与 LangChain/LlamaIndex 的集成\n",
    "\n",
    "FAISS 可以作为 LangChain 和 LlamaIndex 的向量存储后端。\n",
    "\n",
    "**LangChain 示例:**\n",
    "```python\n",
    "# from langchain.vectorstores import FAISS\n",
    "# from langchain.embeddings import OpenAIEmbeddings # Or other embeddings\n",
    "# from langchain.docstore.document import Document\n",
    "\n",
    "# # Assuming 'split_docs' is a list of LangChain Document objects\n",
    "# embeddings = OpenAIEmbeddings()\n",
    "# vectorstore_lc = FAISS.from_documents(split_docs, embeddings)\n",
    "# retriever_lc = vectorstore_lc.as_retriever()\n",
    "# results = retriever_lc.get_relevant_documents(\"your query\")\n",
    "# vectorstore_lc.save_local(\"faiss_langchain_index\")\n",
    "# loaded_vectorstore_lc = FAISS.load_local(\"faiss_langchain_index\", embeddings)\n",
    "```\n",
    "\n",
    "**LlamaIndex 示例:**\n",
    "```python\n",
    "# from llama_index.vector_stores import FaissVectorStore\n",
    "# from llama_index import VectorStoreIndex, StorageContext\n",
    "# import faiss # Need faiss installed\n",
    "\n",
    "# # Assuming 'nodes' is a list of LlamaIndex Node objects\n",
    "# # Assuming Settings.embed_model is configured\n",
    "\n",
    "# # 1. Create FAISS index directly\n",
    "# d = Settings.embed_model.embed_dim \n",
    "# faiss_index = faiss.IndexFlatL2(d)\n",
    "\n",
    "# # 2. Create FaissVectorStore wrapper\n",
    "# vector_store_li = FaissVectorStore(faiss_index=faiss_index)\n",
    "\n",
    "# # 3. Create StorageContext and build index\n",
    "# storage_context = StorageContext.from_defaults(vector_store=vector_store_li)\n",
    "# index_li = VectorStoreIndex(nodes, storage_context=storage_context)\n",
    "\n",
    "# # Or, let LlamaIndex handle FAISS creation internally during VectorStoreIndex build\n",
    "# # (This might happen if you don't explicitly provide a vector_store)\n",
    "\n",
    "# # Persisting is usually done via the StorageContext\n",
    "# # index_li.storage_context.persist(persist_dir=\"./faiss_llamaindex_index\")\n",
    "\n",
    "# # Loading\n",
    "# # from llama_index import load_index_from_storage, StorageContext\n",
    "# # storage_context_load = StorageContext.from_defaults(persist_dir=\"./faiss_llamaindex_index\")\n",
    "# # loaded_index_li = load_index_from_storage(storage_context_load)\n",
    "```\n",
    "通常，使用 LangChain 或 LlamaIndex 提供的封装会更方便，它们会处理好索引构建、添加和搜索的细节。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 总结\n",
    "\n",
    "FAISS 是一个用于高效向量相似性搜索的强大库，是构建现代 AI 应用（尤其是 RAG 系统）的关键组件。\n",
    "\n",
    "**关键要点：**\n",
    "*   向量数据库/索引库用于存储和快速检索高维嵌入向量。\n",
    "*   FAISS 提供了多种索引类型，需要在速度、内存和精度之间进行权衡 (`IndexFlatL2` 精确但慢, `IndexIVFFlat` 等 ANN 索引更快但近似)。\n",
    "*   核心操作包括构建索引 (`faiss.Index...`)、训练索引 (如果需要)、添加向量 (`.add()`) 和搜索 (`.search()`)。\n",
    "*   FAISS 需要 NumPy float32 数组作为输入。\n",
    "*   通常与文本嵌入模型 (如 Sentence Transformers) 结合使用。\n",
    "*   可以作为 LangChain 和 LlamaIndex 的向量存储后端，通常由这些框架封装其使用细节。\n",
    "\n",
    "理解向量搜索的基本原理以及 FAISS 等库的核心用法，对于构建和优化基于嵌入的 AI 应用非常有帮助。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 5
}