{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Vector Database 客户端库教程 (FAISS 重点)\n", "\n", "欢迎来到向量数据库客户端库教程!随着嵌入(Embeddings)技术在表示文本、图像等非结构化数据方面的成功,**向量数据库**(或称向量搜索引擎)成为了现代 AI 应用的关键基础设施。它们专门用于存储高维向量,并能高效地执行**相似性搜索 (Similarity Search)**,即根据一个查询向量找出数据库中最相似的向量。\n", "\n", "**为什么需要向量数据库/搜索库?**\n", "\n", "1. **高效检索**: 对于大规模向量数据集(百万甚至十亿级别),传统的线性扫描计算相似度非常慢。向量数据库使用近似最近邻 (Approximate Nearest Neighbor, ANN) 等算法来大幅加速搜索过程。\n", "2. **RAG 的核心**: 在检索增强生成 (RAG) 中,向量数据库用于存储文档块的嵌入向量,并根据用户问题的嵌入向量快速找到相关的文档块。\n", "3. **推荐系统**: 找到与用户或物品嵌入向量相似的其他用户或物品。\n", "4. **图像/音频搜索**: 基于内容的图像或音频检索。\n", "5. **重复数据删除**: 查找相似的文本或图像。\n", "\n", "**本教程重点介绍 `faiss-cpu` (或 `faiss-gpu`)**: \n", "* FAISS (Facebook AI Similarity Search) 是由 Facebook AI 开发的一个非常高效的向量相似性搜索库。\n", "* 它提供了多种索引类型,可以在内存或磁盘上运行。\n", "* 它是一个库,而不是一个数据库服务,通常嵌入在应用程序中或由其他框架(如 LangChain, LlamaIndex)调用。\n", "\n", "**其他流行的向量数据库/库 (简介):**\n", "* **ChromaDB**: 开源,本地优先,易于使用,与 LangChain/LlamaIndex 集成良好。\n", "* **Pinecone**: 商业化的、完全托管的云原生向量数据库服务。\n", "* **Weaviate**: 开源的云原生向量数据库,支持 GraphQL。\n", "* **Milvus**: 开源的云原生向量数据库。\n", "\n", "**本教程将涵盖 FAISS 的核心用法:**\n", "\n", "1. 安装 FAISS\n", "2. 准备示例向量数据 (使用 Sentence Transformers 获取文本嵌入)\n", "3. 构建 FAISS 索引 (如 `IndexFlatL2`, `IndexIVFFlat`)\n", "4. 向索引添加向量\n", "5. 执行相似性搜索 (`index.search()`)\n", "6. (简介) 索引的保存与加载\n", "7. (简介) 与 LangChain/LlamaIndex 的集成" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. 安装 FAISS 和 Sentence Transformers\n", "\n", "```bash\n", "# 安装 FAISS CPU 版本\n", "pip install faiss-cpu\n", "\n", "# 或者,如果你有兼容的 GPU 和 CUDA 环境,可以安装 GPU 版本\n", "# pip install faiss-gpu\n", "\n", "# 安装 Sentence Transformers 用于生成文本嵌入\n", "pip install sentence-transformers\n", "\n", "# 其他依赖\n", "pip install numpy pandas matplotlib\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import time\n", "import os\n", "\n", "# 尝试导入 faiss\n", "try:\n", " import faiss\n", " print(f\"FAISS version: {faiss.__version__}\")\n", " faiss_available = True\n", "except ImportError:\n", " print(\"FAISS library not found. Please install faiss-cpu or faiss-gpu.\")\n", " print(\"pip install faiss-cpu\")\n", " faiss_available = False\n", "\n", "# 尝试导入 sentence-transformers\n", "try:\n", " from sentence_transformers import SentenceTransformer\n", " print(\"SentenceTransformer imported successfully.\")\n", " st_available = True\n", " # 加载一个预训练的嵌入模型 (第一次运行时会自动下载)\n", " # 'all-MiniLM-L6-v2' 是一个常用且相对较小的模型\n", " embedding_model = SentenceTransformer('all-MiniLM-L6-v2')\n", " embedding_dim = embedding_model.get_sentence_embedding_dimension()\n", " print(f\"Loaded Sentence Transformer model. Embedding dimension: {embedding_dim}\")\n", "except ImportError:\n", " print(\"sentence-transformers library not found. Please install it: pip install sentence-transformers\")\n", " st_available = False\n", " embedding_model = None\n", " embedding_dim = None\n", "except Exception as e:\n", " print(f\"Error loading Sentence Transformer model: {e}. Check internet connection or model name.\")\n", " st_available = False\n", " embedding_model = None\n", " embedding_dim = None\n", "\n", "# 辅助函数\n", "def time_it(func, *args, **kwargs):\n", " start = time.time()\n", " result = func(*args, **kwargs)\n", " end = time.time()\n", " print(f\"Execution time: {end - start:.4f} seconds\")\n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. 准备示例向量数据\n", "\n", "我们需要一组向量来构建索引。这里,我们使用 `sentence-transformers` 将一些示例文本转换为嵌入向量。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"--- Preparing Sample Data and Embeddings ---\")\n", "\n", "documents = [\n", " \"The cat sat on the mat.\",\n", " \"The dog chased the ball.\",\n", " \"Apples and oranges are fruits.\",\n", " \"Paris is the capital of France.\",\n", " \"The weather is sunny today.\",\n", " \"Machine learning models require data.\",\n", " \"A feline rested upon the rug.\", # Similar to sentence 1\n", " \"Information retrieval is key for RAG.\"\n", "]\n", "\n", "doc_embeddings = None\n", "if st_available and embedding_model:\n", " print(f\"Generating embeddings for {len(documents)} documents...\")\n", " # 使用 embedding_model.encode() 获取嵌入向量\n", " doc_embeddings = embedding_model.encode(documents)\n", " \n", " # FAISS 需要 float32 类型的 NumPy 数组\n", " doc_embeddings = np.array(doc_embeddings).astype('float32')\n", " \n", " print(f\"Embeddings generated. Shape: {doc_embeddings.shape}\") # (num_documents, embedding_dim)\n", " # print(\"Sample embedding (first 5 dims of first doc):\")\n", " # print(doc_embeddings[0, :5])\n", "else:\n", " print(\"Sentence Transformer model not available. Cannot generate embeddings.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. 构建 FAISS 索引\n", "\n", "FAISS 提供了多种索引类型,适用于不同的数据规模、内存限制和搜索速度/精度权衡。\n", "\n", "* **`faiss.IndexFlatL2`**: \n", " * 最简单的索引,进行精确的暴力 L2 (欧氏距离) 搜索。\n", " * 不需要训练。\n", " * 适用于小型数据集,精度最高,但速度最慢。\n", "* **`faiss.IndexFlatIP`**: 类似 `IndexFlatL2`,但使用内积 (Inner Product) 作为相似度度量(对于归一化向量,等价于余弦相似度)。\n", "* **`faiss.IndexIVFFlat`**: \n", " * 基于倒排文件 (Inverted File) 的索引,是常用的近似最近邻 (ANN) 算法。\n", " * 需要一个**训练 (train)** 阶段,使用一部分数据(或全部数据)来学习数据空间中的聚类中心 (centroids)。\n", " * 搜索时,先找到查询向量最近的几个聚类中心,然后在这些中心对应的列表中进行搜索,从而减少搜索范围。\n", " * 参数:`quantizer` (通常是 `IndexFlatL2`),`d` (向量维度),`nlist` (聚类中心数量)。\n", " * `nprobe` 参数控制搜索时要检查的聚类列表数量(影响速度和精度)。\n", "* **其他索引**: 如 `IndexPQ` (Product Quantization), `IndexHNSWFlat` (Hierarchical Navigable Small World graphs) 等,用于更大规模或需要更高压缩率/速度的场景。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"\\n--- Building FAISS Index --- \")\n", "index_flat_l2 = None\n", "index_ivf_flat = None\n", "\n", "if faiss_available and doc_embeddings is not None:\n", " d = embedding_dim # 向量维度\n", " n_docs = doc_embeddings.shape[0]\n", " \n", " # --- 1. IndexFlatL2 (精确,暴力搜索) --- \n", " print(\"\\nBuilding IndexFlatL2...\")\n", " index_flat_l2 = faiss.IndexFlatL2(d)\n", " print(f\"IndexFlatL2 created. Is trained: {index_flat_l2.is_trained}\")\n", " print(f\"Initial vector count: {index_flat_l2.ntotal}\")\n", " \n", " # --- 2. IndexIVFFlat (近似,需要训练) --- \n", " print(\"\\nBuilding IndexIVFFlat...\")\n", " nlist = 4 # 聚类中心数量 (通常选择 sqrt(N) 到 N/100 之间,这里 N 很小)\n", " quantizer = faiss.IndexFlatL2(d) # 底层使用 L2 距离计算中心\n", " index_ivf_flat = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)\n", " # faiss.METRIC_L2: 使用 L2 距离\n", " # faiss.METRIC_INNER_PRODUCT: 使用内积\n", " \n", " print(f\"IndexIVFFlat created. Is trained: {index_ivf_flat.is_trained}\")\n", " \n", " # 训练索引 (学习聚类中心)\n", " if n_docs >= nlist: # Need enough data to train\n", " print(f\"Training IndexIVFFlat with {n_docs} vectors...\")\n", " time_it(index_ivf_flat.train, doc_embeddings)\n", " print(f\"IndexIVFFlat trained: {index_ivf_flat.is_trained}\")\n", " else:\n", " print(f\"Skipping IndexIVFFlat training (need at least {nlist} vectors, got {n_docs}).\")\n", " index_ivf_flat = None # Cannot use untrained IVF index\n", " \n", "else:\n", " print(\"FAISS or embeddings not available, cannot build index.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. 向索引添加向量\n", "\n", "使用索引对象的 `.add(vectors)` 方法将向量(必须是 float32 NumPy 数组)添加到索引中。\n", "对于某些索引(如 IVF),向量会被分配到最近的聚类中心对应的列表中。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"\\n--- Adding Vectors to Index --- \")\n", "\n", "if index_flat_l2 and doc_embeddings is not None:\n", " print(\"Adding vectors to IndexFlatL2...\")\n", " index_flat_l2.add(doc_embeddings)\n", " print(f\"IndexFlatL2 vector count: {index_flat_l2.ntotal}\")\n", "\n", "if index_ivf_flat and index_ivf_flat.is_trained and doc_embeddings is not None:\n", " print(\"\\nAdding vectors to IndexIVFFlat...\")\n", " index_ivf_flat.add(doc_embeddings)\n", " print(f\"IndexIVFFlat vector count: {index_ivf_flat.ntotal}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. 执行相似性搜索 (`index.search()`)\n", "\n", "使用 `.search(query_vectors, k)` 方法来查找与查询向量最相似的 `k` 个向量。\n", "\n", "* `query_vectors`: 一个包含一个或多个查询向量的 float32 NumPy 数组 (形状 `[num_queries, dimension]`)。\n", "* `k`: 要查找的最近邻的数量。\n", "* 返回两个数组:\n", " * `D`: 距离数组 (形状 `[num_queries, k]`),包含查询向量到每个最近邻的距离(L2 距离或负内积)。\n", " * `I`: 索引数组 (形状 `[num_queries, k]`),包含每个最近邻在原始添加数据中的索引。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"\\n--- Performing Similarity Search --- \")\n", "\n", "if (index_flat_l2 or index_ivf_flat) and embedding_model:\n", " query_text = \" feline animal \" # Query related to 'cat sat on the mat'\n", " k = 3 # Find top 3 similar documents\n", " \n", " print(f\"Query text: '{query_text}'\")\n", " query_embedding = embedding_model.encode([query_text]).astype('float32')\n", " print(f\"Query embedding shape: {query_embedding.shape}\")\n", " \n", " # --- Search using IndexFlatL2 --- \n", " if index_flat_l2 and index_flat_l2.ntotal > 0:\n", " print(\"\\nSearching using IndexFlatL2...\")\n", " distances_flat, indices_flat = time_it(index_flat_l2.search, query_embedding, k)\n", " print(f\" Distances (L2): {distances_flat[0]}\")\n", " print(f\" Indices: {indices_flat[0]}\")\n", " print(\" Retrieved Documents (FlatL2):\")\n", " for i, idx in enumerate(indices_flat[0]):\n", " if 0 <= idx < len(documents):\n", " print(f\" {i+1}. Index={idx}, Dist={distances_flat[0][i]:.4f} - '{documents[idx]}'\")\n", " else:\n", " print(f\" {i+1}. Invalid index {idx} found.\")\n", " \n", " # --- Search using IndexIVFFlat --- \n", " if index_ivf_flat and index_ivf_flat.is_trained and index_ivf_flat.ntotal > 0:\n", " index_ivf_flat.nprobe = 2 # Search in the 2 nearest clusters (adjust for speed/accuracy)\n", " print(f\"\\nSearching using IndexIVFFlat (nprobe={index_ivf_flat.nprobe})...\")\n", " distances_ivf, indices_ivf = time_it(index_ivf_flat.search, query_embedding, k)\n", " print(f\" Distances (L2): {distances_ivf[0]}\")\n", " print(f\" Indices: {indices_ivf[0]}\")\n", " print(\" Retrieved Documents (IVFFlat):\")\n", " # Note: Indices might be -1 if fewer than k results are found in probed clusters\n", " for i, idx in enumerate(indices_ivf[0]):\n", " if idx != -1 and 0 <= idx < len(documents):\n", " print(f\" {i+1}. Index={idx}, Dist={distances_ivf[0][i]:.4f} - '{documents[idx]}'\")\n", " elif idx == -1:\n", " print(f\" {i+1}. No result found in probed clusters for this rank.\")\n", " else:\n", " print(f\" {i+1}. Invalid index {idx} found.\")\n", "else:\n", " print(\"Index or embedding model not available, skipping search example.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. (简介) 索引的保存与加载\n", "\n", "FAISS 索引可以保存到磁盘,以便后续重用,避免重新构建。\n", "\n", "```python\n", "# import faiss\n", "\n", "# index_to_save = index_flat_l2 # Or index_ivf_flat\n", "# index_filename = \"my_faiss_index.index\"\n", "\n", "# # --- 保存 --- \n", "# if index_to_save:\n", "# print(f\"Saving index to {index_filename}...\")\n", "# faiss.write_index(index_to_save, index_filename)\n", "# print(\"Index saved.\")\n", "\n", "# # --- 加载 --- \n", "# if os.path.exists(index_filename):\n", "# print(f\"\\nLoading index from {index_filename}...\")\n", "# loaded_index = faiss.read_index(index_filename)\n", "# print(f\"Index loaded. Vector count: {loaded_index.ntotal}\")\n", " # Ready to use loaded_index.search(...)\n", "# os.remove(index_filename) # Cleanup\n", "# else:\n", "# print(\"Index file not found for loading.\")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. (简介) 与 LangChain/LlamaIndex 的集成\n", "\n", "FAISS 可以作为 LangChain 和 LlamaIndex 的向量存储后端。\n", "\n", "**LangChain 示例:**\n", "```python\n", "# from langchain.vectorstores import FAISS\n", "# from langchain.embeddings import OpenAIEmbeddings # Or other embeddings\n", "# from langchain.docstore.document import Document\n", "\n", "# # Assuming 'split_docs' is a list of LangChain Document objects\n", "# embeddings = OpenAIEmbeddings()\n", "# vectorstore_lc = FAISS.from_documents(split_docs, embeddings)\n", "# retriever_lc = vectorstore_lc.as_retriever()\n", "# results = retriever_lc.get_relevant_documents(\"your query\")\n", "# vectorstore_lc.save_local(\"faiss_langchain_index\")\n", "# loaded_vectorstore_lc = FAISS.load_local(\"faiss_langchain_index\", embeddings)\n", "```\n", "\n", "**LlamaIndex 示例:**\n", "```python\n", "# from llama_index.vector_stores import FaissVectorStore\n", "# from llama_index import VectorStoreIndex, StorageContext\n", "# import faiss # Need faiss installed\n", "\n", "# # Assuming 'nodes' is a list of LlamaIndex Node objects\n", "# # Assuming Settings.embed_model is configured\n", "\n", "# # 1. Create FAISS index directly\n", "# d = Settings.embed_model.embed_dim \n", "# faiss_index = faiss.IndexFlatL2(d)\n", "\n", "# # 2. Create FaissVectorStore wrapper\n", "# vector_store_li = FaissVectorStore(faiss_index=faiss_index)\n", "\n", "# # 3. Create StorageContext and build index\n", "# storage_context = StorageContext.from_defaults(vector_store=vector_store_li)\n", "# index_li = VectorStoreIndex(nodes, storage_context=storage_context)\n", "\n", "# # Or, let LlamaIndex handle FAISS creation internally during VectorStoreIndex build\n", "# # (This might happen if you don't explicitly provide a vector_store)\n", "\n", "# # Persisting is usually done via the StorageContext\n", "# # index_li.storage_context.persist(persist_dir=\"./faiss_llamaindex_index\")\n", "\n", "# # Loading\n", "# # from llama_index import load_index_from_storage, StorageContext\n", "# # storage_context_load = StorageContext.from_defaults(persist_dir=\"./faiss_llamaindex_index\")\n", "# # loaded_index_li = load_index_from_storage(storage_context_load)\n", "```\n", "通常,使用 LangChain 或 LlamaIndex 提供的封装会更方便,它们会处理好索引构建、添加和搜索的细节。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 总结\n", "\n", "FAISS 是一个用于高效向量相似性搜索的强大库,是构建现代 AI 应用(尤其是 RAG 系统)的关键组件。\n", "\n", "**关键要点:**\n", "* 向量数据库/索引库用于存储和快速检索高维嵌入向量。\n", "* FAISS 提供了多种索引类型,需要在速度、内存和精度之间进行权衡 (`IndexFlatL2` 精确但慢, `IndexIVFFlat` 等 ANN 索引更快但近似)。\n", "* 核心操作包括构建索引 (`faiss.Index...`)、训练索引 (如果需要)、添加向量 (`.add()`) 和搜索 (`.search()`)。\n", "* FAISS 需要 NumPy float32 数组作为输入。\n", "* 通常与文本嵌入模型 (如 Sentence Transformers) 结合使用。\n", "* 可以作为 LangChain 和 LlamaIndex 的向量存储后端,通常由这些框架封装其使用细节。\n", "\n", "理解向量搜索的基本原理以及 FAISS 等库的核心用法,对于构建和优化基于嵌入的 AI 应用非常有帮助。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 5 }