{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Combining Qdrant and LlamaIndex to keep Q&A systems up-to-date\n", "---\n", "# Introduction\n", "\n", "Have you ever been frustrated with an answer engine that is stuck in the past? As our world rapidly evolves, the accuracy of information changes accordingly. Traditional models can become outdated, providing answers that were once accurate but are now obsolete. The cost of outdated knowledge can be high - misinforming users, impacting decision-making, and ultimately undermining trust in your system.\n", "\n", "Qdrant and LlamaIndex work together seamlessly, continually adapting your engine to the relentless pace of information change. By mastering these tools, you can transform your applications from static knowledge repositories into dynamic, adaptable knowledge machines. Whether you're a seasoned data scientist or an AI enthusiast, join us on this learning journey - the future of answer engines is here, and it's time to embrace it.\n", "\n", "## Learning Outcomes\n", "\n", "In this tutorial, you will learn the following:\n", "\n", "- 1️⃣ How to build a question-answering system using LlamaIndex and Qdrant.\n", " - We will load a news dataset, store it with Qdrant client, and load the data into LlamaIndex.\n", "- 2️⃣ How to keep the QA engine updated and improve the ranking system.\n", " - We will define two postprocessors: Recency and Cohere Rerank; and use these to create various query engines.\n", "- 3️⃣ How to use Node Sources in LlamaIndex to investigate questions and sources on which the answers are based.\n", " - We will query these engines with various questions and compare their responses.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "Main Tools\n", "1. `llama_index`: A powerful tool for building large-scale information retrieval systems. [Learn More](https://gpt-index.readthedocs.io/en/latest/getting_started/starter_example.html) \n", "2. `qdrant_client`: A high-performance vector database designed for storing and searching large-scale high-dimensional vectors. In this tutorial, we use Qdrant as our vector storage system.\n", "3. `cohere`: A key reranking service to be used in postprocessing. It takes in a query and a list of texts and returns an ordered array with each text assigned a _new_ relevance score.\n", "4. `OpenAI`: Important for answer generation, as it takes the top few candidates to produce a final answer.\n", "5. `datasets`: Library necessary to import our dataset.\n", "6. `pandas`: Relevant library for data manipulation and analysis.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install Packages\n", "\n", "Before you start, install the required packages with pip:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# !pip install llama-index cohere datasets pandas\n", "# !pip install -U qdrant-client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Optional: install Rich to make error messages and stack traces easier to read.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# !pip install 'rich[jupyter]'\n", "%load_ext rich" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import your packages" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "import os\n", "import random\n", "from pathlib import Path\n", "from typing import Any\n", "\n", "import pandas as pd\n", "from datasets import load_dataset\n", "from IPython.display import Markdown, display_markdown\n", "from llama_index import (GPTVectorStoreIndex, ServiceContext,\n", " SimpleDirectoryReader)\n", "from llama_index.indices.postprocessor import FixedRecencyPostprocessor\n", "from llama_index.indices.postprocessor.cohere_rerank import CohereRerank\n", "from llama_index.vector_stores.qdrant import QdrantVectorStore\n", "from qdrant_client import QdrantClient\n", "\n", "Path.ls = lambda x: list(x.iterdir())\n", "random.seed(42) # This is the answer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieve API Keys:\n", "\n", "Before you start, you must retrieve two API keys for the following services:\n", "\n", "1. OpenAI key for LLM. [Link](https://platform.openai.com/account/api-keys) \n", "2. Cohere key for Rerank. [Link](https://dashboard.cohere.ai/api-keys) or additionally, read [Cohere Documentation](https://docs.cohere.com/reference/key). \n", "\n", "This tutorial by default uses the Qdrant Client, which doesn't require an API key. However, if you choose Qdrant Cloud instead, then you need a third key. You can get it [the Qdrant Cloud main control panel](https://cloud.qdrant.io/) \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Optional] If you want to use the Qdrant Cloud, please get the Qdrant Cloud API Keys and URL\n" ] } ], "source": [ "def check_environment_keys():\n", " \"\"\"\n", " Utility Function that you have the NECESSARY Keys\n", " \"\"\"\n", " if os.environ.get(\"OPENAI_API_KEY\") is None:\n", " raise ValueError(\n", " \"OPENAI_API_KEY cannot be None. Set the key using os.environ['OPENAI_API_KEY']='sk-xxx'\"\n", " )\n", " if os.environ.get(\"COHERE_API_KEY\") is None:\n", " raise ValueError(\n", " \"COHERE_API_KEY cannot be None. Set the key using os.environ['COHERE_API_KEY']='xxx'\"\n", " )\n", " if os.environ.get(\"QDRANT_API_KEY\") is None:\n", " print(\"[Optional] If you want to use the Qdrant Cloud, please get the Qdrant Cloud API Keys and URL\")\n", "\n", "\n", "check_environment_keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Architecture\n", "\n", "Our answer engine consists of two main parts:\n", "\n", "1. Retrieval - Done with Qdrant\n", "2. Synthesis - Done with OpenAI API\n", "\n", "We will use LlamaIndex to make the Query Engine and Qdrant for our Vector Store. Later, we will add components to keep the engine updated and improve ranking after retrieval\n", "\n", "The arrow point represents the direction of data flow. The \"Query Engine\" box encapsulates the postprocessing step to indicate that it's a part of the query engine's function. This diagram is meant to provide a high-level understanding of the process and does not include all the details involved.\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load Sample Dataset\n", "\n", "First we need to load our documents. In this example, we will use the [News Category Dataset v3](https://huggingface.co/datasets/heegyu/news-category-dataset). This dataset contains news articles with various fields like `headline`, `category`, `short_description`, `link`, `authors`, and date. Once we load the data, we will reformat it to suit our needs." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Found cached dataset json (/Users/nirantk/.cache/huggingface/datasets/heegyu___json/heegyu--news-category-dataset-a0dcb53f17af71bf/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)\n" ] } ], "source": [ "dataset = load_dataset(\"heegyu/news-category-dataset\", split=\"train\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | link | \n", "headline | \n", "category | \n", "short_description | \n", "authors | \n", "date | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "https://www.huffpost.com/entry/covid-boosters-... | \n", "Over 4 Million Americans Roll Up Sleeves For O... | \n", "U.S. NEWS | \n", "Health experts said it is too early to predict... | \n", "Carla K. Johnson, AP | \n", "2022-09-23 | \n", "
| 1 | \n", "https://www.huffpost.com/entry/american-airlin... | \n", "American Airlines Flyer Charged, Banned For Li... | \n", "U.S. NEWS | \n", "He was subdued by passengers and crew when he ... | \n", "Mary Papenfuss | \n", "2022-09-23 | \n", "
| 2 | \n", "https://www.huffpost.com/entry/funniest-tweets... | \n", "23 Of The Funniest Tweets About Cats And Dogs ... | \n", "COMEDY | \n", "\"Until you have a dog you don't understand wha... | \n", "Elyse Wanshel | \n", "2022-09-23 | \n", "
| 3 | \n", "https://www.huffpost.com/entry/funniest-parent... | \n", "The Funniest Tweets From Parents This Week (Se... | \n", "PARENTING | \n", "\"Accidentally put grown-up toothpaste on my to... | \n", "Caroline Bologna | \n", "2022-09-23 | \n", "
| 4 | \n", "https://www.huffpost.com/entry/amy-cooper-lose... | \n", "Woman Who Called Cops On Black Bird-Watcher Lo... | \n", "U.S. NEWS | \n", "Amy Cooper accused investment firm Franklin Te... | \n", "Nina Golgowski | \n", "2022-09-22 | \n", "