Skip to content

Instantly share code, notes, and snippets.

@umutseven92
Last active April 26, 2022 10:14
Show Gist options
  • Select an option

  • Save umutseven92/b1b3ad57c178874b1250fdf4bc86e47b to your computer and use it in GitHub Desktop.

Select an option

Save umutseven92/b1b3ad57c178874b1250fdf4bc86e47b to your computer and use it in GitHub Desktop.

Revisions

  1. umutseven92 revised this gist Apr 26, 2022. 1 changed file with 6 additions and 2 deletions.
    8 changes: 6 additions & 2 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -22,15 +22,19 @@

    ## Metrics

    * `Precision` is the fraction of relevant search results among the total search results.
    * If the engine returns 30 results, out of which 15 are relevant, then precision in is 15/30.
    * `Recall` is the ability of a search engine to retrieve all the relevant results from the corpus.
    * If there are 50 documents that are relevant, but the search engine returns only 30 of those, then the recall is 30 out of 50.
    * `F score` (specifically `F1 score`) is a single number, that represents both precision and recall.
    * `Mean Reciprocal Rank (MRR)` guides the search engine to put the most-desired result on the top.
    * It gives the score 1 for clicking the first result, ½ for clicking the second result, ⅓ for the third result, and so on.
    * `Mean Average Precision (MAP)` allows to quantify the relevance of the top returned results.
    * `Normalized Discounted Cumulative Gain (nDCG)` is like MAP, but weights the relevance of the result by its position.

    ## Misc

    * Corpus is the complete set of documents that need to be searched.
    * Precision is true positives / selected elements.
    * Recall is true positives / all positives.
    * Shingles are word-ngrams. Given a stream of tokens, the shingle filter will create new tokens by concatenating adjacent terms.
    * They give you the ability to pre-bake phrase matching. By building phrases into the index, you can avoid creating phrases at query time and save some processing time/speed.
    * The downside is that you have larger indices and potentially more memory usage.
  2. umutseven92 revised this gist Oct 28, 2021. 1 changed file with 6 additions and 0 deletions.
    6 changes: 6 additions & 0 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -14,6 +14,12 @@
    * Words with similar meanings get similar vectors and can be projected onto a 2-dimensional graph for easy visualization.
    * These vectors are trained by ensuring words with similar meanings are physically near each other.

    ## Query

    * Query expansion is where the query is broadened by introducing additional tokens or phrases.
    * Query relaxation is where tokens are ignored to make the query less restrictive and increase recall.
    * Query translation is where a text-translation paradigm is applied to convert tail (i.e., unpopular, 20th percentile) queries into head queries that account for the bulk of traffic, helping to increase recall.

    ## Metrics

    * `F score` (specifically `F1 score`) is a single number, that represents both precision and recall.
  3. umutseven92 revised this gist Oct 27, 2021. 1 changed file with 0 additions and 139 deletions.
    139 changes: 0 additions & 139 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -1,144 +1,5 @@
    # Search

    ## Elasticsearch

    ### Architecture

    * Nodes join a cluster (named `elasticsearch` by default).
    * One node becomes the master node.
    * Each index data is divided into shards.
    * 5 shards by default.
    * Due to how routing works, these shards cannot be increased later; you would need to create a new index.
    * Internally, an index is a logical namespace that points to one or more shards.
    * Each node contains these shards.
    * When you index a document, Elasticsearch will determine which shard the document should be routed to for indexing.
    * Same routing will be used when Elasticsearch is retrieving the document.
    * You can enable custom routing, mainly for performance reasons.
    * A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards.
    * Each shard is a Lucene index.
    * A Lucene index can be tought as a self-contained search engine, with its own syntax different to Elasticsearch.
    * Each shard gets replicated, and stored in another node. These are replica shards.
    * The primary shard is responsible for keeping replica shards updated.
    * Replica shards are used for both availability and searching.
    * Lucene index contains segments.
    * A segment is an inverted index.
    * A search in a shard will search each segment in turn, then combine their results into the final results for that shard.
    * These segments are immutable.
    * When data is written, it is published into segments.
    * While you are indexing documents, Elasticsearch collects them in memory (and in the transaction log, for safety) then every second or so, writes a new small segment to disk, and refreshes the search.
    * A Refresh is what makes the data in the new segment visible to search.
    * A Refresh is done every second, but only on indices that have received one search request or more in the last 30 seconds.
    * A Flush is what actually writes the data into disk.
    * A Flush triggeres the Lucene commit and empties the transaction log.
    * Flush happens automatically depending on how many operations get added to the transaction log, how big they are, and when the last flush happened.
    * Both refresh and flush can be triggered by `_refresh` and `_flush` APIs respectively.
    * As segments grow, they get merged into bigger segments.
    * This is called a Merge.
    * Once the new bigger segment is written, the old segments are dropped.
    * Merging can be quite resource intensive, especially with respect to disk I/O.
    * Maximum segment size is 5GBs; segments bigger than this won't be considered for merging.
    * Merging can be forced by the `_forcemerge` endpoint.
    * This should only be done on read-only indices, as it can result in >5GB segments to be produced, which means will no longer be considered by future merge requests. Documents marked for deletion in these sengments will never get cleaned up.
    * When a document is deleted, it gets marked to be deleted, and is not visible. It actually gets deleted during a merge.
    * Each shard copy writes operations (index and delete operations) into its transaction log (`translog`).
    * This is to prevent data loss on operations that haven't been written to disk yet with a Refresh.
    * If a crash happends before the flush, the operations in the transaction log are restored.

    ### Node Types

    * All node types are determined by the roles it gets assigned, e.g. `master`, `data` and `ingest`.

    #### Master Node
    * Only one gets selected as a master node in a cluster. They are responsible for creating or deleting indexes, tracking nodes, and allocating shards to nodes.

    #### Data Nodes
    * Data nodes are responsible for holding data and performing data-related operations, like CRUD operations, indexing, search, and aggregations.
    * You can configure data nodes so that they only do search and aggregation, not any indexing, to reduce the load in the individual nodes.

    #### Ingest Node
    * Applies ingest pipelines to a documents in order to transform and enrich the document before indexing.
    * With a heavy ingest load, it makes sense to use dedicated ingest nodes.

    ### Documents

    * Fields are the smallest individual unit of data in Elasticsearch.
    * These fields are customizable and could include, for example: title, author, date, summary, team, score, etc.
    * Multi-fields are fields that can be indexed in more than one way to produce more search results.
    * Meta-fields deal with a document’s metadata and usually start with an underscore.
    * Mapping defines the different types that reside within an index.
    * Mapping can be done via the API, or via Index Templates.
    * You can’t change the mapping or field type of an existing field.

    ### Analysis

    * Result of the analysis gets stored in the inverted index.
    * An inverted index is a data structure that maps terms to the ID's of the documents they are found in.
    * | ID | Term | Documents (Posting List) |
    |----|-----------|----------|
    | 1 | blue | 1 |
    | 2 | butterfly | 2,3 |
    | 3 | brutus | 3 |
    * When a search is performed, this inverted index is what gets searched.
    * In most cases, the same analyzer is used at index and search time.
    * Analysis steps are Character filter -> Tokenizer -> Token filter
    * Tokenizer also records the positions of tokens inside the input, for highlighting & proximity searches.
    * Analysers can be debugged via the `_analyze` endpoint.

    ### Queries

    * Term queries do not do analysis; it matches exact terms.
    * Match queries analyze the search input.
    * `track_total_hits` controls the computation of total hit counts.
    * Setting it to `true` will show the exact count, with the expense of a slower query.
    * Setting it to an integer will show the exact count, up to the defined value.
    * `timeout` is used to set a time out for queries.
    * `multi_match` allows for multi-field queries.
    * `best_fields` finds documents which match any field, but uses the `_score` from the best field. This is equivalent to using a `dis_max` query with `match` on every field.
    * `most_fields` finds documents which match any field and combines the `_score` from each field. This is equivalent to using a boolean query with `should` and `match` on every field.
    * `cross_fields` treats fields with the same analyzer as though they were one big field, and looks for each word in any field. All terms must be present in at least one field for a document to match.
    * `query_string` uses a syntax to return documents.
    * Very versatile, but the query is strict and returns an error if the query string includes any invalid syntax.
    * `simple_query_string` is similar, but uses a simpler syntax.
    * Unlike `query_string`, does not return errors for invalid syntax- it ignores any invalid parts of the query string.
    * `dis_max` returns documents that match defined multiple sub-queries.
    * If `tie_breaker` is 0, the highest scoring match is used.
    * Else, the lesser scoring matches are multiplied by `tie_breaker`, and added to the highest scoring match.
    * This equivalent to using a `multi_match` query with `best_fields` enabled.
    * `combined_fields` query supports searching multiple text fields as if their contents had been indexed into one combined field.
    * It analyzes the query string into individual terms, then looks for each term in any of the fields.
    * Only supports text fields that have the same analyser, unlike `multi_match` that supports multiple types of fields with different analysers.
    * Query expansion is where the query is broadened by introducing additional tokens or phrases.
    * Query relaxation is where tokens are ignored to make the query less restrictive and increase recall.
    * Query translation is where a text-translation paradigm is applied to convert tail (i.e., unpopular, 20th percentile) queries into head queries that account for the bulk of traffic, helping to increase recall.

    #### Boolean Queries

    * `must`: The query must appear in matching documents and will contribute to the score.
    * `filter`: The query must appear in matching documents, but will **not** contribute to the score.
    * `should`: The should appear in matching documents and will contribute to the score.
    * `must_not`: The query must **not** appear in matching documents, and will **not** contribute to the score.
    * The score from each matching `must` or `should` clause will be added together to provide the final `_score` for each document.
    * `minimum_should_match` specifies the number or percentage of `should` clauses returned documents must match.
    * Each query accepts a `_name` in its top level definition.
    * If named queries are used, the response includes a `matched_queries` property for each hit.

    ### Relevance

    * `score = tf * idf * field_norm * query_norm * coord * index_boost * query_boost`
    * Term Frequency (`tf`): Number of times a given term appears in the document.
    * Inverse Document Frequency (`idf`): Assigns low weight to terms that appear frequently in all of the documents in the index.
    * For performance reasons, Elasticsearch doesn’t calculate the IDF across all documents in the index. Instead, each shard calculates a local IDF for the documents contained in that shard.
    * This means that local IDF and global IDF can be different, which can produce incorrect results.
    * This difference diminish the more documents that you add to the index, so in practice this is not a problem.
    * Unliked `tf`, `idf` cannot be disabled.
    * Field Length Normalization (`field_norm`): A term match found in a field with a low number of total terms is going to be more important than a match found in a field with a large number of terms.
    * Query Normalization (`query_norm`): Used so that different queries can be compared. For any individual query, however, it uses the same score for every document.
    * Coordination (`coord`): Counts the number of terms from the query that appear in the document.
    * If we have a 3 term query and a document contains 2 of those terms, then it will be scored higher than a document that has only 1 of those terms.
    * Index boost (`index_boost`): Boosting done during indexing; usage is strongly discouraged due to adverse effects, like having to reindex all documents to change an index-time boost.
    * Query boost (`query_boost`): Boosting done during the query.
    * The `_explain` endpoint can be used to debug the relevance score.

    ## Learning to Rank

    * Applies supervised machine learning to search relevance ranking.
  4. umutseven92 revised this gist Oct 27, 2021. 1 changed file with 5 additions and 0 deletions.
    5 changes: 5 additions & 0 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -92,13 +92,18 @@
    * Setting it to `true` will show the exact count, with the expense of a slower query.
    * Setting it to an integer will show the exact count, up to the defined value.
    * `timeout` is used to set a time out for queries.
    * `multi_match` allows for multi-field queries.
    * `best_fields` finds documents which match any field, but uses the `_score` from the best field. This is equivalent to using a `dis_max` query with `match` on every field.
    * `most_fields` finds documents which match any field and combines the `_score` from each field. This is equivalent to using a boolean query with `should` and `match` on every field.
    * `cross_fields` treats fields with the same analyzer as though they were one big field, and looks for each word in any field. All terms must be present in at least one field for a document to match.
    * `query_string` uses a syntax to return documents.
    * Very versatile, but the query is strict and returns an error if the query string includes any invalid syntax.
    * `simple_query_string` is similar, but uses a simpler syntax.
    * Unlike `query_string`, does not return errors for invalid syntax- it ignores any invalid parts of the query string.
    * `dis_max` returns documents that match defined multiple sub-queries.
    * If `tie_breaker` is 0, the highest scoring match is used.
    * Else, the lesser scoring matches are multiplied by `tie_breaker`, and added to the highest scoring match.
    * This equivalent to using a `multi_match` query with `best_fields` enabled.
    * `combined_fields` query supports searching multiple text fields as if their contents had been indexed into one combined field.
    * It analyzes the query string into individual terms, then looks for each term in any of the fields.
    * Only supports text fields that have the same analyser, unlike `multi_match` that supports multiple types of fields with different analysers.
  5. umutseven92 revised this gist Oct 27, 2021. 1 changed file with 3 additions and 0 deletions.
    3 changes: 3 additions & 0 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -99,6 +99,9 @@
    * `dis_max` returns documents that match defined multiple sub-queries.
    * If `tie_breaker` is 0, the highest scoring match is used.
    * Else, the lesser scoring matches are multiplied by `tie_breaker`, and added to the highest scoring match.
    * `combined_fields` query supports searching multiple text fields as if their contents had been indexed into one combined field.
    * It analyzes the query string into individual terms, then looks for each term in any of the fields.
    * Only supports text fields that have the same analyser, unlike `multi_match` that supports multiple types of fields with different analysers.
    * Query expansion is where the query is broadened by introducing additional tokens or phrases.
    * Query relaxation is where tokens are ignored to make the query less restrictive and increase recall.
    * Query translation is where a text-translation paradigm is applied to convert tail (i.e., unpopular, 20th percentile) queries into head queries that account for the bulk of traffic, helping to increase recall.
  6. umutseven92 revised this gist Oct 27, 2021. 1 changed file with 3 additions and 0 deletions.
    3 changes: 3 additions & 0 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -96,6 +96,9 @@
    * Very versatile, but the query is strict and returns an error if the query string includes any invalid syntax.
    * `simple_query_string` is similar, but uses a simpler syntax.
    * Unlike `query_string`, does not return errors for invalid syntax- it ignores any invalid parts of the query string.
    * `dis_max` returns documents that match defined multiple sub-queries.
    * If `tie_breaker` is 0, the highest scoring match is used.
    * Else, the lesser scoring matches are multiplied by `tie_breaker`, and added to the highest scoring match.
    * Query expansion is where the query is broadened by introducing additional tokens or phrases.
    * Query relaxation is where tokens are ignored to make the query less restrictive and increase recall.
    * Query translation is where a text-translation paradigm is applied to convert tail (i.e., unpopular, 20th percentile) queries into head queries that account for the bulk of traffic, helping to increase recall.
  7. umutseven92 revised this gist Oct 25, 2021. 1 changed file with 15 additions and 0 deletions.
    15 changes: 15 additions & 0 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -92,10 +92,25 @@
    * Setting it to `true` will show the exact count, with the expense of a slower query.
    * Setting it to an integer will show the exact count, up to the defined value.
    * `timeout` is used to set a time out for queries.
    * `query_string` uses a syntax to return documents.
    * Very versatile, but the query is strict and returns an error if the query string includes any invalid syntax.
    * `simple_query_string` is similar, but uses a simpler syntax.
    * Unlike `query_string`, does not return errors for invalid syntax- it ignores any invalid parts of the query string.
    * Query expansion is where the query is broadened by introducing additional tokens or phrases.
    * Query relaxation is where tokens are ignored to make the query less restrictive and increase recall.
    * Query translation is where a text-translation paradigm is applied to convert tail (i.e., unpopular, 20th percentile) queries into head queries that account for the bulk of traffic, helping to increase recall.

    #### Boolean Queries

    * `must`: The query must appear in matching documents and will contribute to the score.
    * `filter`: The query must appear in matching documents, but will **not** contribute to the score.
    * `should`: The should appear in matching documents and will contribute to the score.
    * `must_not`: The query must **not** appear in matching documents, and will **not** contribute to the score.
    * The score from each matching `must` or `should` clause will be added together to provide the final `_score` for each document.
    * `minimum_should_match` specifies the number or percentage of `should` clauses returned documents must match.
    * Each query accepts a `_name` in its top level definition.
    * If named queries are used, the response includes a `matched_queries` property for each hit.

    ### Relevance

    * `score = tf * idf * field_norm * query_norm * coord * index_boost * query_boost`
  8. umutseven92 revised this gist Oct 25, 2021. 1 changed file with 4 additions and 0 deletions.
    4 changes: 4 additions & 0 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -88,6 +88,10 @@

    * Term queries do not do analysis; it matches exact terms.
    * Match queries analyze the search input.
    * `track_total_hits` controls the computation of total hit counts.
    * Setting it to `true` will show the exact count, with the expense of a slower query.
    * Setting it to an integer will show the exact count, up to the defined value.
    * `timeout` is used to set a time out for queries.
    * Query expansion is where the query is broadened by introducing additional tokens or phrases.
    * Query relaxation is where tokens are ignored to make the query less restrictive and increase recall.
    * Query translation is where a text-translation paradigm is applied to convert tail (i.e., unpopular, 20th percentile) queries into head queries that account for the bulk of traffic, helping to increase recall.
  9. umutseven92 revised this gist Oct 18, 2021. 1 changed file with 12 additions and 4 deletions.
    16 changes: 12 additions & 4 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -73,7 +73,7 @@

    * Result of the analysis gets stored in the inverted index.
    * An inverted index is a data structure that maps terms to the ID's of the documents they are found in.
    * | ID | Term | Document |
    * | ID | Term | Documents (Posting List) |
    |----|-----------|----------|
    | 1 | blue | 1 |
    | 2 | butterfly | 2,3 |
    @@ -88,9 +88,10 @@

    * Term queries do not do analysis; it matches exact terms.
    * Match queries analyze the search input.
    * Query expansion is when the query is broadened by introducing additional tokens or phrases.
    * Query relaxation is when tokens are ignored to make the query less restrictive and increase recall.

    * Query expansion is where the query is broadened by introducing additional tokens or phrases.
    * Query relaxation is where tokens are ignored to make the query less restrictive and increase recall.
    * Query translation is where a text-translation paradigm is applied to convert tail (i.e., unpopular, 20th percentile) queries into head queries that account for the bulk of traffic, helping to increase recall.

    ### Relevance

    * `score = tf * idf * field_norm * query_norm * coord * index_boost * query_boost`
    @@ -122,8 +123,15 @@
    * Words with similar meanings get similar vectors and can be projected onto a 2-dimensional graph for easy visualization.
    * These vectors are trained by ensuring words with similar meanings are physically near each other.

    ## Metrics

    * `F score` (specifically `F1 score`) is a single number, that represents both precision and recall.
    * `Mean Average Precision (MAP)` allows to quantify the relevance of the top returned results.
    * `Normalized Discounted Cumulative Gain (nDCG)` is like MAP, but weights the relevance of the result by its position.

    ## Misc

    * Corpus is the complete set of documents that need to be searched.
    * Precision is true positives / selected elements.
    * Recall is true positives / all positives.
    * Shingles are word-ngrams. Given a stream of tokens, the shingle filter will create new tokens by concatenating adjacent terms.
  10. umutseven92 revised this gist Oct 7, 2021. 1 changed file with 5 additions and 3 deletions.
    8 changes: 5 additions & 3 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -26,9 +26,12 @@
    * These segments are immutable.
    * When data is written, it is published into segments.
    * While you are indexing documents, Elasticsearch collects them in memory (and in the transaction log, for safety) then every second or so, writes a new small segment to disk, and refreshes the search.
    * This is referred to as a Refresh.
    * A Refresh is what makes the data in the new segment visible to search.
    * This is done only on indices that have received one search request or more in the last 30 seconds.
    * A Refresh is done every second, but only on indices that have received one search request or more in the last 30 seconds.
    * A Flush is what actually writes the data into disk.
    * A Flush triggeres the Lucene commit and empties the transaction log.
    * Flush happens automatically depending on how many operations get added to the transaction log, how big they are, and when the last flush happened.
    * Both refresh and flush can be triggered by `_refresh` and `_flush` APIs respectively.
    * As segments grow, they get merged into bigger segments.
    * This is called a Merge.
    * Once the new bigger segment is written, the old segments are dropped.
    @@ -39,7 +42,6 @@
    * When a document is deleted, it gets marked to be deleted, and is not visible. It actually gets deleted during a merge.
    * Each shard copy writes operations (index and delete operations) into its transaction log (`translog`).
    * This is to prevent data loss on operations that haven't been written to disk yet with a Refresh.
    * A flush triggeres the Lucene commit and empties the transaction log.
    * If a crash happends before the flush, the operations in the transaction log are restored.

    ### Node Types
  11. umutseven92 revised this gist Oct 7, 2021. 1 changed file with 12 additions and 5 deletions.
    17 changes: 12 additions & 5 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -16,7 +16,10 @@
    * You can enable custom routing, mainly for performance reasons.
    * A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards.
    * Each shard is a Lucene index.
    * A Lucene index can be tought as a self-contained search engine.
    * A Lucene index can be tought as a self-contained search engine, with its own syntax different to Elasticsearch.
    * Each shard gets replicated, and stored in another node. These are replica shards.
    * The primary shard is responsible for keeping replica shards updated.
    * Replica shards are used for both availability and searching.
    * Lucene index contains segments.
    * A segment is an inverted index.
    * A search in a shard will search each segment in turn, then combine their results into the final results for that shard.
    @@ -25,15 +28,19 @@
    * While you are indexing documents, Elasticsearch collects them in memory (and in the transaction log, for safety) then every second or so, writes a new small segment to disk, and refreshes the search.
    * This is referred to as a Refresh.
    * A Refresh is what makes the data in the new segment visible to search.
    * This is done only on indices that have received one search request or more in the last 30 seconds.
    * As segments grow, they get merged into bigger segments.
    * This is called a Merge.
    * Once the new bigger segment is written, the old segments are dropped.
    * Merging can be quite resource intensive, especially with respect to disk I/O.
    * Maximum segment size is 5GBs.
    * Maximum segment size is 5GBs; segments bigger than this won't be considered for merging.
    * Merging can be forced by the `_forcemerge` endpoint.
    * This should only be done on read-only indices, as it can result in >5GB segments to be produced, which means will no longer be considered by future merge requests. Documents marked for deletion in these sengments will never get cleaned up.
    * When a document is deleted, it gets marked to be deleted, and is not visible. It actually gets deleted during a merge.
    * Each shard gets replicated, and stored in another node. These are replica shards.
    * The primary shard is responsible for keeping replica shards updated.
    * Replica shards are used for both availability and searching.
    * Each shard copy writes operations (index and delete operations) into its transaction log (`translog`).
    * This is to prevent data loss on operations that haven't been written to disk yet with a Refresh.
    * A flush triggeres the Lucene commit and empties the transaction log.
    * If a crash happends before the flush, the operations in the transaction log are restored.

    ### Node Types

  12. umutseven92 revised this gist Oct 6, 2021. 1 changed file with 24 additions and 13 deletions.
    37 changes: 24 additions & 13 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -52,40 +52,51 @@

    ### Documents

    * Fields are the smallest individual unit of data in Elasticsearch. These are customizable and could include, for example: title, author, date, summary, team, score, etc.
    * Multi-fields can (and should) be indexed in more than one way to produce more search results.
    * Fields are the smallest individual unit of data in Elasticsearch.
    * These fields are customizable and could include, for example: title, author, date, summary, team, score, etc.
    * Multi-fields are fields that can be indexed in more than one way to produce more search results.
    * Meta-fields deal with a document’s metadata and usually start with an underscore.
    * Mapping defines the different types that reside within an index.
    * Mapping can be done via the API, or via Index Templates.
    * You can’t change the mapping or field type of an existing field.

    ### Analysis

    * Result of the analysis gets stored at the inverted index.
    * When a search is performed, this result is what gets searched.
    * Result of the analysis gets stored in the inverted index.
    * An inverted index is a data structure that maps terms to the ID's of the documents they are found in.
    * | ID | Term | Document |
    |----|-----------|----------|
    | 1 | blue | 1 |
    | 2 | butterfly | 2,3 |
    | 3 | brutus | 3 |
    * When a search is performed, this inverted index is what gets searched.
    * In most cases, the same analyzer is used at index and search time.
    * Character filter -> Tokenizer -> Token filter
    * Analysis steps are Character filter -> Tokenizer -> Token filter
    * Tokenizer also records the positions of tokens inside the input, for highlighting & proximity searches.
    * Analysers can be debugged via the `_analyze` endpoint.

    ### Queries

    * Term queries do not do analysis; it matches exact terms.
    * Match queries analyze the search input.
    * Query expansion: Broaden the query by introducing additional tokens or phrases.
    * Query relaxation: Ignore tokens to make the query less restrictive and increase recall.
    * Query expansion is when the query is broadened by introducing additional tokens or phrases.
    * Query relaxation is when tokens are ignored to make the query less restrictive and increase recall.

    ### Relevance

    * `score(q,d) = queryNorm(q) · coord(q,d) · ∑ (tf(t in d)· idf(t)² · t.getBoost() · norm(t,d)) (t in q)`
    * Term Frequency (tf): Number of times a given term appears in the document.
    * Inverse Document Frequency (idf): Assigns low weight/relevance to terms that appear frequently in all of the documents in the index.
    * `score = tf * idf * field_norm * query_norm * coord * index_boost * query_boost`
    * Term Frequency (`tf`): Number of times a given term appears in the document.
    * Inverse Document Frequency (`idf`): Assigns low weight to terms that appear frequently in all of the documents in the index.
    * For performance reasons, Elasticsearch doesn’t calculate the IDF across all documents in the index. Instead, each shard calculates a local IDF for the documents contained in that shard.
    * This means that local IDF and global IDF can be different, which can produce incorrect results.
    * This difference diminish the more documents that you add to the index, so in practice this is not a problem.
    * Field Length Normalization (norm): The inverse square root of the number of terms in the field; smaller fields are higher valued.
    * Query Normalization (queryNorm): Used so that different queries can be compared. For any individual query, however, it uses the same score for every document.
    * Coordination (coord): Counts the number of terms from the query that appear in the document. If we have a 3-term query and a document contains 2 of those terms, then it will be scored higher than a document that has only 1 of those terms.
    * Unliked `tf`, `idf` cannot be disabled.
    * Field Length Normalization (`field_norm`): A term match found in a field with a low number of total terms is going to be more important than a match found in a field with a large number of terms.
    * Query Normalization (`query_norm`): Used so that different queries can be compared. For any individual query, however, it uses the same score for every document.
    * Coordination (`coord`): Counts the number of terms from the query that appear in the document.
    * If we have a 3 term query and a document contains 2 of those terms, then it will be scored higher than a document that has only 1 of those terms.
    * Index boost (`index_boost`): Boosting done during indexing; usage is strongly discouraged due to adverse effects, like having to reindex all documents to change an index-time boost.
    * Query boost (`query_boost`): Boosting done during the query.
    * The `_explain` endpoint can be used to debug the relevance score.

    ## Learning to Rank
  13. umutseven92 revised this gist Oct 6, 2021. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion search.md
    Original file line number Diff line number Diff line change
    @@ -66,14 +66,15 @@
    * In most cases, the same analyzer is used at index and search time.
    * Character filter -> Tokenizer -> Token filter
    * Tokenizer also records the positions of tokens inside the input, for highlighting & proximity searches.
    * Analysers can be debugged via the `_analyze` endpoint.

    ### Queries

    * Term queries do not do analysis; it matches exact terms.
    * Match queries analyze the search input.
    * Query expansion: Broaden the query by introducing additional tokens or phrases.
    * Query relaxation: Ignore tokens to make the query less restrictive and increase recall.

    ### Relevance

    * `score(q,d) = queryNorm(q) · coord(q,d) · ∑ (tf(t in d)· idf(t)² · t.getBoost() · norm(t,d)) (t in q)`
  14. umutseven92 revised this gist Oct 6, 2021. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -85,6 +85,7 @@
    * Field Length Normalization (norm): The inverse square root of the number of terms in the field; smaller fields are higher valued.
    * Query Normalization (queryNorm): Used so that different queries can be compared. For any individual query, however, it uses the same score for every document.
    * Coordination (coord): Counts the number of terms from the query that appear in the document. If we have a 3-term query and a document contains 2 of those terms, then it will be scored higher than a document that has only 1 of those terms.
    * The `_explain` endpoint can be used to debug the relevance score.

    ## Learning to Rank

  15. umutseven92 revised this gist Oct 6, 2021. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion search.md
    Original file line number Diff line number Diff line change
    @@ -46,7 +46,7 @@
    * Data nodes are responsible for holding data and performing data-related operations, like CRUD operations, indexing, search, and aggregations.
    * You can configure data nodes so that they only do search and aggregation, not any indexing, to reduce the load in the individual nodes.

    ### Ingest Node
    #### Ingest Node
    * Applies ingest pipelines to a documents in order to transform and enrich the document before indexing.
    * With a heavy ingest load, it makes sense to use dedicated ingest nodes.

  16. umutseven92 revised this gist Oct 6, 2021. 1 changed file with 15 additions and 0 deletions.
    15 changes: 15 additions & 0 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -35,6 +35,21 @@
    * The primary shard is responsible for keeping replica shards updated.
    * Replica shards are used for both availability and searching.

    ### Node Types

    * All node types are determined by the roles it gets assigned, e.g. `master`, `data` and `ingest`.

    #### Master Node
    * Only one gets selected as a master node in a cluster. They are responsible for creating or deleting indexes, tracking nodes, and allocating shards to nodes.

    #### Data Nodes
    * Data nodes are responsible for holding data and performing data-related operations, like CRUD operations, indexing, search, and aggregations.
    * You can configure data nodes so that they only do search and aggregation, not any indexing, to reduce the load in the individual nodes.

    ### Ingest Node
    * Applies ingest pipelines to a documents in order to transform and enrich the document before indexing.
    * With a heavy ingest load, it makes sense to use dedicated ingest nodes.

    ### Documents

    * Fields are the smallest individual unit of data in Elasticsearch. These are customizable and could include, for example: title, author, date, summary, team, score, etc.
  17. umutseven92 created this gist Sep 30, 2021.
    97 changes: 97 additions & 0 deletions search.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,97 @@
    # Search

    ## Elasticsearch

    ### Architecture

    * Nodes join a cluster (named `elasticsearch` by default).
    * One node becomes the master node.
    * Each index data is divided into shards.
    * 5 shards by default.
    * Due to how routing works, these shards cannot be increased later; you would need to create a new index.
    * Internally, an index is a logical namespace that points to one or more shards.
    * Each node contains these shards.
    * When you index a document, Elasticsearch will determine which shard the document should be routed to for indexing.
    * Same routing will be used when Elasticsearch is retrieving the document.
    * You can enable custom routing, mainly for performance reasons.
    * A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards.
    * Each shard is a Lucene index.
    * A Lucene index can be tought as a self-contained search engine.
    * Lucene index contains segments.
    * A segment is an inverted index.
    * A search in a shard will search each segment in turn, then combine their results into the final results for that shard.
    * These segments are immutable.
    * When data is written, it is published into segments.
    * While you are indexing documents, Elasticsearch collects them in memory (and in the transaction log, for safety) then every second or so, writes a new small segment to disk, and refreshes the search.
    * This is referred to as a Refresh.
    * A Refresh is what makes the data in the new segment visible to search.
    * As segments grow, they get merged into bigger segments.
    * This is called a Merge.
    * Once the new bigger segment is written, the old segments are dropped.
    * Merging can be quite resource intensive, especially with respect to disk I/O.
    * Maximum segment size is 5GBs.
    * When a document is deleted, it gets marked to be deleted, and is not visible. It actually gets deleted during a merge.
    * Each shard gets replicated, and stored in another node. These are replica shards.
    * The primary shard is responsible for keeping replica shards updated.
    * Replica shards are used for both availability and searching.

    ### Documents

    * Fields are the smallest individual unit of data in Elasticsearch. These are customizable and could include, for example: title, author, date, summary, team, score, etc.
    * Multi-fields can (and should) be indexed in more than one way to produce more search results.
    * Meta-fields deal with a document’s metadata and usually start with an underscore.
    * Mapping defines the different types that reside within an index.
    * Mapping can be done via the API, or via Index Templates.
    * You can’t change the mapping or field type of an existing field.

    ### Analysis

    * Result of the analysis gets stored at the inverted index.
    * When a search is performed, this result is what gets searched.
    * In most cases, the same analyzer is used at index and search time.
    * Character filter -> Tokenizer -> Token filter
    * Tokenizer also records the positions of tokens inside the input, for highlighting & proximity searches.

    ### Queries

    * Term queries do not do analysis; it matches exact terms.
    * Match queries analyze the search input.
    * Query expansion: Broaden the query by introducing additional tokens or phrases.
    * Query relaxation: Ignore tokens to make the query less restrictive and increase recall.

    ### Relevance

    * `score(q,d) = queryNorm(q) · coord(q,d) · ∑ (tf(t in d)· idf(t)² · t.getBoost() · norm(t,d)) (t in q)`
    * Term Frequency (tf): Number of times a given term appears in the document.
    * Inverse Document Frequency (idf): Assigns low weight/relevance to terms that appear frequently in all of the documents in the index.
    * For performance reasons, Elasticsearch doesn’t calculate the IDF across all documents in the index. Instead, each shard calculates a local IDF for the documents contained in that shard.
    * This means that local IDF and global IDF can be different, which can produce incorrect results.
    * This difference diminish the more documents that you add to the index, so in practice this is not a problem.
    * Field Length Normalization (norm): The inverse square root of the number of terms in the field; smaller fields are higher valued.
    * Query Normalization (queryNorm): Used so that different queries can be compared. For any individual query, however, it uses the same score for every document.
    * Coordination (coord): Counts the number of terms from the query that appear in the document. If we have a 3-term query and a document contains 2 of those terms, then it will be scored higher than a document that has only 1 of those terms.

    ## Learning to Rank

    * Applies supervised machine learning to search relevance ranking.
    * A model is trained based on some preprocessed data.
    * This data with hand-labeled ranks is then used to optimize future results.

    ## Vector Based Search

    * Each word is represented as a numerical quantity known as vector, using tools like word2vec (Google) or GloVe (Stanford).
    * This representation is called a text (word) embedding.
    * Each vector captures information about what a word means (semantics).
    * Words with similar meanings get similar vectors and can be projected onto a 2-dimensional graph for easy visualization.
    * These vectors are trained by ensuring words with similar meanings are physically near each other.

    ## Misc

    * Precision is true positives / selected elements.
    * Recall is true positives / all positives.
    * Shingles are word-ngrams. Given a stream of tokens, the shingle filter will create new tokens by concatenating adjacent terms.
    * They give you the ability to pre-bake phrase matching. By building phrases into the index, you can avoid creating phrases at query time and save some processing time/speed.
    * The downside is that you have larger indices and potentially more memory usage.
    * Named-entity recognition (NER) locates, classifies and extracts named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations etc.
    * A named entity is a real-world object, such as persons, locations, organizations, products, etc.
    * In the sentence "Biden is the president of the United States", both "Biden" and the "United States" are named entities.