Skip to content

Instantly share code, notes, and snippets.

@keegancsmith
Last active August 6, 2020 22:21
Show Gist options
  • Select an option

  • Save keegancsmith/ecdb45c6d0390911fef1aea1ef1f01a4 to your computer and use it in GitHub Desktop.

Select an option

Save keegancsmith/ecdb45c6d0390911fef1aea1ef1f01a4 to your computer and use it in GitHub Desktop.

Revisions

  1. keegancsmith revised this gist Aug 6, 2020. 1 changed file with 5 additions and 1 deletion.
    6 changes: 5 additions & 1 deletion postgres-gin.org
    Original file line number Diff line number Diff line change
    @@ -97,4 +97,8 @@ I also investigated not using =lower(name)= as the index. I couldn't get the
    postgres planner to use just =name=. Tried disabling different planners,
    ensured I was using case insensitive =~*=. No dice. We have had ideas of using
    postgres as the backend for code search, this makes me a bit less confident in
    that idea.
    that idea.

    Compiling =pg_trgrm= with regex debugging didn't help that much. It does
    produce very cool dot diagrams to understand the NFA it uses though. Very
    nice.
  2. keegancsmith created this gist Aug 6, 2020.
    100 changes: 100 additions & 0 deletions postgres-gin.org
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,100 @@
    ** DONE Postgres trigram exploration
    CLOSED: [2020-08-07 Fri 00:19]
    - State "DONE" from "TODO" [2020-08-07 Fri 00:19]
    :LOGBOOK:
    CLOCK: [2020-08-06 Thu 21:30]--[2020-08-07 Fri 00:19] => 2:49
    :END:
    [2020-08-06 Thu 21:30]

    I went a bit off the deep end and compiled postgres to better understand the
    gin + pg_trgm indexes. I was seeing weird behaviour where =~
    'github\.com/sourcegraph'= was slower than =~ 'sourcegraph'=. Ended up reading
    a bunch of source code and developer documentation. Postgres architecture and
    sourcecode is actually quite nice.

    GIN is really nice, it stands for Generalized Inverted Index. Basically it
    abstracts inverted indexes. The developer provides the following functions:

    - extract the list of values in a document. So for pg_trgm that is a list of trigrams.
    - extract the list of values from a query (of a given type). So for pg_trgm if
    doing a regex it is a list of trigrams from the regex. You can smuggle extra
    data, this is used later in =consistent=.
    - =consistent= a function which given list of values returns true if it
    matches the query. So if searching a regex like =(A|B)= consistent will
    check that the trigrams match in A or B.
    - A way to compare values. GIN uses a btree on the values. So for pg_trgrm it
    is just comparison on the trigrams, which are int4's (ie 4 byte int, only
    uses first 3 bytes).

    See https://www.postgresql.org/docs/12/gin-extensibility.html for the details.

    #+begin_src shell
    export PSQL_SRC=/Users/keegan/src/git.postgresql.org/git/postgresql

    # build
    cd $PSQL_SRC
    ./configure --prefix=$PWD/local
    make install
    make -C contrib/citext install
    make -C contrib/pg_trgm COPT=-DTRGM_REGEXP_DEBUG install

    # Start DB
    ./local/bin/initdb -D $PWD/local/pgsql/data
    # modify port to be 5000
    vim local/pgsql/data/postgresql.conf
    ./local/bin/pg_ctl -D $PWD/local/pgsql/data -l logfile start
    ./local/bin/createdb -p 5000 keegan

    # Connect
    ./local/bin/psql -p 5000

    # Insert 434497 repos into table
    git clone https://github.com/sourcegraph/ghdump.git
    cd ghdump/api_response_dump
    find . -type f | xargs cat | jq -r '.items | .[].full_name' | awk '{ print "github.com/" $1 }' | sort | uniq | $PSQL_SRC/local/bin/psql -p 5000 -c 'COPY repo (name) FROM STDIN'

    #+end_src

    #+begin_src sql
    create table repo (id serial primary key, name citext not null);
    ALTER TABLE ONLY repo ADD CONSTRAINT repo_name_unique UNIQUE (name) DEFERRABLE;
    CREATE INDEX repo_name_trgm ON repo USING gin (lower((name)::text) gin_trgm_ops);
    vacuum analyse repo;
    #+end_src

    #+begin_example
    keegan=# explain analyse select name from repo where lower(name) ~ 'github\.com/sourcegraph';
    QUERY PLAN
    ---------------------------------------------------------------------------------------------------------------------------
    Bitmap Heap Scan on repo (cost=176.34..335.19 rows=43 width=35) (actual time=8.032..8.186 rows=34 loops=1)
    Recheck Cond: (lower((name)::text) ~ 'github\.com/sourcegraph'::text)
    Heap Blocks: exact=1
    -> Bitmap Index Scan on repo_name_trgm (cost=0.00..176.33 rows=43 width=0) (actual time=8.011..8.011 rows=34 loops=1)
    Index Cond: (lower((name)::text) ~ 'github\.com/sourcegraph'::text)
    Planning Time: 4.998 ms
    Execution Time: 8.296 ms
    (7 rows)

    keegan=# explain analyse select name from repo where lower(name) ~ 'sourcegraph';
    QUERY PLAN
    ---------------------------------------------------------------------------------------------------------------------------
    Bitmap Heap Scan on repo (cost=112.34..271.19 rows=43 width=35) (actual time=1.506..1.655 rows=34 loops=1)
    Recheck Cond: (lower((name)::text) ~ 'sourcegraph'::text)
    Heap Blocks: exact=1
    -> Bitmap Index Scan on repo_name_trgm (cost=0.00..112.33 rows=43 width=0) (actual time=1.475..1.475 rows=34 loops=1)
    Index Cond: (lower((name)::text) ~ 'sourcegraph'::text)
    Planning Time: 1.950 ms
    Execution Time: 1.701 ms
    (7 rows)
    #+end_example

    I haven't gotten to the root of why it is slower to search =sourcegraph= vs
    =github\.com/sourcegraph=. My best guess is when confirming when a regex
    matches a value it is faster to check the former due to not matching the
    github.com prefix.

    I also investigated not using =lower(name)= as the index. I couldn't get the
    postgres planner to use just =name=. Tried disabling different planners,
    ensured I was using case insensitive =~*=. No dice. We have had ideas of using
    postgres as the backend for code search, this makes me a bit less confident in
    that idea.