keegancsmith · August 6, 2020 22:21 · Aug 6, 2020 · Aug 6, 2020
diff --git a/postgres-gin.org b/postgres-gin.org
@@ -97,4 +97,8 @@ I also investigated not using =lower(name)= as the index. I couldn't get the
 postgres planner to use just =name=. Tried disabling different planners,
 ensured I was using case insensitive =~*=. No dice. We have had ideas of using
 postgres as the backend for code search, this makes me a bit less confident in
-that idea.
+that idea.
+
+Compiling =pg_trgrm= with regex debugging didn't help that much. It does
+produce very cool dot diagrams to understand the NFA it uses though. Very
+nice.
diff --git a/postgres-gin.org b/postgres-gin.org
@@ -0,0 +1,100 @@
+** DONE Postgres trigram exploration
+   CLOSED: [2020-08-07 Fri 00:19]
+   - State "DONE"       from "TODO"       [2020-08-07 Fri 00:19]
+   :LOGBOOK:
+   CLOCK: [2020-08-06 Thu 21:30]--[2020-08-07 Fri 00:19] =>  2:49
+   :END:
+  [2020-08-06 Thu 21:30]
+
+I went a bit off the deep end and compiled postgres to better understand the
+gin + pg_trgm indexes. I was seeing weird behaviour where =~
+'github\.com/sourcegraph'= was slower than =~ 'sourcegraph'=. Ended up reading
+a bunch of source code and developer documentation. Postgres architecture and
+sourcecode is actually quite nice.
+
+GIN is really nice, it stands for Generalized Inverted Index. Basically it
+abstracts inverted indexes. The developer provides the following functions:
+
+- extract the list of values in a document. So for pg_trgm that is a list of trigrams.
+- extract the list of values from a query (of a given type). So for pg_trgm if
+  doing a regex it is a list of trigrams from the regex. You can smuggle extra
+  data, this is used later in =consistent=.
+- =consistent= a function which given list of values returns true if it
+  matches the query. So if searching a regex like =(A|B)= consistent will
+  check that the trigrams match in A or B.
+- A way to compare values. GIN uses a btree on the values. So for pg_trgrm it
+  is just comparison on the trigrams, which are int4's (ie 4 byte int, only
+  uses first 3 bytes).
+
+See https://www.postgresql.org/docs/12/gin-extensibility.html for the details.
+
+#+begin_src shell
+  export PSQL_SRC=/Users/keegan/src/git.postgresql.org/git/postgresql
+
+  # build
+  cd $PSQL_SRC
+  ./configure --prefix=$PWD/local
+  make install
+  make -C contrib/citext install
+  make -C contrib/pg_trgm COPT=-DTRGM_REGEXP_DEBUG install
+
+  # Start DB
+  ./local/bin/initdb -D $PWD/local/pgsql/data
+  # modify port to be 5000
+  vim local/pgsql/data/postgresql.conf
+  ./local/bin/pg_ctl -D $PWD/local/pgsql/data -l logfile start
+  ./local/bin/createdb -p 5000 keegan
+
+  # Connect
+  ./local/bin/psql -p 5000
+
+  # Insert 434497 repos into table
+  git clone https://github.com/sourcegraph/ghdump.git
+  cd ghdump/api_response_dump
+  find . -type f | xargs cat | jq -r '.items | .[].full_name' | awk '{ print "github.com/" $1 }' | sort | uniq | $PSQL_SRC/local/bin/psql -p 5000 -c 'COPY repo (name) FROM STDIN'
+
+#+end_src
+
+#+begin_src sql
+  create table repo (id serial primary key, name citext not null);
+  ALTER TABLE ONLY repo ADD CONSTRAINT repo_name_unique UNIQUE (name) DEFERRABLE;
+  CREATE INDEX repo_name_trgm ON repo USING gin (lower((name)::text) gin_trgm_ops);
+  vacuum analyse repo;
+#+end_src
+
+#+begin_example
+  keegan=# explain analyse select name from repo where lower(name) ~ 'github\.com/sourcegraph';
+                                                          QUERY PLAN
+  ---------------------------------------------------------------------------------------------------------------------------
+   Bitmap Heap Scan on repo  (cost=176.34..335.19 rows=43 width=35) (actual time=8.032..8.186 rows=34 loops=1)
+     Recheck Cond: (lower((name)::text) ~ 'github\.com/sourcegraph'::text)
+     Heap Blocks: exact=1
+     ->  Bitmap Index Scan on repo_name_trgm  (cost=0.00..176.33 rows=43 width=0) (actual time=8.011..8.011 rows=34 loops=1)
+           Index Cond: (lower((name)::text) ~ 'github\.com/sourcegraph'::text)
+   Planning Time: 4.998 ms
+   Execution Time: 8.296 ms
+  (7 rows)
+
+  keegan=# explain analyse select name from repo where lower(name) ~ 'sourcegraph';
+                                                          QUERY PLAN
+  ---------------------------------------------------------------------------------------------------------------------------
+   Bitmap Heap Scan on repo  (cost=112.34..271.19 rows=43 width=35) (actual time=1.506..1.655 rows=34 loops=1)
+     Recheck Cond: (lower((name)::text) ~ 'sourcegraph'::text)
+     Heap Blocks: exact=1
+     ->  Bitmap Index Scan on repo_name_trgm  (cost=0.00..112.33 rows=43 width=0) (actual time=1.475..1.475 rows=34 loops=1)
+           Index Cond: (lower((name)::text) ~ 'sourcegraph'::text)
+   Planning Time: 1.950 ms
+   Execution Time: 1.701 ms
+  (7 rows)
+#+end_example
+
+I haven't gotten to the root of why it is slower to search =sourcegraph= vs
+=github\.com/sourcegraph=. My best guess is when confirming when a regex
+matches a value it is faster to check the former due to not matching the
+github.com prefix.
+
+I also investigated not using =lower(name)= as the index. I couldn't get the
+postgres planner to use just =name=. Tried disabling different planners,
+ensured I was using case insensitive =~*=. No dice. We have had ideas of using
+postgres as the backend for code search, this makes me a bit less confident in
+that idea.
No results found