Skip to content

Instantly share code, notes, and snippets.

@timmc
Last active October 4, 2024 14:00
Show Gist options
  • Select an option

  • Save timmc/df5fbb6e069fb8c1c4e181a29930ace3 to your computer and use it in GitHub Desktop.

Select an option

Save timmc/df5fbb6e069fb8c1c4e181a29930ace3 to your computer and use it in GitHub Desktop.

Revisions

  1. timmc revised this gist May 27, 2020. 2 changed files with 50 additions and 1 deletion.
    49 changes: 49 additions & 0 deletions pwned-passwords-sqlite-build.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,49 @@
    #!/usr/bin/env python3
    # Build a SQLite3 DB for looking up SHA-1 hashes of leaked passwords.
    #
    # This can be fed the txt file from one of Have I Been Pwned's hash
    # lists available from https://haveibeenpwned.com/Passwords -- but any
    # text file with line format ``hash-hex:count`` will work.
    #
    # When run on the v5 hash-ordered SHA-1 file, expect the build to take
    # about 35 minutes and produce a 15.7 GiB file (~30.5 bytes per record).
    #
    # This example shows querying the resulting database for the
    # vulnerable password "qwerty123", and finding that it was present
    # with a count of 621679:
    #
    # >>> import sqlite3
    # >>> conn = sqlite3.connect("pwned-passwords-sha1-with-counts-v5.sqlite")
    # >>> hash_bytes = hashlib.sha1('qwerty123'.encode()).digest()
    # >>> conn.execute("SELECT * FROM hashes WHERE hash = :sha1 LIMIT 1", {'sha1': hash_bytes}).fetchone()
    # (b'\\\xec\x17[\x16^=^b\xc9\xe1<\xe8H\xefo\xea\xc8\x1b\xff', 621679)

    import os
    import sqlite3
    import sys

    def record_generator(in_path):
    with open(in_path) as hashes:
    for line in hashes:
    (sha1_hex, count_str) = line.split(':', 2)
    sha1_bytes = bytes.fromhex(sha1_hex)
    count = int(count_str)
    yield (sha1_bytes, count)

    def build(in_path, out_path):
    with sqlite3.connect(out_path) as conn:
    conn.execute('pragma journal_mode=memory')
    conn.execute('CREATE TABLE hashes("hash" BLOB PRIMARY KEY, "count" INT) WITHOUT ROWID')
    conn.executemany(
    'INSERT INTO hashes(hash, count) VALUES (?, ?)',
    record_generator(in_path))
    conn.commit()

    def main(*args):
    if len(args) != 2:
    print("Usage: build.py <hash-and-count-input.txt> <output.sqlite>")
    sys.exit(1)
    build(*args)

    if __name__ == '__main__':
    main(*sys.argv[1:])
    2 changes: 1 addition & 1 deletion pwned-passwords-sqlite.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    (There are better ways to do this, such as making a small Python program to load in the data as binary instead of hex—but this is what I did in a pinch and it worked well for what I needed!)
    **Update:** Added a Python version (`pwned-passwords-sqlite-build.py`) that preserves counts and uses binary keys rather than text.

    Last executed 2019-06-25 with the v4 dump:

  2. timmc renamed this gist May 27, 2020. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  3. timmc revised this gist Mar 2, 2020. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,5 @@
    (There are better ways to do this, such as making a small Python program to load in the data as binary instead of hex—but this is what I did in a pinch and it worked well for what I needed!)

    Last executed 2019-06-25 with the v4 dump:

    1. Make sure you have 60 GB free disk space and some extra to
  4. timmc renamed this gist Aug 9, 2019. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  5. timmc created this gist Jun 25, 2019.
    38 changes: 38 additions & 0 deletions gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,38 @@
    Last executed 2019-06-25 with the v4 dump:

    1. Make sure you have 60 GB free disk space and some extra to
    spare. Alternatively, take a walk on the wild side and delete
    source files as soon as you've used them.
    2. Download the SHA-1 (ordered by hash) torrent from
    https://haveibeenpwned.com/Passwords
    3. Unpack and strip off the counts:
    ```
    7z x -so pwned-passwords-sha1-ordered-by-hash-v4.7z pwned-passwords-sha1-ordered-by-hash-v4.txt | sed 's/:.*//' > hashes.lst
    ```
    4. Load into sqlite:
    ```
    $ sqlite3 pwned-passwords-by-hash-v4.sqlite
    sqlite> pragma journal_mode=memory;
    sqlite> CREATE TABLE hashes("hash" BLOB PRIMARY KEY) WITHOUT ROWID;
    sqlite> .mode csv
    sqlite> .import hashes.lst hashes
    ```
    5. Confirm that queries will use the primary key's index:
    ```
    sqlite> EXPLAIN QUERY PLAN SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1;
    0|0|0|SEARCH TABLE hashes USING PRIMARY KEY (hash=?)
    ```
    A sample query should return almost instantly:
    ```
    sqlite> SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1;
    D657187D9C9C1AD04FDA5132338D405FDB112FA1
    ```

    A few notes on construction:

    - Journaling is disabled to speed up initial import.
    - Making the hash the primary key tells sqlite that the data is unique
    and ordered—and the primary key data is stored in a B-tree, not
    duplicated into an additional index.
    - We can also avoid storing rowids; no idea how much space this saves
    us, but no harm here: https://www.sqlite.org/withoutrowid.html