Skip to content

Instantly share code, notes, and snippets.

@mneedham
Created April 14, 2023 06:06
Show Gist options
  • Select an option

  • Save mneedham/9c96fc2f6cd223c9c88c828b8b4fce42 to your computer and use it in GitHub Desktop.

Select an option

Save mneedham/9c96fc2f6cd223c9c88c828b8b4fce42 to your computer and use it in GitHub Desktop.

Revisions

  1. mneedham created this gist Apr 14, 2023.
    119 changes: 119 additions & 0 deletions queries.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,119 @@
    # Intro to Window Functions

    Queries based on the [Citi bikes dataset](https://www.kaggle.com/datasets/rosenthal/citi-bike-stations?resource=download) using [DuckDB](https://duckdb.org/).

    Associated YouTube video: https://youtu.be/u9AtW5P0m5c

    ## Import data
    ```sql
    CREATE OR REPLACE TABLE bikeStations AS
    FROM read_csv_auto('data/*.csv', types={"station_id": "VARCHAR"})
    WHERE missing_station_information = false;
    ```
    ```sql
    ALTER TABLE bikeStations ADD COLUMN ts TIMESTAMP;
    ```
    ```sql
    UPDATE bikeStations
    SET ts = to_timestamp(station_status_last_reported)::TIMESTAMP AT TIME ZONE 'US/Eastern';
    ```

    ## Overview
    ```sql
    .mode line
    SELECT * FROM bikeStations LIMIT 1;
    ```

    ## General purpose window functions

    ## Row numbers
    ```sql
    .mode duckbox
    SELECT station_id, ts, row_number() OVER () AS rn
    FROM bikeStations
    LIMIT 10;
    ```

    ## Row number by time
    ```sql
    SELECT station_id, ts, row_number() OVER (ORDER BY ts) AS rn
    FROM bikeStations
    LIMIT 10;
    ```

    ## Row number then sorted by time (subtly different)
    ```sql
    SELECT station_id, ts, row_number() OVER () AS rn
    FROM bikeStations
    ORDER BY ts
    LIMIT 10;
    ```

    ## Row number by station
    ```sql
    SELECT station_id, station_name, ts, num_bikes_available AS available,
    row_number() OVER (PARTITION BY station_id ORDER BY ts) AS rn
    FROM bikeStations
    LIMIT 10;
    ```

    ## First entry for each station (does not work)
    ```sql
    SELECT station_id, station_name, ts, num_bikes_available AS available,
    row_number() OVER (PARTITION BY station_id ORDER BY ts) AS rn
    FROM bikeStations
    WHERE rn = 1
    LIMIT 10;
    ```

    ## CTE to do this
    ```sql
    WITH bikeStationsRn AS (
    SELECT station_id, station_name, ts, num_bikes_available AS available,
    row_number() OVER (PARTITION BY station_id ORDER BY ts) AS rn
    FROM bikeStations
    )

    SELECT *
    FROM bikeStationsRn
    WHERE rn = 1
    LIMIT 10;
    ```

    ## Lead
    ```sql
    SELECT station_id AS id, station_name, ts,
    num_bikes_available AS current,
    LEAD(num_bikes_available) OVER (PARTITION BY station_id ORDER BY ts) AS next,
    next - current AS delta
    FROM bikeStations
    ORDER BY station_id, ts
    LIMIT 10;
    ```

    ## Lag
    ```sql
    SELECT station_id AS id, station_name, ts,
    LAG(num_bikes_available) OVER (PARTITION BY station_id ORDER BY ts) AS prev,
    num_bikes_available AS current,
    current - prev AS delta
    FROM bikeStations
    ORDER BY station_id, ts
    LIMIT 10;
    ```

    ## Window Clause
    ## The WINDOW clause can be used to define a named window that can be shared between multiple window functions:
    ```sql
    SELECT station_id AS id, station_name, ts,
    LAG(num_bikes_available) OVER station AS prev,
    num_bikes_available AS current,
    LEAD(num_bikes_available) OVER station AS next,
    FROM bikeStations
    WINDOW station AS (
    PARTITION BY station_id
    ORDER BY ts
    )
    ORDER BY station_id, ts
    LIMIT 10;
    ```