Created
April 14, 2023 06:06
-
-
Save mneedham/9c96fc2f6cd223c9c88c828b8b4fce42 to your computer and use it in GitHub Desktop.
Revisions
-
mneedham created this gist
Apr 14, 2023 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,119 @@ # Intro to Window Functions Queries based on the [Citi bikes dataset](https://www.kaggle.com/datasets/rosenthal/citi-bike-stations?resource=download) using [DuckDB](https://duckdb.org/). Associated YouTube video: https://youtu.be/u9AtW5P0m5c ## Import data ```sql CREATE OR REPLACE TABLE bikeStations AS FROM read_csv_auto('data/*.csv', types={"station_id": "VARCHAR"}) WHERE missing_station_information = false; ``` ```sql ALTER TABLE bikeStations ADD COLUMN ts TIMESTAMP; ``` ```sql UPDATE bikeStations SET ts = to_timestamp(station_status_last_reported)::TIMESTAMP AT TIME ZONE 'US/Eastern'; ``` ## Overview ```sql .mode line SELECT * FROM bikeStations LIMIT 1; ``` ## General purpose window functions ## Row numbers ```sql .mode duckbox SELECT station_id, ts, row_number() OVER () AS rn FROM bikeStations LIMIT 10; ``` ## Row number by time ```sql SELECT station_id, ts, row_number() OVER (ORDER BY ts) AS rn FROM bikeStations LIMIT 10; ``` ## Row number then sorted by time (subtly different) ```sql SELECT station_id, ts, row_number() OVER () AS rn FROM bikeStations ORDER BY ts LIMIT 10; ``` ## Row number by station ```sql SELECT station_id, station_name, ts, num_bikes_available AS available, row_number() OVER (PARTITION BY station_id ORDER BY ts) AS rn FROM bikeStations LIMIT 10; ``` ## First entry for each station (does not work) ```sql SELECT station_id, station_name, ts, num_bikes_available AS available, row_number() OVER (PARTITION BY station_id ORDER BY ts) AS rn FROM bikeStations WHERE rn = 1 LIMIT 10; ``` ## CTE to do this ```sql WITH bikeStationsRn AS ( SELECT station_id, station_name, ts, num_bikes_available AS available, row_number() OVER (PARTITION BY station_id ORDER BY ts) AS rn FROM bikeStations ) SELECT * FROM bikeStationsRn WHERE rn = 1 LIMIT 10; ``` ## Lead ```sql SELECT station_id AS id, station_name, ts, num_bikes_available AS current, LEAD(num_bikes_available) OVER (PARTITION BY station_id ORDER BY ts) AS next, next - current AS delta FROM bikeStations ORDER BY station_id, ts LIMIT 10; ``` ## Lag ```sql SELECT station_id AS id, station_name, ts, LAG(num_bikes_available) OVER (PARTITION BY station_id ORDER BY ts) AS prev, num_bikes_available AS current, current - prev AS delta FROM bikeStations ORDER BY station_id, ts LIMIT 10; ``` ## Window Clause ## The WINDOW clause can be used to define a named window that can be shared between multiple window functions: ```sql SELECT station_id AS id, station_name, ts, LAG(num_bikes_available) OVER station AS prev, num_bikes_available AS current, LEAD(num_bikes_available) OVER station AS next, FROM bikeStations WINDOW station AS ( PARTITION BY station_id ORDER BY ts ) ORDER BY station_id, ts LIMIT 10; ```