Skip to content

Instantly share code, notes, and snippets.

@mneedham
Created October 14, 2022 18:24
Show Gist options
  • Save mneedham/1118519a859ce92ec54de6bed320c698 to your computer and use it in GitHub Desktop.
Save mneedham/1118519a859ce92ec54de6bed320c698 to your computer and use it in GitHub Desktop.

Revisions

  1. mneedham created this gist Oct 14, 2022.
    11 changes: 11 additions & 0 deletions parquet-cli.sh
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,11 @@
    # The NYC Taxis Dataset - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

    pip install parquet-cli

    parq data/yellow_tripdata_2022-01.parquet

    parq data/yellow_tripdata_2022-01.parquet --schema

    parq data/yellow_tripdata_2022-01.parquet --head 10

    parq data/yellow_tripdata_2022-01.parquet --tail 10
    12 changes: 12 additions & 0 deletions parquet.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,12 @@
    import pyarrow.parquet as pq

    file = pq.ParquetFile("data/yellow_tripdata_2022-01.parquet")
    file.metadata
    file.schema

    file.read().to_pandas()

    df = file.read().to_pandas()

    df.to_csv("trips.csv")
    df.to_json("trips.json", orient="records", lines=True)
    3 changes: 3 additions & 0 deletions size.sh
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,3 @@
    stat -f %z data/yellow_tripdata_2022-01.parquet | numfmt --to=iec
    stat -f %z trips.csv | numfmt --to=iec
    stat -f %z trips.json | numfmt --to=iec