dannguyen · October 2, 2025 18:53 · Oct 15, 2024 · Oct 15, 2024 · Oct 10, 2024 · Oct 9, 2024
diff --git a/README.openai-structured-output-demo.md b/README.openai-structured-output-demo.md
@@ -79,6 +79,9 @@ pip install openai pydantic
 For ease of use, these scripts are set up to use gpt-4o-mini's vision capabilities to ingest PNG files via web URLs. If you want to modify the script to test a URL of your choosing, simply modify the `INPUT_URL` variable at the top of the script. 
 
 
+### Scanned financial disclosure
+
+
 
 
 <br>
@@ -249,7 +252,18 @@ It left the values as text, e.g. `"value": "$5,000,001 - $25,000,000"` versus `"
 
 
 
+### Scanned financial disclosure
+
+- [extract-scanned-financial-disclosure.py](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-extract-scanned-financial-disclosure-py)
+- [output-scanned-financial-disclosure.json](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-scanned-financial-disclosure-json)
+
+As I said at the beginning of this section, the report screenshot comes from a [PDF with actual text](https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf) — most Congressional disclosure filings in the past 5 years have used the [e-filing system](https://disclosures-clerk.house.gov/FinancialDisclosure), which inherently results in more regular data even when the output is PDF.
+
+So I tried using Structured Outputs on a screenshot of a [2008-era report](https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2008/8135973.pdf), and the results were [pretty solid](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-scanned-financial-disclosure-json). 
+
+<img width="842" alt="image" src="https://gist.github.com/user-attachments/assets/e430e76a-2519-43fa-a370-85a584b816b6">
 
+The main caveat is that I had to rotate the page orientation by 90 degrees. The model did try to parse the vertically-orientated page, and got about half of the values right — which is probably one of the worst-case scenarios (you'd prefer the model to completely flub things, so that you could at least catch with automated-error checks)
 
 
 

diff --git a/extract-scanned-financial-disclosure.py b/extract-scanned-financial-disclosure.py
@@ -3,11 +3,11 @@
 """
 extract-financial-disclosure.py
 Parses and extracts structured data from the screenshot at the given URL:
-https://private-user-images.githubusercontent.com/121520/376677240-e11735ce-71d1-47b1-9127-2188be17c42e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjkwMDU3NDYsIm5iZiI6MTcyOTAwNTQ0NiwicGF0aCI6Ii8xMjE1MjAvMzc2Njc3MjQwLWUxMTczNWNlLTcxZDEtNDdiMS05MTI3LTIxODhiZTE3YzQyZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQxMDE1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MTAxNVQxNTE3MjZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZTgwYjAwMDFmYWNiNGI0MmEzNGU5YzllNzgzMzdhYTlmZjkwOGU0ZGNhYmRkZDhhODA2MzZlMmM5YmZhY2ZlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.BhDYMwoVUmMt4OTKw04PfqQBKKqWzv0KuAsdFaV_WMU
+https://gist.github.com/user-attachments/assets/e430e76a-2519-43fa-a370-85a584b816b6
 
 The page comes from page 5 of 24; Schedule III, of the full financial
 disclosure report found here:
-https://gist.github.com/user-attachments/assets/12c3fce6-9dd5-4140-bc59-75606062799c
+https://gist.github.com/user-attachments/assets/e430e76a-2519-43fa-a370-85a584b816b6
 
 (the page was manually rotated 90 degrees from its original orientation in the scanned document)
 

diff --git a/output-scanned-financial-disclosure.py → output-scanned-financial-disclosure.json b/output-scanned-financial-disclosure.py → output-scanned-financial-disclosure.json
diff --git a/extract-scanned-financial-disclosure.py b/extract-scanned-financial-disclosure.py
@@ -0,0 +1,91 @@
+#!/usr/bin/env python3
+
+"""
+extract-financial-disclosure.py
+Parses and extracts structured data from the screenshot at the given URL:
+https://private-user-images.githubusercontent.com/121520/376677240-e11735ce-71d1-47b1-9127-2188be17c42e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjkwMDU3NDYsIm5iZiI6MTcyOTAwNTQ0NiwicGF0aCI6Ii8xMjE1MjAvMzc2Njc3MjQwLWUxMTczNWNlLTcxZDEtNDdiMS05MTI3LTIxODhiZTE3YzQyZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQxMDE1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MTAxNVQxNTE3MjZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZTgwYjAwMDFmYWNiNGI0MmEzNGU5YzllNzgzMzdhYTlmZjkwOGU0ZGNhYmRkZDhhODA2MzZlMmM5YmZhY2ZlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.BhDYMwoVUmMt4OTKw04PfqQBKKqWzv0KuAsdFaV_WMU
+
+The page comes from page 5 of 24; Schedule III, of the full financial
+disclosure report found here:
+https://gist.github.com/user-attachments/assets/12c3fce6-9dd5-4140-bc59-75606062799c
+
+(the page was manually rotated 90 degrees from its original orientation in the scanned document)
+
+This script assumes your API key is set up in the default way,
+  i.e. environment variable: $OPENAI_API_KEY
+  https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
+"""
+import base64
+import json
+from openai import OpenAI
+from pathlib import Path
+from pydantic import BaseModel, Field
+from typing import Union, Literal
+
+INPUT_URL = "https://gist.github.com/user-attachments/assets/52c5c8f5-886f-45fe-a338-d1cd3e36ecc8"
+
+
+# OpenAI examples of Stuctured Output scripts and data definitions
+# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2
+
+
+# Define the data structures in Pydantic:
+# a Disclosure Report has a list of assets
+class Asset(BaseModel):
+    owner: Union[Literal['SP', 'DC', 'JT'], None] = Field(description="The leftmost first column of the table")
+
+    asset_name: str = Field(
+        description="The name of the asset, the second column of the table"
+    )
+    asset_value_low: Union[int, None] = Field(
+        description="In the third column, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
+    )
+    asset_value_high: Union[int, None] = Field(
+        description="In the third column, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
+    )
+    income_type: str = Field(description="The fourth column")
+
+    income_low: Union[int, None] = Field(
+        description="In the 5th column, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'.  If the value is enclosed in parentheses, then the income values are meant to be negative"
+    )
+    income_high: Union[int, None] = Field(
+        description="In the 5th column,  the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'. If the value is enclosed in parentheses, then the income values are meant to be negative"
+    )
+
+    transaction_type: Union[Literal['P', 'S', 'E'], None]
+
+class DisclosureReport(BaseModel):
+    assets: list[Asset]
+
+## initialize OpenAI client
+client = OpenAI()
+
+
+# Example of message format for passing in an image via URL
+# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
+input_messages = [
+    {"role": "system", "content": "Output the result in JSON format."},
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Extract the text from this image"},
+            {
+                "type": "image_url",
+                "image_url": {"url": INPUT_URL},
+            },
+        ],
+    },
+]
+
+# gpt-4o-mini is cheap and fast and has vision capabilities
+response = client.beta.chat.completions.parse(
+    response_format=DisclosureReport,
+    model="gpt-4o-mini",
+    messages=input_messages
+)
+
+message = response.choices[0].message
+
+# Print it out in readable format
+obj = json.loads(message.content)
+print(json.dumps(obj, indent=2))
diff --git a/output-scanned-financial-disclosure.py b/output-scanned-financial-disclosure.py
@@ -0,0 +1,134 @@
+{
+  "assets": [
+    {
+      "owner": "SP",
+      "asset_name": "820 Sir Francis Drake Blvd., San Anselmo, CA - Commercial Property",
+      "asset_value_low": 1000001,
+      "asset_value_high": 5000000,
+      "income_type": "RENT",
+      "income_low": 100001,
+      "income_high": 1000000,
+      "transaction_type": "P"
+    },
+    {
+      "owner": "SP",
+      "asset_name": "Access Technology Partners, LP",
+      "asset_value_low": 0,
+      "asset_value_high": 0,
+      "income_type": "PARTNERSHIP INCOME/(LOSS)",
+      "income_low": -1000000,
+      "income_high": -1,
+      "transaction_type": "S"
+    },
+    {
+      "owner": "SP",
+      "asset_name": "Active, LLC",
+      "asset_value_low": 15001,
+      "asset_value_high": 50000,
+      "income_type": "PARTNERSHIP INCOME/(LOSS)",
+      "income_low": -200,
+      "income_high": -1,
+      "transaction_type": "P"
+    },
+    {
+      "owner": "SP",
+      "asset_name": "Agile Software Corp. - Public Common Stock",
+      "asset_value_low": 0,
+      "asset_value_high": 0,
+      "income_type": "CAPITAL GAIN",
+      "income_low": 15001,
+      "income_high": 50000,
+      "transaction_type": "S"
+    },
+    {
+      "owner": "SP",
+      "asset_name": "Akamai Technologies Inc. - Public Common Stock",
+      "asset_value_low": 50001,
+      "asset_value_high": 100000,
+      "income_type": "NONE",
+      "income_low": null,
+      "income_high": null,
+      "transaction_type": "P"
+    },
+    {
+      "owner": "SP",
+      "asset_name": "Alcatel Lucent Ads - Public Common Stock",
+      "asset_value_low": 1001,
+      "asset_value_high": 15000,
+      "income_type": "DIVIDENDS",
+      "income_low": 1,
+      "income_high": 200,
+      "transaction_type": null
+    },
+    {
+      "owner": "SP",
+      "asset_name": "Alcoa Inc. - Public Common Stock",
+      "asset_value_low": 15001,
+      "asset_value_high": 50000,
+      "income_type": "DIVIDENDS",
+      "income_low": 201,
+      "income_high": 1000,
+      "transaction_type": "P"
+    },
+    {
+      "owner": "SP",
+      "asset_name": "American International Group Inc. - Public Common Stock",
+      "asset_value_low": 250001,
+      "asset_value_high": 500000,
+      "income_type": "DIVIDENDS",
+      "income_low": 2501,
+      "income_high": 5000,
+      "transaction_type": null
+    },
+    {
+      "owner": "SP",
+      "asset_name": "Americas Doctors.com - Preferred Stock",
+      "asset_value_low": 1001,
+      "asset_value_high": 15000,
+      "income_type": "NONE",
+      "income_low": null,
+      "income_high": null,
+      "transaction_type": null
+    },
+    {
+      "owner": "SP",
+      "asset_name": "Apple Computer - Public Common Stock",
+      "asset_value_low": 5000001,
+      "asset_value_high": 25000000,
+      "income_type": "CAPITAL GAIN",
+      "income_low": 100001,
+      "income_high": 1000000,
+      "transaction_type": "S"
+    },
+    {
+      "owner": "SP",
+      "asset_name": "Aristotle, LLC",
+      "asset_value_low": 15001,
+      "asset_value_high": 50000,
+      "income_type": "NONE",
+      "income_low": null,
+      "income_high": null,
+      "transaction_type": null
+    },
+    {
+      "owner": "SP",
+      "asset_name": "Ashlar, Inc. - Common Stock",
+      "asset_value_low": 0,
+      "asset_value_high": 0,
+      "income_type": "CAPITAL GAIN/(LOSS)",
+      "income_low": -1001,
+      "income_high": -201,
+      "transaction_type": "S"
+    },
+    {
+      "owner": "SP",
+      "asset_name": "AT&T - Public Common Stock",
+      "asset_value_low": 250001,
+      "asset_value_high": 500000,
+      "income_type": "DIVIDENDS",
+      "income_low": 5001,
+      "income_high": 15000,
+      "transaction_type": null
+    }
+  ]
+}
diff --git a/README.openai-structured-output-demo.md b/README.openai-structured-output-demo.md
@@ -0,0 +1,593 @@
+# Extracting financial disclosure reports and police blotter narratives using OpenAI's Structured Output
+
+> **tl;dr** this demo shows how to call OpenAI's [gpt-4o-mini model](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/), provide it with URL of a screenshot of a document, and extract data that follows a schema you define. The results are pretty solid even with little effort in defining the data — and no effort doing data prep. OpenAI's API could be a cost-efficient tool for large scale data gathering projects involving public documents.
+
+
+OpenAI announced [Structured Outputs for its API](https://openai.com/index/introducing-structured-outputs-in-the-api/), a feature that allows users to specify the fields and schema of extracted data, and guarantees that the JSON output will follow that specification. 
+
+For example, given a Congressional financial disclosure report, with assets defined in a table like this:
+
+<img width="859" alt="image" src="https://gist.github.com/user-attachments/assets/e64c7ad1-d7af-4e51-a3f2-5961fde4fac3">
+
+You define the data model you're expecting to extract, either in JSON schema or (as this demo does) via [the pydantic library](https://docs.pydantic.dev/latest/concepts/json_schema/):
+
+```py
+class Asset(BaseModel):
+    asset_name: str
+    owner: str
+    location: Union[str, None]
+    asset_value_low: Union[int, None]
+    asset_value_high: Union[int, None]
+    income_type: str
+    income_low: Union[int, None]
+    income_high: Union[int, None]
+    tx_gt_1000: bool
+
+class DisclosureReport(BaseModel):
+    assets: list[Asset]
+```
+
+OpenAI's API infers from the field names (the above example is basic; there are [ways to provide detailed descriptions](https://docs.pydantic.dev/latest/concepts/fields/#default-values) for each data field) how your data model relates to the actual document you're trying to parse, and produces the extracted data in JSON format:
+
+```json
+{
+      "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]",
+      "owner": "JT",
+      "location": "St. Helena/Napa, CA, US",
+      "asset_value_low": 5000001,
+      "asset_value_high": 25000000,
+      "income_type": "Grape Sales",
+      "income_low": 100001,
+      "income_high": 1000000,
+      "tx_gt_1000": false
+    },
+    {
+      "asset_name": "25 Point Lobos - Commercial Property [RP]",
+      "owner": "SP",
+      "location": "San Francisco/San Francisco, CA, US",
+      "asset_value_low": 5000001,
+      "asset_value_high": 25000000,
+      "income_type": "Rent",
+      "income_low": 100001,
+      "income_high": 1000000,
+      "tx_gt_1000": false
+}
+```
+
+This demo gist provides code and results for two scenarios:
+
+- [Financial disclosure reports](#example-financial-disclosures): this is a data-tables-in-PDF problem where you'd typically have to use a PDF parsing library like [pdfplumber](https://github.com/jsvine/pdfplumber) and write your own data parsing methods.
+- [Newspaper police blotter](#example-police-blotter): this is a situation of irregular information — brief descriptions of reported crime incidents, written by a human reporter —  where you'd employ humans to read, interpret, and do data entry. 
+
+Note: these are very basic examples, using the bare minimum of instructions to the API (e.g. "Extract the text from this image") and relatively little code to define the expected data schema. That said
+
+## How to run this code/use this demo
+
+Each example has the Python script used to produce the corresponding JSON output. To re-run these scripts on your own, the first thing you need to do is to create your own OpenAI developer account at [platform.openai.com](http://platform.openai.com), then: 
+
+- Put in a [couple bucks into your account balance](https://platform.openai.com/settings/organization/billing/overview). Both of these examples use around [30,000-50,000 tokens](https://openai.com/api/pricing/), i.e. cost about half a cent to execute) 
+- Create an [API key](https://platform.openai.com/api-keys)
+- Set it as your [$OPENAI_API_KEY environmental variable](https://platform.openai.com/docs/quickstart/create-and-export-an-api-key)
+    - [alternatively](https://github.com/openai/openai-python?tab=readme-ov-file#usage), you can paste your key into the `api_key` argument, i.e. replace `client = OpenAI()` with `client = OpenAI(api_key='Yourkeyhere')`
+
+Then install the [OpenAI Python SDK](https://github.com/openai/openai-python) and [pydantic](https://docs.pydantic.dev/latest/#why-use-pydantic):
+
+```sh
+pip install openai pydantic
+```
+
+For ease of use, these scripts are set up to use gpt-4o-mini's vision capabilities to ingest PNG files via web URLs. If you want to modify the script to test a URL of your choosing, simply modify the `INPUT_URL` variable at the top of the script. 
+
+
+
+
+<br>
+<hr>
+<p id="example-financial-disclosures"></p>
+
+## Financial disclosure report
+
+- The script: [extract-financial-disclosure.py](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-extract-financial-disclosure-py)
+- The results: [output-financial-disclosure.json](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-financial-disclosure-json)
+
+The following screenshot is taken from the PDF of the full report, which can be found at [disclosures-clerk.house.gov](https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf)). Note that this example simply passes a *PNG screenshot of the PDF* to OpenAI's API — results may be different/more efficient if you send it the actual PDF.
+
+<img width="905" alt="image" src="https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504">
+
+
+As shown in the following snippet, the results look accurate and as expected. Note that it also correctly parses the "Location" and "Description" fields (when it exists), even though those fields aren't provided in tabular format (i.e. they're globbed into the "Asset" description as free form text). 
+
+
+It also understands that `tx_gt_1000` corresponds to the `Tx. > $1,000?` header, and that that field contains checkboxes. Even though the sample page has no examples of checked checkboxes, the model correctly infers that `tx_gt_1000` is false.
+
+
+<img width="859" alt="image" src="https://gist.github.com/user-attachments/assets/e64c7ad1-d7af-4e51-a3f2-5961fde4fac3">
+
+```json
+    {
+      "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]",
+      "owner": "OL",
+      "location": "New York, NY, US",
+      "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors.",
+      "asset_value_low": 1000001,
+      "asset_value_high": 5000000,
+      "income_type": "Partnership Income",
+      "income_low": 50001,
+      "income_high": 100000,
+      "tx_gt_1000": false
+    },
+    {
+      "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]",
+      "owner": "SP",
+      "location": null,
+      "description": null,
+      "asset_value_low": 5000001,
+      "asset_value_high": 25000000,
+      "income_type": "None",
+      "income_low": null,
+      "income_high": null,
+      "tx_gt_1000": false
+    },
+```
+
+It's also nice that I didn't have to do even the minimum of "data prep": I gave it a screenshot of the report page — the top third of which has info I don't need — and it "knew" that it should only care about the data under the "Schedule A: Assets and Unearned Income" header.
+
+If I were scraping financial disclosures for real, I would make use of [json-schema's "description" attribute](https://json-schema.org/learn/getting-started-step-by-step#define-properties), which can be defined via Pydantic like this:
+
+```py
+from pydantic import BaseModel, Field
+
+class Asset(BaseModel):
+    asset_name: str = Field(
+        description="The name of the asset, under the 'Asset' header"
+    )
+    owner: str = Field(
+        description="Under the 'Owner' header, a 2-letter abbreviation, e.g. SP, DC, JT"
+    )
+    location: Union[str, None] = Field(
+        description="Some records have 'Location:' text as part of the 'Asset' header"
+    )
+    description: Union[str, None] = Field(
+        description="Some records have 'Description:' text as part of the 'Asset' header"
+    )
+    asset_value_low: Union[int, None] = Field(
+        description="Under the 'Value of Asset' field, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
+    )
+    asset_value_high: Union[int, None] = Field(
+        description="Under the 'Value of Asset' field, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
+    )
+    income_type: str = Field(description="Under the 'Income Type(s) field")
+    income_low: Union[int, None] = Field(
+        description="Under the 'Income' field, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
+    )
+    income_high: Union[int, None] = Field(
+        description="Under the 'Income' field, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
+    )
+    tx_gt_1000: bool = Field(
+        description="Under the 'Tx. > $1,000?' header: True if the checkbox is checked, False if it is empty"
+    )
+
+
+class DisclosureReport(BaseModel):
+    assets: list[Asset]
+```
+
+But as you can see from the [result JSON](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-financial-disclosure-json), OpenAI's model seems "smart" enough to understand a basic data-copying task without specific instructions. 
+
+
+### Financial disclosure report with no instruction
+
+I was curious how well the model without any instruction, i.e. when you don't bother to define a pydantic model and instead pass in a response format of `{"type": "json_object"}`:
+
+
+```py
+response = client.beta.chat.completions.parse(
+    response_format={"type": "json_object"},
+    model="gpt-4o-mini",
+    messages=input_messages
+)
+```
+
+The answer: just fine. You can see the code and full results here:
+
+- [extract-basic-financial-disclosure.py](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-extract-basic-financial-disclosure-py)
+- [output-basic-financial-disclosure.json](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-basic-financial-disclosure-json)
+
+Without a defined schema, the model treated the entire document (not just the Assets Schedule) as data:
+
+```json
+{
+  "document": {
+    "title": "Financial Disclosure Report",
+    "header": "Clerk of the House of Representatives \u2022 Legislative Resource Center \u2022 B81 Cannon Building \u2022 Washington, DC 20515",
+    "filer_information": {
+      "name": "Hon. Nancy Pelosi",
+      "status": "Member",
+      "state_district": "CA11"
+    },
+    "filing_information": {
+      "filing_type": "Annual Report",
+      "filing_year": "2023",
+      "filing_date": "05/15/2024"
+    },
+    "schedule_a": {
+      "title": "Schedule A: Assets and 'Unearned' Income",
+      "assets": [
+        {
+          "asset": "11 Zinfandel Lane - Home & Vineyard [RP]",
+          "owner": "JT",
+          "value": "$5,000,001 - $25,000,000",
+          "income_type": "Grape Sales",
+          "income": "$100,001 - $1,000,000",
+          "location": "St. Helena/Napa, CA, US"
+        },
+```
+
+It left the values as text, e.g. `"value": "$5,000,001 - $25,000,000"` versus `"asset_value_low": 5000001`. And it left out the optional data fields, e.g. location and description, for entries that didn't have them:
+
+```json
+        {
+          "asset": "AllianceBernstein Holding L.P. Units (AB) [OL]",
+          "owner": "SP",
+          "value": "$1,000,001 - $5,000,000",
+          "income_type": "Partnership Income",
+          "income": "$50,001 - $100,000",
+          "location": "New York, NY, US",
+          "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors."
+        },
+        {
+          "asset": "Alphabet Inc. - Class A (GOOGL) [ST]",
+          "owner": "SP",
+          "value": "$5,000,001 - $25,000,000",
+          "income_type": "None"
+        },
+```
+
+
+
+
+
+
+
+
+
+
+
+<br>
+<hr>
+
+<p id="example-police-blotter"></p>
+
+## Newspaper police blotter
+
+- The script: [extract-police-blotter.py](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-extract-police-blotter-py)
+
+- The results: [output-police-blotter.json](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-police-blotter-json)
+
+
+![image](https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703)
+
+The screenshot was taken from the Stanford Daily archives: [https://archives.stanforddaily.com/2004/04/09?page=3&section=MODSMD_ARTICLE12#article](https://archives.stanforddaily.com/2004/04/09?page=3&section=MODSMD_ARTICLE12#article)
+
+
+For reasons that are explained in detail below, this example isn't meant to be a reasonable test of the model capabilities. But it's a fun experiment to see how well its model performs with something not meant to be "data" and is inherently riddled with data quality issues.
+
+Consider what the data point of a basic crime incident report might contain:
+
+- When: a date and time
+- Where: a place
+- Who:
+    - a victim
+    - a suspect
+- What: the crime the suspect allegedly committed
+
+It's easy to come up with many variations and edge cases:
+
+- No specific time: i.e. "computer science graduate students reported that they had books stolen from the Gates Computer Science Building in the previous five months"
+- No listed place: it's unclear if the reporter purposefully omitted it, or if it was left off the original police report.
+- No suspect ("an alcohol-related medical call") or no victim (e.g. "an accidental fire call"). Or multiple suspects and multiple victims.
+
+Unlike the financial disclosure example, the input data is freeform narrative text. The onus is entirely on us to define what what a blotter report is, which ends up requiring defining what a crime incident is. Not surprisingly, the corresponding Pydantic code is a lot more verbose, and I bet if you asked 1,000 journalists to write a definition, they'd all be different.
+
+Here's what mine looks like:
+
+```py
+
+# Define the data structures in Pydantic:
+# an Incident involves several Persons (victims, perpetrators)
+class Person(BaseModel):
+    description: str
+    gender: str
+    is_student: bool
+
+
+# Pydantic docs on field descriptions:
+# https://docs.pydantic.dev/latest/concepts/fields/
+class Incident(BaseModel):
+    date: str
+    time: str
+    location: str
+    summary: str = Field(description="""Brief summary, less than 30 chars""")
+    category: str = Field(
+        description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
+    )
+    property_damage: str = Field(
+        description="""If a property crime, then a description of what was stolen/damaged/lost"""
+    )
+    arrest_made: bool
+    perpetrators: list[Person]
+    victims: list[Person]
+    incident_text: str = Field(
+        description="""Include the complete verbatim text from the input that pertains to the incident"""
+    )
+
+class Blotter(BaseModel):
+    incidents: list[Incident]
+```
+
+### Police blotter results
+
+
+I ask the model to provide an `incident_text` field, i.e. the verbatim text from which it extracted the incident data point. This is helpful for evaluating the experiment. But for an actual data project, you might want to omit it as it adds to the number of output tokens and API cost
+
+```py
+    incident_text: str = Field(
+        description="""Include the complete verbatim text from the input that pertains to the incident"""
+    )
+```
+
+<img width="212" alt="image" src="https://gist.github.com/user-attachments/assets/4ca1b517-dfd6-45aa-b6dd-fa8073de035e">
+
+
+
+The resulting `incident_text` field extracted from the above snippet is basically correct:
+
+> A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.
+
+However, it leaves off the `11:40 p.m.`, which is at the beginning of the printed incident, and is something that I normally would like to include because I want to know everything the model looked at when extracting the data point. 
+
+The `11:40 p.m.` time is correctly included in the rest of the data output:
+
+```json
+{
+  "date": "April 2",
+  "time": "11:40 p.m.",
+  "location": "Rains apartments",
+  "summary": "Bike vandalized",
+  "category": "property",
+  "property_damage": "Wheel of bike",
+  "arrest_made": false,
+  "perpetrators": [
+    {
+      "description": "Two unknown suspects",
+      "gender": "unknown",
+      "is_student": false
+    }
+  ],
+  "victims": [
+    {
+      "description": "A graduate student in the School of Education",
+      "gender": "unknown",
+      "is_student": true
+    }
+  ],
+  "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments."
+}
+
+```
+
+
+
+#### The good
+
+As with the financial disclosure report, my script provides a screenshot and leaves it up to OpenAI's model to figure out what's going on. I was pleasantly surprised at how well gpt-4o-mini did in gleaning structure from a newspaper print listicle, with instructions as basic as: "Extract the text from this image"
+
+For example, on first glance of the blotter, it seems that every incident has a date (in the subhed) and time (at the beginning of the graf). But under "Thursday, April 1", you can see that pattern already broken:
+
+<img width="214" alt="image" src="https://gist.github.com/user-attachments/assets/c5d37af5-56e0-4b0b-81ab-9cf5e3359316">
+
+
+Is that second graf ("A female administrator in Materials Science...") a continuation of the 9:30 p.m. incident where a "man reported that someone removed his rear license plate"? 
+
+Most human readers, after reading both paragraphs — and then the rest of the blotter — will realize that these are 2 separate incidents. But there's nothing at all in the structure of the text to indicate that. Before I ran this experiment, I thought I would have to provide detailed parsing instructions to the model, e.g.
+
+> What you are reading is a police blotter, a list of reported incidents that police were called to. Every paragraph should be treated as a separate incident. Most incidents, but not all, begin with a timestamp, e.g. "11:20 p.m". 
+
+But the model saw on its own that there are 2 incidents, and that the second one happened on April 1 at an unspecified time.  
+
+```json
+ {
+      "date": "April 1",
+      "time": "9:30 p.m.",
+      "location": "Toyon parking lot",
+      "summary": "License plate stolen",
+      "category": "property",
+      "property_damage": "rear license plate",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "Man",
+          "gender": "unknown",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot."
+    },
+    {
+      "date": "April 1",
+      "time": "unknown",
+      "location": "unknown",
+      "summary": "Unauthorized purchase reported",
+      "category": "other",
+      "property_damage": "computer equipment",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "Female administrator",
+          "gender": "female",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry\u2019s Electronics sometime in the past five months."
+    },
+```
+
+By my count, there are 19 incidents in this issue of the Stanford Daily's police blotter, and the API correctly returns 19 different incidents.
+
+
+#### The bad
+
+Again, the data model is inherently messy, and I put in minimal effort to describe what an "incident" is, such as the variety of situations and edge cases. That, plus the inherent limitations of the data, are the root cause of most of the model's problems. 
+
+For example, I intended the `perpetrators` and `victims` to be lists of proper nouns or simple nouns, so that we could ask questions like: "how many incidents involved multiple people". Given the following incident text:
+
+> A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.
+
+— this is how the model parsed the suspects:
+
+```json
+ "perpetrators": [
+    {
+      "description": "Two unknown suspects",
+      "gender": "unknown",
+      "is_student": false
+    }
+  ]
+```
+
+For a data project, I might have preferred a result that would easily return a result of `2`, e.g.: 
+
+```json
+ "perpetrators": [
+    {
+      "description": "Unknown suspect",
+      "gender": "unknown",
+      "is_student": false
+    }
+    {
+      "description": "Unknown suspect",
+      "gender": "unknown",
+      "is_student": false
+    }
+  ]
+```
+
+But how should the model know what I'm trying to do sans specific instructions? I think most humans, given the same minimalist instructions, would have also recorded `"Two unknown suspects"`.
+
+However, the model greatly struggled with filling out the `perpetrators` and `victims` lists, such as frequently mistaking the suspect/perpetrator as the victim, when there was no specific victim mentioned:
+
+> A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license.
+
+```json
+      "victims": [
+        {
+          "description": "A male undergraduate",
+          "gender": "male",
+          "is_student": true
+        }
+      ]
+```
+
+It goes without saying that the model missed when the narrative was more complicated. For example, in the case of the unauthorized purchases at Fry's:
+
+> A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months.
+
+The "female administrator" is not the victim, but the person who reported the crime. The victim would be Stanford University, or more specifically, its MScE department. 
+
+I'm not surprised the model had problems with identifying victims and suspects, though I'm unsure how much extra instruction would be needed to get reliable results from a general model.  
+
+One thing that the model frequently and inexplicably erred on was classifying people's gender.
+
+This is how I defined a `Person` using pydantic:
+
+```py
+class Person(BaseModel):
+    description: str
+    gender: str
+    is_student: bool
+```
+
+Even when the subject's noun has an obvious gender, the model would inexplicably flub it:
+
+> A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot.
+
+```json
+      "victims": [
+        {
+          "description": "A man",
+          "gender": "unknown",
+          "is_student": false
+        }
+      ]
+```
+
+It was worse when the subject's noun did not indicate gender, but the rest of the sentence did:
+
+> A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.
+
+```json
+"victims": [
+        {
+          "description": "A graduate student in the School of Education",
+          "gender": "unknown",
+          "is_student": true
+        }
+      ],
+```
+
+Not sure what the issue is. It might be remedied if I provided explicit and thorough instructions and examples, but this seemed like a much easier thing to infer than the other things that OpenAI's model was able to infer on its own.
+
+
+
+#### The weird
+
+With so many things left to the interpretation of the LLM, it was no surprise that I get different results every time I run the [extract-police-blotter.py](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-extract-police-blotter-py) script, especially when it comes to the categorization of crimes. 
+
+In the data specification, I did attempt to describe for the model what I wanted for `category`: 
+
+```py
+category: str = Field(
+    description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
+)
+```
+
+Given the option of saying "other", the model seemed eager to use it for any slightly vague situation. It classified the unauthorized purchases at Fry's as "other", even though embezzlement would better fit under property crimes by the [FBI's UCR definition](https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/topic-pages/offense-definitions). Maybe this could be fixed by providing the model with detailed examples and definitions of statutes and criminal code?  
+
+But ultimately, as I said from the start, the model's performance is bounded by the limitations and errors in the source data. For example, an incident where someone gets hit on the head with a bottle seems to me obviously "violent", i.e. assault:
+
+> A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested.
+
+However, the model thinks it is "other":
+
+```json
+{
+      "date": "April 4",
+      "time": "3:05 a.m.",
+      "location": "Sigma Alpha Epsilon",
+      "summary": "Altercation reported",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": false,
+      "perpetrators": [
+        {
+          "description": "Two undergraduate suspects",
+          "gender": "unknown",
+          "is_student": true
+        }
+      ],
+      "victims": [
+        {
+          "description": "A male undergraduate",
+          "gender": "male",
+          "is_student": true
+        }
+      ],
+      "incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested."
+    }
+```
+
+But is the model necessarily wrong? Two "suspects" were apparently identified, but no one was actually arrested. I took this to mean that the suspects fled and hadn't been located at the time of the report. But maybe it's something more benign: an "altercation" happened, but when the cops arrived, everyone was cool including the guy who got hit by the bottle, thus no allegation of assault for police to act on or file as part of their UCR statistics. Ultimately we have to guess the author's intent.
+
+OpenAI model's performance here wouldn't work for a real data project — but again, this was just a toy experiment, and doesn't represent what you'd get if you spend more than 10 minutes thinking about the data model, nevermind pick a data source slightly more structured than a newspaper listicle. I think OpenAI's model would work very well for something with more substantive text and formal structure, such as obituaries. 
diff --git a/extract-basic-financial-disclosure.py b/extract-basic-financial-disclosure.py
@@ -0,0 +1,60 @@
+#!/usr/bin/env python3
+
+"""
+extract-basic-financial-disclosure.py
+Parses and extracts structured data — and lets the model infer the structure by itself —
+from the screenshot at the given URL:
+
+https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504
+
+Full financial disclosure report:
+https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf
+
+
+This script assumes your API key is set up in the default way,
+  i.e. environment variable: $OPENAI_API_KEY
+  https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
+
+"""
+import base64
+import json
+from openai import OpenAI
+from pathlib import Path
+from typing import Union
+
+INPUT_URL = "https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504"
+
+
+## initialize OpenAI client
+client = OpenAI()
+
+# Example of message format for passing in an image via URL
+# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
+input_messages = [
+    {"role": "system", "content": "Output the result in JSON format."},
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Extract the text from this image"},
+            {
+                "type": "image_url",
+                "image_url": {"url": INPUT_URL},
+            },
+        ],
+    },
+]
+
+# we are letting the model infer the data structure by itself
+# but we still need to tell it to respond in JSON, hence
+# response_format={"type": "json_object"}
+response = client.beta.chat.completions.parse(
+    response_format={"type": "json_object"},
+    model="gpt-4o-mini",
+    messages=input_messages
+)
+
+message = response.choices[0].message
+
+# Print it out in readable format
+obj = json.loads(message.content)
+print(json.dumps(obj, indent=2))
diff --git a/output-basic-financial-disclosure.json b/output-basic-financial-disclosure.json
@@ -0,0 +1,66 @@
+{
+  "document": {
+    "title": "Financial Disclosure Report",
+    "header": "Clerk of the House of Representatives \u2022 Legislative Resource Center \u2022 B81 Cannon Building \u2022 Washington, DC 20515",
+    "filer_information": {
+      "name": "Hon. Nancy Pelosi",
+      "status": "Member",
+      "state_district": "CA11"
+    },
+    "filing_information": {
+      "filing_type": "Annual Report",
+      "filing_year": "2023",
+      "filing_date": "05/15/2024"
+    },
+    "schedule_a": {
+      "title": "Schedule A: Assets and 'Unearned' Income",
+      "assets": [
+        {
+          "asset": "11 Zinfandel Lane - Home & Vineyard [RP]",
+          "owner": "JT",
+          "value": "$5,000,001 - $25,000,000",
+          "income_type": "Grape Sales",
+          "income": "$100,001 - $1,000,000",
+          "location": "St. Helena/Napa, CA, US"
+        },
+        {
+          "asset": "25 Point Lobos - Commercial Property [RP]",
+          "owner": "SP",
+          "value": "$5,000,001 - $25,000,000",
+          "income_type": "Rent",
+          "income": "$100,001 - $1,000,000",
+          "location": "San Francisco/San Francisco, CA, US"
+        },
+        {
+          "asset": "45 Belden Place - Four Story Commercial Building [RP]",
+          "owner": "SP",
+          "value": "$5,000,001 - $25,000,000",
+          "income_type": "Rent",
+          "income": "$100,001 - $1,000,000",
+          "location": "San Francisco/San Francisco, CA, US"
+        },
+        {
+          "asset": "AllianceBernstein Holding L.P. Units (AB) [OL]",
+          "owner": "SP",
+          "value": "$1,000,001 - $5,000,000",
+          "income_type": "Partnership Income",
+          "income": "$50,001 - $100,000",
+          "location": "New York, NY, US",
+          "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors."
+        },
+        {
+          "asset": "Alphabet Inc. - Class A (GOOGL) [ST]",
+          "owner": "SP",
+          "value": "$5,000,001 - $25,000,000",
+          "income_type": "None"
+        },
+        {
+          "asset": "Amazon.com, Inc. (AMZN) [ST]",
+          "owner": "SP",
+          "value": "$5,000,001 - $25,000,000",
+          "income_type": "None"
+        }
+      ]
+    }
+  }
+}
diff --git a/extract-financial-disclosure.py b/extract-financial-disclosure.py
@@ -1,6 +1,8 @@
 #!/usr/bin/env python3
 
 """
+extract-financial-disclosure.py
+
 Parses and extracts structured data from the screenshot at the given URL:
 
 https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504

diff --git a/extract-police-blotter.py b/extract-police-blotter.py
@@ -1,8 +1,9 @@
 #!/usr/bin/env python3
+
 """
-blotter-extract.py
+extract-police-blotter.py
 
-Parses and extracts structured data from the screenshot of a police blotter at the given URL:
+Parses and extracts structured data from the screenshot at the given URL:
 
 https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703
 
@@ -40,7 +41,7 @@ class Incident(BaseModel):
     location: str
     summary: str = Field(description="""Brief summary, less than 30 chars""")
     category: str = Field(
-        description="""Type of crime, broadly speaking: "violent" , "property", "traffic", or "other" """
+        description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
     )
     property_damage: str = Field(
         description="""If a property crime, then a description of what was stolen/damaged/lost"""

diff --git a/output-financial-disclosure.json b/output-financial-disclosure.json
@@ -4,6 +4,7 @@
       "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]",
       "owner": "JT",
       "location": "St. Helena/Napa, CA, US",
+      "description": null,
       "asset_value_low": 5000001,
       "asset_value_high": 25000000,
       "income_type": "Grape Sales",
@@ -15,6 +16,7 @@
       "asset_name": "25 Point Lobos - Commercial Property [RP]",
       "owner": "SP",
       "location": "San Francisco/San Francisco, CA, US",
+      "description": null,
       "asset_value_low": 5000001,
       "asset_value_high": 25000000,
       "income_type": "Rent",
@@ -26,6 +28,7 @@
       "asset_name": "45 Belden Place - Four Story Commercial Building [RP]",
       "owner": "SP",
       "location": "San Francisco/San Francisco, CA, US",
+      "description": null,
       "asset_value_low": 5000001,
       "asset_value_high": 25000000,
       "income_type": "Rent",
@@ -35,8 +38,9 @@
     },
     {
       "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]",
-      "owner": "SP",
+      "owner": "OL",
       "location": "New York, NY, US",
+      "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors.",
       "asset_value_low": 1000001,
       "asset_value_high": 5000000,
       "income_type": "Partnership Income",
@@ -48,6 +52,7 @@
       "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]",
       "owner": "SP",
       "location": null,
+      "description": null,
       "asset_value_low": 5000001,
       "asset_value_high": 25000000,
       "income_type": "None",
@@ -59,6 +64,7 @@
       "asset_name": "Amazon.com, Inc. (AMZN) [ST]",
       "owner": "SP",
       "location": null,
+      "description": null,
       "asset_value_low": 5000001,
       "asset_value_high": 25000000,
       "income_type": "None",

diff --git a/output-police-blotter.json b/output-police-blotter.json
@@ -6,12 +6,12 @@
       "location": "Toyon parking lot",
       "summary": "License plate stolen",
       "category": "property",
-      "property_damage": "rear license plate",
+      "property_damage": "Rear license plate",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Man",
+          "description": "A man",
           "gender": "unknown",
           "is_student": false
         }
@@ -21,28 +21,28 @@
     {
       "date": "April 1",
       "time": "unknown",
-      "location": "unknown",
+      "location": "Fry's Electronics",
       "summary": "Unauthorized purchase reported",
       "category": "other",
-      "property_damage": "computer equipment",
+      "property_damage": "None",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Female administrator",
+          "description": "A female administrator in Materials Science and Engineering",
           "gender": "female",
           "is_student": false
         }
       ],
-      "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry\u2019s Electronics sometime in the past five months."
+      "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months."
     },
     {
       "date": "April 2",
       "time": "3:30 p.m.",
       "location": "unknown",
-      "summary": "Report of license plate stolen",
+      "summary": "Rear license plate stolen reported",
       "category": "property",
-      "property_damage": "rear license plate",
+      "property_damage": "Rear license plate",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
@@ -52,74 +52,62 @@
           "is_student": false
         }
       ],
-      "incident_text": "Another man reported that the rear license plate was missing from his vehicle, which had been parked near 655 Serra Street."
+      "incident_text": "Another man reported that the rear license plate was missing from his vehicle."
     },
     {
-      "date": "April 1",
+      "date": "April 2",
       "time": "11:40 p.m.",
       "location": "Rains apartments",
-      "summary": "Vandalism reported",
+      "summary": "Bike vandalized",
       "category": "property",
-      "property_damage": "bike wheel vandalized",
+      "property_damage": "Wheel of bike",
       "arrest_made": false,
       "perpetrators": [
         {
-          "description": "Unknown suspects",
+          "description": "Two unknown suspects",
           "gender": "unknown",
           "is_student": false
         }
       ],
       "victims": [
         {
-          "description": "Graduate student",
+          "description": "A graduate student in the School of Education",
           "gender": "unknown",
           "is_student": true
         }
       ],
       "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments."
     },
     {
-      "date": "April 1",
+      "date": "April 2",
       "time": "11:40 p.m.",
       "location": "Adelfa",
-      "summary": "Medical call responded",
-      "category": "other",
-      "property_damage": "",
-      "arrest_made": false,
-      "perpetrators": [],
-      "victims": [],
-      "incident_text": "Police responded to an alcohol-related medical call in Adelfa."
-    },
-    {
-      "date": "April 1",
-      "time": "unknown",
-      "location": "Gates Computer Science Building",
-      "summary": "Theft reported",
-      "category": "property",
-      "property_damage": "books stolen",
+      "summary": "Medical call",
+      "category": "call for service",
+      "property_damage": "None",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Three computer science graduate students",
+          "description": "unknown",
           "gender": "unknown",
-          "is_student": true
+          "is_student": false
         }
       ],
-      "incident_text": "Three computer science graduate students reported that they had books stolen from the Gates Computer Science Building in the previous five months."
+      "incident_text": "Police responded to an alcohol-related medical call in Adelfa."
     },
     {
       "date": "April 3",
       "time": "10:20 p.m.",
       "location": "unknown",
-      "summary": "Citation issued",
+      "summary": "Bike citation",
       "category": "other",
-      "property_damage": "",
-      "arrest_made": true,
+      "property_damage": "None",
+      "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Male undergraduate",
+          "description": "A male undergraduate",
           "gender": "male",
           "is_student": true
         }
@@ -129,15 +117,15 @@
     {
       "date": "April 3",
       "time": "11:20 p.m.",
-      "location": "Lomit Drive",
-      "summary": "Minor in possession cited",
+      "location": "Lomita Drive",
+      "summary": "Minor in possession citation",
       "category": "other",
-      "property_damage": "",
-      "arrest_made": true,
+      "property_damage": "None",
+      "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Woman",
+          "description": "A woman",
           "gender": "female",
           "is_student": false
         }
@@ -148,14 +136,14 @@
       "date": "April 3",
       "time": "11:40 p.m.",
       "location": "unknown",
-      "summary": "Citation issued",
+      "summary": "Driving citation",
       "category": "other",
-      "property_damage": "",
-      "arrest_made": true,
+      "property_damage": "None",
+      "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Man",
+          "description": "A man",
           "gender": "male",
           "is_student": false
         }
@@ -166,14 +154,14 @@
       "date": "April 4",
       "time": "1:00 a.m.",
       "location": "unknown",
-      "summary": "Damage report",
+      "summary": "Car damage reported",
       "category": "property",
-      "property_damage": "car damage",
+      "property_damage": "Trunk and hood",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Man",
+          "description": "A man",
           "gender": "male",
           "is_student": false
         }
@@ -184,10 +172,10 @@
       "date": "April 4",
       "time": "1:31 a.m.",
       "location": "Mayfield Avenue",
-      "summary": "Cited for biking rules",
+      "summary": "Bikes U-Locked incident",
       "category": "other",
-      "property_damage": "",
-      "arrest_made": true,
+      "property_damage": "None",
+      "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
@@ -204,18 +192,18 @@
       "location": "Sigma Alpha Epsilon",
       "summary": "Altercation reported",
       "category": "other",
-      "property_damage": "injury reported",
+      "property_damage": "None",
       "arrest_made": false,
       "perpetrators": [
         {
-          "description": "Two undergraduates",
+          "description": "Two undergraduate suspects",
           "gender": "unknown",
           "is_student": true
         }
       ],
       "victims": [
         {
-          "description": "Male undergraduate",
+          "description": "A male undergraduate",
           "gender": "male",
           "is_student": true
         }
@@ -226,14 +214,14 @@
       "date": "April 5",
       "time": "2:45 a.m.",
       "location": "unknown",
-      "summary": "Arrest made",
+      "summary": "Arrest for intoxication",
       "category": "other",
-      "property_damage": "",
+      "property_damage": "None",
       "arrest_made": true,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Man",
+          "description": "A man",
           "gender": "male",
           "is_student": false
         }
@@ -243,75 +231,87 @@
     {
       "date": "April 6",
       "time": "7:15 a.m.",
-      "location": "Andronico\u2019s Supermarket",
-      "summary": "Assistance provided",
-      "category": "other",
-      "property_damage": "",
+      "location": "Andronico's Supermarket",
+      "summary": "Assisted in detaining suspect",
+      "category": "call for service",
+      "property_damage": "None",
       "arrest_made": false,
       "perpetrators": [],
-      "victims": [],
-      "incident_text": "Police assisted Andronico\u2019s Supermarket in detaining a suspect after a call was made on a blue emergency phone. Palo Alto police later took the suspect into custody."
+      "victims": [
+        {
+          "description": "unknown",
+          "gender": "unknown",
+          "is_student": false
+        }
+      ],
+      "incident_text": "Police assisted Andronico's Supermarket in detaining a suspect after a call was made on a blue emergency phone. Palo Alto police later took the suspect into custody."
     },
     {
       "date": "April 6",
       "time": "3:20 p.m.",
       "location": "Studio 3 on Angell Court",
-      "summary": "Fire call responded",
-      "category": "other",
-      "property_damage": "",
+      "summary": "Accidental fire call",
+      "category": "call for service",
+      "property_damage": "None",
       "arrest_made": false,
       "perpetrators": [],
-      "victims": [],
+      "victims": [
+        {
+          "description": "unknown",
+          "gender": "unknown",
+          "is_student": false
+        }
+      ],
       "incident_text": "Police responded to an accidental fire call to Studio 3 on Angell Court."
     },
     {
       "date": "April 6",
       "time": "9:00 p.m.",
       "location": "unknown",
-      "summary": "Report of found property",
+      "summary": "Found property reported",
       "category": "property",
-      "property_damage": "personal property",
+      "property_damage": "None",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Man",
+          "description": "A man",
           "gender": "male",
           "is_student": false
         }
       ],
-      "incident_text": "A man reported that he found someone else\u2019s personal property in his locked car."
+      "incident_text": "A man reported that he found someone else's personal property in his locked car."
     },
     {
       "date": "April 6",
       "time": "1:45 a.m.",
       "location": "San Jose main jail",
       "summary": "Arrest for trespassing",
       "category": "other",
-      "property_damage": "",
+      "property_damage": "None",
       "arrest_made": true,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Local vagrant",
+          "description": "A local vagrant",
           "gender": "unknown",
           "is_student": false
         }
       ],
-      "incident_text": "A local vagrant was booked into the San Jose main jail for trespassing \u2013 his seventh time trespassing in a month."
+      "incident_text": "A local vagrant was booked into the San Jose main jail for trespassing - his seventh time trespassing in a month."
     },
     {
       "date": "April 6",
       "time": "2:50 a.m.",
       "location": "Palm Drive",
-      "summary": "Citation issued",
+      "summary": "Driving citation",
       "category": "other",
-      "property_damage": "",
-      "arrest_made": true,
+      "property_damage": "None",
+      "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "Man",
+          "description": "A man",
           "gender": "male",
           "is_student": false
         }

diff --git a/extract-financial-disclosure.py b/extract-financial-disclosure.py
@@ -39,7 +39,6 @@ class Asset(BaseModel):
     income_type: str
     income_low: Union[int, None]
     income_high: Union[int, None]
-
     tx_gt_1000: bool
 
 class DisclosureReport(BaseModel):

diff --git a/output-disclosure-report.json → output-financial-disclosure.json b/output-disclosure-report.json → output-financial-disclosure.json
diff --git a/output-blotter-entries.json → output-police-blotter.json b/output-blotter-entries.json → output-police-blotter.json
diff --git a/financial-disclosure-extractor.py → extract-financial-disclosure.py b/financial-disclosure-extractor.py → extract-financial-disclosure.py
diff --git a/blotter-extract.py → extract-police-blotter.py b/blotter-extract.py → extract-police-blotter.py
diff --git a/output-disclosure-report.json b/output-disclosure-report.json
@@ -3,6 +3,7 @@
     {
       "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]",
       "owner": "JT",
+      "location": "St. Helena/Napa, CA, US",
       "asset_value_low": 5000001,
       "asset_value_high": 25000000,
       "income_type": "Grape Sales",
@@ -13,6 +14,7 @@
     {
       "asset_name": "25 Point Lobos - Commercial Property [RP]",
       "owner": "SP",
+      "location": "San Francisco/San Francisco, CA, US",
       "asset_value_low": 5000001,
       "asset_value_high": 25000000,
       "income_type": "Rent",
@@ -23,6 +25,7 @@
     {
       "asset_name": "45 Belden Place - Four Story Commercial Building [RP]",
       "owner": "SP",
+      "location": "San Francisco/San Francisco, CA, US",
       "asset_value_low": 5000001,
       "asset_value_high": 25000000,
       "income_type": "Rent",
@@ -33,6 +36,7 @@
     {
       "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]",
       "owner": "SP",
+      "location": "New York, NY, US",
       "asset_value_low": 1000001,
       "asset_value_high": 5000000,
       "income_type": "Partnership Income",
@@ -43,6 +47,7 @@
     {
       "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]",
       "owner": "SP",
+      "location": null,
       "asset_value_low": 5000001,
       "asset_value_high": 25000000,
       "income_type": "None",
@@ -53,6 +58,7 @@
     {
       "asset_name": "Amazon.com, Inc. (AMZN) [ST]",
       "owner": "SP",
+      "location": null,
       "asset_value_low": 5000001,
       "asset_value_high": 25000000,
       "income_type": "None",

diff --git a/financial-disclosure-extractor.py b/financial-disclosure-extractor.py
@@ -33,6 +33,7 @@
 class Asset(BaseModel):
     asset_name: str
     owner: str
+    location: Union[str, None]
     asset_value_low: Union[int, None]
     asset_value_high: Union[int, None]
     income_type: str

diff --git a/extracted-blotter-entries.json → output-blotter-entries.json b/extracted-blotter-entries.json → output-blotter-entries.json
diff --git a/extracted-disclosure-report.json → output-disclosure-report.json b/extracted-disclosure-report.json → output-disclosure-report.json
diff --git a/README.police-blotter-structured-output-demo.md b/README.police-blotter-structured-output-demo.md
@@ -1,5 +0,0 @@
-# Police Blotter data extraction using OpenAI's Structured Output
-
-![image](https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703)
-
-<img width="905" alt="image" src="https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504">

diff --git a/extracted-disclosure-report.json b/extracted-disclosure-report.json
@@ -1,52 +1,64 @@
 {
   "assets": [
     {
-      "asset_name": "11 Zinfandel Lane - Home & Vineyard",
+      "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]",
       "owner": "JT",
-      "asset_value_low": "$5,000,001",
-      "asset_value_high": "$25,000,000",
+      "asset_value_low": 5000001,
+      "asset_value_high": 25000000,
       "income_type": "Grape Sales",
-      "tx_gt_1000": true
+      "income_low": 100001,
+      "income_high": 1000000,
+      "tx_gt_1000": false
     },
     {
-      "asset_name": "25 Point Lobos - Commercial Property",
+      "asset_name": "25 Point Lobos - Commercial Property [RP]",
       "owner": "SP",
-      "asset_value_low": "$5,000,001",
-      "asset_value_high": "$25,000,000",
+      "asset_value_low": 5000001,
+      "asset_value_high": 25000000,
       "income_type": "Rent",
-      "tx_gt_1000": true
+      "income_low": 100001,
+      "income_high": 1000000,
+      "tx_gt_1000": false
     },
     {
-      "asset_name": "45 Belden Place - Four Story Commercial Building",
-      "owner": "RP",
-      "asset_value_low": "$5,000,001",
-      "asset_value_high": "$25,000,000",
+      "asset_name": "45 Belden Place - Four Story Commercial Building [RP]",
+      "owner": "SP",
+      "asset_value_low": 5000001,
+      "asset_value_high": 25000000,
       "income_type": "Rent",
-      "tx_gt_1000": true
+      "income_low": 100001,
+      "income_high": 1000000,
+      "tx_gt_1000": false
     },
     {
-      "asset_name": "AllianceBernstein Holding L.P. Units (AB)",
-      "owner": "OL",
-      "asset_value_low": "$1,000,001",
-      "asset_value_high": "$5,000,000",
+      "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]",
+      "owner": "SP",
+      "asset_value_low": 1000001,
+      "asset_value_high": 5000000,
       "income_type": "Partnership Income",
-      "tx_gt_1000": true
+      "income_low": 50001,
+      "income_high": 100000,
+      "tx_gt_1000": false
     },
     {
-      "asset_name": "Alphabet Inc. - Class A (GOOGL)",
+      "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]",
       "owner": "SP",
-      "asset_value_low": "$5,000,001",
-      "asset_value_high": "$25,000,000",
+      "asset_value_low": 5000001,
+      "asset_value_high": 25000000,
       "income_type": "None",
+      "income_low": null,
+      "income_high": null,
       "tx_gt_1000": false
     },
     {
-      "asset_name": "Amazon.com, Inc. (AMZN)",
+      "asset_name": "Amazon.com, Inc. (AMZN) [ST]",
       "owner": "SP",
-      "asset_value_low": "$5,000,001",
-      "asset_value_high": "$25,000,000",
+      "asset_value_low": 5000001,
+      "asset_value_high": 25000000,
       "income_type": "None",
+      "income_low": null,
+      "income_high": null,
       "tx_gt_1000": false
     }
   ]
-}
+}
diff --git a/financial-disclosure-extractor.py b/financial-disclosure-extractor.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python3
+
+"""
+Parses and extracts structured data from the screenshot at the given URL:
+
+https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504
+
+Full financial disclosure report:
+https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf
+
+
+This script assumes your API key is set up in the default way,
+  i.e. environment variable: $OPENAI_API_KEY
+  https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
+
+"""
+import base64
+import json
+from openai import OpenAI
+from pathlib import Path
+from pydantic import BaseModel, Field
+from typing import Union
+
+INPUT_URL = "https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504"
+
+
+# OpenAI examples of Stuctured Output scripts and data definitions
+# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2
+
+
+# Define the data structures in Pydantic:
+# a Disclosure Report has a list of assets
+class Asset(BaseModel):
+    asset_name: str
+    owner: str
+    asset_value_low: Union[int, None]
+    asset_value_high: Union[int, None]
+    income_type: str
+    income_low: Union[int, None]
+    income_high: Union[int, None]
+
+    tx_gt_1000: bool
+
+class DisclosureReport(BaseModel):
+    assets: list[Asset]
+
+## initialize OpenAI client
+client = OpenAI()
+
+
+# Example of message format for passing in an image via URL
+# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
+input_messages = [
+    {"role": "system", "content": "Output the result in JSON format."},
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Extract the text from this image"},
+            {
+                "type": "image_url",
+                "image_url": {"url": INPUT_URL},
+            },
+        ],
+    },
+]
+
+# gpt-4o-mini is cheap and fast and has vision capabilities
+response = client.beta.chat.completions.parse(
+    response_format=DisclosureReport,
+    model="gpt-4o-mini",
+    messages=input_messages
+)
+
+message = response.choices[0].message
+
+# Print it out in readable format
+obj = json.loads(message.content)
+print(json.dumps(obj, indent=2))
diff --git a/README.police-blotter-structured-output-demo.md b/README.police-blotter-structured-output-demo.md
@@ -1,3 +1,5 @@
 # Police Blotter data extraction using OpenAI's Structured Output
 
 ![image](https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703)
+
+<img width="905" alt="image" src="https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504">
diff --git a/extracted-disclosure-report.json b/extracted-disclosure-report.json
@@ -0,0 +1,52 @@
+{
+  "assets": [
+    {
+      "asset_name": "11 Zinfandel Lane - Home & Vineyard",
+      "owner": "JT",
+      "asset_value_low": "$5,000,001",
+      "asset_value_high": "$25,000,000",
+      "income_type": "Grape Sales",
+      "tx_gt_1000": true
+    },
+    {
+      "asset_name": "25 Point Lobos - Commercial Property",
+      "owner": "SP",
+      "asset_value_low": "$5,000,001",
+      "asset_value_high": "$25,000,000",
+      "income_type": "Rent",
+      "tx_gt_1000": true
+    },
+    {
+      "asset_name": "45 Belden Place - Four Story Commercial Building",
+      "owner": "RP",
+      "asset_value_low": "$5,000,001",
+      "asset_value_high": "$25,000,000",
+      "income_type": "Rent",
+      "tx_gt_1000": true
+    },
+    {
+      "asset_name": "AllianceBernstein Holding L.P. Units (AB)",
+      "owner": "OL",
+      "asset_value_low": "$1,000,001",
+      "asset_value_high": "$5,000,000",
+      "income_type": "Partnership Income",
+      "tx_gt_1000": true
+    },
+    {
+      "asset_name": "Alphabet Inc. - Class A (GOOGL)",
+      "owner": "SP",
+      "asset_value_low": "$5,000,001",
+      "asset_value_high": "$25,000,000",
+      "income_type": "None",
+      "tx_gt_1000": false
+    },
+    {
+      "asset_name": "Amazon.com, Inc. (AMZN)",
+      "owner": "SP",
+      "asset_value_low": "$5,000,001",
+      "asset_value_high": "$25,000,000",
+      "income_type": "None",
+      "tx_gt_1000": false
+    }
+  ]
+}
diff --git a/extracted-blotter-entries.json b/extracted-blotter-entries.json
@@ -1,48 +1,48 @@
 {
   "incidents": [
     {
-      "date": "Thursday, April 1",
+      "date": "April 1",
       "time": "9:30 p.m.",
       "location": "Toyon parking lot",
       "summary": "License plate stolen",
       "category": "property",
-      "property_damage": "Rear license plate",
+      "property_damage": "rear license plate",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "A man",
+          "description": "Man",
           "gender": "unknown",
           "is_student": false
         }
       ],
       "incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot."
     },
     {
-      "date": "Thursday, April 1",
+      "date": "April 1",
       "time": "unknown",
-      "location": "Fry's Electronics",
-      "summary": "Unauthorized purchase",
+      "location": "unknown",
+      "summary": "Unauthorized purchase reported",
       "category": "other",
-      "property_damage": "None",
+      "property_damage": "computer equipment",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "A female administrator in Materials Science and Engineering",
+          "description": "Female administrator",
           "gender": "female",
           "is_student": false
         }
       ],
-      "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months."
+      "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry\u2019s Electronics sometime in the past five months."
     },
     {
-      "date": "Friday, April 2",
+      "date": "April 2",
       "time": "3:30 p.m.",
       "location": "unknown",
-      "summary": "License plate stolen",
+      "summary": "Report of license plate stolen",
       "category": "property",
-      "property_damage": "Rear license plate",
+      "property_damage": "rear license plate",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
@@ -52,51 +52,51 @@
           "is_student": false
         }
       ],
-      "incident_text": "Another man reported that the rear license plate was missing from his vehicle."
+      "incident_text": "Another man reported that the rear license plate was missing from his vehicle, which had been parked near 655 Serra Street."
     },
     {
-      "date": "Thursday, April 1",
+      "date": "April 1",
       "time": "11:40 p.m.",
       "location": "Rains apartments",
-      "summary": "Vandalism",
+      "summary": "Vandalism reported",
       "category": "property",
-      "property_damage": "Wheel of bike",
+      "property_damage": "bike wheel vandalized",
       "arrest_made": false,
       "perpetrators": [
         {
-          "description": "Two unknown suspects",
+          "description": "Unknown suspects",
           "gender": "unknown",
           "is_student": false
         }
       ],
       "victims": [
         {
-          "description": "A graduate student in the School of Education",
+          "description": "Graduate student",
           "gender": "unknown",
           "is_student": true
         }
       ],
       "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments."
     },
     {
-      "date": "Thursday, April 1",
+      "date": "April 1",
       "time": "11:40 p.m.",
-      "location": "Adelifa",
-      "summary": "Medical call",
+      "location": "Adelfa",
+      "summary": "Medical call responded",
       "category": "other",
-      "property_damage": "None",
+      "property_damage": "",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [],
-      "incident_text": "Police responded to an alcohol-related medical call in Adelifa."
+      "incident_text": "Police responded to an alcohol-related medical call in Adelfa."
     },
     {
-      "date": "Thursday, April 1",
+      "date": "April 1",
       "time": "unknown",
       "location": "Gates Computer Science Building",
-      "summary": "Theft",
+      "summary": "Theft reported",
       "category": "property",
-      "property_damage": "Books",
+      "property_damage": "books stolen",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
@@ -109,103 +109,85 @@
       "incident_text": "Three computer science graduate students reported that they had books stolen from the Gates Computer Science Building in the previous five months."
     },
     {
-      "date": "Saturday, April 3",
+      "date": "April 3",
       "time": "10:20 p.m.",
       "location": "unknown",
-      "summary": "Traffic violation",
+      "summary": "Citation issued",
       "category": "other",
-      "property_damage": "None",
-      "arrest_made": false,
+      "property_damage": "",
+      "arrest_made": true,
       "perpetrators": [],
       "victims": [
         {
-          "description": "A male undergraduate",
+          "description": "Male undergraduate",
           "gender": "male",
           "is_student": true
         }
       ],
       "incident_text": "A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license."
     },
     {
-      "date": "Saturday, April 3",
+      "date": "April 3",
       "time": "11:20 p.m.",
-      "location": "Lomita Drive",
-      "summary": "Possession of alcohol",
+      "location": "Lomit Drive",
+      "summary": "Minor in possession cited",
       "category": "other",
-      "property_damage": "None",
-      "arrest_made": false,
+      "property_damage": "",
+      "arrest_made": true,
       "perpetrators": [],
       "victims": [
         {
-          "description": "A woman",
+          "description": "Woman",
           "gender": "female",
           "is_student": false
         }
       ],
       "incident_text": "A woman was cited and released on Lomita Drive for being a minor in possession of alcohol."
     },
     {
-      "date": "Saturday, April 3",
+      "date": "April 3",
       "time": "11:40 p.m.",
       "location": "unknown",
-      "summary": "Driving without license",
+      "summary": "Citation issued",
       "category": "other",
-      "property_damage": "None",
-      "arrest_made": false,
+      "property_damage": "",
+      "arrest_made": true,
       "perpetrators": [],
       "victims": [
         {
-          "description": "A man",
+          "description": "Man",
           "gender": "male",
           "is_student": false
         }
       ],
       "incident_text": "A man was cited and released for driving with a suspended license after he was stopped near Galvez Street and Campus Drive."
     },
     {
-      "date": "Saturday, April 4",
-      "time": "11:45 p.m.",
-      "location": "Lomita Drive",
-      "summary": "Urinating in public",
-      "category": "other",
-      "property_damage": "None",
-      "arrest_made": true,
-      "perpetrators": [],
-      "victims": [
-        {
-          "description": "A woman",
-          "gender": "female",
-          "is_student": false
-        }
-      ],
-      "incident_text": "A woman was cited and released for urinating in public on Lomita Drive."
-    },
-    {
-      "date": "Sunday, April 4",
+      "date": "April 4",
       "time": "1:00 a.m.",
       "location": "unknown",
-      "summary": "Traffic violation",
-      "category": "other",
-      "property_damage": "Damage to car",
+      "summary": "Damage report",
+      "category": "property",
+      "property_damage": "car damage",
       "arrest_made": false,
       "perpetrators": [],
       "victims": [
         {
-          "description": "A man",
+          "description": "Man",
           "gender": "male",
           "is_student": false
         }
       ],
       "incident_text": "A man reported that someone walked over the top of his car, causing damage to the trunk, top and hood."
     },
     {
-      "date": "Sunday, April 4",
+      "date": "April 4",
       "time": "1:31 a.m.",
       "location": "Mayfield Avenue",
-      "summary": "Possession of stolen bikes",
+      "summary": "Cited for biking rules",
       "category": "other",
-      "property_damage": "None",
-      "arrest_made": false,
+      "property_damage": "",
+      "arrest_made": true,
       "perpetrators": [],
       "victims": [
         {
@@ -217,12 +199,12 @@
       "incident_text": "Two men were cited and released for walking with bikes U-Locked to themselves on Mayfield Avenue after neither could establish ownership of the bikes."
     },
     {
-      "date": "Sunday, April 4",
+      "date": "April 4",
       "time": "3:05 a.m.",
       "location": "Sigma Alpha Epsilon",
-      "summary": "Altercation",
+      "summary": "Altercation reported",
       "category": "other",
-      "property_damage": "None",
+      "property_damage": "injury reported",
       "arrest_made": false,
       "perpetrators": [
         {
@@ -233,61 +215,103 @@
       ],
       "victims": [
         {
-          "description": "A male undergraduate",
+          "description": "Male undergraduate",
           "gender": "male",
           "is_student": true
         }
       ],
       "incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested."
     },
     {
-      "date": "Monday, April 5",
+      "date": "April 5",
       "time": "2:45 a.m.",
       "location": "unknown",
-      "summary": "Public intoxication",
+      "summary": "Arrest made",
       "category": "other",
-      "property_damage": "None",
+      "property_damage": "",
       "arrest_made": true,
       "perpetrators": [],
       "victims": [
         {
-          "description": "A man",
+          "description": "Man",
           "gender": "male",
           "is_student": false
         }
       ],
-      "incident_text": "Police arrested a man for being drunk in public."
+      "incident_text": "Police arrested a man for being drunk in public on Palm Drive near the entrance arch."
     },
     {
-      "date": "Tuesday, April 6",
+      "date": "April 6",
+      "time": "7:15 a.m.",
+      "location": "Andronico\u2019s Supermarket",
+      "summary": "Assistance provided",
+      "category": "other",
+      "property_damage": "",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [],
+      "incident_text": "Police assisted Andronico\u2019s Supermarket in detaining a suspect after a call was made on a blue emergency phone. Palo Alto police later took the suspect into custody."
+    },
+    {
+      "date": "April 6",
+      "time": "3:20 p.m.",
+      "location": "Studio 3 on Angell Court",
+      "summary": "Fire call responded",
+      "category": "other",
+      "property_damage": "",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [],
+      "incident_text": "Police responded to an accidental fire call to Studio 3 on Angell Court."
+    },
+    {
+      "date": "April 6",
+      "time": "9:00 p.m.",
+      "location": "unknown",
+      "summary": "Report of found property",
+      "category": "property",
+      "property_damage": "personal property",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "Man",
+          "gender": "male",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A man reported that he found someone else\u2019s personal property in his locked car."
+    },
+    {
+      "date": "April 6",
       "time": "1:45 a.m.",
       "location": "San Jose main jail",
-      "summary": "Trespassing",
+      "summary": "Arrest for trespassing",
       "category": "other",
-      "property_damage": "None",
+      "property_damage": "",
       "arrest_made": true,
       "perpetrators": [],
       "victims": [
         {
-          "description": "A local vagrant",
-          "gender": "male",
+          "description": "Local vagrant",
+          "gender": "unknown",
           "is_student": false
         }
       ],
-      "incident_text": "A local vagrant was booked into the San Jose main jail for trespassing \u2014 his seventh time trespassing in a month."
+      "incident_text": "A local vagrant was booked into the San Jose main jail for trespassing \u2013 his seventh time trespassing in a month."
     },
     {
-      "date": "Tuesday, April 6",
+      "date": "April 6",
       "time": "2:50 a.m.",
       "location": "Palm Drive",
-      "summary": "Driving without a license",
+      "summary": "Citation issued",
       "category": "other",
-      "property_damage": "None",
+      "property_damage": "",
       "arrest_made": true,
       "perpetrators": [],
       "victims": [
         {
-          "description": "A man",
+          "description": "Man",
           "gender": "male",
           "is_student": false
         }

diff --git a/README.md → ....police-blotter-structured-output-demo.md b/README.md → ....police-blotter-structured-output-demo.md
diff --git a/extract.py → blotter-extract.py b/extract.py → blotter-extract.py
@@ -1,7 +1,8 @@
 #!/usr/bin/env python3
-
 """
-Parses and extracts structured data from the screenshot at the given URL:
+blotter-extract.py
+
+Parses and extracts structured data from the screenshot of a police blotter at the given URL:
 
 https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703
 
@@ -10,7 +11,6 @@
   https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
 
 """
-
 import base64
 import json
 from openai import OpenAI
@@ -19,19 +19,28 @@
 
 INPUT_URL = "https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703"
 
+
+# OpenAI examples of Stuctured Output scripts and data definitions
+# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2
+
+
+# Define the data structures in Pydantic:
+# an Incident involves several Persons (victims, perpetrators)
 class Person(BaseModel):
     description: str
     gender: str
     is_student: bool
 
 
+# Pydantic docs on field descriptions:
+# https://docs.pydantic.dev/latest/concepts/fields/
 class Incident(BaseModel):
     date: str
     time: str
     location: str
     summary: str = Field(description="""Brief summary, less than 30 chars""")
     category: str = Field(
-        description="""Type of crime, either "violent" or "property", or "other" """
+        description="""Type of crime, broadly speaking: "violent" , "property", "traffic", or "other" """
     )
     property_damage: str = Field(
         description="""If a property crime, then a description of what was stolen/damaged/lost"""
@@ -44,36 +53,43 @@ class Incident(BaseModel):
     )
 
 
-
 class Blotter(BaseModel):
     incidents: list[Incident]
 
 
+## done defining the data structures
+##################################################
+
+
 ## initialize OpenAI client
 client = OpenAI()
 
-with open(INPUT_PATH, "rb") as img:
-    image_data = base64.b64encode(img.read()).decode("utf-8")
 
+# Example of message format for passing in an image via URL
+# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
 input_messages = [
     {"role": "system", "content": "Output the result in JSON format."},
-    {"role": "user", "content": [
-        {"type": "text", "text": "Extract the text from this image"},
-        {
-            "type": "image_url",
-            "image_url": {"url": INPUT_URL},
-        },
-
-    ]},
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Extract the text from this image"},
+            {
+                "type": "image_url",
+                "image_url": {"url": INPUT_URL},
+            },
+        ],
+    },
 ]
 
+# gpt-4o-mini is cheap and fast and has vision capabilities
 response = client.beta.chat.completions.parse(
-    model="gpt-4o-mini",
     response_format=Blotter,
+    model="gpt-4o-mini",
     messages=input_messages
 )
 
 message = response.choices[0].message
-obj = json.loads(message.content)
 
+# Print it out in readable format
+obj = json.loads(message.content)
 print(json.dumps(obj, indent=2))
diff --git a/extracted-blotter-entries.json b/extracted-blotter-entries.json
@@ -0,0 +1,298 @@
+{
+  "incidents": [
+    {
+      "date": "Thursday, April 1",
+      "time": "9:30 p.m.",
+      "location": "Toyon parking lot",
+      "summary": "License plate stolen",
+      "category": "property",
+      "property_damage": "Rear license plate",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "A man",
+          "gender": "unknown",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot."
+    },
+    {
+      "date": "Thursday, April 1",
+      "time": "unknown",
+      "location": "Fry's Electronics",
+      "summary": "Unauthorized purchase",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "A female administrator in Materials Science and Engineering",
+          "gender": "female",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months."
+    },
+    {
+      "date": "Friday, April 2",
+      "time": "3:30 p.m.",
+      "location": "unknown",
+      "summary": "License plate stolen",
+      "category": "property",
+      "property_damage": "Rear license plate",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "Another man",
+          "gender": "unknown",
+          "is_student": false
+        }
+      ],
+      "incident_text": "Another man reported that the rear license plate was missing from his vehicle."
+    },
+    {
+      "date": "Thursday, April 1",
+      "time": "11:40 p.m.",
+      "location": "Rains apartments",
+      "summary": "Vandalism",
+      "category": "property",
+      "property_damage": "Wheel of bike",
+      "arrest_made": false,
+      "perpetrators": [
+        {
+          "description": "Two unknown suspects",
+          "gender": "unknown",
+          "is_student": false
+        }
+      ],
+      "victims": [
+        {
+          "description": "A graduate student in the School of Education",
+          "gender": "unknown",
+          "is_student": true
+        }
+      ],
+      "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments."
+    },
+    {
+      "date": "Thursday, April 1",
+      "time": "11:40 p.m.",
+      "location": "Adelifa",
+      "summary": "Medical call",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [],
+      "incident_text": "Police responded to an alcohol-related medical call in Adelifa."
+    },
+    {
+      "date": "Thursday, April 1",
+      "time": "unknown",
+      "location": "Gates Computer Science Building",
+      "summary": "Theft",
+      "category": "property",
+      "property_damage": "Books",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "Three computer science graduate students",
+          "gender": "unknown",
+          "is_student": true
+        }
+      ],
+      "incident_text": "Three computer science graduate students reported that they had books stolen from the Gates Computer Science Building in the previous five months."
+    },
+    {
+      "date": "Saturday, April 3",
+      "time": "10:20 p.m.",
+      "location": "unknown",
+      "summary": "Traffic violation",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "A male undergraduate",
+          "gender": "male",
+          "is_student": true
+        }
+      ],
+      "incident_text": "A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license."
+    },
+    {
+      "date": "Saturday, April 3",
+      "time": "11:20 p.m.",
+      "location": "Lomita Drive",
+      "summary": "Possession of alcohol",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "A woman",
+          "gender": "female",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A woman was cited and released on Lomita Drive for being a minor in possession of alcohol."
+    },
+    {
+      "date": "Saturday, April 3",
+      "time": "11:40 p.m.",
+      "location": "unknown",
+      "summary": "Driving without license",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "A man",
+          "gender": "male",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A man was cited and released for driving with a suspended license after he was stopped near Galvez Street and Campus Drive."
+    },
+    {
+      "date": "Saturday, April 4",
+      "time": "11:45 p.m.",
+      "location": "Lomita Drive",
+      "summary": "Urinating in public",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": true,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "A woman",
+          "gender": "female",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A woman was cited and released for urinating in public on Lomita Drive."
+    },
+    {
+      "date": "Sunday, April 4",
+      "time": "1:00 a.m.",
+      "location": "unknown",
+      "summary": "Traffic violation",
+      "category": "other",
+      "property_damage": "Damage to car",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "A man",
+          "gender": "male",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A man reported that someone walked over the top of his car, causing damage to the trunk, top and hood."
+    },
+    {
+      "date": "Sunday, April 4",
+      "time": "1:31 a.m.",
+      "location": "Mayfield Avenue",
+      "summary": "Possession of stolen bikes",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": false,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "Two men",
+          "gender": "unknown",
+          "is_student": false
+        }
+      ],
+      "incident_text": "Two men were cited and released for walking with bikes U-Locked to themselves on Mayfield Avenue after neither could establish ownership of the bikes."
+    },
+    {
+      "date": "Sunday, April 4",
+      "time": "3:05 a.m.",
+      "location": "Sigma Alpha Epsilon",
+      "summary": "Altercation",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": false,
+      "perpetrators": [
+        {
+          "description": "Two undergraduates",
+          "gender": "unknown",
+          "is_student": true
+        }
+      ],
+      "victims": [
+        {
+          "description": "A male undergraduate",
+          "gender": "male",
+          "is_student": true
+        }
+      ],
+      "incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested."
+    },
+    {
+      "date": "Monday, April 5",
+      "time": "2:45 a.m.",
+      "location": "unknown",
+      "summary": "Public intoxication",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": true,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "A man",
+          "gender": "male",
+          "is_student": false
+        }
+      ],
+      "incident_text": "Police arrested a man for being drunk in public."
+    },
+    {
+      "date": "Tuesday, April 6",
+      "time": "1:45 a.m.",
+      "location": "San Jose main jail",
+      "summary": "Trespassing",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": true,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "A local vagrant",
+          "gender": "male",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A local vagrant was booked into the San Jose main jail for trespassing \u2014 his seventh time trespassing in a month."
+    },
+    {
+      "date": "Tuesday, April 6",
+      "time": "2:50 a.m.",
+      "location": "Palm Drive",
+      "summary": "Driving without a license",
+      "category": "other",
+      "property_damage": "None",
+      "arrest_made": true,
+      "perpetrators": [],
+      "victims": [
+        {
+          "description": "A man",
+          "gender": "male",
+          "is_student": false
+        }
+      ],
+      "incident_text": "A man was cited and released for driving without a license on Palm Drive."
+    }
+  ]
+}
diff --git a/README.md b/README.md
@@ -1,2 +1,3 @@
 # Police Blotter data extraction using OpenAI's Structured Output
 
+![image](https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703)
diff --git a/extract.py b/extract.py
@@ -1,8 +1,9 @@
 #!/usr/bin/env python3
 
 """
-Parses the screenshot at the given URL:
+Parses and extracts structured data from the screenshot at the given URL:
 
+https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703
 
 This script assumes your API key is set up in the default way,
   i.e. environment variable: $OPENAI_API_KEY
@@ -16,7 +17,7 @@
 from pathlib import Path
 from pydantic import BaseModel, Field
 
-INPUT_URL = ""
+INPUT_URL = "https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703"
 
 class Person(BaseModel):
     description: str

diff --git a/README.md b/README.md
@@ -0,0 +1,2 @@
+# Police Blotter data extraction using OpenAI's Structured Output
+
diff --git a/parse_blotter.py → extract.py b/parse_blotter.py → extract.py
diff --git a/parse_blotter.py b/parse_blotter.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python3
+
+"""
+Parses the screenshot at the given URL:
+
+
+This script assumes your API key is set up in the default way,
+  i.e. environment variable: $OPENAI_API_KEY
+  https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
+
+"""
+
+import base64
+import json
+from openai import OpenAI
+from pathlib import Path
+from pydantic import BaseModel, Field
+
+INPUT_URL = ""
+
+class Person(BaseModel):
+    description: str
+    gender: str
+    is_student: bool
+
+
+class Incident(BaseModel):
+    date: str
+    time: str
+    location: str
+    summary: str = Field(description="""Brief summary, less than 30 chars""")
+    category: str = Field(
+        description="""Type of crime, either "violent" or "property", or "other" """
+    )
+    property_damage: str = Field(
+        description="""If a property crime, then a description of what was stolen/damaged/lost"""
+    )
+    arrest_made: bool
+    perpetrators: list[Person]
+    victims: list[Person]
+    incident_text: str = Field(
+        description="""Include the complete verbatim text from the input that pertains to the incident"""
+    )
+
+
+
+class Blotter(BaseModel):
+    incidents: list[Incident]
+
+
+## initialize OpenAI client
+client = OpenAI()
+
+with open(INPUT_PATH, "rb") as img:
+    image_data = base64.b64encode(img.read()).decode("utf-8")
+
+input_messages = [
+    {"role": "system", "content": "Output the result in JSON format."},
+    {"role": "user", "content": [
+        {"type": "text", "text": "Extract the text from this image"},
+        {
+            "type": "image_url",
+            "image_url": {"url": INPUT_URL},
+        },
+
+    ]},
+]
+
+response = client.beta.chat.completions.parse(
+    model="gpt-4o-mini",
+    response_format=Blotter,
+    messages=input_messages
+)
+
+message = response.choices[0].message
+obj = json.loads(message.content)
+
+print(json.dumps(obj, indent=2))
Original file line number	Diff line number	Diff line change
		@@ -1,2 +1,3 @@
		# Police Blotter data extraction using OpenAI's Structured Output

		![image](https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Police Blotter data extraction using OpenAI's Structured Output