Last active
October 2, 2025 18:53
-
Star
(131)
You must be signed in to star a gist -
Fork
(12)
You must be signed in to fork a gist
-
-
Save dannguyen/faaa56cebf30ad51108a9fe4f8db36d8 to your computer and use it in GitHub Desktop.
Revisions
-
dannguyen revised this gist
Oct 15, 2024 . 3 changed files with 16 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -79,6 +79,9 @@ pip install openai pydantic For ease of use, these scripts are set up to use gpt-4o-mini's vision capabilities to ingest PNG files via web URLs. If you want to modify the script to test a URL of your choosing, simply modify the `INPUT_URL` variable at the top of the script. ### Scanned financial disclosure <br> @@ -249,7 +252,18 @@ It left the values as text, e.g. `"value": "$5,000,001 - $25,000,000"` versus `" ### Scanned financial disclosure - [extract-scanned-financial-disclosure.py](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-extract-scanned-financial-disclosure-py) - [output-scanned-financial-disclosure.json](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-scanned-financial-disclosure-json) As I said at the beginning of this section, the report screenshot comes from a [PDF with actual text](https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf) — most Congressional disclosure filings in the past 5 years have used the [e-filing system](https://disclosures-clerk.house.gov/FinancialDisclosure), which inherently results in more regular data even when the output is PDF. So I tried using Structured Outputs on a screenshot of a [2008-era report](https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2008/8135973.pdf), and the results were [pretty solid](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-scanned-financial-disclosure-json). <img width="842" alt="image" src="https://gist.github.com/user-attachments/assets/e430e76a-2519-43fa-a370-85a584b816b6"> The main caveat is that I had to rotate the page orientation by 90 degrees. The model did try to parse the vertically-orientated page, and got about half of the values right — which is probably one of the worst-case scenarios (you'd prefer the model to completely flub things, so that you could at least catch with automated-error checks) This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -3,11 +3,11 @@ """ extract-financial-disclosure.py Parses and extracts structured data from the screenshot at the given URL: https://gist.github.com/user-attachments/assets/e430e76a-2519-43fa-a370-85a584b816b6 The page comes from page 5 of 24; Schedule III, of the full financial disclosure report found here: https://gist.github.com/user-attachments/assets/e430e76a-2519-43fa-a370-85a584b816b6 (the page was manually rotated 90 degrees from its original orientation in the scanned document) File renamed without changes. -
dannguyen revised this gist
Oct 15, 2024 . 2 changed files with 225 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,91 @@ #!/usr/bin/env python3 """ extract-financial-disclosure.py Parses and extracts structured data from the screenshot at the given URL: https://private-user-images.githubusercontent.com/121520/376677240-e11735ce-71d1-47b1-9127-2188be17c42e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjkwMDU3NDYsIm5iZiI6MTcyOTAwNTQ0NiwicGF0aCI6Ii8xMjE1MjAvMzc2Njc3MjQwLWUxMTczNWNlLTcxZDEtNDdiMS05MTI3LTIxODhiZTE3YzQyZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQxMDE1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MTAxNVQxNTE3MjZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZTgwYjAwMDFmYWNiNGI0MmEzNGU5YzllNzgzMzdhYTlmZjkwOGU0ZGNhYmRkZDhhODA2MzZlMmM5YmZhY2ZlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.BhDYMwoVUmMt4OTKw04PfqQBKKqWzv0KuAsdFaV_WMU The page comes from page 5 of 24; Schedule III, of the full financial disclosure report found here: https://gist.github.com/user-attachments/assets/12c3fce6-9dd5-4140-bc59-75606062799c (the page was manually rotated 90 degrees from its original orientation in the scanned document) This script assumes your API key is set up in the default way, i.e. environment variable: $OPENAI_API_KEY https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety """ import base64 import json from openai import OpenAI from pathlib import Path from pydantic import BaseModel, Field from typing import Union, Literal INPUT_URL = "https://gist.github.com/user-attachments/assets/52c5c8f5-886f-45fe-a338-d1cd3e36ecc8" # OpenAI examples of Stuctured Output scripts and data definitions # https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2 # Define the data structures in Pydantic: # a Disclosure Report has a list of assets class Asset(BaseModel): owner: Union[Literal['SP', 'DC', 'JT'], None] = Field(description="The leftmost first column of the table") asset_name: str = Field( description="The name of the asset, the second column of the table" ) asset_value_low: Union[int, None] = Field( description="In the third column, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'" ) asset_value_high: Union[int, None] = Field( description="In the third column, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'" ) income_type: str = Field(description="The fourth column") income_low: Union[int, None] = Field( description="In the 5th column, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'. If the value is enclosed in parentheses, then the income values are meant to be negative" ) income_high: Union[int, None] = Field( description="In the 5th column, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'. If the value is enclosed in parentheses, then the income values are meant to be negative" ) transaction_type: Union[Literal['P', 'S', 'E'], None] class DisclosureReport(BaseModel): assets: list[Asset] ## initialize OpenAI client client = OpenAI() # Example of message format for passing in an image via URL # https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing input_messages = [ {"role": "system", "content": "Output the result in JSON format."}, { "role": "user", "content": [ {"type": "text", "text": "Extract the text from this image"}, { "type": "image_url", "image_url": {"url": INPUT_URL}, }, ], }, ] # gpt-4o-mini is cheap and fast and has vision capabilities response = client.beta.chat.completions.parse( response_format=DisclosureReport, model="gpt-4o-mini", messages=input_messages ) message = response.choices[0].message # Print it out in readable format obj = json.loads(message.content) print(json.dumps(obj, indent=2)) This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,134 @@ { "assets": [ { "owner": "SP", "asset_name": "820 Sir Francis Drake Blvd., San Anselmo, CA - Commercial Property", "asset_value_low": 1000001, "asset_value_high": 5000000, "income_type": "RENT", "income_low": 100001, "income_high": 1000000, "transaction_type": "P" }, { "owner": "SP", "asset_name": "Access Technology Partners, LP", "asset_value_low": 0, "asset_value_high": 0, "income_type": "PARTNERSHIP INCOME/(LOSS)", "income_low": -1000000, "income_high": -1, "transaction_type": "S" }, { "owner": "SP", "asset_name": "Active, LLC", "asset_value_low": 15001, "asset_value_high": 50000, "income_type": "PARTNERSHIP INCOME/(LOSS)", "income_low": -200, "income_high": -1, "transaction_type": "P" }, { "owner": "SP", "asset_name": "Agile Software Corp. - Public Common Stock", "asset_value_low": 0, "asset_value_high": 0, "income_type": "CAPITAL GAIN", "income_low": 15001, "income_high": 50000, "transaction_type": "S" }, { "owner": "SP", "asset_name": "Akamai Technologies Inc. - Public Common Stock", "asset_value_low": 50001, "asset_value_high": 100000, "income_type": "NONE", "income_low": null, "income_high": null, "transaction_type": "P" }, { "owner": "SP", "asset_name": "Alcatel Lucent Ads - Public Common Stock", "asset_value_low": 1001, "asset_value_high": 15000, "income_type": "DIVIDENDS", "income_low": 1, "income_high": 200, "transaction_type": null }, { "owner": "SP", "asset_name": "Alcoa Inc. - Public Common Stock", "asset_value_low": 15001, "asset_value_high": 50000, "income_type": "DIVIDENDS", "income_low": 201, "income_high": 1000, "transaction_type": "P" }, { "owner": "SP", "asset_name": "American International Group Inc. - Public Common Stock", "asset_value_low": 250001, "asset_value_high": 500000, "income_type": "DIVIDENDS", "income_low": 2501, "income_high": 5000, "transaction_type": null }, { "owner": "SP", "asset_name": "Americas Doctors.com - Preferred Stock", "asset_value_low": 1001, "asset_value_high": 15000, "income_type": "NONE", "income_low": null, "income_high": null, "transaction_type": null }, { "owner": "SP", "asset_name": "Apple Computer - Public Common Stock", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "CAPITAL GAIN", "income_low": 100001, "income_high": 1000000, "transaction_type": "S" }, { "owner": "SP", "asset_name": "Aristotle, LLC", "asset_value_low": 15001, "asset_value_high": 50000, "income_type": "NONE", "income_low": null, "income_high": null, "transaction_type": null }, { "owner": "SP", "asset_name": "Ashlar, Inc. - Common Stock", "asset_value_low": 0, "asset_value_high": 0, "income_type": "CAPITAL GAIN/(LOSS)", "income_low": -1001, "income_high": -201, "transaction_type": "S" }, { "owner": "SP", "asset_name": "AT&T - Public Common Stock", "asset_value_low": 250001, "asset_value_high": 500000, "income_type": "DIVIDENDS", "income_low": 5001, "income_high": 15000, "transaction_type": null } ] } -
dannguyen revised this gist
Oct 10, 2024 . No changes.There are no files selected for viewing
-
Dan Nguyen @ Decision Sciences revised this gist
Oct 9, 2024 . 1 changed file with 593 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,593 @@ # Extracting financial disclosure reports and police blotter narratives using OpenAI's Structured Output > **tl;dr** this demo shows how to call OpenAI's [gpt-4o-mini model](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/), provide it with URL of a screenshot of a document, and extract data that follows a schema you define. The results are pretty solid even with little effort in defining the data — and no effort doing data prep. OpenAI's API could be a cost-efficient tool for large scale data gathering projects involving public documents. OpenAI announced [Structured Outputs for its API](https://openai.com/index/introducing-structured-outputs-in-the-api/), a feature that allows users to specify the fields and schema of extracted data, and guarantees that the JSON output will follow that specification. For example, given a Congressional financial disclosure report, with assets defined in a table like this: <img width="859" alt="image" src="https://gist.github.com/user-attachments/assets/e64c7ad1-d7af-4e51-a3f2-5961fde4fac3"> You define the data model you're expecting to extract, either in JSON schema or (as this demo does) via [the pydantic library](https://docs.pydantic.dev/latest/concepts/json_schema/): ```py class Asset(BaseModel): asset_name: str owner: str location: Union[str, None] asset_value_low: Union[int, None] asset_value_high: Union[int, None] income_type: str income_low: Union[int, None] income_high: Union[int, None] tx_gt_1000: bool class DisclosureReport(BaseModel): assets: list[Asset] ``` OpenAI's API infers from the field names (the above example is basic; there are [ways to provide detailed descriptions](https://docs.pydantic.dev/latest/concepts/fields/#default-values) for each data field) how your data model relates to the actual document you're trying to parse, and produces the extracted data in JSON format: ```json { "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]", "owner": "JT", "location": "St. Helena/Napa, CA, US", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Grape Sales", "income_low": 100001, "income_high": 1000000, "tx_gt_1000": false }, { "asset_name": "25 Point Lobos - Commercial Property [RP]", "owner": "SP", "location": "San Francisco/San Francisco, CA, US", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Rent", "income_low": 100001, "income_high": 1000000, "tx_gt_1000": false } ``` This demo gist provides code and results for two scenarios: - [Financial disclosure reports](#example-financial-disclosures): this is a data-tables-in-PDF problem where you'd typically have to use a PDF parsing library like [pdfplumber](https://github.com/jsvine/pdfplumber) and write your own data parsing methods. - [Newspaper police blotter](#example-police-blotter): this is a situation of irregular information — brief descriptions of reported crime incidents, written by a human reporter — where you'd employ humans to read, interpret, and do data entry. Note: these are very basic examples, using the bare minimum of instructions to the API (e.g. "Extract the text from this image") and relatively little code to define the expected data schema. That said ## How to run this code/use this demo Each example has the Python script used to produce the corresponding JSON output. To re-run these scripts on your own, the first thing you need to do is to create your own OpenAI developer account at [platform.openai.com](http://platform.openai.com), then: - Put in a [couple bucks into your account balance](https://platform.openai.com/settings/organization/billing/overview). Both of these examples use around [30,000-50,000 tokens](https://openai.com/api/pricing/), i.e. cost about half a cent to execute) - Create an [API key](https://platform.openai.com/api-keys) - Set it as your [$OPENAI_API_KEY environmental variable](https://platform.openai.com/docs/quickstart/create-and-export-an-api-key) - [alternatively](https://github.com/openai/openai-python?tab=readme-ov-file#usage), you can paste your key into the `api_key` argument, i.e. replace `client = OpenAI()` with `client = OpenAI(api_key='Yourkeyhere')` Then install the [OpenAI Python SDK](https://github.com/openai/openai-python) and [pydantic](https://docs.pydantic.dev/latest/#why-use-pydantic): ```sh pip install openai pydantic ``` For ease of use, these scripts are set up to use gpt-4o-mini's vision capabilities to ingest PNG files via web URLs. If you want to modify the script to test a URL of your choosing, simply modify the `INPUT_URL` variable at the top of the script. <br> <hr> <p id="example-financial-disclosures"></p> ## Financial disclosure report - The script: [extract-financial-disclosure.py](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-extract-financial-disclosure-py) - The results: [output-financial-disclosure.json](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-financial-disclosure-json) The following screenshot is taken from the PDF of the full report, which can be found at [disclosures-clerk.house.gov](https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf)). Note that this example simply passes a *PNG screenshot of the PDF* to OpenAI's API — results may be different/more efficient if you send it the actual PDF. <img width="905" alt="image" src="https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504"> As shown in the following snippet, the results look accurate and as expected. Note that it also correctly parses the "Location" and "Description" fields (when it exists), even though those fields aren't provided in tabular format (i.e. they're globbed into the "Asset" description as free form text). It also understands that `tx_gt_1000` corresponds to the `Tx. > $1,000?` header, and that that field contains checkboxes. Even though the sample page has no examples of checked checkboxes, the model correctly infers that `tx_gt_1000` is false. <img width="859" alt="image" src="https://gist.github.com/user-attachments/assets/e64c7ad1-d7af-4e51-a3f2-5961fde4fac3"> ```json { "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]", "owner": "OL", "location": "New York, NY, US", "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors.", "asset_value_low": 1000001, "asset_value_high": 5000000, "income_type": "Partnership Income", "income_low": 50001, "income_high": 100000, "tx_gt_1000": false }, { "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]", "owner": "SP", "location": null, "description": null, "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "None", "income_low": null, "income_high": null, "tx_gt_1000": false }, ``` It's also nice that I didn't have to do even the minimum of "data prep": I gave it a screenshot of the report page — the top third of which has info I don't need — and it "knew" that it should only care about the data under the "Schedule A: Assets and Unearned Income" header. If I were scraping financial disclosures for real, I would make use of [json-schema's "description" attribute](https://json-schema.org/learn/getting-started-step-by-step#define-properties), which can be defined via Pydantic like this: ```py from pydantic import BaseModel, Field class Asset(BaseModel): asset_name: str = Field( description="The name of the asset, under the 'Asset' header" ) owner: str = Field( description="Under the 'Owner' header, a 2-letter abbreviation, e.g. SP, DC, JT" ) location: Union[str, None] = Field( description="Some records have 'Location:' text as part of the 'Asset' header" ) description: Union[str, None] = Field( description="Some records have 'Description:' text as part of the 'Asset' header" ) asset_value_low: Union[int, None] = Field( description="Under the 'Value of Asset' field, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'" ) asset_value_high: Union[int, None] = Field( description="Under the 'Value of Asset' field, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'" ) income_type: str = Field(description="Under the 'Income Type(s) field") income_low: Union[int, None] = Field( description="Under the 'Income' field, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'" ) income_high: Union[int, None] = Field( description="Under the 'Income' field, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'" ) tx_gt_1000: bool = Field( description="Under the 'Tx. > $1,000?' header: True if the checkbox is checked, False if it is empty" ) class DisclosureReport(BaseModel): assets: list[Asset] ``` But as you can see from the [result JSON](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-financial-disclosure-json), OpenAI's model seems "smart" enough to understand a basic data-copying task without specific instructions. ### Financial disclosure report with no instruction I was curious how well the model without any instruction, i.e. when you don't bother to define a pydantic model and instead pass in a response format of `{"type": "json_object"}`: ```py response = client.beta.chat.completions.parse( response_format={"type": "json_object"}, model="gpt-4o-mini", messages=input_messages ) ``` The answer: just fine. You can see the code and full results here: - [extract-basic-financial-disclosure.py](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-extract-basic-financial-disclosure-py) - [output-basic-financial-disclosure.json](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-basic-financial-disclosure-json) Without a defined schema, the model treated the entire document (not just the Assets Schedule) as data: ```json { "document": { "title": "Financial Disclosure Report", "header": "Clerk of the House of Representatives \u2022 Legislative Resource Center \u2022 B81 Cannon Building \u2022 Washington, DC 20515", "filer_information": { "name": "Hon. Nancy Pelosi", "status": "Member", "state_district": "CA11" }, "filing_information": { "filing_type": "Annual Report", "filing_year": "2023", "filing_date": "05/15/2024" }, "schedule_a": { "title": "Schedule A: Assets and 'Unearned' Income", "assets": [ { "asset": "11 Zinfandel Lane - Home & Vineyard [RP]", "owner": "JT", "value": "$5,000,001 - $25,000,000", "income_type": "Grape Sales", "income": "$100,001 - $1,000,000", "location": "St. Helena/Napa, CA, US" }, ``` It left the values as text, e.g. `"value": "$5,000,001 - $25,000,000"` versus `"asset_value_low": 5000001`. And it left out the optional data fields, e.g. location and description, for entries that didn't have them: ```json { "asset": "AllianceBernstein Holding L.P. Units (AB) [OL]", "owner": "SP", "value": "$1,000,001 - $5,000,000", "income_type": "Partnership Income", "income": "$50,001 - $100,000", "location": "New York, NY, US", "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors." }, { "asset": "Alphabet Inc. - Class A (GOOGL) [ST]", "owner": "SP", "value": "$5,000,001 - $25,000,000", "income_type": "None" }, ``` <br> <hr> <p id="example-police-blotter"></p> ## Newspaper police blotter - The script: [extract-police-blotter.py](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-extract-police-blotter-py) - The results: [output-police-blotter.json](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-output-police-blotter-json)  The screenshot was taken from the Stanford Daily archives: [https://archives.stanforddaily.com/2004/04/09?page=3§ion=MODSMD_ARTICLE12#article](https://archives.stanforddaily.com/2004/04/09?page=3§ion=MODSMD_ARTICLE12#article) For reasons that are explained in detail below, this example isn't meant to be a reasonable test of the model capabilities. But it's a fun experiment to see how well its model performs with something not meant to be "data" and is inherently riddled with data quality issues. Consider what the data point of a basic crime incident report might contain: - When: a date and time - Where: a place - Who: - a victim - a suspect - What: the crime the suspect allegedly committed It's easy to come up with many variations and edge cases: - No specific time: i.e. "computer science graduate students reported that they had books stolen from the Gates Computer Science Building in the previous five months" - No listed place: it's unclear if the reporter purposefully omitted it, or if it was left off the original police report. - No suspect ("an alcohol-related medical call") or no victim (e.g. "an accidental fire call"). Or multiple suspects and multiple victims. Unlike the financial disclosure example, the input data is freeform narrative text. The onus is entirely on us to define what what a blotter report is, which ends up requiring defining what a crime incident is. Not surprisingly, the corresponding Pydantic code is a lot more verbose, and I bet if you asked 1,000 journalists to write a definition, they'd all be different. Here's what mine looks like: ```py # Define the data structures in Pydantic: # an Incident involves several Persons (victims, perpetrators) class Person(BaseModel): description: str gender: str is_student: bool # Pydantic docs on field descriptions: # https://docs.pydantic.dev/latest/concepts/fields/ class Incident(BaseModel): date: str time: str location: str summary: str = Field(description="""Brief summary, less than 30 chars""") category: str = Field( description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """ ) property_damage: str = Field( description="""If a property crime, then a description of what was stolen/damaged/lost""" ) arrest_made: bool perpetrators: list[Person] victims: list[Person] incident_text: str = Field( description="""Include the complete verbatim text from the input that pertains to the incident""" ) class Blotter(BaseModel): incidents: list[Incident] ``` ### Police blotter results I ask the model to provide an `incident_text` field, i.e. the verbatim text from which it extracted the incident data point. This is helpful for evaluating the experiment. But for an actual data project, you might want to omit it as it adds to the number of output tokens and API cost ```py incident_text: str = Field( description="""Include the complete verbatim text from the input that pertains to the incident""" ) ``` <img width="212" alt="image" src="https://gist.github.com/user-attachments/assets/4ca1b517-dfd6-45aa-b6dd-fa8073de035e"> The resulting `incident_text` field extracted from the above snippet is basically correct: > A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments. However, it leaves off the `11:40 p.m.`, which is at the beginning of the printed incident, and is something that I normally would like to include because I want to know everything the model looked at when extracting the data point. The `11:40 p.m.` time is correctly included in the rest of the data output: ```json { "date": "April 2", "time": "11:40 p.m.", "location": "Rains apartments", "summary": "Bike vandalized", "category": "property", "property_damage": "Wheel of bike", "arrest_made": false, "perpetrators": [ { "description": "Two unknown suspects", "gender": "unknown", "is_student": false } ], "victims": [ { "description": "A graduate student in the School of Education", "gender": "unknown", "is_student": true } ], "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments." } ``` #### The good As with the financial disclosure report, my script provides a screenshot and leaves it up to OpenAI's model to figure out what's going on. I was pleasantly surprised at how well gpt-4o-mini did in gleaning structure from a newspaper print listicle, with instructions as basic as: "Extract the text from this image" For example, on first glance of the blotter, it seems that every incident has a date (in the subhed) and time (at the beginning of the graf). But under "Thursday, April 1", you can see that pattern already broken: <img width="214" alt="image" src="https://gist.github.com/user-attachments/assets/c5d37af5-56e0-4b0b-81ab-9cf5e3359316"> Is that second graf ("A female administrator in Materials Science...") a continuation of the 9:30 p.m. incident where a "man reported that someone removed his rear license plate"? Most human readers, after reading both paragraphs — and then the rest of the blotter — will realize that these are 2 separate incidents. But there's nothing at all in the structure of the text to indicate that. Before I ran this experiment, I thought I would have to provide detailed parsing instructions to the model, e.g. > What you are reading is a police blotter, a list of reported incidents that police were called to. Every paragraph should be treated as a separate incident. Most incidents, but not all, begin with a timestamp, e.g. "11:20 p.m". But the model saw on its own that there are 2 incidents, and that the second one happened on April 1 at an unspecified time. ```json { "date": "April 1", "time": "9:30 p.m.", "location": "Toyon parking lot", "summary": "License plate stolen", "category": "property", "property_damage": "rear license plate", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "Man", "gender": "unknown", "is_student": false } ], "incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot." }, { "date": "April 1", "time": "unknown", "location": "unknown", "summary": "Unauthorized purchase reported", "category": "other", "property_damage": "computer equipment", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "Female administrator", "gender": "female", "is_student": false } ], "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry\u2019s Electronics sometime in the past five months." }, ``` By my count, there are 19 incidents in this issue of the Stanford Daily's police blotter, and the API correctly returns 19 different incidents. #### The bad Again, the data model is inherently messy, and I put in minimal effort to describe what an "incident" is, such as the variety of situations and edge cases. That, plus the inherent limitations of the data, are the root cause of most of the model's problems. For example, I intended the `perpetrators` and `victims` to be lists of proper nouns or simple nouns, so that we could ask questions like: "how many incidents involved multiple people". Given the following incident text: > A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments. — this is how the model parsed the suspects: ```json "perpetrators": [ { "description": "Two unknown suspects", "gender": "unknown", "is_student": false } ] ``` For a data project, I might have preferred a result that would easily return a result of `2`, e.g.: ```json "perpetrators": [ { "description": "Unknown suspect", "gender": "unknown", "is_student": false } { "description": "Unknown suspect", "gender": "unknown", "is_student": false } ] ``` But how should the model know what I'm trying to do sans specific instructions? I think most humans, given the same minimalist instructions, would have also recorded `"Two unknown suspects"`. However, the model greatly struggled with filling out the `perpetrators` and `victims` lists, such as frequently mistaking the suspect/perpetrator as the victim, when there was no specific victim mentioned: > A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license. ```json "victims": [ { "description": "A male undergraduate", "gender": "male", "is_student": true } ] ``` It goes without saying that the model missed when the narrative was more complicated. For example, in the case of the unauthorized purchases at Fry's: > A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months. The "female administrator" is not the victim, but the person who reported the crime. The victim would be Stanford University, or more specifically, its MScE department. I'm not surprised the model had problems with identifying victims and suspects, though I'm unsure how much extra instruction would be needed to get reliable results from a general model. One thing that the model frequently and inexplicably erred on was classifying people's gender. This is how I defined a `Person` using pydantic: ```py class Person(BaseModel): description: str gender: str is_student: bool ``` Even when the subject's noun has an obvious gender, the model would inexplicably flub it: > A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot. ```json "victims": [ { "description": "A man", "gender": "unknown", "is_student": false } ] ``` It was worse when the subject's noun did not indicate gender, but the rest of the sentence did: > A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments. ```json "victims": [ { "description": "A graduate student in the School of Education", "gender": "unknown", "is_student": true } ], ``` Not sure what the issue is. It might be remedied if I provided explicit and thorough instructions and examples, but this seemed like a much easier thing to infer than the other things that OpenAI's model was able to infer on its own. #### The weird With so many things left to the interpretation of the LLM, it was no surprise that I get different results every time I run the [extract-police-blotter.py](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8#file-extract-police-blotter-py) script, especially when it comes to the categorization of crimes. In the data specification, I did attempt to describe for the model what I wanted for `category`: ```py category: str = Field( description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """ ) ``` Given the option of saying "other", the model seemed eager to use it for any slightly vague situation. It classified the unauthorized purchases at Fry's as "other", even though embezzlement would better fit under property crimes by the [FBI's UCR definition](https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/topic-pages/offense-definitions). Maybe this could be fixed by providing the model with detailed examples and definitions of statutes and criminal code? But ultimately, as I said from the start, the model's performance is bounded by the limitations and errors in the source data. For example, an incident where someone gets hit on the head with a bottle seems to me obviously "violent", i.e. assault: > A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested. However, the model thinks it is "other": ```json { "date": "April 4", "time": "3:05 a.m.", "location": "Sigma Alpha Epsilon", "summary": "Altercation reported", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [ { "description": "Two undergraduate suspects", "gender": "unknown", "is_student": true } ], "victims": [ { "description": "A male undergraduate", "gender": "male", "is_student": true } ], "incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested." } ``` But is the model necessarily wrong? Two "suspects" were apparently identified, but no one was actually arrested. I took this to mean that the suspects fled and hadn't been located at the time of the report. But maybe it's something more benign: an "altercation" happened, but when the cops arrived, everyone was cool including the guy who got hit by the bottle, thus no allegation of assault for police to act on or file as part of their UCR statistics. Ultimately we have to guess the author's intent. OpenAI model's performance here wouldn't work for a real data project — but again, this was just a toy experiment, and doesn't represent what you'd get if you spend more than 10 minutes thinking about the data model, nevermind pick a data source slightly more structured than a newspaper listicle. I think OpenAI's model would work very well for something with more substantive text and formal structure, such as obituaries. -
dannguyen revised this gist
Oct 7, 2024 . 2 changed files with 126 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,60 @@ #!/usr/bin/env python3 """ extract-basic-financial-disclosure.py Parses and extracts structured data — and lets the model infer the structure by itself — from the screenshot at the given URL: https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504 Full financial disclosure report: https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf This script assumes your API key is set up in the default way, i.e. environment variable: $OPENAI_API_KEY https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety """ import base64 import json from openai import OpenAI from pathlib import Path from typing import Union INPUT_URL = "https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504" ## initialize OpenAI client client = OpenAI() # Example of message format for passing in an image via URL # https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing input_messages = [ {"role": "system", "content": "Output the result in JSON format."}, { "role": "user", "content": [ {"type": "text", "text": "Extract the text from this image"}, { "type": "image_url", "image_url": {"url": INPUT_URL}, }, ], }, ] # we are letting the model infer the data structure by itself # but we still need to tell it to respond in JSON, hence # response_format={"type": "json_object"} response = client.beta.chat.completions.parse( response_format={"type": "json_object"}, model="gpt-4o-mini", messages=input_messages ) message = response.choices[0].message # Print it out in readable format obj = json.loads(message.content) print(json.dumps(obj, indent=2)) This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,66 @@ { "document": { "title": "Financial Disclosure Report", "header": "Clerk of the House of Representatives \u2022 Legislative Resource Center \u2022 B81 Cannon Building \u2022 Washington, DC 20515", "filer_information": { "name": "Hon. Nancy Pelosi", "status": "Member", "state_district": "CA11" }, "filing_information": { "filing_type": "Annual Report", "filing_year": "2023", "filing_date": "05/15/2024" }, "schedule_a": { "title": "Schedule A: Assets and 'Unearned' Income", "assets": [ { "asset": "11 Zinfandel Lane - Home & Vineyard [RP]", "owner": "JT", "value": "$5,000,001 - $25,000,000", "income_type": "Grape Sales", "income": "$100,001 - $1,000,000", "location": "St. Helena/Napa, CA, US" }, { "asset": "25 Point Lobos - Commercial Property [RP]", "owner": "SP", "value": "$5,000,001 - $25,000,000", "income_type": "Rent", "income": "$100,001 - $1,000,000", "location": "San Francisco/San Francisco, CA, US" }, { "asset": "45 Belden Place - Four Story Commercial Building [RP]", "owner": "SP", "value": "$5,000,001 - $25,000,000", "income_type": "Rent", "income": "$100,001 - $1,000,000", "location": "San Francisco/San Francisco, CA, US" }, { "asset": "AllianceBernstein Holding L.P. Units (AB) [OL]", "owner": "SP", "value": "$1,000,001 - $5,000,000", "income_type": "Partnership Income", "income": "$50,001 - $100,000", "location": "New York, NY, US", "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors." }, { "asset": "Alphabet Inc. - Class A (GOOGL) [ST]", "owner": "SP", "value": "$5,000,001 - $25,000,000", "income_type": "None" }, { "asset": "Amazon.com, Inc. (AMZN) [ST]", "owner": "SP", "value": "$5,000,001 - $25,000,000", "income_type": "None" } ] } } } -
dannguyen revised this gist
Oct 7, 2024 . 4 changed files with 92 additions and 83 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,6 +1,8 @@ #!/usr/bin/env python3 """ extract-financial-disclosure.py Parses and extracts structured data from the screenshot at the given URL: https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504 This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,8 +1,9 @@ #!/usr/bin/env python3 """ extract-police-blotter.py Parses and extracts structured data from the screenshot at the given URL: https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703 @@ -40,7 +41,7 @@ class Incident(BaseModel): location: str summary: str = Field(description="""Brief summary, less than 30 chars""") category: str = Field( description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """ ) property_damage: str = Field( description="""If a property crime, then a description of what was stolen/damaged/lost""" This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -4,6 +4,7 @@ "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]", "owner": "JT", "location": "St. Helena/Napa, CA, US", "description": null, "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Grape Sales", @@ -15,6 +16,7 @@ "asset_name": "25 Point Lobos - Commercial Property [RP]", "owner": "SP", "location": "San Francisco/San Francisco, CA, US", "description": null, "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Rent", @@ -26,6 +28,7 @@ "asset_name": "45 Belden Place - Four Story Commercial Building [RP]", "owner": "SP", "location": "San Francisco/San Francisco, CA, US", "description": null, "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Rent", @@ -35,8 +38,9 @@ }, { "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]", "owner": "OL", "location": "New York, NY, US", "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors.", "asset_value_low": 1000001, "asset_value_high": 5000000, "income_type": "Partnership Income", @@ -48,6 +52,7 @@ "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]", "owner": "SP", "location": null, "description": null, "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "None", @@ -59,6 +64,7 @@ "asset_name": "Amazon.com, Inc. (AMZN) [ST]", "owner": "SP", "location": null, "description": null, "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "None", This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -6,12 +6,12 @@ "location": "Toyon parking lot", "summary": "License plate stolen", "category": "property", "property_damage": "Rear license plate", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A man", "gender": "unknown", "is_student": false } @@ -21,28 +21,28 @@ { "date": "April 1", "time": "unknown", "location": "Fry's Electronics", "summary": "Unauthorized purchase reported", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A female administrator in Materials Science and Engineering", "gender": "female", "is_student": false } ], "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months." }, { "date": "April 2", "time": "3:30 p.m.", "location": "unknown", "summary": "Rear license plate stolen reported", "category": "property", "property_damage": "Rear license plate", "arrest_made": false, "perpetrators": [], "victims": [ @@ -52,74 +52,62 @@ "is_student": false } ], "incident_text": "Another man reported that the rear license plate was missing from his vehicle." }, { "date": "April 2", "time": "11:40 p.m.", "location": "Rains apartments", "summary": "Bike vandalized", "category": "property", "property_damage": "Wheel of bike", "arrest_made": false, "perpetrators": [ { "description": "Two unknown suspects", "gender": "unknown", "is_student": false } ], "victims": [ { "description": "A graduate student in the School of Education", "gender": "unknown", "is_student": true } ], "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments." }, { "date": "April 2", "time": "11:40 p.m.", "location": "Adelfa", "summary": "Medical call", "category": "call for service", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "unknown", "gender": "unknown", "is_student": false } ], "incident_text": "Police responded to an alcohol-related medical call in Adelfa." }, { "date": "April 3", "time": "10:20 p.m.", "location": "unknown", "summary": "Bike citation", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A male undergraduate", "gender": "male", "is_student": true } @@ -129,15 +117,15 @@ { "date": "April 3", "time": "11:20 p.m.", "location": "Lomita Drive", "summary": "Minor in possession citation", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A woman", "gender": "female", "is_student": false } @@ -148,14 +136,14 @@ "date": "April 3", "time": "11:40 p.m.", "location": "unknown", "summary": "Driving citation", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A man", "gender": "male", "is_student": false } @@ -166,14 +154,14 @@ "date": "April 4", "time": "1:00 a.m.", "location": "unknown", "summary": "Car damage reported", "category": "property", "property_damage": "Trunk and hood", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A man", "gender": "male", "is_student": false } @@ -184,10 +172,10 @@ "date": "April 4", "time": "1:31 a.m.", "location": "Mayfield Avenue", "summary": "Bikes U-Locked incident", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { @@ -204,18 +192,18 @@ "location": "Sigma Alpha Epsilon", "summary": "Altercation reported", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [ { "description": "Two undergraduate suspects", "gender": "unknown", "is_student": true } ], "victims": [ { "description": "A male undergraduate", "gender": "male", "is_student": true } @@ -226,14 +214,14 @@ "date": "April 5", "time": "2:45 a.m.", "location": "unknown", "summary": "Arrest for intoxication", "category": "other", "property_damage": "None", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "A man", "gender": "male", "is_student": false } @@ -243,75 +231,87 @@ { "date": "April 6", "time": "7:15 a.m.", "location": "Andronico's Supermarket", "summary": "Assisted in detaining suspect", "category": "call for service", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "unknown", "gender": "unknown", "is_student": false } ], "incident_text": "Police assisted Andronico's Supermarket in detaining a suspect after a call was made on a blue emergency phone. Palo Alto police later took the suspect into custody." }, { "date": "April 6", "time": "3:20 p.m.", "location": "Studio 3 on Angell Court", "summary": "Accidental fire call", "category": "call for service", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "unknown", "gender": "unknown", "is_student": false } ], "incident_text": "Police responded to an accidental fire call to Studio 3 on Angell Court." }, { "date": "April 6", "time": "9:00 p.m.", "location": "unknown", "summary": "Found property reported", "category": "property", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A man", "gender": "male", "is_student": false } ], "incident_text": "A man reported that he found someone else's personal property in his locked car." }, { "date": "April 6", "time": "1:45 a.m.", "location": "San Jose main jail", "summary": "Arrest for trespassing", "category": "other", "property_damage": "None", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "A local vagrant", "gender": "unknown", "is_student": false } ], "incident_text": "A local vagrant was booked into the San Jose main jail for trespassing - his seventh time trespassing in a month." }, { "date": "April 6", "time": "2:50 a.m.", "location": "Palm Drive", "summary": "Driving citation", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A man", "gender": "male", "is_student": false } -
dannguyen revised this gist
Oct 7, 2024 . No changes.There are no files selected for viewing
-
dannguyen revised this gist
Oct 7, 2024 . 1 changed file with 0 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -39,7 +39,6 @@ class Asset(BaseModel): income_type: str income_low: Union[int, None] income_high: Union[int, None] tx_gt_1000: bool class DisclosureReport(BaseModel): -
dannguyen revised this gist
Oct 7, 2024 . 2 changed files with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes.File renamed without changes. -
dannguyen revised this gist
Oct 7, 2024 . 2 changed files with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes.File renamed without changes. -
dannguyen revised this gist
Oct 7, 2024 . 1 changed file with 6 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -3,6 +3,7 @@ { "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]", "owner": "JT", "location": "St. Helena/Napa, CA, US", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Grape Sales", @@ -13,6 +14,7 @@ { "asset_name": "25 Point Lobos - Commercial Property [RP]", "owner": "SP", "location": "San Francisco/San Francisco, CA, US", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Rent", @@ -23,6 +25,7 @@ { "asset_name": "45 Belden Place - Four Story Commercial Building [RP]", "owner": "SP", "location": "San Francisco/San Francisco, CA, US", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Rent", @@ -33,6 +36,7 @@ { "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]", "owner": "SP", "location": "New York, NY, US", "asset_value_low": 1000001, "asset_value_high": 5000000, "income_type": "Partnership Income", @@ -43,6 +47,7 @@ { "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]", "owner": "SP", "location": null, "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "None", @@ -53,6 +58,7 @@ { "asset_name": "Amazon.com, Inc. (AMZN) [ST]", "owner": "SP", "location": null, "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "None", -
dannguyen revised this gist
Oct 7, 2024 . 3 changed files with 1 addition and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -33,6 +33,7 @@ class Asset(BaseModel): asset_name: str owner: str location: Union[str, None] asset_value_low: Union[int, None] asset_value_high: Union[int, None] income_type: str File renamed without changes.File renamed without changes. -
dannguyen revised this gist
Oct 7, 2024 . 1 changed file with 0 additions and 5 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +0,0 @@ -
dannguyen revised this gist
Oct 7, 2024 . 2 changed files with 115 additions and 25 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,52 +1,64 @@ { "assets": [ { "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]", "owner": "JT", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Grape Sales", "income_low": 100001, "income_high": 1000000, "tx_gt_1000": false }, { "asset_name": "25 Point Lobos - Commercial Property [RP]", "owner": "SP", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Rent", "income_low": 100001, "income_high": 1000000, "tx_gt_1000": false }, { "asset_name": "45 Belden Place - Four Story Commercial Building [RP]", "owner": "SP", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "Rent", "income_low": 100001, "income_high": 1000000, "tx_gt_1000": false }, { "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]", "owner": "SP", "asset_value_low": 1000001, "asset_value_high": 5000000, "income_type": "Partnership Income", "income_low": 50001, "income_high": 100000, "tx_gt_1000": false }, { "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]", "owner": "SP", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "None", "income_low": null, "income_high": null, "tx_gt_1000": false }, { "asset_name": "Amazon.com, Inc. (AMZN) [ST]", "owner": "SP", "asset_value_low": 5000001, "asset_value_high": 25000000, "income_type": "None", "income_low": null, "income_high": null, "tx_gt_1000": false } ] } This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,78 @@ #!/usr/bin/env python3 """ Parses and extracts structured data from the screenshot at the given URL: https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504 Full financial disclosure report: https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf This script assumes your API key is set up in the default way, i.e. environment variable: $OPENAI_API_KEY https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety """ import base64 import json from openai import OpenAI from pathlib import Path from pydantic import BaseModel, Field from typing import Union INPUT_URL = "https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504" # OpenAI examples of Stuctured Output scripts and data definitions # https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2 # Define the data structures in Pydantic: # a Disclosure Report has a list of assets class Asset(BaseModel): asset_name: str owner: str asset_value_low: Union[int, None] asset_value_high: Union[int, None] income_type: str income_low: Union[int, None] income_high: Union[int, None] tx_gt_1000: bool class DisclosureReport(BaseModel): assets: list[Asset] ## initialize OpenAI client client = OpenAI() # Example of message format for passing in an image via URL # https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing input_messages = [ {"role": "system", "content": "Output the result in JSON format."}, { "role": "user", "content": [ {"type": "text", "text": "Extract the text from this image"}, { "type": "image_url", "image_url": {"url": INPUT_URL}, }, ], }, ] # gpt-4o-mini is cheap and fast and has vision capabilities response = client.beta.chat.completions.parse( response_format=DisclosureReport, model="gpt-4o-mini", messages=input_messages ) message = response.choices[0].message # Print it out in readable format obj = json.loads(message.content) print(json.dumps(obj, indent=2)) -
dannguyen revised this gist
Oct 7, 2024 . 2 changed files with 54 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,3 +1,5 @@ # Police Blotter data extraction using OpenAI's Structured Output  <img width="905" alt="image" src="https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504"> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,52 @@ { "assets": [ { "asset_name": "11 Zinfandel Lane - Home & Vineyard", "owner": "JT", "asset_value_low": "$5,000,001", "asset_value_high": "$25,000,000", "income_type": "Grape Sales", "tx_gt_1000": true }, { "asset_name": "25 Point Lobos - Commercial Property", "owner": "SP", "asset_value_low": "$5,000,001", "asset_value_high": "$25,000,000", "income_type": "Rent", "tx_gt_1000": true }, { "asset_name": "45 Belden Place - Four Story Commercial Building", "owner": "RP", "asset_value_low": "$5,000,001", "asset_value_high": "$25,000,000", "income_type": "Rent", "tx_gt_1000": true }, { "asset_name": "AllianceBernstein Holding L.P. Units (AB)", "owner": "OL", "asset_value_low": "$1,000,001", "asset_value_high": "$5,000,000", "income_type": "Partnership Income", "tx_gt_1000": true }, { "asset_name": "Alphabet Inc. - Class A (GOOGL)", "owner": "SP", "asset_value_low": "$5,000,001", "asset_value_high": "$25,000,000", "income_type": "None", "tx_gt_1000": false }, { "asset_name": "Amazon.com, Inc. (AMZN)", "owner": "SP", "asset_value_low": "$5,000,001", "asset_value_high": "$25,000,000", "income_type": "None", "tx_gt_1000": false } ] } -
dannguyen revised this gist
Oct 5, 2024 . 1 changed file with 112 additions and 88 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,48 +1,48 @@ { "incidents": [ { "date": "April 1", "time": "9:30 p.m.", "location": "Toyon parking lot", "summary": "License plate stolen", "category": "property", "property_damage": "rear license plate", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "Man", "gender": "unknown", "is_student": false } ], "incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot." }, { "date": "April 1", "time": "unknown", "location": "unknown", "summary": "Unauthorized purchase reported", "category": "other", "property_damage": "computer equipment", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "Female administrator", "gender": "female", "is_student": false } ], "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry\u2019s Electronics sometime in the past five months." }, { "date": "April 2", "time": "3:30 p.m.", "location": "unknown", "summary": "Report of license plate stolen", "category": "property", "property_damage": "rear license plate", "arrest_made": false, "perpetrators": [], "victims": [ @@ -52,51 +52,51 @@ "is_student": false } ], "incident_text": "Another man reported that the rear license plate was missing from his vehicle, which had been parked near 655 Serra Street." }, { "date": "April 1", "time": "11:40 p.m.", "location": "Rains apartments", "summary": "Vandalism reported", "category": "property", "property_damage": "bike wheel vandalized", "arrest_made": false, "perpetrators": [ { "description": "Unknown suspects", "gender": "unknown", "is_student": false } ], "victims": [ { "description": "Graduate student", "gender": "unknown", "is_student": true } ], "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments." }, { "date": "April 1", "time": "11:40 p.m.", "location": "Adelfa", "summary": "Medical call responded", "category": "other", "property_damage": "", "arrest_made": false, "perpetrators": [], "victims": [], "incident_text": "Police responded to an alcohol-related medical call in Adelfa." }, { "date": "April 1", "time": "unknown", "location": "Gates Computer Science Building", "summary": "Theft reported", "category": "property", "property_damage": "books stolen", "arrest_made": false, "perpetrators": [], "victims": [ @@ -109,103 +109,85 @@ "incident_text": "Three computer science graduate students reported that they had books stolen from the Gates Computer Science Building in the previous five months." }, { "date": "April 3", "time": "10:20 p.m.", "location": "unknown", "summary": "Citation issued", "category": "other", "property_damage": "", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "Male undergraduate", "gender": "male", "is_student": true } ], "incident_text": "A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license." }, { "date": "April 3", "time": "11:20 p.m.", "location": "Lomit Drive", "summary": "Minor in possession cited", "category": "other", "property_damage": "", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "Woman", "gender": "female", "is_student": false } ], "incident_text": "A woman was cited and released on Lomita Drive for being a minor in possession of alcohol." }, { "date": "April 3", "time": "11:40 p.m.", "location": "unknown", "summary": "Citation issued", "category": "other", "property_damage": "", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "Man", "gender": "male", "is_student": false } ], "incident_text": "A man was cited and released for driving with a suspended license after he was stopped near Galvez Street and Campus Drive." }, { "date": "April 4", "time": "1:00 a.m.", "location": "unknown", "summary": "Damage report", "category": "property", "property_damage": "car damage", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "Man", "gender": "male", "is_student": false } ], "incident_text": "A man reported that someone walked over the top of his car, causing damage to the trunk, top and hood." }, { "date": "April 4", "time": "1:31 a.m.", "location": "Mayfield Avenue", "summary": "Cited for biking rules", "category": "other", "property_damage": "", "arrest_made": true, "perpetrators": [], "victims": [ { @@ -217,12 +199,12 @@ "incident_text": "Two men were cited and released for walking with bikes U-Locked to themselves on Mayfield Avenue after neither could establish ownership of the bikes." }, { "date": "April 4", "time": "3:05 a.m.", "location": "Sigma Alpha Epsilon", "summary": "Altercation reported", "category": "other", "property_damage": "injury reported", "arrest_made": false, "perpetrators": [ { @@ -233,61 +215,103 @@ ], "victims": [ { "description": "Male undergraduate", "gender": "male", "is_student": true } ], "incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested." }, { "date": "April 5", "time": "2:45 a.m.", "location": "unknown", "summary": "Arrest made", "category": "other", "property_damage": "", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "Man", "gender": "male", "is_student": false } ], "incident_text": "Police arrested a man for being drunk in public on Palm Drive near the entrance arch." }, { "date": "April 6", "time": "7:15 a.m.", "location": "Andronico\u2019s Supermarket", "summary": "Assistance provided", "category": "other", "property_damage": "", "arrest_made": false, "perpetrators": [], "victims": [], "incident_text": "Police assisted Andronico\u2019s Supermarket in detaining a suspect after a call was made on a blue emergency phone. Palo Alto police later took the suspect into custody." }, { "date": "April 6", "time": "3:20 p.m.", "location": "Studio 3 on Angell Court", "summary": "Fire call responded", "category": "other", "property_damage": "", "arrest_made": false, "perpetrators": [], "victims": [], "incident_text": "Police responded to an accidental fire call to Studio 3 on Angell Court." }, { "date": "April 6", "time": "9:00 p.m.", "location": "unknown", "summary": "Report of found property", "category": "property", "property_damage": "personal property", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "Man", "gender": "male", "is_student": false } ], "incident_text": "A man reported that he found someone else\u2019s personal property in his locked car." }, { "date": "April 6", "time": "1:45 a.m.", "location": "San Jose main jail", "summary": "Arrest for trespassing", "category": "other", "property_damage": "", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "Local vagrant", "gender": "unknown", "is_student": false } ], "incident_text": "A local vagrant was booked into the San Jose main jail for trespassing \u2013 his seventh time trespassing in a month." }, { "date": "April 6", "time": "2:50 a.m.", "location": "Palm Drive", "summary": "Citation issued", "category": "other", "property_damage": "", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "Man", "gender": "male", "is_student": false } -
dannguyen renamed this gist
Oct 5, 2024 . 1 changed file with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes. -
dannguyen renamed this gist
Oct 5, 2024 . 1 changed file with 33 additions and 17 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,7 +1,8 @@ #!/usr/bin/env python3 """ blotter-extract.py Parses and extracts structured data from the screenshot of a police blotter at the given URL: https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703 @@ -10,7 +11,6 @@ https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety """ import base64 import json from openai import OpenAI @@ -19,19 +19,28 @@ INPUT_URL = "https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703" # OpenAI examples of Stuctured Output scripts and data definitions # https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2 # Define the data structures in Pydantic: # an Incident involves several Persons (victims, perpetrators) class Person(BaseModel): description: str gender: str is_student: bool # Pydantic docs on field descriptions: # https://docs.pydantic.dev/latest/concepts/fields/ class Incident(BaseModel): date: str time: str location: str summary: str = Field(description="""Brief summary, less than 30 chars""") category: str = Field( description="""Type of crime, broadly speaking: "violent" , "property", "traffic", or "other" """ ) property_damage: str = Field( description="""If a property crime, then a description of what was stolen/damaged/lost""" @@ -44,36 +53,43 @@ class Incident(BaseModel): ) class Blotter(BaseModel): incidents: list[Incident] ## done defining the data structures ################################################## ## initialize OpenAI client client = OpenAI() # Example of message format for passing in an image via URL # https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing input_messages = [ {"role": "system", "content": "Output the result in JSON format."}, { "role": "user", "content": [ {"type": "text", "text": "Extract the text from this image"}, { "type": "image_url", "image_url": {"url": INPUT_URL}, }, ], }, ] # gpt-4o-mini is cheap and fast and has vision capabilities response = client.beta.chat.completions.parse( response_format=Blotter, model="gpt-4o-mini", messages=input_messages ) message = response.choices[0].message # Print it out in readable format obj = json.loads(message.content) print(json.dumps(obj, indent=2)) -
dannguyen revised this gist
Oct 5, 2024 . 1 changed file with 298 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,298 @@ { "incidents": [ { "date": "Thursday, April 1", "time": "9:30 p.m.", "location": "Toyon parking lot", "summary": "License plate stolen", "category": "property", "property_damage": "Rear license plate", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A man", "gender": "unknown", "is_student": false } ], "incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot." }, { "date": "Thursday, April 1", "time": "unknown", "location": "Fry's Electronics", "summary": "Unauthorized purchase", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A female administrator in Materials Science and Engineering", "gender": "female", "is_student": false } ], "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months." }, { "date": "Friday, April 2", "time": "3:30 p.m.", "location": "unknown", "summary": "License plate stolen", "category": "property", "property_damage": "Rear license plate", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "Another man", "gender": "unknown", "is_student": false } ], "incident_text": "Another man reported that the rear license plate was missing from his vehicle." }, { "date": "Thursday, April 1", "time": "11:40 p.m.", "location": "Rains apartments", "summary": "Vandalism", "category": "property", "property_damage": "Wheel of bike", "arrest_made": false, "perpetrators": [ { "description": "Two unknown suspects", "gender": "unknown", "is_student": false } ], "victims": [ { "description": "A graduate student in the School of Education", "gender": "unknown", "is_student": true } ], "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments." }, { "date": "Thursday, April 1", "time": "11:40 p.m.", "location": "Adelifa", "summary": "Medical call", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [], "incident_text": "Police responded to an alcohol-related medical call in Adelifa." }, { "date": "Thursday, April 1", "time": "unknown", "location": "Gates Computer Science Building", "summary": "Theft", "category": "property", "property_damage": "Books", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "Three computer science graduate students", "gender": "unknown", "is_student": true } ], "incident_text": "Three computer science graduate students reported that they had books stolen from the Gates Computer Science Building in the previous five months." }, { "date": "Saturday, April 3", "time": "10:20 p.m.", "location": "unknown", "summary": "Traffic violation", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A male undergraduate", "gender": "male", "is_student": true } ], "incident_text": "A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license." }, { "date": "Saturday, April 3", "time": "11:20 p.m.", "location": "Lomita Drive", "summary": "Possession of alcohol", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A woman", "gender": "female", "is_student": false } ], "incident_text": "A woman was cited and released on Lomita Drive for being a minor in possession of alcohol." }, { "date": "Saturday, April 3", "time": "11:40 p.m.", "location": "unknown", "summary": "Driving without license", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A man", "gender": "male", "is_student": false } ], "incident_text": "A man was cited and released for driving with a suspended license after he was stopped near Galvez Street and Campus Drive." }, { "date": "Saturday, April 4", "time": "11:45 p.m.", "location": "Lomita Drive", "summary": "Urinating in public", "category": "other", "property_damage": "None", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "A woman", "gender": "female", "is_student": false } ], "incident_text": "A woman was cited and released for urinating in public on Lomita Drive." }, { "date": "Sunday, April 4", "time": "1:00 a.m.", "location": "unknown", "summary": "Traffic violation", "category": "other", "property_damage": "Damage to car", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "A man", "gender": "male", "is_student": false } ], "incident_text": "A man reported that someone walked over the top of his car, causing damage to the trunk, top and hood." }, { "date": "Sunday, April 4", "time": "1:31 a.m.", "location": "Mayfield Avenue", "summary": "Possession of stolen bikes", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [], "victims": [ { "description": "Two men", "gender": "unknown", "is_student": false } ], "incident_text": "Two men were cited and released for walking with bikes U-Locked to themselves on Mayfield Avenue after neither could establish ownership of the bikes." }, { "date": "Sunday, April 4", "time": "3:05 a.m.", "location": "Sigma Alpha Epsilon", "summary": "Altercation", "category": "other", "property_damage": "None", "arrest_made": false, "perpetrators": [ { "description": "Two undergraduates", "gender": "unknown", "is_student": true } ], "victims": [ { "description": "A male undergraduate", "gender": "male", "is_student": true } ], "incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested." }, { "date": "Monday, April 5", "time": "2:45 a.m.", "location": "unknown", "summary": "Public intoxication", "category": "other", "property_damage": "None", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "A man", "gender": "male", "is_student": false } ], "incident_text": "Police arrested a man for being drunk in public." }, { "date": "Tuesday, April 6", "time": "1:45 a.m.", "location": "San Jose main jail", "summary": "Trespassing", "category": "other", "property_damage": "None", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "A local vagrant", "gender": "male", "is_student": false } ], "incident_text": "A local vagrant was booked into the San Jose main jail for trespassing \u2014 his seventh time trespassing in a month." }, { "date": "Tuesday, April 6", "time": "2:50 a.m.", "location": "Palm Drive", "summary": "Driving without a license", "category": "other", "property_damage": "None", "arrest_made": true, "perpetrators": [], "victims": [ { "description": "A man", "gender": "male", "is_student": false } ], "incident_text": "A man was cited and released for driving without a license on Palm Drive." } ] } -
dannguyen revised this gist
Oct 5, 2024 . 2 changed files with 4 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,2 +1,3 @@ # Police Blotter data extraction using OpenAI's Structured Output  This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,8 +1,9 @@ #!/usr/bin/env python3 """ Parses and extracts structured data from the screenshot at the given URL: https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703 This script assumes your API key is set up in the default way, i.e. environment variable: $OPENAI_API_KEY @@ -16,7 +17,7 @@ from pathlib import Path from pydantic import BaseModel, Field INPUT_URL = "https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703" class Person(BaseModel): description: str -
dannguyen revised this gist
Oct 5, 2024 . 2 changed files with 2 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,2 @@ # Police Blotter data extraction using OpenAI's Structured Output File renamed without changes. -
dannguyen created this gist
Oct 5, 2024 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,78 @@ #!/usr/bin/env python3 """ Parses the screenshot at the given URL: This script assumes your API key is set up in the default way, i.e. environment variable: $OPENAI_API_KEY https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety """ import base64 import json from openai import OpenAI from pathlib import Path from pydantic import BaseModel, Field INPUT_URL = "" class Person(BaseModel): description: str gender: str is_student: bool class Incident(BaseModel): date: str time: str location: str summary: str = Field(description="""Brief summary, less than 30 chars""") category: str = Field( description="""Type of crime, either "violent" or "property", or "other" """ ) property_damage: str = Field( description="""If a property crime, then a description of what was stolen/damaged/lost""" ) arrest_made: bool perpetrators: list[Person] victims: list[Person] incident_text: str = Field( description="""Include the complete verbatim text from the input that pertains to the incident""" ) class Blotter(BaseModel): incidents: list[Incident] ## initialize OpenAI client client = OpenAI() with open(INPUT_PATH, "rb") as img: image_data = base64.b64encode(img.read()).decode("utf-8") input_messages = [ {"role": "system", "content": "Output the result in JSON format."}, {"role": "user", "content": [ {"type": "text", "text": "Extract the text from this image"}, { "type": "image_url", "image_url": {"url": INPUT_URL}, }, ]}, ] response = client.beta.chat.completions.parse( model="gpt-4o-mini", response_format=Blotter, messages=input_messages ) message = response.choices[0].message obj = json.loads(message.content) print(json.dumps(obj, indent=2))