Skip to content

Instantly share code, notes, and snippets.

@mickley
Created November 5, 2024 22:32
Show Gist options
  • Select an option

  • Save mickley/fa06cfc2e98790194e5327c6e0d27ee0 to your computer and use it in GitHub Desktop.

Select an option

Save mickley/fa06cfc2e98790194e5327c6e0d27ee0 to your computer and use it in GitHub Desktop.
Custom VoucherVision prompt used by the Oregon State University Herbarium.
prompt_author: James Mickley
prompt_author_institution: Oregon State University
prompt_name: OSUvA_medium
prompt_version: v-0-2
prompt_description: Prompt developed by James Mickley. Based loosely on SLTPvA_medium v1.0
~ Changelog ~
* v-0-2 - 2024-07-22 - Adds datum, specifies that scientific name should not have taxonomic authors, OSC-specific catalogNumber,
spell out country in full, remove region name from county
LLM: General Purpose
instructions: 1. Refactor the unstructured OCR text into a dictionary based on the JSON structure outlined below.
2. Map the unstructured OCR text to the appropriate JSON key and populate the field given the user-defined rules.
3. JSON key values are permitted to remain empty strings if the corresponding information is not found in the unstructured OCR text.
4. Duplicate dictionary fields are not allowed.
5. Ensure all JSON keys are in camel case and are enclosed in double quotes.
6. Ensure new JSON field values follow sentence case capitalization and are enclosed in double quotes.
7. Ensure all key-value pairs in the JSON dictionary strictly adhere to the format and data types specified in the template.
8. Ensure output JSON string is valid JSON format. It should not have trailing commas or unquoted keys.
9. Only return a JSON dictionary represented as a string. You should not explain your answer.
json_formatting_instructions: This section provides rules for formatting each JSON value organized by the JSON key.
rules:
catalogNumber: 'A barcode identifier with 5-7 digits that starts with ''OSC'' or ''OSC-V-'''
scientificName: The scientific name of the taxon including genus, specific epithet,
and any lower classifications. Remove taxonomic authors.
genus: Taxonomic determination to genus. Genus must be capitalized. If genus is not present,
use taxonomic family name.
specificEpithet: The name of the first or species epithet of the scientificName. Only include the species epithet.
infraspecificEpithet: 'The infraspecific epithet specified in the taxonomic name (''var.'', ''ssp.'', ''subsp.'', ''f.''). Prefix with the infraspecific rank.'
collector: The name of the primary collector of the specimen.
associatedCollectors: A comma separated list of full names for additional collectors, if any.
collectorNumber: Number or unique identifier associated with the collector.
verbatimDate: Date of collection exactly as it appears in the unformatted text. Do not change the format or correct typos.
eventDate: 'Date the specimen was collected formatted as YYYY-MM-DD. If
specific components of the date are unknown, they should be replaced with
zeros. Examples: ''0000-00-00'' if the entire date is unknown, ''YYYY-00-00''
if only the year is known, and ''YYYY-MM-00'' if year and month are known
but day is not.'
country: Country where the specimen was collected. Spell out the country's name if abbreviation is given.
stateProvince: The name of the state, province, canton or region where the specimen was collected.
county: The name of the county, parish, shire or borough where the specimen was collected. Do not include the type of region or its abbreviation.
locality: Description of geographic location, landscape, landmarks, regional
features, nearby places, or other information specifying the site of collection.
Exclude coordinates and elevation.
decimalLatitude: Latitude decimal coordinate. Correct and convert the verbatim coordinates to conform with the decimal degrees GPS coordinate format.
decimalLongitude: Longitude decimal coordinate. Correct and convert the verbatim coordinates to conform with the decimal degrees GPS coordinate format.
verbatimCoordinates: Latitude/Longitude, TRS, or UTM coordinates exactly as they appear in the unformatted OCR text. Exclude elevation.
datum: Datum specified in the unformatted text. Possible values include [WGS84, NAD83, NAD27].
verbatimElevation: The elevation in the unformatted OCR text. Include any units.
cultivated: Cultivated plants are intentionally grown by humans. In text descriptions,
look for planting dates, garden locations, ornamental, cultivar names, garden,
or farm to indicate cultivated plant. Set to 'cultivated' if cultivated, otherwise use an empty string.
habitat: Description of a plant's habitat or the location where the specimen was collected. Ignore descriptions of the plant itself.
plantDescription: Description of plant features such as leaf shape, size, color,
stem texture, height, flower structure, scent, fruit or seed characteristics,
root system type, overall growth habit and form, smell or secretions,
presence of hairs or bristles, and any other distinguishing morphological
or physiological characteristics.
associatedSpecies: 'List of species associated with the specimen.
Usually species names are preceeded by ''associated'' or ''with''. When multiple taxa are
listed together, their names should be separated by commas.'
mapping:
TAXONOMY:
- catalogNumber
- scientificName
- genus
- specificEpithet
- infraspecificEpithet
GEOGRAPHY:
- country
- stateProvince
- county
- decimalLatitude
- decimalLongitude
- verbatimCoordinates
- datum
LOCALITY:
- locality
- habitat
- associatedSpecies
- verbatimElevation
COLLECTING:
- collector
- associatedCollectors
- collectorNumber
- verbatimDate
- eventDate
- cultivated
- plantDescription
MISC: []
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment