Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Last active July 29, 2025 14:26
Show Gist options
  • Select an option

  • Save dannguyen/a0b69c84ebc00c54c94d to your computer and use it in GitHub Desktop.

Select an option

Save dannguyen/a0b69c84ebc00c54c94d to your computer and use it in GitHub Desktop.
Using Python 3.x and Google Cloud Vision API to OCR scanned documents to extract structured data

Using Google Cloud Vision API to OCR scanned documents to extract structured data

Just a quickie test to see if Google Cloud Vision can be used to effectively OCR a scanned data table, in the way that products like ABBYY FineReader provide image-ocr-to-Excel.

The short answer: No. While Cloud Vision provides bounding polygon coordinates in its output, it doesn't provide it at the word or region level, which would be needed to then calculate the data delimiters.

You can read more about getting started with the Google Cloud Vision API in its official docs. The instructions here are a somewhat simplified version of the official instructions here:

https://github.com/GoogleCloudPlatform/cloud-vision/tree/master/python/text

Set up a Google developer account and get an API key:

https://cloud.google.com/vision/docs/auth-template/cloud-api-auth#set_up_an_api_key

How to run

The cloudvisreq.py script is included at the bottom of this gist.

$  python cloudvisreq.py API_KEY image1.jpg image2.png

Sample image:

Courtesy of Eli Lilly:

image

Result:

Wrote 3021 bytes to jsons/pdftable.png.json
---------------------------------------------
    Bounding Polygon:
{'vertices': [{'y': 272, 'x': 212}, {'y': 272, 'x': 3066}, {'y': 2295, 'x': 3066}, {'y': 2295, 'x': 212}]}
    Text:
Lilly other Health Care Professional Registry
Data updated on Monday, March 3, 2014
Payments Made: Q1-Q4 2013
The Other Health Care Professional Registry reports direct and indirect payments by Lilly, as well as Lilly's portion of alliance partnership paymen
o health care professionals other than physicians serving as faculty
members. When the "Entity Paid
a company, hospital, or university, and there
a different name on the provider of service
reflects an indirect payment which may or may not actually have been received in
whole or in part by the
ted service provider. Copyright 2014 Eli Lilly and Company. All rights reserved
Note: Due to differences in scope and definitions, the data reported in this report may differ from data included in reports submitted by Lilly for compliance with state payment reporting laws
Payments
All Amounts in US Dollars
Provider of Service
*Payments for reimbursed expenses are not compensation
Entity Paid
Patient Education Professional
Education
Education
WARD, JANET
OH ABDALLAH, RITA $14,025 $500 $2,823 S17348
ABDALLAH, RITA
FAIRVIEW PARK
OH AGOSTI, CAROL $1,050 1,631 $355 $3036
AGOSTI, CAROL
TOLEDO
AINSWORTH, ABBY CORPUS CHRISTI AINSWORTH, ABBY $113 $113
AKERS, REBECCA VINCENNES IN AKERS, REBECCA
$2,400 $2,400
IALCHEMIPHARMA LLC, WAYNE PA BELAZI, DEA $11,889 $1.99 $12,088
MEADOW VISTA
ALEMAN, MARY
ALEMAN, MARY
ALEXANDER, LISA
ALEXANDER, LISA
FALLEN, CONNETTE COTTAGE GROVE
MN ALLEN, CONNETTE $6,113 $875 $2,347 $9,335
MALLISON HEINRICH, LL. LUBBOCK HEINRICH, ALLISON $4,200 $1.238 $5.438
ALLISON, CRYSTAL GOODYEAR
AZ ALLISON, CRYSTAL s250 $250
MALLISON, SUSAN
EASTON
PA ALLISON, SUSAN $7950 $2.138 $2.252 $12.340
ALVAREZ, MICHELLE MCALLEN TX ALVAREZ, MICHELLE
S4,650 $750 $244 $5,644
GRANADA HILLS CA ANGELES, ADA
$39,150 $11,456 $938 $9,181 $60,725
ANGELES WORLD, INC.,
ARBOLEDA, JANE
MIAMI FL ARBOLEDA, JANE
S375 $4,875 $549 $5,799
WACO ARMSTRONG, JULIE
$2,400 $750 $3,150
ARMSTRONG, JULIE
S900 $2,006 $657 $3,563
ARNOLD, MARY BETH
AUGUSTA
GA ARNOLD, MARY BETH
ARRAMBIDE, ROBIN KATY TX ARRAMBIDE, ROBIN
$11719 $11719
ARRINGTON, WANDA CHARLOTTE NC ARRINGTON, WANDA
S2,531 $258 $2789
PA ASH FORD, RICHARD
$3,525 S339 $3,864
WILKES BARRE
ASHFORD, RICHARD
ATTANASIO, MICHAEL SEWELL NJ ATTANASIO, MICHAEL
$450 $450
INDIANA PA AvOLIO JOHN $225 $225
AVOLIO JOHN
BABEY, CHRISTINE SCOTTSDALE AZ BABEY, CHRISTINE $5.100 $13,294 $2.135 $20,529
DALLAS BAILEY-GRAY, PATRICK 226 $226
BAILEY-GRAY, PATRICK
DECATUR IL BAKER, BENITA $17.100 S11738 $6,884 $35,722
BAKER, BENITA
from base64 import b64encode
from sys import argv
from os import makedirs
from os.path import join, basename
import json
import requests
ENDPOINT_URL = 'https://vision.googleapis.com/v1/images:annotate'
RESULTS_DIR = 'jsons'
makedirs(RESULTS_DIR, exist_ok=True)
def make_image_data_dict(image_filenames):
img_requests = []
for imgname in image_filenames:
with open(imgname, 'rb') as f:
ctxt = b64encode(f.read()).decode()
img_requests.append({
'image': {'content': ctxt},
'features': [{
'type': 'TEXT_DETECTION',
'maxResults': 1
}]
}
)
return img_requests
def make_image_data(image_filenames):
imgdict = make_image_data_dict(image_filenames)
return json.dumps({"requests": imgdict }).encode()
def request_ocr(api_key, image_filenames):
imgdata = make_image_data(image_filenames)
response = requests.post(ENDPOINT_URL,
data=imgdata,
params={'key': api_key},
headers={'Content-Type': 'application/json'})
return response
if __name__ == '__main__':
api_key, *image_filenames = argv[1:]
if not api_key or not image_filenames:
print("""
Please supply an api key, then one or more image filenames
$ python cloudvisreq.py api_key image1.jpg image2.png""")
else:
response = request_ocr(api_key, image_filenames)
if response.status_code != 200:
print(response.text)
else:
for idx, resp in enumerate(response.json()['responses']):
imgname = image_filenames[idx]
datatxt = json.dumps(resp, indent=2)
jpath = join(RESULTS_DIR, basename(imgname) + '.json')
with open(jpath, 'w') as f:
f.write(datatxt)
print("Wrote", len(datatxt), "bytes to", jpath)
print("---------------------------------------------")
txtan = resp['textAnnotations'][0]
print(" Bounding Polygon:")
print(txtan['boundingPoly'])
print(" Text:")
print(txtan['description'])
@sabuncu
Copy link

sabuncu commented Mar 27, 2016

Thank you! This is so useful, and one of a handful of interesting work currently available for Google Cloud Vision.

@lf94
Copy link

lf94 commented Sep 3, 2016

Nice write up. I'm trying to simulate what Google Books does with scanning books, because I have a bunch of pdfs. I was wondering if Google's Cloud Vision would be good for this, but apparently not. I'll try tesseract but I heard it isn't great for preserving page layout? Then again, OCRopus (which is what Google Books uses apparently) doesn't seem to do it either (based on my short amount of searching...).

@milapj
Copy link

milapj commented Sep 26, 2016

Hello, I am having trouble running this file. I got a couple of errors "Invalid Syntax for '*image' line 46" and " makedirs() got an unexpected keyword argument 'exist_ok'. I am new to python so this is a bit confusing for me. Also the api_key link is a dead link.

@Anid4u2c
Copy link

Anid4u2c commented Nov 23, 2016

This is great preliminary research. I'm thinking a less complex table would resolve the "Backend timeout!" error. A low-level test with 6 columns, 7 rows, and 1 header row with wrapped text provided similar (-/+ 5 px) coordinates for the beginning (top-left & bottom-left) of the column-based text results. The position of the header row text was not as patterned, because the text was centered prior to the image's creation - the bounding coordinates for the first column and last column could hypothetically create a frame, providing that the header row's height - which could be increased by a subsequent column's text (if wrapped) is higher or lower than the previous bounding coordinates. My hope is that small-sample-group statistical analysis would predict the position of the text within each cell, for each column and row - by grouping their bounds. Those bounds would then be used to identify the grid required to rebuild the table. Further interest motivates me to implement machine learning to assist in the identification of tables, rather than parsing an image such as the "road signs" example above, and identifying it as a table.

This is the image I analyzed:
image295

Anyone can "Try the API" here: https://cloud.google.com/vision/

Copy link

ghost commented Apr 18, 2017

@Milap7:
I think you need to use python3 instead of python2. I got the same error with python2.

@rahpalrah
Copy link

How do i specify any other language from English to French or Hindi. These are language supported by Google Vision API but unable to figure out where to add for the extraction

@ngadhvi
Copy link

ngadhvi commented Feb 13, 2019

I am trying to figure out a problem I am solving using the same API and want to know if there is a way if I can provide coordinates to the vision API to perform OCR on it?
As my trained model generates bounding box coordinates of the ROI (Region of Interest) which need to be OCR'ed.

@SidharthPati
Copy link

SidharthPati commented Mar 5, 2019

How do i specify any other language from English to French or Hindi. These are language supported by Google Vision API but unable to figure out where to add for the extraction

Add this to the request body:

"imageContext": { "languageHints": [ "hi" ] },

@satishkumarkolamudi
Copy link

satishkumarkolamudi commented Dec 28, 2023

img5

This is my image. Unable to extract data properly from this image. Are there any limitations to Google OCR API, or is there any other way to overcome this issue?

Below is the description I got:

"Product Code",
"SS/CS",
"SS/CS",
"SS/CS",
"KR002/P",
"(3mm Foamex)",
"600x400mm High Visibility clothing must",
"be worn (3mm Foamex)",
"600x400mm Eye protection must be worn",
"(3mm Foamex)",
"400x600mm Safety helmets and boots must",
"be worn on this site (3mm Foamex)",
"KR022M/QR/P 600x400mm No unauthorised access",
"KR015M/P",
"KR019M/P",
"KR021M/P",
"KR024K/CS",
"Description",
"BESPOKE Construction Grade Self-adhesive",
"Vinyl Sign [Size H150xW300mm[White on Kier",
"Blue] to read: Inducton Room",
"BESPOKE Construction Grade Self-adhesive",
"Vinyl Sign [Size H150xW450mm] [White on Kier",
"Blue] to read: Engineer",
"BESPOKE Construction Grade Self-adhesive",
"Vinyl Sign [Size H150xW450mm] [White on Kier",
"Blue] to read: Print Room",
"915x1830mm Kier Logo",
"KR029K/CS",
"(3mm Foamex)",
"- Language: ENGLISH",
"150x300mm Site office",
"(Construction Grade Self-adhesive Vinyl)",
"150x300mm Reception",
"(Construction Grade Self-adhesive Vinyl)",
"Ordered Despatched",
"1",
"1",
"1",
"6",
"4",
"4",
"4",
"4",
"1",
"1",
"1",
"1",
"6",
"4",
"4",
"4",
"4",
"1",
"1",
"To Follow"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment