Skip to content

Instantly share code, notes, and snippets.

@lppier
Last active August 23, 2019 02:49
Show Gist options
  • Select an option

  • Save lppier/f888c068e809109f38a5f0dce9a8e932 to your computer and use it in GitHub Desktop.

Select an option

Save lppier/f888c068e809109f38a5f0dce9a8e932 to your computer and use it in GitHub Desktop.

Revisions

  1. lppier revised this gist Aug 23, 2019. No changes.
  2. lppier created this gist Aug 23, 2019.
    27 changes: 27 additions & 0 deletions detect_percentage_english.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,27 @@
    import string
    import urllib.request
    from nltk.corpus import words

    punctuation = set(string.punctuation)

    def remove_punc(str):
    return ''.join(c for c in str if c not in punctuation)

    total_count = 0
    eng_count = 0

    with open('hsbc_th_supplement-pdf-page-1-text.txt') as f:
    for line in f:
    text_words = remove_punc(line).lower().split()
    print(text_words)
    total_count += len(text_words)
    for word in text_words:
    print(f"Finding {word}")
    if word in words.words():
    eng_count += 1

    print('%s English words found' % eng_count)
    print('%s total words found' % total_count)

    percentage_eng = 0 if total_count == 0 else (float(eng_count) / total_count * 100)
    print('%s%% of words were English' % percentage_eng)