Skip to content

Instantly share code, notes, and snippets.

@pmbaumgartner
Created January 10, 2022 15:49
Show Gist options
  • Select an option

  • Save pmbaumgartner/66049c355fcfd73bda5bb72a8fd78540 to your computer and use it in GitHub Desktop.

Select an option

Save pmbaumgartner/66049c355fcfd73bda5bb72a8fd78540 to your computer and use it in GitHub Desktop.

Revisions

  1. pmbaumgartner created this gist Jan 10, 2022.
    19 changes: 19 additions & 0 deletions cleaning_tokenizer.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,19 @@
    from spacy.tokenizer import Tokenizer

    class CTLTokenizer(Tokenizer):
    # https://stackoverflow.com/a/58718664
    def __call__(self, string) -> spacy.tokens.Doc:
    string = self.clean_string(string)
    doc = super().__call__(string)
    return doc

    def clean_string(self, string: str) -> str:
    """String cleaning function. You can call this to clean a string
    without tokenizing.
    e.g.
    nlp.tokenizer.clean_string('Some example sentence')
    """
    if not string.endswith("."):
    string = string + "."
    return string