Skip to content

Instantly share code, notes, and snippets.

@JMSwag
Forked from akhan619/tokenizers.md
Created January 19, 2023 02:09
Show Gist options
  • Save JMSwag/b43ed64553e53ec49e8f9ea7fabddd29 to your computer and use it in GitHub Desktop.
Save JMSwag/b43ed64553e53ec49e8f9ea7fabddd29 to your computer and use it in GitHub Desktop.

Revisions

  1. @akhan619 akhan619 revised this gist Jan 11, 2023. 1 changed file with 470 additions and 8 deletions.
    478 changes: 470 additions & 8 deletions tokenizers.md
    Original file line number Diff line number Diff line change
    @@ -1,24 +1,37 @@
    # Exploring Tokenizers from Hugging Face

    _Hugging Face_ (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at their **Tokenizers** library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch. In the process we will understand the tokenization process in detail and some gotchas to keep an eye out for.
    _Hugging Face_ (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at *tokenization* using a hands on approach with the help of the **Tokenizers** library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch based on the BERT tokenization scheme. In the process we will understand tokenization in detail and some gotchas to keep an eye out for.

    #### Background on NLP (Optional)

    ## Background on NLP (Optional)
    If you already have an understanding of the NLP pipeline, you can safely skip this section.

    For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:

    - Pre-process
    - Get the data ready into a format that can be passed on to the NLP model.
    - Train
    - Train the model.
    - Evaluate
    - Using metrics suitable to a given task evaluate how well the trained model performs on some test data.
    - Using metrics suitable for a given task, evaluate how well the trained model performs on some test data.
    - Predict
    - Once we are satisfied with our trained model, make some predictions.

    Of course this is a very broad overview of the steps and there is a lot going on in each step. As mentioned before, in this post we will focus on the first step - Pre-processing the data and how we can leverage _Hugging Face_ **Tokenizers** to achieve it.
    Of course this is a very broad overview of the steps and there is a lot going on in each step. As mentioned before, in this post we will focus on the first step - **pre-processing** the data and how we can leverage _Hugging Face_ **Tokenizers** to achieve it.

    #### Installation

    You can very easily install the **Tokenizers** library in a new python environment using:

    > `pip install tokenizers`
    You will also need the **Datasets** library to load the data we will be working with.

    ## Dataset
    Before we can do anything with the HF Tokenizers library, we need data to work with. I will be working with a dataset I created on HF but the steps can be applied to any dataset. We will use the HF **Datasets** library to load the data.
    > `pip install datasets`
    #### Dataset

    Before we can do anything with the HF **Tokenizers** library, we need data to work with. I will be working with a dataset I created on HF but the steps can be applied to any dataset.

    ```python
    # Load our dataset
    @@ -51,14 +64,463 @@ ds
    # num_rows: 71866962
    # })
    ```

    _Note: Your dataset size may be different depending on whether you loaded the small version or not._

    This dataset is almost 21 GB is size and contains over 71 million observations. It also has 8 **features**. We can think of features as columns/fields in a typical database for our current purposes, but they can have added functionality depending on the type of feature.
    This dataset is almost 21 GB is size and contains over 71 million observations. It also has 8 **features**. We can think of features as columns/fields in a typical database for our current purposes, but note that they can have added functionality depending on the type of feature.

    We are interested in only one of the features which is the _'sentence'_ feature which contains a single sentence from a 10-K filing. Let's check an example sentence from this dataset.

    We are interested in only one of the features which is the _'sentence'_ feature. Let's check an example sentence from this dataset.
    ```python
    # An example sentence from the dataset.
    example_sentence = ds[100]['sentence']
    print(example_sentence)
    # 'Our Expeditionary Services segment competes with a number of divisions of large corporations and other large and small companies.'
    ```
    Now that our dataset is loaded we can look at the pre-processing step in more detail.

    #### Pre-processing

    The thing with textual data is that it can be all over the place. So, cleaning the text becomes an important part. After all the following two sentences convey the same meaning yet, for a machine it is two very different things.

    > _Héllò? What aré yòü üptò tòday?_
    vs

    > _Hello? What are you upto today?_
    Next, a string is hard for a machine to understand. In the example above a machine will have no idea whether _**you**_ or _**you upto**_ is a single word. So, we need to pass on the structure of words explicitly.

    Lastly, machines like numbers. We need to convert the sequence of words to some fixed sequence of numbers.

    Each of these steps is a part of a general pipeline:

    - _Normalization_
    - _Pre-tokeninzation_
    - _Tokenization_
    - _Post-processing_

    So, when we say pre-process the data or tokenize the data what we actually have in mind are the steps in the pipeline above. Let's see exactly what they are and how they help to get the data ready into a format the NLP models can work with.

    #### Step 1: Normalization

    This step helps to manage the plethora of Unicode characters that might be present in our text or take care of accented characters. We want to have our text in a consistent format.

    Usually, Unicode normalization is applied which is a topic in itself and outside the scope of this post. We can very easily apply this step as follows:

    ```python
    # Step 1: Load our normalizer.
    from tokenizers import normalizers
    from tokenizers.normalizers import NFD, StripAccents

    # We create our normalizer which will appy Unicode normalization and strip accents
    normalizer = normalizers.Sequence([NFD(), StripAccents()])

    normalizer.normalize_str("Héllò? What aré yòü üptò tòday?")
    # "Hello? What are you upto today?"

    # Example on our dataset
    normalizer.normalize_str(example_sentence)
    # Our Expeditionary Services segment competes with a number of divisions of large corporations and other large and small companies.
    ```

    In the code above we have used two normalizers, *NFD* and *StripAccents*. And as you can see we can very easily chain these two together using the _Sequence_ class.

    #### Step 2: Pre-tokenization

    Here we want to split our long string into individual words. We can split on whitespace, punctuation or even more specific ones such as _ByteLevel_ or _BertPreTokenizer_.

    A point to note here is that the tokenization pipeline is going to be heavily influenced by the NLP model that will be used subsequently. For example, BERT has it's own tokenization pipeline and in order to use the BERT model we must follow the same pipeline. This means the way the normalization, word splitting, etc. was done while training BERT must be used on our data as well provided we will be using the **pre-trained BERT** for **fine-tuning**. Instead, if we are going to train BERT from scratch then we can follow our own design.

    The code to pre-tokenize our sentence is as follows:

    ```python
    # Step 2: Load our pre-tokenizer
    from tokenizers.pre_tokenizers import Whitespace

    # We create our pre-tokenizer which will split based on the regex \w+|[^\w\s]+
    pre_tokenizer = Whitespace()

    pre_tokenizer.pre_tokenize_str("Hello! What are you upto today?.")
    # [('Hello', (0, 5)),
    # ('!', (5, 6)),
    # ('What', (7, 11)),
    # ('are', (12, 15)),
    # ('you', (16, 19)),
    # ('upto', (20, 24)),
    # ('today', (25, 30)),
    # ('?.', (30, 32))]

    # Example on our dataset
    pre_tokenizer.pre_tokenize_str(example_sentence)
    # [('Our', (0, 3)),
    # ('Expeditionary', (4, 17)),
    # ('Services', (18, 26)),
    # ('segment', (27, 34)),
    # ('competes', (35, 43)),
    # ('with', (44, 48)),
    # ('a', (49, 50)),
    # ('number', (51, 57)),
    # ('of', (58, 60)),
    # ('divisions', (61, 70)),
    # ('of', (71, 73)),
    # ('large', (74, 79)),
    # ('corporations', (80, 92)),
    # ('and', (93, 96)),
    # ('other', (97, 102)),
    # ('large', (103, 108)),
    # ('and', (109, 112)),
    # ('small', (113, 118)),
    # ('companies', (119, 128)),
    # ('.', (128, 129))]
    ```

    As we can see the pre-tokenizer splits our sentence based on whitespace and punctuation. It also returns the offset of the words that it has generated in our sentence.

    #### Step 3: Tokenization

    At this point one can say that our work is over. We started with a string, cleaned it and split it into words. We could simply repeat the same process over all the sentences we have and collect all the unique words. Then their index position would serve as an id that we can feed into our models. The words themselves are called **tokens** and the ids are called **token ids**. This can definitely be a strategy.

    But, one would soon see the problem. For any decently sized textual dataset (also called a **corpus** in NLP lingo) we could have tens of thousands of words. This would make the training process for our actual NLP model much longer and less efficient.

    There is another problem. Consider the next 2 sentences:

    > _I will give you a dollar tomorrow_
    and

    > _I will be giving you a dollar tomorrow_
    First off, both the sentences convey the same idea. But, we have used 2 different words _give_ and _giving_ here. Semantically they should be interpreted in the same way. Imagine, instead of creating a list (which is called our **vocabulary**) of unique words (tokens) as follows:

    > _['I', 'will', 'be', 'give', 'giving', 'you', 'a', 'dollar', 'tomorrow']_
    We create,

    > _['I', 'will', 'be', 'giv', '##e', '##ing', 'you', 'a', 'dollar', 'tomorrow']_
    _Note: The list above is our **vocabulary** not the sentence broken into tokens._

    This might seem a very weird way to create our list of words. We have split **give** and **giving** into a common part and 2 other pieces. But, notice what happens when we replace our sentences with the new words from our vocabulary (_Note: To keep it simple I have kept the other words as they are. However, they might be split as well depending on the corpus._):

    > _['I', 'will', 'giv', '##e', 'you', 'a', 'dollar', 'shortly']_
    and

    > _['I', 'will', 'be' 'giv', '##ing', 'you', 'a', 'dollar', 'shortly']_
    _Note: The list above is our **tokenized sentence** not the vocabulary._

    Now, both our sentences after replacing with the tokens, will have a common word '_giv_' which can be very helpful to a NLP model to understand that the sentences _share a similar meaning_.

    Further, the other token '_##ing_' is a very common ending for many words and will reduce the size of our overall vocabulary. For example, if we had the following sentence:

    > I was willing to go to the concert
    The new vocabulary is:

    > _['I', 'will', 'be' 'giv', '##ing', 'you', 'a', 'dollar', 'tomorrow', 'was', 'to', 'go', 'the', 'concert']_
    See how the word **willing** is already present in the vocabulary.

    The above strategy is a very simplified version of an algorithm known as **WordPiece** and is used by the BERT Transformer models. So, lets see how we could implement it in our tokenizer.

    ```python
    # Step 3: Load our model
    from tokenizers.models import WordPiece
    from tokenizers import Tokenizer

    # We create our tokenizer based on the WordPiece algorithm model.
    # We need to supply the token which will represent unknown tokens.
    tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

    # With our tokenizer object ready we set our normalizer and pre-tokenizer.
    tokenizer.normalizer = normalizer
    tokenizer.pre_tokenizer = pre_tokenizer
    ```

    So, all we need to do is create a _Tokenizer_ object. We set the normalizer and the pre-tokenizer of this new _Tokenizer_ object to the ones we created earlier. What this means is that we don't have to run our normalizer and pre-tokenizer on the dataset beforehand. They will be run automatically by the _Tokenizer_ object. Second, we have used a _WordPiece_ class object as our model so that our tokenizer uses the **WordPiece** algorithm.

    Finally, we have defined a new token '_[UNK]_'. These are what as known as special tokens and are dictated by the NLP model that will be used. More on it soon.

    #### Step 4: Training

    With our model/normalizer/pre-tokenizer all ready, we can now train our tokenizer model on the data. The code for it is:

    ```python
    # Step 4: Train our tokenizer
    from tokenizers.trainers import WordPieceTrainer
    import time

    # We will create a batch iterator which will generate a batch of sentences for training
    # our tokenizaer. This is the preferred way instead of passing single sentences to the
    # tokenizer as it will a lot faster.
    def batch_iterator(dataset, batch_size=10000):
    for i in range(0, len(dataset), batch_size):
    lower_idx = i
    # Ensure the upper idx doesn't overflow leading to an 'IndexError'
    upper_idx = i + batch_size if i + batch_size <= len(dataset) else len(dataset)
    yield dataset[lower_idx : upper_idx]["sentence"]

    # We pass in the list of special tokens so that our model knows about them.
    trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

    tic = time.perf_counter()
    # Now, we do batch training based on our iterator that we defined earlier.
    tokenizer.train_from_iterator(batch_iterator(ds), trainer=trainer, length=len(ds))
    toc = time.perf_counter()
    print(f"Elapsed time: {toc - tic:0.4f} seconds")
    ```

    _Note: It took me about 30 mins to train the tokenizer on the full 21GB corpus on a AMD Ryzen Pro 7 8 Core machine._

    Most of the code is self explanatory. We create a batch iterator so that we don't train on a single sentence every time and our training is faster. We also need a _Trainer_ object to train our tokenizer. This must be compatible with the model that we instantiated our tokenizer with. We used _WordPiece_ as our model:

    > `_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))_`
    So, we use the _WordPieceTrainer_ class to create the trainer object. There is again these special tokens that we pass to the constructor. Let's see what they mean.

    #### Special Tokens

    When using BERT as a model there are certain special tokens used by it that need to be used. Other models might use a different set of special tokens. We will keep it simple here and see the BERT ones. They are:

    > _["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]_
    - **[UNK]**: This is used to represent any word that the tokenizer fails to find in it's vocabulary. This can happen when the word comes from a different corpus to the one the tokenizer was trained on or we set the size of our vocabulary to a small one.
    - **[CLS]**: This token is automatically inserted during post-processing at the start of a sentence or pair of sentences.
    - **[SEP]**: This token is automatically inserted during post-processing at the end of every sentence.
    - **[PAD]**: This token is used to ensure that the size of all sentences in a batch of sentences are of the same length.
    - **[MASK]**: This a special token that is used only during training the BERT model (not the tokenizer) on a Masked Language Modelling task.

    Don't worry if these seem vague. We will be applying them all in our post-processing section.

    #### Understanding the _Encoding_ object

    Once we complete the training process, we can use our tokenizer to encode sentences. Let's see what this means.

    ```python
    # Define our example
    example_sentence = ds[100]['sentence']
    print(example_sentence)
    # 'Our Expeditionary Services segment competes with a number of divisions of large corporations and other large and small companies.'

    # Now that the training is done let us check out what the output of the tokenizer looks like.
    output = tokenizer.encode(example_sentence)
    output
    # Encoding(num_tokens=22, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
    ```

    Ok we got an _Encoding_ class object. We see a list of attributes the object has as well as the number of tokens that we generated.

    We can check each of them out as:

    ```python
    # The number of sequences
    output.n_sequences
    # 1

    # The tokens generated after our sentence went through the normalization->
    # pre-tokenization->tokenization(WordPiece) pipeline
    output.tokens

    # The ids assigned to these tokens.
    output.ids

    # The attention masks
    output.attention_mask

    # The sequence ids
    output.sequence_ids

    # The word ids
    output.word_ids

    # The type ids
    output.type_ids

    # The offsets for our tokens.
    output.offsets
    ```

    But, I find it better to see them in a table side-by-side to really get an understanding of what they mean. Here is the same set of outputs in tabular form for a truncated set of tokens:

    | tokens | ids | attention_mask | special_tokens_mask | sequence_ids | word_ids | type_ids | offsets |
    | :----: | :-: | :------------: | :-----------------: | :----------: | :------: | :------: | :-----: |
    | Our | 1817 | 1 | 0 | 0 | 0 | 0 | (0, 3) |
    | Exped | 19910 | 1 | 0 | 0 | 1 | 0 | (4, 9) |
    | ##ition | 1515 | 1 | 0 | 0 | 1 | 0 | (9, 14) |
    | ##ary | 1610 | 1 | 0 | 0 | 1 | 0 | (14, 17) |
    | Services | 3504 | 1 | 0 | 0 | 2 | 0 | (18, 26) |
    | ...| ...| ... | ... | ... | ... | ... | ... |
    | companies | 2351 | 1 | 0 | 0 | 18 | 0 | (119, 128) |
    | . | 18 | 1 | 0 | 0 | 19 | 0 | (128, 129) |

    For now, just focus on **tokens**, **ids**, **word_ids** and **offsets**. The rest will be clearer when we explore the next few sections.

    So, we see that our tokenizer gave the word _'Our'_ the same token representation with an integer id of **1817** and a word id of **0** as its the first word in the sentence. The offset gives the exact index in the sentence string where this **token** (not the word) is found.

    Next, the tokenizer split the word _'Expeditionary'_ into 3 separate **tokens** as _'Exped'_, _'##ition'_ and _'##ary'_. This is the **WordPiece** algorithm in action. It assigned **different** ids to each of them. However, the word id assigned to them were the same, **1**. So, from this we can see that even though the word was split, we still have enough information to reconstruct the word. Finally, the offset again provides the index into our string where the **token** (not the whole word) is found.

    #### Step 5: Post-processing

    The post-processing is highly tied to the NLP model which we will be using. We can do all kinds of things in this step, but usually here is where we add the special tokens based on the NLP model. As we are assuming that the tokenized text will be fed into BERT, let us see what BERT needs.

    BERT expects every single sentence to begin with the _'[CLS]'_ token and end with a _'[SEP]'_ token. So, for the following:

    > _I love machine learning._
    we need to feed into BERT:

    > _['[CLS]', 'I', 'love', 'machine', 'learning', '.', '[SEP]']_
    BERT can also be fed 2 sentences at a time for a training task known as _Next Sentence Prediction_. So, for the following inputs:

    > _I love machine learning. It is cool._
    we need to feed:

    > _['[CLS]', 'I', 'love', 'machine', 'learning', '.', '[SEP]', 'It', 'is', 'cool', ',', '[SEP]']_
    So, the _'[SEP]'_ token goes at the end of every sentence while the _'[CLS]'_ token only goes at the beginning of the first sentence.

    With this in mind, lets see how we can easily achieve it using our tokenizer:

    ```python
    from tokenizers.processors import TemplateProcessing

    # BERT like post-processor
    post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair=[CLS] $A [SEP] $B:1 [SEP]:1,
    special_tokens=[
    ("[CLS]", tokenizer.token_to_id("[CLS]")),
    ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
    )

    tokenizer.post_processor = post_processor

    output = tokenizer.encode(example_sentence)
    ```

    So, we use a new class called _TemplateProcessing_ which can be easily told how to process a single sentence using the _single_ parameter and a pair of of sentences using the _pair_ parameter.

    We provide the string:

    > _"[CLS] $A [SEP]"_
    to tell the post-processor that for any sentence represented by **$A** add the _[CLS]_ and _[SEP]_ tokens as defined. For a pair of sentences we provide:

    > _"[CLS] $A [SEP] $B:1 [SEP]:1"_
    Here again, **$A** and **$B** are the two sentences. The extra **:1** basically tells the tokenizer how to identify which sentence a token belongs to when there are a pair of sentences. So, here every token coming from second sentence will have a **type_id** of 1 while every token coming from the first sentence will have a **type_id** of 0 (the default when nothing is specified).

    One last thing in the example above is the **token_to_id** method. This method on the tokenizer object easily gives us the id that is assigned to a token.

    So, lets see an example output:

    ```python
    # Multiple sentences
    print(ds[100]["sentence"])
    # "Our Expeditionary Services segment competes with a number of divisions of large corporations and other large and small companies."
    print(ds[101]["sentence"])
    # Although certain of our competitors have substantially greater financial and other resources than we do, we believe that we have maintained a satisfactory competitive position through our responsiveness to customer needs, our attention to quality, and our unique combination of market expertise and technical and financial capabilities.

    output = tokenizer.encode(ds[100]["sentence"], ds[101]["sentence"])
    output.n_sequences
    # 2
    ```

    If we return to our table as before:

    | tokens | ids | attention_mask | special_tokens_mask | sequence_ids | word_ids | type_ids | offsets |
    | :----: | :-: | :------------: | :-----------------: | :----------: | :------: | :------: | :-----: |
    | [CLS] | 1 | 1 | 1 | None | None | 0 | (0, 0) |
    | Our | 1817 | 1 | 0 | 0 | 0 | 0 | (0, 3) |
    | Exped | 19910 | 1 | 0 | 0 | 1 | 0 | (4, 9) |
    | ##ition | 1515 | 1 | 0 | 0 | 1 | 0 | (9, 14) |
    | ##ary | 1610 | 1 | 0 | 0 | 1 | 0 | (14, 17) |
    | Services | 3504 | 1 | 0 | 0 | 2 | 0 | (18, 26) |
    | ...| ...| ... | ... | ... | ... | ... | ... |
    | companies | 2351 | 1 | 0 | 0 | 18 | 0 | (119, 128) |
    | . | 18 | 1 | 0 | 0 | 19 | 0 | (128, 129) |
    | [SEP] | 2 | 1 | 1 | None | None | 0 | (0, 0) |
    | Although | 3854 | 1 | 0 | 1 | 0 | 1 | (0, 8) |
    | certain | 1809 | 1 | 0 | 1 | 1 | 1 | (9, 16) |
    | ...| ...| ... | ... | ... | ... | ... | ... |
    | capabilities | 4870 | 1 | 0 | 1 | 49 | 1 | (323, 335) |
    | . | 18 | 1 | 0 | 1 | 50 | 1 | (335, 336) |
    | [SEP] | 2 | 1 | 1 | None | None | 1 | (0, 0) |

    We see the special tokens have been added. The following needs to be noted:
    - **sequence_ids** and **type_ids** are **0** for any token belonging to the first sentence and **1** for the second sentence.
    - **offsets** are always calculated with respect to the sentence the token comes from, not the combined sentences.
    - Special tokens can be identified using the **special_tokens_mask** attribute which is **1** if the token is a special token.
    - **sequence_ids** and **word_ids** are always **None** for special tokens and the offset is always **(0,0)** as these tokens don't really belong to the sentence.
    - However, **type_ids** work the same for special tokens as normal tokens.

    #### Padding and Attention Masks

    Padding comes into the picture when we have multiple sentences in a batch that we want to tokenize and feed into a NLP model. Most model require the input to be of a fixed size. But, almost always sentences are going to vary in size. So, one thing we can do is to simply add padding tokens till the size of every sentence in our batch is the same. For example, consider we have 2 sentences in our batch as follows:

    > _I love football_
    > _I live in Paris_
    We could simply tokenize and add a padding token so that the tokenized sentences have the same length:

    > _['[CLS]', 'I', 'love', 'football', '[SEP]', '[PAD]']_
    > _['[CLS]', 'I', 'live', 'in', 'Paris', '[SEP]']_
    We see a pad token is added to the end of the first sentence after the _[SEP]_ token to make the final count of tokens in each sentence the same. More than **1** padding token can be added and we can control whether to pad left or right. Here we will keep it simple and use defaults as:

    ```python
    pad_token = "[PAD]"
    tokenizer.enable_padding(pad_id=tokenizer.token_to_id(pad_token), pad_token=pad_token)

    output = tokenizer.encode_batch([
    [ds[100]["sentence"], ds[101]["sentence"]],
    [ds[102]["sentence"], ds[103]["sentence"]]
    ])
    ```

    This batch produces the following output:

    | tokens | ids | attention_mask | special_tokens_mask | sequence_ids | word_ids | type_ids | offsets |
    | :----: | :-: | :------------: | :-----------------: | :----------: | :------: | :------: | :-----: |
    | [CLS] | 1 | 1 | 1 | None | None | 0 | (0, 0) |
    | Our | 1817 | 1 | 0 | 0 | 0 | 0 | (0, 3) |
    | ...| ...| ... | ... | ... | ... | ... | ... |
    | . | 18 | 1 | 0 | 0 | 19 | 0 | (128, 129) |
    | [SEP] | 2 | 1 | 1 | None | None | 0 | (0, 0) |
    | Although | 3854 | 1 | 0 | 1 | 0 | 1 | (0, 8) |
    | ...| ...| ... | ... | ... | ... | ... | ... |
    | . | 18 | 1 | 0 | 1 | 50 | 1 | (335, 336) |
    | [SEP] | 2 | 1 | 1 | None | None | 1 | (0, 0) |
    | [CLS] | 1 | 1 | 1 | None | None | 0 | (0, 0) |
    | Backlog | 12416 | 1 | 0 | 0 | 0 | 0 | (0, 7) |
    | ...| ...| ... | ... | ... | ... | ... | ... |
    | . | 18 | 1 | 0 | 0 | 18 | 0 | (115, 116) |
    | [SEP] | 2 | 1 | 1 | None | None | 0 | (0, 0) |
    | Backlog | 12416 | 1 | 0 | 1 | 0 | 1 | (0, 7) |
    | ...| ...| ... | ... | ... | ... | ... | ... |
    | . | 18 | 1 | 0 | 1 | 28 | 1 | (161, 162) |
    | [SEP] | 2 | 1 | 1 | None | None | 1 | (0, 0) |
    | ...| ...| ... | ... | ... | ... | ... | ... |
    | [PAD] | 3 | 0 | 1 | None | None | 0 | (0, 0) |
    | [PAD] | 3 | 0 | 1 | None | None | 0 | (0, 0) |

    We provide a batch of two, with each input of the batch being a pair of sentences. We see the tokens of the second sentence was padded with _[PAD]_ tokens. As the pad token is a special token all the discussion about special tokens earlier apply here as well.

    Now, we can finally talk about the **attention_mask**. If you noticed carefully in the examples before, this was always **1** for all the tokens. Only in the current example is the value different from **1** and that too only for the _[PAD]_ token. This ties in with the attention mechanism for Transformer models in general.

    I won't go into the details of the attention mechanism. But, intuitively we can understand why it is 0 for the _[PAD]_ token. The pad token was introduced just to make sure the size of all our tokenized sentences are the same. Our model shouldn't really care about it. To ensure it doesn't we set the **attention_mask** value to 0.

    This won't be the case for the other special tokens. The other special tokens all play a role in the model learning so for them the **attention_mask** value is still **1**.

    #### Conclusion

    This wraps up our discussion of **tokenization**. We saw the different aspects of the process and the ideas behind them. We saw how the entire pipeline of _Normalization->Pre-tokenization->Tokenization->Post-processing_ can be easily integrated into a single instance of the **Tokenizer** class and applied to entire batches of textual input.

    If you want more details then the Hugging Face documentation is a great resource to start. Thank you for reading!
  2. @akhan619 akhan619 revised this gist Jan 10, 2023. 1 changed file with 5 additions and 2 deletions.
    7 changes: 5 additions & 2 deletions tokenizers.md
    Original file line number Diff line number Diff line change
    @@ -51,11 +51,14 @@ ds
    # num_rows: 71866962
    # })
    ```
    _Note: Your dataset size may be different depending on whether you loaded the small version or not._

    This dataset is almost 21 GB is size and contains over 71 million observations. It also has 8 **features**. We can think of features as columns/fields in a typical database for our current purposes, but they have added functionality. We are interested in only one of them which is the _'sentence'_ feature. Let's check an example sentence from this dataset.
    This dataset is almost 21 GB is size and contains over 71 million observations. It also has 8 **features**. We can think of features as columns/fields in a typical database for our current purposes, but they can have added functionality depending on the type of feature.

    We are interested in only one of the features which is the _'sentence'_ feature. Let's check an example sentence from this dataset.
    ```python
    # An example sentence from the dataset.
    example_sentence = ds[100]['sentence']
    print(example_sentence)
    # 'Our Expeditionary Services segment competes with a number of divisions of large corporations and other large and small companies.'
    ```
    ```
  3. @akhan619 akhan619 revised this gist Jan 10, 2023. 1 changed file with 14 additions and 8 deletions.
    22 changes: 14 additions & 8 deletions tokenizers.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,9 @@
    # Exploring Tokenizers from Hugging Face

    Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at their Tokenizers library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch.
    _Hugging Face_ (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at their **Tokenizers** library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch. In the process we will understand the tokenization process in detail and some gotchas to keep an eye out for.

    ## Background on NLP (Optional)
    If you already have an understanding of the NLP pipeline, you can safely skip this section.

    For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:
    - Pre-process
    @@ -11,10 +14,12 @@ For any NLP task, one of the first steps is pre-processing the data so that it c
    - Using metrics suitable to a given task evaluate how well the trained model performs on some test data.
    - Predict
    - Once we are satisfied with our trained model, make some predictions.

    Of course the above is a very broad overview of the steps and there is a lot going on in each step. For now let us focus on the first step - Pre-processing the data.

    For that we first need some dataset to work with. I will be working with a dataset I created on HF but the steps can be applied to any dataset. We will use the HF **Datasets** library to load the data.

    Of course this is a very broad overview of the steps and there is a lot going on in each step. As mentioned before, in this post we will focus on the first step - Pre-processing the data and how we can leverage _Hugging Face_ **Tokenizers** to achieve it.

    ## Dataset
    Before we can do anything with the HF Tokenizers library, we need data to work with. I will be working with a dataset I created on HF but the steps can be applied to any dataset. We will use the HF **Datasets** library to load the data.

    ```python
    # Load our dataset
    from datasets import load_dataset
    @@ -32,11 +37,12 @@ from datasets import load_dataset
    ds = load_dataset('JanosAudran/financial-reports-sec', 'large_lite', split="train+test+validation")
    ```

    We can gather some info on the dataset size and structure using:
    Now that we have loaded our dataset, let's check it out. We can gather some info on the dataset size and structure using:

    ```python
    # Size
    print(f"Size of the dataset {ds.dataset_size // 1024 ** 3} GB.")
    # 'Size of the dataset 21 GB.'
    print(f"Size of the dataset {ds.dataset_size / 1024 ** 3:.2f} GB.")
    # 'Size of the dataset 21.09 GB.'

    # Let's check the features in the dataset.
    ds
  4. @akhan619 akhan619 revised this gist Jan 10, 2023. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion tokenizers.md
    Original file line number Diff line number Diff line change
    @@ -46,7 +46,7 @@ ds
    # })
    ```

    This dataset is almost 21 GB is size and contains over 71 million observations. It also has 8 **features** but we are interested in only one of them which is the _'sentence'_ feature. Let check an example sentence from this dataset.
    This dataset is almost 21 GB is size and contains over 71 million observations. It also has 8 **features**. We can think of features as columns/fields in a typical database for our current purposes, but they have added functionality. We are interested in only one of them which is the _'sentence'_ feature. Let's check an example sentence from this dataset.
    ```python
    # An example sentence from the dataset.
    example_sentence = ds[100]['sentence']
  5. @akhan619 akhan619 revised this gist Jan 10, 2023. 1 changed file with 0 additions and 2 deletions.
    2 changes: 0 additions & 2 deletions tokenizers.md
    Original file line number Diff line number Diff line change
    @@ -34,8 +34,6 @@ ds = load_dataset('JanosAudran/financial-reports-sec', 'large_lite', split="trai

    We can gather some info on the dataset size and structure using:
    ```python
    ds = load_dataset('JanosAudran/financial-reports-sec', 'large_lite', split="train+test+validation")

    # Size
    print(f"Size of the dataset {ds.dataset_size // 1024 ** 3} GB.")
    # 'Size of the dataset 21 GB.'
  6. @akhan619 akhan619 revised this gist Jan 10, 2023. 1 changed file with 44 additions and 2 deletions.
    46 changes: 44 additions & 2 deletions tokenizers.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,6 @@
    # Exploring Tokenizers from Hugging Face

    Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at their Tokenizers library.
    Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at their Tokenizers library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch.

    For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:
    - Pre-process
    @@ -12,4 +12,46 @@ For any NLP task, one of the first steps is pre-processing the data so that it c
    - Predict
    - Once we are satisfied with our trained model, make some predictions.

    Of course the above is a very broad overview of the steps and there is certainly more to each step. For now let us focus on the first step - Pre-processing the data.
    Of course the above is a very broad overview of the steps and there is a lot going on in each step. For now let us focus on the first step - Pre-processing the data.

    For that we first need some dataset to work with. I will be working with a dataset I created on HF but the steps can be applied to any dataset. We will use the HF **Datasets** library to load the data.
    ```python
    # Load our dataset
    from datasets import load_dataset

    # Most datasets on HF are split into test/train/validate. This is useful when training our
    # NLP model. However, during tokenization we want the combined data from all 3. For this
    # we pass the "train+test+validation" to the split parameter so that the load_dataset()
    # function returns a Dataset object instead of a DatasetDict object and at the same time
    # combines the splits together.

    # NOTE: THIS IS A LARGE DATASET. IT WILL TAKE A WHILE TO DOWNLOAD AND GENERATE THE SPLITS.
    # YOU MAY WANT TO TEST WITH THE SMALLER VERSION USING
    # ds = load_dataset('JanosAudran/financial-reports-sec', 'small_lite', split="train+test+validation")

    ds = load_dataset('JanosAudran/financial-reports-sec', 'large_lite', split="train+test+validation")
    ```

    We can gather some info on the dataset size and structure using:
    ```python
    ds = load_dataset('JanosAudran/financial-reports-sec', 'large_lite', split="train+test+validation")

    # Size
    print(f"Size of the dataset {ds.dataset_size // 1024 ** 3} GB.")
    # 'Size of the dataset 21 GB.'

    # Let's check the features in the dataset.
    ds
    # Dataset({
    # features: ['cik', 'sentence', 'section', 'labels', 'filingDate', 'docID', 'sentenceID', 'sentenceCount'],
    # num_rows: 71866962
    # })
    ```

    This dataset is almost 21 GB is size and contains over 71 million observations. It also has 8 **features** but we are interested in only one of them which is the _'sentence'_ feature. Let check an example sentence from this dataset.
    ```python
    # An example sentence from the dataset.
    example_sentence = ds[100]['sentence']
    print(example_sentence)
    # 'Our Expeditionary Services segment competes with a number of divisions of large corporations and other large and small companies.'
    ```
  7. @akhan619 akhan619 created this gist Jan 10, 2023.
    15 changes: 15 additions & 0 deletions tokenizers.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,15 @@
    # Exploring Tokenizers from Hugging Face

    Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at their Tokenizers library.

    For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:
    - Pre-process
    - Get the data ready into a format that can be passed on to the NLP model.
    - Train
    - Train the model.
    - Evaluate
    - Using metrics suitable to a given task evaluate how well the trained model performs on some test data.
    - Predict
    - Once we are satisfied with our trained model, make some predictions.

    Of course the above is a very broad overview of the steps and there is certainly more to each step. For now let us focus on the first step - Pre-processing the data.