Skip to content

Instantly share code, notes, and snippets.

@jamescalam
Last active July 8, 2021 09:33
Show Gist options
  • Save jamescalam/5ca739d93cd05f19eaad3f07ef80d2cf to your computer and use it in GitHub Desktop.
Save jamescalam/5ca739d93cd05f19eaad3f07ef80d2cf to your computer and use it in GitHub Desktop.

Revisions

  1. jamescalam revised this gist Jul 8, 2021. 1 changed file with 1 addition and 25 deletions.
    26 changes: 1 addition & 25 deletions encode_batch.ipynb
    Original file line number Diff line number Diff line change
    @@ -39,33 +39,9 @@
    }
    ],
    "source": [
    "batch = tokenizer.encode_batch(lines)\n",
    "batch = tokenizer(lines, max_length=512, padding='max_length', truncation=True)\n",
    "len(batch)"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": 5,
    "metadata": {},
    "outputs": [
    {
    "output_type": "execute_result",
    "data": {
    "text/plain": [
    "[Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),\n",
    " Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),\n",
    " Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),\n",
    " Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),\n",
    " Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]"
    ]
    },
    "metadata": {},
    "execution_count": 5
    }
    ],
    "source": [
    "batch[:5]"
    ]
    }
    ]
    }
  2. jamescalam renamed this gist Jun 11, 2021. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  3. jamescalam created this gist Jun 11, 2021.
    71 changes: 71 additions & 0 deletions gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,71 @@
    {
    "metadata": {
    "language_info": {
    "codemirror_mode": {
    "name": "ipython",
    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.8.5"
    },
    "orig_nbformat": 2,
    "kernelspec": {
    "name": "ml",
    "display_name": "ML",
    "language": "python"
    }
    },
    "nbformat": 4,
    "nbformat_minor": 2,
    "cells": [
    {
    "cell_type": "code",
    "execution_count": 4,
    "metadata": {},
    "outputs": [
    {
    "output_type": "execute_result",
    "data": {
    "text/plain": [
    "10000"
    ]
    },
    "metadata": {},
    "execution_count": 4
    }
    ],
    "source": [
    "batch = tokenizer.encode_batch(lines)\n",
    "len(batch)"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": 5,
    "metadata": {},
    "outputs": [
    {
    "output_type": "execute_result",
    "data": {
    "text/plain": [
    "[Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),\n",
    " Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),\n",
    " Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),\n",
    " Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),\n",
    " Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]"
    ]
    },
    "metadata": {},
    "execution_count": 5
    }
    ],
    "source": [
    "batch[:5]"
    ]
    }
    ]
    }