# bpetokenizer A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports `save` and `load` tokenizers in the `json` and `file` format. ### Overview The Byte Pair Encoding (BPE) algorithm is a simple yet powerful method for building a vocabulary of subword units for a given text corpus. This tokenizer can be used for training your tokenizer of the LLM on various languages of text corpus. this algorithm is first introduced in the paper [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909) and then used this in the gpt2 tokenizer([Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)) The [notebook](notebooks/tokenization.ipynb) which shows the BPE algorithm in detail and how the tokenizers work internally. Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their own text dataset. ### Features - Implements Byte Pair Encoding (BPE) algorithm. - Handles special tokens. - Uses a customizable regex pattern for tokenization. - Compatible with Python 3.9 and above #### This repository has 2 different Tokenizers: - `BPETokenizer` - `Tokenizer` 1. [Tokenizer](bpetokenizer/base.py): This class contains `train`, `encode`, `decode` and functionalities to `save` and `load`. Also contains few helper functions `get_stats`, `merge`, `replace_control_characters`.. to perform the BPE algorithm for the tokenizer. 2. [BPETokenizer](bpetokenizer/tokenizer.py): This class emphasizes the real power of the tokenizer(used in gpt4 tokenizer..[tiktoken](https://github.com/openai/tiktoken)), uses the `GPT4_SPLIT_PATTERN` to split the text as mentioned in the gpt4 tokenizer. also handles the `special_tokens` (refer [sample_bpetokenizer](sample/bpetokenizer/sample_bpetokenizer.py)). which inherits the `save` and `load` functionlities to save and load the tokenizer respectively. ### Usage this tutorial leverages the `special_tokens` usage in the Tokenizer. Install the package ```shell pip install bpetokenizer ``` ```py from bpetokenizer import BPETokenizer special_tokens = { "<|endoftext|>": 1001, "<|startoftext|>": 1002, "[SPECIAL1]": 1003, "[SPECIAL2]": 1004, } tokenizer = BPETokenizer(special_tokens=special_tokens) # you can also use the method _special_tokens to register the special tokens (if not passed when intializing) texts = "<|startoftext|> Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.<|endoftext|>" tokenizer.train(texts, vocab_size=310, verbose=True) # tokenizer._special_tokens(special_tokens) # if not passed when intialization of the BPETokenizer encode_text = """ <|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer. Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness. Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence. Hello, World! This is yet another sample text, with [SPECIAL1] and [SPECIAL2] making an appearance. Hey there, World! Testing the tokenizer with [SPECIAL1] and [SPECIAL2] to see if it handles special tokens properly. Salutations, Planet! The tokenizer should recognize [SPECIAL1] and [SPECIAL2] in this long string of text. Hello again, World! [SPECIAL1] and [SPECIAL2] are special tokens that need to be handled correctly by the tokenizer. Welcome, World! Including [SPECIAL1] and [SPECIAL2] multiple times in this large text to ensure proper encoding. Hi, World! Let's add [SPECIAL1] and [SPECIAL2] in various parts of this long sentence to test the tokenizer thoroughly. <|endoftext|> """ ids = tokenizer.encode(encode_text, special_tokens="all") print(ids) decode_text = tokenizer.decode(ids) print(decode_text) tokenizer.save("sample_bpetokenizer", mode="json") # mode: default is file ``` refer [sample_bpetokenizer](sample/bpetokenizer) to have an understanding of the `vocab` and the `model` file of the tokenizer trained on the above texts. #### To Load the Tokenizer ```py from bpetokenizer import BPETokenizer tokenizer = BPETokenizer() tokenizer.load("sample_bpetokenizer.json", mode="json") encode_text = """ <|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer. Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness. Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.<|endoftext|>""" print("vocab: ", tokenizer.vocab) print('---') print("merges: ", tokenizer.merges) print('---') print("special tokens: ", tokenizer.special_tokens) ids = tokenizer.encode(encode_text, special_tokens="all") print('---') print(ids) decode_text = tokenizer.decode(ids) print('---') print(decode_text) # you can also print the tokens and the text chunks split with the pattern. tokens = tokenizer.tokens(encode_text, verbose=True) # if verbose, prints the text chunks and also the pattern used to split. print('---') print("tokens: ", tokens) ``` refer to the [load_json_vocab](sample/load_json_vocab/) and run the `bpetokenizer_json` to get an overview of `vocab`, `merges`, `special_tokens` and to view the tokens that are split by the tokenizer using pattern, look at [tokens](sample/load_json_vocab/tokens.py) ### Run Tests the tests folder `tests/` include the tests of the tokenizer, uses pytest. ``` python3 -m pytest ``` additionally, the workflows are setup to run the tests when made a PR. ### Contributing Contributions to the BPE Tokenizer are most welcomed! If you would like to contribute, please follow these steps: - Star and Fork the repository. - Create a new branch (git checkout -b feature/your-feature). - Commit your changes (git commit -am 'Add some feature'). - Push to the branch (git push origin feature/your-feature). - Create a new Pull Request. Please ensure your code follows the project's coding standards and includes appropriate tests. Also, update the documentation as necessary. ### License This project is licensed under the MIT License. ---- *this tokenizer is inspired from the [minbpe](https://github.com/karpathy/minbpe), but more optimized.