|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
library_name: transformers |
|
--- |
|
|
|
# **ArlowGPT Tokenizer** |
|
|
|
### Overview |
|
The **ArlowGPT Tokenizer** is a byte pair encoding (BPE) tokenizer developed from scratch, optimized for large-scale language modeling and text generation tasks. It features a vocabulary size of **59,575 tokens** and supports a maximum context length of **131,072 tokens**, making it suitable for handling extremely long documents and sequences. |
|
|
|
### Key Features |
|
- **Vocabulary Size**: 59,575 tokens |
|
- **Maximum Context Length**: 131,072 tokens |
|
- **Tokenizer Type**: Byte Pair Encoding (BPE) |
|
- **Special Tokens**: |
|
- `<pad>`: Padding token used for sequence alignment. |
|
- `<mask>`: Special token for masked language modeling tasks. |
|
- `<eos>`: End-of-sequence token. |
|
- `<bos>`: Beginning-of-sequence token. |
|
- **Trained From Scratch**: The tokenizer was trained from scratch using a large corpus of English and multilingual text. |
|
|
|
### Training Data |
|
The tokenizer was trained on **Wikipedia**, ensuring high coverage of general knowledge and domain-specific terms. Although primarily optimized for English, it also includes some multilingual capability due to the nature of the training dataset. |
|
|
|
### Intended Use Cases |
|
This tokenizer is designed for **general-purpose language modeling** and is suitable for tasks such as: |
|
- Autoregressive text generation |
|
- Long-context summarization |
|
- Conversational AI |
|
- Information retrieval over large documents |
|
- General NLP tasks requiring long context processing |
|
|
|
### Supported Languages |
|
- **Primary Language**: English |
|
- **Secondary Support**: Some multilingual content |
|
|
|
### Performance & Benchmarks |
|
No formal benchmarks have been conducted yet, but the tokenizer has been designed for efficiency in both tokenization speed and memory usage, with a focus on handling extremely long contexts up to **131,072 tokens**. |
|
|
|
### Limitations |
|
- **Multilingual Coverage**: While the tokenizer includes some multilingual tokens, it is primarily optimized for English text, and performance on non-English languages may vary. |
|
- **No Benchmarked Metrics**: The tokenizer has not undergone formal benchmarking for speed or performance across various tasks. |
|
|
|
### Citation |
|
If you use the **ArlowGPT Tokenizer** in your work, please cite it as: |
|
``` |
|
@misc{arlowgpt_tokenizer, |
|
title={ArlowGPT Tokenizer}, |
|
author={yuchenxie}, |
|
year={2025}, |
|
howpublished={\url{https://huggingface.co/yuchenxie/ArlowGPT-Tokenizer}} |
|
} |
|
``` |