ArlowGPT-Tokenizer / README.md
yuchenxie's picture
Update README.md
7d4a402 verified
---
license: apache-2.0
language:
- en
library_name: transformers
---
# **ArlowGPT Tokenizer**
### Overview
The **ArlowGPT Tokenizer** is a byte pair encoding (BPE) tokenizer developed from scratch, optimized for large-scale language modeling and text generation tasks. It features a vocabulary size of **59,575 tokens** and supports a maximum context length of **131,072 tokens**, making it suitable for handling extremely long documents and sequences.
### Key Features
- **Vocabulary Size**: 59,575 tokens
- **Maximum Context Length**: 131,072 tokens
- **Tokenizer Type**: Byte Pair Encoding (BPE)
- **Special Tokens**:
- `<pad>`: Padding token used for sequence alignment.
- `<mask>`: Special token for masked language modeling tasks.
- `<eos>`: End-of-sequence token.
- `<bos>`: Beginning-of-sequence token.
- **Trained From Scratch**: The tokenizer was trained from scratch using a large corpus of English and multilingual text.
### Training Data
The tokenizer was trained on **Wikipedia**, ensuring high coverage of general knowledge and domain-specific terms. Although primarily optimized for English, it also includes some multilingual capability due to the nature of the training dataset.
### Intended Use Cases
This tokenizer is designed for **general-purpose language modeling** and is suitable for tasks such as:
- Autoregressive text generation
- Long-context summarization
- Conversational AI
- Information retrieval over large documents
- General NLP tasks requiring long context processing
### Supported Languages
- **Primary Language**: English
- **Secondary Support**: Some multilingual content
### Performance & Benchmarks
No formal benchmarks have been conducted yet, but the tokenizer has been designed for efficiency in both tokenization speed and memory usage, with a focus on handling extremely long contexts up to **131,072 tokens**.
### Limitations
- **Multilingual Coverage**: While the tokenizer includes some multilingual tokens, it is primarily optimized for English text, and performance on non-English languages may vary.
- **No Benchmarked Metrics**: The tokenizer has not undergone formal benchmarking for speed or performance across various tasks.
### Citation
If you use the **ArlowGPT Tokenizer** in your work, please cite it as:
```
@misc{arlowgpt_tokenizer,
title={ArlowGPT Tokenizer},
author={yuchenxie},
year={2025},
howpublished={\url{https://huggingface.co/yuchenxie/ArlowGPT-Tokenizer}}
}
```