yuchenxie
/

ArlowGPT-Tokenizer

Inference Endpoints

Model card Files Files and versions Community

ArlowGPT-Tokenizer / README.md

yuchenxie's picture

Update README.md

7d4a402 verified about 1 month ago

|

history blame contribute delete

2.46 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	---

	# ArlowGPT Tokenizer

	### Overview
	The ArlowGPT Tokenizer is a byte pair encoding (BPE) tokenizer developed from scratch, optimized for large-scale language modeling and text generation tasks. It features a vocabulary size of 59,575 tokens and supports a maximum context length of 131,072 tokens, making it suitable for handling extremely long documents and sequences.

	### Key Features
	- Vocabulary Size: 59,575 tokens
	- Maximum Context Length: 131,072 tokens
	- Tokenizer Type: Byte Pair Encoding (BPE)
	- Special Tokens:
	- `<pad>`: Padding token used for sequence alignment.
	- `<mask>`: Special token for masked language modeling tasks.
	- `<eos>`: End-of-sequence token.
	- `<bos>`: Beginning-of-sequence token.
	- Trained From Scratch: The tokenizer was trained from scratch using a large corpus of English and multilingual text.

	### Training Data
	The tokenizer was trained on Wikipedia, ensuring high coverage of general knowledge and domain-specific terms. Although primarily optimized for English, it also includes some multilingual capability due to the nature of the training dataset.

	### Intended Use Cases
	This tokenizer is designed for general-purpose language modeling and is suitable for tasks such as:
	- Autoregressive text generation
	- Long-context summarization
	- Conversational AI
	- Information retrieval over large documents
	- General NLP tasks requiring long context processing

	### Supported Languages
	- Primary Language: English
	- Secondary Support: Some multilingual content

	### Performance & Benchmarks
	No formal benchmarks have been conducted yet, but the tokenizer has been designed for efficiency in both tokenization speed and memory usage, with a focus on handling extremely long contexts up to 131,072 tokens.

	### Limitations
	- Multilingual Coverage: While the tokenizer includes some multilingual tokens, it is primarily optimized for English text, and performance on non-English languages may vary.
	- No Benchmarked Metrics: The tokenizer has not undergone formal benchmarking for speed or performance across various tasks.

	### Citation
	If you use the ArlowGPT Tokenizer in your work, please cite it as:
	```
	@misc{arlowgpt_tokenizer,
	title={ArlowGPT Tokenizer},
	author={yuchenxie},
	year={2025},
	howpublished={\url{https://huggingface.co/yuchenxie/ArlowGPT-Tokenizer}}
	}
	```