β Evaluating Long Context #2: SCROLLS and ZeroSCROLLS
In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general.
π The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer?
1οΈβ£ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents. 2οΈβ£ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business. 3οΈβ£ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models.
Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include:
1οΈβ£ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries. 2οΈβ£ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers.
π‘ What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments.
I just released Sentence Transformers v3.4.0, featuring a memory leak fix, compatibility between the powerful Cached... losses and the Matryoshka loss modifier, and a bunch of fixes & small features.
πͺ Matryoshka & Cached loss compatibility It is now possible to combine the powerful Cached... losses (which use in-batch negatives & a caching mechanism to allow for endless batch size & negatives) with the Matryoshka loss modifier which modifies a base loss such that it is trained not only on the maximum dimensionality (e.g. 1024 dimensions), but also on many lower dimensions (e.g. 768, 512, 256, 128, 64, 32). After training, these models' embeddings can be truncated for faster retrieval, etc.
ποΈ Resolve memory leak when Model and Trainer are reinitialized Due to a circular dependency between Trainer -> Model -> ModelCardData -> Trainer, deleting both the trainer & model still didn't free up the memory. This led to a memory leak in scripts where you repeatedly do so.
β New Features Many new small features, e.g. multi-GPU support for 'mine_hard_negatives', a 'margin' parameter to TripletEvaluator, and Matthews Correlation Coefficient in the BinaryClassificationEvaluator.
π Bug Fixes Also a bunch of fixes, for example that subsequent batches were not sorted when using the "no_duplicates" batch sampler. See the release notes for more details.
ποΈ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.
We apply our recipe to train 2 Static Embedding models that we release today! We release: 2οΈβ£ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0 π§ my modern training strategy: ideation -> dataset choice -> implementation -> evaluation π my training scripts, using the Sentence Transformers library π my Weights & Biases reports with losses & metrics π my list of 30 training and 13 evaluation datasets
The 2 Static Embedding models have the following properties: ποΈ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5' 0οΈβ£ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed! π No maximum sequence length! Embed texts at any length (note: longer texts may embed worse) π Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more. πͺ Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)
Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://huggingface.co/blog/static-embeddings
The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.
That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!
Details: π€ Based on ModernBERT-base with 149M parameters. π Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB! ποΈ Immediate FA2 and unpacking support for super efficient inference. πͺ Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256. β‘οΈ Maximum sequence length of 8192 tokens! 2οΈβ£ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets. β Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc. ποΈ Apache 2.0 licensed: fully commercially permissible