The community has been busy distilling DeepSeek-R1 from inference providers, but we decided to have a go at doing it ourselves from scratch πͺ
Whatβs new compared to existing reasoning datasets?
βΎ Based on AI-MO/NuminaMath-1.5: we focus on math reasoning traces and generate answers for problems in NuminaMath 1.5, an improved version of the popular NuminaMath-CoT dataset.
π³ 800k R1 reasoning traces: We generate two answers for 400k problems using DeepSeek R1. The filtered dataset contains 220k problems with correct reasoning traces.
π 512 H100s running locally: Instead of relying on an API, we leverage vLLM and SGLang to run generations locally on our science cluster, generating 180k reasoning traces per day.
β³ Automated filtering: We apply Math Verify to only retain problems with at least one correct answer. We also leverage Llama3.3-70B-Instruct as a judge to retrieve more correct examples (e.g for cases with malformed answers that canβt be verified with a rules-based parser)
π We match the performance of DeepSeek-Distill-Qwen-7B by finetuning Qwen-7B-Math-Instruct on our dataset.
β Evaluating Long Context #2: SCROLLS and ZeroSCROLLS
In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general.
π The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer?
1οΈβ£ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents. 2οΈβ£ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business. 3οΈβ£ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models.
Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include:
1οΈβ£ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries. 2οΈβ£ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers.
π‘ What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments.
This first unit of the course sets you up with all the fundamentals to become a pro in agents.
- What's an AI Agent? - What are LLMs? - Messages and Special Tokens - Understanding AI Agents through the Thought-Action-Observation Cycle - Thought, Internal Reasoning and the Re-Act Approach - Actions, Enabling the Agent to Engage with Its Environment - Observe, Integrating Feedback to Reflect and Adapt
What do you need to know about Spacy NER models: βοΈ Models represent a python packages; packages could be installed directly into environemnt or via python CLI. βοΈ Library has a pipeline for optimized request handling in batches. βοΈ Architecture: DNN embedding-based models (not transformers)
Fascinating deep dive into Swiggy's Hermes - their in-house Text-to-SQL solution that's revolutionizing data accessibility!
Hermes enables natural language querying within Slack, generating and executing SQL queries with an impressive <2 minute turnaround time. The system architecture is particularly intriguing:
Technical Implementation: - Built on GPT-4 with a Knowledge Base + RAG approach for Swiggy-specific context - AWS Lambda middleware handles communication between Slack UI and the Gen AI model - Databricks jobs orchestrate query generation and execution
Under the Hood: The pipeline employs a sophisticated multi-stage approach: 1. Metrics retrieval using embedding-based vector lookup 2. Table/column identification through metadata descriptions 3. Few-shot SQL retrieval with vector-based search 4. Structured prompt creation with data snapshots 5. Query validation with automated error correction
Architecture Highlights: - Compartmentalized by business units (charters) for better context management - Snowflake integration with seamless authentication - Automated metadata onboarding with QA validation - Real-time feedback collection via Slack
What's particularly impressive is how they've solved the data context challenge through charter-specific implementations, significantly improving query accuracy for well-defined metadata sets.
Kudos to the Swiggy team for democratizing data access across their organization. This is a brilliant example of practical AI implementation solving real business challenges.
Wanted: Peak Data. I'm collecting audio data to train another TTS model: + AVM data: ChatGPT Advanced Voice Mode audio & text from source + Professional audio: Permissive (CC0, Apache, MIT, CC-BY)
This audio should *impress* most native speakers, not just barely pass their audio Turing tests. Professional-caliber means S or A-tier, not your average bloke off the street. Traditional TTS may not make the cut. Absolutely no low-fi microphone recordings like Common Voice.
The bar is much higher than last time, so there are no timelines yet and I expect it may take longer to collect such mythical data. Raising the bar means evicting quite a bit of old data, and voice/language availability may decrease. The theme is *quality* over quantity. I would rather have 1 hour of A/S-tier than 100 hours of mid data.
I have nothing to offer but the north star of a future Apache 2.0 TTS model, so prefer data that you *already have* and costs you *nothing extra* to send. Additionally, *all* the new data may be used to construct public, Apache 2.0 voicepacks, and if that arrangement doesn't work for you, no need to send any audio.
Last time I asked for horses; now I'm asking for unicorns. As of writing this post, I've currently got a few English & Chinese unicorns, but there is plenty of room in the stable. Find me over on Discord at rzvzn: https://discord.gg/QuGxSWBfQy
The most difficult part was getting the model running in the first place, but the next steps are simple: βοΈ Implement sentence splitting, allowing for streamed responses π Multilingual support (only phonemization left)
π’ SmolLM2 paper released! Learn how the π€ team built one of the best small language models: from data choices to training insights. Check out our findings and share your thoughts! π€π‘
Colox, a reasoning AI model. I am currently working on a model smarter than GPT o1 that thinks before it speaks. It is coming tomorrow in the afternoon.
Note: expected 3.9-3.10 Python. Accelerate in Python 3.11 may require further tweaks for launching. Might try out to wrap other frameworks later on hereβοΈ: https://github.com/nicolay-r/nlp-thirdgate
The new release bulk-ner 0.25.1 in which the following updates were made: β Removing sentnce index from output #21 β API + support function for custom entities construction β hub for providers
π¨ Key takeaway of a quick mastering Sentiment Analysis nowadays. Trough the questionare π of the past RuOpinoinNE-2024 competition we got insights and participants model preference chocies. Our main conclusion:
β¨ The submissions of the top performed models exploit Few-shot learning for LLM.
Takeaway note comparing with the prior RuSentNE-2023 competition: π§ Reasoning in steps requires more actions for tweaking. Most recent solutions empowered with Chain-of-Thouhgt are tend to think too much. Earlier we might see improvements for the Flan-T5 (2.8B) in fine-tuned mode but not among the zero-shot approaches. nicolay-r/flan-t5-tsa-thor-xl
Our Streamlit application provides a text-to-speech conversion tool using the Kokoro library, allowing users to input text, select language and voice, and adjust speech speed. The generated audio can be played or downloaded as a WAV file. Optionally, an OpenAI API key enables text translation to English, with subsequent speech generation for both the original and translated text. This functionality, along with helpful instructions and sample prompts, positions the application for various business opportunities. It can be offered as a SaaS platform with tiered subscriptions for access to features like diverse voices, languages, and translation. Target markets include content creators, language learning platforms, accessibility tools, and businesses needing automated voice responses. Further revenue streams can be generated through API integration with other applications, custom voice creation or cloning services, and affiliate marketing with related services.
G2P is an underrated piece of small TTS models, like offensive linemen who do a bunch of work and get no credit.
Instead of relying on explicit G2P, larger speech models implicitly learn this task by eating many thousands of hours of audio data. They often use a 500M+ parameter LLM at the front to predict latent audio tokens over a learned codebook, then decode these tokens into audio.
Kokoro instead relies on G2P preprocessing, is 82M parameters, and thus needs less audio to learn. Because of this, we can cherrypick high fidelity audio for training data, and deliver solid speech for those voices. In turn, this excellent audio quality & lack of background noise helps explain why Kokoro is very competitive in single-voice TTS Arenas.
Seeing AI develop has been a wild ride, from trying to explain why we'd bother to generate a single sentence with a *neural network* to explaining that AI is not a magic, all-knowing box. The recent weeks and months have been a lot of talking about how AI works; to policy makers, to other developers, but also and mainly friends and family without a technical background.
Yesterday, the first provisions of the EU AI Act came into force, and one of the the key highlights are the AI literacy requirements for organisations deploying AI systems. This isn't just a box-ticking exercise. Ensuring that employees and stakeholders understand AI systems is crucial for fostering responsible and transparent AI development. From recognising biases to understanding model limitations, AI literacy empowers individuals to engage critically with these technologies and make informed decisions.
In the context of Hugging Face, AI literacy has many facets: allowing more people to contribute to AI development, providing courses and documentation to ensuring access is possible, and accessible AI tools that empower users to better understand how AI systems function. This isn't just a regulatory milestone; itβs an opportunity to foster a culture where AI literacy becomes foundational, enabling stakeholders to recognise biases, assess model limitations, and engage critically with technology.
Embedding these principles into daily practice, and eventually extending our learnings in AI literacy to the general public, is essential for building trustworthy AI that aligns with societal values.
Exciting breakthrough in Streaming Recommendation Systems! @BytedanceTalk researchers have developed "Long-Term Interest Clock" (LIC), a revolutionary approach to understand user preferences throughout the day.
>> Technical Innovation The system introduces two groundbreaking modules: - Clock-based General Search Unit (Clock-GSU): Intelligently retrieves relevant user behaviors by analyzing time patterns and content similarity - Clock-based Exact Search Unit (Clock-ESU): Employs time-gap-aware attention mechanism to precisely model user interests
>> Key Advantages LIC addresses critical limitations of existing systems by: - Providing fine-grained time perception instead of discrete hour-based recommendations - Analyzing long-term user behavior patterns rather than just short-term interactions - Operating at item-level granularity versus broad category-level interests
>> Real-World Impact Already deployed in Douyin Music App, the system has demonstrated remarkable results: - 0.122% improvement in user active days - Significant boost in engagement metrics including likes and play rates - Enhanced user satisfaction with reduced dislike rates
>> Under the Hood The system processes user behavior sequences spanning an entire year, utilizing multi-head attention mechanisms and sophisticated time-gap calculations to understand user preferences. It pre-computes embeddings stored in parameter servers for real-time performance, making it highly scalable for production environments.
This innovation marks a significant step forward in personalized content delivery, especially for streaming platforms where user preferences vary throughout the day. The research has been accepted for presentation at WWW '25, Sydney.