Krinal Joshi

krinal

AI & ML interests

NLP, Speech

Recent Activity

reacted to mmhamdy's post with πŸ‘ about 11 hours ago
β›“ Evaluating Long Context #2: SCROLLS and ZeroSCROLLS In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general. πŸ“œ The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer? 1️⃣ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents. 2️⃣ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business. 3️⃣ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models. Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include: 1️⃣ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries. 2️⃣ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers. πŸ’‘ What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments. - SCROLLS Paper: https://huggingface.co/papers/2201.03533 - ZeroSCROLLS Paper: https://huggingface.co/papers/2305.14196
View all activity

Organizations

Blog-explorers's profile picture Hugging Face Discord Community's profile picture

krinal's activity

reacted to lewtun's post with πŸ‘ about 11 hours ago
view post
Post
2441
Introducing OpenR1-Math-220k!

open-r1/OpenR1-Math-220k

The community has been busy distilling DeepSeek-R1 from inference providers, but we decided to have a go at doing it ourselves from scratch πŸ’ͺ

What’s new compared to existing reasoning datasets?

β™Ύ Based on AI-MO/NuminaMath-1.5: we focus on math reasoning traces and generate answers for problems in NuminaMath 1.5, an improved version of the popular NuminaMath-CoT dataset.

🐳 800k R1 reasoning traces: We generate two answers for 400k problems using DeepSeek R1. The filtered dataset contains 220k problems with correct reasoning traces.

πŸ“€ 512 H100s running locally: Instead of relying on an API, we leverage vLLM and SGLang to run generations locally on our science cluster, generating 180k reasoning traces per day.

⏳ Automated filtering: We apply Math Verify to only retain problems with at least one correct answer. We also leverage Llama3.3-70B-Instruct as a judge to retrieve more correct examples (e.g for cases with malformed answers that can’t be verified with a rules-based parser)

πŸ“Š We match the performance of DeepSeek-Distill-Qwen-7B by finetuning Qwen-7B-Math-Instruct on our dataset.

πŸ”Ž Read our blog post for all the nitty gritty details: https://huggingface.co/blog/open-r1/update-2
reacted to mmhamdy's post with πŸ‘ about 11 hours ago
view post
Post
1475
β›“ Evaluating Long Context #2: SCROLLS and ZeroSCROLLS

In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general.

πŸ“œ The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer?

1️⃣ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents.
2️⃣ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business.
3️⃣ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models.

Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include:

1️⃣ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries.
2️⃣ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers.

πŸ’‘ What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments.

- SCROLLS Paper: SCROLLS: Standardized CompaRison Over Long Language Sequences (2201.03533)
- ZeroSCROLLS Paper: ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding (2305.14196)
reacted to burtenshaw's post with πŸ‘ about 11 hours ago
view post
Post
2002
The Hugging Face agents course is finally out!

πŸ‘‰ https://huggingface.co/agents-course

This first unit of the course sets you up with all the fundamentals to become a pro in agents.

- What's an AI Agent?
- What are LLMs?
- Messages and Special Tokens
- Understanding AI Agents through the Thought-Action-Observation Cycle
- Thought, Internal Reasoning and the Re-Act Approach
- Actions, Enabling the Agent to Engage with Its Environment
- Observe, Integrating Feedback to Reflect and Adapt
reacted to nicolay-r's post with πŸ‘ 1 day ago
view post
Post
2066
πŸ“’ If you wish to empower LLM with NER for texts in English, then I can recommend to use Spacy. Sharing the wrapper of Spacy NER models the bulk-ner dedicated for hadling CSV / JSONL content:
Script: https://github.com/nicolay-r/nlp-thirdgate/blob/master/tutorials/ner_spacy_383.sh
Code: https://raw.githubusercontent.com/nicolay-r/nlp-thirdgate/refs/heads/master/ner/spacy_383.py

What do you need to know about Spacy NER models:
β˜‘οΈ Models represent a python packages; packages could be installed directly into environemnt or via python CLI.
β˜‘οΈ Library has a pipeline for optimized request handling in batches.
β˜‘οΈ Architecture: DNN embedding-based models (not transformers)

πŸ€– List of models (or see screenshot below):
https://huggingface.co/spacy
πŸ“‹ Supported NER types:
https://github.com/explosion/spaCy/discussions/9147

⚠️ NOTE: chunking seems to be non-applicable due to specifics of models and usage of the internal pipeline mechanism

πŸš€ Performance for sentences (en):
Model: spacy/en_core_web_sm πŸ”₯ 530 sentences per second πŸ”₯ (similar to larger solutions)

🌌 other wrappers for bulk-ner nlp-thirdgate: https://github.com/nicolay-r/nlp-thirdgate#ner
reacted to singhsidhukuldeep's post with πŸ‘ 1 day ago
view post
Post
3672
Fascinating deep dive into Swiggy's Hermes - their in-house Text-to-SQL solution that's revolutionizing data accessibility!

Hermes enables natural language querying within Slack, generating and executing SQL queries with an impressive <2 minute turnaround time. The system architecture is particularly intriguing:

Technical Implementation:
- Built on GPT-4 with a Knowledge Base + RAG approach for Swiggy-specific context
- AWS Lambda middleware handles communication between Slack UI and the Gen AI model
- Databricks jobs orchestrate query generation and execution

Under the Hood:
The pipeline employs a sophisticated multi-stage approach:
1. Metrics retrieval using embedding-based vector lookup
2. Table/column identification through metadata descriptions
3. Few-shot SQL retrieval with vector-based search
4. Structured prompt creation with data snapshots
5. Query validation with automated error correction

Architecture Highlights:
- Compartmentalized by business units (charters) for better context management
- Snowflake integration with seamless authentication
- Automated metadata onboarding with QA validation
- Real-time feedback collection via Slack

What's particularly impressive is how they've solved the data context challenge through charter-specific implementations, significantly improving query accuracy for well-defined metadata sets.

Kudos to the Swiggy team for democratizing data access across their organization. This is a brilliant example of practical AI implementation solving real business challenges.
reacted to schuler's post with πŸ‘ 1 day ago
view post
Post
6010
πŸ“’ New Research Alert: Making Language Models Smaller & Smarter!

Thrilled to share the latest technical report demonstrating how to reduce language model parameters by 77% while maintaining performance.

The secret? Grouped pointwise convolutions. Yes. We brought a method from computer vision to the transformers arena.

πŸ”‘ Key Findings:
β€’ 77% parameter reduction.
β€’ Maintained model capabilities.
β€’ Improved generalization.

Paper: https://www.researchgate.net/publication/388835829_SAVING_77_OF_THE_PARAMETERS_IN_LARGE_LANGUAGE_MODELS_TECHNICAL_REPORT
Code: https://github.com/joaopauloschuler/less-parameters-llm
reacted to hexgrad's post with πŸ‘ 3 days ago
view post
Post
4225
Wanted: Peak Data. I'm collecting audio data to train another TTS model:
+ AVM data: ChatGPT Advanced Voice Mode audio & text from source
+ Professional audio: Permissive (CC0, Apache, MIT, CC-BY)

This audio should *impress* most native speakers, not just barely pass their audio Turing tests. Professional-caliber means S or A-tier, not your average bloke off the street. Traditional TTS may not make the cut. Absolutely no low-fi microphone recordings like Common Voice.

The bar is much higher than last time, so there are no timelines yet and I expect it may take longer to collect such mythical data. Raising the bar means evicting quite a bit of old data, and voice/language availability may decrease. The theme is *quality* over quantity. I would rather have 1 hour of A/S-tier than 100 hours of mid data.

I have nothing to offer but the north star of a future Apache 2.0 TTS model, so prefer data that you *already have* and costs you *nothing extra* to send. Additionally, *all* the new data may be used to construct public, Apache 2.0 voicepacks, and if that arrangement doesn't work for you, no need to send any audio.

Last time I asked for horses; now I'm asking for unicorns. As of writing this post, I've currently got a few English & Chinese unicorns, but there is plenty of room in the stable. Find me over on Discord at rzvzn: https://discord.gg/QuGxSWBfQy
reacted to Xenova's post with πŸ‘ 4 days ago
view post
Post
5108
We did it. Kokoro TTS (v1.0) can now run 100% locally in your browser w/ WebGPU acceleration. Real-time text-to-speech without a server. ⚑️

Generate 10 seconds of speech in ~1 second for $0.

What will you build? πŸ”₯
webml-community/kokoro-webgpu

The most difficult part was getting the model running in the first place, but the next steps are simple:
βœ‚οΈ Implement sentence splitting, allowing for streamed responses
🌍 Multilingual support (only phonemization left)

Who wants to help?
Β·
upvoted an article 4 days ago
view article
Article

Smol but Mighty: Can Small Models Reason well? πŸ€”

By evijit β€’
β€’ 7
reacted to fdaudens's post with πŸ‘ 4 days ago
reacted to retronic's post with πŸ‘ 4 days ago
view post
Post
4257
Colox, a reasoning AI model. I am currently working on a model smarter than GPT o1 that thinks before it speaks. It is coming tomorrow in the afternoon.
Β·
reacted to nicolay-r's post with πŸ‘ 4 days ago
view post
Post
1108
πŸ“’ Who would like to embed NER into LLM pipeline, just made an example of the pretrained multilingual BERT via DeepPavlov framework via bulk-ner:
πŸ“” : https://github.com/nicolay-r/nlp-thirdgate/blob/master/tutorials/ner_deeppavlov_130.ipynb

Note: expected 3.9-3.10 Python. Accelerate in Python 3.11 may require further tweaks for launching. Might try out to wrap other frameworks later on here↗️: https://github.com/nicolay-r/nlp-thirdgate

The new release bulk-ner 0.25.1 in which the following updates were made:
βœ… Removing sentnce index from output #21
βœ… API + support function for custom entities construction
βœ… hub for providers

🌟 bulk-ner: https://github.com/nicolay-r/bulk-ner
reacted to nicolay-r's post with πŸ‘ 5 days ago
view post
Post
2050
🚨 Key takeaway of a quick mastering Sentiment Analysis nowadays. Trough the questionare πŸ“ of the past RuOpinoinNE-2024 competition we got insights and participants model preference chocies. Our main conclusion:

✨ The submissions of the top performed models exploit Few-shot learning for LLM.

Takeaway note comparing with the prior RuSentNE-2023 competition:
🧠 Reasoning in steps requires more actions for tweaking. Most recent solutions empowered with Chain-of-Thouhgt are tend to think too much. Earlier we might see improvements for the Flan-T5 (2.8B) in fine-tuned mode but not among the zero-shot approaches.
nicolay-r/flan-t5-tsa-thor-xl

Related materials:
https://github.com/dialogue-evaluation/RuOpinionNE-2024
RuSentNE-2023: Evaluating Entity-Oriented Sentiment Analysis on Russian News Texts (2305.17679)
Large Language Models in Targeted Sentiment Analysis (2404.12342)
upvoted an article 6 days ago
view article
Article

Open-source DeepResearch – Freeing our search agents

β€’ 918
reacted to shukdevdatta123's post with πŸ‘ 6 days ago
view post
Post
1621
Introducing Kokoro TTS Translate For All users:

shukdevdatta123/Kokoro-TTS

https://colab.research.google.com/drive/1DIpBzJSBBeTcpkyxkHcpngLumMapEWQz?usp=sharing

(colab link for GPU access)

Our Streamlit application provides a text-to-speech conversion tool using the Kokoro library, allowing users to input text, select language and voice, and adjust speech speed. The generated audio can be played or downloaded as a WAV file. Optionally, an OpenAI API key enables text translation to English, with subsequent speech generation for both the original and translated text. This functionality, along with helpful instructions and sample prompts, positions the application for various business opportunities. It can be offered as a SaaS platform with tiered subscriptions for access to features like diverse voices, languages, and translation. Target markets include content creators, language learning platforms, accessibility tools, and businesses needing automated voice responses. Further revenue streams can be generated through API integration with other applications, custom voice creation or cloning services, and affiliate marketing with related services.
upvoted an article 6 days ago
reacted to hexgrad's post with πŸ‘ 6 days ago
view post
Post
5480
I wrote an article about G2P: https://hf.co/blog/hexgrad/g2p

G2P is an underrated piece of small TTS models, like offensive linemen who do a bunch of work and get no credit.

Instead of relying on explicit G2P, larger speech models implicitly learn this task by eating many thousands of hours of audio data. They often use a 500M+ parameter LLM at the front to predict latent audio tokens over a learned codebook, then decode these tokens into audio.

Kokoro instead relies on G2P preprocessing, is 82M parameters, and thus needs less audio to learn. Because of this, we can cherrypick high fidelity audio for training data, and deliver solid speech for those voices. In turn, this excellent audio quality & lack of background noise helps explain why Kokoro is very competitive in single-voice TTS Arenas.
upvoted an article 7 days ago
view article
Article

Powerful ASR + diarization + speculative decoding with Hugging Face Inference Endpoints

β€’ 72
reacted to frimelle's post with πŸ‘ 7 days ago
view post
Post
1651
Seeing AI develop has been a wild ride, from trying to explain why we'd bother to generate a single sentence with a *neural network* to explaining that AI is not a magic, all-knowing box. The recent weeks and months have been a lot of talking about how AI works; to policy makers, to other developers, but also and mainly friends and family without a technical background.

Yesterday, the first provisions of the EU AI Act came into force, and one of the the key highlights are the AI literacy requirements for organisations deploying AI systems. This isn't just a box-ticking exercise. Ensuring that employees and stakeholders understand AI systems is crucial for fostering responsible and transparent AI development. From recognising biases to understanding model limitations, AI literacy empowers individuals to engage critically with these technologies and make informed decisions.

In the context of Hugging Face, AI literacy has many facets: allowing more people to contribute to AI development, providing courses and documentation to ensuring access is possible, and accessible AI tools that empower users to better understand how AI systems function. This isn't just a regulatory milestone; it’s an opportunity to foster a culture where AI literacy becomes foundational, enabling stakeholders to recognise biases, assess model limitations, and engage critically with technology.

Embedding these principles into daily practice, and eventually extending our learnings in AI literacy to the general public, is essential for building trustworthy AI that aligns with societal values.
  • 1 reply
Β·
reacted to singhsidhukuldeep's post with πŸ‘ 7 days ago
view post
Post
1736
Exciting breakthrough in Streaming Recommendation Systems! @BytedanceTalk researchers have developed "Long-Term Interest Clock" (LIC), a revolutionary approach to understand user preferences throughout the day.

>> Technical Innovation
The system introduces two groundbreaking modules:
- Clock-based General Search Unit (Clock-GSU): Intelligently retrieves relevant user behaviors by analyzing time patterns and content similarity
- Clock-based Exact Search Unit (Clock-ESU): Employs time-gap-aware attention mechanism to precisely model user interests

>> Key Advantages
LIC addresses critical limitations of existing systems by:
- Providing fine-grained time perception instead of discrete hour-based recommendations
- Analyzing long-term user behavior patterns rather than just short-term interactions
- Operating at item-level granularity versus broad category-level interests

>> Real-World Impact
Already deployed in Douyin Music App, the system has demonstrated remarkable results:
- 0.122% improvement in user active days
- Significant boost in engagement metrics including likes and play rates
- Enhanced user satisfaction with reduced dislike rates

>> Under the Hood
The system processes user behavior sequences spanning an entire year, utilizing multi-head attention mechanisms and sophisticated time-gap calculations to understand user preferences. It pre-computes embeddings stored in parameter servers for real-time performance, making it highly scalable for production environments.

This innovation marks a significant step forward in personalized content delivery, especially for streaming platforms where user preferences vary throughout the day. The research has been accepted for presentation at WWW '25, Sydney.