5 30 87

Umitcan Sahin PRO

ucsahin

AI & ML interests

Visual Language Models, Large Language Models, Vision Transformers

Recent Activity

reacted to davanstrien's post with 👍 12 days ago

Updated the ColPali Query Generator Space https://huggingface.co/spaces/davanstrien/ColPali-Query-Generator to use https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct. Given an input image, it generates several queries along with explanations to justify them. This approach can generate synthetic data for fine-tuning ColPali models.

liked a model 15 days ago

Qwen/Qwen2.5-VL-72B-Instruct

liked a model 15 days ago

Qwen/Qwen2.5-VL-7B-Instruct

View all activity

Organizations

None yet

ucsahin's activity

reacted to davanstrien's post with 👍 12 days ago

Post

1867

Updated the ColPali Query Generator Space davanstrien/ColPali-Query-Generator to use Qwen/Qwen2.5-VL-7B-Instruct.

Given an input image, it generates several queries along with explanations to justify them. This approach can generate synthetic data for fine-tuning ColPali models.

reacted to merve's post with 🔥 17 days ago

Post

5099

Oof, what a week! 🥵 So many things have happened, let's recap! merve/jan-24-releases-6793d610774073328eac67a9

Multimodal 💬
- We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG 💗
- UI-TARS are new models by ByteDance to unlock agentic GUI control 🤯 in 2B, 7B and 72B
- Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B
- MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context
- Dataset: Yale released a new benchmark called MMVU
- Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark

LLMs 📖
- DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! 🤯
- Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B
- NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)

Audio 🗣️
- Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B
- TangoFlux is a new audio generation model trained from scratch and aligned with CRPO

Image/Video/3D Generation ⏯️
- Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux
- tencent released Hunyuan3D-2, new 3D asset generation from images

7 replies

reacted to kadirnar's post with 🚀🔥 23 days ago

Post

2802

I created my own AI image and video from scratch using the fal.ai platform 💫

Workflow: Flux Lora Training + Upscale + Kling AI(1.6)

5 replies

reacted to fdaudens's post with 🚀🚀 28 days ago

Post

2321

🔥 The AI Agent hype is real! This blog post deep dives into everything you need to know before deploying them: from key definitions to practical recommendations. A must-read for anyone building the future of autonomous systems.

📊 Key insight: A clear table breaking down the 5 levels of AI agents - from simple processors to fully autonomous systems. Essential framework for understanding where your agent stands on the autonomy spectrum

⚖️ Deep analysis of 15 core values reveals critical trade-offs: accuracy, privacy, safety, equity & more. The same features that make agents powerful can make them risky. Understanding these trade-offs is crucial for responsible deployment

🎯 6 key recommendations for the road ahead:
- Create rigorous evaluation protocols
- Study societal effects
- Understand ripple effects
- Improve transparency
- Open source can make a positive difference
- Monitor base model evolution

Read the blog post: https://huggingface.co/blog/ethics-soc-7 Brillant work by @meg @evijit @sasha @giadap

reacted to m-ric's post with 🤗🚀🔥 about 1 month ago

Post

5108

Since I published it on GitHub a few days ago,
Hugging Face's new agentic library 𝘀𝗺𝗼𝗹𝗮𝗴𝗲𝗻𝘁𝘀 has gathered nearly 4k stars 🤯

➡️ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!

The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.

We will make it work better, and fully open. ✨

Sounds like something you'd like to do? Apply here 👉 https://apply.workable.com/huggingface/j/AF1D4E3FEB/

3 replies

reacted to singhsidhukuldeep's post with 🔥 about 2 months ago

Post

2194

Exciting News in AI: JinaAI Releases JINA-CLIP-v2!

The team at Jina AI has just released a groundbreaking multilingual multimodal embedding model that's pushing the boundaries of text-image understanding. Here's why this is a big deal:

🚀 Technical Highlights:
- Dual encoder architecture combining a 561M parameter Jina XLM-RoBERTa text encoder and a 304M parameter EVA02-L14 vision encoder
- Supports 89 languages with 8,192 token context length
- Processes images up to 512×512 pixels with 14×14 patch size
- Implements FlashAttention2 for text and xFormers for vision processing
- Uses Matryoshka Representation Learning for efficient vector storage

⚡️ Under The Hood:
- Multi-stage training process with progressive resolution scaling (224→384→512)
- Contrastive learning using InfoNCE loss in both directions
- Trained on massive multilingual dataset including 400M English and 400M multilingual image-caption pairs
- Incorporates specialized datasets for document understanding, scientific graphs, and infographics
- Uses hard negative mining with 7 negatives per positive sample

📊 Performance:
- Outperforms previous models on visual document retrieval (52.65% nDCG@5)
- Achieves 89.73% image-to-text and 79.09% text-to-image retrieval on CLIP benchmark
- Strong multilingual performance across 30 languages
- Maintains performance even with 75% dimension reduction (256D vs 1024D)

🎯 Key Innovation:
The model solves the long-standing challenge of unifying text-only and multi-modal retrieval systems while adding robust multilingual support. Perfect for building cross-lingual visual search systems!

Kudos to the research team at Jina AI for this impressive advancement in multimodal AI!

reacted to merve's post with 🔥👀👍 3 months ago

Post

2213

The authors of ColPali trained a retrieval model based on SmolVLM 🤠 vidore/colsmolvlm-alpha
TLDR;

- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks

- ColSmolVLM is more memory efficient than ColQwen2 💗

reacted to ezgikorkmaz's post with 🚀 3 months ago

Post

2090

I wrote a recent survey about deep reinforcement learning. The paper is a compact guide to understand some of the key concepts in reinforcement learning. Find the paper below:

Paper: https://arxiv.org/pdf/2401.02349v2
Twitter: https://x.com/EzgiKorkmazAI/status/1851934161138798615

reacted to merve's post with 🤗👀🔥 3 months ago

Post

5139

OmniVision-968M: a new local VLM for edge devices, fast & small but performant
💨 a new vision language model with 9x less image tokens, super efficient
📖 aligned with DPO for reducing hallucinations
⚡️ Apache 2.0 license 🔥

Demo hf.co/spaces/NexaAIDev/omnivlm-dpo-demo
Model https://huggingface.co/NexaAIDev/omnivision-968M

4 replies

reacted to merve's post with 🔥 3 months ago

Post

1977

Amazing past days at open ML, it's raining coding models, let's have a recap 🌧️ Find all models and datasets here merve/nov-15-releases-67372d0ebdc354756a52ecd0

Models
💻 Coding: Qwen team released two Qwen2.5-Coder checkpoints of 32B and 7B. Infly released OpenCoder: 1.5B and 8B coding models with instruction SFT'd versions and their datasets! 💗

🖼️ Image/Video Gen: Alibaba vision lab released In-context LoRA -- 10 LoRA models on different themes based on Flux. Also Mochi the sota video generation model with A2.0 license now comes natively supported in diffusers 👏

🖼️ VLMs/Multimodal: NexaAIDev released Omnivision 968M a new vision language model aligned with DPO for reducing hallucinations, also comes with GGUF ckpts 👏 Microsoft released LLM2CLIP, a new CLIP-like model with longer context window allowing complex text inputs and better search

🎮 AGI?: Etched released Oasis 500M, a diffusion based open world model that takes keyboard input and outputs gameplay 🤯

Datasets
Common Corpus: A text dataset with 2T tokens with permissive license for EN/FR on various sources: code, science, finance, culture 📖

reacted to merve's post with 🔥 3 months ago

Post

5439

Another great week in open ML!
Here's a small recap 🫰🏻

Model releases
⏯️ Video Language Models
AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2

💬 Small language models
Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets.
Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M

🖼️ Image Generation
Stability AI released stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license

🖼️💬Any-to-Any
gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!

Dataset releases
🖼️ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2

reacted to merve's post with 🚀 4 months ago

Post

3480

Microsoft released a groundbreaking model that can be used for web automation, with MIT license 🔥 microsoft/OmniParser

Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents.

no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.

Lastly, the authors also fine-tune this model on open-set detection for interactable regions and see if they can use it as a plug-in for VLMs and it actually outperforms off-the-shelf open-set detectors like GroundingDINO. 👏

OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing.