Given an input image, it generates several queries along with explanations to justify them. This approach can generate synthetic data for fine-tuning ColPali models.
Multimodal ๐ฌ - We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG ๐ - UI-TARS are new models by ByteDance to unlock agentic GUI control ๐คฏ in 2B, 7B and 72B - Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B - MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context - Dataset: Yale released a new benchmark called MMVU - Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark
LLMs ๐ - DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! ๐คฏ - Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B - NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)
Audio ๐ฃ๏ธ - Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B - TangoFlux is a new audio generation model trained from scratch and aligned with CRPO
Image/Video/3D Generation โฏ๏ธ - Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux - tencent released Hunyuan3D-2, new 3D asset generation from images
7 replies
ยท
reacted to kadirnar's
post with ๐๐ฅ23 days ago
๐ฅ The AI Agent hype is real! This blog post deep dives into everything you need to know before deploying them: from key definitions to practical recommendations. A must-read for anyone building the future of autonomous systems.
๐ Key insight: A clear table breaking down the 5 levels of AI agents - from simple processors to fully autonomous systems. Essential framework for understanding where your agent stands on the autonomy spectrum
โ๏ธ Deep analysis of 15 core values reveals critical trade-offs: accuracy, privacy, safety, equity & more. The same features that make agents powerful can make them risky. Understanding these trade-offs is crucial for responsible deployment
๐ฏ 6 key recommendations for the road ahead: - Create rigorous evaluation protocols - Study societal effects - Understand ripple effects - Improve transparency - Open source can make a positive difference - Monitor base model evolution
Since I published it on GitHub a few days ago, Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.
Exciting News in AI: JinaAI Releases JINA-CLIP-v2!
The team at Jina AI has just released a groundbreaking multilingual multimodal embedding model that's pushing the boundaries of text-image understanding. Here's why this is a big deal:
๐ Technical Highlights: - Dual encoder architecture combining a 561M parameter Jina XLM-RoBERTa text encoder and a 304M parameter EVA02-L14 vision encoder - Supports 89 languages with 8,192 token context length - Processes images up to 512ร512 pixels with 14ร14 patch size - Implements FlashAttention2 for text and xFormers for vision processing - Uses Matryoshka Representation Learning for efficient vector storage
โก๏ธ Under The Hood: - Multi-stage training process with progressive resolution scaling (224โ384โ512) - Contrastive learning using InfoNCE loss in both directions - Trained on massive multilingual dataset including 400M English and 400M multilingual image-caption pairs - Incorporates specialized datasets for document understanding, scientific graphs, and infographics - Uses hard negative mining with 7 negatives per positive sample
๐ Performance: - Outperforms previous models on visual document retrieval (52.65% nDCG@5) - Achieves 89.73% image-to-text and 79.09% text-to-image retrieval on CLIP benchmark - Strong multilingual performance across 30 languages - Maintains performance even with 75% dimension reduction (256D vs 1024D)
๐ฏ Key Innovation: The model solves the long-standing challenge of unifying text-only and multi-modal retrieval systems while adding robust multilingual support. Perfect for building cross-lingual visual search systems!
Kudos to the research team at Jina AI for this impressive advancement in multimodal AI!
reacted to merve's
post with ๐ฅ๐๐3 months ago
I wrote a recent survey about deep reinforcement learning. The paper is a compact guide to understand some of the key concepts in reinforcement learning. Find the paper below:
OmniVision-968M: a new local VLM for edge devices, fast & small but performant ๐จ a new vision language model with 9x less image tokens, super efficient ๐ aligned with DPO for reducing hallucinations โก๏ธ Apache 2.0 license ๐ฅ
Models ๐ป Coding: Qwen team released two Qwen2.5-Coder checkpoints of 32B and 7B. Infly released OpenCoder: 1.5B and 8B coding models with instruction SFT'd versions and their datasets! ๐
๐ผ๏ธ Image/Video Gen: Alibaba vision lab released In-context LoRA -- 10 LoRA models on different themes based on Flux. Also Mochi the sota video generation model with A2.0 license now comes natively supported in diffusers ๐
๐ผ๏ธ VLMs/Multimodal: NexaAIDev released Omnivision 968M a new vision language model aligned with DPO for reducing hallucinations, also comes with GGUF ckpts ๐ Microsoft released LLM2CLIP, a new CLIP-like model with longer context window allowing complex text inputs and better search
๐ฎ AGI?: Etched released Oasis 500M, a diffusion based open world model that takes keyboard input and outputs gameplay ๐คฏ
Datasets Common Corpus: A text dataset with 2T tokens with permissive license for EN/FR on various sources: code, science, finance, culture ๐
Another great week in open ML! Here's a small recap ๐ซฐ๐ป
Model releases โฏ๏ธ Video Language Models AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2
๐ฌ Small language models Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets. Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M
๐ผ๏ธ๐ฌAny-to-Any gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!
Dataset releases ๐ผ๏ธ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2
Microsoft released a groundbreaking model that can be used for web automation, with MIT license ๐ฅ microsoft/OmniParser
Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents.
no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.
Lastly, the authors also fine-tune this model on open-set detection for interactable regions and see if they can use it as a plug-in for VLMs and it actually outperforms off-the-shelf open-set detectors like GroundingDINO. ๐
OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing.