![](https://cdn-avatars.huggingface.co/v1/production/uploads/641de0213239b631552713e4/LF00eHlddZADv5RRw1gJA.png)
AV LLMs
A collection of Audio, Video and Visual LLMs.
- Text-to-Speech • Updated • 422
- 1.02k
OpenVoice
🤗 dataautogpt3/ProteusV0.3
Text-to-Image • Updated • 98k • 93ByteDance/SDXL-Lightning
Text-to-Image • Updated • 158k • • 1.98kopenai/whisper-large-v3
Automatic Speech Recognition • Updated • 4.02M • • 4.02kstabilityai/TripoSR
Image-to-3D • Updated • 31.6k • 517Efficient-Large-Model/VILA-7b
Text Generation • Updated • 182 • 26google/paligemma-3b-pt-896
Image-Text-to-Text • Updated • 3.57k • 116microsoft/Phi-3-vision-128k-instruct
Text Generation • Updated • 156k • 947stabilityai/stable-audio-open-1.0
Text-to-Audio • Updated • 22.7k • 1.06kOpenVLA: An Open-Source Vision-Language-Action Model
Paper • 2406.09246 • Published • 37aiola/whisper-medusa-v1
Updated • 137 • 178merve/idefics3llama-vqav2
Updated • 8black-forest-labs/FLUX.1-schnell
Text-to-Image • Updated • 987k • • 3.36k- 112
Llama3.1 S V0.2 Checkpoint 2024 08 20
😻Convert text to audio and vice versa
gpt-omni/mini-omni
Text-to-Speech • Updated • 1 • 416fishaudio/fish-speech-1.4
Text-to-Speech • Updated • 1.03k • 448- 166
Tonic's GOT OCR
📲GOT - OCR (from : UCAS, Beijing)
stepfun-ai/GOT-OCR2_0
Image-Text-to-Text • Updated • 373k • 1.37kapple/coreml-sam2-large
Mask Generation • Updated • 20 • 25coreml-projects/sam-2-studio
Updated • 22mistralai/Pixtral-12B-2409
Image-Text-to-Text • Updated • 599allenai/Molmo-72B-0924
Image-Text-to-Text • Updated • 4.61k • 280openai/whisper-large-v3-turbo
Automatic Speech Recognition • Updated • 6.84M • • 1.93kRevai/reverb-asr
Automatic Speech Recognition • Updated • 10 • 79- 344
GOT Online
💬Extract text from images using various OCR modes
facebook/vfusion3d
Image-to-3D • Updated • 76 • 66facebook/cotracker
Updated • 866 • 35rhymes-ai/Aria
Image-Text-to-Text • Updated • 25.7k • 612SWivid/F5-TTS
Text-to-Speech • Updated • 1.15M • 894- 63
Ichigo Llama3.1 S Instruct
🏢Generate text from audio recordings
kyutai/moshiko-mlx-q4
Updated • 355 • 28kyutai/moshiko-mlx-q8
Updated • 251 • 5- 99
Open VLM Video Leaderboard
🌎VLMEvalKit Eval Results in video understanding benchmark
jimmycarter/LibreFLUX
Text-to-Image • Updated • 489 • 159microsoft/OmniParser
Image-Text-to-Text • Updated • 2.09k • 1.55k- 236
Aya Expanse
🌍Interact with Aya Expanse to chat, speak, and generate images in 23 languages
CohereForAI/aya-expanse-32b
Text Generation • Updated • 47.4k • 208stabilityai/stable-diffusion-3.5-medium
Text-to-Image • Updated • 151k • • 568OuteAI/OuteTTS-0.1-350M
Text-to-Speech • Updated • 5.27k • 300vidore/colpali
Updated • 47k • 418vidore/colpali-v1.2
Updated • 66.1k • 105si-pbc/hertz-dev
Audio-to-Audio • Updated • 209- 38
Talk To Ultravox
⚡Talk to Fixie.ai's Ultravox with WebRTC ⚡️
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper • 2411.10440 • Published • 113Xkev/Llama-3.2V-11B-cot
Image-Text-to-Text • Updated • 5.36k • 141google/paligemma-3b-pt-224
Image-Text-to-Text • Updated • 47.5k • 299apple/coreml-mobileclip
Updated • 312 • 40InstantX/InstantIR
Image-to-Image • Updated • 1 • 165- 79
InstantIR
🖼diffusion-based Image Restoration model
- 137
Flux IP Adapter
🖼Prompt with Images in flux[dev]
- 38
Image Preferences - Argilla annotation space
🖼A community project to create an image preferences dataset.
fishaudio/fish-speech-1.5
Text-to-Speech • Updated • 8.69k • 445meta-llama/Llama-3.3-70B-Instruct
Text Generation • Updated • 611k • • 1.9k- 42
Paligemma2 Vqav2
🐨PaliGemma2 LoRA finetuned on VQAv2
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper • 2412.04467 • Published • 106fancyfeast/llama-joycaption-alpha-two-hf-llava
Updated • 79.6k • 138taohu/mask
Updated • 5[MASK] is All You Need
Paper • 2412.06787 • Published • 2- 605
Open VLM Leaderboard
🌎VLMEvalKit Evaluation Results Collection
microsoft/LLM2CLIP-Llama3.2-1B-EVA02-L-14-336
Zero-Shot Image Classification • Updated • 10LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper • 2411.04997 • Published • 37Generative Powers of Ten
Paper • 2312.02149 • Published • 6- 62
StoryStar
💬Fantasy story generator
GoodiesHere/Apollo-LMMs-Apollo-7B-t32
Video-Text-to-Text • Updated • 417 • 50Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper • 2412.10360 • Published • 139Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text • Updated • 1.7M • 1.11kXiaoduoAILab/Xmodel_VLM
Text Generation • Updated • 68 • 12nvidia/Cosmos-1.0-Diffusion-14B-Text2World
Updated • 88.4k • 49nvidia/Cosmos-1.0-Autoregressive-12B
Updated • 718 • 28nvidia/Cosmos-1.0-Autoregressive-13B-Video2World
Updated • 791 • 31nvidia/Cosmos-1.0-Diffusion-7B-Text2World
Updated • 215k • 201nvidia/Cosmos-1.0-Diffusion-14B-Video2World
Updated • 3.54k • 50- 353
Stable Point-Aware 3D
⚡Create 3D models from images
hexgrad/Kokoro-82M
Text-to-Speech • Updated • 364k • 3.04k- 1.98k
Kokoro TTS
❤Upgraded to v1.0!
openbmb/MiniCPM-o-2_6
Any-to-Any • Updated • 460k • 936- 287
TTS Spaces Arena
🤗Blind vote on HF TTS models!
google/paligemma2-10b-pt-896
Image-Text-to-Text • Updated • 4.09k • 29NovaSky-AI/Sky-T1-32B-Preview
Text Generation • Updated • 20k • 527MiniMaxAI/MiniMax-VL-01
Image-Text-to-Text • Updated • 2.31k • 234- 46
SmolVLM
📊Generate descriptions from images and text prompts
HKUSTAudio/Llasa-3B
Text-to-Speech • Updated • 7.88k • 426HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text • Updated • 19.2k • 98deepseek-ai/Janus-Pro-7B
Any-to-Any • Updated • 381k • 2.85k- 233
Kokoro TTS Zero
🎴✨[With v1.0.0] Accelerated TTS on Kokoro-82M
kyutai/hibiki-2b-mlx-bf16
Translation • Updated • 165 • 14kyutai/hibiki-2b-pytorch-bf16
Translation • Updated • 182 • 39ARTPARK-IISc/Vaani
Viewer • Updated • 9.72M • 1.71k • 18Zyphra/Zonos-v0.1-hybrid
Text-to-Speech • Updated • 874 • 501Zyphra/Zonos-v0.1-transformer
Updated • 2.88k • 153