Dual Caption Preference Optimization for Diffusion Models Paper • 2502.06023 • Published 2 days ago • 7
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation Paper • 2502.05179 • Published 4 days ago • 19
VideoRoPE: What Makes for Good Video Rotary Position Embedding? Paper • 2502.05173 • Published 4 days ago • 58
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting Paper • 2502.05176 • Published 4 days ago • 24
HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution Paper • 2501.10045 • Published 26 days ago • 9
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published 21 days ago • 81
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass Paper • 2501.13928 • Published 19 days ago • 16
view article Article The SOTA Text-to-speech and Zero Shot Voice cloning model that no one knows about... By srinivasbilla • 22 days ago • 60
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot Paper • 2501.09012 • Published 27 days ago • 10
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction Paper • 2501.06282 • Published Jan 10 • 43
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis Paper • 2501.04561 • Published Jan 8 • 16
The Superposition of Diffusion Models Using the Itô Density Estimator Paper • 2412.17762 • Published Dec 23, 2024 • 12
Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free Paper • 2410.10814 • Published Oct 14, 2024 • 49
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis Paper • 2410.08261 • Published Oct 10, 2024 • 50
Intriguing Properties of Large Language and Vision Models Paper • 2410.04751 • Published Oct 7, 2024 • 16