11 9 8

Suyuchen Wang

sheryc

https://suyuchen.wang/

AI & ML interests

Playing with LLMs

Recent Activity

reacted to ahmed-masry's post with 🚀 7 days ago

Happy to announce AlignVLM 📏 – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) 🌍📄🖼 🔗 Read the paper: https://huggingface.co/papers/2502.01341 🧐 What’s the challenge? Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌ 🎯 Our Solution: ALIGN Connector We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. ✅ 🔬 How does it perform? We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all 🏆 on diverse document understanding tasks 📄. 📊 Meet the AlignVLM Model Family! We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results: ✅ AlignVLM surpasses all Base VLMs trained under similar configurations. ✅ Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 🚀. 🤔 What about robustness to noise? We injected Gaussian noise (μ=0, σ=3) into the vision encoder’s outputs before feeding them to the connector: ✅ ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness! ❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs. Code & model weights coming soon! Stay tuned! 🔥

authored a paper 7 days ago

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

upvoted a paper 7 days ago

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

View all activity

Organizations

sheryc's activity

upvoted a paper 7 days ago

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Paper • 2502.01341 • Published 8 days ago • 33

upvoted a paper about 2 months ago

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Paper • 2412.04626 • Published Dec 5, 2024 • 13

upvoted a paper 5 months ago

LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models

Paper • 2409.00509 • Published Aug 31, 2024 • 38

upvoted a collection 8 months ago

VisionLM

Collection

662 items • Updated 1 day ago • 40

upvoted 2 papers 8 months ago

VCR: Visual Caption Restoration

Paper • 2406.06462 • Published Jun 10, 2024 • 10

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper • 2405.21060 • Published May 31, 2024 • 64

upvoted a paper 9 months ago

The Road Less Scheduled

Paper • 2405.15682 • Published May 24, 2024 • 23

upvoted 2 papers 11 months ago

Stealing Part of a Production Language Model

Paper • 2403.06634 • Published Mar 11, 2024 • 91

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU

Paper • 2403.06504 • Published Mar 11, 2024 • 53