Abstract
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce VideoRoPE, with a 3D structure designed to preserve spatio-temporal relationships. VideoRoPE features low-frequency temporal allocation to mitigate periodic oscillations, a diagonal layout to maintain spatial symmetry, and adjustable temporal spacing to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at https://github.com/Wiselnn570/VideoRoPE{https://github.com/Wiselnn570/VideoRoPE}.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models (2024)
- B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens (2024)
- Vulnerability-Aware Spatio-Temporal Learning for Generalizable and Interpretable Deepfake Video Detection (2025)
- Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency (2025)
- Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection (2025)
- When SAM2 Meets Video Shadow and Mirror Detection (2024)
- TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper