🏆 [2025-02-26]: Our paper was accepted to CVPR 2025 🎉
🔥 [2025-04-04]: Our paper was selected as a Highlight (Top 3–5%) paper ✨
Widely shared videos on the internet are often edited. Recently, although Video Large Language Models (Vid-LLMs) have made great progress in general video understanding tasks, their capabilities in video editing understanding (VEU) tasks remain unexplored. To address this gap, in this paper, we introduce VEU-Bench (Video Editing Understanding Benchmark), a comprehensive benchmark that categorizes video editing components across various dimensions, from intra-frame features like shot size to inter-shot attributes such as cut types and transitions. Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging. To enhance the annotation of VEU automatically, we built an annotation pipeline integrated with an ontology-based knowledge base. Through extensive experiments with 11 state-of-the-art Vid-LLMs, our findings reveal that current Vid-LLMs face significant challenges in VEU tasks, with some performing worse than random choice. To alleviate this issue, we develop Oscars★, a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and achieves performance comparable to commercial models like GPT-4o. We also demonstrate that incorporating VEU data significantly enhances the performance of Vid-LLMs on general video understanding benchmarks, with an average improvement of 8.3% across nine reasoning tasks.
We present VEU-Bench (Video Editing Understanding Benchmark), the first comprehensive benchmark designed to evaluate the video editing understanding capabilities of Video Large Language Models (Vid-LLMs). Unlike general video understanding tasks, video editing understanding (VEU) requires models to recognize abstract and symbolic editing elements—such as shot types, camera motions, cut types, and transitions—and to reason about their functions and stylistic intentions within narrative contexts.
VEU-Bench introduces a three-level evaluation paradigm—recognition, reasoning, and judging—across 10 editing dimensions including both intra-frame (e.g., shot size, angle, color), intra-shot (e.g., motion, speed), and inter-shot elements (e.g., cut type, transition). With over 50K high-quality QA pairs grounded in real edited videos, VEU-Bench offers a rich and diverse benchmark to evaluate models' ability to perceive visual editing cues, explain changes, and interpret artistic intentions.
To generate high-quality annotations, we design an ontology-based annotation pipeline built upon domain-specific knowledge extracted from professional video editing tutorials. This system rewrites abstract editing concepts into video-specific prompts and explanations, enabling scalable generation of reasoning and judging tasks with minimal human intervention.
Through extensive evaluations, we reveal that current state-of-the-art Vid-LLMs perform poorly on VEU tasks—often worse than random guessing in some categories—due to their weak alignment between editing knowledge and visual perception. To address this, we introduce Oscars, a VEU expert model fine-tuned on VEU-Bench. Oscars achieves a 28.3% performance gain over existing open-source models and even rivals commercial models like GPT-4o.
More importantly, we demonstrate that training on VEU-Bench can significantly improve Vid-LLMs on general video reasoning tasks, with an average boost of 8.3% across multiple benchmarks. These findings highlight VEU-Bench as not only a challenge for editing-specific evaluation but also a valuable dataset for enhancing abstract reasoning in video foundation models.
While previous VEU benchmarks primarily focus on basic recognition tasks, VEU-Bench extends the evaluation to include both reasoning and judging.
Recognition: Models classify editing elements across 10 dimensions through multiple-choice questions.
Reasoning: Models explain changes in editing elements (e.g., shot size, transitions) with supporting evidence.
Judging: Models assess the purpose and impact of editing choices, demonstrating an understanding of the creator's intent and storytelling effects.
This three-level design offers a more comprehensive and realistic assessment of video editing understanding compared to earlier benchmarks.
Comparison between VEU-Bench and previous VEU benchmarks. VEU-Bench encompasses a wider range of video editing components and includes high-level reasoning and judgment tasks.
Answer length statistics of the curated VEU-Bench Dataset. The answer length ranges from 0-115 characters, with the majority between 31-50 characters.
Video duration statistics of the curated VEU-Bench Dataset. The video durations range from 1 to over 60 seconds, with the majority between 1 and 12 seconds.
Sampled video sources in our VEU-Bench dataset from various domains. The majority of the videos are cherry-picked from the AVE dataset, with the rest from MovieCuts and AutoTransition.
Task portion statistics of our dataset. The majority of the task was perception, followed by reasoning and judge.
We evaluate various models including commercial models and open-source ones. Current Vid-LLMs exhibit poor performance across all benchmark dimensions, while our expert model Oscars exhibits improvements across all dimensions compared to the baseline: Qwen2-VL.
Click on Recognition, Reasoning&Judging or All Score to expand detailed results.
Reset | Recognition | Reasoning&Judging | All Score | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Model Size | Frame Numbers | Overall | Shot Subject | Shot Color | Shot Size | Shot Angle | Shot Location | Shot Type | Shot Motion | Shot Speed | Transition | Cut Type | Overall | Shot Size | Shot Angle | Shot Location | Shot Type | Shot Motion | Transition | Cut Type | Overall |
Overall results of different models on the VEU-Bench leaderboard.
The best-performing model in each category is in-bold, and the second best is underlined.
@article{li_2025_veubench,
author = {Li, Bozheng and Wu, Yongliang and Lu, Yi and Yu, Jiashuo and Tang, Licheng and Cao, Jiawang and Zhu, Wenqing and Sun, Yuyang and Wu, Jay and Zhu, Wenbo},
month = {03},
title = {VEU-Bench: Towards Comprehensive Understanding of Video Editing},
journal = {CVPR},
url = {https://cvpr.thecvf.com/virtual/2025/poster/34180},
year = {2025},
urldate = {2025-04-16},
}