Currently, the paper is under review. We will set the links as soon as the paper is published.
At this moment, our code and additional report are provided as supplementary materials.
In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.
ABMamba consists of a vision encoder, a Aligned Hierarchical Bidirectional Scan (AHBS) module, and a Mamba based LLM. The videos are encoded into a sequence of tokens, and the AHBS module propagates information both forward and backward across multiple resolutions, overcoming the coarse summarization and loss of sequential cues associated with the simple downsampling or projection strategies.
The overview of Aligned Hierarchical Bidirectional Scan (AHBS) module. (a) The module consists of a dimension-wise token compression, projector, AHBS, and temporal token compression. (b) The AHBS explicitly models intricate temporal dynamics through multi-resolution parallel bidirectional scan.
The quantitative results of ABMamba on the VATEX and MSR-VTT datasets. ABMamba achieves competitive performance compared to existing methods while maintaining a significantly higher throughput.
To be appeared.