- [R] Vid2Seq: a pretrained visual language model for describing multi-event videos https://arxiv.org/abs/2302.14115 14 comments machinelearning
Linking pages
Would you like to stay up to date with Computer science? Checkout Computer science
Weekly.
Related searches:
Search whole site: site:arxiv.org
Search title: [2302.14115] Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
See how to search.