Bridging Images and Text - a Survey of VLMs - discu.eu

Hacker News

Bridging Images and Text – A Survey of VLMs https://nanonets.com/blog/bridging-images-and-text-a-survey-of-vlms/ 2 comments 17/9/2024

Reddit

A Survey of Latest VLMs and VLM Benchmarks https://nanonets.com/blog/bridging-images-and-text-a-survey-of-vlms/ 2 comments 18/9/2024 deeplearning

Linking pages

Linked pages

LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS | LAION https://laion.ai/blog/laion-5b/ 104 comments
[2403.09611] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training https://arxiv.org/abs/2403.09611 63 comments
https://arxiv.org/pdf/2103.00020.pdf 11 comments
VisualGenome https://homes.cs.washington.edu/~ranjay/visualgenome/index.html 6 comments
[2111.06377] Masked Autoencoders Are Scalable Vision Learners https://arxiv.org/abs/2111.06377 5 comments
[2111.11432] Florence: A New Foundation Model for Computer Vision https://arxiv.org/abs/2111.11432 2 comments
[2106.13884] Multimodal Few-Shot Learning with Frozen Language Models https://arxiv.org/abs/2106.13884 2 comments
COCO - Common Objects in Context http://cocodataset.org 2 comments
GitHub - kakaobrain/coyo-dataset: COYO-700M: Large-scale Image-Text Pair Dataset https://github.com/kakaobrain/coyo-dataset 1 comment
[2304.08485] Visual Instruction Tuning https://arxiv.org/abs/2304.08485 1 comment
[2403.05525] DeepSeek-VL: Towards Real-World Vision-Language Understanding https://arxiv.org/abs/2403.05525 1 comment
[2111.09734] ClipCap: CLIP Prefix for Image Captioning https://arxiv.org/abs/2111.09734 0 comments
[2212.08045] Image-and-Language Understanding from Pixels Only https://arxiv.org/abs/2212.08045 0 comments
[2301.12597] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models https://arxiv.org/abs/2301.12597 0 comments
[2111.15664] OCR-free Document Understanding Transformer https://arxiv.org/abs/2111.15664 0 comments
[2306.14824] Kosmos-2: Grounding Multimodal Large Language Models to the World https://arxiv.org/abs/2306.14824 0 comments
GitHub - haotian-liu/LLaVA: Visual Instruction Tuning: Large Language-and-Vision Assistant built towards multimodal GPT-4 level capabilities. https://github.com/haotian-liu/LLaVA 0 comments
[2309.10952] LMDX: Language Model-based Document Information Extraction and Localization https://arxiv.org/abs/2309.10952 0 comments
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models https://mathvista.github.io/ 0 comments
MMMU https://mmmu-benchmark.github.io/ 0 comments

Related searches:

Search whole site: site:nanonets.com

Search title: Bridging Images and Text - a Survey of VLMs

See how to search.

Submit link to: