Hacker News
- Bridging Images and Text – A Survey of VLMs https://nanonets.com/blog/bridging-images-and-text-a-survey-of-vlms/ 2 comments
- A Survey of Latest VLMs and VLM Benchmarks https://nanonets.com/blog/bridging-images-and-text-a-survey-of-vlms/ 2 comments deeplearning
Linking pages
- Table Extraction using LLMs: Unlocking Structured Data from Documents https://nanonets.com/blog/table-extraction-using-llms-unlocking-structured-data-from-documents/ 2 comments
- Fine-Tuning Vision Language Models (VLMs) for Data Extraction https://nanonets.com/blog/fine-tuning-vision-language-models-vlms-for-data-extraction/ 2 comments
Linked pages
- LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS | LAION https://laion.ai/blog/laion-5b/ 104 comments
- [2403.09611] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training https://arxiv.org/abs/2403.09611 63 comments
- https://arxiv.org/pdf/2103.00020.pdf 11 comments
- VisualGenome https://homes.cs.washington.edu/~ranjay/visualgenome/index.html 6 comments
- [2111.06377] Masked Autoencoders Are Scalable Vision Learners https://arxiv.org/abs/2111.06377 5 comments
- [2111.11432] Florence: A New Foundation Model for Computer Vision https://arxiv.org/abs/2111.11432 2 comments
- [2106.13884] Multimodal Few-Shot Learning with Frozen Language Models https://arxiv.org/abs/2106.13884 2 comments
- COCO - Common Objects in Context http://cocodataset.org 2 comments
- GitHub - kakaobrain/coyo-dataset: COYO-700M: Large-scale Image-Text Pair Dataset https://github.com/kakaobrain/coyo-dataset 1 comment
- [2304.08485] Visual Instruction Tuning https://arxiv.org/abs/2304.08485 1 comment
- [2403.05525] DeepSeek-VL: Towards Real-World Vision-Language Understanding https://arxiv.org/abs/2403.05525 1 comment
- [2111.09734] ClipCap: CLIP Prefix for Image Captioning https://arxiv.org/abs/2111.09734 0 comments
- [2212.08045] Image-and-Language Understanding from Pixels Only https://arxiv.org/abs/2212.08045 0 comments
- [2301.12597] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models https://arxiv.org/abs/2301.12597 0 comments
- [2111.15664] OCR-free Document Understanding Transformer https://arxiv.org/abs/2111.15664 0 comments
- [2306.14824] Kosmos-2: Grounding Multimodal Large Language Models to the World https://arxiv.org/abs/2306.14824 0 comments
- GitHub - haotian-liu/LLaVA: Visual Instruction Tuning: Large Language-and-Vision Assistant built towards multimodal GPT-4 level capabilities. https://github.com/haotian-liu/LLaVA 0 comments
- [2309.10952] LMDX: Language Model-based Document Information Extraction and Localization https://arxiv.org/abs/2309.10952 0 comments
- MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models https://mathvista.github.io/ 0 comments
- MMMU https://mmmu-benchmark.github.io/ 0 comments
Related searches:
Search whole site: site:nanonets.com
Search title: Bridging Images and Text - a Survey of VLMs
See how to search.