GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.

Linking pages

GitHub - microsoft/BlingFire: A lightning fast Finite State machine and REgular expression manipulation library. https://github.com/microsoft/blingfire 92 comments
GitHub - llSourcell/DoctorGPT: DoctorGPT is an LLM that can pass the US Medical Licensing Exam. It works offline, it's cross-platform, & your health data stays private. https://github.com/llSourcell/DoctorGPT 75 comments
GitHub - niedev/RTranslator: Open source real-time translation app for Android that runs locally https://github.com/niedev/RTranslator 64 comments
Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI https://lightning.ai/pages/community/lora-insights/ 39 comments
Tokenization for language modeling: Byte Pair Encoding vs Unigram Language Modeling | Nick Dingwall https://ndingwall.github.io/blog/tokenization 39 comments
GitHub - google-research/bert: TensorFlow code and pre-trained models for BERT https://github.com/google-research/bert 21 comments
GitHub - facebookresearch/LASER: Language-Agnostic SEntence Representations https://github.com/facebookresearch/LASER 11 comments
spaCy meets Transformers: Fine-tune BERT, XLNet and GPT-2 · Explosion https://explosion.ai/blog/spacy-pytorch-transformers 11 comments
How the BPE tokenization algorithm used by large language models works. | sidsite https://sidsite.com/posts/bpe/ 11 comments
Towards an ImageNet Moment for Speech-to-Text https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/ 10 comments
GitHub - maziarraissi/Applied-Deep-Learning: Applied Deep Learning Course https://github.com/maziarraissi/Applied-Deep-Learning 6 comments
ML-Enhanced Code Completion Improves Developer Productivity – Google AI Blog https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html 3 comments
GitHub - VKCOM/YouTokenToMe: Unsupervised text tokenizer focused on computational efficiency https://github.com/VKCOM/YouTokenToMe 3 comments
Overview of tokenization algorithms in NLP | by Ane Berasategi | Towards Data Science https://towardsdatascience.com/overview-of-nlp-tokenization-algorithms-c41a7d5ec4f9 3 comments
GitHub - NLPOptimize/flash-tokenizer: EFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING https://github.com/NLPOptimize/flash-tokenizer 2 comments
GitHub - argosopentech/argos-train: Training scripts for Argos Translate https://github.com/argosopentech/argos-train 1 comment
How we used Universal Sentence Encoder and FAISS to make our search 10x smarter | by Maxim Leonovich | OneBar https://blog.onebar.io/building-a-semantic-search-engine-using-open-source-components-e15af5ed7885 1 comment
GitHub - amrzv/awesome-colab-notebooks: Collection of google colaboratory notebooks for fast and easy experiments https://github.com/amrzv/awesome-colab-notebooks 0 comments
spaCy meets Transformers: Fine-tune BERT, XLNet and GPT-2 · Explosion https://explosion.ai/blog/spacy-transformers 0 comments
Subword Tokenization - Handling Misspellings and Multilingual Data - the Thought Vector blog - Blog Vector https://www.thoughtvector.io/blog/subword-tokenization/ 0 comments

Linking pages

Linked pages