Hacker News
- Fine tune LLAMA3 on million scale dataset in consumer GPU using QLora, DeepSpeed https://medium.com/@sumandas0/fine-tune-llama3-on-million-scale-dataset-in-consumer-gpu-using-qlora-deepspeed-3ae8ad75299a 26 comments
- US brokers selling overlapping datasets with people as “actively pregnant“ https://gizmodo.com/data-brokers-selling-pregnancy-roe-v-wade-abortion-1849148426 4 comments
- Turning petabytes of raw video data into a high-quality ML dataset https://medium.com/@mvoodarla/curating-a-dataset-from-raw-images-and-videos-c8b962eca9ba 2 comments
- The Pandora Papers – leaked dataset of 11.9M financial documents https://twitter.com/ICIJorg/status/1444474822797545476 5 comments
- Hobbling computer vision datasets against unauthorized use https://www.unite.ai/hobbling-computer-vision-datasets-against-unauthorized-use/ 4 comments
- Ethical issues in research using datasets of illicit origin https://www.lightbluetouchpaper.org/2017/11/07/ethical-issues-in-research-using-datasets-of-illicit-origin/ 19 comments
- The ImageNet dataset transformed AI research https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/ 55 comments
Lobsters
- cleanlab 2.0: Automatically Find Errors in ML Datasets https://cleanlab.ai/blog/cleanlab-2/ 3 comments ai , python , release
- Build dataflows with larger-than-memory datasets. Use two Python open source libraries in this hands-on guide to create Big Data pipelines https://medium.com/@marine.gosselin/big-data-models-vs-computer-memory-b345814ece9f 8 comments programming
- Collecting datasets for CV: any wishlists? https://www.tictag.io 14 comments computervision
- Having troubles with loading a custom dataset into yoloV5 https://learnopencv.com/custom-object-detection-training-using-yolov5/?ck_subscriber_id=1373562521#Custom-Object-Detection-Training-using-YOLOv5 3 comments learnmachinelearning
- Random Walk Dataset None 2 comments octave
- Air pollution causes nearly 2 million asthma cases, and a similar number of excess deaths, per year. The researchers examined an existing dataset that looked at nitrogen dioxide concentrations in 58 countries during 2010–12. https://cosmosmagazine.com/earth/climate/air-pollution-asthma-excess-deaths/ 9 comments science
- [R] AnimeCeleb: Large-Scale Animation CelebFaces Dataset via Controllable 3D Synthetic Models https://arxiv.org/abs/2111.07640 2 comments machinelearning
- DeepMind open-sources protein structure dataset generated by AlphaFold 2 https://venturebeat.com/2021/07/22/deepmind-open-sources-protein-structure-dataset-generated-by-alphafold-2/ 3 comments science
- Handling large datasets using pandas if you have memory constraint. https://www.kaggle.com/c/avazu-ctr-prediction/overview 6 comments learnmachinelearning
- What colors should I prepare for use for graphs with unknown datasets? https://www.reddit.com/r/web_design/comments/n4f6gb/what_colors_should_i_prepare_for_use_for_graphs/ 7 comments web_design
- survivoR R package: "a collection of datasets detailing events and the cast across all 40 seasons of the US Survivor, including castaway information, vote history, immunity and reward challenge winners, jury votes, and viewers" http://gradientdescending.com/survivor-now-on-cran/ 4 comments rstats
- Rush Limbaugh downplaying hurricane Irma may have decreased evacuations. Phone-location dataset shows correlation with election results. https://arstechnica.com/science/2020/09/rush-limbaugh-downplaying-hurricane-irma-may-have-decreased-evacuations/ 5 comments politics
- Rust Notebooks: Loading Datasets from CSV into NDArray https://shahinrostami.com/posts/programming/rust-notebooks/loading-datasets-from-csv-into-ndarray/ 3 comments rust
- Google just published 25 million free datasets https://towardsdatascience.com/google-just-published-25-million-free-datasets-d83940e24284 89 comments technews
- Not sure if this is the correct subreddit, but this is a dataset of "Good" and "Evil" chat messages in a video game and I want someone to train a classifer on it https://drive.google.com/file/d/1bxAQrt-Nomj4npg7GLLtA3FTs374kK6l/view?usp=sharing 4 comments learnmachinelearning
- Training a YOLOv3 Object Detection Model with a Custom Dataset https://blog.roboflow.ai/training-a-yolov3-object-detection-model-with-a-custom-dataset/ 3 comments computervision
- Twelve Million Phones, One Dataset, Zero Privacy https://www.nytimes.com/interactive/2019/12/19/opinion/location-tracking-cell-phone.html 28 comments firefox
- 70+ Machine Learning Datasets - Gain real-world experience with Data Science projects! https://data-flair.training/blogs/machine-learning-datasets/ 3 comments programming
- What Does ‘Broken’ Sound Like? First-Ever Audio Dataset of Malfunctioning Industrial Machines https://medium.com/syncedreview/what-does-broken-sound-like-first-ever-audio-dataset-of-malfunctioning-industrial-machines-b4f8f6d81dd7 3 comments artificial
- Github Releases Dataset Of Six Million Methods From Open Source Projects For CodeSearchNet Challenge https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/?utm_campaign=1569513857&utm_medium=social&utm_source=twitter&utm_content=1569513857 13 comments programming
- 300+ Free Datasets for Machine Leaning divided into 10 Use Cases https://lionbridge.ai/business-resources/open-datasets-for-machine-learning/ 11 comments datascience
- Training a chatbot with dialogues dataset https://www.reddit.com/r/artificial/comments/an337w/training_a_chatbot_with_dialogues_dataset/ 10 comments artificial
- I Compiled a Dataset of ~2.5 Million /r/WallStreetBets Comments for an Algorithmic Trading Strategy Based on Market Volatility [X-Post /r/AlgoTrading] https://www.kaggle.com/theriley106/wallstreetbetscomments 29 comments wallstreetbets
- Linux Game Compatibility Checker update - new datasets, grouping & sorting & filtering options, more statistics http://lgc.lysioneer.nl/ 5 comments linux_gaming
- MNIST Tutorial with Tensorflow Dataset API http://cjalmeida.net/post/tensorflow-mnist/ 3 comments programming
- Unstable MLP's accuracy on train dataset https://datascience.stackexchange.com/questions/19763/unstable-mlps-accuracy-on-train-dataset 3 comments learnmachinelearning
- Scraping a Craft Beer dataset http://www.jeannicholashould.com/python-web-scraping-tutorial-for-craft-beers.html 7 comments datascience
- Where to find a large datasets of indian songs lyrics combine into one text file? https://www.reddit.com/r/india/comments/5kis4r/where_to_find_a_large_datasets_of_indian_songs/ 4 comments india
- Advice on checksums for very large datasets https://www.reddit.com/r/crypto/comments/5j5gxr/advice_on_checksums_for_very_large_datasets/ 15 comments crypto
- At how large of a dataset should I be using something like haystack/elasticsearch rather than using the built in ORM? https://www.reddit.com/r/django/comments/5b5sgd/at_how_large_of_a_dataset_should_i_be_using/ 3 comments django
- The UK has been using massive datasets to spy on innocent civilians for years http://uk.businessinsider.com/the-uk-has-been-using-datasets-to-spy-on-innocent-civilians-2016-4 18 comments worldnews
- Joining big public datasets: How much attention does a Hacker News frontpage post drive to a GitHub project? https://www.reddit.com/r/bigquery/comments/3qpyor/joining_hacker_news_and_github_how_much_attention/ 3 comments programming
- Benchmarking BDB, CDB and Tokyo Cabinet on large datasets http://www.dmo.ca/blog/benchmarking-hash-databases-on-large-data/ 18 comments programming