Hacker News
- Large language model data pipelines and Common Crawl https://blog.christianperone.com/2023/06/appreciating-llms-data-pipelines/ 12 comments
- Ask HN: Alternatives to Common Crawl? https://groups.google.com/g/common-crawl/c/BvMGYUY-dro 2 comments
- Common Crawl https://commoncrawl.org/ 7 comments
- Common Crawl https://commoncrawl.org/ 61 comments
- Using Common Crawl to play Family Feud https://fulmicoton.com/posts/commoncrawl/ 4 comments
- Lexalytics Text Analysis Work with Common Crawl Data http://commoncrawl.org/lexalytics-text-analysis-work-with-common-crawl-data/ 2 comments
- SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data [video] http://commoncrawl.org/startup-profile-swiftkeys-head-data-scientist-on-the-value-of-common-crawls-open-data/ 2 comments
- Triv.io donates URL index to Common Crawl http://commoncrawl.org/common-crawl-url-index/ 16 comments
- Blekko donates search data to Common Crawl http://blog.blekko.com/2012/12/17/common-crawl-donation/ 36 comments
- Common Crawl announces Open Source Big Data code contest winners http://commoncrawl.org/announcing-the-winners-of-the-code-contest/ 9 comments
- Common Crawl code contest - with fresh crawl of 3.2 billion web pages http://commoncrawl.org/common-crawls-brand-spanking-new-video-and-first-ever-code-contest/ 5 comments
- Common Crawl http://commoncrawl.org/ 5 comments
- MapReduce for the Masses: Zero to Hadoop in 5 minutes with Common Crawl http://www.commoncrawl.org/mapreduce-for-the-masses/ 26 comments
- Tokenising the english text of 30TB common crawl http://matpalm.com/blog/2011/12/10/common_crawl_visible_text/ 7 comments
- Free 5 Billion Page Web Index Now Available from Common Crawl Foundation http://www.readwriteweb.com/archives/common_crawl_foundation_announces_5_billion_page_w.php 39 comments
Lobsters
- Of indexing common-crawl with tantivy https://fulmicoton.com/posts/commoncrawl/ 4 comments rust
- A Linguistics powered "Theme Thesaurus" that crawls Google-News finding words mentioned most commonly with any "target" word. https://surl.im/IuJc 5 comments datascience
- Blog post: Practical Common Lisp - Crawling InterfaceLift with Common Lisp - second try [plug] http://christian.ftwca.de:8080/post/practical-common-lisp---crawling-interfacelift-with-common-lisp---second-try 9 comments lisp
- MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl | CommonCrawl http://www.commoncrawl.org/mapreduce-for-the-masses/ 4 comments programming
- This Is Why Your Website Is Slow - Ghostery releases its annual list of common embeds, including widgets and analytics, that can slow websites to a crawl. http://www.technologyreview.com/blog/mimssbits/27371/ 17 comments web_design
- This Is Why Your Website Is Slow: Ghostery releases its annual list of common embeds, including widgets and analytics, that can slow websites to a crawl https://www.technologyreview.com/blog/mimssbits/27371/?p1=blogs 8 comments technology