Robots.txt meant for search engines don’t work well for web archives - Internet Archive Blogs - discu.eu

Hacker News

Robots.txt meant for search engines don’t work well for web archives http://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ 143 comments 21/4/2017

Linking pages

With the rise of AI, web crawlers are suddenly controversial - The Verge https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders 101 comments
A Curious Case of Disregarded Robots.txt – mike.pub https://mike.pub/20170425-disregarded-robots-txt 10 comments
Robots.txt is 25 years old â Martijn Koster's Pages https://www.greenhills.co.uk/posts/robotstxt-25/ 2 comments
Common Crawl And Unlocking Web Archives For Research https://www.forbes.com/sites/kalevleetaru/2017/09/28/common-crawl-and-unlocking-web-archives-for-research/ 1 comment
What is the Internet Archive doing with our books? | NWU https://nwu.org/what-is-the-internet-archive-doing-with-our-books/ 0 comments
2018-04-24: Why we need multiple web archives: the case of blog.reidreport.com https://ws-dl.blogspot.com/2018/04/2018-04-24-why-we-need-multiple-web.html 0 comments
Internet Archive to ignore robots.txt directives | Boing Boing http://boingboing.net/2017/04/22/internet-archive-to-ignore-rob.html 0 comments
What Celine Dion–Fan Dreams Say About the Early Internet - The Atlantic https://www.theatlantic.com/technology/archive/2020/01/celine-dreams-fan-site-geocities-internet-archive/604750/ 0 comments
GitHub - buren/wayback_archiver: Ruby gem to send URLs to Wayback Machine https://github.com/buren/wayback_archiver 0 comments
On Robots and Text – Pixel Envy https://pxlnv.com/blog/on-robots-and-text/ 0 comments
The trouble with openness | anderegg.ca https://anderegg.ca/2024/11/27/the-trouble-with-openness 0 comments

Related searches:

Search whole site: site:blog.archive.org

Search title: Robots.txt meant for search engines don’t work well for web archives - Internet Archive Blogs

See how to search.

Submit link to: