That is what I did, collecting the robots.txt files from a wide range of blogs and websites. Below you will find them.
Key Takeaways
- Only 2 out of 30 websites that I checked were not using a robots.txt file
- Even if you don’t have any specific requirements for the search bots, therefore, you probably should use a simple robots.txt file
- Most people stick to the “User-agent: *” attribute to cover all agents
- The most common “Disallowed” factor is the RSS Feed
- Google itself is using a combination of closed folders (e.g., /searchhistory/) and open ones (e.g., /search), which probably means they are treated differently
- A minority of the sites included the sitemap URL on the robots.txt file
The Minimalistic Guys
Problogger.net
User-agent: *
Disallow:
Marketing Pilgrim
User-agent: *Search Engine Journal
Disallow:
User-agent: *Matt Cutts
Disallow:
User-agent: *Pronet Advertising
Allow:
User-agent: *
Disallow: /files/
User-agent: *TechCrunch
Disallow: /mt
Disallow: /*.cgi$
User-agent: *
Disallow: /*/feed/
Disallow: /*/trackback/
The Structured Ones
Online Marketing BlogUser-agent: GooglebotShoemoney
Disallow: */feed/
User-agent: *
Disallow: /Blogger/
Disallow: /wp-admin/
Disallow: /stats/
Disallow: /cgi-bin/
Disallow: /2005x/
User-Agent: GooglebotScoreboard Media
Disallow: /link.php
Disallow: /gallery2
Disallow: /gallery2/
Disallow: /category/
Disallow: /page/
Disallow: /pages/
Disallow: /feed/
Disallow: /feed
User-agent: *SEOMoz.org
Disallow: /cgi-bin/
User-agent: Googlebot
Disallow: /category/
Disallow: /page/
Disallow: */feed/
Disallow: /2007/
Disallow: /2006/
Disallow: /wp-*
User-agent: *Wolf-Howl
Disallow: /blogdetail.php?ID=537
Disallow: /blog?page
Disallow: /blog/author/
Disallow: /blog/category/
Disallow: /tracker
Disallow: /ugc?page
Disallow: /ugc/author/
Disallow: /ugc/category/
User-agent: *John Chow
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /noindex/
Disallow: /privacy-policy/
Disallow: /about/
Disallow: /company-biographies/
Disallow: /press-media-room/
Disallow: /newsletter/
Disallow: /contact-us/
Disallow: /terms-of-service/
Disallow: /terms-of-service/
Disallow: /information/comment-policy/
Disallow: /faq/
Disallow: /contact-form/
Disallow: /advertising/
Disallow: /information/licensing-information/
Disallow: /2005/
Disallow: /2006/
Disallow: /2007/
Disallow: /2008/
Disallow: /2009/
Disallow: /2004/
Disallow: /*?*
Disallow: /page/
Disallow: /iframes/
sitemap: http://www.johnchow.com/sitemap.xmlSmashing Magazine
User-agent: *
Disallow: /cgi-bin/
Disallow: /go/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /author/
Disallow: /page/
Disallow: /category/
Disallow: /wp-images/
Disallow: /images/
Disallow: /backup/
Disallow: /banners/
Disallow: /archives/
Disallow: /trackback/
Disallow: /feed/
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
User-agent: Mediapartners-Google
Allow: /
User-agent: duggmirror
Disallow: /
Sitemap: http://www.smashingmagazine.com/sitemap.xmlGizmodo
User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: /styles/
Disallow: /inc/
Disallow: /tag/
Disallow: /cc/
Disallow: /category/
User-agent: MSIECrawler
Disallow: /
User-agent: psbot
Disallow: /
User-agent: Fasterfox
Disallow: /
User-agent: Slurp
Crawl-delay: 200
User-Agent: GooglebotLifehacker
Disallow: /index.xml$
Disallow: /excerpts.xml$
Allow: /sitemap.xml$
Disallow: /*view=rss$
Disallow: /*?view=rss$
Disallow: /*format=rss$
Disallow: /*?format=rss$
Sitemap: http://gizmodo.com/sitemap.xml
User-Agent: Googlebot
Disallow: /index.xml$
Disallow: /excerpts.xml$
Allow: /sitemap.xml$
Disallow: /*view=rss$
Disallow: /*?view=rss$
Disallow: /*format=rss$
Disallow: /*?format=rss$
Sitemap: http://lifehacker.com/sitemap.xml
The Mainstream Media
Wall Street JournalUser-agent: *ZDNet
Disallow: /article_email/
Disallow: /article_print/
Disallow: /PA2VJBNA4R/
Sitemap: http://online.wsj.com/sitemap.xml
User-agent: *NY Times
Disallow: /Ads/
Disallow: /redir/
# Disallow: /i/ is removed per 190723
Disallow: /av/
Disallow: /css/
Disallow: /error/
Disallow: /clear/
Disallow: /mac-ad
Disallow: /adlog/
# URS per bug 239819, these were expanded
Disallow: /1300-
Disallow: /1301-
Disallow: /1302-
Disallow: /1303-
Disallow: /1304-
Disallow: /1305-
Disallow: /1306-
Disallow: /1307-
Disallow: /1308-
Disallow: /1309-
Disallow: /1310-
Disallow: /1311-
Disallow: /1312-
Disallow: /1313-
Disallow: /1314-
Disallow: /1315-
Disallow: /1316-
Disallow: /1317-
# robots.txt, www.nytimes.com 6/29/2006YouTube
#
User-agent: *
Disallow: /pages/college/
Disallow: /college/
Disallow: /library/
Disallow: /learning/
Disallow: /aponline/
Disallow: /reuters/
Disallow: /cnet/
Disallow: /partners/
Disallow: /archives/
Disallow: /indexes/
Disallow: /thestreet/
Disallow: /nytimes-partners/
Disallow: /financialtimes/
Allow: /pages/
Allow: /2003/
Allow: /2004/
Allow: /2005/
Allow: /top/
Allow: /ref/
Allow: /services/xml/
User-agent: Mediapartners-Google*
Disallow:
# robots.txt file for YouTube
User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: /profile
Disallow: /results
Disallow: /browse
Disallow: /t/terms
Disallow: /t/privacy
Disallow: /login
Disallow: /watch_ajax
Disallow: /watch_queue_ajax
Bonus
GoogleUser-agent: *
Allow: /searchhistory/
Disallow: /news?output=xhtml&
Allow: /news?output=xhtml
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
Disallow: /news
Disallow: /nwshp
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /relcontent
Disallow: /sorry/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /advanced_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /default
Disallow: /m?
Disallow: /m/search?
Disallow: /wml?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/search?
No comments:
Post a Comment