Web-based robots.txt parser using Google’s open-source CodeWill Critchlow
The punchline: I’ve been playing around with a toy project recently and have deployed it as a free web-based tool for checking how Google will parse your robots.txt files, given that their own online tool does not replicate actual Googlebot behaviour. Check it out at realrobotstxt.com.
While preparing for my recent presentation at SearchLove London, I got mildly obsessed by the way that the deeper I dug into how robots.txt files work, the more surprising things I found, and the more places I found where there was conflicting information from different sources. Google’s open source robots.txt parser should have made everything easy by not only complying with their newly-published draft specification, but also by apparently being real production Google code.
Two challenges led me further down the rabbit hole that ultimately led to me building a web-based tool:
- It’s a C++ project, so needs to be compiled, which requires at least some programming / code administration skills, so I didn’t feel like it was especially accessible to the wider search community
- When I got it compiled and played with it, I discovered that it was missing crucial Google-specific functionality to enable us to see how Google crawlers like the images and video crawlers will interpret robots.txt files
Ways this tool differs from other resources
Apart from the benefit of being a web-based tool rather than requiring compilation to run locally, my realrobotstxt.com tool should be 100% compliant with the draft specification that Google released, as it is entirely powered by their open source tool except for two specific changes that I made to bring it in line with my understanding of how real Google crawlers work:
- Googlebot-image, Googlebot-video and Googlebot-news(*) should all fall back on obeying Googlebot directives if there are no rulesets specifically targeting their own individual user agents – we have verified that this is at least how the images bot behaves in the real world
- Google has a range of bots (AdsBot-Google, AdsBot-Google-Mobile, and the AdSense bot, Mediapartners-Google) which apparently ignore User-agent: * directives and only obey rulesets specifically targeting their own individual user agents
[(*) Note: unrelated to the tweaks I’ve made, but relevant because I mentioned Googlebot-news, it is very much not well-known that Googlebot-news is not a crawler and hasn’t been since 2011, apparently. If you didn’t know this, don’t worry – you’re not alone. I only learned it recently, and it’s pretty hard to discern from the documentation which regularly refers to it as a crawler. The only real official reference I can find is the blog post announcing its retirement. I mean, it makes sense to me, because having different crawlers for web and news search opens up dangerous cloaking opportunities, but why then refer to it as a crawler’s user agent throughout the docs? It seems, though I haven’t been able to test this in real life, as though rules directly targeting Googlebot-news function somewhat like a Google News-specific noindex. This is very confusing, because regular Googlebot blocking does not keep URLs out of the web index, but there you go.]
I expect to see the Search Console robots.txt checker retired soon
We have seen a gradual move to turn off old Search Console features and I expect that the robots.txt checker will be retired soon. Googlers have recently been referring recently to it being out of step with how their actual crawlers work – and we can see differences in our own testing:
These cases seem to be handled correctly by the open source parser – here’s my web-based tool on the exact same scenario:
This felt like all the more reason for me to release my web-based version, as the only official web-based tool we have is out of date and likely going away. Who knows whether Google will release an updated version based on their open source parser – but until they do, my tool might prove useful to some people.
I’d like to see the documentation updated
Unfortunately, while I can make a pull request against the open source code, I can’t do the same with Google documentation. Despite implications out of Google that the old Search Console checker isn’t in sync with real Googlebot, and hence shouldn’t be trusted as the authoritative answer about how Google will parse a robots.txt file, references to it remain widespread in the documentation:
- Introduction to robots.txt
- Avoid common mistakes
- Create a robots.txt file
- Test your robots.txt with the robots.txt Tester
- Submit your updated robots.txt to Google
- Debugging your pages
In addition, although it’s natural that old blog posts might not be updated with new information, these are still prominently ranking for some related searches:
Who knows. Maybe they’ll update the docs with links to my tool 😉
Let me know if it’s useful to you
Anyway. I hope you find my tool useful – I enjoyed hacking around with a bit of C++ and Python to make it – it’s good to have a “maker” project on the go sometimes when your day job doesn’t involve shipping code. If you spot any weirdness, have questions, or just find it useful, please drop me a note to let me know. You can find me on Twitter.