Google Is Telling Webmasters To Remove Noindex From Robots.txt
As of the 1st September 2019, Google will stop supporting any unsupported directives within robots.txt files relating to indexing. ‘Noindex’ directives, in particular, have been targeted, with the search giant recently announcing that all directives relating to this should be removed from robots.txt and implemented correctly.
This move comes after Google posted to their Webmaster Central Blog formalising the Robots Exclusion Protocol Specification, introducing the REP as an official internet standard for developers and webmasters across the web. They stated that “developers have interpreted the protocol somewhat differently over the years,” going on to describe how the REP hasn’t updated to cater for modern internet requirements either. For this reason, Google has worked with webmasters, search engines and the original author of the REP to update the rules and clarify a number of different points. This has since been submitted to the IETF (Internet Engineering Task Force) and appears to be awaiting review.
Utilising their experience with robots.txt, these proposed standards have clarified a few uncertain points, particularly relating to indexing. The rules are generally the same, but Google states that the new clarification “defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web.”
Following this, our Lead Developer stated that:
What Are The New Robots Exclusion Protocol Specification Points?
With the above in mind, Google has listed the following as the most notable points from the proposed new standard:
- Any URL-based transfer protocol, not just HTTP, can use robots.txt, including GTP or CoAP
- There will be new maximum file size, requiring developers to parse at least the first 500 kibibytes of a robots.txt file. This will alleviate strain on servers and ensures connections aren’t open for extended periods of time.
- Website owners will gain better flexibility to update robots.txt via a new maximum caching time of 24 hours or cache directive value. This ensures that crawlers aren’t overloading websites.
- In cases where the robots.txt file is no longer available, Google or any other crawl bot will not crawl previously known disallowed pages for a reasonable period of time.
How Should I Stop Google From Crawling My Site?
Following Google’s notification to webmasters that they should stop using robots.txt to index pages, you’d be forgiven for wondering how you should disallow crawling instead. Thankfully, there are a number of ways you can do so correctly in advance of the September 1st deadline. Google stated that the usage of rules relating to crawl delay, nofollow and noindex were “contradicted by other rules in all but 0.001% of all robots.txt files on the internet”, before stating that these mistakes were actually harming the websites and their visibility.
Their suggested alternative options include:
- Add noindex to your robots meta tags. This is supported in both HTML and HTTP response headers.
- 404 and 410 HTTP status codes can be used to drop the URLs from the index once these codes have been processed.
- Password protect account-related, paywall or subscription-related content using a log-in page. This will usually remove these URLs from Google’s index.
- Complete a disallow in robots.txt. You can still use robots.txt to stop a URL from being indexed using a disallow rule instead.
- Remove the URL using the Search Console This is quick, simple, and reduces the risk of webmaster error.
It is uncertain as to whether Google will provide more direct clarification ahead of the IETP publishing the new standards, however, their given information thus far has provided webmasters with a starting point to improve the robots.txt files.