Google Can Index Blocked URLs Without Crawling


Google’s John Mueller not too long ago “liked” a tweet by search advertising and marketing advisor Barry Adams (of Polemic Digital) that concisely acknowledged the aim of the robots.txt exclusion protocol. He freshened up an previous matter and fairly presumably gave us a brand new means to consider it.

Google Can Index Blocked Pages

The concern started when a publisher tweeted that Google had listed an internet site that was blocked by robots.txt.

Screenshot of a tweet by a person who says Google indexed a web page that was blocked by Robots.txt

John Mueller responded:

URLs can be indexed without being crawled, if they’re blocked by robots.txt – that’s by design.

Usually that comes from links from somewhere, judging from that number, I’d imagine from within your site somewhere.”

How Robots.txt Works

Barry (@badams) tweeted:

“Robots.txt is a crawl management tool, not an index management tool.”

We typically consider Robots.txt as a approach to block Google from together with a web page from Google’s index. But robots.txt is only a approach to block which pages Google crawls.

That’s why if one other web site has a hyperlink to a sure web page, then Google will crawl and index the web page (to a sure extent).

Barry then went on to clarify find out how to maintain a web page out of Google’s index:

“Use meta robots directives or X-Robots-Tag HTTP headers to prevent indexing – and (counter-intuitively) let Googlebot crawl those pages you don’t want it to index so it sees those directives.”

NoIndex Meta Tag

The noindex meta tag permits crawled pages to be saved out of Google’s index. It doesn’t cease the crawl of the web page, nevertheless it does guarantee the web page can be saved out of Google’s index.

The noindex meta tag is superior to the robots.txt exclusion protocol for retaining an online web page from being listed.

Here is what John Mueller said in a tweet from August 2018

“…if you want to prevent them from indexing, I’d use the noindex robots meta tag instead of robots.txt disallow.”

Screenshot of a tweet by Google's John Mueller recommending the noindex meta tag to prevent Google from indexing a web page

Robots Meta Tag Has Many Uses

A cool factor in regards to the Robots meta tag is that it may be used to resolve points till a greater repair comes alongside.

For instance, a writer was having hassle producing 404 response codes as a result of the angularJS framework saved producing 200 standing codes.

His tweet asking for help mentioned:

Hi @JohnMu I´m having many troubles with managing 404 pages in angularJS, at all times give me a 200 standing on them. Any approach to clear up it? Thanks

Screenshot of a tweet about 400 pages resolving as 200 response codes

John Mueller prompt utilizing a robots noindex meta tag. This would trigger Google to drop that 200 response code web page from the index and regard that web page as a comfortable 404.

“I’d make a normal error page and just add a noindex robots meta tag to it. We’ll call it a soft-404, but that’s fine there.”

So, though the online web page is displaying a 200 response code (which implies the web page was efficiently served), the robots meta tag will maintain the web page out of Google index and Google will deal with it as if the web page is just not discovered, which is a 404 response.

Screenshot of John Mueller tweet explaining how robots meta tag works

Official Description of Robots Meta Tag

According to the official documentation on the World Wide Web Consortion, the official physique that decides net requirements (W3C), that is what the Robots Meta Tag does:

Robots and the META element
The META element allows HTML authors to tell visiting robots whether a document may be indexed, or used to harvest more links.”

This is how the W3c paperwork describe the Robots.txt:

“When a Robot visits a Web site,it firsts checks for …robots.txt. If it can find this document, it will analyze its contents to see if it is allowed to retrieve the document.”

Screenshot of a page from the W3c showing the official standard for the robots meta tag

The W3c interprets the position of the Robots.txt as like a gate keeper for what recordsdata are retrieved. Retrieved means crawled by a robotic that obeys the Robots.txt exclusion protocol.

Barry Adams was right to explain the Robots.txt exclusion as a approach to handle crawling, not indexing.

It is perhaps helpful to thik of the Robots.txt as being like safety guards on the door of your web site, retaining sure net pages blocked. It might make untangling unusual Googlebot exercise on blocked net pages a bit simpler.

More Resources

Images by Shutterstock, Modified by Author
Screenshots by Author, Modified by Author



Tags: , , , ,