Google Can Index Blocked URLs Without Crawling


Google’s John Mueller not too way back “liked” a tweet by search promoting and advertising and marketing advisor Barry Adams (of Polemic Digital) that concisely acknowledged the goal of the robots.txt exclusion protocol. He freshened up an earlier matter and pretty presumably gave us a model new means to contemplate it.

Google Can Index Blocked Pages

The concern began when a publisher tweeted that Google had listed an web website that was blocked by robots.txt.

Screenshot of a tweet by a person who says Google indexed a web page that was blocked by Robots.txt

John Mueller responded:

URLs can be indexed without being crawled, if they’re blocked by robots.txt – that’s by design.

Usually that comes from links from somewhere, judging from that number, I’d imagine from within your site somewhere.”

How Robots.txt Works

Barry (@badams) tweeted:

“Robots.txt is a crawl management tool, not an index management tool.”

We sometimes think about Robots.txt as a strategy to dam Google from along with an online web page from Google’s index. But robots.txt is barely a strategy to dam which pages Google crawls.

That’s why if one different site has a hyperlink to a certain internet web page, then Google will crawl and index the online web page (to a certain extent).

Barry then went on to make clear learn the way to keep up an online web page out of Google’s index:

“Use meta robots directives or X-Robots-Tag HTTP headers to prevent indexing – and (counter-intuitively) let Googlebot crawl those pages you don’t want it to index so it sees those directives.”

NoIndex Meta Tag

The noindex meta tag permits crawled pages to be saved out of Google’s index. It doesn’t stop the crawl of the online web page, nonetheless it does assure the online web page might be saved out of Google’s index.

The noindex meta tag is superior to the robots.txt exclusion protocol for retaining an internet internet web page from being listed.

Here is what John Mueller said in a tweet from August 2018

“…if you want to prevent them from indexing, I’d use the noindex robots meta tag instead of robots.txt disallow.”

Screenshot of a tweet by Google's John Mueller recommending the noindex meta tag to prevent Google from indexing a web page

Robots Meta Tag Has Many Uses

A cool issue regarding the Robots meta tag is that it could be used to resolve factors until a higher restore comes alongside.

For occasion, a author was having trouble producing 404 response codes on account of the angularJS framework saved producing 200 standing codes.

His tweet asking for help talked about:

Hi @JohnMu I´m having many troubles with managing 404 pages in angularJS, always give me a 200 standing on them. Any strategy to clear up it? Thanks

Screenshot of a tweet about 400 pages resolving as 200 response codes

John Mueller immediate using a robots noindex meta tag. This would set off Google to drop that 200 response code internet web page from the index and regard that internet web page as a cushty 404.

“I’d make a normal error page and just add a noindex robots meta tag to it. We’ll call it a soft-404, but that’s fine there.”

So, although the net internet web page is displaying a 200 response code (which suggests the online web page was effectively served), the robots meta tag will preserve the online web page out of Google index and Google will take care of it as if the online web page is simply not found, which is a 404 response.

Screenshot of John Mueller tweet explaining how robots meta tag works

Official Description of Robots Meta Tag

According to the official documentation on the World Wide Web Consortion, the official physique that decides web necessities (W3C), that’s what the Robots Meta Tag does:

Robots and the META element
The META element allows HTML authors to tell visiting robots whether a document may be indexed, or used to harvest more links.”

This is how the W3c paperwork describe the Robots.txt:

“When a Robot visits a Web site,it firsts checks for …robots.txt. If it can find this document, it will analyze its contents to see if it is allowed to retrieve the document.”

Screenshot of a page from the W3c showing the official standard for the robots meta tag

The W3c interprets the place of the Robots.txt as like a gate keeper for what recordsdata are retrieved. Retrieved means crawled by a robotic that obeys the Robots.txt exclusion protocol.

Barry Adams was proper to elucidate the Robots.txt exclusion as a strategy to deal with crawling, not indexing.

It is probably useful to thik of the Robots.txt as being like security guards on the door of your site, retaining certain web pages blocked. It may make untangling uncommon Googlebot train on blocked web pages a bit less complicated.

More Resources

Images by Shutterstock, Modified by Author
Screenshots by Author, Modified by Author



Tags: , , , ,