We love digital - Call
03332 207 677Circle Info
and say hello - Mon - Fri, 9am - 5pm

Call 03332 207 677Circle Info

Hannah Pennington

What Is A Robots.Txt.File?

5th Oct 2020 SEO 5 minutes to read

Key to part of your on-site optimisation, is the robots.txt.file. Often overlooked, we advise that this is essential in the relationship search engines have with your site. This file alone can be more than a few bytes and so is well worth including in your optimisation strategy. The robots.txt.file can usually be found in your root directory and its purpose is to regulate the bots that crawl your site.  It’s here that you un grant or deny permission to all or some specific search engine robots to access certain pages, or your site as a whole. Developed in 1994, it’s known as the Robots Exclusion Standard/Protocol.

More info here: http://www.robotstxt.org/

Robots.txt. rules

The rules of the Robots Exclusion Standard are loose and the is no official body that governs this. There are commonly used elements which are listed below:

  • User-agent: This refers to the specific bots the rules apply to
  • Disallow referring to the site areas the bot specified by the user-agent is not supposed to crawl
  • Allow: Used instead of or in addition to the above, with the opposite meaning

The robots.txt.file often mention the location of the sitemap and whilst most existing search bots – including those belonging to the main search engines – translate and understand the above elements, not all stick to the rules! Also as with everything, certain cases that fall outside of this:

While ‘’Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.”

This indicates that Google can still index other pages, even if they are blocked in robots.txt.file. Below is the link to Google support and a guide as well as a further how to section.

https://support.google.com/webmasters/answer/6062608?hl=en&visit_id=637350840876502223-3007179655&rd=1

File structure

Let us look a typical file structure to start with:

User-agent: *
Disallow:
Sitemap: https://www.yoursite.com/sitemap.xml

The example shown means that:

Access Rules

To set access rules for a specific robot, e.g. Googlebot, the user-agent needs to be defined accordingly:

  • User-agent: Googlebot
  • Disallow:/images/

In the above example, Googlebot is denied access to the /images/ folder of a site. Additionally, a specific rule can be set to explicitly disallow access to all files within a folder:

  • Disallow:/images/*

The wildcard in this case refers to all files within the folder. But robots.txt can be even more flexible and define access rules for a specific page:

  • Disallow:/blog/readme.txt

– or a certain filetype:

  • Disallow:/content/*.pdf

If a site uses parameters in URLs and they result in pages with duplicate content, you can opt-out of indexing them by using a corresponding rule, something like:

  • Disallow: /*?*

The above means do not crawl any URLs with ‘?’ in them and this is often the way that you see parameters included in URLs.

With such an extensive set of commands its easy to see that this can be tricky for both website owners and webmasters alike and how mistakes, which can be costly, are made.

Common robots.txt mistakes:

There are some mistakes that are easy to spot and are listed below

No robots.txt file at all

Having no robots.txt file for your site means it is completely open for any spider to crawl. If you have a simple static site with minimal pages and nothing you wish to hide, this may not be an issue, but it’s likely you are running a CMS.

No CMS is perfect, and the chances are there are indexable instances of duplicate content because of the same articles being accessible via different URLs, as well as backend stuff not intended for visitors to site.

Empty robots.txt

This can also be problematic and as well as including the above issues, depending on the CMS used on the site, both cases also bear a risk of URLs like the below example getting indexed:

  • https://www.somedomain.com/srgoogle/?utm_source=google&utm_content=some%20bad%20keyword&utm_term=&utm_campaign…

This can expose your site to potentially being indexed in the context of a bad neighborhood. (the actual domain name has of course been replaced but the domain where this specific type of URLs being indexable had an empty robots.txt file)

Default robots.txt allowing to access everything

Robots.txt file showing like in the below example:

  • User-agent: *
  • Allow:/
  • Or like this:
  • User-agent: *

Disallow

As in the two cases prior, you are leaving your site completely unprotected and there is little point in having a robots.txt file like this at all, unless, again, you are running a static minimal page and don’t want to hide anything on the server.

Best practice is not to mislead the search engines, if your sitemap.xml file contains URLs explicitly blocked by your robots.txt, this is a contradiction.  This can often happen if your robots.txt and /or sitemap.xml files are generated by different automated tools and not checked manually afterward.

It’s easy to see this using Google Webmaster Tools. To do this, you will need to have added your site to Google Webmaster Tools, verified it, and submitted an XML sitemap for it. From here you can see a report on crawling the URLs submitted via the sitemap in the Optimization > Sitemaps section of google webmaster tools.

Blocking access to sensitive areas with robots.txt

If there are areas of your site that require to be blocked, password protect them. DO NOT do this with robots.txt.

Remembering that robots.txt is a recommendation, not a mandatory set of rules which means that anyone not following the protocol could still access that area as could rogue bots. The best rule of thumb is, if it needs to be 100% private, best not to put it online! It has been known that the SEO community have discovered projects that have not yet been released to the public from Google by looking at their robots.txt

Whilst lots to consider with robots.txt and it being a key part of your optimisations, we hope that this is useful when looking at the best course for action. There is still some fun to be had with robots.txt however and for a more light-hearted look at this, the link below provides some great reading!

https://www.link-assistant.com/blog/10-robots-txt-files-worth-to-have-a-look-at/

Share this post

Hannah Pennington

Client Services Manager

Artistic Hannah loves spending her time either creating something or giving back and each year she commits to getting hands-on with a different charity. It’s handy then that she wants to be an Octopus because she could use all those extra arms and skills to do all of this at once.

New Koozai Guide To Get Your Business Back On Track!

What do you think?

aspect-ratio
Gary Hainsworth

A Guide to Google Analytics Integration In Screaming Frog

Gary Hainsworth
20th Oct 2020
Analytics
aspect-ratio
Daria Kolowca

What Is A Single keywords Ad Group or SKAG?

Daria Kolowca
19th Oct 2020
Paid Search

Digital Ideas Monthly

Sign up now and get our free monthly email. It’s filled with our favourite pieces of the news from the industry, SEO, PPC, Social Media and more. And, don’t forget - it’s free, so why haven’t you signed up already?
  • This field is for validation purposes and should be left unchanged.

Unlike 08 numbers, 03 numbers cost the same to call as geographic landline numbers (starting 01 and 02), even from a mobile phone. They are also normally included in your inclusive call minutes. Please note we may record some calls.

Circle Cross