The importance of a robots.txt file

Let’s get digital, digital!

You might be surprised to hear that one small text file, known as robots.txt, could be the downfall of your website. If you get the file wrong you could end up telling search engine robots not to crawl your site, meaning your web pages won’t appear in the search results. Therefore, it’s important that you understand the purpose of a robots.txt file and learn how to check you’re using it correctly.
A robots.txt file gives instructions to web robots about the pages the website owner doesn’t wish to be ‘crawled’. For instance, if you didn’t want your images to be listed by Google and other search engines, you’d block them using your robots.txt file.
You can go to your website and check if you have a robots.txt file by adding /robots.txt immediately after your domain name in the address bar at the top:

The URL you enter should look like this: http://www.examplewebsite.com/robots.txt
(obviously with your domain name instead!)

How does it work?

Before a search engine crawls your site, it will look at your robots.txt file as instructions on where they are allowed to crawl (visit) and index (save) on the search engine results.
Robots.txt files are useful:

If you want search engines to ignore any duplicate pages on your website
If you don’t want search engines to index your internal search results pages
If you don’t want search engines to index certain areas of your website or a whole website
If you don’t want search engines to index certain files on your website (images, PDFs, etc.)
If you want to tell search engines where your sitemap is located

How to create a robots.txt file

If you’ve found that you don’t currently have a robots.txt file, I’d advise you to create one as soon as possible. You will need to:

Create a new text file and save it as the name “ - you can use the Notepad program on Windows PCs or TextEdit for Macs and then “Save As” a text-delimited file.
Upload it to the root directory of your website – this is usually a root level folder called “htdocs” or “www” which makes it appear directly after your domain name.
If you use subdomains, you’ll need to create a robots.txt file for each subdomain.

What to include in your robots.txt file

There’s often disagreements about what should and shouldn’t be put in robots.txt files. Please note again that robots.txt isn’t meant to deal with security issues for your website, therefore I’d recommend that the location of any admin or private pages on your site aren’t included in the robots.txt file. If you want to securely prevent robots from accessing any private content on your website then you need to password protect the area where they are stored. Remember, robots.txt is designed to act as a guide for web robots, and not all of them will abide by your instructions.
Let’s look at different examples of how you may want to use the robots.txt file:

Allow everything and submit the sitemap – This is the best option for most websites, it allows all search engine to fully crawl the website and index all the data, it even shows the search engines where the XML sitemap is located so they can find new pages very quickly:
User-agent: *
Allow: /
#Sitemap Reference
Sitemap:http://www.example.com/sitemap.xml

Allow everything apart from one sub-directory - Sometimes you may have an area on your website where you don’t want search engines to show in the search engine results. This could be a checkout area, image files, an irrelevant part of a forum or an adult section of a website for example all shown below. Any URL including the path disallowed will be excluded by the search engines:
User-agent: *
Allow: /
# Disallowed Sub-Directories
Disallow: /checkout/
Disallow: /website-images/
Disallow: /forum/off-topic/
Disallow: /adult-chat/

Allow everything apart from certain files - Sometimes you may want to show media on your website or provide documents but don’t want them to appear within image search results, social network previews or document search engine listings. Files you may wish to block could be any animated GIFs, PDF instruction manuals or any development PHP files for example shown below:
User-agent: *
Allow: /
# Disallowed File Types
Disallow: /*.gif$
Disallow: /*.pdf$
Disallow: /*.PDF$
Disallow: /*.php$

Allow everything apart from certain webpages - Some webpages on your website may not be suitable to show in search engine results and you can block individual pages as well using the robots.txt file. Webpages that you may wish to block could be your terms and conditions page, a page which you want to remove quickly for legal reasons or a page with sensitive information on which you don’t want to be searchable (remember that people can still read your robot.txt file and the pages will still be seen by some scrupulous crawler bots):
User-agent: *
Allow: /
# Disallowed Web Pages
Disallow: /terms.html
Disallow: /blog/how-to-blow-up-the-moon
Disallow: /secret-list-of-contacts.php

Allow everything apart from certain patterns of URLs - Lastly you may have an awkward pattern of URLs which you may wish to disallow, one’s which may be nicely grouped into a certain sub-directory. Examples of URL patterns you may wish to block might be internal search result pages, left over test pages from development or 2nd, 3rd, 4th etc. pages of an ecommerce category page:
User-agent: *
Allow: /
# Disallowed URL Patterns
Disallow: /*search=
Disallow: /*_test.php$
Disallow: /*?page=*

Putting it all together

Clearly you may wish to use a combination of these methods to block off different areas of your website, the key things to remember are:

If you disallow a sub-directory then ANY file, sub-directory or webpage within that URL pattern will be disallowed
The star symbol (*) substitutes for any character or number of characters
The dollar symbol ($) signifies the end of the URL, without using this for blocking file extensions you may block a huge number of URLs by accident
The URLs are case sensitive matched so you may have to put in both caps and non-caps versions to capture all
It can take search engines several days to a few weeks to notice a disallowed URL and remove it from their index
The “User-agent” setting allows you to block certain crawler bots or treat them differently if needed, a full list of user agent bots can be found here which replace the catch-all star symbol (*)

If you are still puzzled or worried about the robot.txt file creation then Google has a handy testing tool within Webmaster Tools, just sign into Webmaster Tools and visit this URL: https://www.google.com/webmasters/tools/robots-analysis. Yandex also have a free tool which doesn’t require a login here: http://webmaster.yandex.com/robots.xml
Google have put together a ‘fishy’ looking overview of what’s blocked and what’s not block on their in-depth robots.txt file page:

What not to include in your robots.txt file

Occasionally, a website has a robots.txt file which includes the following command:
User-agent: *
Disallow: /

This is telling all bots to ignore THE WHOLE domain, meaning none of that website’s pages or files would be listed at all by the search engines!
The aforementioned example highlights the importance of properly implementing a robots.txt file, so be sure to check yours to ensure you’re not unknowingly restricting your chances of being indexed by search engines.

What happens if you have no robots.txt file?

Without a robots.txt file search engines will have a free run to crawl and index anything they find on the website. This is fine for most websites but it’s really good practice to at least point out where your XML sitemap is so search engines can find new content without having to slowly crawl through all the pages on your website and bumping into them days later.
NB – This is an updated version of Ben Wood‘s Robots.txt introductory post from 2012.

Balaji SEO Analyst Chennai

Tuesday, 2 September 2014