In my previous post on my other site, I wrote about Search Engine Optimization, in which I explained the essence of a robots.txt file in a website. Other than choosing a perfect domain name for your site, your search engine ranking is also determined by your site’s robots.txt.
In this post, I will go further and discuss into detail the meaning of a robots.txt file, what it really does in a website, how to create one and where to place it. I will even share a sample of how a robots.txt file from one of my websites looks like.
So, what is a a robots.txt?
The brief answer:
This is simply a small document, written in plain text and placed at the top level directory of a website to control how crawlers should scan your site. Wait… Some terminology there, but let me explain one by one, plus more;
Top level directory– The root folder where your website documents are placed. This is usually the same folder that contains the index.php or index.html file.
Crawlers-These are specific programs, popularly known as robots, that traverse the website pages, looking for information to index.
Indexing-After robots have successfully scanned/crawled a site, they spit the information for placement on Search Engine Results Pages, a process known as indexing. This makes it easier for us to find what we are looking for on Google, for example.
Scanning/Crawling-The process of reading/traversing a website to collect information.
You have now seen the importance of this small file in your website. This is how it works;
When a robot first visits a website, it looks for the robots.txt, to ascertain whether it’s allowed or not. If it finds the file, it will read the first and second lines i.e
User agent: * Disallow: /
The code above disallows any robot to crawl a site. This is because it contains / after Disallow: indicating that all robots are blocked from accessing the site. If for example a Google bot visits a particular website and finds the same code as above, it will turn away, without crawling and therefore it will not be indexed on search engines.
Now look at this;
User agent: * Disallow:
Note there is no / after Disallow: It indicates to all robots that they are not only allowed in, but to also crawl every file of the site. As I said earlier in my previous post, this is the most dangerous part of the robots.txt. It could take just a minute to get your website hacked and everything destroyed, just because you forgot to include the final / in the file. Now you know, so I dont need to explain further.
The third code is;
User agent: * Disallow: /folder1/
To go back a little bit, its good to know that the robots.txt file is a very important but one of the most overlooked elements in a website, and the reason why many sites are performing poorly.
If robots come to your website and cannot find the robots.txt file when they look for it, they will assume everything is okay and proceed to crawl the site, because that’s what they are programmed to do. So if there is no robots.txt file in your website, robots will still crawl by default site/server settings. Another point to note is that robots.txt is a publicly accessible file, therefore accessible by anyone who has your website url. It should therefore NEVER be used to hide files!
In the code above, robots are being blocked from folder1. This is achieved by writing the full path after the / in the file.
However, there is another instruction written as Allow. This is used to allow access to certain files which are contained in folder/directory that was blocked earlier. Sounds confusing, but here is the best way to understand.
Look at this robots file below;
User agent: * Allow: /folder1/file3/
Let’s assume folder1 is a directory containing file1, file2 and file3. File1 and file2, contain important/private information that I don’t want to be crawled by robots. However, I want robots to access file3. I will now go ahead and use the Allow function as follows;
User agent: * Disallow: /folder1 Allow: /folder1/file3/
So far we have established that the robots.txt is a crucial file that must be in a website, though it’s presence doesn’t influence site your site functionality. Here is a sample robots.txt from one of my WordPress websites;
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /cgi-bin/ Disallow: /wp-content/plugins/ Disallow: /wp-content/cache/ Disallow: /wp-content/themes/ Disallow: /category/*/* Disallow: /trackback/ Disallow: /comments/ Disallow: /Tags/ Disallow: /feed/ Disallow: */feed/
That’s all I had about the robots.txt file, hope it helps. For any questions, post in the comments section below. I am always happy to help by defining technology in everyone’s language…
Love you all.