In this article, we are going to talk in-depth about the robots.txt file.
We will not only see what it is and how to use its directives.
We will also see cases where we can use it to improve the positioning of our website.
Understanding the particularities and operation of this file is of vital importance because an error in it can cause our site to stop being crawled or indexed by search engines.
With the consequent loss of organic traffic that this would entail.
The maximum implication of this file occurs on very large websites.
With thousands and thousands of URLs where it is possible that the robots have difficulties crawling the entire content of a website.
For small sites, we may not need to make large modifications because crawling is not as frequent.
What is the Robots.txt File?
The robots.txt is a file that indicates which parts of our website can be accessed by search engine robots or crawlers.
Thanks to it we can allow or block the tracking of certain content on our website.
By default, unless otherwise indicated in the robots.txt, all the content on our website will be crawlable by search engines.
Our goal as SEOs is to optimize the crawl budget.
That is, prioritize tracking towards new content that is updated periodically or that is relevant.
Clarification: although in its guidelines for Webmasters, Google indicates that it follows to the letter what we indicate in the robots.txt, other robots may not do so.
Likewise, in certain cases, such as when using the Google translator, it ignores this file.
Since explaining how this file works on theory may seem more difficult than it actually is, I’ll include examples throughout the content.
The Appearance of a Robots.txt File can be as follows:
User-agent: *
Disallow: /temporary-offers/
Disallow: /browser/
Allow: /browser/find-course
sitemap: https://domain.com/sitemap.xml
We can access the robots.txt of any website by entering /robots.txt after the domain.
Difference Between Crawling and Indexing
Before we go any further into the specifics of this file.
I’d like to talk about the difference between crawling and indexing, as there seems to be a lot of confusion around this.
- We are talking about tracking when the robot accesses the web content.
- We talk about indexing when that content is added to the search engine’s index.
Generally, there are exceptions that we will see later in this article.
First, the complete crawl of content occurs, and then it’s indexing.
The standard process that Google follows to index content is:
- Detect new content through links.
- Track and analyze such content.
- Add it to the index.
It is possible that content blocked by robots.txt will be indexed if it receives links and Google considers it relevant.
If we want to prevent content from being indexed, we must use the “noindex” meta tag inside the <head> as follows:
<meta name=”robots” content=”noindex”>
In this way, said content will not be shown in the search results, although it can be crawled by robots.
If we want robots not to access content, we must use robots.txt
Using noindex and blocking content together with robots.txt can be counterproductive and is discouraged.
This has an explanation.
To read the noindex tag, Google must access (crawl) that content.
If we block its access in robots.txt, it will not be able to read said tag.
Therefore, there is a possibility that the content will end up being indexed if it receives links.
When content blocked in robots.txt is indexed, the most common thing is that the meta description appears: “information not available”.
This is because it hasn’t been able to crawl that content to see its tags.
Characteristics of the Robots.txt
Now that we are clear that robots.txt is used to manage the crawling of our website.
But not the indexing, we are going to talk about its characteristics.
It must be located in the root of the web and is a text file that uses the UTF-8 encoding format.
The file must always have the name “robots.txt” and be a unique file for each web.
If we have several subdomains, each one can have its own robots.txt, whose rules will apply to it.
A robots.txt hosted in the subdomain cars.domain.com will not apply to a URL belonging to the subdomain motorcycles.domain.com
The robots.txt is made up of groups (User-agent) and each group has a series of directives, where it is indicated whether or not a robot can access it.
Therefore, each group must have an allow or disallow directive.
Within this file, we can add comments with the hash #.
These will be useful to clarify why we block or allow access to certain content.
Thus, in the future, we will be able to understand everything at a glance.
Its mission is similar to CSS or HTML annotations.
Example of Robots.txt with Comments:
User-agent: *
Disallow: /find/
#I don’t want it to crawl the contents that hang under the URL search/#
Generally, at the end of the file, the path of the sitemap.xml of the web is indicated.
How to Create it? Directives and Other Commands
We now move on to the important part, which is learning how to create it and its directives.
We can use any text editor or even notepad to create the file.
Always keeping in mind that it cannot be formatted and that the extension will be text.
As I discussed above, each group starts with User-agent :
Here we indicate on which robots (user agents) the directives that we indicate below will be applied.
It is possible that we do not want a certain bot to access a part of the web but Google does.
If we add an asterisk (*), it implies that the directives will apply to all robots.
User-agent: *
In this case, the directives will only affect the Google image crawling robot.
User-agent: Googlebot Image
Next, we indicate the directives applied, which can be two:
- Disallow: to indicate directories or pages that should not be crawled
- Allow: to indicate directories or pages that can be accessed
The Allow directive overrides the Disallow.
It is used when we want to allow the crawling of certain content that belongs to a directory whose crawling is blocked.
User-agent: *
Disallow: /offers/
Allow: /offers/product-1
In this case, we find an online store that does not want the robots to access the offer category, but a specific product that does want it to be tracked.
To block a specific URL, we must indicate it as it is displayed in the browser.
If we add a final slash (/) we will block access to the directory.
Any URL that hangs from it will deny access to the robots.
User-agent: *
Disallow: /offers/
Continuing with the previous example, in this case, the robot will not be able to track any product that hangs from this category (offers/product-1).
However, the category URL itself will be able to crawl it.
If we want it not to crawl the URL of the category or what is attached to it, we can use the asterisk (*).
The asterisk (*) allows us to simplify the rules of robots.txt based on certain parameters.
We can use it both in the prefix and in the suffix.
User-agent: *
Disallow: /offer*
Here we are stating that any content from /offer cannot be crawled.
Note: The URL/offers or any content hanging from it could also not be crawled. We are blocking tracking for all URLs beginning with “offer”.
If we want to block a type of file or URL termination, we will use the dollar symbol ($) after said termination.
User-agent: *
Disallow: /*.pdf$
Thus, you will not be able to access any pdf file hosted on our website.
We can also use it with images or any extension.
It is common to indicate the path of the sitemap.xml in the robots.txt
This can be indicated anywhere in the document, although the most common is to do it at the end.
The full address must be indicated and if there is more than one sitemap, several can be indicated.
Sitemap: https://domain.com/sitemap1.xml
Sitemap: https://domain.com/sitemap2.xml
When is it Advisable to Allow it and When not? Allow vs Disallow
We may think that we are interested in Google crawling all the content on our website.
However, this is not the case, due to the crawl budget, and, above all, this applies to very large sites.
On websites with hundreds or a few thousand URLs, we should not worry too much about it.
In very large e-commerces or news sites with a high frequency of updating and uploading new content, tracking can become a serious problem.
Let’s put the case of a large e-commerce that uploads 100 new products every day.
The number of content that Google crawls per day from said e-commerce is limited.
If the robot “wastes time” with content that we are not interested in, these new products may not be tracked and, therefore, in the face of Google it is as if we had not uploaded them to the web.
They do not exist!
Our work as SEOs is failing because there is content that we are interested in positioning that is not even being tracked.
We have a tracking problem.
Crawling problems usually occur in URLs that are generated through filters, the content generated by users, internal web search results, etc…
Each site is a world and it will be necessary to analyze the peculiarities that it has in order to establish a robots.txt that facilitates the prioritization of crawling toward relevant and new content.
We will try to ensure that all duplicate or low-value content is not crawled in order to optimize our crawl budget.
We must also be careful with which parts of the web we block in tracking.
We may think that the pages of a blog or e-commerce should not be tracked but, like almost everything in SEO, the answer is: it depends.
On some sites where the web structure is not well planned and internal links are scarce, paging may be the only way for bots to access older content.
If we block the only way for these bots to access such content, search engines may not see the changes we make to them because they take too long to crawl them again.
On many occasions, there may be basic errors because the website was not made with SEO in mind, but we must be careful with what changes we make because the remedy may be worse than the disease.
My advice before you start blocking parts of the website is to first understand how that website works and why.
The next step is to analyze the server logs.
This is where we will actually see what parts of the web and when the bots have accessed.
This way we will be able to detect parts of our website of no interest to SEO that are being highly tracked, or important parts that robots do not reach.
Here we can also see if the robots waste time crawling content that no longer exists or even redirects.
How do Check it From Search Console?
From Search Console, we can also detect errors in our robots.txt or test if a certain URL is allowed or denied access.
You can access it through this link, the verification process is manual.
We must enter each URL in which we want to check if access to a certain Google bot is allowed or not.
The positive part of this tool is that we can carry out live tests to see how these changes would affect the robots.
Clarification: These changes made to the tool are not made to the actual file. Then we would have to modify the file uploaded to our website in case we want to make the modifications.
Final Conclusion
As you can see, robots.txt is a very important file because it affects a vital part of SEO: crawling.
Understanding the needs of our website and optimizing this file accordingly will be of vital importance to keep our site in good health.
We can rely on the two tools seen in this article in case we have doubts about whether we are managing this file correctly.
Thanks for Reading.