How Google Interprets robots.txt Definitions #GoogleSeoGuide #s1ep36
Google's automated?crawlers?support the?Robots Exclusion Protocol (REP). This means that before crawling a site, Google's crawlers download and parse the site's robots.txt file to extract information about which parts of the site may be crawled. The REP isn't applicable to Google's crawlers that are controlled by users (for example, feed subscriptions), or crawlers that are used to increase user safety (for example, malware analysis).
This page describes Google's interpretation of the REP. For the original standard, check?RFC 9309.
What is a robots.txt file
If you don't want crawlers to access sections of your site, you can create a robots.txt file with appropriate rules. A robots.txt file is a simple text file containing rules about which crawlers may access which parts of a site. For example, the robots.txt file for example.com may look like this:
# This robots.txt file controls crawling of URLs under https://example.com.
# All crawlers are disallowed to crawl files in the "includes" directory, such
# as .css, .js, but Google needs them for rendering, so Googlebot is allowed
# to crawl them.
User-agent: *
Disallow: /includes/
User-agent: Googlebot
Allow: /includes/
Sitemap: https://example.com/sitemap.xml
File location and range of validity
? You must place the robots.txt file in the top-level directory of a site, on a supported protocol.
The URL for the robots.txt file is (like other URLs) case-sensitive. In case of Google Search, the supported protocols are HTTP, HTTPS, and FTP. On HTTP and HTTPS, crawlers fetch the robots.txt file with an HTTP non-conditional?GET?request; on FTP, crawlers use a standard?RETR (RETRIEVE)?command, using anonymous login.
The rules listed in the robots.txt file apply only to the host, protocol, and port number where the robots.txt file is hosted.
Examples of valid robots.txt URLs
The following examples of robots.txt URLs and what URL paths they're valid for.
?? https://example.com/robots.txt
This is the general case. It's not valid for other subdomains, protocols, or port numbers. It's valid for all files in all subdirectories on the same host, protocol, and port number.
? Valid for:
?? Not valid for:
?? https://www.example.com/robots.txt
A robots.txt on a subdomain (for this example on the www subdomain.) is only valid for that subdomain.
? Valid for: https://www.example.com/
?? Not valid for:
?? https://example.com/folder/robots.txt
Not a valid robots.txt file. Crawlers don't check for robots.txt files in subdirectories.
?? ftp://example.com/robots.txt
? Valid for: ftp://example.com/
?? Not valid for: https://example.com/
?? https://212.96.82.21/robots.txt
A robots.txt with an IP-address as the host name is only valid for crawling of that IP address as host name. It isn't automatically valid for all websites hosted on that IP address (though it's possible that the robots.txt file is shared, in which case it would also be available under the shared host name).
? Valid for: https://212.96.82.21/
?? Not valid for: https://example.com/?(even if hosted on 212.96.82.21)
?? https://example.com:80/robots.txt
Standard port numbers (80?for HTTP,?443?for HTTPS,?21?for FTP) are equivalent to their default host names.
? Valid for:
?? Not valid for: https://example.com:81/
?? https://example.com:8181/robots.txt
Robots.txt files on non-standard port numbers are only valid for content made available through those port numbers.
? Valid for: https://example.com:8181/
?? Not valid for: https://example.com/
Handling of errors and HTTP status codes
When requesting a robots.txt file, the HTTP status code of the server's response affects how the robots.txt file will be used by Google's crawlers. The following list summarizes how Googlebot treats robots.txt files for different HTTP status codes.
Caching
Google generally caches the contents of robots.txt file for up to 24 hours, but may cache it longer in situations where refreshing the cached version isn't possible (for example, due to timeouts or?5xx?errors). The cached response may be shared by different crawlers. Google may increase or decrease the cache lifetime based on?max-age Cache-Control?HTTP headers.
File format
The robots.txt file must be a?UTF-8?encoded plain text file and the lines must be separated by?CR,?CR/LF, or?LF.
Google ignores invalid lines in robots.txt files, including the Unicode?Byte Order Mark?(BOM) at the beginning of the robots.txt file, and use only valid lines. For example, if the content downloaded is HTML instead of robots.txt rules, Google will try to parse the content and extract rules, and ignore everything else.
Similarly, if the character encoding of the robots.txt file isn't UTF-8, Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots.txt rules invalid.
Google currently enforces a robots.txt file size limit of 500?kibibytes?(KiB). Content which is after the maximum file size is ignored. You can reduce the size of the robots.txt file by consolidating rules that would result in an oversized robots.txt file. For example, place excluded material in a separate directory.
Syntax and Rules
Valid robots.txt lines consist of a field, a colon, and a value. Spaces are optional but recommended to improve readability. Space at the beginning and at the end of the line is ignored. To include comments, precede your comment with the?#?character. Keep in mind that everything after the?#?character will be ignored. The general format is?<field>:<value><#optional-comment>.
领英推荐
Google supports the following fields:
The?allow?and?disallow?fields are also called rules (also known as directives). These rules are always specified in the form of?rule: [path]?where?[path]?is optional. By default, there are no restrictions for crawling for the designated crawlers. Crawlers ignore rules without a?[path].
The?[path]?value, if specified, is relative to the root of the website from where the robots.txt file was fetched (using the same protocol, port number, host and domain names). The path value must start with?/?to designate the root and the value is case-sensitive. Learn more about?URL matching based on path values.
Grouping of lines and rules
You can group together rules that apply to multiple user agents by repeating?user-agent?lines for each crawler.
? Order of precedence for user agents
Only one group is valid for a particular crawler. Google's crawlers determine the correct group of rules by finding in the robots.txt file the group with the most specific user agent that matches the crawler's user agent. Other groups are ignored. All non-matching text is ignored (for example, both googlebot/1.2 and googlebot* are equivalent to googlebot). The order of the groups within the robots.txt file is irrelevant.
If there's more than one specific group declared for a user agent, all the rules from the groups applicable to the specific user agent are combined internally into a single group. User agent specific groups and global groups (*) are not combined.
Examples
Matching of?user-agent?fields
user-agent: googlebot-news
(group 1)
user-agent: *
(group 2)
user-agent: googlebot
(group 3)
? Grouping of rules
1?? If there are multiple groups in a robots.txt file that are relevant to a specific user agent, Google's crawlers internally merge the groups. For example:
user-agent: googlebot-news
disallow: /fish
user-agent: *
disallow: /carrots
user-agent: googlebot-news
disallow: /shrimp
The crawlers internally group the rules based on user agent, for example:
user-agent: googlebot-news
disallow: /fish
disallow: /shrimp
user-agent: *
disallow: /carrots
2?? Rules other than?allow,?disallow, and?user-agent?are ignored by the robots.txt parser. This means that the below robots.txt snippet is treated as one group, and thus both?user-agent?a?and?b?are affected by the?disallow: /?rule:
user-agent: a
sitemap: https://example.com/sitemap.xml
user-agent: b
disallow: /
When the crawlers process the robots.txt rules, they ignore the?sitemap?line. For example, this is how the crawlers would understand the previous robots.txt snippet:
user-agent: a
user-agent: b
disallow: /
URL matching based on path values
Google, Bing, and other major search engines support a limited form of?wildcards?for path values. These wildcard characters are:
Example path matches
Order of precedence for rules
When matching robots.txt rules to URLs,
1?? crawlers use the most specific rule based on the length of the rule path.
2?? In case of conflicting rules, including those with wildcards, Google uses the least restrictive rule.
The following examples demonstrate which rule Google's crawlers will apply on a given URL.
Sample situations
?? https://example.com/page
allow: /p
disallow: /
Applicable rule:?allow: /p, because it's more specific.
?? https://example.com/folder/page
allow: /folder
disallow: /folder
Applicable rule:?allow: /folder, because in case of conflicting rules, Google uses the least restrictive rule.
?? https://example.com/page.htm
allow: /page
disallow: /*.htm
Applicable rule:?disallow: /*.htm, because the rule path is longer and it matches more characters in the URL, so it's more specific.
?? https://example.com/page.php5
allow: /page
disallow: /*.ph
Applicable rule:?allow: /page, because in case of conflicting rules, Google uses the least restrictive rule.
?? https://example.com/
allow: /$
disallow: /
Applicable rule:?allow: /$, because it's more specific.
?? https://example.com/page.htm
allow: /$
disallow: /
Applicable rule:?disallow: /, because the?allow?rule only applies on the root URL.
To be continued...
Thank you for learning with us. This episode is based on?the Fundamentals of SEO Starter Guide?by Google.?Remember, you could always follow all of the episodes for "Google SEO Guide" from the document below:
How Will We Do It? ??????
We won’t care what anyone says, and SEO & Digital Marketing will change our lives!!!
What Will We Learn:???????
Click the below link and get the?#LearningSEOsocially?Calendar Season 1&2. Follow this plan and learn with me how to develop?#SEO?skills with free guides, resources, tools, and more.
Follow Eachother to?#LearnSEO?Socially:???????
BONUS:???????
Don't forget to get a list of?29 Must-Have?#SEO?Skills?basic to mastery.