Updated March 14, 2026 · 12 min read

Robots.txt: Complete Guide with Examples (2026)

Your robots.txt file is the first thing every search engine reads when it visits your website. A single misplaced line can block Google from indexing your entire site, costing you thousands of visitors overnight. Yet most website owners either ignore this file completely or copy-paste a template without understanding what it does.

This guide covers everything you need to know about robots.txt: what it is, how search engine crawlers use it, the complete syntax reference, real-world examples for every common scenario, and the mistakes that silently destroy your SEO. Whether you are running a WordPress blog, a Shopify store, or a custom React single-page application, you will find a ready-to-use robots.txt example tailored to your platform.

If you just need to generate a robots.txt file quickly, our free robots.txt generator will create one for you in seconds. But understanding what each directive does will save you from costly mistakes down the road.

What Is Robots.txt and Why It Matters

A robots.txt file is a simple plain text file that lives at the root of your website (for example, https://example.com/robots.txt). It follows the Robots Exclusion Protocol, a standard created in 1994 that tells web crawlers which parts of your site they are allowed to access and which parts they should stay away from.

When a search engine crawler like Googlebot arrives at your website, the very first thing it does is look for a robots.txt file. Before crawling a single page, it reads your robots.txt to understand the rules you have set. If the file says a certain directory is off-limits, well-behaved crawlers will respect that instruction and skip it entirely.

This matters for three critical reasons:

Important: Robots.txt is not a security tool. It is a publicly readable file that politely asks crawlers to stay away. Malicious bots will ignore it. Never use robots.txt to protect sensitive information like passwords, customer data, or private documents.

How Search Engine Crawlers Use Robots.txt

Understanding the crawl process helps you write better robots.txt rules. Here is what happens step by step when a search engine discovers your website:

  1. Discovery: The search engine learns about your site through links from other websites, your sitemap submission, or direct URL submission in tools like Google Search Console.
  2. Robots.txt check: Before crawling any page, the crawler fetches yourdomain.com/robots.txt. If the file exists, the crawler reads and caches the rules. If the file does not exist (404 response), the crawler assumes everything is allowed.
  3. Rule matching: For each URL the crawler wants to visit, it checks whether any Disallow rule matches that URL path. If a matching Disallow exists and no more specific Allow rule overrides it, the crawler skips that page.
  4. Crawling: For allowed pages, the crawler downloads the HTML, follows links to discover new pages, and repeats the process.
  5. Re-checking: Crawlers periodically re-fetch your robots.txt to check for changes. Google typically caches it for up to 24 hours.

Each major search engine has its own crawler with a specific user-agent name. Google uses Googlebot, Bing uses Bingbot, and so on. Your robots.txt can set different rules for different crawlers, giving you granular control over which search engines can access which parts of your site.

Robots.txt Syntax: The Complete Reference

The robots.txt syntax is straightforward, but the details matter. Every directive must follow a specific format, and even small errors can cause crawlers to misinterpret your rules. Here is every directive you need to know:

User-agent

The User-agent directive specifies which crawler the following rules apply to. Use * as a wildcard to target all crawlers, or specify a particular bot by name.

# Rules for all crawlers
User-agent: *

# Rules specifically for Google
User-agent: Googlebot

# Rules specifically for Bing
User-agent: Bingbot

You can have multiple User-agent blocks in a single robots.txt file. Each block starts with a User-agent line and includes the Disallow and Allow rules that apply to that crawler.

Disallow

The Disallow directive tells crawlers not to access a specific URL path. The path is case-sensitive and matches from the beginning of the URL path.

# Block a specific page
Disallow: /private-page.html

# Block an entire directory
Disallow: /admin/

# Block all URLs containing a query parameter
Disallow: /*?

# Block everything (disallow all pages)
Disallow: /

# An empty Disallow means allow everything
Disallow:

An empty Disallow: directive (with nothing after the colon) means "allow everything." This is the standard way to explicitly state that no restrictions apply to the specified user-agent.

Allow

The Allow directive overrides a Disallow rule for a specific path. This is particularly useful when you want to block a directory but allow access to certain files within it.

User-agent: *
Disallow: /private/
Allow: /private/public-page.html

In this example, crawlers are blocked from the entire /private/ directory except for /private/public-page.html, which they are allowed to access. When both Allow and Disallow match a URL, the more specific (longer) rule takes precedence.

Sitemap

The Sitemap directive tells crawlers where to find your XML sitemap. Unlike other directives, the Sitemap line is not tied to a User-agent block. It applies globally and can appear anywhere in the file.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml

You can include multiple Sitemap directives if your site uses multiple sitemaps. Always use the full absolute URL (including https://), not a relative path.

Crawl-delay

The Crawl-delay directive asks crawlers to wait a specified number of seconds between requests. This protects your server from being overwhelmed by aggressive crawling.

User-agent: *
Crawl-delay: 10

This tells crawlers to wait at least 10 seconds between each page request. Note that Google does not support Crawl-delay. To control Googlebot's crawl rate, use the Crawl Rate settings in Google Search Console instead. Bing, Yandex, and several other crawlers do respect Crawl-delay.

Pro tip: Use Crawl-delay sparingly. Setting it too high means search engines will crawl fewer pages per visit, which can slow down how quickly new or updated content appears in search results.

Wildcard Patterns

Most modern crawlers support two wildcard characters in robots.txt paths:

# Block all PDF files
Disallow: /*.pdf$

# Block all URLs with query strings
Disallow: /*?*

# Block URLs ending in .json
Disallow: /*.json$

Generate Your Robots.txt in Seconds

Select your platform, choose your rules, and download a properly formatted robots.txt file. No syntax errors, no guesswork.

Open the Free Robots.txt Generator →

Robots.txt Examples for Every Common Scenario

Here are ready-to-use robots.txt examples you can adapt for your site. Each one solves a specific, common problem.

Allow All Crawlers (Default Open Access)

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

This is the simplest robots.txt possible. The empty Disallow directive means "disallow nothing," so all crawlers can access every page. This is appropriate for most public websites that want maximum search engine visibility.

Block All Crawlers (Staging or Development Sites)

User-agent: *
Disallow: /

The forward slash after Disallow matches every URL on your site, effectively blocking all crawlers from all pages. Use this for staging environments, development servers, and sites that are not ready for public access. This is the most critical robots.txt rule to know: one line that keeps your entire site out of search engines.

Watch out: If you launch your production site with this rule still in place, Google will de-index your entire site. Always check your robots.txt before going live.

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /search/
Disallow: /tmp/

Sitemap: https://example.com/sitemap.xml

This is a common pattern for ecommerce and content sites. It blocks admin panels, user account areas, cart and checkout flows (which create duplicate URLs), internal search results, and temporary directories while allowing crawlers to access all public content.

Block Specific Bots Only

# Allow all crawlers by default
User-agent: *
Disallow:

# Block specific AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Sitemap: https://example.com/sitemap.xml

This configuration allows all standard search engine crawlers while blocking specific AI training bots. As of 2026, many publishers use this pattern to allow Google and Bing indexing while preventing their content from being used to train large language models. Google-Extended blocks Google's AI training crawler specifically, without affecting Googlebot's regular search indexing.

Block URL Parameters (Prevent Duplicate Content)

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*&page=
Disallow: /*?ref=
Disallow: /*?utm_

Sitemap: https://example.com/sitemap.xml

Sorting, filtering, pagination, referral tracking, and UTM parameters create thousands of duplicate URLs that waste your crawl budget. Blocking them in robots.txt keeps crawlers focused on your canonical pages.

Robots.txt vs Meta Robots Tag: When to Use Which

This is one of the most common points of confusion in SEO. Robots.txt and meta robots tags serve different purposes and operate at different levels. Here is how they compare:

Feature Robots.txt Meta Robots Tag
Where it lives Root of your domain (/robots.txt) In the HTML <head> of each page
What it controls Whether crawlers can access a page Whether a page is indexed or links are followed
Scope Entire directories or URL patterns Individual pages
Can prevent indexing? No (URL can still appear in results) Yes (noindex removes from results)
Can prevent crawling? Yes No (page must be crawled to read the tag)
Affects link equity? Yes (blocked pages cannot pass link value) Yes (nofollow stops link equity flow)

The key insight: If you block a page in robots.txt, Google cannot see a noindex tag on that page because it never crawls the HTML. This means if other sites link to a page you have blocked in robots.txt, Google may still show the URL in search results with no description. To truly remove a page from search results, you need to allow crawling but add a <meta name="robots" content="noindex"> tag in the page's HTML.

Use robots.txt for broad crawl management (blocking directories, managing crawl budget). Use meta robots for page-level indexing control (keeping specific pages out of search results).

Common Robots.txt Mistakes That Hurt Your SEO

These mistakes are surprisingly common and can silently damage your search rankings for weeks or months before you notice. Review your robots.txt against this list to make sure you are not making any of them.

1. Blocking CSS and JavaScript Files

Years ago, it was common practice to block /css/ and /js/ directories in robots.txt. This is now actively harmful. Google renders pages using their CSS and JavaScript to understand layout, content, and user experience. If you block these resources, Googlebot sees a broken, unstyled page and may rank it lower.

# DO NOT do this
Disallow: /css/
Disallow: /js/
Disallow: /images/

Google has explicitly stated that blocking CSS and JS can negatively affect indexing. Use the URL Inspection tool in Google Search Console to see how Googlebot renders your page and whether any resources are blocked.

2. Leaving the Staging Robots.txt on Production

This happens more often than anyone admits. A developer sets Disallow: / on a staging server to keep it out of search engines. When the site goes live, the robots.txt is copied along with everything else, and suddenly Google cannot crawl any pages. Traffic drops to zero within days.

Pro tip: Add a robots.txt check to your deployment checklist. Verify the production robots.txt immediately after every deployment.

3. Blocking Important Pages by Accident

A rule like Disallow: /blog blocks not just /blog but also /blog/, /blog/my-best-post, and /blog-archive because robots.txt uses prefix matching. If you meant to block only the blog index page, you need Disallow: /blog$ (if the crawler supports the $ wildcard).

4. Using Robots.txt Instead of Noindex

As discussed above, blocking a page in robots.txt does not remove it from search results. If Google has already indexed a page or if other sites link to it, the URL can still appear in search results. The only way to reliably remove a page is to allow crawling and use a noindex meta tag.

5. Forgetting the Trailing Slash

Disallow: /private matches /private, /private-page, /private/admin, and /privately. If you only want to block the /private/ directory, use Disallow: /private/ with a trailing slash. The slash makes the rule match only that directory and its contents.

6. Having Multiple Robots.txt Files

Some sites accidentally serve different robots.txt content based on whether the URL uses www or not, or http vs https. Each protocol-subdomain combination is treated as a separate site by search engines. Make sure your canonical domain serves the correct robots.txt and that all other versions redirect to it.

7. Not Including a Sitemap Reference

While not strictly an error, omitting the Sitemap: directive is a missed opportunity. Adding your sitemap URL to robots.txt gives crawlers an immediate path to discover all your important pages, even if you have also submitted the sitemap through Google Search Console.

How to Test Your Robots.txt

Never publish a robots.txt file without testing it first. A single typo can block your entire site. Here are the tools and methods to verify your file works correctly:

Google Search Console Robots.txt Tester

Google Search Console offers a built-in robots.txt testing tool. Enter any URL from your site and it will tell you whether that URL is blocked or allowed by your current robots.txt. It also highlights syntax errors and warnings. This is the most authoritative testing tool because it shows you exactly how Googlebot interprets your rules.

Manual Browser Check

Simply navigate to yourdomain.com/robots.txt in your browser. Verify that the file loads as plain text, the syntax is correct, and there are no unexpected rules. This is the fastest sanity check you can do.

URL Inspection Tool

After deploying your robots.txt, use the URL Inspection tool in Google Search Console to check specific pages. It will show you whether Googlebot can access the page and whether any robots.txt rules are blocking it. It also shows a rendered screenshot of what Googlebot sees, which helps catch blocked CSS or JavaScript issues.

Third-Party Validators

Several online tools validate robots.txt syntax and test URL matching against your rules. These are useful for checking syntax before uploading the file to your server. Our robots.txt generator produces properly formatted files, eliminating syntax errors entirely.

Build a Perfect Robots.txt File

Choose your rules visually, preview the output, and copy it to your server. Zero syntax errors guaranteed.

Generate Your Robots.txt Free →

Robots.txt for Different Platforms

Different platforms have different crawling needs. Here are optimized robots.txt configurations for the most popular website platforms in 2026.

Robots.txt for WordPress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-json/
Disallow: /author/
Disallow: /trackback/
Disallow: /feed/
Disallow: /?s=
Disallow: /search/

Sitemap: https://example.com/sitemap.xml

This WordPress robots.txt blocks the admin area (while allowing admin-ajax.php, which many themes and plugins need), internal search results, feeds, author archives (which often duplicate content), and trackbacks. The wp-content/plugins/ and wp-content/cache/ directories contain files that do not need to be indexed.

Do not block /wp-content/uploads/ unless you specifically want to hide your media files from search engines. Images in your uploads folder can drive significant traffic through Google Images.

Robots.txt for Shopify

Shopify generates its own robots.txt automatically, but since 2021 you can customize it by editing the robots.txt.liquid template. Here is a solid configuration:

User-agent: *
Disallow: /admin
Disallow: /cart
Disallow: /orders
Disallow: /checkouts/
Disallow: /checkout
Disallow: /account
Disallow: /collections/*sort_by*
Disallow: /collections/*+*
Disallow: /*/collections/*sort_by*
Disallow: /search
Disallow: /*?variant=*
Disallow: /*?q=*

Sitemap: https://example.com/sitemap.xml

The key Shopify-specific rules here are blocking collection sort parameters (which create massive duplicate content), variant parameters (each product variant generates a separate URL), and the search page. Shopify's cart, checkout, and account pages should always be blocked since they contain user-specific content that wastes crawl budget.

Robots.txt for React SPAs (Single-Page Applications)

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

For React, Next.js, Vue, and other JavaScript-based single-page applications, keep your robots.txt as open as possible. SPAs already face crawling challenges because their content is rendered by JavaScript. Blocking any resources (especially JS bundles) will prevent search engines from rendering your pages entirely.

The real SEO work for SPAs happens at the rendering level: use server-side rendering (SSR) or static site generation (SSG) to ensure crawlers can see your content. Your robots.txt should not add any additional obstacles. Focus on providing a comprehensive sitemap, as JavaScript-rendered navigation can be difficult for crawlers to follow.

Robots.txt for API-Only Backends

User-agent: *
Disallow: /

If your domain serves only an API with no public-facing pages, block everything. There is no content for search engines to index, and API endpoints should not appear in search results.

Advanced Robots.txt Strategies

Managing Crawl Budget for Large Sites

Sites with more than 10,000 pages need to think carefully about crawl budget. Search engines will not crawl every page on every visit. By blocking low-value pages in robots.txt, you direct crawlers to spend their limited budget on your most important content.

Common low-value pages to block: paginated archive pages beyond page 2, tag pages with only one or two posts, internal search results (which create infinite URL combinations), and print-friendly versions of pages.

Handling Multiple Subdomains

Each subdomain needs its own robots.txt file. The rules at example.com/robots.txt do not apply to blog.example.com or shop.example.com. If you use subdomains, create and maintain separate robots.txt files for each one.

Dealing with Aggressive Bots

Some bots ignore robots.txt entirely and crawl your site aggressively, consuming server resources. For these, robots.txt is not enough. Use server-level blocking through .htaccess rules, firewall rules, or your CDN's bot management features to deny access at the network level.

Frequently Asked Questions

What is a robots.txt file?

A robots.txt file is a plain text file placed at the root of your website that tells search engine crawlers which pages or sections of your site they can and cannot access. It follows the Robots Exclusion Protocol and is the first file crawlers check before indexing your site.

Does robots.txt block pages from appearing in Google?

Not exactly. Robots.txt prevents crawlers from accessing a page, but if other sites link to that page, Google may still index the URL and show it in search results without a description. To fully block a page from search results, use a meta robots noindex tag instead.

Where should I place my robots.txt file?

Your robots.txt file must be placed in the root directory of your domain. For example, https://example.com/robots.txt. It will not work if placed in a subdirectory. Each subdomain needs its own separate robots.txt file.

What is the difference between robots.txt and meta robots tags?

Robots.txt controls whether crawlers can access a page at all, while meta robots tags control whether a page should be indexed or whether links should be followed. Robots.txt works at the crawl level; meta robots works at the indexing level. For best results, use them together.

Can I use robots.txt to hide sensitive information?

No. Robots.txt is publicly accessible and only serves as a polite request to well-behaved crawlers. Malicious bots can ignore it entirely. Never use robots.txt to protect sensitive data. Use proper authentication and server-side security instead.

How do I test my robots.txt file?

Use Google Search Console's robots.txt Tester tool to check your file for errors and test whether specific URLs are blocked or allowed. You can also visit yourdomain.com/robots.txt directly in your browser to verify the file is accessible and correctly formatted.

Need a Robots.txt File Right Now?

Skip the manual editing. Our free robots.txt generator lets you select your platform, toggle common rules on and off, and download a properly formatted file in seconds. It includes presets for WordPress, Shopify, and static sites, with built-in validation to catch syntax errors before they reach your server.

Open the Free Robots.txt Generator →

© 2026 BizToolkit. Free tools for freelancers and small businesses.