← Glossary

What is robots.txt?

robots.txt is a file that tells search engine crawlers which paths they can or cannot request. Learn how it works and how it affects crawling and indexing.

robots.txt is a plain-text file at the root of your site (e.g. https://yoursite.com/robots.txt) that gives crawlers rules about which URLs they’re allowed to request. It does not remove pages from the index by itself—for that you use noindex or Search Console—but it controls crawl access and can block entire sections from being fetched.

How it works

  • AllowAllow: /blog/ means crawlers can access that path.
  • DisallowDisallow: /admin/ means crawlers should not request URLs under that path.
  • Sitemap — You can list sitemap URLs so search engines discover them easily.

Crawlers that respect the standard (e.g. Googlebot) will follow these rules. Blocking a URL in robots.txt can prevent Google from reading it, so it may stay in the index without fresh content or drop out if it was never indexed.

Common issues

  • Blocking important pages — Accidentally disallowing key sections so they aren’t crawled or updated.
  • Wrong location or syntax — File must be at the root and use the correct format.
  • Confusion with noindex — robots.txt controls crawling; noindex (meta or header) and Search Console control indexing.

How BearAudit uses it

BearAudit fetches and parses robots.txt for each property. We show what’s allowed and disallowed and highlight URLs that are in sitemaps but disallowed by robots (or vice versa), so you can fix crawl coverage and avoid blocking pages you want indexed.

More in the glossary

View all glossary entries