ai.txt, llms.txt, and robots.txt are three different files with three different jobs. robots.txt controls whether an AI crawler is allowed to fetch your pages at all. ai.txt is a newer, usage-oriented standard that governs what your content may be used for (e.g. AI training). llms.txt controls nothing — it helps language models understand your most important content faster. In short: robots.txt and ai.txt are access and usage rules, while llms.txt is a comprehension aid.
This distinction is business-critical in 2026. AI crawlers now account for a measurable share of web traffic: across 2024/2025 Cloudflare reported that GPTBot became the most active AI crawler on its network, with AI bot traffic rising sharply over the year. If you do not actively manage access and usage, you leave both to chance — and forfeit the chance to be cited correctly in AI answers.
The confusion around the three files is understandable. They all end in `.txt`, all live at the root of your domain, and all get mentioned in the same breath. But they solve different problems, and whoever conflates them either blocks too much (and vanishes from AI answers) or too little (and forfeits control over their content). This article sorts the three files cleanly: what each one does, which one actually controls AI crawlers, how they work together, and how to set them up correctly in under an hour.
What is the difference between ai.txt, llms.txt and robots.txt?
The difference is functional. robots.txt is the oldest standard (the Robots Exclusion Protocol, dating to 1994 and standardized as RFC 9309 in 2022) and tells crawlers which URLs they may and may not fetch. ai.txt is a newer proposal (popularized by Spawning.ai, among others) that specifically governs the AI and training use of content — not just "may you read this" but "may you train on this." llms.txt was proposed in September 2024 by Jeremy Howard (co-founder of Answer.AI and fast.ai) and is a curated Markdown map of your most important content that makes comprehension easier for language models.
The table below summarizes the core differences:
| File | Primary job | Controls crawlers? | Format | Path |
|---|---|---|---|---|
| robots.txt | Allow/block access | Yes (widely honored) | Directives | /robots.txt |
| ai.txt | Govern AI use & training | Partly (young, voluntary) | Directives | /ai.txt |
| llms.txt | Improve comprehension | No | Markdown | /llms.txt |
Mnemonic: robots.txt = the door, ai.txt = the usage contract, llms.txt = the signpost.
robots.txt in detail
robots.txt is the foundation. Every serious search engine and every reputable AI crawler reads this file first, before fetching your pages. It works through simple rules: a `User-agent` names the crawler, while `Disallow` and `Allow` set blocked and permitted paths. Important: robots.txt prevents crawling, not necessarily indexing — a URL that is already known can still appear as a result despite a block. For AI crawlers, robots.txt is nonetheless the strongest control instrument, because GPTBot, ClaudeBot, and the rest honor it per their providers' documentation.
ai.txt in detail
ai.txt goes a step beyond robots.txt. Where robots.txt only governs fetching, ai.txt aims to differentiate the use of fetched content — for example permitting indexing while disallowing training. The standard is still young and not formally ratified; in 2026 several initiatives (Spawning.ai, the "AI Preferences" discussions at the IETF) are competing over a unified form. In practice this means ai.txt is a useful signal — valuable legally and for communication — but not a technical hard stop. Reputable providers increasingly take it into account; there is no guarantee.
llms.txt in detail
llms.txt is the counterpart: it blocks nothing, it opens. The idea is to offer language models a clean Markdown overview of your most important content — free of navigation, ads, and HTML clutter. Models have limited context windows; a focused llms.txt helps them quickly find and accurately reproduce the relevant pages. Learn more about structure and impact in our [guide to llms.txt](/magazin/what-is-llms-txt).
Why three files instead of one
You might ask: why not put everything in a single file? The answer lies in history and in the separate goals. robots.txt grew over three decades and is optimized for path access — it has no semantics for "training." ai.txt emerged in response to exactly that gap, because creators wanted to distinguish "read" from "train." And llms.txt solves an entirely different problem: it is not about control but about efficiency of comprehension. Three problems, three files, three formats. This separation is not an oversight but clean division of labor — each file stays simple and does exactly one thing well.
Which file controls AI crawlers?
robots.txt controls AI crawlers most reliably — provided the crawler honors the protocol. The major AI providers run named user-agents that respect robots.txt: OpenAI uses GPTBot (for training) and OAI-SearchBot (for ChatGPT search), Anthropic uses ClaudeBot, Google uses Google-Extended (a robots.txt directive that governs AI training use without affecting normal Google Search), and Perplexity uses PerplexityBot. Block these in robots.txt and the reputable providers comply.
ai.txt targets a finer layer: it aims to govern not just fetching but use — for example "indexing allowed, training prohibited." That is conceptually stronger, but as of 2026 it is not yet universally supported by every crawler; compliance is voluntary. llms.txt controls nothing at all. It is purely declarative and offers no blocking power — if you want to block access, you need robots.txt.
The key AI crawler user-agents in 2026
To control crawlers, you have to know their names. Keep these user-agents in view in your rules and logs:
| Provider | User-agent | Purpose |
|---|---|---|
| OpenAI | GPTBot | Model training |
| OpenAI | OAI-SearchBot | ChatGPT search / live retrieval |
| OpenAI | ChatGPT-User | User-triggered fetches |
| Anthropic | ClaudeBot | Training & retrieval |
| Google-Extended | AI training opt-out (Gemini) | |
| Perplexity | PerplexityBot | Indexing for Perplexity |
| Common Crawl | CCBot | Open training corpus |
Important: a block against `Google-Extended` affects only Google's AI training, not your normal ranking in Google Search. That separation lets you keep classic SEO visibility while disallowing AI training.
What robots.txt cannot do
robots.txt is not a security measure. Disreputable scrapers simply ignore the file, and even a blocked path remains technically reachable — robots.txt is a request, not a lock. If you genuinely want to protect content, you need server-side measures such as authentication, rate limiting, or bot management (for example via Cloudflare). robots.txt governs cooperative crawlers; it does nothing against hostile ones.
The legal context in Europe
ai.txt in particular gains weight against the backdrop of EU law. The European copyright directive provides an opt-out mechanism for text and data mining: rights holders can object to the machine processing of their works, provided they declare that reservation in a "machine-readable" form. ai.txt and corresponding robots.txt directives are increasingly regarded as an accepted way to express exactly that machine-readable reservation. The EU AI Act likewise points to the obligation of generative model providers to respect such reservations. The implication: a well-maintained ai.txt in 2026 is not only a technical but also a legal signal — a documented, dated record of your usage preference. For legally binding questions rely on qualified counsel; as a practitioner, though, you should know and use the mechanism.
How do they work together?
The three files do not contradict each other — they interlock at different levels. Picture a three-stage pipeline: access → use → comprehension. robots.txt decides at the door whether a crawler may enter at all. If you let it in, ai.txt specifies what it may do with the content (read, cite, train). And once a language model processes your content, llms.txt helps it pick up the right pages in priority order and in clean structure.
The order of effect matters: a block in robots.txt makes downstream files moot. If you fully bar GPTBot, it does not matter what your llms.txt says — the crawler never reaches your content. So rather than blanket-blocking everything, differentiate deliberately: perhaps block training, but allow search and citation so you stay visible in AI answers.
A concrete interplay scenario
Take a typical case: a B2B SaaS magazine that wants to be cited in ChatGPT and Perplexity but does not want to serve as training material. The solution combines all three files. In robots.txt you allow OAI-SearchBot and PerplexityBot but block GPTBot and Google-Extended. In ai.txt you document this preference again, explicitly and machine-readable — as an additional, future-proof signal. In llms.txt you link the ten most important guide articles, so the search crawlers prioritize exactly the content you want to be cited for. Result: maximum citability, minimal training use.
This logic is the core of any well-considered [GEO strategy](/magazin/generative-engine-optimization-guide): manage access deliberately instead of reflexively shutting AI crawlers out and losing your visibility in ChatGPT, Perplexity, and Google AI Overviews. The three files are not an either/or but a coordinated set.
The decision matrix
Which file you need, and when, depends on your goal. This matrix helps you place yourself:
| Your goal | robots.txt | ai.txt | llms.txt |
|---|---|---|---|
| Block crawlers entirely | required | optional | irrelevant |
| Get cited in AI answers | allow | allow | recommended |
| Block training, allow search | differentiate | recommended | recommended |
| Maximize AI comprehension | allow | allow | required |
Most brands land in row three in 2026: they want to appear in AI answers but control whether their content flows into model training. For that goal you need all three files — and that is precisely what makes understanding their interplay so valuable.
How do you set them up?
All three files live at the root of your domain (e.g. `https://your-domain.com/robots.txt`) and can be set up in under an hour. Work in this order:
1. Define robots.txt. Decide which AI crawlers you allow or block. A typical setup for maximum AI visibility with a simultaneous training opt-out looks like this:
``` User-agent: GPTBot Disallow: /
User-agent: OAI-SearchBot Allow: /
User-agent: Google-Extended Disallow: /
Sitemap: https://your-domain.com/sitemap.xml ```
Here you block training (GPTBot, Google-Extended) but allow ChatGPT search (OAI-SearchBot) — you stay citable without supplying training data.
2. Add ai.txt. State your usage preferences explicitly. A minimal, readable variant can permit indexing while disallowing training. Because ai.txt is still young, treat it as a supplementary signal, not your sole protection.
3. Write llms.txt. Build a curated Markdown map of your most important pages — an H1 with the project name, a short blockquote summary, then topically grouped link lists. Keep it current and lean.
4. Validate. Open each file in a browser, confirm an HTTP 200 status, and check your server logs to see whether the named AI user-agents actually respect the directives.
Example of a lean llms.txt
A good llms.txt is short and curated. Here is how the start might look:
``` # Your Brand
PromptA platform for X that solves Y for Z.
## Core guides - [What is GEO](/magazin/generative-engine-optimization-guide): The complete guide - [What is llms.txt](/magazin/what-is-llms-txt): Definition and setup
## Product - [Features](/features): Overview - [Pricing](/pricing): Plans compared ```
Each link carries a short, descriptive note — that helps the model gauge relevance before it even fetches the page.
Example of an ai.txt
ai.txt does not yet have a single ratified standard, so several spellings circulate in practice. A common, readable variant borrows robots.txt syntax and adds a usage directive. What matters is that your intent is documented unambiguously and machine-readably:
``` # ai.txt — usage preferences for AI systems User-Agent: * Disallow-AI-Training: / Allow-AI-Search: / Contact: contact@your-domain.com ```
Because directive names vary by initiative, it pays to add a short plain-text comment to your ai.txt that states the intent in one sentence — so humans and newer crawler generations grasp your preference too. Treat ai.txt in 2026 as a bridge: it documents your intent legally and communicatively, while technical enforcement still runs primarily through robots.txt.
Setup with Next.js and modern frameworks
If you run your project on a modern framework like Next.js, these files do not belong in the static public directory but are ideally generated dynamically. Next.js offers route handlers and convention files (`robots.ts`, `sitemap.ts`) that produce robots.txt and your sitemap type-safely at build time. ai.txt and llms.txt can be served via simple route handlers that pull content from your content model. The advantage: when you publish a new category or a new cornerstone article, your llms.txt updates automatically — no manual upkeep, nothing forgotten.
Maintenance and automation
Discovery files are not a one-off project. User-agents change, new crawlers appear, and your most important content shifts. If you maintain a searchable prompt and content library of your own, you can generate these files automatically — which is exactly why Prompt2Love builds GEO hygiene into the stack, so your content is cleanly accessible and comprehensible to AI crawlers from day one. Set yourself a quarterly reminder to check robots.txt, ai.txt, and llms.txt against the current crawler landscape.
What does this mean legally?
These files are primarily communication and compliance instruments, not technical locks — but they are more legally relevant than many assume. In the EU, Article 4 of the DSM Directive (2019/790) lets rights holders reserve text and data mining in a machine-readable way. robots.txt and ai.txt are exactly such machine-readable opt-outs: set correctly, you document a usage reservation that AI trainers must respect if they want to rely on the TDM exception. The EU AI Act likewise points to this reservation for training data.
In practice this means a well-maintained robots.txt and ai.txt are not just SEO hygiene but part of your copyright governance — they create a paper trail. Anyone who monetizes or licenses content should set the reservation deliberately and consistently and mirror it in the website's terms of use. But never rely on it alone: against actors who ignore standards, only server-side protection helps. These files govern the behavior of cooperative, reputable providers; they are no substitute for authentication, contracts, and, in the worst case, legal action.
Should I allow AI crawlers at all?
In most cases the answer is yes, at least the search crawlers. The decision hinges on three questions. First: do you live on visibility (magazine, SaaS, consulting, local business)? Then you want to appear in AI answers, because Gartner projects that traditional search query volume will fall roughly 25 percent by 2026 as users increasingly turn to AI assistants — miss that and you lose reach. Second: is your content itself the product you sell (paywalled journalism, databases, course material)? Then a training opt-out combined with search permission is usually the best compromise. Third: do you process sensitive or personal data? Then that path belongs behind authentication anyway, not merely in a robots.txt.
So make the decision deliberately rather than blocking everything by reflex. Block it all and you vanish from ChatGPT, Perplexity, and Google AI Overviews — the channels through which a growing share of research now flows. A differentiated [GEO strategy](/magazin/generative-engine-optimization-guide) almost always beats the sledgehammer block, and our [guide to llms.txt](/magazin/what-is-llms-txt) shows how to steer the crawlers you do allow toward your best content.
Common mistakes and how to avoid them
The most expensive mistake is blanket-blocking every AI crawler. Many teams set `User-agent: *` with `Disallow: /` and then wonder why they no longer appear in any AI answer. Differentiating almost always beats a global block: opt out of training, allow search and citation.
The second classic is conflating blocking with comprehension. llms.txt protects nothing — anyone who thinks an llms.txt controls AI access is mistaken. Conversely, a perfect robots.txt does not improve how your content is understood; for that you need llms.txt plus clean, passage-level citable content.
Third mistake: stale files. AI crawler user-agents change; OpenAI split GPTBot and OAI-SearchBot, and Google introduced Google-Extended. Schedule a quarterly review. Fourth mistake: the wrong path or a typo in the user-agent name — even a small misspelling renders a directive useless. Always validate live in the browser and in your logs. Fifth mistake: contradictory rules between robots.txt and ai.txt — keep both consistent, otherwise you create ambiguous signals that crawlers may interpret differently.
How to verify it actually works
Setting a directive is one thing; proving its effect is another. The most reliable test is your server logs: filter for the known user-agents (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot) and check whether, after you set a block, they actually disappear from the affected paths. Reputable crawlers usually respond within a few days, because they re-read robots.txt periodically. As a complement, you can use the robots.txt tester in Google Search Console and run spot checks inside the AI engines themselves: ask ChatGPT or Perplexity about one of your topics and see whether your domain appears as a source. If it does not, despite strong content, an accidental block is worth investigating.
The one-sentence summary
If you take away only one thing: robots.txt decides access, ai.txt decides use, llms.txt decides comprehension — block deliberately, document your intent, and make it easy for models to cite you correctly. These three files cost you an hour of setup and help determine whether your brand stays visible in the AI era or vanishes into the background noise.
You might also like
What is llms.txt? (And How to Create One)
llms.txt is a Markdown file at your domain root that gives AI models a curated map of your most important content. What it is, why it matters for AI search, a step-by-step guide to creating one, and how llms-full.txt differs.
Generative Engine Optimization (GEO): The Complete Guide
Generative Engine Optimization (GEO) makes your content citable for ChatGPT, Perplexity, Gemini, and Google AI Overviews. The complete guide: definition, how it differs from SEO, citation strategies, llms.txt, and measurement.
Prompt Engineering Fundamentals
Prompt engineering from the ground up: building blocks, techniques, iteration, and the most common mistakes. The complete 2026 guide to reliable AI output.
