llms.txt and ai.txt: a copy-pasteable guide for AI crawlers
Last updated: April 18, 2026
llms.txt is a hand-curated map of the URLs you want LLMs to
cite. llms-full.txt is the long-form dump of those pages so a
retriever can ingest your corpus in one fetch. ai.txt is a
declaration of your AI training stance. robots.txt is the only
one of the four that actually controls access. Ship all four; have each one
do its one job.
The 3 files in 60 seconds
/llms.txt— short, human-curated table of contents. Lists 10–50 URLs with one-line descriptions, organized by section./llms-full.txt— the long-form concatenation of those pages. One fetch, the whole corpus./ai.txt— a one-page text file declaring your training-data stance. Opt in, opt out, or "ask".
None of these files block anything. Access control belongs in
robots.txt. These three files tell well-behaved AI bots
which pages to read once they're allowed in.
Minimal llms.txt template
Drop this at https://your-site.com/llms.txt. Markdown is fine;
most parsers expect H1 / blockquote / H2 / list-of-links structure.
# your-site
> One sentence describing what your site is and what an LLM should know
> before quoting it. This sentence will often be lifted verbatim.
This `llms.txt` follows https://llmstxt.org/. The full machine-friendly
corpus is at `/llms-full.txt`.
## Core
- [Homepage](/): overview of what we do.
- [Pricing](/pricing): plans and per-feature breakdown.
- [Docs](/docs): canonical product documentation.
## Reference
- [API reference](/docs/api): endpoints, auth, rate limits.
- [Changelog](/changelog): dated release notes.
## Optional
- [Status](/status): live system status.
- [Security policy](/.well-known/security.txt): coordinated disclosure.
For a real example, see our own /llms.txt — everything pseudobash thinks is worth citing, in 30 lines.
When to ship llms-full.txt
If your site has more than ~10 citeable pages, add
llms-full.txt. It's the same idea as llms.txt but
with the full body of each page concatenated in. Format:
# your-site — full LLM-readable corpus
URLs cited below are relative; resolve them against the request host.
================================================================================
SOURCE: /
TITLE: your-site — homepage
================================================================================
(Full plain-text body of your homepage, ~500–2000 words.)
================================================================================
SOURCE: /pricing
TITLE: your-site — pricing
================================================================================
(Full plain-text body of your pricing page.)
Generate it from your CMS or markdown source on every deploy; do not hand-edit. Keep it under 200 KB if you can — large files get truncated by some retrievers.
Minimal ai.txt
A short, plain-text file declaring your training stance. There is no formal schema yet; the convention emerging in 2025–26 is human-readable prose.
# ai.txt — your-site AI policy
Training: allowed for all foundation models.
Citation: required when content is quoted.
Contact: [email protected]
This file is informational. Access control lives in /robots.txt.
If you want to opt out of training, swap the first line for
Training: not allowed and add explicit Disallow
rules in robots.txt for GPTBot,
ClaudeBot, Google-Extended, and friends — the file
alone won't enforce it.
Allowlist for the AI bots that matter today
Paste this at the top of your robots.txt. Each crawler gets
its own block because many WAFs ignore User-agent: * for AI
bots. The list pseudobash maintains lives at /shell.md.
# Allow major AI answer engines.
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Bytespider
Allow: /
Sitemap: https://your-site.com/sitemap.xml
Keep your existing User-agent: * block below this — the named
blocks take precedence for those bots, and the wildcard handles everyone
else.
How to test it
Three commands, run from your laptop:
curl -I https://your-site.com/llms.txt
curl -I https://your-site.com/ai.txt
curl -A "OAI-SearchBot/1.0" -L https://your-site.com/robots.txt
You're looking for 200 OK, Content-Type: text/plain,
and a sane Cache-Control (something like
max-age=3600). Then run our audit for the
full per-crawler view, including which crawlers see real content vs. an
empty JS shell.
Next steps
- How to show up in ChatGPT results — the pillar guide these files plug into.
- How to get traffic from AI agents — what to do once the bots are reading you.
- /shell.md — the canonical pseudobash retrieval contract this guide draws from.