Writing

llms.txt and ai.txt: a copy-pasteable guide for AI crawlers

Last updated: April 18, 2026

Short answer

llms.txt is a hand-curated map of the URLs you want LLMs to cite. llms-full.txt is the long-form dump of those pages so a retriever can ingest your corpus in one fetch. ai.txt is a declaration of your AI training stance. robots.txt is the only one of the four that actually controls access. Ship all four; have each one do its one job.

The 3 files in 60 seconds

None of these files block anything. Access control belongs in robots.txt. These three files tell well-behaved AI bots which pages to read once they're allowed in.

Minimal llms.txt template

Drop this at https://your-site.com/llms.txt. Markdown is fine; most parsers expect H1 / blockquote / H2 / list-of-links structure.

# your-site

> One sentence describing what your site is and what an LLM should know
> before quoting it. This sentence will often be lifted verbatim.

This `llms.txt` follows https://llmstxt.org/. The full machine-friendly
corpus is at `/llms-full.txt`.

## Core

- [Homepage](/): overview of what we do.
- [Pricing](/pricing): plans and per-feature breakdown.
- [Docs](/docs): canonical product documentation.

## Reference

- [API reference](/docs/api): endpoints, auth, rate limits.
- [Changelog](/changelog): dated release notes.

## Optional

- [Status](/status): live system status.
- [Security policy](/.well-known/security.txt): coordinated disclosure.

For a real example, see our own /llms.txt — everything pseudobash thinks is worth citing, in 30 lines.

When to ship llms-full.txt

If your site has more than ~10 citeable pages, add llms-full.txt. It's the same idea as llms.txt but with the full body of each page concatenated in. Format:

# your-site — full LLM-readable corpus

URLs cited below are relative; resolve them against the request host.

================================================================================
SOURCE: /
TITLE: your-site — homepage
================================================================================

(Full plain-text body of your homepage, ~500–2000 words.)

================================================================================
SOURCE: /pricing
TITLE: your-site — pricing
================================================================================

(Full plain-text body of your pricing page.)

Generate it from your CMS or markdown source on every deploy; do not hand-edit. Keep it under 200 KB if you can — large files get truncated by some retrievers.

Minimal ai.txt

A short, plain-text file declaring your training stance. There is no formal schema yet; the convention emerging in 2025–26 is human-readable prose.

# ai.txt — your-site AI policy

Training: allowed for all foundation models.
Citation: required when content is quoted.
Contact: [email protected]

This file is informational. Access control lives in /robots.txt.

If you want to opt out of training, swap the first line for Training: not allowed and add explicit Disallow rules in robots.txt for GPTBot, ClaudeBot, Google-Extended, and friends — the file alone won't enforce it.

Allowlist for the AI bots that matter today

Paste this at the top of your robots.txt. Each crawler gets its own block because many WAFs ignore User-agent: * for AI bots. The list pseudobash maintains lives at /shell.md.

# Allow major AI answer engines.

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bytespider
Allow: /

Sitemap: https://your-site.com/sitemap.xml

Keep your existing User-agent: * block below this — the named blocks take precedence for those bots, and the wildcard handles everyone else.

How to test it

Three commands, run from your laptop:

curl -I https://your-site.com/llms.txt
curl -I https://your-site.com/ai.txt
curl -A "OAI-SearchBot/1.0" -L https://your-site.com/robots.txt

You're looking for 200 OK, Content-Type: text/plain, and a sane Cache-Control (something like max-age=3600). Then run our audit for the full per-crawler view, including which crawlers see real content vs. an empty JS shell.

Next steps

Audit my site

Sources