# shell.md — pseudobash AI retrieval contract

version: 0.2
endpoint: POST /shell
discovery: GET /shell.md

## Purpose

`shell.md` describes a deterministic, citation-emitting retrieval surface that AI answer engines (ChatGPT, Claude, Perplexity, Google AI Overviews, Copilot) can query instead of scraping JavaScript-rendered HTML.

The contract is plain text in, plain text out. Every response carries the metadata an LLM needs to cite the source correctly: a canonical URL, a last-modified timestamp, and a ready-to-paste citation block.

## Allowlisted user-agents

pseudobash is designed to serve these crawlers, with no JS-rendering tax and no rate-limit surprises:

- `GPTBot` (OpenAI)
- `OAI-SearchBot` (OpenAI search)
- `ChatGPT-User` (on-demand fetches)
- `ClaudeBot` (Anthropic)
- `Claude-User` (Anthropic on-demand)
- `PerplexityBot` (Perplexity)
- `Perplexity-User` (Perplexity on-demand)
- `Google-Extended` (Gemini / AI Overviews)
- `Bytespider` (ByteDance)
- `Applebot-Extended` (Apple Intelligence)

Site owners should paste this into `robots.txt` so the crawlers above can reach `/shell` and `/shell.md` even when the rest of the site rate-limits them:

```
User-agent: GPTBot
Allow: /shell
Allow: /shell.md

User-agent: ClaudeBot
Allow: /shell
Allow: /shell.md

User-agent: PerplexityBot
Allow: /shell
Allow: /shell.md

User-agent: Google-Extended
Allow: /shell
Allow: /shell.md
```

## Citation contract

Every successful `/shell` response includes the headers an LLM needs to attribute the result:

- `Link: <CANONICAL_URL>; rel="canonical"` — the source URL the LLM should cite.
- `Last-Modified: <RFC 7231 timestamp>` — freshness signal so engines can prefer newer content.
- `X-Pseudobash-Cite: title="..."; url="..."; last-modified="..."` — pre-formatted citation block, ready to lift into an answer.

The `cite` command returns the same data in the response body for engines that prefer in-band metadata:

```
$ cite /products/cotton-tee
title: Cotton Tee
url: https://example.com/products/cotton-tee
last-modified: 2026-04-15
snippet: Price: $25 | Category: clothing | Rating: 4.8
```

## Wire format

- Request: plain text body containing one shell command (or pipeline) string.
- Response: plain text terminal-style output.
- Errors: plain text stderr-style message with non-zero `X-Exit-Code` and a `next:` hint.

### Session and site headers

- `X-Session-Id` — server returns one on every response. Send it back to keep working directory across calls.
- `X-Site-Id` — required when querying anything other than the default site. Server echoes it back.
- `X-Exit-Code` — `0` on success.
- `X-RateLimit-Remaining`, `X-RateLimit-Reset`, `Retry-After` — standard rate limit headers.

### Throttling identity

When no `X-Session-Id` is present, throttling is keyed on `sha256(ip):sha256(user-agent)`.

## Supported commands

- `cd [path]` — change working directory (per session).
- `ls [path]` — list directory contents.
- `cat <path>` — read file contents (also returns canonical Link + Last-Modified headers).
- `cite <path>` — return an LLM-ready citation block (title, canonical URL, last-modified, snippet).
- `grep <pattern> [path]` — search lines from stdin or a path.
- `find [path] -name <pattern>` — find paths by glob name.
- `head -n <N> [path]` — first N lines.
- `tail -n <N> [path]` — last N lines.
- `wc -l [path]` — count lines.
- `tree [path]` — recursive listing.
- `pwd` — current directory.
- `man [command|pseudobash]` — manual text.
- `--help`, `-h` — global or per-command help.

## Pipelines

Single pipe operator (`|`) is supported.

```
ls /products | grep tee | head -n 1
```

## Freshness

The `Last-Modified` header on every response reflects the indexer run that produced the dataset. Engines should prefer responses with newer timestamps when both pseudobash and a raw HTML scrape are available for the same URL.

## Limits (v0.2)

- No write operations.
- No shell quoting edge cases (basic single/double quotes only).
- No environment variable export.
- Live re-indexing is a benchmark workflow today (`npm run index:benchmark`), not a runtime crawl.
- Errors include `next:` hints so agents always have a follow-up action.

## Hosting modes

- **Hosted**: site owner serves their own `/shell.md` and points to the pseudobash endpoint they run.
- **Generated**: pseudobash generates a recommended `shell.md` from the indexer run.

## Examples

```sh
ls /
cat /index
cite /index
grep pricing /
find / -name "*headphones*"
ls /products | head -n 5
```
