Skip to main content

How to Make Your Payload CMS Site AI-Ready Right Now

Make your Next.js + Payload CMS site AI-ready: let AI crawlers in, serve raw Markdown, and wire up @graph schema to let LLMs find and cite your content.

How to Make Your Payload CMS Site AI-Ready Right Now

Unlike Googlebot, AI crawlers flood your site with heavy traffic, look for raw Markdown, and scan for custom metadata. If you’re building on Next.js and Payload, implementing these optimizations requires a minimal overhead - but leaving it misconfigured means invisible content for LLMs.

The shift is already measurable. Ahrefs analyzed 300,000 keywords and found that when AI Overviews appear, the top organic result loses 34.5% of its clicks. Their February 2026 follow-up put that figure at 58%. Vercel went from less than 1% to 10% of new signups arriving from ChatGPT in six months. The traffic mix is moving - quietly, but fast.

TL;DR

  • Access first. AI crawlers are blocked by Cloudflare's Bot Fight Mode by default. Unblock search bots at the CDN level, then use robots.ts to explicitly allow search indexers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) while blocking training scrapers (GPTBot, ClaudeBot, Google-Extended). Nothing else here matters until requests reach your server.

  • Serve Markdown via content negotiation. Add a middleware check for Accept: text/markdown and return raw Markdown from a route handler. This cuts token overhead by ~80% versus HTML and makes your content structurally preferred by LLM pipelines — high signal, low effort.

  • *Link your structured data into one @graph. Three Schema types — BreadcrumbList for hierarchy, FAQPage for citable facts, and Article linked to Author and Organization — are what AI search bots actually use to assess relevance and authority. Without structured authorship, models have no signal to trust your content over anyone else's.

  • *Skip llms.txt unless you run a documentation site. Adoption is ~2% outside developer tooling, Google ignores it entirely, and AI search crawlers don't fetch it unprompted. Spend that time on the Markdown endpoint instead.

How Do You Let AI Crawlers Reach Your Site?

AI bots are blocked by default on Cloudflare’s Bot Fight Mode. You have to explicitly unblock them in Cloudflare, then split search bots from training scrapers in your robots.ts.

Before you optimize anything, confirm the crawlers can reach your server.

If you use Cloudflare, Bot Fight Mode is usually on by default and blocks GPTBot and PerplexityBot before requests reach your app. Choose the option for your plan:

  1. Paid plans - flip the AI-bot toggles in Security → Bots.
  2. Free plan - add a WAF Custom Rule with action Skip matching the user agents you want through.

Same goal either way: the request has to reach your origin before anything downstream matters.

Cloudflare Security → Bots panel: Bot Fight Mode on, AI Scrapers and Crawlers off.Cloudflare Security → Bots panel: Bot Fight Mode on, AI Scrapers and Crawlers off.

On the Next.js side, you handle this in app/robots.ts. A wildcard rule lets everyone in. The smarter move is to welcome AI search while blocking AI training - surface in real-time answers on ChatGPT and Claude, without silently feeding the next round of LLM training.

Each major vendor now ships separate user agents for those two jobs. Split them cleanly.

User-AgentPurposeAction
OAI-SearchBotSearch indexerAllow
ChatGPT-UserLive in-chat fetchesAllow
Claude-SearchBotSearch indexerAllow
Claude-UserLive in-chat fetchesAllow
PerplexityBotConversational search crawlerAllow
GPTBotTraining scraperBlock
ClaudeBotTraining scraperBlock
Google-ExtendedGemini training corpusBlock
import { MetadataRoute } from "next";

export default function robots(): MetadataRoute.Robots {
return {
rules: [
{
userAgent: [
"OAI-SearchBot",
"ChatGPT-User",
"Claude-SearchBot",
"Claude-User",
"PerplexityBot",
],
allow: "/",
},
{
userAgent: ["GPTBot", "ClaudeBot", "Google-Extended"],
disallow: "/",
},
{
userAgent: "*",
allow: "/",
disallow: ["/admin", "/api"],
},
],
sitemap: `${process.env.NEXT_PUBLIC_SITE_URL}/sitemap.xml`,
};
}

Deploy, then check your access logs three to five days later. If you don't see successful requests from GPTBot or PerplexityBot, something upstream - like a WAF, CDN, or origin firewall - is still blocking them. It's time to audit your network configuration and locate the bottleneck.

How Do You Expose a Markdown Endpoint for Every Page?

Use content negotiation to serve Markdown. Return raw Markdown when bots hit your URLs with an Accept: text/markdown header.

Sure, AI crawlers can parse HTML. But they process plain Markdown way more efficiently - it means fewer tokens, zero layout noise, and a much cleaner structure. Cloudflare reported roughly 80% fewer tokens on one of their own posts when serving Markdown vs. HTML. That's why LLM pipelines prefer it when it's availablea. And why a .md version of every public page is the highest-signal surface you can offer them.

The implementation has two parts:

  1. Handle rewrites via middleware (src/proxy.ts). It checks for the Accept: text/markdown header and routes requests to an internal /md/* path. This keeps the URLs clean and uniform for both users and crawlers, eliminating the need for .md extensions or extra canonicals.
  2. A route handler (app/md/posts/[slug]/route.ts) - fetches the document from Payload and returns it as plain Markdown.
// src/proxy.ts
import { NextResponse } from "next/server";
import type { NextRequest } from "next/server";

export default function proxy(request: NextRequest) {
const accept = request.headers.get("accept") ?? "";

if (accept.includes("text/markdown")) {
return NextResponse.rewrite(
new URL(`/md${request.nextUrl.pathname}`, request.url),
);
}
}

export const config = {
matcher: ["/((?!_next/|api/|md/).*)"],
};
// app/md/posts/[slug]/route.ts
import { getPayload } from "payload";
import configPromise from "@payload-config";

export async function GET(_req: Request, { params }: { params: Promise }) {
const { slug } = await params;

const payload = await getPayload({ config: configPromise });
const result = await payload.find({
collection: "posts",
where: {
and: [{ slug: { equals: slug } }, { _status: { equals: "published" } }],
},
depth: 1,
limit: 1,
draft: false,
overrideAccess: false,
});
const post = result.docs[0];
if (!post) return new Response("Not found", { status: 404 });

return new Response(buildMarkdown(post), {
headers: {
"Content-Type": "text/markdown; charset=utf-8",
Vary: "Accept",
},
});
}

buildMarkdown assembles the response: an H1 title, the body, and whatever fields you need (publication date, excerpt, author). What goes inside the body depends on how you store content.

If your collection has a contentMarkdown plain-text field, use it directly. If content lives in Lexical JSON - the default in Payload 3.x - use the built-in convertLexicalToMarkdown from @payloadcms/richtext-lexical (docs).

For complex content models with custom blocks, don't serialize on every request. Add a beforeChange hook that converts Lexical to Markdown at save time and stores it alongside the source. Serialization is a publish-time operation, not a read-time one.

The Vary: Accept header on the Markdown response matters. It tells caches and CDNs to store HTML and Markdown variants separately, so a browser never gets the Markdown body and an agent never gets the HTML page.

curl with Accept: text/markdown returning a clean Markdown body from the same URL a browser would get HTML for|697x273curl with Accept: text/markdown returning a clean Markdown body from the same URL a browser would get HTML for|697x273

Which Structured Data Do AI Search Bots Actually Read?

Three JSON-LD types - BreadcrumbList, FAQPage, and Article linked to Author/Organization - cover the structural, citation, and authority signals AI search bots actually use. Generate them under one @graph per page.

AI search bots prefer content they can parse cleanly. Three specific Schema types handle most of this heavy lifting, and all of them map perfectly onto Payload CMS collections.

SchemaWhat it signalsSource in Payload
BreadcrumbListHierarchyplugin-nested-docs breadcrumbs
FAQPageCitation unitsfaqs collection or array field
Article + Author + OrganizationAuthorityposts collection + globals

You can drop these into one @graph block or split them across separate JSON-LD scripts - parsers handle both the same way. We'll use @graph below.

Editorial tooling that feeds the same signals

The structured data above is only as good as the content it describes. Two plugins close that loop inside the Payload admin. @payloadcms/plugin-seo exposes auto-generate hooks for title, description, and image — you wire them to your own functions, so editors can regenerate meta fields from the document's actual content rather than filling them in by hand.

For the content itself, payload-ai adds AI-assisted generation directly to Text and RichText fields. The practical value for GEO isn't speed, it's consistency: content drafted with your FAQ schema, author fields, and sameAs links already populated gives the structured data something accurate to point at.

Hierarchy from plugin-nested-docs

@payloadcms/plugin-nested-docs adds two fields to every document in the collections you enable: parent is a self-referencing relationship (editors pick the parent doc), and breadcrumbs is an auto-populated array of ancestors, each entry with a label and a URL.

Payload admin: Parent field and auto-generated Breadcrumbs array from plugin-nested-docsPayload admin: Parent field and auto-generated Breadcrumbs array from plugin-nested-docs

Hierarchy is one of the strongest signals AI crawlers use. A page at /docs/getting-started/installation already tells a model what it's about and where it fits - before reading a single line of body text. BreadcrumbList JSON-LD turns that path into something machines can read, and plugin-nested-docs builds it for you: every document already knows its full chain of parent pages, with labels and URLs, ready to drop into the schema below. No URL logic to write by hand, and nothing breaks when an editor renames a parent page.

FAQ and Author/Organization

An FAQPage schema packages questions and answers into a single machine-readable unit that AI bots can lift verbatim - attribution included. Think of it as an inverted pyramid: keep the acceptedAnswer to one or two tight sentences, and leave the deep details in the page body for human readers. Long, vague answers don’t get cited; concise, specific ones do. In Payload, a dedicated faqs collection or a simple array field gives editors a clean, structured way to manage this.

Article linked to Author and Organization is how machines read authority. A bio written in the page text does nothing for them. The links have to be set in the markup. The Author is a Person with a sameAs array pointing to verified profiles (LinkedIn, GitHub, ORCID); the publisher is an Organization global edited once with its own sameAs and logo. Without structured authorship, you lose the authority signal - leaving the LLM to just guess whether your content is trustworthy.

The combined @graph

const jsonLd = {
"@context": "https://schema.org",
"@graph": [
{
"@type": "BreadcrumbList",
itemListElement: [
{
"@type": "ListItem",
position: 1,
name: "Home",
item: process.env.NEXT_PUBLIC_SITE_URL,
},
...breadcrumbs.map((item, index) => ({
"@type": "ListItem",
position: index + 2,
name: item.label,
item: item.url,
})),
],
},
{
"@type": "FAQPage",
mainEntity: doc.faqs.map((item) => ({
"@type": "Question",
name: item.question,
acceptedAnswer: {
"@type": "Answer",
text: item.answerPlainText,
},
})),
},
{
"@type": "Article",
headline: doc.title,
datePublished: doc.publishedAt,
author: {
"@type": "Person",
name: doc.author.name,
sameAs: doc.author.socialLinks?.map((l) => l.url),
},
publisher: {
"@type": "Organization",
name: org.name,
logo: {
"@type": "ImageObject",
url: org.logo.url,
},
},
},
],
};

Two key rules here. First, render everything server-side: your schema must match the DOM. If you hide your FAQ behind a JS accordion that only mounts on click, your JSON-LD will point to phantom markup. Second, keep Organization as a Payload global - edit it once, inject it everywhere, and prevent data drift when your logo or legal name changes. Validate the final output with Google's Rich Results Test before shipping.

When does llms.txt actually matter?

Mostly worth it for developer documentation. AI agents only fetch llms.txt when explicitly pointed at it - and that traffic is almost entirely coding agents reading API docs in real time. For a marketing site or blog: ~2% adoption, Google ignores it, and inference crawlers barely fetch it on their own.

llms.txt is a proposed plain-text index for AI crawlers that lists your site name, description, and grouped links. The idea is to give models a clean map so they find what matters, but most sites don't actually need it. A lot of companies publish one: Cloudflare, Anthropic, Vercel, Stripe, Supabase, and thousands of others. Directories like llmstxthub.com and directory.llmstxt.cloud list who's doing it.

Look at who actually adopts it, though. Almost every meaningful adopter is a developer-tooling company with documentation that AI coding agents need to read in real time. Cloudflare went so far as to embed a literal warning at the top of every doc page: "STOP! If you are an AI agent or LLM, read this before continuing... HTML wastes context." They publish per-product llms.txt and a llms-full.txt for the entire docs surface. The use case is concrete: a developer asks a coding agent to write something against the Workers API and points it at the docs. Only then does the agent fetch llms.txt to find the right page and pull the Markdown version. That flow works because the agent has a continuous, evolving need for an accurate API surface.

llms.txt files from Next.js, Perplexity, and Cloudflare - all developer documentation, the only segment where llms.txt actually gets fetched|697x566llms.txt files from Next.js, Perplexity, and Cloudflare - all developer documentation, the only segment where llms.txt actually gets fetched|697x566

That's not how a marketing site or a blog gets read. Visitors arrive, read, leave. AI search bots - the GPTBots and ClaudeBots that collect training data and answer real-time queries - don't have an ongoing relationship with your content the way a coding agent has with a docs site. They just grab a sample and move on, and the numbers show it:

  • ~2% adoption of llms.txt across analyzed sites (Web Almanac 2025)
  • 1.1% of llms.txt requests come from OAI-SearchBot in a 30-day audit across 1,000 Adobe domains - the rest is Googlebot (Longato)
  • Google doesn't use llms.txt and has no plans to (Search Engine Land)
  • 94% of crawler hits in the most-cited counter-study came from OAI-SearchBot; GPTBot showed up on just 2 of 8 sites (Archer Education)

The pragmatic split:

  • Documentation portal (public API references, SDK guides, technical tutorials) - generate llms.txt and llms-full.txt straight from your CMS collections. The use case is real, and it only takes half a day to ship.
  • Blog or marketing site - skip the directory maps entirely. Instead, spend that time setting up Markdown content negotiation and cleaning up your HTML hierarchy. Those are the only surfaces AI search bots actually care about.

What Changed, In Numbers?

AI Overviews now cost the top organic result up to 58% of its clicks, GEO-style content lifts visibility by up to 40%, and Markdown cuts the tokens an AI crawler reads by ~80%.

MetricFindingSource
CTR drop when AI Overview appears (top organic result)34.5% (April 2025) → 58% (February 2026)Ahrefs, 300k keywords
Visibility lift from GEO-style content (citations, stats, authority)Up to +40% in generative responsesAggarwal et al., KDD 2024
Token reduction when serving Markdown vs HTML16,180 → 3,150 tokens (~80% less)Cloudflare
ChatGPT share of Vercel signups<1% → 4.8% (Mar 2025) → 10% (Apr 2025)Rauch on X

Two patterns are worth pulling out:

  • The optimization vector inverted. The tactics that get you noticed by generative engines - like direct citations, verifiable statistics, and structured authority - are the exact opposite of legacy SEO. In fact, old-school tricks like keyword stuffing actually tanked visibility in recent GEO studies.
  • Citation share is the new rank. Position #3 vs. #5 on a Google SERP increasingly misses the question that matters: when an AI synthesizes an answer about your category, does it mention you, and how often? Old rank trackers don't measure it.

The Payload setup we’ve covered - the robots config, Markdown endpoints, and a structured @graph - directly feeds what models actually care about: hierarchy, citable facts, and verifiable authorship. None of this is isolated effort. It all combines into one strong surface. This places your content exactly where LLMs search for answers.

Key Takeaways

On access

AI search bots and training scrapers are different user agents with different jobs, allowing one doesn't mean allowing the other. The split is explicit and configurable.

On content format

Markdown cuts token overhead by ~80% compared to HTML. Serving it via content negotiation costs half a day and makes your content structurally preferred over sites that don't. On structured data

Three Schema types – BreadcrumbList, FAQPage, and Article linked to Author – cover hierarchy, citation units, and authority. Without structured authorship, the model has no signal to trust your content over anyone else's.

On llms.txt

Only worth implementing if you run a documentation portal. For blogs and marketing sites the fetch rate is ~2%, Google ignores it entirely, and the same effort spent on Markdown endpoints returns more. On the competitive shift

The tactics that improve GEO visibility: direct citations, verifiable statistics, structured authority — are the opposite of legacy SEO keyword optimization. Optimizing for the old model actively hurts visibility in the new one.

On what's actually being measured now

Traditional rank position is the wrong metric. The question that matters is whether an AI synthesizing an answer in your category cites you, and how often.

AI crawlers are no longer a footnote in your traffic mix. They actively consume your content and provide answers to questions that buyers are already asking. Most sites haven't been adapted yet, which means the structural work covered here — access configuration, Markdown endpoints, linked Schema — still creates a real gap. This isn't a new discipline called "AI SEO." It's the same job content has always had: be readable, be credible, be findable. The audience and the approach to optimization just expanded.

AI-readiness is an architecture decision

The robots config, Markdown endpoints, and structured @graph covered here aren't one-off optimizations — they're decisions baked into how your Payload project is shaped. Getting them right from the start is easier than retrofitting them later, especially as AI crawlers get more selective about what they surface.

If you're building on Payload CMS and want this layer handled properly, FocusReactive specializes in this stack, reach out to hash out the details of your setup.

FAQs

Short answers to the questions that come up most when teams start making a Payload site AI-ready. Each is marked up as FAQPage, so AI search bots can quote it directly.