Is AI Visibility Tracking a Scam? The Non-Determinism Problem Nobody Talks About
AI citations change run-to-run. SparkToro found recommendation lists repeat less than 1% of the time. Here's what AEO tracking actually measures — and what vendors oversell.

Is AI Visibility Tracking a Scam? The Non-Determinism Problem Nobody Talks About
Last week we ran the same query in ChatGPT three times: “What are the best tools for optimizing content for AI search?” Three runs. Three different lists of cited sources. Two of the three citations appeared only once. The third run cited a page that didn't even appear in the first two.
This isn't a bug. It's how large language models work. And it's the elephant in the room for a growing industry of AI visibility tracking tools that promise to measure your “AI search ranking” the way Google Search Console measures your organic positions.
We build AEO optimization services for B2B SaaS companies. We have free tools on this site that check AI search visibility. Step 5 of our 5-Step AEO Framework is “Cross-Platform Monitoring.” We are not outside critics. We are practitioners with skin in the game, and we're going to tell you what most vendors in this space won't.
The short version: AI visibility tracking is not a scam — but it is routinely oversold. LLM outputs are probabilistically non-deterministic: citation results vary run-to-run for identical queries. SparkToro's 2025 research found AI recommendation lists repeat less than 1% of the time. Tracking tools provide directional intelligence about citation probability over time, not stable “rankings” you can report to a board. The value is real. The framing is wrong.
<1%
Of AI recommendation lists are identical when the same prompt is run twice
SparkToro & Gumshoe, 2025
60%+
Of AI search engine responses contain inaccurate citations
Tow Center, Columbia University, 2025
2.8x
More sources cited per query by Perplexity vs. ChatGPT
Qwairy Q3 2025 Analysis
Why We're Writing This (And Why It's Uncomfortable)
We have two free tools on this site that check AI search readiness. Our AEO Citation Readiness Checker audits whether your content is structurally prepared for AI citations. Our 5-Step AEO Framework — the same one we walk through in our complete AEO guide — ends with “Cross-Platform Monitoring.”
If AI tracking is a scam, we're complicit.
But here's the thing: we built those tools knowing their limitations. And we're increasingly uncomfortable with how other vendors in this space frame what these tools actually measure.
There's a difference between measuring signal and selling certainty. A growing number of tools promise “AI search rankings” and “AI visibility dashboards” that imply a stability of measurement that the underlying technology makes impossible. We want to explain where that line is — honestly, with the technical receipts.
The Technical Reality: Why LLMs Can't Give You the Same Answer Twice
Temperature, Probability, and the Myth of Determinism
Large language models generate text by predicting the next token (roughly, a word or word-fragment) from a probability distribution across their entire vocabulary. The “temperature” setting controls how much randomness is introduced into this selection. At temperature 0 (greedy decoding), the model always picks the most probable next token. In theory, this should produce identical outputs for identical inputs.
In practice, it doesn't. A 2024 peer-reviewed study titled “Non-Determinism of ‘Deterministic’ LLM Settings” tested five LLMs across eight common tasks at temperature 0. The findings:
- Accuracy variations of up to 15% across runs of the same prompt
- A gap between best possible and worst possible performance of up to 70%
- None of the five models delivered repeatable accuracy across all tasks
Why? Three root causes, all architectural:
- GPU floating-point non-associativity. GPUs perform massive parallel computations. Floating-point math is not perfectly associative — (a+b)+c can differ from a+(b+c) by tiny amounts. These tiny differences in logit calculations cascade through billions of parameters.
- Mixture-of-Experts routing. GPT-4 and similar models use MoE architectures where different “expert” subnetworks handle different tokens. Under capacity constraints, tokens compete for expert slots. Different concurrent queries change the routing, producing different outputs.
- Batch invariance failures. The output for your specific query depends on what other queries are being processed in the same batch. Different server load at different times means different outputs.
A June 2025 follow-up paper from researchers at MIT confirmed that this non-determinism is “perhaps essential to the efficient use of compute resources” — meaning it's a fundamental engineering tradeoff, not a bug that will be patched.
Temperature 0 makes the sampling step deterministic. It does nothing to control the computational chaos happening before sampling ever occurs.
What This Means for Citations Specifically
When ChatGPT searches the web in browsing mode, it queries Bing's index, reads the top results, and synthesizes an answer. The non-determinism cascades: a slightly different token at position 47 can change which source the model selects at position 200. Add the variability of which web results are fetched (timing, index updates, server location), and you get a system where asking the same question twice and getting different cited sources is a mathematical certainty, not an edge case.
Perplexity searches across multiple engines. Google AI Overviews pull from Google's own index. Claude draws primarily from training data. Each platform introduces its own layers of variability on top of the base-level LLM non-determinism.
“Your brand has a stable 'AI ranking' that can be tracked like a Google position. Higher ranking = more visibility. Track it monthly, report it to your board.”
Result: False confidence in a stable metric
“Your brand has a citation probability that fluctuates based on query phrasing, model state, retrieval timing, and randomness inherent in next-token prediction. No two runs produce identical results.”
Result: Probabilistic signal that requires statistical discipline to interpret
The Evidence: How Much Do AI Citations Actually Vary?
The SparkToro Study: Less Than 1 in 100
In late 2025, Rand Fishkin (SparkToro) and Patrick O'Donnell (Gumshoe.ai) ran one of the most rigorous studies on AI citation consistency to date. Six hundred volunteers ran 12 different prompts through ChatGPT, Claude, and Google AI — a combined 2,961 queries across categories including headphones, chef's knives, digital marketing consultants, and cancer care hospitals.
The headline finding: the odds of getting the same list of brand recommendations twice were less than 1 in 100. Nearly every response was unique in three dimensions — the brands listed, their order, and the number of items returned.
Fishkin's verdict on AI ranking tools was direct: “Any tool that gives a ‘ranking position in AI’ is full of baloney.”
But there's a nuance we'll come back to in the steel-man section. Despite the chaos in list composition, category leaders appeared frequently across runs. The list changed every time, but the top brands kept showing up. That distinction — between position tracking (meaningless) and frequency tracking (meaningful) — is the crux of this entire debate.
The Tow Center Study: Confident and Wrong
In March 2025, researchers at Columbia University's Tow Center for Digital Journalism ran 200 queries across eight AI search engines — ChatGPT Search, Perplexity, Perplexity Pro, Gemini, DeepSeek, Grok 2, Grok 3, and Copilot. The study focused specifically on citation accuracy.
The collective failure rate: over 60% of responses contained inaccurate citations. Perplexity performed best with a 37% error rate. Grok 3 performed worst at 94%.
The researchers flagged an especially dangerous pattern: these platforms were confidently wrong. Gemini and Grok 3 provided more fabricated source links than correct ones. The problem isn't just variability — it's that the citations themselves are frequently inaccurate even when they appear authoritative.
Platform-by-Platform Divergence
Qwairy's Q3 2025 analysis of 118,101 AI-generated answers with 669,065 total citations revealed how differently each platform handles sources. Perplexity averages 21.87 citations per question. ChatGPT averages 7.92 — a 2.8x difference. Only OpenAI relies meaningfully on Wikipedia (4.8% of citations).
This means your “AI visibility” depends heavily on which platform you're measuring. A tool that tracks your mentions in ChatGPT tells you nothing about your visibility in Perplexity, and both are nearly independent of your presence in Google AI Overviews. There is no single “AI search ranking.” There are multiple, loosely correlated citation systems that each behave differently.
The Steel-Man Case for AI Visibility Tracking
We've made the strongest case against tracking. Intellectual honesty requires making the strongest case for it. Here are the four best arguments.
Weather Forecasts Are Probabilistic Too
A 70% chance of rain is not certainty. But nobody calls weather forecasting useless because the probability changes from day to day. Similarly, tracking citation probability over time — even with run-to-run variability — reveals meaningful trends about whether your content is becoming more or less citable.
If your citation rate for a target query goes from 5% to 30% over three months, that trend is real and actionable — even though any single run is noise. The error isn't in tracking. The error is in treating a probabilistic measure like a deterministic rank.
Brand Presence Is More Consistent Than Rankings
Here's the finding from Fishkin's own research that cuts the other direction: while exact recommendation lists varied wildly, category leaders appeared in the majority of responses. In tight categories like cloud computing, the top brands appeared in most responses. Bose, Sony, and Apple showed up in most headphone recommendation runs even though the specific list changed every time.
In B2B SaaS — where there are fewer competitors per category — brand-level presence is more consistent than list-level composition. If you're one of three companies that an AI platform consistently mentions for “fintech compliance tools,” that consistency is measurable and meaningful.
The Market Is Real, Even If Measurement Is Imperfect
The adoption data makes the channel impossible to ignore:
- 94% of B2B buyers now use AI tools during their purchasing process (Forrester, 2025)
- 38% of software buyers start their search with AI chatbots — up 11 points year-over-year (Gartner, 2026)
- $750 billion in US revenue is projected to funnel through AI-powered search by 2028 (McKinsey, 2025)
You don't need perfect measurement to know this channel matters. You need honest measurement to know what to do about it.
AI Traffic Converts at Higher Rates
Even at approximately 1% of total web traffic, AI referrals convert meaningfully better. Semrush's July 2025 data showed LLM visitors converting at 4.4x the rate of organic search visitors. Rocket Agency reported ChatGPT traffic converting 5.1x higher. Platform-specific rates ranged from 15.9% (ChatGPT) to 3% (Gemini), compared to Google organic's 1.76%.
The caveat: these numbers come from early adopter audiences, small sample sizes, and industries where AI usage skews high. They'll normalize as AI search goes mainstream. But even discounting for sample bias, the signal is clear — AI-referred visitors arrive with higher intent.
What AI Tracking Actually Tells You vs. What Vendors Claim
| What Vendors Often Claim | What the Data Shows | What You Should Do |
|---|---|---|
| “Track your AI ranking” | No stable ranking exists. Citation is probabilistic. | Track citation frequency across many runs over time, not a single “rank.” |
| “Monitor brand mentions in AI” | Mention detection works. Accuracy validation usually doesn't. | Track mentions AND verify what the AI actually says about you. |
| “Optimize to rank #1 in ChatGPT” | There is no persistent #1. Citation sources rotate. | Optimize for citation-worthy content structure, not a position. |
| “AI search dashboard like Search Console” | No API-level data exists. All tools use prompt sampling. | Use tools for directional intelligence, not as source of truth. |
| “We track 10,000 queries automatically” | Scale without re-runs per query is noise, not signal. | Fewer queries with multiple runs per query beats many queries run once. |
The Accuracy Validation Gap
Most AI visibility tools count mentions without validating accuracy. A tool that tells you ChatGPT mentioned your brand 47 times this month is incomplete if ChatGPT was saying something wrong about you in half of those mentions. Of the major tools we've evaluated, very few consistently detect when an AI platform is misrepresenting what a brand does.
This matters for B2B SaaS companies especially. If an AI platform tells a potential buyer that your compliance tool “handles SOC 2 audits” when it actually handles SOC 2 readiness, that inaccurate citation is worse than no citation at all.
The Sampling Problem
Here's a question to ask any AI visibility tool vendor: “How many times do you run each prompt before recording a result?”
A tool that runs 10,000 prompts once each gives less reliable data than a tool that runs 500 prompts five times each. When the SparkToro study showed less than 1% consistency across runs, the implication is clear: single-run data points are statistically meaningless. Re-run methodology determines whether your data has any confidence behind it.
Most vendors don't disclose their sampling approach. If they run each prompt once per month — which Ahrefs Brand Radar explicitly does — the data reflects a single snapshot of a system that changes by the minute.
The Three Tiers of AI Measurement (What We Actually Recommend)
Instead of asking “What's my AI ranking?” — a question that has no stable answer — we recommend structuring AI measurement around three tiers of increasing complexity and decreasing certainty.
Three Tiers of AI Measurement
Tier 1: Structural Readiness
Binary checks — can AI platforms access, parse, and extract from your content? Deterministic. Pass or fail.
Tier 2: Citation Probability
Multi-run prompt testing across platforms. Track frequency trends over months, not snapshots. Statistical, not deterministic.
Tier 3: Business Impact
Referral traffic from AI platforms, branded search trends, and pipeline attribution. The metrics that actually matter.
Tier 1: Structural Readiness (What You CAN Measure Deterministically)
These are binary, deterministic checks with clear pass/fail answers:
- Does your
robots.txtallow GPTBot, ClaudeBot, and PerplexityBot? - Is your content server-side rendered, or hidden behind JavaScript that AI crawlers can't execute?
- Do you have schema markup — Organization, Article, FAQ, BreadcrumbList?
- Are your headings descriptive? Are definitions extractable?
- Does
max-image-previewrestrict AI platforms from previewing your content?
This is what our AEO Citation Readiness Checker measures. Not probabilistic citation outcomes — structural prerequisites. These checks have definitive answers. Either your robots.txt allows GPTBot or it doesn't. Either your content renders in initial HTML or it requires JavaScript execution.
Start here. If Tier 1 isn't passing, nothing else matters.
Tier 2: Citation Probability (What Requires Statistical Discipline)
This is where most tracking tools operate — and where the honesty gap lives. Citation probability is measurable, but only with methodological rigor:
- Define 20-30 queries that represent your buyer's actual research journey — not vanity queries about your brand name.
- Run each query 3-5 times per platform per measurement cycle (weekly or biweekly for priority queries, monthly for the rest).
- Record four things: Were you cited? How often across runs (citation rate)? Was the citation accurate? Who else was cited?
- Track the trend month-over-month. A single measurement is noise. Three months of directional change is signal.
- Accept the uncertainty. This is weather forecasting, not GPS coordinates. Report probability ranges, not positions.
This is how we built Step 5 of our 5-Step AEO Framework. Not as a dashboard fantasy with green and red indicators, but as a disciplined monitoring practice that acknowledges the statistical nature of the data.
Tier 3: Business Impact (What Actually Matters)
The metrics that connect AI visibility to pipeline:
- Referral traffic from
chat.openai.com,perplexity.ai,gemini.google.com, and related AI platform domains — measurable in any analytics tool. - Branded search volume trends. If AI platforms mention you frequently, branded searches increase as people verify what the AI told them.
- Sales conversation signals. “I saw you recommended by ChatGPT” or “Perplexity mentioned your tool” — track these qualitatively in CRM notes.
- Pipeline attribution from organic + AI referral sources combined.
The question isn't “Are we ranked in ChatGPT?” The question is “Is AI search driving qualified attention to our brand?” That second question is answerable.
What This Means for How We Build AEO Services
This understanding shapes how we run AEO optimization engagements:
We don't promise “AI rankings.” We promise structural optimization that increases citation probability. There is no guarantee that ChatGPT will cite your page for a specific query on a specific day — and anyone who offers that guarantee either doesn't understand how these systems work or is counting on you not asking questions.
Step 5 of our framework — Cross-Platform Monitoring — is built around Tier 2 methodology. Multi-run sampling, trend tracking, statistical thinking. It's slower and messier than a dashboard with a big green number. But it reflects reality.
We built free tools that measure Tier 1 because that's the part we can give definitive answers on. If your content can't be accessed, parsed, or extracted by AI platforms, no amount of monitoring will help.
When we work with clients on measurement, we're transparent that we're tracking probability, not position. If that's not what you want to hear, that's fine. But we'd rather lose a prospect by being honest than win one by overselling.
Frequently Asked Questions
Is AI visibility tracking worth paying for?
Yes, but only if the tool discloses its sampling methodology, tracks citation accuracy (not just mention count), and you treat the data as directional intelligence rather than a stable metric. Most tools are worth the time savings over manual testing. Few are worth it for the “AI ranking” dashboards they sell. Ask the vendor: “How many times do you run each prompt before recording a result?” If the answer is once, the data is a single frame from a movie.
Can I track AI search visibility for free?
Yes. Manual testing across ChatGPT, Perplexity, and Claude takes 2-3 hours per week for 20-30 queries. Free tools like the AI Search Visibility Checker on this site assess structural readiness — whether your content meets the technical prerequisites for AI citation. The paid tools primarily add automation and historical trending, which has real value for teams that need it.
Why do different AI platforms cite different sources for the same query?
Each platform uses different retrieval mechanisms. ChatGPT browses via Bing's index. Perplexity searches across multiple engines. Google AI Overviews pull from Google's own index. Claude draws primarily from training data. Different retrieval inputs produce different citation outputs — even before the non-determinism of the LLM layer adds variability. Qwairy's research found Perplexity cites 2.8x more sources per query than ChatGPT, confirming fundamentally different citation architectures.
If AI results are non-deterministic, does AEO optimization even work?
Yes — the same way SEO works even though Google's algorithm changes constantly. You can systematically increase the probability that your content gets cited by optimizing structure, schema, entity signals, and content quality. You cannot guarantee a specific outcome on a specific query at a specific time. The parallel to SEO is almost exact: no one guarantees a #1 Google ranking, but systematic optimization demonstrably improves the odds.
How does XEO's approach to AEO measurement differ from other agencies?
We're transparent that AI citations are probabilistic, not deterministic. We measure across three tiers: structural readiness (binary pass/fail), citation probability (multi-run sampling over time), and business impact (referral traffic, branded search, pipeline). We don't sell “AI rankings” because stable AI rankings do not exist. We invest in the inputs — content structure, schema, entity authority — that increase citation probability across platforms, rather than chasing output metrics that change by the minute.
AI visibility tracking is not a scam. But the framing around it often is.
The value is real: directional intelligence about whether your content is becoming more or less citable across AI platforms. The fraud is in selling that intelligence as stable rankings, deterministic positions, or dashboard metrics you can take to your board with the same confidence as a Google Search Console report.
For the market economics side of this story — why a $1B valuation on licensed data in a market with 20+ competitors doesn't add up — see our analysis of the AI visibility tool bubble.
If you're evaluating whether to invest in AEO, start with Tier 1 — structural readiness. It's free, it's deterministic, and it's the foundation everything else depends on.
Run a free citation readiness check on your site →
Or, if you want a full AEO audit with multi-platform citation analysis, that's what we do.

Founder, XEO.works
Ankur Shrestha is the founder of XEO.works, a cross-engine optimization agency for B2B SaaS companies in fintech, healthtech, and other regulated verticals. With experience across YMYL industries including financial services compliance (PCI DSS, SOX) and healthcare data governance (HIPAA, HITECH), he builds SEO + AEO content engines that tie content to pipeline — not just traffic.