How Leap Tests AI Detectors and Humanizers
Why this page exists
Over the last year our /review and /vs pages have become a trust surface. Other pages on this site cite /methodology when they make a relative claim ("Leap preserves meaning better than tool X," "Originality scores in the top-tier band on unmodified output"). That only works if the destination is specific enough to defend. So this is the long version, not the soundbite.
Testing corpus — what goes in
We generate input samples from the model families that produce the bulk of real-world AI writing today, refreshed monthly as frontier models ship:
- OpenAI — ChatGPT on GPT-4o and GPT-5, default settings, no custom system prompt.
- Anthropic — Claude Sonnet 4.6 and Claude Opus 4.7, default settings, no custom system prompt.
- Google — Gemini 2.5 Pro, default settings.
- Meta — Llama 3.1 70B Instruct, used to capture open-weight model output distributions that behave differently from the API providers.
- Mistral — Mistral Large, for the same reason as Llama.
Genre matters as much as model. AI output on a marketing brief has a different rhythm than AI output on a college essay, and detectors catch them at different rates. Every model above generates samples across four genres:
- Academic writing — argumentative essays, literature analysis, research summaries.
- Marketing copy — landing pages, product descriptions, email drafts.
- Long-form blog posts — listicles, how-tos, opinion pieces.
- Professional communication — emails, reports, summaries.
Two control groups sit alongside the AI samples. Hand-written human samples — original prose drafted by the Leap team and outside writers, checked for false positives on every detector in the panel. Edited samples — AI output that a human has rewritten in their own voice, which is how real-world documents actually show up in 2026. A detector that flags everything the model touched, even after thorough human editing, is producing noise.
Detector panel — what we score against
The panel is not about crowning a single best detector. It is about covering the detectors that a teacher, editor, or hiring manager is actually likely to paste text into:
- GPTZero — the educator standard with the longest public benchmark history, integrated into multiple LMSes. If a teacher flags a student paper in 2026, it is most often through GPTZero.
- Turnitin — the institutional ground truth. Most universities do not publish the score; students see a flag, not a percentage. We still test it because the flag decision drives real outcomes.
- Originality.ai — publisher and SEO adoption, API-first. The default pick for content ops teams vetting freelance submissions.
- Copyleaks — enterprise, multi-language. Often shows up in corporate procurement as an extension of their plagiarism platform.
- Winston AI — the professional tier. PDF reports, plagiarism bundle, the detector editors and agencies pay for.
ZeroGPT and Scribbr are spot-checks rather than primary panel members. They are the free detectors casual users reach for first. Accuracy lags the paid panel, but they matter for the "will this pass a casual check" scenario.
Score normalization
Detectors report scores on different scales, and averaging them naively produces garbage. Our normalization:
- AI-probability scales (0–100). GPTZero, Originality, Copyleaks, Turnitin, ZeroGPT all report an "AI probability" where higher means more AI-like. We use these as-is.
- Winston's inverted scale. Winston reports a "human score" where 100 means human-written and 0 means AI-written — the opposite direction. We flip it before comparing, so Winston's 15 becomes 85 in our AI-probability column.
- Rank-order, not absolute. When we compare across detectors, we rank the same sample against itself across the panel rather than averaging the raw scores. "Sample A scored in the top quartile on four of five detectors" is a claim we can defend; "Sample A scored 78.4% on average" is not.
Pass threshold — what "passed" means
Our internal pass threshold is an AI-probability score below 50 on the detector in question. That number is not a guarantee that a human reader would agree the text reads human — it is the band at which most detectors stop flagging. Detector vendors often use lower thresholds (25 or 30) for their own strict-mode flags; we use 50 because it reflects the default behavior most readers will see. When a /vs or /review page says "Leap passes detector X," that is the threshold it is measured against.
Refresh cadence
We rerun the full panel monthly. Detectors update their weights on their own schedule (weekly for some, quarterly for others), and a result from three months ago is already decorative. The full refresh log lives in the eval repo; a brief changelog summary appears on /review pages whenever a detector's stance materially changes.
If a new frontier model ships between monthly runs (GPT-5.x, a Claude point release, a Gemini update), we regenerate that model's corpus immediately and rerun the panel. Frozen-snapshot benchmarks are how vendors end up quoting stale numbers.
What we measure per sample
Three dimensions, scored independently so we do not optimize one at the cost of the others:
- Pass-rate — does the humanized output fall below the detector's AI flag threshold? Measured per detector, not averaged, because thresholds and false-positive rates differ.
- Readability — does the output read naturally? Scored by blind human review against the original input. A humanizer that bypasses detection by mangling grammar is not shipping a usable product.
- Meaning preservation — does the output still say what the input said? Measured via semantic similarity plus spot checks. Loss of key claims, numbers, or nuance is a failure even if the detector score drops.
What we do NOT claim
Negative claims are as important as positive ones in a category this crowded:
- No head-to-head bypass percentages. Detector weights rot. Any specific "Leap achieves X% bypass vs competitor Y's Z%" claim is stale the week after it is written. Our comparison language is rank-order ("top-tier band," "softens significantly," "lags on edge cases"), never a decimal.
- No star ratings derived from an algorithm. The ratings that do appear on /review/* pages are editorial — they reflect our team's judgment on fit for a given audience, not a computed score.
- No "100% undetectable" or "zero false positives" guarantees. Every detector produces false positives, particularly on non-native English writing. Every humanizer softens on some samples and fails on others. Any vendor who tells you otherwise is selling you the claim, not the product.
How to reproduce this
The methodology lives in the open-source portion of the Leap repo, under evals/. If you want to run any part of it yourself:
evals/detection/— our in-house detector quality evals. Tests that Leap's detector scores AI text above 70 and human text below 30 across the model family matrix.evals/detectors/— the third-party detector harness. Wraps GPTZero and Winston APIs and scores samples across the panel. Requires your own API keys for the detectors you want to test.evals/fixtures/— the sample corpus. AI-generated and hand-written samples that feed both suites.
Running the detector harness costs money (third-party detector API credits), so we do not run it in CI. It runs on demand, and we log the results. When we are ready to publish a fully reproducible dataset with a third-party audit, it will ship from the same directory.
API contract — /api/humanize
Programmatic callers (Chrome extension, SDKs, third-party integrations) hit the same humanizer the dashboard uses. The endpoint supports two modes, and the response shape changes slightly depending on which one you use.
Legacy compound mode. Send { "text": "..." } and the endpoint runs detect → humanize → detect and returns the rewrite with both before- and after-scores filled in. detectionScoreAfter is always a number. This is the contract older API clients already depend on; nothing about it changed, it is just slower (p50 ~37s) because it runs two detection passes inline.
Dashboard-style skip-detect-after mode. Send { "text": "...", "skipDetectAfter": true } and the endpoint skips the post-humanization detect, returning as soon as the rewrite is ready. In this mode detectionScoreAfter is null, and the caller is expected to call /api/detect separately on the returned outputText if an after-score is needed. This is what the dashboard uses; p50 drops to ~13.6s.
When the caller already has a fresh detection score for the input text (e.g., the user just ran it through the detector two seconds ago), pass { "text": "...", "detectionScoreBefore": 87 } to skip the pre-humanization detect as well. The value must be a finite number in the range 0–100; anything else is ignored and the endpoint runs detection normally. Combining both flags (pass detectionScoreBefore and set skipDetectAfter: true) is the minimum-latency path.
Response shape. { outputText: string, wordCount: number, detectionScoreBefore: number, detectionScoreAfter: number | null, latencyMs: number }. The only field that widened is detectionScoreAfter, from number to number | null. Existing clients that never send skipDetectAfter will never see null and do not need to change. Clients that opt into the faster path must handle null and fetch the after- score separately.
Rate limits. Enforced per authenticated user per hour, tiered by plan. Free: 60 requests/hour; Pro: 600 requests/hour. Anonymous callers hit a pre-auth circuit-breaker bucket (10/hour) before auth is even checked. Every response carries standard X-RateLimit-Limit, X-RateLimit-Remaining,X-RateLimit-Reset headers; 429 responses include Retry-After. Word-count quotas are enforced separately in the Convex layer and are not bypassable by staying under the request-rate cap.
Pricing data cadence
Every pricing number on /review/* and /vs/* pages is pulled from the vendor's public pricing page. Pricing is dated April 2026. Verify before buying. Vendors reshape their tiers often, and a page we published two weeks ago may already be off by a dollar or a credit tier. When a vendor materially changes pricing (launches a new plan, raises a tier 20%+), we update the affected review pages in the next monthly refresh.
No-affiliate disclosure
Leap does not take affiliate revenue from any detector or humanizer linked anywhere on this site. We are not in the Originality.ai affiliate program, the GPTZero affiliate program, Winston's, Copyleaks', or any humanizer program. Competitor links use rel="noreferrer noopener" and carry no tracking parameters we control. When a /review or /vs page recommends another product for a specific use case — and several of them do — that recommendation is not paid.
What we are still building
This methodology is a work in progress. Four pieces still need to land before we publish benchmark percentages:
- Reproducible corpus. A versioned input set so every release is scored against the same bar. Today the corpus evolves informally; it needs to be pinned, hashed, and published.
- Public result CSV. Raw scores per detector per model per release, downloadable. No vibes, no curated excerpts.
- Third-party audit. An independent team reruns the methodology on the published corpus and reports whether the numbers replicate. Without this, scores are marketing.
- Version-controlled prompts. Humanization prompts change over time. Every benchmark needs to be tagged to a specific prompt version so improvements (and regressions) are traceable. The prompt changelog already lives at
evals/PROMPT_CHANGELOG.md.
We would rather ship this slowly and correctly than publish round numbers we cannot back up.
How you can verify for yourself
You do not have to take any humanizer's word for it — ours included. The test takes about two minutes:
- Run your AI-generated input through Leap's free AI detector and note the score.
- Humanize the same input through Leap's humanizer (500 words free, no signup).
- Re-score the humanized output against the detector you actually care about — GPTZero, Turnitin, Originality.ai, Winston, whichever one will read your text downstream.
That is the benchmark that matters. Your input, your detector, your threshold. For the full detector and humanizer landscape, start at /review. When our reproducible dataset is ready to publish, it will ship from evals/ and we will link it here.
Frequently asked questions
Why don't you publish specific bypass percentages?
Because detector weights drift week to week, and any single number is out of date before it ships. A '99.8% bypass' claim is measured against one detector on one corpus at one moment in time, usually by the vendor. We rank-order tools against each other and describe performance qualitatively instead. When we can publish a reproducible corpus with a third-party audit attached, we will — with the data and scripts, not just the headline.
Will you publish benchmark data later?
Yes, that is the goal. We are building toward a versioned corpus, a public result CSV per detector per model per release, and an independent audit. Until those four pieces are in place, the comparison pages describe relative performance qualitatively and competitive claims are bounded by what we can defend on the record.
How should I interpret the relative claims on your comparison pages?
As directional, not as benchmarks. When a /vs/* page says Leap preserves meaning better than tool X, that reflects our internal testing and judgment at the time of writing — not a lab result. The fairest way to interpret any humanizer comparison is to run your own samples through both tools and through the detector you actually care about. Our free AI detector is built for exactly that test.
Do you take affiliate revenue from the tools you review?
No. We do not participate in any affiliate or referral program from the detectors or humanizers we review, link to, or compare against. Pricing pages are linked with rel='noreferrer noopener'. When a /review or /vs page recommends a competitor for a specific workflow, that recommendation is not paid.