Summary of my process with testing various AI-detectors in March 2025

Over the past few weeks this March, I’ve been running extensive tests on free online AI-detectors. I wanted to see if any online AI-detectors were reliable at finding whether a story contains unedited AI. This was not because we plan to scan every story on the site—we are emphatically avoiding that, just to be clear.

But we had issues with people not self-reporting when posting AI stories. And we were uncertain if the approvers team should just trust our own instincts about identifying these stories as AI, and tag accordingly. So, when brainstorming solutions, I got curious if AI-detectors had any usefulness in this context. Or, conversely, if they were totally unreliable, and should be avoided.

This thread is a summary of my experiment, and what I learned.


My main interest in testing these detectors was to see for myself what they could actually do. I’d heard a lot of hearsay about AI-detectors (mostly about how unreliable they were). Unfortunately a lot of the hearsay also contradicted itself! And hearsay is an awful source of reliable information…

Also, the hearsay did say the detectors were unreliable. But unreliable in which direction? Would they flag human-written stories as AI too? Or, were they unreliable because they regularly failed to detect AI? (Or, worst case scenario, a bit of both?)

I got tired of swimming through hearsay and decided I might as well get hard numbers on how they actually worked. I’d do it by testing them myself.

To test the detectors, I pulled a pool of one hundred stories that were all posted to the site four or more years ago—all guaranteed to be fully human-written.

I also pulled a pool of about fifty self-identified AI stories that were posted recently to the site.

I noted them all in a spreadsheet, and over the next several weeks, I began testing both pools of stories against seven different free AI-detectors, and noted down each result.

Very quickly into my tests, I ruled out five of these AI-detectors as useless to our purposes. One of them flagged everything as AI, no matter what it was. (Comically, it was more certain that human-written stories were AI than actual AI stories, which it did tend to flag as ‘maybe human’. I struck that detector off my list immediately.) Several other detectors were pretty awful at detecting AI, even if they at least didn’t flag human-written stories as AI.

Two of the AI-detectors, however, proved very reliable. The first one detected AI with reasonable confidence (though it completely missed on detecting AI generated by certain models). The second one detected AI with a strong accuracy that regularly surprised me.

But more importantly: not once in my tests did either of these detectors incorrectly flag a human-written story as AI. So, per my tests, these two detectors have a 100 out of 100 accuracy rate of not flagging human-written stories as AI.

In other words: whatever their occasional unreliability in flagging AI as AI, they are reliable in not giving false positives.

This was the most important part for me.

What it meant is that if one of these detectors flagged a story as AI—well, I now have the extensive numbers to back up its accuracy and say, “yeah, we have strong reason believe it.”


And just to confirm, as mentioned at the beginning:

We are not going to be running most people’s stories through these AI-detectors.

The only time we’ll be using these detectors, in practice, is when the approvers read a story in the queue and are pretty sure it is AI-generated, but the story appears to be missing the AI tags as per our rules.

As a final impartial step, when addressing the above situation, we’ll then run the story through the AI-detectors. We will be doing this to help us avoid accidentally tagging a human-written story as AI. The detectors will act as a balance against possible approver bias, and thus they will help the authors by keeping things more fair.

If these AI-detectors (which, again, have not yet flagged a human-written story as AI) agree with our assessment that the story is in fact AI-generated, then we will apply the tag and move on.


Some miscellaneous thoughts and observations:

ADDITIONAL HEARSAY

I’d heard via hearsay that one reason the AI-detectors are unreliable is because they give different results on different days. This was something I tested for myself. For the two reliable detectors, I found them to give the exact same numbers when fed the same material on different days, even down to the percentage point.

So I suspect that the ‘different results on different days’ thing is probably due to the AI-detectors getting updated between tests, and thus the more advanced model of the detector perhaps gave a different number than the older model.

RE-CHECKING THE DETECTORS

I found that, when testing the AI detectors, I could usually tell if a detector would be accurate within the first five tests.

I’d say that a round of twenty tests (ten human-written stories + ten AI stories) would be enough to predict the trend of how useful that detector would end up being.

(For example, I was able to spot the ‘reliable’ detectors very early on, and they continued to prove reliable through all one-hundred-and-fifty stories I ran them through.)

What this means, to me, is that if the two reliable detectors get updated and stop being reliable, we could re-run a small percentage of the tests on them. We could use those smaller numbers to confirm whether the detectors are still functioning, or have gone obsolete. So if someone believes that their story was flagged by the detectors inaccurately and would like us to re-test the detectors, we can do so to confirm they are still operating at accuracy.

AVOIDING FUTURE MASS-TESTING (FOR NOW)

After some thought, I’ve made a conscious choice to avoid sharing the names of the seven AI-detectors I tested. I’m doing this because I want to discourage users from needlessly running the site’s stories through these detectors themselves. Something which did not occur to me until I’d nearly concluded my tests (and which Derek Williams then pointed out to me) is that AI-detectors can potentially train on what you feed into them. Like the rest of the internet, our site has been scraped several times over by AI models looking to train on stolen material. So yes, it’s likely that whatever data the AI-detectors are working with unfortunately already includes our stories…

…but still, I would have structured my testing process differently if I’d been aware of this detail earlier.

So—just in case—I’d like to avoid feeding the AI models further, beyond what feels necessary. That is why I’m discouraging mass testing by our readerbase for now.

That being said, if you are an author who published stories prior to November 2022 (aka, prior to ChatGPT’s release), and you don’t mind if your stories are used as part of a future control group when testing AI-detectors: then please let us know in this thread!

We’re a large and robust community of authors, and I’m certain we can solve this dilemma with a bit of crowdsourcing. :blush:

Thank you!

4 Likes