How can you tell if text is AI-generated? Researchers have figured out a new method

Have you ever looked at a piece of writing and thought something might be “off”? It might be hard to pinpoint exactly what it is. There might be too many adjectives or the sentence structure might be overly repetitious. It might get you thinking, “Did a human write this or was it generated by artificial intelligence?”

In a new paper, researchers at Northeastern University set out to make it a little easier to answer that question by analyzing the syntax, or sentence structure, in AI-generated text. What they found is that AI models tend to produce specific patterns of nouns, verbs and adjectives more frequently than humans.

The work is published on the arXiv preprint server.

“It empirically validates the sense that a lot of these generations are formulaic,” says Byron Wallace, director of Northeastern’s data science program and the Sy and Laurie Sternberg interdisciplinary associate professor. “Literally, they’re formulaic.”

It’s already well known that AI models tend to repeat certain words––ChatGPT went through a period where it always used to “delve into,” Wallace says. But Wallace says this is “not really capturing the whole story” when it comes to identifying AI-generated text. Wallace and Chantal Shaib, a Ph.D. student at Northeastern helming this research, decided to look beyond which words an AI model is choosing and focus on syntax.

The researchers prompted a wide range of AI models to produce certain kinds of texts, like summaries of movie reviews and news articles or biomedical research. They then analyzed all of the AI-generated text and identified what they call syntactic templates, certain sequences of parts of speech that get repeated by AI models.

The types of syntactic patterns produced in the text varied from AI model to AI model. It was almost like “each model has its own signature,” Wallace says. In some cases, that looked like sets of dual-adjectives. A movie review summary of “The Last Black Man in San Francisco” described the film as a “unique and intense viewing experience,” a “highly original and impressive debut” for the director, and “magical and thought-provoking,” all within the span of two paragraphs.

“Humans can also produce these templates,” Shaib says. “They can have repeated syntax in their writing, but it’s at a much lower rate than what the models produce.”

Shaib adds that the size of an AI model didn’t impact how likely it was to produce these templates. Every model they analyzed tended to repeat syntactic patterns at a higher rate than humans.

However, depending on what style of writing they examined, the difference between how often AI models and humans use those patterns differed. The gap was much smaller in biomedical writing, which has a specific style guide. Meanwhile, in movie reviews and news articles, genres where writers can get more creative, AI models far exceeded humans in producing the same patterns, Shaib says.

Where are these templates coming from? Shaib says.

“What we found is that actually, it’s not something that the model makes up during the generation process,” Shaib says. “We were able to find about 75% of these templates in the training data.”

Shaib admits that this research is not about creating a foolproof method of determining whether a piece of text is AI generated. However, it provides a new model for how people can talk about AI-generated text, widening the frame to look at not just specific words but an entire style of writing.

“The biggest takeaway from this is it gives us a tool to talk about exactly why certain texts just seem kind of off to us, especially when we see a lot of them in a row,” Shaib says. “It gives us a methodology to actually analyze what’s going on here as opposed to just relying on a feeling.”

Microsoft warns thousands of cloud customers of data vulnerability

In Computer science Cybersecurity

on 27 August 20217 min read

More information:
Chantal Shaib et al, Detection and Measurement of Syntactic Templates in Generated Text, arXiv (2024). DOI: 10.48550/arxiv.2407.00211

Journal information:
arXiv

Provided by
Northeastern University

This story is republished courtesy of Northeastern Global News news.northeastern.edu.

Citation:
How can you tell if text is AI-generated? Researchers have figured out a new method (2024, October 30)