Data mining reveals too many similarities between papers
If copying is the sincerest form of flattery, then journals are publishing a lot of amazingly flattering science. Of course to most of us, the authors of such reports would best be labeled plagiarists — and warrant censure, not praise.
But Harold R. Garner and his colleagues at the University of Texas Southwestern Medical Center at Dallas aren’t calling anybody names. They’re just posting a large and growing bunch of research papers — pairs of them — onto the Internet and highlighting patches in each that are identical.
Says Garner: “We’re pointing out possible plagiarism. You be the judge.” But this physicist notes that in terms of wrong-doing, authors of the newest paper in most pairs certainly appear to have been “caught with their hands in the cookie jar.”
Garner's team developed data-mining software about eight years ago that allows a resarcher to input lots of text — the entire abstract of a paper, for instance — and ask the program to compare it to everything posted on a database. Such as the National Library of Medicine's MEDLINE, which abstracts all major biomedical journal articles. The software then looks for matches to words, phrases, numbers — anything, and pulls up matches that are similar. The idea: to help scientists find papers that offer similar findings, contradictions, even speculations that might suggest promising new directions in a given research field.
Early on, Garner says, his team realized this software also had the potential for highlighting potential plagiarism. But that was not their first priority. In fact, his group didn't really begin looking in earnest for signs of copycatting until about two years ago.
Today, Garner’s group has published a short paper in Science on results of a survey it conducted among authors of pairs of remarkably similar papers (identified from MEDLINE), and the editors who published those papers. The Texas team wanted to find out whether the apparent copycats — not only the authors but also the editors who published their work — would own up to plagiarism. And once confronted with this public finger pointing, what would they do about it?
The real surprise, says Garner — indeed, “the shock” — was that so few authors of the initial papers were aware of the copycat’s antics. Prior to emailing PDFs that highlighted identical passages in each set of paired papers, 93 percent said they had been unaware of the newer paper.
Since those newer papers were all available via MEDLINE searches, they should have come up every time authors of the first paper searched for work on topics related to their own. In fact, Garner points out, because MEDLINE posts search results in reverse chronological order, copycatted papers should turn up before the papers on which they had been based.
To date, 83 of the 212 pairs of largely identical papers identified so far by the data-mining software that Garner’s team has developed have triggered formal investigations by the journals involved. In 46 instances, editors of the second papers have issued retractions. However, what constitutes a retraction varied considerably. It might have been broad publication of problems with the offending second paper — both in the journal and in a notice sent to MEDLINE.
Other times, some website might have acknowledged the retraction of some or all of a paper, with no notification of the problem forwarded to MEDLINE. In such cases, Garner notes, anyone using MEDLINE's search function would get no warning that the abstract it pulled up relates to findings that have been discredited.
Have you ever shared this material on apparent plagiarism with the administrators of the second paper's authors, I asked Garner. "No, that would have put us into this situation where we would be acting more as police or an investigatory body," he said. And they're not anxious to serve as honesty cops.
So far, his team's software has turned up more than 9,000 'highly similar' papers in biomedical journals indexed by MEDLINE. And only 212 are copycats? Actually, Garner says, that estimate is probably way low. Of that big number, "We have only gotten through looking at 212 so far." Their investigations continue.
For more on the implications of such copycatting, check out my next post.
Long, T.C., . . . and H.R. Garner. 2009. Responding to Possible Plagiarism. Science 323(March 6):1293.
Deja Vu: A Database of Highly Similar and Duplicate Citations. This is being compiled using data mining software that has been developed by the Harold Garner lab at the University of Texas, Southwestern Medical Center in Dallas. [Go to]