View previous topic :: View next topic |
Author |
Message |
Goverp Advocate
Joined: 07 Mar 2007 Posts: 2177
|
Posted: Fri Aug 16, 2024 4:43 pm Post subject: Any programs to identify duplicated content? |
|
|
My problem is about 0.5 GB of stuff spread across 24 Word files. OK, I'm converting them to Libreoffice .odt as I go.
The author/curator suffers from verbal diarrhoea, a lack of organization, and a bad memory, so text is repeated all over the place, often verbatim ('cos cut and paste is so easy).
Does anyone know of software to identify duplicated text within files?
This is NOT a request for awk/sed/bash/perl/whatever scripts. This stuff is embedded in the .docx/.odt files. I could probably profitably "normalize" them by printing to .ps, and search them. But whatever, it's likely to need a purpose-build application.
I think I read about something related based on software to identify common sequences of amino acids in DNA, which would probably be able to handle "near misses" resulting from formatting tweaks, typo corrections and the like. This seems to be a family of programs called BLAST, but (blast it!) I suspect it only supports alphabets with 4 characters (the 4 amino acids...) Or maybe Blat, or Blas, there seems to be a similar problem with these application names!
Another approach might be plagiarism detectors (seem to be described as lexical similarity testers), which seem to exist in Ubuntu (similarity-tester) and elsewhere (Sherlock, wcopyfind, ospc), but no version AFAICT in Gentoo.
This site for WCopypaste seems useful. I might play with that.
An update, even before the first post!
WCopyfind is a Windows .exe file. The 64-bit version runs happily under wine, and produces a directory containing a tree of web pages that show the scores and files with highlighted duplicates. Just what I wanted!
I'll post this to help anyone with a similar problem. _________________ Greybeard |
|
Back to top |
|
|
krumpf Apprentice
Joined: 15 Jul 2018 Posts: 187
|
|
Back to top |
|
|
Goverp Advocate
Joined: 07 Mar 2007 Posts: 2177
|
Posted: Fri Aug 16, 2024 7:24 pm Post subject: |
|
|
Thanks for the suggestion, but that's to identify and/or eliminate duplicate files (replacing them with links, IIRC); my problem was to find long strings of text embedded in one file that occur in other files - typically because the author pasted an email into one document, and then later, forgetting it was there, pasted it into another file. _________________ Greybeard |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|