Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Any programs to identify duplicated content?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Gentoo Chat
View previous topic :: View next topic  
Author Message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2115

PostPosted: Fri Aug 16, 2024 4:43 pm    Post subject: Any programs to identify duplicated content? Reply with quote

My problem is about 0.5 GB of stuff spread across 24 Word files. OK, I'm converting them to Libreoffice .odt as I go.
The author/curator suffers from verbal diarrhoea, a lack of organization, and a bad memory, so text is repeated all over the place, often verbatim ('cos cut and paste is so easy).

Does anyone know of software to identify duplicated text within files?

This is NOT a request for awk/sed/bash/perl/whatever scripts. This stuff is embedded in the .docx/.odt files. I could probably profitably "normalize" them by printing to .ps, and search them. But whatever, it's likely to need a purpose-build application.

I think I read about something related based on software to identify common sequences of amino acids in DNA, which would probably be able to handle "near misses" resulting from formatting tweaks, typo corrections and the like. This seems to be a family of programs called BLAST, but (blast it!) I suspect it only supports alphabets with 4 characters (the 4 amino acids...) Or maybe Blat, or Blas, there seems to be a similar problem with these application names!

Another approach might be plagiarism detectors (seem to be described as lexical similarity testers), which seem to exist in Ubuntu (similarity-tester) and elsewhere (Sherlock, wcopyfind, ospc), but no version AFAICT in Gentoo.

This site for WCopypaste seems useful. I might play with that.

An update, even before the first post!
WCopyfind is a Windows .exe file. The 64-bit version runs happily under wine, and produces a directory containing a tree of web pages that show the scores and files with highlighted duplicates. Just what I wanted!

I'll post this to help anyone with a similar problem.
_________________
Greybeard
Back to top
View user's profile Send private message
krumpf
Apprentice
Apprentice


Joined: 15 Jul 2018
Posts: 182

PostPosted: Fri Aug 16, 2024 6:05 pm    Post subject: Reply with quote

Not sure that's helpful, especially since you already found a solution, but wouldn't app-misc/rdfind suits your needs ?
_________________
Dragon Princess Music Games Heroes and villains
Back to top
View user's profile Send private message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2115

PostPosted: Fri Aug 16, 2024 7:24 pm    Post subject: Reply with quote

Thanks for the suggestion, but that's to identify and/or eliminate duplicate files (replacing them with links, IIRC); my problem was to find long strings of text embedded in one file that occur in other files - typically because the author pasted an email into one document, and then later, forgetting it was there, pasted it into another file.
_________________
Greybeard
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo Chat All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum