Ask HN: Tool to find text reuse, similar paragraphs, fuzzy/near dupes in folder?
Do you know of any too that I can use to compare my own notes and documents vault in search for copied paragraphs or almost similar phrases? Normal diffing/hashing wouldn't work as we're talking about the contents of slightly modified documents, and the comparison of each file against all others.
I found the following tools that seem related yet not quite there, maybe I'm missing a particular term of art?
https://github.com/YaleDHLab/intertext
Python app. Requires to load and tag a corpus of text, it is used to compare different works in a visual way.
https://github.com/e-orlov/neardup
CLI Java tool, looks like a reupload to Github as it is an old project. Haven't tried it.
This post does not have any comments yet