Detecting copied or similar text blocks

时光总嘲笑我的痴心妄想 提交于 2021-01-27 13:43:03

问题


I have a bunch of texts about programming in Markdown format. There is a build process that is capable of converting those texts into Word/HTML and also perform simple validation rules like spell checking or checking if document has required header structure. I would like to extend that build code to also check for copy-pasted or similar chunks within all texts.

Is there any existing Java/Groovy library that can help me with that analysis?

My first idea was to use PMD's CopyPasteDetector, but it is too much oriented to analyse real code. I don't see how I can use it to analyse normal text.


回答1:


You might want to try Dude, my own quick and dirty duplication detector for text files. Besides providing you a quick estimate of how much is shared between two text files, it can also determine copying between a set of files, drawing a nice graph of sharing relations.




回答2:


I ended up using CPD and Groovy after all. Here is the code if some one is interested:

import net.sourceforge.pmd.cpd.Tokens
import net.sourceforge.pmd.cpd.TokenEntry
import net.sourceforge.pmd.cpd.Tokenizer
import net.sourceforge.pmd.cpd.CPDNullListener
import net.sourceforge.pmd.cpd.MatchAlgorithm
import net.sourceforge.pmd.cpd.SourceCode
import net.sourceforge.pmd.cpd.SourceCode.StringCodeLoader
import net.sourceforge.pmd.cpd.SimpleRenderer

// Prepare empty token data.
TokenEntry.clearImages()
def tokens = new Tokens()

// List all source files with text.
def source = new TreeMap<String, SourceCode>()
new File('.').eachFile { file ->
  if (file.isFile() && file.name.endsWith('.txt')) {
    def analyzedText = file.text
    def sourceCode = new SourceCode(new StringCodeLoader(analyzedText, file.name))
    source.put(sourceCode.fileName, sourceCode)
    analyzedText.eachLine { line, lineNumber ->
      line.split('[\\W\\s\\t\\f]+').each { token ->
        token = token.trim()
        if (token) {
          tokens.add(new TokenEntry(token, sourceCode.fileName, lineNumber + 1))
        }
      }
    }
    tokens.add(TokenEntry.getEOF())
  }
}

// Run matching algorithm.
def maxTokenChain = 15
def matchAlgorithm = new MatchAlgorithm(source, tokens, maxTokenChain, new CPDNullListener())
matchAlgorithm.findMatches()

// Produce report.
matchAlgorithm.matches().each { match ->
  println "  ========================================"
  match.iterator().each { mark ->
    println "  DUPLICATION ERROR: <${mark.tokenSrcID}:${mark.beginLine}> [DUPLICATION] Found a ${match.lineCount} line (${match.tokenCount} tokens) duplication!"
  }
  def indentedTextSlice = ""
  match.sourceCodeSlice.eachLine { line ->
    indentedTextSlice += "  $line\n"
  }
  println "  ----------------------------------------"
  println indentedTextSlice
  println "  ========================================"
}



回答3:


You can start with a simple implementation Longest Common Substring (LCS) algorithm for two strings. See one Java implementation.

Next, you can see the Suffix Arrays and the Genetics and string algorithms.

See also Longest Common Substring in a big text.



来源:https://stackoverflow.com/questions/17504560/detecting-copied-or-similar-text-blocks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!