I came across an accusation of plagiarism on the web today and thought it would be interesting to code up some python that finds candidate texts that are plagiarisms of the reference text. The obvious idea: query google for matches.
So the first thing was to find the unusual words that would form the google signature. I tracked down a list of English word frequencies and saved it to disk. Then I wrote code to load the reference text into memory and to count occurrences of each word. Then the top-scoring words are the ones that occur much more frequently than expected in the word frequency list.
Downloads: plagiarism.py • word frequency file
