Theses
Theses
Ivan Tokić
Međujezično otkrivanje plagijata s Wikipedije
Cross-Lingual Plagiarism Detection from Wikipedia
2017
Undergraduate
Jan Šnajder
FER
FER2
5331
49
EN
Detekcija plagijata je zadatak iz područja analize prirodnih jezika koji se bavi pronalaženjem plagijariziranih rečenica, paragrafa te komada teksta u sumnjivom dokumentu i dohvaćanjem njihovih izvora. U ovom slučaju uzimamo u obzir radove na hrvatskom s detekcijom plagijariziranih izvora na engleskom. Pošto je Wikipedija jedan od najčešćih izvora plagijarizma, odabrana je kao tijelo teksta u odnosu na koje izvršavamo detekciju, s ograničenjem na podskup Wikipedije koji se bavi temama iz područja strojnog učenja. Ovaj rad istražuje prijevodni model koji koristi sintaktičke i semantičke značajke u kombinaciji s latentnom semantičkom analizom.
Plagiarism detection is a natural language processing task with the purpose of finding plagiarised sentences, paragraphs, or text fragments in a suspicious document, and retrieving their sources. Cross-lingual plagiarism detection further expands upon this by considering source works in languages other than the origi- nal work’s language. In this particular case we consider Croatian papers with the detection of plagiarised English sources. As Wikipedia is one of the most common plagiarism sources, it was chosen as the corpus against which to perform the detection, in particular, a subset of its pertaining to machine learning. The work explores a translational model which utilizes a number of syntactic and semantic features in combination with the latent semantic analysis.
međujezična detekcija plagijata, latentna semantička analiza, procesiranje prirodnih jezika, strojno učenje, Wikipedija, hrvatski jezik
cross-lingual plagiarism detection, latent semantic analysis, natural language processing, machine learning, Wikipedia, Croatian language
5.7.2017.
Plagiarism detection is an authorship analysis task that aims at determining the originality of text using natural language processing techniques. In extrinsic plagiarism detection, the plagiarized text is detected by computing the semantic similarity between two texts. Ideally, such systems are capable of discovering not only text fragments that are identical to each other, but also fragments that are not identical but semantically similar, i.e., paraphrased. Furthermore, systems for cross-lingual extrinsic plagiarism detection can analyze texts written in different languages - a case that often arises in practice.
The topic of the thesis is the cross-lingual plagiarism detection with English Wikipedia as the source and students' theses in Croatian as the target. Do a literature survey on methods for extrinsic plagiarism detection, including monolingual and cross-lingual methods. Devise and implement a cross-lingual semantic similarity model and a method for cross-lingual plagiarism detection based on machine learning. For cross-lingual semantic similarity, you may rely on publicly available machine translation services or online dictionaries. Choose a couple of topic from computer science and compile a suitable test collection for system evaluation. Additionally, you should evaluate the system on a collection of student theses. All references must be cited, and all source code, documentation, executables, and datasets must be provided with the thesis.