Theses
Theses
David Lozić
Intrinzično otkrivanje plagijata u studentskim radovima
Intrinsic Plagiarism Detection in Student Theses
2017
Undergraduate
Jan Šnajder
FER
FER2
5324
32
EN
Cilj intrinzičnog otkrivanja plagijata je prepoznavanje krađe unutar dokumenta bez pomoći referentnih tekstova. U ovom radu su predložena dva pristupa. Prvi koristi one-class SVM u prepoznavanju stršećih vrijednosti, dok drugi radi klasifikaciju koristeći niz klasifikatora. Za potrebe zadatka stvoren je umjetni skup podataka gdje su dokumenti analizirani na razini rečenice koristeći tehniku klizećeg prozora. Iako su postignuti pozitivni rezultati, sustav nije dovoljno učinkovit za samostalnu detekciju plagijata. F1 rezultat prvog modela je 0.3, dok je F1 rezultat drugog modela 0.25 za klasu plagijata.
The goal of intrinsic plagiarism detection is to uncover theft without the aid of external references by analyzing the discrepancies within a single corpus. In this thesis, two approaches are proposed. One focuses one outlier detection using the one-class SVM, and the other performs classification using several classifiers. An artificial dataset was created for the task and the documents are analyzed on a sentence level using a sliding window. While there have been some positive results, the system alone is not satisfactory enough for confident plagiarism detection on the sentence level. The first model scored an F1 score of 0.3 on the plagiarism class and the second model achieved an F1 score of 0.25.
obrada prirodnog jezika, strojno učenje, detekcija plagijata, intrinzična detekcija plagijata, hrvatski jezik, SVM, one-class SVM
natural language processing, machine learning, plagiarism detection, intrinsic plagiarism detection, Croatian language, SVM, one-class SVM
6.7.2017.
Plagiarism detection is an authorship analysis task that aims at determining the originality of text using natural language processing techniques. In intrinsic plagiarism detection, the plagiarized text is detected based on a statistical stylometric analysis of the text. Namely, the inconsistencies in statistical style features of the different text fragments indicate what parts of text might have been plagiarized. In contrast to extrinsic procedures, intrinsic plagiarism can be used even in cases when the original text is not available.
The topic of this thesis is the intrinsic plagiarism detection for Croatian language based on the analysis of stylometric features and machine learning. Do a literature survey on methods for intrinsic plagiarism detection as well methods for computational stylometry analysis. Devise a system for intrinsic plagiarism detection in student theses. Compile a suitable test collection of student theses' texts composed of texts from a number of different authors. Implement the system and carry out an experimental evaluation on the test collection. Design an application programming interface in such a way that the system can be used as a stand-alone module. All references must be cited, and all source code, documentation, executables, and datasets must be provided with the thesis.