Theses

Student

David Lozić

Title HR

Intrinzično otkrivanje plagijata u studentskim radovima

Title EN

Intrinsic Plagiarism Detection in Student Theses

Year

2017

Level

Undergraduate

Supervisor

Jan Šnajder

Co-supervisor

Hands-on assistant

Study Programme

FER

Programme

FER2

Thesis ID

5324

Number of pages

Language

Abstract HR

Cilj intrinzičnog otkrivanja plagijata je prepoznavanje krađe unutar dokumenta bez pomoći referentnih tekstova. U ovom radu su predložena dva pristupa. Prvi koristi one-class SVM u prepoznavanju stršećih vrijednosti, dok drugi radi klasifikaciju koristeći niz klasifikatora. Za potrebe zadatka stvoren je umjetni skup podataka gdje su dokumenti analizirani na razini rečenice koristeći tehniku klizećeg prozora. Iako su postignuti pozitivni rezultati, sustav nije dovoljno učinkovit za samostalnu detekciju plagijata. F1 rezultat prvog modela je 0.3, dok je F1 rezultat drugog modela 0.25 za klasu plagijata.

Abstract EN

The goal of intrinsic plagiarism detection is to uncover theft without the aid of external references by analyzing the discrepancies within a single corpus. In this thesis, two approaches are proposed. One focuses one outlier detection using the one-class SVM, and the other performs classification using several classifiers. An artificial dataset was created for the task and the documents are analyzed on a sentence level using a sliding window. While there have been some positive results, the system alone is not satisfactory enough for confident plagiarism detection on the sentence level. The first model scored an F1 score of 0.3 on the plagiarism class and the second model achieved an F1 score of 0.25.

Keywords HR

obrada prirodnog jezika, strojno učenje, detekcija plagijata, intrinzična detekcija plagijata, hrvatski jezik, SVM, one-class SVM

Keywords EN

natural language processing, machine learning, plagiarism detection, intrinsic plagiarism detection, Croatian language, SVM, one-class SVM

Defense date

6.7.2017.

Thesis task HR

Thesis task EN

Plagiarism detection is an authorship analysis task that aims at determining the originality of text using natural language processing techniques. In intrinsic plagiarism detection, the plagiarized text is detected based on a statistical stylometric analysis of the text. Namely, the inconsistencies in statistical style features of the different text fragments indicate what parts of text might have been plagiarized. In contrast to extrinsic procedures, intrinsic plagiarism can be used even in cases when the original text is not available. The topic of this thesis is the intrinsic plagiarism detection for Croatian language based on the analysis of stylometric features and machine learning. Do a literature survey on methods for intrinsic plagiarism detection as well methods for computational stylometry analysis. Devise a system for intrinsic plagiarism detection in student theses. Compile a suitable test collection of student theses' texts composed of texts from a number of different authors. Implement the system and carry out an experimental evaluation on the test collection. Design an application programming interface in such a way that the system can be used as a stand-alone module. All references must be cited, and all source code, documentation, executables, and datasets must be provided with the thesis.

Publicly available

Published paper(s)

File

TakeLab-ZR-2017-DavidLozic.pdf