Theses

Student

Ivan Tokić

Title HR

Međujezično otkrivanje plagijata s Wikipedije

Title EN

Cross-Lingual Plagiarism Detection from Wikipedia

Year

2017

Level

Undergraduate

Supervisor

Jan Šnajder

Co-supervisor

Hands-on assistant

Study Programme

FER

Programme

FER2

Thesis ID

5331

Number of pages

Language

Abstract HR

Detekcija plagijata je zadatak iz područja analize prirodnih jezika koji se bavi pronalaženjem plagijariziranih rečenica, paragrafa te komada teksta u sumnjivom dokumentu i dohvaćanjem njihovih izvora. U ovom slučaju uzimamo u obzir radove na hrvatskom s detekcijom plagijariziranih izvora na engleskom. Pošto je Wikipedija jedan od najčešćih izvora plagijarizma, odabrana je kao tijelo teksta u odnosu na koje izvršavamo detekciju, s ograničenjem na podskup Wikipedije koji se bavi temama iz područja strojnog učenja. Ovaj rad istražuje prijevodni model koji koristi sintaktičke i semantičke značajke u kombinaciji s latentnom semantičkom analizom.

Abstract EN

Plagiarism detection is a natural language processing task with the purpose of finding plagiarised sentences, paragraphs, or text fragments in a suspicious document, and retrieving their sources. Cross-lingual plagiarism detection further expands upon this by considering source works in languages other than the origi- nal work’s language. In this particular case we consider Croatian papers with the detection of plagiarised English sources. As Wikipedia is one of the most common plagiarism sources, it was chosen as the corpus against which to perform the detection, in particular, a subset of its pertaining to machine learning. The work explores a translational model which utilizes a number of syntactic and semantic features in combination with the latent semantic analysis.

Keywords HR

međujezična detekcija plagijata, latentna semantička analiza, procesiranje prirodnih jezika, strojno učenje, Wikipedija, hrvatski jezik

Keywords EN

cross-lingual plagiarism detection, latent semantic analysis, natural language processing, machine learning, Wikipedia, Croatian language

Defense date

5.7.2017.

Thesis task HR

Thesis task EN

Plagiarism detection is an authorship analysis task that aims at determining the originality of text using natural language processing techniques. In extrinsic plagiarism detection, the plagiarized text is detected by computing the semantic similarity between two texts. Ideally, such systems are capable of discovering not only text fragments that are identical to each other, but also fragments that are not identical but semantically similar, i.e., paraphrased. Furthermore, systems for cross-lingual extrinsic plagiarism detection can analyze texts written in different languages - a case that often arises in practice. The topic of the thesis is the cross-lingual plagiarism detection with English Wikipedia as the source and students' theses in Croatian as the target. Do a literature survey on methods for extrinsic plagiarism detection, including monolingual and cross-lingual methods. Devise and implement a cross-lingual semantic similarity model and a method for cross-lingual plagiarism detection based on machine learning. For cross-lingual semantic similarity, you may rely on publicly available machine translation services or online dictionaries. Choose a couple of topic from computer science and compile a suitable test collection for system evaluation. Additionally, you should evaluate the system on a collection of student theses. All references must be cited, and all source code, documentation, executables, and datasets must be provided with the thesis.

Publicly available

Published paper(s)

File

TakeLab-ZR-2017-IvanTokic.pdf