Theses

Student

Martin Tutek

Title HR

Višeoznačna klasifikacija dokumenata u pojmovniku EuroVoc

Title EN

Multi-label Document Classification using EuroVoc Thesaurus

Year

2014

Level

Graduate

Supervisor

Jan Šnajder

Co-supervisor

Hands-on assistant

Study Programme

FER

Programme

FER2

Thesis ID

776

Number of pages

Language

Abstract HR

EuroVoc je višejezični konceptualni pojmovnik pravnih dokumenata zemalja članica Europske Unije. Sastoji se od preko 6000 različitih oznaka klasa koje se dodjeljuju dokumentima, zadatak koji spada u područje strojnog učenja. Problem pojmovnika EuroVoc je visoko rasipanje oznaka te njihova neravnomjerna raspodjeljenost. Slabije reprezentirane oznake su teže za prepoznati i klasificirati, te najviše škode procesu klasifikacije. Razni načini poboljšanja postupka su analizirani s naglaskom na povezivanje sličnih oznaka u grupe te potom strožu diskriminaciju među članovima iste grupe. Rezultati su uspore deni s osnovnim algoritmom stroja s potpornim vektorima, te su izneseni zaključci i analizirani mogući novi smjerovi istraživanja.

Abstract EN

EuroVoc is a multilingual conceptual thesaurus covering the activities of the European Union. It consists of over 6000 different class labels which are assigned to documents, a problem which is tackled by machine learning algorithms for multi-label classification. Problems with the thesaurus include label sparsity, inconsistent descriptor distribution and various other. The main line of research was an attempt to group similar labels in order to split the problem into differentiating between groups of labels and differentiating between similar labels. The model was compared to the SVM baseline and various other multi-label classification algorithms, the results and possible new lines of research were analysed.

Keywords HR

EuroVoc, klasifikacija teksta, neravnomjerna distribucija klasa, strojno učenje

Keywords EN

EuroVoc, text classification, multi-label classification, label sparsity, machine learning

Defense date

8.7.2014.

Thesis task HR

Višeoznačna klasifikacija dokumenata, odnosno tzv. indeksiranje dokumenata, odnosi se na automatsko pridjeljivanje više oznaka istom dokumentu. Za indeksiranje pravnih dokumenata, koje ima za svrhu poboljšanje pristupa legislativi, može se koristiti hijerarhijski i višejezični pojmovnik EuroVoc. No, višeoznačna klasifikacija dokumenata pojmovnikom EuroVoc pokazala se iznimno izazovnim zadatkom zbog problema rijetkih oznaka, koji proizlazi iz razmjerno velikog broja oznaka i njihove neravnomjerne razdiobe. Tema ovog rada jest indeksiranja dokumenata pomoću pojmovnika EuroVoc temeljeno na strojnom učenju. Potrebno je dati isrpan pregled postojećih pristupa indeksiranju dokumenata pojmovnikom EuroVoc, višeoznačnoj klasifikaciji, hijerarhijskoj klasifikaciji te hibridnim algoritmima koji se temelje na kombinaciji potonjih dvaju pristupa. Potrebno je implementirati algoritme višeoznačne klasifikacije dokumenata uključujući nasumične šume (engl. random rorests), stroj s potpornim vektorima (engl. support vector machine, SVM) te algoritam HOMER (Tsoumakas et al., 2008). Implementacija se može graditi pomoću postojećih biblioteka otvorenog koda poput biblioteke Mulan. Potrebno je implementirati tehnike za rješavanje problema rijetkih oznaka, kao što su dodavanja težina značajkama i označavanja značajki. Vrednovati i usporediti modele višeoznačne klasifikacije na skupovima podataka na hrvatskome i engleskome jeziku, na korpusima NN13205 odnosno JRC AC. Kod vrednovanja u obzir treba uzeti višeoznačnu i hijerarhijsku prirodu problema. Usporediti rezultate modela s višeoznačnim strojem s potpornim vektorima kao osnovnom metodom te provesti iscrpnu analizu pogreške nad ispitnim skupom. Radu priložiti izvorni i izvršni kod razvijenog sustava, označene skupove podataka i potrebnu dokumentaciju te citirati korištenu literaturu.

Thesis task EN

Multi-label document classification, also known as document indexing, is the task of automatically assigning multiple labels to the same document. The multilingual and hierarchical EuroVoc thesaurus has been used to index documents of the European Parliament as well as documents of national legislatives to improve the access to legislation. Multi-label classification using EuroVoc thesaurus has proven to be a particularly challenging task due to label sparsity, arising as a consequence of a large number of labels and unbalanced label distribution. The topic of this thesis are the machine learning experiments on multi-label document classification using EuroVoc thesaurus. Provide a thorough overview of the existing approaches to EuroVoc document indexing, multi-label classification, hierarchical classification, and algorithms that combine the latter two approaches. Implement multi-label document classification algorithms based on, but not limited to, random forests, support vector machine (SVM), and the HOMER algorithm of Tsoumakas et al. (2008). Implementation may build on the existing open-source libraries such Mulan. Implement techniques for addressing the label sparsity problem, such as feature weighting and feature labeling. Evaluate and compare multi-labeling models with on the English and Croatian dataset, using the JRC AC corpus and NN13205, respectively. The evaluation should account for multi-label and hierarchical nature of the problem. Compare the models against multi-class SVM as the baseline and carry out a thorough error analysis on the test set. All references listed should be cited and all cited references must be included in the reference list. All algorithms, source code, and additional documentation must be provided with the thesis.

Publicly available

Published paper(s)

File

TakeLab-DR-2014-MartinTutek.pdf