Theses

Student

Luka Skukan

Title HR

Označavanje vremenskih izraza u tekstovima na hrvatskome jeziku

Title EN

Temporal Expression Tagging for Croatian Texts

Year

2014

Level

Undergraduate

Supervisor

Jan Šnajder

Co-supervisor

Hands-on assistant

Goran Glavaš

Study Programme

FER

Programme

FER2

Thesis ID

3795

Number of pages

Language

Abstract HR

Označavanje vremenskih izraza u zadnje je vrijeme postalo jedno od fokusa obrade prirodnog jezika. Ono se sastoji od identifikacije vremenskih izraza i njihove normalizacije u kanonski oblik. Nedavni uspjesi označivaća baziranih na pravilima, a posebno sustava HeidelTime, kao i nedostatak uspješnih označivača za hrvatski jezik, potaknuli su oblikovanje skupa pravila za hrvatski za HeidelTime. Stvoren je i korpus sastavljen od članaka s hrvatske Wikipedije baziranih na člancima o ratovima. Korpus je anotiran TIMEX3 oznakama, a njegova je svrha bila pomoć u razvoju skupa pravila, ali otvara i mogućnost korištenja za druge alate i svrhe. Implementacija pravila za hrvatski postigla je rezultate uspredive s rezultatima implementacija za druge jezike, a korpus WikiWarsHR usporediv je sa sliˇcnim korpusima na drugim jezicima po duljini i broju označenih vremenskih izraza.

Abstract EN

Temporal expression extraction has recently become a popular field of natural language processing. This task consists of locating temporal expressions and normalising them into a canonical form. Recent successes achieved by rule-based temporal taggers, HeidelTime in particular, and a lack of a good temporal tagger for Croatian have inspired the construction of HeidelTime rule set for the Croatian language. Additionally, WikiWarsHR, a corpus of Wikipedia-based historical narratives tagged with TIMEX3, has been developed, both for the purpose of developing the HeidelTime rule set and for future use. Results achieved by the Croatian implementation of HeidelTime are comparable to implementations for other languages, while WikiWarsHR is close in size and density of tags to other publicly available corpora.

Keywords HR

TimeML, TIMEX3, označavanje bazirano na pravilima, normalizacija vremenskih izraza, obrada prirodnog jezika

Keywords EN

TimeML, TIMEX3, rule-based tagging, temporal expression normalisation, natural language processing

Defense date

3.7.2014.

Thesis task HR

Temporal expression tagging refers to identifying the parts of text that denote intervals or moments in time and normalizing them to canonical forms. This is an important process in semantic text analysis and a prerequisite for more advanced information extraction methods, such as event extraction, document summarization, and temporal and causal reasoning. Previous research has shown that the task can be solved efficiently using methods based on a hand-crafted set of rules. The topic of this thesis are the rule-based approaches for temporal expression tagging. Provide an overview of the various methods for detecting and normalizing temporal expressions, with a focus on rule-based methods, including the HeidelTime temporal expression tagger (Strötgen and Gertz, 2013). Develop and implement a temporal expression detection and normalization stystem for the Croatian language as an extension of the HeidelTime tagger. Compile a text corpus manually annotated with temporal expressions, along the lines of the WikiWars corpus (Mazur and Dale, 2010). Perform an experimental evaluation and a detailed error analysis. All references must be cited, and all source code, documentation, executables, and datasets must be provided with the thesis.

Thesis task EN

Publicly available

Published paper(s)

File

TakeLab-ZR-2014-LukaSkukan.pdf