Theses
Theses
Ivan Mršić
Prepoznavanje i klasifikacija imenovanih entiteta za jezično sučelje bazi podataka
Entity Recognition and Classification for a Natural Language Database Interface
2017
Undergraduate
Jan Šnajder
FER
FER2
5326
38
EN
Zbog uvijek prisutne ljudske potrebe za razumijevanjem što ve ́cih količina podataka i rasta računalne snage baze grafova dobivaju na popularnosti. Još jedan korak k boljoj integraciji čovjeka i računala jest izrada jezičnog sučelja za komunikaciju s takvom bazom. U ovom radu razvijen je i ispitan sustav za izvlačenje entiteta iz upita u svrhu izgradnje takvog sučelja, pogotovo imenovanih entiteta te su u tu svrhu izgrađena četiri modela. Dobiveni rezultati su ohrabrujući, te pokazuju da se i na vrlo malom skupu podataka mogu naučiti modeli koji rade sa specijaliziranim upitima.
Because of the always present human need for understanding big quantities of data and computer power growth, graph databases are gaining in popularity. Another step in a better connection of human and the computer would be a natural language interface for communication with such a database. This thesis describes the development and evaluation of a entity extraction from queries, which is part of such natural language interface. Special attention was given to named entity extraction. Four models were built for this purpose. End results are encouraging, showing that even on small datasets models can be created to deduct from specialized queries.
obrada prirodnog jezika, izvlačenje imenovanih entiteta, stroj potpornih vektora, stabla odluke, CRF, jezično sučelje, graf baza podataka
natural language processing, named entity extraction, support vector machine, decision trees, CRF, natural language interface, graph database
6.7.2017.
A natural language database interface allows the users to query the database in a controlled natural language. Such interfaces rely on natural language processing and machine learning for analyzing the query. One of the steps in query analysis is the extraction of key information from the query, such as named entities.
The topic of this thesis is the automatic extraction of named entities from user queries over a database of famous people. Study the methods for named entity extraction, with an emphasis on machine learning methods for sequential labeling. Devise and implement a method for named entity extraction from user queries in English, which will also cover the classification of each entity as either a target or reference entity with respect to the query intent. Compile a suitable collection for model training and testing, which will include query examples with pre-labeled named entities. Carry out an experimental evaluation of the model, a comparison against a baseline, a statistical analysis of the results, and an error analysis. All references must be cited, and all source code, documentation, executables, and datasets must be provided with the thesis.