Theses

Student

Matej Paradžik

Title HR

Postupci prilagodbe domeni za analizu sentimenta u tekstu

Title EN

Domain Adaptation for Sentiment Analysis from Text

Year

2017

Level

Graduate

Supervisor

Jan Šnajder

Co-supervisor

Hands-on assistant

Zoran Medić

Study Programme

FER

Programme

FER2

Thesis ID

1531

Number of pages

Language

Abstract HR

Uobičajeni pristup izgradnji sustava za analizu sentimenta temelji se na klasifikaciji sentimenta korištenjem algoritama nadziranog strojnog učenja. Da bi takav sustav radio dobro, dokumenti, čiji sentiment će naučeni klasifikator predviđati, trebali bi biti iz iste domene kao i dokumenti na kojima je klasifikator učen. Umjesto skupog označavanja dokumenata za svaku novu domenu, moguće je iskoristiti postojeće označene dokumente iz neke druge domene i primijeniti neki postupak prilagodbe domeni. U ovom radu proučeni su postupci prilagodbe domeni temeljeni na dodjeljivanju težina (engl. instance weighting) i postupci temeljeni na promjeni prostora značajki (engl. change of representation). Proučeni postupci iskorišteni su za prilagodbu domene pri učenju klasifikatora sentimenta temeljenog na stroju potpornih vektora. Postupci prilagodbe domene eksperimentalno su vrednovani pri izgradnji klasifikatora sentimenta za hrvatski i engleski jezik. Kao najbolji postupak za prilagodbu domene pokazala se primjena marginaliziranih odšumljavajućih naslaganih autoenkodera (engl. marginalized denoising stacked autoencoders).

Abstract EN

Typical sentiment analysis systems are based on supervised classification algorithms which, in order to work well, assume that documents used for training of a classifier come from the same domain as documents sentiment of which classifier will be used to predict. In order to alleviate the need for labeling documents for each new domain of interest, domain adaptation techniques can be used with already available documents from some different domain. This thesis studies several domain adaptation techniques based on instance weighting and change of representation. The studied techniques were used for building domain-adapted sentiment classifierd based on linear support vector machine. The performance of domainadapted classifiers was experimentally evaluated on texts in Croatian and English languages. Using domain adaptation technique based on marginalized denoising stacked autoencoders gave the best results across all domain adaptation tasks in both languages.

Keywords HR

obrada prirodnog jezika, analiza sentimenta, prilagodba domeni, klasifikacija sentimenta, hrvatski jezik, engleski jezik

Keywords EN

natural language processing, sentiment analysis, domain adaptation, crossdomain sentiment analysis, Croatian language, English language

Defense date

13.7.2017.

Thesis task HR

Thesis task EN

The increase in online communication is paralleled by an increase in the amount of user-generated text. Texts cover diverse genres and domains, often expressing users' opinions or sentiments toward various topics, persons, products, etc. However, sentiment analysis systems are typically tailored for a specific domain, which impedes their use on texts from other domains. Besides domain dependence, sentiment analysis systems are often also dependent on a specific text genre, such as microblogs, reviews, or short comments. Developing sentiment analysis systems for each domain and genre separately is time consuming and costly. The problem can be alleviated by the use of domain adaptation techniques. The focus of this thesis are the domain adaptation techniques for sentiment analysis in Croatian language. Study the domain adaptation techniques, with a focus on domain adaptation for sentiment analysis. Analyze and implement at least three domain adaptation techniques. Apply the techniques to suitable datasets in Croatian and English, covering different domains and genres. Carry out a detailed experimental evaluation of the sentiment analysis system and the domain adaptation techniques, including error analysis and statistical significance testing. All references must be cited, and all source code, documentation, executables, and datasets must be provided with the thesis.

Publicly available

Published paper(s)

File

TakeLab-DR-2017-MatejParadzik.pdf