Tesi etd-06212021-172428 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
BERTOLDO, SARA
URN
etd-06212021-172428
Titolo
Probing the linguistic knowledge of word embeddings: A case study on colexification
Dipartimento
FILOLOGIA, LETTERATURA E LINGUISTICA
Corso di studi
INFORMATICA UMANISTICA
Relatori
relatore Prof. Lenci, Alessandro
correlatore Dott. Brochhagen, Thomas
correlatore Dott. Brochhagen, Thomas
Parole chiave
- colexification
- fasttext
- natural language processing
- window size
- word embedding
Data inizio appello
12/07/2021
Consultabilità
Tesi non consultabile
Riassunto
In recent years it has become clear that data is the new resource of power and richness. The companies that are able to manage it to extract useful information are the ones that are expected to last and increase their profits.
One of the ways in which data is conveyed is through natural language: every day we produce an enormous amount of linguistic data, in written or spoken forms. Through the help of computational resources, we can manage such a big quantity of information, in an automatized and scaled way. Before being able to do this, we need to find ways to allow computers to represent linguistic knowledge. This is indeed a problem, considering that computers do not have linguistic proficiency as we humans do.
For words to be processed by machine models, they are often required to have some form of numeric representation that models can use in their calculations. One method that has become influential in recent years is word embeddings, defined as the representation of terms as real-valued vectors such that the words that are closer in the vector space are expected to be similar in meaning.
These techniques are very popular and have shown great success in multiple studies, but it is still not clear what kind of linguistic knowledge they do acquire. Also, it is still an open question exactly in which way some of their parameters affect the knowledge they acquire. The present work is motivated by figuring it out.
We are going to test the system on a linguistic problem. The issue under examination is colexification: the phenomenon in which, within a language, multiple meanings are expressed by a single word form.
One of the reasons why this circumstance happens has been suggested to be a semantic connection between the meanings. It follows that two similar meanings are more expected to be conveyed through a single term with respect to two meanings pertaining to completely different fields. We assume that there is a relationship between distributional similarity and colexification, in the sense in which the former is informative about the latter. This assumption is more concretely based on the results from Xu et al. (2020). We use this study as a general guide to follow in this investigation. We used some word embedding models, specifically, fastText trained with different window sizes, to obtain the cosine similarity values between pairs of words.
Subsequently, we performed two predictive tasks, showing how using a predictive model like logistic regression and nothing else than the cosine similarity values between word vectors, it is possible to predict whether a pair of meanings is a highly frequent colexification or whether it is a colexification at all.
The results suggest that the linguistic models in use were able to acquire a certain knowledge as regards word meaning. Additionally, changing the model parameter of window size, we inspected what kind of linguistic knowledge the computational models acquired concerning colexification.
The project covered the whole working process. We started from the data collecting, understanding and cleaning, to get to the training of the fastText model, and evaluation of the results obtained by the predictive model.
Our findings indicate that a narrow window size value is sufficient to allow the linguistic model to acquire a good level of semantic knowledge in a distributional similarity task. Additionally, the parameter of window size, depending on the task, does not always lead to different results in computation. This raises a broader question: in which tasks does window size matter and what does this tell us about these tasks.
One of the ways in which data is conveyed is through natural language: every day we produce an enormous amount of linguistic data, in written or spoken forms. Through the help of computational resources, we can manage such a big quantity of information, in an automatized and scaled way. Before being able to do this, we need to find ways to allow computers to represent linguistic knowledge. This is indeed a problem, considering that computers do not have linguistic proficiency as we humans do.
For words to be processed by machine models, they are often required to have some form of numeric representation that models can use in their calculations. One method that has become influential in recent years is word embeddings, defined as the representation of terms as real-valued vectors such that the words that are closer in the vector space are expected to be similar in meaning.
These techniques are very popular and have shown great success in multiple studies, but it is still not clear what kind of linguistic knowledge they do acquire. Also, it is still an open question exactly in which way some of their parameters affect the knowledge they acquire. The present work is motivated by figuring it out.
We are going to test the system on a linguistic problem. The issue under examination is colexification: the phenomenon in which, within a language, multiple meanings are expressed by a single word form.
One of the reasons why this circumstance happens has been suggested to be a semantic connection between the meanings. It follows that two similar meanings are more expected to be conveyed through a single term with respect to two meanings pertaining to completely different fields. We assume that there is a relationship between distributional similarity and colexification, in the sense in which the former is informative about the latter. This assumption is more concretely based on the results from Xu et al. (2020). We use this study as a general guide to follow in this investigation. We used some word embedding models, specifically, fastText trained with different window sizes, to obtain the cosine similarity values between pairs of words.
Subsequently, we performed two predictive tasks, showing how using a predictive model like logistic regression and nothing else than the cosine similarity values between word vectors, it is possible to predict whether a pair of meanings is a highly frequent colexification or whether it is a colexification at all.
The results suggest that the linguistic models in use were able to acquire a certain knowledge as regards word meaning. Additionally, changing the model parameter of window size, we inspected what kind of linguistic knowledge the computational models acquired concerning colexification.
The project covered the whole working process. We started from the data collecting, understanding and cleaning, to get to the training of the fastText model, and evaluation of the results obtained by the predictive model.
Our findings indicate that a narrow window size value is sufficient to allow the linguistic model to acquire a good level of semantic knowledge in a distributional similarity task. Additionally, the parameter of window size, depending on the task, does not always lead to different results in computation. This raises a broader question: in which tasks does window size matter and what does this tell us about these tasks.
File
Nome file | Dimensione |
---|---|
Tesi non consultabile. |