Tesi etd-03142023-165209

Tipo di tesi

Tesi di laurea magistrale

URN

etd-03142023-165209

Titolo

Text to Time Series Representations: Towards Interpretable Predictive Models

Dipartimento

FILOLOGIA, LETTERATURA E LINGUISTICA

Corso di studi

INFORMATICA UMANISTICA

Relatori

relatore Prof. Guidotti, Riccardo
relatore Dott. Spinnato, Francesco

Parole chiave

interpretability
nlp
sentence embedding
text to time series
time series classification

Data inizio appello

13/04/2023

Consultabilità

Completa

Riassunto (Inglese)

Riassunto (Italiano)

Natural Language Processing (NLP) and time series analysis are two domains of research that have seen a surge of interest in recent years. At the core of NLP are text representation techniques that convert text into machine-readable input. These techniques have proven remarkably accurate, especially with state-of-the-art transformer models. However, they inhibit human interpretation since they generate an implicit vector representation that embeds the text as a whole. In contrast, time series analysis techniques are inherently designed to extract meaningful information from time-dependent data while preserving the local features. One prominent time series classification approach uses shapelets to identify representative sequences of the target variable. This approach classifies time series based on their proximity to shapelets, distinguishing it from time-independent methods that rely on global time series statistics. In the context of time series, KNN classification has also been adapted with distance metrics like DTW that detect local patterns between time series even if they do not occur simultaneously. This study proposes a novel approach for text-to-time series representations that preserve the text's sequential structure. Specifically, we treat text as a sequence of sentences and map it using feature extraction techniques derived from sentence embedding, sentiment analysis, and custom linguistic features. We explore pooling methods to convert multivariate time series into univariate ones enhancing the identification of unique one-dimensional patterns among text-time series. We propose adapting the principal component analysis technique (PCA) to reduce sentence timestamps' global feature vector space to one dimension. By reducing the dimensionality, we can represent the text-time series as a one-dimensional signal that can be more easily analyzed and interpreted. We test the resulting transformation from several datasets consisting mainly of song lyrics, performing classification tasks. We compare feature-based approaches that use global statistics from multivariate time series with KNN and shapelet-based approaches on the resulting univariate time series. Despite the pooling approximation, shapelet-based methods perform similarly to feature-based ones, although with variation in results depending on the text domain, feature type, and dimensionality reduction contingencies. Finally, we demonstrate how to extract interpretable features from the original multivariate shapelets and relocate the shapelets in the source text.

File

Nome file	Dimensione
tesi_mat...gioli.pdf	852.91 Kb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-03142023-165209