logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-03142023-165209


Tipo di tesi
Tesi di laurea magistrale
Autore
POGGIOLI, MATTIA
URN
etd-03142023-165209
Titolo
Text to Time Series Representations: Towards Interpretable Predictive Models
Dipartimento
FILOLOGIA, LETTERATURA E LINGUISTICA
Corso di studi
INFORMATICA UMANISTICA
Relatori
relatore Prof. Guidotti, Riccardo
relatore Dott. Spinnato, Francesco
Parole chiave
  • interpretability
  • nlp
  • sentence embedding
  • time series classification
  • text to time series
Data inizio appello
13/04/2023
Consultabilità
Completa
Riassunto
Natural Language Processing (NLP) and time series analysis are two domains of research that have seen a surge of interest in recent years. At the core of NLP are text representation techniques that convert text into machine-readable input. These techniques have proven remarkably accurate, especially with state-of-the-art transformer models. However, they inhibit human interpretation since they generate an implicit vector representation that embeds the text as a whole. In contrast, time series analysis techniques are inherently designed to extract meaningful information from time-dependent data while preserving the local features. One prominent time series classification approach uses shapelets to identify representative sequences of the target variable. This approach classifies time series based on their proximity to shapelets, distinguishing it from time-independent methods that rely on global time series statistics. In the context of time series, KNN classification has also been adapted with distance metrics like DTW that detect local patterns between time series even if they do not occur simultaneously. This study proposes a novel approach for text-to-time series representations that preserve the text's sequential structure. Specifically, we treat text as a sequence of sentences and map it using feature extraction techniques derived from sentence embedding, sentiment analysis, and custom linguistic features. We explore pooling methods to convert multivariate time series into univariate ones enhancing the identification of unique one-dimensional patterns among text-time series. We propose adapting the principal component analysis technique (PCA) to reduce sentence timestamps' global feature vector space to one dimension. By reducing the dimensionality, we can represent the text-time series as a one-dimensional signal that can be more easily analyzed and interpreted. We test the resulting transformation from several datasets consisting mainly of song lyrics, performing classification tasks. We compare feature-based approaches that use global statistics from multivariate time series with KNN and shapelet-based approaches on the resulting univariate time series. Despite the pooling approximation, shapelet-based methods perform similarly to feature-based ones, although with variation in results depending on the text domain, feature type, and dimensionality reduction contingencies. Finally, we demonstrate how to extract interpretable features from the original multivariate shapelets and relocate the shapelets in the source text.
File