logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-09122019-134119


Tipo di tesi
Tesi di laurea magistrale
Autore
MOLINARI, ALESSIO
URN
etd-09122019-134119
Titolo
Risk Minimization Models for Technology-Assisted Review and their Application to e-discovery
Dipartimento
FILOLOGIA, LETTERATURA E LINGUISTICA
Corso di studi
INFORMATICA UMANISTICA
Relatori
relatore Esuli, Andrea
relatore Sebastiani, Fabrizio
Parole chiave
  • cost-sensitive classification
  • decision theory
  • e-discovery
  • machine learning
  • risk minimization
Data inizio appello
30/09/2019
Consultabilità
Completa
Riassunto
In several subfields of data science, the term “review” refers to the activities, carried
out by one or more human annotators, of checking the correctness of the class labels
attributed by an automatic process to unlabelled data items, and of replacing wrong
labels with the correct labels.
Given a set D of unlabelled items that are automatically classified, only a subset
of such items are usually reviewed (otherwise, the previous automatic classification
step would be useless). This is due to the fact that reviewing comes at a cost, and is
even more true when the size of the automatically classified dataset is large, which
is increasingly often the case in many application domains.
The amount of data items that are reviewed depends (among other factors) on the
annotation budget, i.e., the maximum amount of data items that the annotator is willing / has time to review, or the maximum cost one is willing to pay an annotator for having the data reviewed.
In order to support the activity of human annotators, it would be important to
identify which data items in D should be reviewed and which should not, in such
a way that overall costs are minimized and the accuracy of the resulting labels is
maximized. Exactly identifying these data items is the task of
technology-assisted review (TAR) algorithms. Essentially, these algorithms attempt to strike an optimal tradeoff between the contrasting goals of minimizing the cost of human intervention and maximizing the accuracy of the resulting labelled data, by focusing on those data items that are most likely to have been misclassified by the automatic classifier, and/or those data items for which review would bring about the highest benefit.
In this work we introduce and test three major modifications of MINECORE,
a recently proposed TAR algorithm whose goal is that of minimizing the expected
costs of review for topical relevance (also known as “responsiveness”) and sensitivity
(also known as “privilege”) in
e-discovery, where the latter is defined as the task of determining, within a civil lawsuit in legal systems based on common law, which documents need be produced by one party to the other party in the lawsuit.
The first modification consists in attempting to increase the quality of the “pos-
terior” probabilities that are input to MINECORE. We attempt to do this via an
instance of the EM algorithm which leverages the transductive nature of e-discovery,
i.e., the fact that the set of documents that must be classified is finite and available
at training time.
The second modification relaxes a simplifying assumption on which MINECORE
relies, i.e., that we should “trust” the posterior probabilities output by our machine-
learned classifiers.
The third modification relaxes yet another simplifying assumption on which
MINECORE relies, i.e., the assumption that human annotators always attribute
the correct class label to the data items they review.
We report experimental results obtained on a large (≈ 800K) dataset of textual
documents, on which we experimentally compare the original version of MINECORE
with our three variants
File