Thesis etd-02072023-162429 |
Link copiato negli appunti
Thesis type
Tesi di laurea magistrale
Author
MANNARI, FRANCESCA
URN
etd-02072023-162429
Thesis title
Design and development of an intelligent system for resource forecasting
Department
INGEGNERIA DELL'INFORMAZIONE
Course of study
INGEGNERIA ROBOTICA E DELL'AUTOMAZIONE
Supervisors
relatore Prof.ssa Lazzerini, Beatrice
relatore Prof. Marcelloni, Francesco
relatore Prof. Pollini, Lorenzo
relatore Prof. Marcelloni, Francesco
relatore Prof. Pollini, Lorenzo
Keywords
- association rules
- classification
- data mining
- machine learning
- text mining
Graduation session start date
23/02/2023
Availability
Withheld
Release date
23/02/2093
Summary
The final goal of the work carried out in this thesis is to develop a system able to suggest a list of spare parts necessary to perform a maintenance service before leaving the logistic center.
Starting from the data available to solve the problem, it is possible to identify two classes of variables: data at call taker level, that are available before the service engineer leaves for the job, and can be fairly used to make predictions, and data reported from the service engineer after a visit is concluded, that can be used as target of the predictions. A brief description of the two kinds of data is reported in the following, while in the “Data analysis” chapter they are described in detail and some examples to help understand the data content are shown. Call taker data, defined as data available before the intervention of the service engineer, should give an idea of the kind of intervention (cleaning, repair) and an idea about which materials are necessary, if they are, and in which quantity.
Data reported from the service engineer, available after the intervention, contain technical information: from the code and the quantity of the material used to the code that identifies a possible fault. Between them, the error code value is certainly a key variable since it is a description of the fault; a prediction of this value might be useful to suggest the materials.
After data have been collected and explored, the preprocessing phase is necessary before making predictions. Null data are discarded or filled, categorical data such as main symptom, a variable that is related to the type of machine, are numerically encoded and textual data such as the error description made by the customer are numerically transformed by applying specific text mining techniques.
During the first step, it has been noticed that, for about half of the times, the service operator did not report any material used for the maintenance.
For this reason, our first goal is to train a binary classifier that predicts if any material is needed. An intervention without the use of any material is a perfect case: nor a trip to the warehouse nor a material order is necessary.
The binary classification could be tricky in two ways: on one hand, ordering and picking some materials that are not used to repair the appliance represents a waste of time, on the other hand, predicting to use no material when some material is necessary produces a waste of time and resources.
The aim is to find a good balance between suggesting the correct number of materials and reducing as much as possible the waste of time and resources. Ideally, if all the materials necessary were suggested, all the tasks would be completed by a single intervention. Obviously, higher the number of suggested materials, higher the probability to solve the problem with a unique intervention. A trade-off between suggesting too many materials, that would bring to unnecessary costs and increasing the probability to not solve the problem with a unique intervention should be accurately investigated. A strategy to deal with the task of classifying materials from such a large dataset might be predicting the error codes first.
The reason for the necessity of this intermediate step is the supposed relationship between the error codes and materials. Instead of classifying over 5,000 different materials, the focus is on the 260 different error codes, which have been clustered due to many of them having low support and poor classification results. Two different clustering approaches, for a total of three different techniques are applied to group the error codes based on their similarity. The first approach is manual and is based on the textual description of the error codes: error codes with similar descriptions are grouped together. The second approach involves grouping error codes associated with similar materials and it is adopted in two different techniques of clustering explained into the dedicated chapter. The idea is clustering error code for which the same materials are used. If the binary classifier predicts that material is necessary within a certain threshold, a multi-class classifier is trained in order to predict the error code cluster.
After the error code cluster has been selected, the most frequent materials used in the past for the same group of values are selected. Due to the fact that, as explained in data analysis chapter, in each intervention from zero to six materials can be used, the six most frequent materials are selected. For each task, on average, 2.4 materials are used. To determine whether a task is fulfilled, it is checked if all the materials actually used are included in the list of suggested materials. For the model that predicts the error code cluster and selects the six most frequently used materials from past data having the same error code cluster, outlined in figure 1.9 the percentage of fulfilled tasks is 51%.
If past data are not only filtered by error code cluster but also by main symptom, machine device and machine group, features available before the intervention, the accuracy of the model improves to 59%.
Increasing the number of suggested materials from six to twelve leads to a 4% increase in accuracy.
The aim of association rules is to identify frequent patterns between some call taker features and materials. The idea is to suggest a list of materials, without predicting the error code, by mining materials that are commonly used together for certain machines, identified by a list of antecedents that can include the machine device, the machine production year and the main symptom. Even if this approach of solution is still a prototype due to the fact that it does not make use of the textual description of the error made by the customer, it has been tested and evaluated. If used in combination with the binary classifier, so that a list of materials is suggested only if it is predicted is necessary, the percentage of fulfilled tasks by using association rules is 54% if six materials are suggested and 57% if twelve materials are suggested.
For each model, the materials are suggested in the same way: since for each intervention six materials can be registered, the six most frequent used materials are selected from past data and suggested: it is how past data are filtered that makes the difference. The model based on the error code prediction and association rules are compared with a baseline model that, only by applying filters on past data selects the six most frequently used materials. Taking into account only tasks where at least a material was used to complete the intervention for a fairer comparison, we can say that the percentage of fulfilled tasks of the model that predicts the error code is 6% higher that the percentage of the baseline model; while performances of association rules are still worse than performances of the baseline model but there is room of improvement by adding the description of the error made by the customer to the list of antecedents.
Starting from the data available to solve the problem, it is possible to identify two classes of variables: data at call taker level, that are available before the service engineer leaves for the job, and can be fairly used to make predictions, and data reported from the service engineer after a visit is concluded, that can be used as target of the predictions. A brief description of the two kinds of data is reported in the following, while in the “Data analysis” chapter they are described in detail and some examples to help understand the data content are shown. Call taker data, defined as data available before the intervention of the service engineer, should give an idea of the kind of intervention (cleaning, repair) and an idea about which materials are necessary, if they are, and in which quantity.
Data reported from the service engineer, available after the intervention, contain technical information: from the code and the quantity of the material used to the code that identifies a possible fault. Between them, the error code value is certainly a key variable since it is a description of the fault; a prediction of this value might be useful to suggest the materials.
After data have been collected and explored, the preprocessing phase is necessary before making predictions. Null data are discarded or filled, categorical data such as main symptom, a variable that is related to the type of machine, are numerically encoded and textual data such as the error description made by the customer are numerically transformed by applying specific text mining techniques.
During the first step, it has been noticed that, for about half of the times, the service operator did not report any material used for the maintenance.
For this reason, our first goal is to train a binary classifier that predicts if any material is needed. An intervention without the use of any material is a perfect case: nor a trip to the warehouse nor a material order is necessary.
The binary classification could be tricky in two ways: on one hand, ordering and picking some materials that are not used to repair the appliance represents a waste of time, on the other hand, predicting to use no material when some material is necessary produces a waste of time and resources.
The aim is to find a good balance between suggesting the correct number of materials and reducing as much as possible the waste of time and resources. Ideally, if all the materials necessary were suggested, all the tasks would be completed by a single intervention. Obviously, higher the number of suggested materials, higher the probability to solve the problem with a unique intervention. A trade-off between suggesting too many materials, that would bring to unnecessary costs and increasing the probability to not solve the problem with a unique intervention should be accurately investigated. A strategy to deal with the task of classifying materials from such a large dataset might be predicting the error codes first.
The reason for the necessity of this intermediate step is the supposed relationship between the error codes and materials. Instead of classifying over 5,000 different materials, the focus is on the 260 different error codes, which have been clustered due to many of them having low support and poor classification results. Two different clustering approaches, for a total of three different techniques are applied to group the error codes based on their similarity. The first approach is manual and is based on the textual description of the error codes: error codes with similar descriptions are grouped together. The second approach involves grouping error codes associated with similar materials and it is adopted in two different techniques of clustering explained into the dedicated chapter. The idea is clustering error code for which the same materials are used. If the binary classifier predicts that material is necessary within a certain threshold, a multi-class classifier is trained in order to predict the error code cluster.
After the error code cluster has been selected, the most frequent materials used in the past for the same group of values are selected. Due to the fact that, as explained in data analysis chapter, in each intervention from zero to six materials can be used, the six most frequent materials are selected. For each task, on average, 2.4 materials are used. To determine whether a task is fulfilled, it is checked if all the materials actually used are included in the list of suggested materials. For the model that predicts the error code cluster and selects the six most frequently used materials from past data having the same error code cluster, outlined in figure 1.9 the percentage of fulfilled tasks is 51%.
If past data are not only filtered by error code cluster but also by main symptom, machine device and machine group, features available before the intervention, the accuracy of the model improves to 59%.
Increasing the number of suggested materials from six to twelve leads to a 4% increase in accuracy.
The aim of association rules is to identify frequent patterns between some call taker features and materials. The idea is to suggest a list of materials, without predicting the error code, by mining materials that are commonly used together for certain machines, identified by a list of antecedents that can include the machine device, the machine production year and the main symptom. Even if this approach of solution is still a prototype due to the fact that it does not make use of the textual description of the error made by the customer, it has been tested and evaluated. If used in combination with the binary classifier, so that a list of materials is suggested only if it is predicted is necessary, the percentage of fulfilled tasks by using association rules is 54% if six materials are suggested and 57% if twelve materials are suggested.
For each model, the materials are suggested in the same way: since for each intervention six materials can be registered, the six most frequent used materials are selected from past data and suggested: it is how past data are filtered that makes the difference. The model based on the error code prediction and association rules are compared with a baseline model that, only by applying filters on past data selects the six most frequently used materials. Taking into account only tasks where at least a material was used to complete the intervention for a fairer comparison, we can say that the percentage of fulfilled tasks of the model that predicts the error code is 6% higher that the percentage of the baseline model; while performances of association rules are still worse than performances of the baseline model but there is room of improvement by adding the description of the error made by the customer to the list of antecedents.
File
Nome file | Dimensione |
---|---|
Thesis not available for consultation. |