## Thesis etd-09252017-235419 |

Link copiato negli appunti

Thesis type

Tesi di laurea magistrale

Author

MALARA, ANDREA

email address

andreamalara@icloud.com

URN

etd-09252017-235419

Thesis title

Search for a massive resonance decaying into a pair of Higgs with 4 b-quarks in the final state and development of machine learning technique applied to b-tagging algorithms.

Department

FISICA

Course of study

FISICA

Supervisors

**relatore**Prof. Rizzi, Andrea

Keywords

- b quarks
- cms
- deep learning
- higgs

Graduation session start date

18/10/2017

Availability

Full

Summary

The discovery of the Higgs boson opens new possibilities for searches beyond the Stan- dard Model. In particular, the recently discovered particle can be used as a tool to search for heavy mass resonances decaying into a pair of Higgs bosons. The largest branching ra- tio of the Higgs boson is into a pair of b-quarks. Thus, a natural choice to search for a new massive state X decaying into a pair of Higgs bosons through four b-jets in the final state. This thesis presents a search for X → H(bb ̄)H(bb ̄) performed using 35.9 fb−1 of proton- proton collision data recorded by the CMS detector at the LHC at the centre-of-mass energy of 13 TeV during 2016.

Signal events are expected to produce a peak on top of the invariant mass multi-jets background distribution of the four reconstructed b-jets. The final goal of this analysis is to measure an excess of events above the background or, otherwise, to provide an up- per bound on the production cross section multiplied by the branching ratio of the process.

The event selection begins by identifying events containing at least four central b- jets. Amongst these jets, two pairs are chosen according to appropriate criteria such that they are compatible with the Higgs particles. This search covers a broad range of mass hypotheses for the resonance, between 260 GeV and 1200 GeV. The kinematics involved in the decay of such a resonance changes substantially over this range, and thus two different mass ranges have been used: the Low Mass Regime (LMR) covers the mass range up to 650GeV, while in the Medium Mass Regime (MMR), the mass is in excess of 550GeV. Here the decaying Higgs bosons are sufficiently boosted so that the decaying jets have a small angular separation and this can be used as a discriminating variable to select the jets originating from the same Higgs. On the other hand, for LMR we use the looser requirement that the di-jets invariant mass is compatible with the nominal Higgs mass.

To perform the analysis, we make use of a signal shape modelled from MC simulation. The simulated signal events are produced in the assumption of a narrow width resonance produced via gluon-gluon fusion for two different spin hypotheses. The signal samples were produced for 20 mass points in steps of 10, 50 or 100GeV in the mass range and, since the mass resolution is smaller than the step size, to account for intermediate mass points, we interpolate both the shape and the signal events normalisation.

On the other hand, a data driven approach is used to model the non-resonant multi-jet background contribution using a fit with a smooth function. To avoid possible biases, this analysis is designed and tested without exploiting the information of the data events falling in the Signal Region (SR) and, in this sense, the analysis is carried out in a “blind” procedure. For this reason, to validate the fit strategy we use several control regions, defined on the 2D space of the two reconstructed Higgs candidate masses.

My personal contribution has been the optimisation of this analysis for the data col- lected in the Run 2 at LHC. I performed almost all the parts of the analysis studying the effects of the new b-tagging algorithms, the bias related to the choice of the background model and, in the perspective of minimisation of such bias, I improved the background modelling.

The background modelling strategy used for Run 1 data tries to fit the distribution of the control regions; differently from the previous searches where this approach worked properly, the increase of statistics collected during Run 2 enhances some features on the distribution, making difficult to properly model it. Therefore we adopt two different strate- gies for LMR and MMR cases. For LMR, an “ABCD” method has been used to predict a compatible shape of the background for the signal region where to perform the fit-strategy validation. I also showed that it is convenient to split the range of the LMR into two over- lapped ranges to separate the turn-on and the tails and avoid most of the fitting problems. At the same time, this choice provides the advantage to extend the LMR total coverage and to exclude the kinematically problematic lower mass range in MMR, which is covered

in the LMR region. To prove this statement I used several selection criteria that led us to minimise the uncertainty in the choice of the functional form for our background model.

Once that all the procedure has been defined, tested and verified, the analysis has been unblinded and, given these assumptions, I searched for excess of events above the background expectation. No significant statistical excess is observed for masses between 260 GeV and 1200 GeV and the upper limits at 95% of confidence level has been set for the production cross-section multiplied by branching ratio as a function of the mass and for both spin hypotheses.

In the second part of this thesis, as a possible improvement for b-tagging for future analysis, I worked on machine learning algorithms as a first step for a development of a new algorithm. In particular, in this search with 4 b quarks in the final state, any improvement for a single b-tagging provides a better global signal extraction efficiency.

As the first step, I select the optimal working point for the most recent available algo- rithms (CMVA and deep-CSV) provided by CMS offline reconstruction. Then I performed an optimisation of a new algorithm that uses a deep neural network and takes as input low-level information, differently from most of the state-of-the-art algorithms in CMS that rely on secondary vertexes reconstruction information. A natural assumption is that if we let a deep neural network deal with it, it is reasonable to expect that some information can be recovered.

The choice of the input variables is inspired by the standard vertex reconstruction algorithm (IVF), whereas no analytical fit of secondary vertex is used. The usage of a deep neural network is justified by the large number of variables used to accurately describe each jet event, order of thousand variables if all the tracks are included, which would be difficult to optimise with a simple neural network. After many attempts, a satisfactory implementation of the network has been reached using feed-forward layers, recurrent network (LSTM) and convolutional units, each used to address a different task given the input variables shapes such that the network is aware of the physics which was used to choose the input.

The results obtained from this first step are promising and interesting for future de- velopments since the new algorithms outperform the existing one, improving the tagging efficiency up to 10-15% for a fixed mistag rate (false positive rate). In particular, I tested how this new algorithm behaves for jets of high momentum, which is particularly relevant in the X → H(bb ̄)H(bb ̄) analysis.

Signal events are expected to produce a peak on top of the invariant mass multi-jets background distribution of the four reconstructed b-jets. The final goal of this analysis is to measure an excess of events above the background or, otherwise, to provide an up- per bound on the production cross section multiplied by the branching ratio of the process.

The event selection begins by identifying events containing at least four central b- jets. Amongst these jets, two pairs are chosen according to appropriate criteria such that they are compatible with the Higgs particles. This search covers a broad range of mass hypotheses for the resonance, between 260 GeV and 1200 GeV. The kinematics involved in the decay of such a resonance changes substantially over this range, and thus two different mass ranges have been used: the Low Mass Regime (LMR) covers the mass range up to 650GeV, while in the Medium Mass Regime (MMR), the mass is in excess of 550GeV. Here the decaying Higgs bosons are sufficiently boosted so that the decaying jets have a small angular separation and this can be used as a discriminating variable to select the jets originating from the same Higgs. On the other hand, for LMR we use the looser requirement that the di-jets invariant mass is compatible with the nominal Higgs mass.

To perform the analysis, we make use of a signal shape modelled from MC simulation. The simulated signal events are produced in the assumption of a narrow width resonance produced via gluon-gluon fusion for two different spin hypotheses. The signal samples were produced for 20 mass points in steps of 10, 50 or 100GeV in the mass range and, since the mass resolution is smaller than the step size, to account for intermediate mass points, we interpolate both the shape and the signal events normalisation.

On the other hand, a data driven approach is used to model the non-resonant multi-jet background contribution using a fit with a smooth function. To avoid possible biases, this analysis is designed and tested without exploiting the information of the data events falling in the Signal Region (SR) and, in this sense, the analysis is carried out in a “blind” procedure. For this reason, to validate the fit strategy we use several control regions, defined on the 2D space of the two reconstructed Higgs candidate masses.

My personal contribution has been the optimisation of this analysis for the data col- lected in the Run 2 at LHC. I performed almost all the parts of the analysis studying the effects of the new b-tagging algorithms, the bias related to the choice of the background model and, in the perspective of minimisation of such bias, I improved the background modelling.

The background modelling strategy used for Run 1 data tries to fit the distribution of the control regions; differently from the previous searches where this approach worked properly, the increase of statistics collected during Run 2 enhances some features on the distribution, making difficult to properly model it. Therefore we adopt two different strate- gies for LMR and MMR cases. For LMR, an “ABCD” method has been used to predict a compatible shape of the background for the signal region where to perform the fit-strategy validation. I also showed that it is convenient to split the range of the LMR into two over- lapped ranges to separate the turn-on and the tails and avoid most of the fitting problems. At the same time, this choice provides the advantage to extend the LMR total coverage and to exclude the kinematically problematic lower mass range in MMR, which is covered

in the LMR region. To prove this statement I used several selection criteria that led us to minimise the uncertainty in the choice of the functional form for our background model.

Once that all the procedure has been defined, tested and verified, the analysis has been unblinded and, given these assumptions, I searched for excess of events above the background expectation. No significant statistical excess is observed for masses between 260 GeV and 1200 GeV and the upper limits at 95% of confidence level has been set for the production cross-section multiplied by branching ratio as a function of the mass and for both spin hypotheses.

In the second part of this thesis, as a possible improvement for b-tagging for future analysis, I worked on machine learning algorithms as a first step for a development of a new algorithm. In particular, in this search with 4 b quarks in the final state, any improvement for a single b-tagging provides a better global signal extraction efficiency.

As the first step, I select the optimal working point for the most recent available algo- rithms (CMVA and deep-CSV) provided by CMS offline reconstruction. Then I performed an optimisation of a new algorithm that uses a deep neural network and takes as input low-level information, differently from most of the state-of-the-art algorithms in CMS that rely on secondary vertexes reconstruction information. A natural assumption is that if we let a deep neural network deal with it, it is reasonable to expect that some information can be recovered.

The choice of the input variables is inspired by the standard vertex reconstruction algorithm (IVF), whereas no analytical fit of secondary vertex is used. The usage of a deep neural network is justified by the large number of variables used to accurately describe each jet event, order of thousand variables if all the tracks are included, which would be difficult to optimise with a simple neural network. After many attempts, a satisfactory implementation of the network has been reached using feed-forward layers, recurrent network (LSTM) and convolutional units, each used to address a different task given the input variables shapes such that the network is aware of the physics which was used to choose the input.

The results obtained from this first step are promising and interesting for future de- velopments since the new algorithms outperform the existing one, improving the tagging efficiency up to 10-15% for a fixed mistag rate (false positive rate). In particular, I tested how this new algorithm behaves for jets of high momentum, which is particularly relevant in the X → H(bb ̄)H(bb ̄) analysis.

File

Nome file | Dimensione |
---|---|

thesis.pdf | 16.22 Mb |

thesis_frn.pdf | 593.94 Kb |

Contatta l’autore |