Higgs-ML

Prediction of Higgs Boson from decay signature using regularised logistic regression.

Intro

Getting started

Project description

The aim of this project of Machine Learning is to predict if a decay signature is a Higgs Boson or some other particle. The model is based on a vector of features of a collision event between two high speed protons. More detail about the project ara available in references/project1_description.pdf. Here a regularized logistic regression is implemented and trained on 8 sub-sets of the full dataset.

Data

The Dataset comes from a popular machine learning challenge recently - finding the Higgs boson - using original data from CERN. The dataset is available at https://www.aicrowd.com/challenges/epfl-machine-learning-higgs. To reproduce the results a folder data/ should be added to the repo, as described in Repo Architecture. A detailed description of the dataset is availabel in references/The_Higgs_boson_ML_challenge.pdf.

Report

All the detailed about the choices that has been made and the methodology used throughout this project are available in report.pdf. Through this report, the reader is able to understand the different assumptions, decisions and results made during the project

Reproduce results

Requirements

  • Python==3.9.13
  • Numpy==1.21.5
  • Matplotlib

Instructions to run

Move to the root folder and execute:

python run.py

Make sure to have all the requirements and the data folder in the root. Be aware training the models on 1000 epochs takes around 5 min on Apple silicon M1 Pro. Here the best model has been trained over 15000 epochs.

If you want to run the cross-validation move to the root folder and execute:

python optimization.py

Here the cross-validation has taken around 1h for one sub-models (on Apple silicon M1 Pro), therefore around 8 hours for the whole model.

If you want to visualize the performances of the model during the training, move to the root folder and execute:

python plot_performance.py

Results

The performances of the models is assessed on AirCrowd from data/submission.csv generated by run.py. The model achieves a global accuracy of 0.818 with a F1-score of 0.722.

Here are he performance of each sub model during the training:

mass no_mass

Authors

  1. Mery Tom, SCIPER: 297217 (tom.mery@epfl.ch)
  2. Lelièvre Maxime, SCIPER: 296777 (maxime.lelievre@epfl.ch)
  3. Peduto Matteo, SCIPER: 316194 (matteo.peduto@epfl.ch)