Intro
Getting started
Project description
The aim of this project of Machine Learning is to predict if a decay signature is a Higgs Boson or some other particle.
The model is based on a vector of features of a collision event between two high speed protons. More detail about the project ara available in references/project1_description.pdf
. Here a regularized logistic regression is implemented and trained on 8 sub-sets of the full dataset.
Data
The Dataset comes from a popular machine learning challenge recently - finding the Higgs boson - using original data from CERN. The dataset is available at https://www.aicrowd.com/challenges/epfl-machine-learning-higgs. To reproduce the results a folder data/
should be added to the repo, as described in Repo Architecture. A detailed description of the dataset is availabel in references/The_Higgs_boson_ML_challenge.pdf
.
Report
All the detailed about the choices that has been made and the methodology used throughout this project are available in report.pdf
. Through this report, the reader is able to understand the different assumptions, decisions and results made during the project
Reproduce results
Requirements
- Python==3.9.13
- Numpy==1.21.5
- Matplotlib
Instructions to run
Move to the root folder and execute:
python run.py
Make sure to have all the requirements and the data folder in the root. Be aware training the models on 1000 epochs takes around 5 min on Apple silicon M1 Pro. Here the best model has been trained over 15000 epochs.
If you want to run the cross-validation move to the root folder and execute:
python optimization.py
Here the cross-validation has taken around 1h for one sub-models (on Apple silicon M1 Pro), therefore around 8 hours for the whole model.
If you want to visualize the performances of the model during the training, move to the root folder and execute:
python plot_performance.py
Results
The performances of the models is assessed on AirCrowd from data/submission.csv
generated by run.py
. The model achieves a global accuracy of 0.818 with a F1-score of 0.722.
Here are he performance of each sub model during the training:
Authors
- Mery Tom, SCIPER: 297217 (tom.mery@epfl.ch)
- Lelièvre Maxime, SCIPER: 296777 (maxime.lelievre@epfl.ch)
- Peduto Matteo, SCIPER: 316194 (matteo.peduto@epfl.ch)