Context Retrieval on SQuAD Dataset
Getting started
Description
The motivation of context retrieval for question answering is to efficiently identify relevant passages that contain the answer to a given question. In this repository, retriever based on TF-IDF, OKAPI BM25 and BERT models are implemented and tested on SQuAD dataset.
Data
Context retrieval is performed on SQuaD dataset. The Stanford Question Answering Dataset (SQuAD) is a collection of over 100,000 question-answer pairs based on articles from Wikipedia. The version 1.1 and 2.0 of SQuAD are already available in the dataset
folder.
Reproduce results
Requirements
- python==3.7.7
- numpy==1.21.6
- nltk==3.7
- scikit-learn==1.0.2
- torch==1.13.1
- transformers==4.27.3
Instructions to run
First make sure to have all the requirements. You might need to manually install:
python -m nltk.downloader all-corpora
To start retrieving without training and initializing the retrievers, please download the retrievers
folder here and place it in the root of the repository (see repo architecture).
The following commands give more details about the positional arguments and a description of the process done while running:
python retrieve.py -h
python train.py -h
python initialize.py -h
Please run them before running the following. The commands showed bellow have to be executed in the same order to keep consistency.
To retrieve a context from a given question run the following:
python retrieve.py model_type question
Example:
python retrieve.py "BM25" "What is a common practice in official corporal punishment?"
- BM25 retriever successfully loaded.
- Context retrieved in 0.04s:
Official corporal punishment, often by caning, remains commonplace in schools in some Asian, African and Caribbean countries. For details of individual countries see School corporal punishment.
YOU DO NOT NEED
to execute the following commands if you already have downloaded and copied the retrievers
folder in the root of the repo.
To train from scratch the BiEncoder BERT based model run:
python train.py data_path nb_epochs batch_size
Training BERT model is a long process and should be done on GPUs.
To initialize and get the accuracy of TF-IDF, BM25 and BERT based model run:
python initialize.py model_type data_path
Beware that initializing the model can take few minutes and will produce pickle files in the retrievers
folder.
Results
The accuracies of the models on SQuAD v-1.1 Validation Set (dev-v1.1.json) are reported bellow. The accuracy is simply computed as the percentage of context correctly (exact match with target) retrieved over the whole dataset. A retrieved context that contains the answer of the question but that was originally not the context associated to the question in the dataset are counted as non-correct. Calculating the accuracy this way make the task harder.
Model Type | Top-1 Accuracy |
---|---|
TF-IDF | 59.25% |
BM25 | 77.60% |
BERT | 84.79% |