UKB-Tools

Tools in Python to quickly start using the UK-BioBank dataset before UKB RAP.

UKB-Tools

Introduction

This repository provides tools in Python to quickly start using the UK-BioBank dataset before UKB RAP. The folder has the following structure:

├── scripts/
    ├── create_data.py
    ├── get_newest_baskets.py
├── src/
    ├── __init__.py
    ├── data.py
    ├── logger.py
    ├── tools.py

Installation

Clone the repository:

git clone https://github.com/TemryL/UKB-Tools.git

Move to the directory:

cd UKB-Tools

Create a virtual environment with Python 3.11 installed. Then install the dependencies:

pip install -r requirements.txt

Usage

UK-BioBank is organized by projects and baskets. Each project ID can have several basket IDs associated. When somenone requests new fields or a data update under the same project ID, a new basket will be created. Data across projects cannot be merged (because of eids randomization). However, data across baskets of the same project can be merged and it is preferable to get data for a given UKB field from the most recent basket.

Let's say we want to create a dataset with UKB fields 31, 131369, 3066. Then one can store the fields in a text file as follow:

ukb_fields.txt:

31
131369
3066

Run the following command to retrieve, for a given project ID, the most recent basket that contains the given UKB fields:

python scripts/get_newest_baskets.py ${/dir/to/ukb_folder} ${project_id} ${data/ukb_fields.txt} ${data/field_to_basket.json}

The results will be stored in a JSON file as follow:

field_to_basket.json:

{
    "31": "project_52887_41230",
    "131369": "project_52887_676883",
    "3066": "project_52887_669338",
}

Finally, to merge the data in a single CSV file, run the following command:

python scripts/create_data.py ${/dir/to/ukb_folder} ${data/field_to_basket.json} ${data.csv}

Contribute

Feel free to contribute to this repo by fixing issues, improving performances or adding new features!