Tom Mery

Join the Beta!

If you are interested in joining the beta version, check the website. Below is a brief explanation of the methods used to develop EyeRub. For more details, please check out the paper (under review). You can also watch here (in French) the presentation at SAFIR 2023, Paris.

Abstract

In this work, we present a new machine learning method based on the Transformer neural network to detect eye rubbing using a smartwatch. In ophthalmology, the accurate detection and prevention of eye-rubbing could reduce incidence and progression of ectasic disorders such as Keratoconus, and prevent blindness. Our approach leverages the state-of-the-art capabilities of the Transformer network, widely recognized for its success in the field of natural language processing (NLP). We evaluate our method against several baselines using a newly collected dataset and achieve an impressive accuracy of 97% with fine-tuning. Notably, our model operates in real-time on an Apple Watch, enabling prompt detection and response.

Problem Statement

The goal of the study is to create a tool using machine learning to identify eye rubbing from AppleWatch sensor data, aiming to investigate its link with corneal diseases like keratoconus. A key challenge is distinguishing between similar hand-face interactions. The proposed solution involves developing a machine learning model capable of classifying different hand-face activities, as depicted in the study's pipeline illustration.

Input

The AppleWatch provides sensor’s measures sampled at 50Hz. The signals are composed of the 19 following features provided by the sensors of the AppleWatch:

Raw Accelerometers Data:
- Acceleration x,y,z in G’s
Processed Device-Motion Data:
- Yaw, Roll, Pitch in rad.
- Rotation Rate x,y,z in rad/s.
- User Acceleration x,y,z in G’s
- Quaternion x,y,z,w
- Gravity x,y,z in G’s

Output

The classes of the classification task are illustrated bellow:

Methods

Real-time classification

To enable real-time operation on the Apple Watch, a sliding window method is employed, where the continuous sensor data stream is segmented into fixed-size windows of 3 seconds, with a step size of 0.5 seconds. Each window is analyzed by a machine learning model to extract features and classify human activities, enabling activity recognition every 0.5 seconds based on the preceding 3 seconds of sensor data.

Model Architecture

The model employs an attention-based architecture, and its encoder is pre-trained through denoising unlabeled sequences.

Dataset

Table below summarizes the statistics of the collected datasets. The automatic labelling setup resulted in signals of variable length. For those signals, we provide statistics of the raw collected signals per user, presented as interactive plots, here.

Results

Effectiveness of unsupervised pre-training

Results below confirm that unsupervised pretraining offers a substantial performance benefit over fully supervised learning both in term of classification performance (F1-Score) and prediction confidence (cross entropy loss).

Performances Comparison

Based on the results presented below, we confirmed that the attention-based model (Transformer) outperforms both traditional machine learning and deep learning methods by a significant margin.