Model Description
- A RoBERTa [Liu et al., 2019] model fine-tuned for de-identification of medical notes.
- Sequence Labeling (token classification): The model was trained to predict protected health information (PHI/PII) entities (spans). A list of protected health information categories is given by HIPAA.
- A token can either be classified as non-PHI or as one of the 11 PHI types. Token predictions are aggregated to spans by making use of BILOU tagging.
- The PHI labels that were used for training and other details can be found here: Annotation Guidelines
- More details on how to use this model, the format of data and other useful information is present in the GitHub repo: Robust DeID.
How to use
- A demo on how the model works (using model predictions to de-identify a medical note) is on this space: Medical-Note-Deidentification.
- Steps on how this model can be used to run a forward pass can be found here: Forward Pass
- In brief, the steps are:
- Sentencize (the model aggregates the sentences back to the note level) and tokenize the dataset.
- Use the predict function of this model to gather the predictions (i.e., predictions for each token).
- Additionally, the model predictions can be used to remove PHI from the original note/text.
Dataset
|
I2B2 |
|
I2B2 |
|
|
TRAIN SET - 790 NOTES |
|
TEST SET - 514 NOTES |
|
| PHI LABEL |
COUNT |
PERCENTAGE |
COUNT |
PERCENTAGE |
| DATE |
7502 |
43.69 |
4980 |
44.14 |
| STAFF |
3149 |
18.34 |
2004 |
17.76 |
| HOSP |
1437 |
8.37 |
875 |
7.76 |
| AGE |
1233 |
7.18 |
764 |
6.77 |
| LOC |
1206 |
7.02 |
856 |
7.59 |
| PATIENT |
1316 |
7.66 |
879 |
7.79 |
| PHONE |
317 |
1.85 |
217 |
1.92 |
| ID |
881 |
5.13 |
625 |
5.54 |
| PATORG |
124 |
0.72 |
82 |
0.73 |
| EMAIL |
4 |
0.02 |
1 |
0.01 |
| OTHERPHI |
2 |
0.01 |
0 |
0 |
| TOTAL |
17171 |
100 |
11283 |
100 |
Training procedure
Results
Questions?
Post a Github issue on the repo: Robust DeID.