A framework for evaluating clinical artificial intelligence systems without ground-truth annotations

Description of datasets

Stanford diverse dermatology images

The Stanford diverse dermatology images (DDI) dataset consists of dermatology images collected in the Stanford Clinics between 2010 and 2020. These images (n: 656) reflect either a benign or malignant skin lesion from patients with three distinct skin tones (Fitzpatrick I-II, III-IV, V-VI). For further details, we refer interested readers to the original publication14. We chose this as the data in the wild due to a recent study reporting the degradation of several models’ performance when deployed on the DDI dataset. These models (see Description of models) were trained on the HAM10000 dataset, which we treated as the source dataset.

HAM10000 dataset

The HAM10000 dataset consists of dermatology images collected over 20 years from the Medical University of Vienna and the practice of Cliff Rosendahl16. These images (n: 10015) reflect a wide range of skin conditions ranging from Bowen’s disease and basal cell carcinoma to melanoma. In line with a recent study14, and to remain consistent with the labels of the Stanford DDI dataset, we map these skin conditions to a binary benign or malignant condition. We randomly split this model into a training and held-out set using a 80: 20 ratio. We did not use a validation set as publicly-available models were already available and therefore did not need to be trained from scratch.

Camelyon17-WILDS dataset

The Camelyon17-WILDS dataset consists of histopathology patches from 50 whole slide images collected from 5 different hospitals29. These images (n: 450, 000) depict lymph node tissue with or without the presence of a tumour. We use the exact same training (n: 302, 436), validation (n: 33, 560), and test (n: 85, 054) splits constructed by the

Read More