Database Restricted Access

# VinDr-Mammo: A large-scale benchmark dataset for computer-aided detection and diagnosis in full-field digital mammography

Published: March 21, 2022. Version: 1.0.0

Pham, H. H., Nguyen Trung, H., & Nguyen, H. Q. (2022). VinDr-Mammo: A large-scale benchmark dataset for computer-aided detection and diagnosis in full-field digital mammography (version 1.0.0). PhysioNet. https://doi.org/10.13026/br2v-7517.

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

## Abstract

Breast cancer is one of the most prevalent types of cancer and the leading type of cancer death. Mammography is the recommended imaging modality for periodic breast cancer screening. A few datasets have been published to develop computer-aided tools for mammography analysis. However, these datasets either have a limited sample size or consist of screen-film mammography (SFM), which have been replaced by full-field digital mammography (FFDM) in clinical practices. This project introduces a large-scale full-field digital mammography dataset of 5,000 four-view exams, which are double read by experienced mammographers to provide cancer assessment and breast density following the Breast Imaging Report and Data System (BI-RADS). Breast abnormalities that require further examination are also marked by bounding rectangles.

## Background

With about 2.2 million new cases in 2020, breast cancer is among the most common cancers [1]. Early detection of breast cancer can make treatment more likely to be successful. The previous study shows that biennial screening can bring about a 30% reduction in breast cancer mortality rate [2]. In breast cancer screening, mammography is the recommended breast examination [3]. However, reading mammography for breast cancer screening is challenging with a significantly larger recall rate than cancer detection [4].

Several works have studied the potential use of computer-aided diagnosis (CAD) tools for breast cancer screening in clinical practices [5,6,7,8]. The computer-aided algorithms leveraged in these works are learning-based algorithms [9] which require a large-scale dataset with annotations to develop upon.

Currently, only a few mammography datasets are publicly available to the research community. Some of the most notable datasets are Digital Database for Screening Mammography (DDSM) [10], Mammographic Image Analysis Society (MIAS) dataset [11], and INbreast [12]. These datasets are provided with detailed annotations of breast abnormalities, yet their limited sample size might hinder the performance of deep learning networks [13].

This project introduces a large-scale benchmark dataset of full-field digital mammography, called VinDr-Mammo, which consists of 5,000 four-view exams with breast-level assessment and finding annotations. Each of these exams was independently double read, with discordance (if any) being resolved by arbitration by a third radiologist. To the best of our knowledge, the VinDr-Mammo dataset is currently the largest public dataset of full-field digital mammography that contains BI-RADS assessment and abnormality annotations.

## Methods

### Overview

The Institutional Review Board of Hanoi Medical University Hospital (HMUH) and Hospital 108 (H108), from which the data was collected, provided the ethical approval for this study. Patient-identifiable information and protected health information were removed from the data. As the clinical care of these hospitals was not affected by this study, the patient informed consent was waived. The data creation procedure contains three steps, which are (1) Data acquisition, (2) Mammography reading, and (3) data stratification.

### Data acquisition

From the pool of mammography examinations taken between 2018 and 2020 stored in the Picture Archiving and Communication Systems (PACS) of HMUH and H108, 5,000 mammography exams, equivalent to 20,000 images, were randomly sampled, then de-identified. Regarding DICOM image's metadata, image specifications used for processing, patient's age, and imaging device's model were retained, while all other patient information was removed to protect patient's privacy. For patients' age information, no patients aged over 89 years old appear in the dataset. For imaging device's model information, we only keep the Manufacturer and Manufacturer's Model Name tags in DICOM metadata. In addition, patient information also appears in the pixel data of some images which were spotted manually. As this information always appears in the corners of the image, it is removed by setting to black all pixels in a pre-defined rectangle at each of these corners. Subsequently, each image was inspected in two rounds by two different reviewers to ensure that the image was properly de-identified.

Aiming to provide a dataset that can be used for the development of both CADx and CADe tools. The reading result includes both the overall assessment of the breast and information about abnormal regions in the breast. The result was reported following the schema and lexicon of the Breast Imagin Reporting and Data System (BI-RADS) [14]. In terms of overall breast assessment, BI-RADS assessment categories (from 1 to 5) and breast density levels (A, B, C, or D) are provided. Regarding abnormal regions, the list of finding categories included in this study are mass, calcification, asymmetries, architectural distortion, and other associated features, namely suspicious lymph node, skin thickening, skin retraction, and nipple retraction. The four abnormal categories - mass, calcification, asymmetries, and architectural distortion - are also assessed BI-RADS. The findings of BI-RADS 2, i.e., benign, were not marked. Only findings of either BI-RADS 3, 4, or 5, which require follow-up examination, were annotated by bounding boxes.

Participating in the reading step were three radiologists with more than ten years of experience in mammography assessment, and all three have healthcare profession certificates provided by the Vietnamese Ministry of Health. The reading procedure followed the European guideline [15] that each exam was double read by two radiologists independently. If there is any discordance between the two radiologists, it will be resolved by arbitration with the involvement of the third radiologist. The reading procedure was facilitated by a web-based annotation tool - VinDr Lab - which was developed specifically for reading medical images [16].

### Data stratification

For learning-based algorithms, separating the dataset into training and test is essential in developing and assessing the performance of these algorithms. By providing a pre-defined separation between training and test, we aim to create consistency between different studies since the algorithms will be based on the same exams for training and test. The dataset is split into 1,000 test exams and 4,000 training exams, with the frequencies of each BI-RADS category, density level, and abnormality category being preserved by applying an iterative stratification algorithm [17].

## Data Description

The project directory contains annotations files, namely breast-level_annotations.csv and finding_annotations.csv, metadata file metadata.csv, and a subfolder images that contains DICOM files.

• images: contains 5,000 subdirectories corresponding to 5,000 exams in the dataset, where folder name is the hashed study identifier of the exam. Each folder has four DICOM files for two standard views of each breast. The path to each image file is images/<<study_id>>/<<image_id>>.dicom.
• breast-level_annotations.csv: Each row corresponds to an image and provides the BI-RADS assessment of the breast depicted by the image along with some metadata of the image. The attributes in each row are:
• study_id: The encoded study identifier.
• series_id: The encoded series identifier.
• laterality: Laterality of the breast depicted in the image. Either L or R.
• view_position: Orientation with respect to the breast of the image. Standard views are CC and MLO.
• height: Height of the image.
• width: Width of the image.
• breast_birads: BI-RADS assessment of the breast that the image depicts.
• breast_density: Density category of the breast that the image depicts.
• split: indicating the split to which the image belongs, either training or test.
• finding_annotations.csv: Each row represents an annotation of a breast abnormality in an image. Metadata for each finding annotation includes image's metadata, namely image_id, study_id, series_id, laterality, view_positition, height, width, breast_birads, breast_density, and split, and annotation's metadata:
• finding_categories: List of finding categories attached to the marked region. For example, mass with skin retraction would be represented as ["Mass", "Skin Retraction"].
• finding_birads: BI-RADS assessment of the marked finding.
• xmin: Left boundary of the box.
• ymin: Top boundary of the box.
• xmax: Right boundary of the box.
• ymax: Bottom boundary of the box.
• metadata.csv: This file contains some information provided by DICOM tags that might be relevant for prospective research, namely the patient's age, imaging device's model and manufacturer.

The folder structure of the dataset is as follows:

├── metadata.csv
├── breast-level_annotations.csv
├── finding_annotations.csv
└── images
├── 0025a5dc99fd5c742026f0b2b030d3e9
│   ├── 451562831387e2822923204cf8f0873e.dicom
│   ├── 47c8858666bcce92bcbd57974b5ce522.dicom
│   └── fcf12c2803ba8dc564bf1287c0c97d9a.dicom
├── ...
└── fff2339ea4b5d2f1792672ba7d52b318
├── 5144bf29398269fa2cf8c36b9c6db7f3.dicom
├── e4199214f5b40bd40847f5c2aedc44ef.dicom
├── e9b6ffe97a3b4b763cf94c9982254beb.dicom
└── f1b6aa1cc6246c2760b882243657212e.dicom

## Usage Notes

The VinDr-Mammo dataset is a large-scale full-field digital mammography dataset, which can be used for the purpose of developing and evaluating algorithms for providing cancer assessment and breast density following the Breast Imaging Report and Data System (BI-RADS). In addition, the dataset can also be used for tasks in medical imaging and computer vision in general. Regarding training and test splits, different splits can be used as desired, since the whole dataset was created via a single procedure.

One limitation of this project is that some abnormalities, namely skin retraction, and nipple retraction, have less than 40 samples, which could make studies on these abnormalities from this dataset
unreliable.

## Release Notes

This is the first public release (v1.0) of the VinDr-Mammo dataset.

## Ethics

The authors declare no ethics concerns. The Institutional Review Board of Hanoi Medical University Hospital (HMUH) and Hospital 108 (H108) approved to release of the de-identified data.

## Acknowledgements

The authors would like to acknowledge the Hanoi Medical University Hospital and Hospital 108 for providing us access to their image databases and for agreeing to make the dataset publicly available. We are especially thankful to all of our radiologist team Nhung Hong Luu, Minh Thi Ngoc Nguyen, and Huong Thu Lai, who participated in the data collection and labeling process.

## Conflicts of Interest

VinBigData JSC supported the creation of this resource. Hieu Trung Nguyen and Ha Quy Nguyen are currently employed by VinBigData. VinBigData JSC did not profit from the work done in this project.

## References

1. Sung, H. et al. Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians 71, 209–249 (2021).
2. Mandelblatt, J. S. et al. Collaborative modeling of the benefits and harms associated with different us breast cancer screening strategies. Annals internal medicine 164, 215–225 (2016).
3. Siu, A. L. Screening for breast cancer: Us preventive services task force recommendation statement. Annals internal medicine 164, 279–296 (2016).
4. Lehman, C. D. et al. National performance benchmarks for modern screening digital mammography: update from the breast cancer surveillance consortium. Radiology 283, 49–58 (2017).
5. McKinney, S. M. et al. International evaluation of an ai system for breast cancer screening. Nature 577, 89–94 (2020).
6. Dembrower, K. et al. Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload: a retrospective simulation study. The Lancet Digit. Heal. 2, e468–e474 (2020).
7. Rodriguez-Ruiz, A. et al. Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists. JNCI: J. Natl. Cancer Inst. 111, 916–922 (2019).
8. Rodríguez-Ruiz, A. et al. Detection of breast cancer with mammography: effect of an artificial intelligence support system. Radiology 290, 305–314 (2019).
9. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
10. Bowyer, K. et al. The digital database for screening mammography. In Third international workshop on digital mammography, vol. 58, 27 (1996).
11. Suckling J, P. The mammographic image analysis society digital mammogram database. Digit. Mammo 375–386 (1994).
12. Moreira, I. C. et al. Inbreast: toward a full-field digital mammographic database. Acad. radiology 19, 236–248 (2012).
13. Sun, C., Shrivastava, A., Singh, S. & Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, 843–852 (2017).
14. Sickles, E. A. et al. ACR BI-RADS® Mammography (American College of Radiology, 2013), fifth edn.
15. Amendoeira, I. et al. European guidelines for quality assurance in breast cancer screening and diagnosis (European Commission, 2013).
16. Nguyen, N. T. et al. VinDr Lab: A Data Platform for Medical AI. https://github.com/vinbigdata-medical/vindr-lab (2021).
17. Sechidis, K., Tsoumakas, G. & Vlahavas, I. On the stratification of multi-label data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 145–158 (Springer, 2011).

##### Access

Access Policy:
Only registered users who sign the specified data use agreement can access the files.

PhysioNet Restricted Health Data License 1.5.0

Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0

##### Discovery

Project Website:
https://vindr.ai/

##### Corresponding Author
You must be logged in to view the contact information.