Database Restricted Access
VinDr-Mammo: A large-scale benchmark dataset for computer-aided detection and diagnosis in full-field digital mammography
Hieu Huy Pham , Hieu Nguyen Trung , Ha Quy Nguyen
Published: March 21, 2022. Version: 1.0.0
When using this resource, please cite:
(show more options)
Pham, H. H., Nguyen Trung, H., & Nguyen, H. Q. (2022). VinDr-Mammo: A large-scale benchmark dataset for computer-aided detection and diagnosis in full-field digital mammography (version 1.0.0). PhysioNet. https://doi.org/10.13026/br2v-7517.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Breast cancer is one of the most prevalent types of cancer and the leading type of cancer death. Mammography is the recommended imaging modality for periodic breast cancer screening. A few datasets have been published to develop computer-aided tools for mammography analysis. However, these datasets either have a limited sample size or consist of screen-film mammography (SFM), which have been replaced by full-field digital mammography (FFDM) in clinical practices. This project introduces a large-scale full-field digital mammography dataset of 5,000 four-view exams, which are double read by experienced mammographers to provide cancer assessment and breast density following the Breast Imaging Report and Data System (BI-RADS). Breast abnormalities that require further examination are also marked by bounding rectangles.
With about 2.2 million new cases in 2020, breast cancer is among the most common cancers . Early detection of breast cancer can make treatment more likely to be successful. The previous study shows that biennial screening can bring about a 30% reduction in breast cancer mortality rate . In breast cancer screening, mammography is the recommended breast examination . However, reading mammography for breast cancer screening is challenging with a significantly larger recall rate than cancer detection .
Several works have studied the potential use of computer-aided diagnosis (CAD) tools for breast cancer screening in clinical practices [5,6,7,8]. The computer-aided algorithms leveraged in these works are learning-based algorithms  which require a large-scale dataset with annotations to develop upon.
Currently, only a few mammography datasets are publicly available to the research community. Some of the most notable datasets are Digital Database for Screening Mammography (DDSM) , Mammographic Image Analysis Society (MIAS) dataset , and INbreast . These datasets are provided with detailed annotations of breast abnormalities, yet their limited sample size might hinder the performance of deep learning networks .
This project introduces a large-scale benchmark dataset of full-field digital mammography, called VinDr-Mammo, which consists of 5,000 four-view exams with breast-level assessment and finding annotations. Each of these exams was independently double read, with discordance (if any) being resolved by arbitration by a third radiologist. To the best of our knowledge, the VinDr-Mammo dataset is currently the largest public dataset of full-field digital mammography that contains BI-RADS assessment and abnormality annotations.
The Institutional Review Board of Hanoi Medical University Hospital (HMUH) and Hospital 108 (H108), from which the data was collected, provided the ethical approval for this study. Patient-identifiable information and protected health information were removed from the data. As the clinical care of these hospitals was not affected by this study, the patient informed consent was waived. The data creation procedure contains three steps, which are (1) Data acquisition, (2) Mammography reading, and (3) data stratification.
From the pool of mammography examinations taken between 2018 and 2020 stored in the Picture Archiving and Communication Systems (PACS) of HMUH and H108, 5,000 mammography exams, equivalent to 20,000 images, were randomly sampled, then de-identified. Regarding DICOM image's metadata, image specifications used for processing, patient's age, and imaging device's model were retained, while all other patient information was removed to protect patient's privacy. For patients' age information, no patients aged over 89 years old appear in the dataset. For imaging device's model information, we only keep the
Manufacturer's Model Name tags in DICOM metadata. In addition, patient information also appears in the pixel data of some images which were spotted manually. As this information always appears in the corners of the image, it is removed by setting to black all pixels in a pre-defined rectangle at each of these corners. Subsequently, each image was inspected in two rounds by two different reviewers to ensure that the image was properly de-identified.
Aiming to provide a dataset that can be used for the development of both CADx and CADe tools. The reading result includes both the overall assessment of the breast and information about abnormal regions in the breast. The result was reported following the schema and lexicon of the Breast Imagin Reporting and Data System (BI-RADS) . In terms of overall breast assessment, BI-RADS assessment categories (from 1 to 5) and breast density levels (A, B, C, or D) are provided. Regarding abnormal regions, the list of finding categories included in this study are mass, calcification, asymmetries, architectural distortion, and other associated features, namely suspicious lymph node, skin thickening, skin retraction, and nipple retraction. The four abnormal categories - mass, calcification, asymmetries, and architectural distortion - are also assessed BI-RADS. The findings of BI-RADS 2, i.e., benign, were not marked. Only findings of either BI-RADS 3, 4, or 5, which require follow-up examination, were annotated by bounding boxes.
Participating in the reading step were three radiologists with more than ten years of experience in mammography assessment, and all three have healthcare profession certificates provided by the Vietnamese Ministry of Health. The reading procedure followed the European guideline  that each exam was double read by two radiologists independently. If there is any discordance between the two radiologists, it will be resolved by arbitration with the involvement of the third radiologist. The reading procedure was facilitated by a web-based annotation tool - VinDr Lab - which was developed specifically for reading medical images .
For learning-based algorithms, separating the dataset into training and test is essential in developing and assessing the performance of these algorithms. By providing a pre-defined separation between training and test, we aim to create consistency between different studies since the algorithms will be based on the same exams for training and test. The dataset is split into 1,000 test exams and 4,000 training exams, with the frequencies of each BI-RADS category, density level, and abnormality category being preserved by applying an iterative stratification algorithm .
The project directory contains annotations files, namely
finding_annotations.csv, metadata file
metadata.csv, and a subfolder
images that contains DICOM files.
images: contains 5,000 subdirectories corresponding to 5,000 exams in the dataset, where folder name is the hashed study identifier of the exam. Each folder has four DICOM files for two standard views of each breast. The path to each image file is
breast-level_annotations.csv: Each row corresponds to an image and provides the BI-RADS assessment of the breast depicted by the image along with some metadata of the image. The attributes in each row are:
study_id: The encoded study identifier.
series_id: The encoded series identifier.
laterality: Laterality of the breast depicted in the image. Either
view_position: Orientation with respect to the breast of the image. Standard views are
height: Height of the image.
width: Width of the image.
breast_birads: BI-RADS assessment of the breast that the image depicts.
breast_density: Density category of the breast that the image depicts.
split: indicating the split to which the image belongs, either
finding_annotations.csv: Each row represents an annotation of a breast abnormality in an image. Metadata for each finding annotation includes image's metadata, namely
split, and annotation's metadata:
finding_categories: List of finding categories attached to the marked region. For example, mass with skin retraction would be represented as
["Mass", "Skin Retraction"].
finding_birads: BI-RADS assessment of the marked finding.
xmin: Left boundary of the box.
ymin: Top boundary of the box.
xmax: Right boundary of the box.
ymax: Bottom boundary of the box.
metadata.csv: This file contains some information provided by DICOM tags that might be relevant for prospective research, namely the patient's age, imaging device's model and manufacturer.
The folder structure of the dataset is as follows:
├── metadata.csv ├── breast-level_annotations.csv ├── finding_annotations.csv └── images ├── 0025a5dc99fd5c742026f0b2b030d3e9 │ ├── 2ddfad7286c2b016931ceccd1e2c7bbc.dicom │ ├── 451562831387e2822923204cf8f0873e.dicom │ ├── 47c8858666bcce92bcbd57974b5ce522.dicom │ └── fcf12c2803ba8dc564bf1287c0c97d9a.dicom ├── ... └── fff2339ea4b5d2f1792672ba7d52b318 ├── 5144bf29398269fa2cf8c36b9c6db7f3.dicom ├── e4199214f5b40bd40847f5c2aedc44ef.dicom ├── e9b6ffe97a3b4b763cf94c9982254beb.dicom └── f1b6aa1cc6246c2760b882243657212e.dicom
The VinDr-Mammo dataset is a large-scale full-field digital mammography dataset, which can be used for the purpose of developing and evaluating algorithms for providing cancer assessment and breast density following the Breast Imaging Report and Data System (BI-RADS). In addition, the dataset can also be used for tasks in medical imaging and computer vision in general. Regarding training and test splits, different splits can be used as desired, since the whole dataset was created via a single procedure.
One limitation of this project is that some abnormalities, namely skin retraction, and nipple retraction, have less than 40 samples, which could make studies on these abnormalities from this dataset
This is the first public release (v1.0) of the VinDr-Mammo dataset.
The authors declare no ethics concerns. The Institutional Review Board of Hanoi Medical University Hospital (HMUH) and Hospital 108 (H108) approved to release of the de-identified data.
The authors would like to acknowledge the Hanoi Medical University Hospital and Hospital 108 for providing us access to their image databases and for agreeing to make the dataset publicly available. We are especially thankful to all of our radiologist team Nhung Hong Luu, Minh Thi Ngoc Nguyen, and Huong Thu Lai, who participated in the data collection and labeling process.
Conflicts of Interest
VinBigData JSC supported the creation of this resource. Hieu Trung Nguyen and Ha Quy Nguyen are currently employed by VinBigData. VinBigData JSC did not profit from the work done in this project.
- Sung, H. et al. Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians 71, 209–249 (2021).
- Mandelblatt, J. S. et al. Collaborative modeling of the benefits and harms associated with different us breast cancer screening strategies. Annals internal medicine 164, 215–225 (2016).
- Siu, A. L. Screening for breast cancer: Us preventive services task force recommendation statement. Annals internal medicine 164, 279–296 (2016).
- Lehman, C. D. et al. National performance benchmarks for modern screening digital mammography: update from the breast cancer surveillance consortium. Radiology 283, 49–58 (2017).
- McKinney, S. M. et al. International evaluation of an ai system for breast cancer screening. Nature 577, 89–94 (2020).
- Dembrower, K. et al. Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload: a retrospective simulation study. The Lancet Digit. Heal. 2, e468–e474 (2020).
- Rodriguez-Ruiz, A. et al. Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists. JNCI: J. Natl. Cancer Inst. 111, 916–922 (2019).
- Rodríguez-Ruiz, A. et al. Detection of breast cancer with mammography: effect of an artificial intelligence support system. Radiology 290, 305–314 (2019).
- LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
- Bowyer, K. et al. The digital database for screening mammography. In Third international workshop on digital mammography, vol. 58, 27 (1996).
- Suckling J, P. The mammographic image analysis society digital mammogram database. Digit. Mammo 375–386 (1994).
- Moreira, I. C. et al. Inbreast: toward a full-field digital mammographic database. Acad. radiology 19, 236–248 (2012).
- Sun, C., Shrivastava, A., Singh, S. & Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, 843–852 (2017).
- Sickles, E. A. et al. ACR BI-RADS® Mammography (American College of Radiology, 2013), fifth edn.
- Amendoeira, I. et al. European guidelines for quality assurance in breast cancer screening and diagnosis (European Commission, 2013).
- Nguyen, N. T. et al. VinDr Lab: A Data Platform for Medical AI. https://github.com/vinbigdata-medical/vindr-lab (2021).
- Sechidis, K., Tsoumakas, G. & Vlahavas, I. On the stratification of multi-label data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 145–158 (Springer, 2011).
Only registered users who sign the specified data use agreement can access the files.
License (for files):
PhysioNet Restricted Health Data License 1.5.0
Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0
- sign the data use agreement for the project