Database Credentialed Access

Radiology Report Expert Evaluation (ReXVal) Dataset

Feiyang Yu Mark Endo Rayan Krishnan Ian Pan Andy Tsai Eduardo Pontes Reis Eduardo Kaiser Ururahy Nunes Fonseca Henrique Lee Zahra Shakeri Andrew Ng Curtis Langlotz Vasantha Kumar Venugopal Pranav Rajpurkar

Published: June 20, 2023. Version: 1.0.0

When using this resource, please cite: (show more options)
Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E. P., Kaiser Ururahy Nunes Fonseca, E., Lee, H., Shakeri, Z., Ng, A., Langlotz, C., Venugopal, V. K., & Rajpurkar, P. (2023). Radiology Report Expert Evaluation (ReXVal) Dataset (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E. P., ... & Rajpurkar, P. (2022). Evaluating Progress in Automatic Chest X-Ray Radiology Report Generation. medRxiv, 2022-08.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


The Radiology Report Expert Evaluation (ReXVal) Dataset is a publicly available dataset of radiologist evaluations of errors in automatically generated radiology reports. The dataset contains annotations from 6 board certified radiologists on clinically significant and clinically insignificant errors under 6 error categories for candidate radiology reports with respect to ground-truth reports from the MIMIC-CXR dataset. There are 4 candidate reports generated for 50 studies, translating to 200 pairs of candidate and ground-truth reports on which radiologists provided annotations. The dataset has been used to evaluate the alignment between scoring of automated metrics and that of radiologists, investigate the failure modes of automated metrics, and build a composite automated metric, in a study on how to meaningfully measure progress in radiology report generation. It is also created to support additional medical AI research in radiology and other expert tasks.


Artificial Intelligence (AI) has been making great strides in tasks that require expert knowledge, such as playing Go [1-4], writing code [5-6], and driving vehicles [7-8]. In the medical domain, AI has reached similar exciting milestones [9], including the effective prediction of 3D protein structures [10-11]. Enabled by the rapidly evolving imaging and computer vision technologies, AI also has made formidable progress on image interpretation tasks, including chest X-ray interpretation. However, the application of AI to image interpretation tasks has often been limited to the identification of a handful of individual pathologies [12-14], representing an over-simplification of the image interpretation task. In contrast, the generation of complete narrative radiology reports [15-20] moves past that simplification and is consistent with how radiologists communicate diagnostic information: the narrative report allows for highly diverse and nuanced findings, including association of findings with anatomic location, and expressions of uncertainty. Although the generation of radiology reports in their full complexity would signify a tremendous achievement for AI, the task remains far from solved. Our work aims to tackle one of the most important bottlenecks for progress: the limited ability to meaningfully measure progress on the report generation task

Automatically measuring the quality of generated radiology reports is challenging. Most prior works have relied on a set of metrics inspired by similar setups in natural language generation, where radiology report text is treated as generic text [21]. However, unlike generic text, radiology reports involve complex, domain-specific knowledge and critically depend on factual correctness. Even metrics that were designed to evaluate the correctness of radiology information by capturing domain-specific concepts do not align with radiologists [22]. Therefore, improvement on existing metrics may not produce clinically meaningful progress or indicate the direction for further progress. This fundamental bottleneck hinders understanding of the quality of report generation methods thereby impeding work toward improvement of existing methods. We seek to remove this bottleneck by developing meaningful measures of progress in radiology report generation. The answer to this question is imperative to understanding which metrics can guide us towards generating reports that are clinically indistinguishable from those generated by radiologists.

In [23], we quantitatively examine the correlation between automated metrics and the scoring of reports by radiologists using candidate radiology reports and ground-truth reports from the MIMIC-CXR dataset [24-26]. Collecting this dataset of radiologist annotations, along with scores for 4 automated metrics: BLEU, BERTScore, CheXbert vector similarity (s_emb) and RadGraph entity and relation F1, we investigate the alignment between automated metrics and expert evaluation to understand how to meaningfully measure progress in radiology report generation. We also identify specific failure modes of automated metrics and build the composite metric “RadCliQ” (Radiology Report Clinical Quality) from a linear combination of automated metrics.

Here we make use of two natural language generation metrics: BLEU and BERTScore. The BLEU scores are computed as BLEU-2 bigrams. BERTScore uses the contextual embeddings from a BERT model to compute similarity of two text sequences. CheXbert vector similarity and RadGraph F1 are designed to capture clinical information in radiology reports. Since radiology reports are a special form of structured text that communicates diagnostics information, their quality depends highly on the correctness of clinical objects and descriptions, which is not a focus of traditional natural language metrics. To address this gap, the CheXbert labeler (which is improved from the CheXpert labeler) [12-13] and RadGraph [27], were developed to parse radiology reports. CheXbert vector similarity is the cosine similarity between CheXbert model embeddings of the generated report and test report. RadGraph is an approach for parsing radiology reports into knowledge graphs containing entities (nodes) and relations (edges), which can capture radiology concept dependencies and semantic meaning. We proposed a novel metric as the overlap in parsed RadGraph graph structures: the RadGraph entity and relation F1 score. Computation of the automated metrics is detailed in [23].



We randomly selected a subset of 50 studies from the test set of the MIMIC-CXR dataset. For each study, we collected 5 radiology reports: one ground-truth report for the study and four reasonably accurate generated reports. We collected the four generated reports by selecting candidate reports that score highly according to each of the automated metrics from the MIMIC-CXR train set. The quality of generated reports ensures that radiologists can identify and enumerate distinct errors in the reports relative to the ground-truth. In total, there are 200 pairs of generated and ground-truth reports.


6 board certified radiologists provided annotations on the report pairs of generated and ground-truth reports. The radiologists were given instructions (available in Supplementary of [4]) prior to starting the evaluation. They were blind to the source of the reports.


For each study, the radiologists were presented with the ground-truth report at the top of the survey page followed by four generated reports shuffled in random order. The radiologists annotated the number of clinically significant errors and the number of clinically insignificant errors under 6 error categories:

  1. False prediction of finding
  2. Omission of finding
  3. Incorrect location/position of finding
  4. Incorrect severity of finding
  5. Mention of comparison that is not present in the reference impression
  6. Omission of comparison describing a change from a previous study

Data Description

The dataset contains 6 radiologists' evaluation of clinically significant and clinically insignificant errors under 6 error categories for generated radiology reports against ground-truth reports from the MIMIC-CXR dataset. For each ground-truth report, there are four candidate reports that correspond to four automated metrics: BLEU, BERTScore, CheXbert vector similarity (s_emb) and RadGraph entity and relation F1.

The dataset is organized as follows:

  1. 50_samples_gt_and_candidates.csv: Each row corresponds to a study with study ID as specified in the "study_id" column, containing one ground-truth report (column “gt_report”) and four candidate reports (columns "bleu"/"bertscore"/"s_emb"/"radgraph"). The column names for the candidate reports specify with respect to which metric the report was retrieved.
  2. 6_valid_raters_per_rater_error_categories.csv: Each row corresponds to an annotation provided by a certain radiologist on a certain report for a certain error category. Column “study_number” specifies the row index of a study in 50_samples_gt_and_candidates.csv, ranging from 0 to 49. Column “candidate_type” specifies the candidate report the annotation applies to. It can take on one of the four values: “bleu”/”bertscore”/”s_emb”/”radgraph” as the type. Column “error_category” is one out of the 6 error categories defined in the Methods section, ranging from 1 to 6. Column “rater_index” is the radiologist index, ranging from 0 to 5. A radiologist index acts as a unique identifier for a radiologist. Each of the 6 radiologists is given such an index as their identifier. All annotations provided by the same radiologist are labeled with the same index that corresponds to the radiologist. Column "clinically_significant" specifies whether the error count is for clinically significant errors (“TRUE”) or clinically insignificant errors (“FALSE”). Column “num_errors” is the number of errors determined by the radiologist.
  3. data_analysis.ipynb: Sample analysis notebook for the dataset. The notebook contains code for loading the dataset, computing mean number of errors over radiologists, handling clinically significant and clinically significant errors, visualizing error distributions, and fetching report texts. It can be used as a starting point for analysis of the dataset.

Usage Notes

This dataset has been used in [23] to investigate the alignment of automated metrics to radiologists on scoring generated radiology reports and the failure modes of different automated metrics, and to build the composite metric RadCliQ. The dataset can be used for a broad range of medical AI research in radiology and other expert medical tasks, particularly ones with complex text.

Note that the candidate reports were generated by retrieving reports from the training set of MIMIC-CXR that achieve the highest metric score with the ground-truth report with respect to the four automated metrics. This design offers two primary advantages for the study in [23]: (1) the candidate reports are sufficiently accurate for radiologists to pinpoint specific errors and not be bogged down by reports that aren’t remotely similar to the test reports; (2) the candidate reports allow us to analyze where certain metrics fail since the reports are the hypothetical top retrievals. A potential limitation that follows is that the dataset concentrates on radiologist annotations of relatively high quality reports. Therefore, the dataset may not be as representative of generated reports that differ significantly from ground-truth reports.

Usage of this dataset involves the usage of the MIMIC-CXR dataset.

Code for computing the four automated metrics and the composite metric RadCliQ on pairs of ground-truth reports and candidate reports is available at [28].

Release Notes

This is the first public release of the dataset.


This dataset was derived from the MIMIC-CXR Dataset. It falls under the same IRB as MIMIC-CXR.

An IRB was not required for the study in which the dataset was collected, because 6 co-investigators of the study constituted the board certified radiologists who contributed annotations to the dataset.


We thank M.A. Endo MD for helpful review and feedback on the radiologist evaluation survey design and the manuscript of the associated paper [4]. Support for this work was provided in part by the Medical Imaging Data Resource Center (MIDRC) under contracts 75N92020C00008 and 75N92020C00021 from the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health.

Conflicts of Interest

The Authors declare no Competing Non-Financial Interests but the following Competing Financial Interests:

I.P. is a consultant for and Diagnosticos da America (Dasa). 

C.P.L. serves on the board of directors and is a shareholder of Bunkerhill Health. He is an advisor and option holder for GalileoCDS, Sirona Medical, Adra, and Kheiron. He is an advisor to Sixth Street and an option holder in His research program has received grant or gift support from Carestream, Clairity, GE Healthcare, Google Cloud, IBM, IDEXX, Hospital Israelita Albert Einstein, Kheiron, Lambda, Lunit, Microsoft, Nightingale Open Science, Nines, Philips, Subtle Medical, VinBrain,, the Paustenbach Fund, the Lowenstein Foundation, and the Gordon and Betty Moore Foundation.


  1. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S. Mastering the game of Go with deep neural networks and tree search. nature. 2016 Jan;529(7587):484-9.
  2. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y. Mastering the game of go without human knowledge. nature. 2017 Oct;550(7676):354-9.
  3. Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, Lillicrap T. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science. 2018 Dec 7;362(6419):1140-4.
  4. Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, Lillicrap T. Mastering atari, go, chess and shogi by planning with a learned model. Nature. 2020 Dec 24;588(7839):604-9.
  5. Li Y, Choi D, Chung J, Kushman N, Schrittwieser J, Leblond R, Eccles T, Keeling J, Gimeno F, Dal Lago A, Hubert T. Competition-level code generation with alphacode. Science. 2022 Dec 9;378(6624):1092-7.
  6. Chen M, Tworek J, Jun H, Yuan Q, Pinto HP, Kaplan J, Edwards H, Burda Y, Joseph N, Brockman G, Ray A. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. 2021 Jul 7.
  7. Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J, Zhang X. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. 2016 Apr 25.
  8. Fridman L, Brown DE, Glazer M, Angell W, Dodd S, Jenik B, Terwilliger J, Patsekin A, Kindelsberger J, Ding L, Seaman S. MIT advanced vehicle technology study: Large-scale naturalistic driving study of driver behavior and interaction with automation. IEEE Access. 2019 Jul 1;7:102021-38.
  9. Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nature medicine. 2022 Jan;28(1):31-8.
  10. Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Žídek A, Nelson AW, Bridgland A, Penedones H. Improved protein structure prediction using potentials from deep learning. Nature. 2020 Jan 30;577(7792):706-10.
  11. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug 26;596(7873):583-9.
  12. Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, Marklund H, Haghgoo B, Ball R, Shpanskaya K, Seekins J. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence 2019 Jul 17 (Vol. 33, No. 01, pp. 590-597).
  13. Smit A, Jain S, Rajpurkar P, Pareek A, Ng AY, Lungren MP. CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv preprint arXiv:2004.09167. 2020 Apr 20.
  14. Pino P, Parra D, Besa C, Lagos C. Clinically correct report generation from chest X-rays using templates. InMachine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings 12 2021 (pp. 654-663). Springer International Publishing.
  15. Miura Y, Zhang Y, Tsai EB, Langlotz CP, Jurafsky D. Improving factual completeness and consistency of image-to-text radiology report generation. arXiv preprint arXiv:2010.10042. 2020 Oct 20.
  16. Chen Z, Song Y, Chang TH, Wan X. Generating radiology reports via memory-driven transformer. arXiv preprint arXiv:2010.16056. 2020 Oct 30.
  17. Endo M, Krishnan R, Krishna V, Ng AY, Rajpurkar P. Retrieval-based chest X-ray report generation using a pre-trained contrastive language-image model. InMachine Learning for Health 2021 Nov 28 (pp. 209-219). PMLR.
  18. Yan A, He Z, Lu X, Du J, Chang E, Gentili A, McAuley J, Hsu CN. Weakly supervised contrastive learning for chest x-ray report generation. arXiv preprint arXiv:2109.12242. 2021 Sep 25.
  19. Nicolson A, Dowling J, Koopman B. Improving chest X-Ray report generation by leveraging warm-starting. arXiv preprint arXiv:2201.09405. 2022 Jan 24.
  20. Zhou HY, Chen X, Zhang Y, Luo R, Wang L, Yu Y. Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence. 2022 Jan;4(1):32-40.
  21. Hossain MZ, Sohel F, Shiratuddin MF, Laga H. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR). 2019 Feb 4;51(6):1-36.
  22. Boag W, Kané H, Rawat S, Wei J, Goehler A. A Pilot Study in Surveying Clinical Judgments to Evaluate Radiology Report Generation. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 2021 Mar 3 (pp. 458-465).
  23. Yu F, Endo M, Krishnan R, Pan I, Tsai A, Reis EP, Fonseca EK, Lee HM, Abad ZS, Ng AY, Langlotz CP. Evaluating Progress in Automatic Chest X-Ray Radiology Report Generation. medRxiv. 2022:2022-08.
  24. Johnson A, Pollard T, Mark R, Berkowitz S, Horng S. MIMIC-CXR Database (version 2.0.0). PhysioNet.
  25. Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data. 2019 Dec 12;6(1):317.
  26. Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. circulation. 2000 Jun 13;101(23):e215-20.
  27. Jain S, Agrawal A, Saporta A, Truong SQ, Duong DN, Bui T, Chambon P, Zhang Y, Lungren MP, Ng AY, Langlotz CP. Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463. 2021 Jun 28.
  28. Accessed on 6-17-2023.

Parent Projects
Radiology Report Expert Evaluation (ReXVal) Dataset was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research


DOI (version 1.0.0):

DOI (latest version):

Corresponding Author
You must be logged in to view the contact information.