Database Credentialed Access

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Jong Hak Moon Geon Choi Paloma Rabaey Min Gwam Kim Hyuk Gi Hong Jung Oh Lee Hangyul Yoon Eunwoo Doe Jiyoun Kim Harshita Sharma Daniel Coelho de Castro Javier Alvarez Valle Edward Choi

Published: Jan. 11, 2026. Version: 1.0.0


When using this resource, please cite: (show more options)
Moon, J. H., Choi, G., Rabaey, P., Kim, M. G., Hong, H. G., Lee, J. O., Yoon, H., Doe, E., Kim, J., Sharma, H., Coelho de Castro, D., Alvarez Valle, J., & Choi, E. (2026). Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/pk42-4v91

Additionally, please cite the original publication:

Moon, Jong Hak, et al. "Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation." arXiv preprint arXiv:2505.21190 (2025).

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset of structured radiology reports that serves as a gold standard for evaluating structured report frameworks. It is designed to support comprehensive assessment of both single-report interpretation and longitudinal reasoning. Constructed from a subset of the MIMIC-CXR test set, LUNGUAGE comprises 1,473 chest X-ray reports from 230 patients, annotated with over 17,000 expert-verified entities and 23,000 relation–attribute pairs across 18 relation types. An additional subset of 80 sequential reports from 10 patients captures disease progression across 3 to 14 studies per patient, covering time intervals from 1 to 1,200 days. These are annotated with over 41,000 pairwise comparisons, grouped into semantically and temporally coherent groups. The dataset also includes a schema-aligned vocabulary covering diagnostic entities and attributes. All annotations were conducted and verified by board-certified radiologists, resulting in a clinically grounded resource for structured understanding and temporal reasoning in radiology.


Background

Chest radiograph reports are the primary diagnostic outputs derived from interpreting chest X-ray images, encapsulating image-based findings, temporal comparisons, and clinical reasoning. These reports inform key clinical decisions by describing current abnormalities, referencing prior studies, and contextualizing observations within the patient’s broader clinical context. However, as they are typically composed in unstructured free-text, they exhibit considerable variability in terminology, level of detail, and organizational style across radiologists. This linguistic heterogeneity poses challenges for consistent computational interpretation and limits the reliability of automated systems in both training and evaluation settings.

To address these challenges, structured report benchmark datasets have been introduced to standardize the representation of radiologic content by extracting discrete clinical entities and attributes from free-text [1–6]. Such structured formats reduce linguistic variability, allow systematic and scalable evaluation of report generation models, and preserve clinically important details necessary for applications like decision support, patient cohort identification, and tracking disease progression over time.

Despite these advancements, most existing benchmarks [1–6] remain limited in two key aspects. First, they assess reports independently, overlooking the temporal comparisons critical to radiologic interpretation. Expressions such as “new consolidation” or “no change in pleural effusion” cannot be validated without reference to earlier studies. Second, they often reduce rich radiologic content to coarse categories, failing to capture fine-grained attributes such as lesion size, precise location, or morphological descriptors—elements crucial for clinical accuracy and treatment planning.

To address these limitations, we introduce LUNGUAGE, a benchmark dataset for fine-grained and temporally aware interpretation of chest radiograph reports. LUNGUAGE provides a clinically grounded resource for evaluating models that extract, generate, or reason over structured radiology reports. By capturing detailed diagnostic attributes and longitudinal patterns, it supports the development and assessment of systems capable of nuanced clinical understanding and temporal reasoning.


Methods

We propose two complementary annotation schemas for structured understanding of radiology reports: a single-report schema capturing fine-grained interpretation within individual reports, and a sequential schema modeling patient-level diagnostic trajectories across time. Both schemas were refined with four board-certified radiologists to ensure clinical validity.

1. Data Source

LUNGUAGE aims to support patient-level evaluation of chest X-ray reports by modeling longitudinal diagnostic scenarios. To this end, we curated a benchmark dataset from the official test split of MIMIC-CXR [8], including all 1,473 reports corresponding to 230 patients. We followed the official MIMIC-CXR preprocessing protocol to extract structured text from each report. Specifically, we parsed the history (including “Indication”), findings, and impression sections. The history/indication field provides contextual information relevant to diagnostic reasoning, such as presenting symptoms (e.g., “fever,” “fatigue,” “cough”) or evaluation intents (e.g., “rule out pneumonia”). In contrast, the findings and impression sections describe image-based observations and interpretations.

2. Single Structured Report: Schema

We propose a schema that captures the internal structure of single reports by extracting clinically relevant information as typed entities and relations. It is designed to reflect the typical subsections of radiology reports—indication/history, findings, and impression—and supports relation extraction across sentence boundaries within each section. Notably, the indication/history section is included to preserve contextual information that influences diagnostic interpretation at the patient trajectory level.

Each radiology report is a structured collection of (entity, relation, attribute) triplets. This schema is designed to encode the diagnostic content of reports in a form that supports structured analysis, longitudinal reasoning, and machine-readable interpretation. It captures both observable features from chest X-ray (CXR) images and additional contextual elements embedded in clinical narratives.

2-1. Entity Types

Entities represent clinically meaningful units such as findings, diagnoses, objects, or background context. Each entity is assigned one of six mutually exclusive Cat (category) labels, depending on whether it originates from the CXR image or external clinical sources.

Chest X-ray Findings are entities that can be directly visualized on the chest X-ray or inferred through image-based interpretation, possibly with minimal supporting context. These form the core of radiologic description and are divided into the following types:

  • PF (Perceptual Findings): Visual features that are explicitly visible in the image and correspond to anatomical or pathological structures (e.g., “opacity”, “pleural effusion”, “pneumothorax”). These are the most direct and objective form of image evidence.
  • CF (Contextual Findings): Diagnoses that require interpretation of visual findings in light of limited contextual knowledge (e.g., “pneumonia”, “congestive heart failure”). These may involve reasoning beyond the image but still rely primarily on radiographic evidence.
  • OTH (Other Objects): Non-anatomic elements such as medical devices, surgical hardware, or foreign materials visible on the image (e.g., “endotracheal tube”, “central venous catheter”, “foreign body”). These often require placement verification or complication monitoring.

Non-Chest X-ray Findings are entities that cannot be determined from the image alone and must be inferred from patient history, clinical documentation, or other diagnostic modalities:

  • COF (Clinical Objective Findings): Structured clinical measurements or physical findings derived from sources such as laboratory tests or vital signs (e.g., “elevated white cell count”, “low oxygen saturation”). These provide objective support for contextual interpretation.
  • NCD (Non-CXR Diagnosis): Diagnoses that originate from non-CXR modalities (e.g., CT, MRI, serology) and are either mentioned for completeness or used to explain findings (e.g., “stroke”, “AIDS”).
  • PATIENT INFO: Historical or subjective patient information, such as symptoms or clinical background, that contributes to interpretation (e.g., “fever”, “history of malignancy”, “recent trauma”).

Each entity is additionally annotated with the following attributes that define its diagnostic interpretation within the report:

  • DxStatus: Indicates whether the entity is considered present or absent in the current study. This label is determined from report language and includes implications from stability or change. For example, “resolved effusion” is annotated as Positive, while “unchanged opacity” is Positive unless the prior state was normal, in which case it is Negative.
  • DxCertainty: Reflects the level of confidence expressed by the radiologist, labeled as either Definitive or Tentative. Typical cues include phrases like “suggests”, “cannot exclude”, or “possibly indicative of”, all leading to a tentative label.

2-2. Relation Types

Relations describe either attributes of a single entity or clinically relevant links between multiple entities. All relations must be grounded in the report text and can span across sentences within the same section.

1. Diagnostic Reasoning: These relations connect semantically and clinically related entities. They encode the logic behind diagnostic interpretation.

  • Associate: A bidirectional, non-causal relationship between entities that co-occur or are conceptually linked (e.g., “opacity” ↔ “consolidation”). When Evidence is used, a corresponding Associate is also required in the reverse direction.
  • Evidence: A unidirectional relation in which a finding supports a diagnosis (e.g., “pneumonia” → “opacity”).

2. Spatial and Descriptive Attributes: These relations describe intrinsic visual characteristics of an entity as observed within a single chest X-ray image. Unlike temporal attributes, these do not require comparison with prior studies. Instead, they provide descriptive detail that refines the interpretation of a finding or object in terms of location, form, extent, intensity, and symmetry.

  • Location: Specifies the anatomical or spatial position of the entity (e.g., “right upper lobe”, “carina above 3 cm”). An entity may have multiple location labels, annotated as a comma-separated list (e.g., “right upper lobe, suprahilar”). Location applies to both disease findings and device placements (e.g., “fragmentation” of “sternal wires”).
  • Morphology: Describes the shape, form, or structural appearance of the entity (e.g., “nodular”, “linear”, “reticular”, “confluent”). Morphological terms help differentiate types of opacities or identify characteristic patterns of pathology.
  • Distribution: Refers to the anatomical spread or pattern of the entity (e.g., “focal”, “diffuse”, “multifocal”, “bilateral”). This helps characterize whether the finding is localized or widespread, and whether it follows typical anatomical distributions.
  • Measurement: Captures quantitative properties such as size, count, or volume (e.g., “2.5 cm”, “few”, “multiple”). These descriptors are typically numerical or ordinal and assist in severity grading or follow-up comparison.
  • Severity: Reflects the degree of abnormality or clinical impact, often based on radiologic intensity or extent (e.g., “mild”, “moderate”, “severe”, “marked”).
  • Comparison: Indicates asymmetry or difference across anatomical sides or regions within the same image (e.g., “left greater than right”, “right lung appears denser”). This is distinct from temporal comparison and only refers to spatial contrasts visible in the current image.

3. Temporal Change: These relations capture how an entity has changed over time by comparing the current study to previous imaging or known clinical baselines. Temporal attributes are essential for longitudinal interpretation and reflect disease progression, treatment response, or clinical stability. Unlike static descriptors, these attributes require temporal context and often imply clinical decision points.

  • Onset: Indicates the timing or duration of a finding as described in the report (e.g., “acute”, “subacute”, “chronic”, “new”). These descriptors suggest whether a condition has recently appeared or has been long-standing.
  • Improved: Signals that a finding has regressed or resolved compared to a prior state (e.g., “resolved effusion”, “decreased consolidation”). It is typically associated with positive treatment response or natural recovery.
  • Worsened: Indicates that the condition has progressed, increased in extent, or become more severe over time (e.g., “enlarging opacity”, “increased pleural effusion”). This is often associated with disease progression or complications.
  • No Change: Describes a finding that has remained stable since a prior study (e.g., “unchanged opacity”, “persistent nodule”). Although these are annotated as Positive by default, they are marked as Negative if the prior state was normal (i.e., continued absence of disease).
  • Placement: Applies specifically to entities labeled as OTH (devices). It describes both the position (e.g., “in expected position”, “malpositioned”) and temporal actions involving the device (e.g., “inserted”, “withdrawn”, “removed”). This attribute is crucial for monitoring device-related interventions over time.

4. Contextual Information: This category captures auxiliary information that influences the interpretation of findings but is not a primary descriptor of the radiologic appearance. These relations provide critical contextual cues—such as modality constraints, patient factors, or historical references—that support diagnostic interpretation. While not visual in the conventional sense, they are essential for accurately situating radiologic findings within the broader clinical scenario.

  • Past Hx: Refers to the patient’s prior medical or surgical history that contextualizes current findings (e.g., “status post lobectomy”, “known tuberculosis”). These mentions often justify or explain current observations or exclude certain diagnoses.
  • Other Source: Indicates that part of the reported information is derived from modalities other than chest X-ray (e.g., “seen on CT”, “confirmed on MRI”). This distinction is important when findings cannot be visualized directly on the image being interpreted.
  • Assessment Limitations: Describes technical or procedural factors that constrain the radiologist’s ability to interpret the image accurately (e.g., “poor inspiration”, “rotated patient position”, “limited view due to overlying hardware”). These limitations help qualify the certainty or completeness of the report’s conclusions.

2-3. Single Report Annotation Process

To construct a clinically reliable gold-standard dataset, we implemented a structured annotation pipeline that reviewed and refined the initial triplets generated by GPT-4 (0613). Unlike the vocabulary construction phase—which focused on individual terms without considering report context—this stage involved section-by-section review of all structured outputs in each report to ensure contextual accuracy and logical consistency.

All 1,473 chest X-ray reports in LUNGUAGE were divided evenly among annotators. Each annotator independently reviewed approximately one-quarter of the dataset, ensuring balanced coverage and minimizing reviewer bias across the annotated corpus. Within each report, annotators examined the structured outputs across the history/indication, findings, and impression sections. The goal was to verify whether the extracted (entity, relation, attribute) triplets accurately captured the meaning of the source text and aligned with the predefined schema. This review explicitly included schema elements that require contextual interpretation and cannot be evaluated at the lexical level alone—namely, DXSTATUS, DXCERTAINTY, ASSOCIATE, and EVIDENCE. These attributes reflect interpretive judgments, such as identifying when an “opacity” supports a diagnosis of “pneumonia” or whether two entities should be linked through an associative relation. Annotators verified whether such relations were correctly inferred from the surrounding text and whether the attributes assigned to each entity (e.g., presence, uncertainty, temporal change) matched the narrative context.

To support this process, we developed a custom annotation interface that displayed the original report text alongside GPT-4’s predicted triplets and an editable table of structured fields. Each sentence in the report was paired with its associated annotations, including entity category, relation type, and all relevant attributes. Annotators could directly add, edit, remove, or merge entries to reflect clinically accurate interpretations. For example, terms like “ground glass opacity”—which could be mistakenly split—were merged into a single PF (perceptual finding) entity based on how radiologists commonly use the phrase. Annotation was conducted separately for each section (history, findings, impression), and the interface supported sentence-level review within each section to ensure consistent entity–relation mappings when terms appeared across multiple sentences.

3. Sequential Structured Report: Schema

Longitudinal radiology reports often exhibit lexical variation, abstraction shifts, and inconsistent phrasing. The same pathology may be described differently over time (e.g., "right opacity" vs. "focal consolidation"), complicating semantic alignment and temporal reasoning. To address this, we introduce a schema that structures reports across patient timelines through two key components:

  • ENTITYGROUPS identify observations that refer to the same underlying clinical finding, even when expressed using different terms, anatomical references, or levels of abstraction. Within each patient, all observation pairs are compared to detect semantic equivalence, regardless of when they appear in the timeline, whether the finding is reported as present or absent (DXSTATUS), or whether it is stated definitively or tentatively (DXCERTAINTY). For example, “PICC line tip in lower SVC” and “at the cavoatrial junction” may describe the same catheter tip location, reflecting inherent ambiguity in 2D imaging. Similarly, “lung volumes” reported as low on day 10 and described as “no change” on day 90 can be grouped to indicate persistent low lung volume.
  • TEMPORALGROUPS divide each ENTITYGROUP into distinct diagnostic episodes based on temporal distance, shifts in status or certainty, and explicit expressions of clinical change (e.g., “worsening,” “resolved”). This approach captures clinically meaningful transitions in a patient’s condition. For example, “fever” mentioned in both the day 10 and day 90 reports appears in the “history” section but occurs far apart in time; treating them as part of separate temporal groups better reflects clinical reasoning. Together, these components support fine-grained evaluation of both semantic consistency and temporal coherence in longitudinal model outputs.

3-1. Sequential Report Annotation Process

We annotated 80 chest X-ray reports from 10 patients among the 230 patient cohort used in the single-report annotation, to create a gold dataset for longitudinal evaluation. The same four physicians from the earlier phase participated in the annotation process, with patients equally divided among them. Each physician independently annotated their assigned patients’ reports in chronological order, identifying observations referring to the same underlying finding (ENTITYGROUP, represented as linearized phrases combining entity and its attributes, e.g., "pleural effusion right lung increasing") and grouping them into diagnostic episodes (TEMPORALGROUP, numbered sequentially as 1, 2, 3, etc. to distinguish separate temporal progressions) based on clinical and temporal continuity. Terminology was normalized when appropriate (e.g., aligning "right clavicle hardware" and "orthopedic side plate"), while preserving distinctions in abstraction and anatomical specificity. This process required significant effort due to the complexity of longitudinal comparison. Patients had between 3 and 14 reports, with time intervals ranging from 1 to 1,200 days.

4. Vocabulary:

To systematically capture the range of descriptive, temporal, spatial, and contextual attributes in radiologic reporting, we constructed a structured vocabulary of relation terms grounded in all schema-defined relation types instantiated in LUNGUAGE. The process followed four stages: (1) automatic candidate extraction, (2) expert review and refinement, (3) clinical subcategory definition, and (4) normalization. This pipeline was designed to maximize coverage while ensuring clinical interpretability and internal consistency.

Candidate Extraction. We first piloted schema and prompt designs on 100 sample reports, iteratively refining them before applying the finalized schema to the full set of 1,473 reports. Using GPT-4, we produced initial structured outputs and extracted candidate terms corresponding to each relation type. This step emphasized high recall to capture the breadth of linguistic variation in free-text radiology reports and provided a foundation for analyzing hierarchical consistency across categories.

Expert Review and Refinement. Four board-certified physicians independently reviewed the candidate vocabularies for each relation category, verifying accurate categorization and removing spurious or ambiguous expressions. Disagreements were resolved through consensus meetings, prioritizing clinical interpretability and reproducibility.

This process was particularly critical for borderline cases, such as:

  • Distinguishing between Condition terms under Morphology and subtle gradations of Severity
  • Differentiating between field-of-view limitations and patient-related limitations

Subcategory definition (Entity, Location, and Attribute Taxonomies). All vocabularies were hierarchically organized to reflect radiologic conventions and enable reasoning across different levels of granularity. They are grouped into three major taxonomies:

  1. Entity Taxonomy. Entities were first assigned to one of six mutually exclusive Cat labels: PF (Perceptual Findings), CF (Contextual Findings), COF (Clinical Objective Findings), NCD (Non-CXR Diagnosis), OTH (Other Objects), and PATIENT INFO (Patient Information). Within each label, entities were further classified into subcategories such as Diagnostic Observations, Anatomical Entities, Diseases and Disorders, Medical Devices, or Symptoms & Signs. Representative examples include: “opacity” and “right hilum” (PF), “pneumonia” and “congestive heart failure” (CF), “oxygen saturation” (COF), “stroke” (NCD), “central venous catheter” (OTH), and “fever” or “chronic dyspnea” (PATIENT INFO). Normalization ensured consistent representation, while diverse raw expressions were linked at the lowest level (e.g., “pneumonia” → “PNA,” “pneumonias”)
  2. Location taxonomy. The most extensive vocabulary, comprising 546 terms, was organized into hierarchical paths that mirror clinical localization practices. High-level systems included respiratory (229), musculoskeletal (84), cardiovascular (73), and others (160). Examples of hierarchical paths include: “lung → lobe → right → upper,” “heart → chamber → atrium → left,” “spine → thoracic → vertebra → T4.” This structuring enables reasoning from coarse system-level interpretation to fine-grained anatomical localization.
  3. Attribute taxonomy. Attributes were systematically organized into descriptive and temporal axes. MORPHOLOGY (205) was divided into shape and structure, texture and density, and condition. Temporal change included ONSET (57), IMPROVED (118), WORSENED (102), and NO CHANGE (138), each stratified into graded interpretations (e.g., “moderate improvement,” “minimal worsening”). Device-related metadata were captured under PLACEMENT (74), describing both positional accuracy (e.g., “malpositioned”) and procedural changes (e.g., “removed,” “repositioned”). Additional axes included MEASUREMENT (139), SEVERITY (86), DISTRIBUTION (37), and COMPARISON (44). Auxiliary types captured contextual but clinically relevant information: ASSESSMENT LIMITATIONS (233; e.g., “rotated patient,” “poor inspiration”), OTHER SOURCE (55; e.g., CT, MRI), and PAST HX (39; e.g., “status post,” “history of malignancy”). Our vocabulary was restricted to relation types that correspond to lexically explicit attributes. Four relation types—EVIDENCE, ASSOCIATE, DXSTATUS, and DXCERTAINTY—were excluded. These relations are critical to the annotation schema but represent pragmatic inference rather than explicit lexical expressions. For instance, EVIDENCE and ASSOCIATE encode reasoning links between entities, often spanning sentences, while DXSTATUS and DXCERTAINTY capture interpretive stance (e.g., presence vs. absence, tentative vs. definitive).

Normalization. The resulting vocabulary includes 14 relation types derived from lexical evidence, each normalized to a preferred set of terms and organized into semantically coherent subcategories. We additionally performed UMLS mapping wherever possible to align relation terms with existing biomedical ontologies, while preserving terms that fall outside conventional coverage. This ensured both lexical consistency and clinical validity, supporting future integration. Beyond its role in structuring chest X-ray reports, this vocabulary provides a reusable lexicon for tasks such as query expansion, ontology alignment, multimodal grounding, and patient-level reasoning, thereby establishing a clinically grounded and internally consistent taxonomy of radiologic language.


Data Description

LUNGUAGE is a benchmark dataset for fine-grained and temporally aware interpretation of chest radiograph reports. It contains 1,473 expert-annotated reports from the MIMIC-CXR test set, including 80 longitudinal reports from 10 patients with 3 to 14 studies each, spanning time intervals from 1 to 1,200 days. The dataset comprises three components: (1) a schema-aligned vocabulary that defines diagnostic entities and attributes with subcategories, normalized forms, and UMLS concept codes; (2) single structured reports with 17,949 expert-validated entities and 23,307 relation–attribute pairs across 18 relation types, capturing detailed diagnostic information at the sentence level; and (3) sequential structured reports annotated with 41,122 pairwise comparisons, organized into ENTITYGROUPS for semantically equivalent observations and TEMPORALGROUPS for temporally aligned clinical episodes.

Single Structured Report Statistics

  • Total number of reports: 1,473 chest X-ray reports
  • Total number of patients: 230
  • Number of imaging studies per patient: ranges from 1 to 15
  • Total number of annotated entities: 17,949
  • Total number of annotated relation–attribute pairs: 23,307

Sequential Structured Report Statistics

  • Total number of reports: 80 chest X-ray reports
  • Total number of patients: 10 patients (subset of the 230-patient cohort)
  • Number of reports per patient: ranges from 3 to 14
  • Time intervals between reports: ranges from 1 day to 1,200 days
  • Total number of observation pairs compared: 41,122
  • Number of observation pairs per patient: ranges from 34 to 141

Vocabulary Statistics

  • Entity Categories
Category Subcategory (count) Example terms
pf (Perceptual Findings) Diagnostic Observations (303), Anatomical Entities (157), Diseases and Disorders (76) opacity, right hilum
cf (Contextual Findings) Diseases and Disorders (271), Diagnostic Observations (43) congestive heart failure, pneumonia
cof (Clinical Objective Findings) Diagnostic Observations (85), Diseases, and Disorders (19) oxygen saturation, anti pd1 antibody
ncd (Non-CXR Diagnosis) Diseases and Disorders (130), Diagnostic Observations (8) stroke, seizure disorder
oth (Other Objects) Medical Devices (489), Procedures & Surgeries (157), Treatment & Medications (19) central venous catheter, lobectomy
patient info Symptoms & Signs (171), Diseases & Disorders (25), Treatment & Medications (8), Procedures & Surgeries (6) fever, cough, chronic dyspnea
  • Attribute Categories – Spatial and Descriptive Attributes
Category Subcategory (count) Example terms
severity Extreme (14), Significant (22), Moderate (27), Mild (9), Minimal (14) moderate, severe
measurement Size (86), Quantity (39), Normality (14) 2.5 cm, multiple
morphology Shape & Structure (109), Texture & Density (62), Condition (34) nodular, reticular
distribution Pattern (24), Extent (9), General Description (3) diffuse, focal
comparison Location & Laterality (41), Degree & Description (3) left greater than right
  • Attribute Categories – Temporal Change
Category Subcategory (count) Example terms
onset Acute/Sudden (24), Chronic/Long-term (20), Progressive (13) acute, chronic
improved Extreme (28), Significant (16), Moderate (38), Mild (19), Minimal (17) resolved, decreased
worsened Extreme (1), Significant (17), Moderate (53), Mild (5), Minimal (26) enlarging, increased
no change No Change (113), Minimal Change (16) unchanged, persistent
placement Standard Position (41), Repositioning (15), New Placement (4), Removal (10), Nonstandard Position (4) inserted, malpositioned
  • Attribute Categories – Contextual Information
Category Subcategory (count) Example terms
assessment limitations Evaluation (112), Field-of-View (49), Patient-Related (55), Technical (17) poor inspiration, rotated patient
other source Image (46), Signal (3), External Source (6) CT, MRI
past hx Past Hx (39) status post, known

Location: Taxonomy and Coverage

The Location subcategory is organized as a multilevel taxonomy covering major anatomical systems—including respiratory (n=229), musculoskeletal (n=82), cardiovascular (n=73), abdominal (n=33), and mediastinal (n=21)—as well as anatomical directions (n=79) and other spatial descriptors. In addition, it includes a small number of non-anatomical references such as medical devices (n=5) and other location descriptors. Although medical devices are not anatomical structures, they are treated as locations when used to indicate the position of a finding—for example, “opacity adjacent to the endotracheal tube”. In such cases, the device acts as a spatial anchor and provides clinically meaningful localization.

Top-level Location Category Sub-categories (count, % of total) Example Anatomical Sites

Max Depth

Example Location Paths
Respiratory 229 (≈42%) Lungs, pleura, bronchi, thoracic wall up to 7 lung > lobes > right > upper, pleura > left > upper
Cardiovascular 73 (≈13%) Heart chambers & valves, aorta, vena cava, jugular/supra-cardiac veins up to 6 vessels > aorta > arch, heart > chambers > atrium > right, veins > jugular > internal > right
Musculoskeletal 82 (≈15%) Spine (cervical—lumbar), ribs, clavicle, shoulder & acromioclavicular joints up to 6 spine > thoracic, bones > ribs > left, joints > shoulder > right
Abdominal 33 (≈6%) Stomach, bowel segments, abdominal quadrants, sub-diaphragmatic spaces up to 6 stomach > fundus, quadrants > right > upper, organs > intestines > duodenum
Mediastinum 21 (≈4%) Paratracheal, carinal, paramediastinal compartments up to 5 paratracheal > right, paramediastinal_region > right, carina
Other structures / Descriptors 105 (≈19%) Axilla, neck, extremities, directional and general descriptors, device placements up to 5 axilla > left, neck > lower, medical_device

Files and Structure

We present two CSV files: Lunguage.csv and Lunguage_vocab.csv.

Directory Structure:

Lunguage
  ├── Lunguage.csv
  └── Lunguage_vocab.csv

File Format and Contents:

Lunguage.csv is the LUNGUAGE benchmark dataset, providing fine-grained, structured labels for both single- and sequential-report. Each row represents an annotated entity and its associated attributes and relations, with rich metadata for temporal and semantic grouping.

Identifiers

  • subject_id: Unique patient identifier.
  • study_id: Unique imaging study (e.g., chest X-ray) identifier.
  • ent_idx: Local identifier for each entity within the report section.
  • section: Report subsection (e.g., “findings”, “impression”) where the entity appears.
  • sent: Target sentence within the report section.
  • sent_idx: Index of the target sentence within the report section.
  • StudyDateTime: Imaging study date.
  • time_from_first: Number of days since the patient’s first report.
  • sequence: Full longitudinal report sequence per patient (used for sequential modeling).
  • report, section_report: Original report or section text tied to the annotations.

Entity and Relations

  • ent: Extracted entity phrase.
  • normed_ent: Normalized entity.
  • cat: Entity category, e.g., PF, CF, OTH, COF, NCD, PATIENT INFO.
  • Diagnostic status and certainty: dx_status, dx_certainty
  • Spatial & appearance: location, morphology, distribution, measurement, severity, comparison
  • Temporal change: onset, improved, worsened, no change, placement
  • Contextual info: past hx, other source, assessment limitations
  • Inter-entity relations: evidence, associate
    • These fields represent relational links between entities, such as when one finding supports a diagnosis (evidence) or is conceptually co-occurring (associate). Instead of plain text values, these columns contain structured references pointing to other entities within the same report section using ent and ent_idx. Example: If a row with ent_idx = 3 (e.g., “opacity”) has evidence = pneumonia, idx5, it indicates that this entity (opacity) supports the entity at ent_idx = 5 (“pneumonia”) within the same section and study.
  • Temporal and Semantic Grouping (for Sequential Tasks)
    • gt_entity_group: Groups all expressions across time referring to the same underlying clinical finding (e.g., “opacity” and “consolidation” in different reports).
    • gt_temporal_group: Further splits the entity group into distinct temporal episodes based on changes over time (e.g., progression, resolution).

To be specific, here is the example instance:

{
  "subject_id": "p10274145",
  "study_id": "s53183707",
  "section": "find",
  "sent_idx": 4,
  "ent_idx": 6,
  "gt_temporal_group": 1.0,
  "gt_entity_group": "heart size upper limits of normal",
  "sequence": 2.0,
  "report": "the lungs are clear bilaterally with no areas ...",
  "section_report": "(1) the lungs are clear bilaterally with no ar...",
  "sent": "cardiomegaly is stable.",
  "cat": "pf",
  "ent": "cardiomegaly",
  "normed_ent": "cardiomegaly",
  "dx_status": "positive",
  "dx_certainty": "definitive",
  "location": null,
  "evidence": null,
  "associate": null,
  "morphology": null,
  "distribution": null,
  "measurement": null,
  "severity": null,
  "comparison": null,
  "onset": null,
"no_change": "stable",
  "improved": null,
"worsened": null,
  "placement": null,
  "past_hx": null,
  "other_source": null,
  "assessment_limitations": null,
  "StudyDateTime": "2174-06-04 15:45:16",
  "time_from_first": "2 days 20:07:54"
}

The Lunguage_vocab.csv file provides a structured vocabulary mapping for descriptive and contextual attributes found in chest X-ray reports. The file consists of 3,827 rows and includes the following five columns:

  • category
    • Indicates the high-level relation type, such as location, morphology, severity, assessment limitations, comparison, etc.
  • subcategory
    • Provides a more fine-grained semantic classification under each category. For example, within assessment limitations, you may see evaluation limitations, patient-related limitations, etc.
  • target_term
    • The original lexical phrase as it appears in the reports (e.g., “obscuration of the hemidiaphragm”, “partly obscuring the visualization”).
  • normed_term
    • The normalized form of the term, used to standardize expressions across different lexical variations (e.g., “obscured”, “limited assessment”).
  • UMLS (w code)
    • A textual mapping to a Unified Medical Language System (UMLS) concept, including the concept name and code (e.g., “metastasis (code: c4255448)”). In some cases, multiple distinct values may map to the same UMLS concept if they are semantically equivalent. When values are semantically different or cannot be mapped to any UMLS concept, they are marked as “– (code: –)”.
{
  "category": "cf",
  "subcategory": "diseases and disorders",
  "target_term": "pulmonary vascular congestion",
  "normed_term": "pulmonary vascular congestion",
  "UMLS (w code)": "pulmonary vascular congestion (code: c5849517)"
}

Usage Notes

LUNGUAGE is the first benchmark dataset to provide fine-grained and temporally annotated structured reports for chest radiographs. Its design supports both single-report understanding and longitudinal reasoning, and its schema-aligned vocabulary promotes interoperability with external medical ontologies. Detailed information about the dataset and how to use it can be found on the GitHub repository [9] and in the paper [10].

Limitations

Despite the expert-driven design, several limitations remain. The current release includes only 1,473 single reports and 80 sequential reports, which may limit its use for large-scale training or full linguistic coverage. To support scalability, we provide code for generating silver-standard datasets. While many vocabulary entries are mapped to UMLS, some could not be reliably linked and are left blank or marked as “– (code: –).” Finally, despite unified guidelines and regular consensus meetings, minor interpretive differences may persist due to the subjectivity of radiologic reporting. Broader consensus across institutions and countries is needed to improve generalizability and consistency.


Release Notes

This is version 1.0.0 of the Lunguage dataset. For any questions or concerns regarding this dataset, please feel free to reach out to us (jhak.moon@kaist.ac.kr).


Ethics

The authors have no ethical concerns to declare.


Conflicts of Interest

The authors have no conflicts of interest to declare.


References

  1. Jain S, Agrawal A, Saporta A, Truong SQH, Duong DN, Bui T, et al. RadGraph: extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463. 2021.
  2. Khanna S, Dejl A, Yoon K, Truong SQH, Duong H, Saenz A, et al. RadGraph2: modeling disease progression in radiology reports via hierarchical information extraction. In: Proceedings of the Machine Learning for Healthcare Conference. PMLR; 2023. p. 381–402.
  3. Wu JT, Agu NN, Lourentzou I, Sharma A, Paguio JA, Yao JS, et al. Chest ImaGenome dataset for clinical reasoning. arXiv preprint arXiv:2108.00316. 2021.
  4. Zhang M, Hu X, Gu L, Harada T, Kobayashi K, Summers RM, et al. CAD-Chest: comprehensive annotation of diseases based on MIMIC-CXR radiology report. 2023.
  5. Zhao W, Wu C, Zhang X, Zhang Y, Wang Y, Xie W. RateScore: a metric for radiology report generation. arXiv preprint arXiv:2406.16845. 2024.
  6. Delbrouck JB, Chambon P, Chen Z, Varma M, Johnston A, Blankemeier L, Van Veen D, Bui T, Truong S, Langlotz C (2024). “RadGraph-XL: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports”. In: Ku L-W, Martins A, Srikumar V, editors. Findings of the Association for Computational Linguistics: ACL 2024; 2024 Aug; Bangkok, Thailand and virtual meeting. Stroudsburg (PA): Association for Computational Linguistics. p. 12902–12915. Available from: https://aclanthology.org/2024.findings-acl.765
  7. Bodenreider O (2004). “The Unified Medical Language System (UMLS): integrating biomedical terminology”. Nucleic Acids Res. 32(Suppl 1): D267–D270. PMID: 14681409.
  8. Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
  9. Lunguage GitHub Repository. https://github.com/SuperSupermoon/Lunguage
  10. Moon JH, Choi G, Rabaey P, Kim MG, Hong HG, Lee JO, et al. Lunguage: A benchmark for structured and sequential chest X-ray interpretation. arXiv. 2025; arXiv:2505.21190.

Files