Database Contributor Review
Medical Information Mart for Intensive Care Brazil (MIMIC-BR): a Brazilian Dataset of Anonymized Hospital and ICU Clinical Data
Gabriela Steil , Adhara Brandão Lima Vanhoz , Mateus de Lima Freitas , Alice Barone de Andrade , Marcos Silva de Mendonça , Rafael Gustavo Bezerra , Maria Tereza Abrahão , Cesar Truyts , Diogo Patrão , Chrystinne Fernandes , Edson Amaro , Adriano Jose Pereira
Published: May 21, 2026. Version: 1.0.0
When using this resource, please cite:
Steil, G., Brandão Lima Vanhoz, A., Freitas, M. d. L., Barone de Andrade, A., Silva de Mendonça, M., Bezerra, R. G., Abrahão, M. T., Truyts, C., Patrão, D., Fernandes, C., Amaro, E., & Pereira, A. J. (2026). Medical Information Mart for Intensive Care Brazil (MIMIC-BR): a Brazilian Dataset of Anonymized Hospital and ICU Clinical Data (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/0vk7-vw29
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
Medical Information Mart for Intensive Care Brazil (MIMIC-BR) is a database of anonymized clinical data from ICU and hospital admissions in Brazil. The dataset format and intended use matches the original large, freely available database comprising de-identified health-related data from patients who were admitted to the critical care units of the Beth Israel Deaconess Medical Center – the original Medical Information Mart for Intensive Care. The MIMIC-BR includes 30,599 adult patients admitted to Einstein Hospital Israelita during a defined retrospective period. The time period consists of three consecutive years within the interval between 2015 and 2025. The exact years are not specified in order to further protect patient privacy. It contains demographic information, diagnoses, procedures, laboratory tests, medications, and vital signs.
Data are structured using the OMOP Common Data Model (OMOP-CDM) and anonymized in accordance with the Brazilian General Data Protection Law (LGPD), aligned with the current Digital Health Strategy for Brazil 2020-2028, and international best practices to ensure privacy and security. This resource aims to reduce barriers to research, ensure greater representation of Brazilian and Latin American populations in global studies, foster interoperability within the OMOP/OHDSI ecosystem, and promote open science, education, and innovation based on real-world clinical data.
Background
In Brazil, there are few public datasets based on hospital Electronic Health Records (EHRs) [1], and virtually none that combine (i) high clinical data granularity, (ii) international data standardization through OMOP-CDM [2-3], and (iii) a robust, auditable anonymization process. Previous initiatives, such as BRAX (the Brazilian labeled chest X-ray dataset) [4], have demonstrated the feasibility of balancing privacy, governance, and scientific impact when sharing data responsibly.
Building on these experiences, and on the original large, freely available database comprising de-identified health-related data from patients who were admitted to the critical care units of the Beth Israel Deaconess Medical Center – the original Medical Information Mart for Intensive Care [5, 6], MIMIC-BR proposes to advance further by opening the path for a public Brazilian database of anonymized hospital clinical data, structured according to OMOP-CDM and aligned with international best practices. The MIMIC-BR initiative was proposed as part of a series of datathons and work meetings to design the database pari passu with other similar databases (i.e. K-MIMIC, Korea Medical Information Mart for Intensive Care). It is our intention to further add more data from other national databases in future publications of the MIMIC-BR, again following the evolution of the original MIMIC database, currently in its 4th generation [6].
By providing a dataset comparable to MIMIC-IV [6], anonymized in accordance with the Brazilian General Data Protection Law (LGPD), aligned with the current Digital Health Strategy for Brazil, and international best practices to ensure privacy and security [7], MIMIC-BR addresses a strategic gap in the national research infrastructure and supports the development of predictive models and evidence-based healthcare solutions tailored to the Brazilian and Latin American context, contributing to reducing racial and other types of bias in scientific research [8, 9].
Methods
MIMIC-BR is derived from the institutional OMOP-CDM repository at Einstein Hospital Israelita, which integrates data from multiple clinical and administrative systems, including the EHR, laboratory information system (LIS), and coding systems [10, 11]. The creation of MIMIC-BR followed the steps below.
Acquisition
A cohort of 30,599 adult patients (≥18 years) admitted to inpatient units, ICUs, and emergency departments during a defined retrospective period (three consecutive years within the interval between 2015 and 2025) was selected from the OMOP-CDM repository. All relevant OMOP tables (e.g., person, visit_occurrence, condition_occurrence, procedure_occurrence, drug_exposure, measurement, observation, death, visit_detail, drug_era, condition_era, concept, vocabulary) were filtered to include only records associated with this cohort.
Preparation
Data were reorganized into a simplified structure derived from OMOP-CDM to facilitate retrospective analysis and public sharing. This process included selecting relevant domains, removing unnecessary or sensitive attributes, and applying transformations such as grouping rare categories and normalizing units. The schema includes modifications and extensions while preserving compatibility with OMOP-CDM for use with OHDSI tools.
De-identification
A robust anonymization pipeline was applied in compliance with LGPD and international best practices. Steps included:
- Removal of direct identifiers and any reversible keys;
- Date shifting per patient to maintain internal temporal consistency;
- Aggregation of extreme ages (≥89 years) and rare categories to reduce re-identification risk;
- Application of k-anonymity, l-diversity, and t-closeness principles to quasi-identifiers;
- Free-text fields were excluded from this release. Structured data were validated to ensure no residual identifiers remained.
In order to address a limitation of Spark (especially version 3.0+), timestamps prior to 1582 cannot be represented due to limitations of the underlying timestamp types. To preserve historical dates and support de-identification, all dates are stored as strings rather than converted to timestamps.
Records with a discharge date (visit_end_date) set to the year 9999 indicate that the patient was still hospitalized at the time of data extraction, or that the discharge date could not be recovered for any reason.
In the visit_occurrence table, unique length-of-stay (LOS) categories with a total number of patients fewer than 20 were grouped into new aggregated categories. For these cases, the visit_end_date was reassigned according to the aggregated category definition.
Data Description
MIMIC-BR is derived from the OMOP Common Data Model and includes a subset of tables reorganized for public release.
Each table contains anonymized and structured data relevant for clinical and epidemiological research.
Below is an overview of the main tables:
person(n = 30,599)
Contains one row per patient. Includes demographic attributes such as age (with aggregation for ≥89 years), sex, and race/ethnicity (when available).visit_occurrence(n = 37,978)
Represents hospital visits, including admission type (elective, emergency), source and destination, unit of care (ward, ICU), and length of stay.condition_occurrence(n = 125,570)
Stores diagnoses coded in ICD-10 and mapped to SNOMED-CT concepts. Includes shifted timestamps and visit linkage.procedure_occurrence(n = 181,608)
Contains surgical and non-surgical procedures, mapped to OMOP standard vocabularies (TUSS or "Terminologia Unificada da Saúde Suplementar", for procedures; and SIGTAP or "Sistema de Gerenciamento da Tabela de Procedimentos, Medicamentos e OPM", not only for procedures, but also for medicines, orthoses, and prostheses, from the Brazilian Unified Health System ("Sistema Único de Saúde" - SUS).drug_exposure(n = 853,730)
Records medication prescriptions and administrations, mapped to RxNorm or equivalent vocabularies.measurement(n = 1,337,890)
Includes laboratory test results (numeric and categorical) and vital signs such as heart rate, blood pressure, oxygen saturation, and temperature. Standardized using LOINC (Logical Observation Identifiers Names and Codes) where possible.observation(n = 640,612)
Captures additional structured clinical observations not classified as measurements, such as anthropometric data.death(n = 46)
Indicates in-hospital mortality events, with anonymized dates.visit_detail(n = 118,115)
Provides more granular information about sub-visits or movements within a hospital stay (e.g., transfers between wards or ICUs).drug_era(n = 695,168)
Summarizes periods of continuous drug exposure, derived from individualdrug_exposurerecords.condition_era(n = 158,205)
Aggregates condition occurrences into broader time intervals representing ongoing disease episodes.
Each table uses surrogate keys for patients and visits, ensuring internal consistency while preventing linkage to
original identifiers. All dates are shifted per patient using an anchor date and offset, preserving relative intervals for longitudinal analysis. Free-text fields (e.g., clinical notes) are excluded from this release.
The [REDACTED] values appear in some fields as part of the de-identification process. These values indicate that the original data has been intentionally suppressed to protect patient privacy.
An independent institutional third-party company conducted a formal review and certification of the de-identification process. They evaluated the strategies and metrics used to assess the risk of re-identification for all data utilized in the project.
Usage Notes
This is the first publicly available dataset of Brazilian patients (hospital and ICU patients) with (i) high clinical granularity, (ii) international data standardization through OMOP-CDM, and (iii) a robust, auditable anonymization process. Future releases may provide greater volumetry. Free text reports (clinical notes, imaging, pathology or other reports) are not provided in the current version.
As real-world data - RWD (data collected during routine clinical practice), it reflects the idiosyncrasies of that practice, so implausible values may be present and analyzed with caution. Researchers should follow best practice guidelines when analyzing the data.
Tutorial on how to download and use OMOP Concepts Tables and vocabularies (OHDSI)
The text below explains how to download OMOP CDM vocabulary tables and how to relate them to clinical OMOP tables using concept identifiers (essential steps to allow analysis using MIMIC-BR data).
Step 1 — Download Vocabulary Files
ATHENA allows you to both search and load standardized vocabularies. It is a resource to be used, not a software tool to install. To download a zip file with all standardized vocabulary tables select all the vocabularies you need for your OMOP CDM. Vocabularies with standard concepts and very common usage are preselected. Add vocabularies that are used in your source data. Vocabularies that are proprietary have no select button. Click on the “License required” button to incorporate such a vocabulary into your list. The Vocabulary Team will contact you and request that you demonstrate your license or help you connect to the right folks to obtain one. The platform can be accessed through the ATHENA vocabulary search page [12], and a brief overview of its functionality is available in the “10-Minute Tutorial” video [13].
Go to the ATHENA website [12] and log in to your account.
From the top menu, click on Download.
Select the desired vocabularies:
| ID | Code (CDM v5) | Name |
| 1 | SNOMED | Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO) |
| 6 | LOINC | Logical Observation Identifiers Names and Codes (Regenstrief Institute) |
| 8 | RxNorm | RxNorm (NLM) |
| 12 | Gender | OMOP Gender |
| 13 | Race | Race and Ethnicity Code Set (USBC) |
| 16 | Multum | Cerner Multum (Cerner) |
| 21 | ATC | WHO Anatomic Therapeutic Chemical Classification |
| 34 | ICD10 | International Classification of Diseases, Tenth Revision (WHO) |
| 44 | Ethnicity | OMOP Ethnicity |
| 82 | RxNorm Extension | OMOP RxNorm Extension |
| 87 | Specimen Type | OMOP Specimen Type |
| 102 | SUS | Table of Procedures, Drugs, Orthoses, Prostheses and Special Materials (Brazilian Unified Health System) |
| 111 | Episode Type | OMOP Episode Type |
| 128 | OMOP Extension | OMOP Extension (OHDSI) |
Click Download to download the ZIP file.
Step 2 — Extract the Files
After extracting the ZIP file, you will find CSV files corresponding to each OMOP vocabulary table.
Included Vocabulary Tables
CONCEPTCONCEPT_ANCESTORCONCEPT_CLASSCONCEPT_RELATIONSHIPCONCEPT_SYNONYMVOCABULARYRELATIONSHIPDOMAINDRUG_STRENGTH
Step 3 — Connect OMOP Concept tables to MIMIC-BR Clinical Tables
The OMOP Common Data Model does not store free-text clinical values in "fact tables". Instead, all clinical meaning is represented using standardized identifiers called concept_id, which reference rows in the CONCEPT table.
Standard Relationship Pattern
Most OMOP clinical tables follow this structure:
xxx_concept_id: standardized concept (referencesCONCEPT.concept_id)xxx_source_concept_id: original source value, also mapped toCONCEPTxxx_source_value: raw value from the source system
Main Clinical Tables and Their Concept Columns
CONDITION_OCCURRENCE.condition_concept_id: medical conditionsDRUG_EXPOSURE.drug_concept_id: medicationsPROCEDURE_OCCURRENCE.procedure_concept_id: proceduresMEASUREMENT.measurement_concept_id: labs and measurementsOBSERVATION.observation_concept_id: clinical observationsVISIT_OCCURRENCE.visit_concept_id: visit typesDEATH.cause_concept_id: cause of death
Hierarchies with CONCEPT_ANCESTOR
CONCEPT_ANCESTOR enables hierarchical queries such as grouping all descendant concepts under a parent concept (e.g., all types of diabetes). This is essential for cohort definitions and analytics.
Mappings and Relationships
CONCEPT_RELATIONSHIP defines mappings between source vocabularies and standard concepts (e.g., ICD to SNOMED). RELATIONSHIP describes the type of linkage.
CONCEPT_SYNONYM stores alternative names for concepts, while DRUG_STRENGTH adds dosage information for drug concepts.
Example Usage in Spark
spark.read.csv('/path/CONCEPT.csv', header=True)
Additional details on the OMOP CDM are provided in the official documentation [14].
We hope this dataset can contribute to reducing the number of under-represented populations in the available pool of datasets containing patient clinical data used for the development of models for clinical decision support.
Release Notes
Patient composition
All patients have at least one ICU admission during the dataset period (3 years). The ICU data focuses on clinical data collected during intensive care unit stays. It includes detailed measurements, interventions, and events that reflect the complexity of critical care. Some examples of ICU data are listed below:
measurement
Contains vital signs and laboratory results recorded in the ICU, including heart rate, respiratory rate, arterial blood gases, and temperature.procedure_occurrence
Documents ICU-specific procedures such as mechanical ventilation, dialysis, and invasive monitoring. Procedures are mapped to OMOP standard concepts.drug_exposure
Includes ICU medication administrations, such as vasoactive drugs, sedatives, and continuous infusions, with precise timing for dose adjustments.observation
Additional ICU-related observations, such as Glasgow Coma Scale or sedation scores.visit_detail
Identifies ICU admissions and transfers within the hospital stay, providing timestamps for entry and discharge from the ICU.
The hospital data includes data related to patient hospitalizations in the wards, intermediate care, surgeries, and all associated clinical events. It provides a longitudinal view of care during inpatient stays, including diagnoses, procedures, medications, and outcomes. Currently, information on Emergency Department visits is not available.
To note, some columns (34 in total), in a few tables are currently empty (e.g., cause_source_concept_id, visit_source_concept_id, admitting_source_concept_id, ethnicity_source_concept_id, value_as_concept_id etc.) because there is no information currently available. They are intentionally kept in order to preserve the OMOP CDM structure. In future releases, it is possible that part of those fields will be filled when additional source information becomes available.
Ethics
The project was approved by the Institutional Review Board of Einstein Hospital Israelita (#93984125.1.0000.0071). The requirement for individual patient consent was waived. The study database was anonymized, with all identifiable patient information removed in compliance with the Brazilian General Data Protection Law (LGPD).
Acknowledgements
We would like to thank the Laboratory of Computational Physiology (LCP), Massachusetts Institute of Technology (MIT), especially Leo Anthony Celi and Tom Pollard, for the inspiration, motivation, and all the continued support of the MIMIC-BR project.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Dias H, Ulbrich AHDPd. BRATECA (Brazilian Tertiary Care Dataset): a clinical information dataset for the Portuguese language (version 1.0) [Internet]. PhysioNet; 2022 [accessed 31 Dec 2025]. Available from: https://doi.org/10.13026/cmab-j041
- Reich C, Ostropolets A, Ryan P, Rijnbeek P, Schuemie M, Davydov A, et al. OHDSI standardized vocabularies—a large-scale centralized reference ontology for international data harmonization. J Am Med Inform Assoc. 2024;31(3):583–590.
- Observational Health Data Sciences and Informatics (OHDSI). Data standardization: the OMOP Common Data Model [Internet]. [accessed 31 Dec 2025]. Available from: https://www.ohdsi.org/data-standardization/
- Reis EP, Paiva JPQ de, Silva MCB da, et al. BRAX: Brazilian labeled chest x-ray dataset. Scientific Data. 2022;9:487. doi:10.1038/s41597-022-01608-8.
- Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Scientific Data. 2016;3:160035.
- Johnson AEW, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record database. Scientific Data. 2023;10:1–21.
- Brasil. Ministério da Saúde. Secretaria-Executiva. Departamento de Informática do SUS. Estratégia de Saúde Digital para o Brasil 2020–2028 [Internet]. Brasília: Ministério da Saúde; 2020 [accessed 31 Dec 2025]. Available from: https://bvsms.saude.gov.br/bvs/publicacoes/strategy_health_digital_brazilian.pdf
- Cortes-Bergoderi M, et al. Validity of cardiovascular risk prediction models in Latin America and among Hispanics in the United States of America: a systematic review. Rev Panam Salud Publica. 2012;32(2):131–139.
- Obermeyer Z, et al. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–453.
- Abrahão MTF, et al. Challenges and opportunities in adopting OMOP-CDM in Brazilian healthcare: a report from Hospital Israelita Albert Einstein [Internet]. OHDSI Global Symposium; 2023 [accessed 31 Dec 2025]. Available from: https://www.ohdsi.org/wp-content/uploads/2023/10/12-Abrahao-BriefReport.pdf
- Abrahão MTF, Freitas ML, Flato UAP, et al. Common data models in intensive care medicine during COVID-19 pandemics: the Hospital Israelita Albert Einstein experience. Einstein (Sao Paulo) [Internet]. 2022;20(Suppl 1):S1–S16 [accessed 31 Dec 2025]. Available from: https://journal.einstein.br/wp-content/uploads/2022/08/Site_volume-20-supplement-1-2022_online_150822_1401.pdf
- Observational Health Data Sciences and Informatics (OHDSI). ATHENA [Internet]. [Accessed 31 Dec 2025]. Available from: https://athena.ohdsi.org
- Observational Health Data Sciences and Informatics (OHDSI). ATHENA 10-Minute Tutorial [Internet]. [accessed 31 Dec 2025]. Available from: https://youtu.be/2WdwBASZYLk
- Observational Health Data Sciences and Informatics (OHDSI). OMOP Common Data Model v5.3 [Internet]. [accessed 31 Dec 2025]. Available from: https://ohdsi.github.io/CommonDataModel/cdm53.html
Access
Access Policy:
Only credentialed users who sign the DUA can access the files. In addition, users must have individual studies reviewed by the contributor.
License (for files):
PhysioNet Contributor Review Health Data License 1.5.0
Data Use Agreement:
PhysioNet Contributor Review Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/0vk7-vw29
DOI (latest version):
https://doi.org/10.13026/x1gx-7j59
Topics:
critical care
dataset
artificial intelligence
intensive care unit
machine learning
tertiary heatlhcare
data anonymization
inpatients
Project Views
0
Current Version0
All VersionsCorresponding Author
Versions
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- submit a request to the authors to use the data for your project