Database Open Access

A Multi-Modal Satellite Imagery Dataset for Public Health Analysis in Colombia

Sebastian A Cajas David Restrepo Dana Moukheiber Kuan Ting Kuo Chenwei Wu David Santiago Garcia Chicangana Atika Rahman Paddo Mira Moukheiber Lama Moukheiber Sulaiman Moukheiber Saptarshi Purkayastha Diego M Lopez Po-Chih Kuo Leo Anthony Celi

Published: Jan. 30, 2024. Version: 1.0.0


When using this resource, please cite: (show more options)
Cajas, S. A., Restrepo, D., Moukheiber, D., Kuo, K. T., Wu, C., Garcia Chicangana, D. S., Paddo, A. R., Moukheiber, M., Moukheiber, L., Moukheiber, S., Purkayastha, S., Lopez, D. M., Kuo, P., & Celi, L. A. (2024). A Multi-Modal Satellite Imagery Dataset for Public Health Analysis in Colombia (version 1.0.0). PhysioNet. https://doi.org/10.13026/xr5s-xe24.

Additionally, please cite the original publication:

Kuo KT, Moukheiber D, Ordonez SC, Restrepo D, Paddo AR, Chen TY, Moukheiber L, Moukheiber M, Moukheiber S, Purkayastha S, Kuo PC. DengueNet: Dengue Prediction using Spatiotemporal Satellite Imagery for Resource-Limited Countries.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

We introduce a cost-effective public health analysis solution for low- and middle-income countries—the Multi-Modal Satellite Imagery Dataset in Colombia. By leveraging high-quality, spatiotemporally aligned satellite images and corresponding metadata, the dataset integrates economic, demographic, meteorological, and epidemiological data. Employing a single forwards and a forward-backward technique ensures clear satellite images with minimal cloud cover for every epi-week, significantly enhancing overall data quality. The extraction process utilizes the satellite extractor package powered by the SentinelHub API, resulting in a comprehensive dataset of 12,636 satellite images from 81 municipalities in Colombia between 2016 and 2018, along with relevant metadata. Beyond expediting public health data analysis across diverse locations and timeframes, this versatile framework consistently captures multimodal features. Its applications extend to various realms in multimodal AI, encompassing deforestation monitoring, forecasting education indices, water quality assessment, tracking extreme climatic events, addressing epidemic illnesses, and optimizing precision agriculture.


Background

In low- and middle-income countries (LMICs), the persistent challenges of traditional data collection and accessibility pose significant obstacles, with costs serving as barriers to comprehensive data gathering. It is crucial to recognize that these difficulties also contribute to public health initiatives and economic advancement . The United Nations has underscored the essential role that data availability plays in achieving global development goals, particularly in critical areas such as fairness, health, and poverty [1]. Unequal access to adequate and timely high-quality data exacerbates wealth disparities between countries with abundant resources and those with limited or moderate resources.

Efforts have been undertaken to explore alternative sources due to the constraints of standard data collection approaches, with satellite imagery emerging as a viable solution. Especially in areas grappling with economic, social, or environmental challenges, satellite imaging proves to be a cost-effective solution, offering near real-time data access [2-4]. Its versatility finds application in various fields, including environmental research, conservation, and public health. Some use cases encompass poverty detection, deforestation tracking, and forecasting diseases susceptible to climate change [5–10]. In the realm of public health, satellite imaging is important for recognizing and addressing critical issues. By combining satellite imaging insights with public health metadata, decision-makers can gain a comprehensive understanding of the broader socio-economic and environmental factors affecting health outcomes [11-12]. This approach enables more targeted and effective resource allocation. It also serves as a tool empowering decision-makers to plan resource allocation, anticipate educational demands, and identify educational inequities [13]. Moreover, by enabling the identification of disadvantaged regions through the analysis of socioeconomic data, these images become essential in the fight against poverty [6-7]. The provision of open-access processed satellite imagery and corresponding metadata enhances reproducibility and fosters greater transparency and accessibility [14]. The data could help in the contribution to forecasting environmentally sensitive diseases such as Dengue [10], Malaria [15-18], and Zika [19–23].


Methods

Our approach involves implementing a multimodal fusion pipeline that seamlessly integrates a diverse range of satellite imagery and corresponding metadata. To establish the robustness and versatility of our framework, we meticulously curated a dataset comprising 12,636 images and embeddings, complemented by comprehensive metadata collected from 81 municipalities in Colombia between 2016 and 2018. This dataset underwent rigorous evaluation in three crucial tasks: predicting dengue cases, assessing poverty levels, and evaluating access to education. Our contributions include developing a framework for acquiring spatiotemporally aligned images along with relevant metadata.

The baseline approach utilized the satellite extractor to process raw data from Sentinel-2LC1, incorporating advanced techniques like recursive artifact removal and cloud removal based on least cloud coverage, with spatial resolution enhanced through Nearest Interpolation. Subsequently, we tailored the dataset into distinct versions, focusing on 5 cities, 10 cities, and 81 cities. The primary method for recursive artifact and cloud removal involved substituting the current satellite image for a specific day with its closest temporal neighbor. Further enhancements to these techniques, as detailed in the releases section, introduced a recursive forward-backward artifact removal algorithm, which was applied in this dataset. Additionally, inter-band data augmentation and albumentation wrapper modules were incorporated in the satellite extraction framework to enhance data quality, offering users the flexibility to customize and expand datasets based on their specific requirements

Furthermore, the satellite extraction method used to acquire these datasets provides users with the flexibility to explore a variety of satellites, including options like L2A, as well as different floating-point precisions, such as 16-bit and 8-bit. We offer two data download methods: 1) single forward artifact removal, and 2) forward-backward artifact removal. The initial method substitutes images with no information or a pixel sum of zero with the nearest future neighbor image. In contrast, the second method replaces such images with the nearest neighbor from either the past or the future. The code for the satellite extractor is available on GitHub [28].


Data Description

Satellite imagery extraction

In the process of extracting satellite imagery, we obtained 12,636 images per Epi-week from Sentinel-2-L1C and Sentinel-2-L1A satellites with a 5-day revisit time. Utilizing the Sentinel Hub API, we upsampled all bands to a 10m resolution, covering spectral bands such as B02, B03, B04, B08 (10m), B05, B06, B07, B11, B12 (20m), and B01, B09, B10 (60m).

In the first stage, a customized framework facilitated the spatiotemporal retrieval of satellite images based on Epi-week, allowing users to specify various parameters. This framework, dockerized for reproducibility, used the Sentinel Hub API to download the best image per epi-week, prioritizing the least cloud coverage through the least cloud coverage mosaicking order algorithm.

To address instances of repeated images, a recursive process removed black images, enhancing the quality by eliminating cloud occlusion and noise artifacts. The dockerized framework ensures reproducibility and scalability for future deployments.

In the third stage, a hash analysis was implemented to assess spatiotemporal variation and identify duplicated images. The difference hash (dhash) provided scale, brightness, and contrast invariance, generating unique fingerprints for each image per Epi-week. This hash function, denoted as F, mapped a fixed small range output y = F(I) for each element in the collection, ensuring data security and integrity.

Multilabel Metadata

Municipality-specific metadata extraction was achieved through the utilization of unique municipality codes in Colombia. The extracted data encompassed both static and dynamic information, with time resolutions varying between weeks and months. Static data included variables representing social determinants of health (SDOH), such as poverty indices, school and water access, as well as sociodemographic factors sourced from the National Administrative Department of Statistics of Colombia (DANE) based on the 2018 census [24].

Dynamic data comprises epidemiological and climatic metadata. Dengue cases per epi-week were sourced from the Colombian Public Health Surveillance System (SIVIGILA) website, chosen as a case study due to its susceptibility to climate change. Monthly climatic variables, including temperature and precipitation, were extracted for each city using worldclim for the 81 municipalities [25].

In the top ten municipalities with the highest dengue cases, weekly temperature and precipitation data were extracted for baseline model generation using Google Earth Engine. Daily temperature data came from MODIS, while precipitation data utilized CHIRPS [26-27]. These daily values were grouped by the coordinates of the region of interest (ROI) to derive mean temperature and cumulative precipitation per Epi-week.

Metadata Descriptor: The csv file includes comprehensive and organized data on many towns between 2007 and 2019, including socioeconomic, demographic, dengue, and climate-related statistics. This dataset makes it possible to analyze towns in-depth and provides insights into environmental and socioeconomic issues during the given time frame.
Each municipality and image is identified by the columns:

  • Municipality code: Unique identifier for each municipality.
  • Municipality: Name of the municipality.

The following information is available for each municipality:

1. Population Year
   - Description: The total population in the municipality in which demographic and socioeconomic data is reported.
   - Data Type: Numeric (Integer)

2. Age Initial Range-Final Range (%)
   - Description: Age distribution represented as a percentage range from the initial to the final age range.
   - Data Type: Numeric (Percentage)

3. Afrocolombian Population (%)
   - Description: Percentage of the population identifying as Afrocolombian.
   - Data Type: Numeric (Percentage)

4. Indian Population (%)
   - Description: Percentage of the population identifying as Native American.
   - Data Type: Numeric (Percentage)

5. People with Disabilities (%)
   - Description: Percentage of the population with disabilities.
   - Data Type: Numeric (Percentage)

6. People who cannot read or write (%)
   - Description: Percentage of the population unable to read or write.
   - Data Type: Numeric (Percentage)

7. Secondary / Higher Education (%)
   - Description: Percentage of the population with secondary or higher education.
   - Data Type: Numeric (Percentage)

8. Employed Population (%)
   - Description: Percentage of the population employed.
   - Data Type: Numeric (Percentage)

9. Unemployed Population (%)
   - Description: Percentage of the population unemployed.
   - Data Type: Numeric (Percentage)

10. People Doing Housework (%)
    - Description: Percentage of the population engaged in housework.
    - Data Type: Numeric (Percentage)

11. Retired People (%)
    - Description: Percentage of the population retired.
    - Data Type: Numeric (Percentage)

12. Men (%), Women (%)
    - Description: Percentage distribution of the population by gender.
    - Data Type: Numeric (Percentage)

13. Households Without Water Access (%)
    - Description: Percentage of households without access to water.
    - Data Type: Numeric (Percentage)

14. Households Without Internet Access (%)
    - Description: Percentage of households without access to the internet.
    - Data Type: Numeric (Percentage)

15. Building Stratification (Value between 1 and 6) (%)
    - Description: Percentage distribution of building stratification values ranging from 1 to 6.
    - Data Type: Numeric (Percentage)

16. Number of Hospitals per km^2
    - Description: Density of hospitals per square kilometer.
    - Data Type: Numeric (Percentage)

17. Number of Houses per km^2
    - Description: Density of houses per square kilometer.
    - Data Type: Numeric (Percentage)

18. Municipality Code
    - Description: Code assigned to each municipality.
    - Data Type: Numeric (Code)

19. Municipality Name
    - Description: Name of the municipality.
    - Data Type: Text (String)

20. Dengue Cases
    - Description: Number of reported dengue cases in that municipality.
    - Data Type: Numeric (Count)

21. Monthly temperature and precipitation values for each year from 2007 to 2018.
Temperature values for each month, e.g., TEMPERATURE_jan_07, TEMPERATURE_feb_07, ..., TEMPERATURE_dec_18.
Precipitation values for each month, e.g., PRECIPITATION_jan_07, PRECIPITATION_feb_07, ..., PRECIPITATION_dec_18.

Dataset description:

We are releasing a dataset of Colombian municipalities with their corresponding metadata, aligned temporally by name.

  • 10_municipalities: This dataset comprises the top 10 cities in Colombia with the highest Dengue proliferation. The data has undergone single-forward artifact removal, as detailed in the methods section.

  • 81_municipalities_v1.0: Featuring information from 81 municipalities in Colombia, this dataset provides a comprehensive overview of sociodemographic-related data. It is processed using a single forward artifact removal method, as explained in the methods section.

  • 81_municipalities_v2.0: In this updated version, labeled as Version 2.0, black images have been replaced using the forward-backward artifact removal technique outlined in the methods section.

We substituted the prefix "image" with the respective municipality code for 81_municipalities_v1.0 and 81_municipalities_v2.0. This adjustment enhances the dataset's clarity and facilitates a more intuitive understanding of the data associated with each municipality.


Usage Notes

All the scripts employed in these processes are written in the Python programming language and are openly accessible. The GitHub [28] has code for for satellite extractor and models using this dataset are also openly available in Huggingface [29]. The dataset has also been used for Dengue Prediction and accepted to the IJCAI 2023 Workshop on Bridge-AI: from Climate Change to Health Equity (BridgeAICCHE) [30].

This dataset is designed for versatile reuse with different metadata. If the metadata is also spatiotemporally aligned with consistent time intervals, it can be extended to forecast various targets. The dataset framework, in turn, is applicable to any location and timestamp worldwide. This flexibility enables its combination with other metadata, facilitating the development of multimodal applications.However, users should be aware of certain limitations when utilizing this resource. The dataset is specifically derived from dengue proliferation in Colombian cities. Users seeking information for other municipalities, countries, or applications should employ the satellite extractor framework to achieve spatially unbiased extension. Additionally, customization of temporal sampling to 5 or 6 weeks is possible, as the current extraction is based on the epi-week calendar. It is important to note that non-endemic regions, such as mountainous cities, may be excluded, which is crucial for different locations in dengue forecasting.


Ethics

The data is sourced from open-access and publicly available repositories, eliminating the need for additional permissions. 


Acknowledgements

This project is supported by the ESA Network of Resources Initiative (Request ID: 1c081a) and benefits from Oracle for Research Cloud Credits through the Oracle for Research Program for the "Towards a Smart Eco-epidemiological Model of Dengue in Colombia using Satellite Images" initiative. MIT Critical data Team: Sebastian Cajas, Dana Moukheiber, David Restrepo, David Santiago Chicangana, María Patricia Arbeláez Montoya, Lama Moukheiber, Chenwei Wu, Kuan-Ting Ku, Po-Chih Kuo, Sulaiman Moukheiber, Kuan-Ting Kuo, Juan Sebastian Osorio-Valencia, Saptarshi Purkayastha, Mira Moukheiber, Atika Rahman Paddo, Braiam Escobar, Cheng Che Tsai, Wilson Arbey Diaz, Luis Jesús Martínez, Alessa Álvarez, Siyi Tang,  Amara Tariq, Imon Banerjee, Aakanksha Rana, Maria Patricia Arbelaez-Montoya, Laura Sofía Daza Rosero, Jhon Fredy Romero Núñez, Ivan Darío Velez, Diego M. López, Leo Anthony Celi.


Conflicts of Interest

The authors declare no competing interests.


References

  1. United Nations. THE 17 GOALS | Sustainable Development. [Online]. Available from: https://sdgs.un.org/goals [Accessed 21 December 2021].
  2. Castro, D. A. & Álvarez, M. A. Predicting socioeconomic indicators using transfer learning on imagery data: an application in Brazil. GeoJournal 88, 1081–1102 (2023).
  3. Hall, O., Ohlsson, M. & Rögnvaldsson, T. A review of explainable AI in the satellite data, deep machine learning, and human poverty domain. Patterns 3, 100600 (2022).
  4. Hargreaves, P. K. & Watmough, G. R. Satellite Earth observation to support sustainable rural development. Int. J. Appl. Earth Obs. Geoinformation 103, 102466 (2021).
  5. Kaselimi, M., Voulodimos, A., Daskalopoulos, I., Doulamis, N. & Doulamis, A. A Vision Transformer Model for Convolution-Free Multilabel Classification of Satellite Imagery in Deforestation Monitoring. IEEE Trans. Neural Netw. Learn. Syst. 34, 3299–3307 (2023).
  6. Jean, N. et al. Combining satellite imagery and machine learning to predict poverty. Science 353, 790–794 (2016).
  7. Chitturi, V. & Nabulsi, Z. Predicting Poverty Level from Satellite Imagery using Deep Neural Networks. Preprint at https://doi.org/10.48550/arXiv.2112.00011 (2021).
  8. Bhatia, S. et al. A Retrospective Study of Climate Change Affecting Dengue: Evidences, Challenges and Future Directions. Front. Public Health 10, 884645 (2022).
  9. Kurane, I. The Effect of Global Warming on Infectious Diseases. Osong Public Health Res. Perspect. 1, 4–9 (2010).
  10. Gibbons, R. V. & Vaughn, D. W. Dengue: an escalating problem. BMJ 324, 1563–1566 (2002).
  11. Holmes Fee C, Hicklen RS, Jean S, Abu Hussein N, Moukheiber L, de Lota MF, Moukheiber M, Moukheiber D, Anthony Celi L, Dankwa-Mullan I. Strategies and solutions to address Digital Determinants of Health (DDOH) across underinvested communities. PLOS digital health. 2023 Oct 12;2(10):e0000314.
  12. Phuong J, Ordóñez P, Cao J, Moukheiber M, Moukheiber L, Caspi A, Swenor BK, Naawu DK, Mankoff J. Telehealth and digital health innovations: A mixed landscape of access. PLOS Digital Health. 2023 Dec 15;2(12):e0000401.
  13. Z. Han, C. Cui, Y. Kong, Q. Li, Y. Chen, and X. Chen, “Improving educational equity by maximizing service coverage in rural Changyuan, China: An evaluation-optimization-validation framework based on spatial accessibility to schools,” Appl. Geogr., vol. 152, p. 102891, Mar. 2023, doi: 10.1016/j.apgeog.2023.102891.
  14. Alberto IR, Alberto NR, Ghosh AK, Jain B, Jayakumar S, Martinez-Martin N, McCague N, Moukheiber D, Moukheiber L, Moukheiber M, Moukheiber S. The impact of commercial health datasets on medical research and health-care algorithms. The Lancet Digital Health. 2023 May 1;5(5):e288-94.
  15. I. Kurane, “The Effect of Global Warming on Infectious Diseases,” Osong Public Health Res. Perspect., vol. 1, no. 1, pp. 4–9, Dec. 2010, doi: 10.1016/j.phrp.2010.12.004.
  16. D. J. Rogers, S. E. Randolph, R. W. Snow, and S. I. Hay, “Satellite imagery in the study and forecast of malaria,” Nature, vol. 415, no. 6872, Art. no. 6872, Feb. 2002, doi: 10.1038/415710a.
  17. R. W. Snow, M. Craig, U. Deichmann, and K. Marsh, “Estimating mortality, morbidity and disability due to malaria among Africa’s non-pregnant population,” Bull. World Health Organ., vol. 77, no. 8, pp. 624–640, 1999.
  18. M. F. Myers, D. J. Rogers, J. Cox, A. Flahault, and S. I. Hay, “Forecasting disease risk for increased epidemic preparedness in public health,” Adv. Parasitol., vol. 47, pp. 309–330, 2000, doi: 10.1016/s0065-308x(00)47013-2.
  19. S. Bhatia, D. Bansal, S. Patil, S. Pandya, Q. M. Ilyas, and S. Imran, “A Retrospective Study of Climate Change Affecting Dengue: Evidences, Challenges and Future Directions,” Front. Public Health, vol. 10, p. 884645, May 2022, doi: 10.3389/fpubh.2022.884645
  20. D. S. Shepard, E. A. Undurraga, and Y. A. Halasa, “Economic and Disease Burden of Dengue in Southeast Asia,” PLoS Negl. Trop. Dis., vol. 7, no. 2, p. e2055, Feb. 2013, doi: 10.1371/journal.pntd.0002055
  21. O. Mudele, A. C. Frery, L. F. R. Zanandrez, A. E. Eiras, and P. Gamba, “Dengue Vector Population Forecasting Using Multisource Earth Observation Products and Recurrent Neural Networks,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 14, pp. 4390–4404, 2021, doi: 10.1109/JSTARS.2021.3073351.
  22. A. Hussain, F. Ali, O. B. Latiwesh, and S. Hussain, “A Comprehensive Review of the Manifestations and Pathogenesis of Zika Virus in Neonates and Adults,” Cureus, vol. 10, no. 9, p. e3290, doi: 10.7759/cureus.3290.
  23. V. S. H. Rao and R. Durvasula, Eds., Dynamic Models of Infectious Diseases: Volume 1: Vector-Borne Diseases. New York, NY: Springer, 2013. doi: 10.1007/978-1-4614-3961-5.
  24. Departamento Administrativo Nacional de Estadística - DANE. DANE - Censo Nacional de Población y Vivienda 2018. [Online]. Available from: https://www.dane.gov.co/index.php/estadisticas-por-tema/demografia-y-poblacion/censo-nacional-de-poblacion-y-vivenda-2018 [Accessed 21 December 2021].
  25. Sistema de Salud Pública. PortalSivigila2019 Estadísticas de Vigilancia Rutinaria. [Online]. Available from: https://portalsivigila.ins.gov.co/Paginas/Vigilancia-Rutinaria.aspx [Accessed 21 December 2021].
  26. NASA. MODIS Web. [Online]. Available from: https://modis.gsfc.nasa.gov/about/ [Accessed 21 December 2021].
  27. Climate Hazards Group InfraRed - UC Santa Barbara. CHIRPS: Rainfall Estimates from Rain Gauge and Satellite Observations | Climate Hazards Center - UC Santa Barbara. [Online]. Available from: https://www.chc.ucsb.edu/data/chirps [Accessed 21 December 2021].
  28. Satellite.extractor. [Online]. Available from: https://github.com/sebasmos/satellite.extractor [Accessed 21 December 2021]. ‌
  29. Cajas Ordonez SA, Restrepo D, López DM, Chicangana DS, Celi LA. MIT Critica data. Huggingface. 2023. Available at https://huggingface.co/MITCriticalData [Accessed 1/30/2024]
  30. Kuo KT, Moukheiber D, Ordonez SC, Restrepo D, Paddo AR, Chen TY, Moukheiber L, Moukheiber M, Moukheiber S, Purkayastha S, Kuo PC. DengueNet: Dengue Prediction using Spatiotemporal Satellite Imagery for Resource-Limited Countries. arXiv preprint arXiv:2401.11114. 2024 Jan 20.

Share
Access

Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.

License (for files):
Creative Commons Zero 1.0 Universal Public Domain Dedication

Discovery
Corresponding Author
You must be logged in to view the contact information.

Files

Total uncompressed size: 65.0 GB.

Access the files
Folder Navigation: <base>
Name Size Modified
10_municipalities
81_municipalities_v1.0
81_municipalities_v2.0
LICENSE.txt (download) 6.5 KB 2024-01-30
SHA256SUMS.txt (download) 3.2 MB 2024-01-30
metadata.csv (download) 7.1 MB 2023-12-17