EN PT

Artigos

0404/2025 - Use of artificial intelligence and environmental data to estimate respiratory hospitalizations in support of SUS management
Uso de inteligência artificial e dados ambientais na estimativa de internações respiratórias como apoio à gestão no SUS

Autor:

• Gabriel Fuscald Scursone - Scursone, GF - <gabriel_scursone@hotmail.com>
ORCID: https://orcid.org/0000-0001-7684-5681

Coautor(es):

• Diana Francisca Adamatti - Adamatti, DF - <dianaada@gmail.com>
ORCID: https://orcid.org/0000-0003-3829-3075

• Flavio Manoel Rodrigues da Silva Júnior - Silva Júnior, FMR - <f.m.r.silvajunior@gmail.com>
ORCID: https://orcid.org/0000-0002-7344-4679

• Ronan Adler Tavella - Tavella, RA - <ronan_tavella@hotmail.com>
ORCID: https://orcid.org/0000-0003-2436-4186

• Alicia da Silva Bonifácio - Bonifácio, AS - <aliciabonifaciob@gmail.com>
ORCID: https://orcid.org/0000-0002-5581-9844

• Ricardo Arend Machado - Machado, RA - <ricardoarend@gmail.com>
ORCID: https://orcid.org/0009-0000-8614-2138

• Elizabet Saes-Silva - Saes-Silva, E - <betssaes@gmail.com>
ORCID: https://orcid.org/0000-0003-2356-7774

• Washington Correia Filho - Correia Filho, W - <wlfcfm@gmail.com>
ORCID: https://orcid.org/0000-0002-4029-4491



Resumo:

This study developed a predictive model of hospital admissions due to respiratory diseases using environmental and meteorological data from São Paulo (2017–2022). It analyzed hospitalizations classified under ICD-10 codes J00–J99. Weekly public data on air pollutants (PM?.?, PM??, O?, NO?, SO?, CO) and climate variables (temperature, humidity, precipitation, among others) were used. The methodology included feature engineering (lags, moving averages, interactions, trend, seasonality), Lasso regression for variable selection, and application of the CatBoost algorithm optimized via GridSearchCV. The model showed strong performance (R² ? 0.895), with good accuracy in estimating healthcare demand. The Shapley Additive Explanations (SHAP) technique ensured model explainability, identifying the most influential predictors of respiratory admissions. The results highlight the potential of AI as a strategic Digital Health tool, especially for early outbreak detection and resource allocation within Brazil’s Unified Health System (SUS).

Palavras-chave:

Artificial Intelligence; Respiratory Tract Diseases; Air Pollution; Predictive Models; Unified Health System.

Abstract:

Este estudo desenvolveu um modelo preditivo de internações hospitalares por doenças respiratórias no estado de São Paulo, com base em dados ambientais e meteorológicos de 2017 a 2022. Foram analisadas internações segundo os códigos J00–J99 da CID-10. Utilizaram-se dados públicos organizados semanalmente sobre poluentes atmosféricos (PM₂.₅, PM₁₀, O₃, NO₂, SO₂, CO) e variáveis climáticas (temperatura, umidade, precipitação, entre outras). A metodologia envolveu engenharia de atributos (defasagens, médias móveis, interações, tendência, sazonalidade), seleção de variáveis com regressão Lasso e aplicação do algoritmo CatBoost, otimizado via GridSearchCV. O modelo apresentou bom desempenho (R² ≈ 0,895), com acurácia nas estimativas da demanda assistencial. A técnica SHAP (Explicações Aditivas de Shapley) foi empregada para garantir a explicabilidade do modelo, permitindo identificar os preditores mais influentes. Os achados destacam o potencial da inteligência artificial como ferramenta da Saúde Digital, especialmente para detecção precoce de surtos e alocação eficiente de recursos no SUS.

Keywords:

Poluição do Ar; Doenças Respiratórias; Inteligência Artificial; Modelos Preditivos; Sistema Único de Saúde.

Conteúdo:

INTRODUCTION
The growing incorporation of digital technologies into health management has highlighted the potential of artificial intelligence (AI) as a decision-support tool. Initiatives such as the Digital Health in Maré project, developed by the Oswaldo Cruz Foundation (Fiocruz) in partnership with the Brazilian Ministry of Health and the Municipal Health Department of Rio de Janeiro, demonstrate the feasibility of co-creating digital solutions within communities. This was exemplified by workshops held in the Nova Holanda neighborhood, where technologies were adapted to local practices, enabling patients to receive medical consultations without leaving their homes¹,².
In the Brazilian context, Ordinance GM/MS No. 3,691, dated May 23, 2024, identifies Digital Health as a foundational axis of primary care and health surveillance, establishing guidelines for the use of artificial intelligence (AI) in the identification of risk factors and the anticipation of healthcare demands². Within this framework, the integration of predictive models into the databases of the Unified Health System (SUS) represents a promising path toward promoting efficiency, transparency, and improved service delivery, especially in vulnerable regions.
One of the main challenges faced by Brazilian urban centers is the increasing burden of respiratory diseases linked to air pollution and climate variability. Long-term exposure to fine particulate matter (PM?.?) has been associated with a wide spectrum of adverse health outcomes, including bronchitis, lung cancer, and cardiovascular diseases³??. In densely populated cities, these effects are further intensified by the urban heat island phenomenon, which amplifies both pollutant concentration and heat retention?. PM?.?, in particular, is especially concerning due to its high toxicity, its ability to penetrate deep into the pulmonary system, and its capacity to carry harmful substances?.
In this context, smart urban ecosystems, supported by technological infrastructure and data-driven governance, are emerging as strategic approaches to mitigate the effects of climate change and improve air quality?. The use of AI in smart cities has shown promise by enabling real-time analysis of environmental indicators and facilitating the implementation of preventive measures with greater agility?. Machine learning–based models, such as neural networks, have been applied to predict respiratory hospitalizations based on air pollutant levels¹?, highlighting their potential to support public policy.
In Brazil, air pollution remains a pressing concern. Tavella et al.¹¹ observed that temperature increases led to ozone concentration rises of up to 14% in Porto Alegre (RS), particularly during the summer, along with significant seasonal variations in particulate matter levels. These findings underscore the importance of seasonally sensitive models tailored to local contexts. The methodological approach proposed by Tavella et al.¹¹, which involved projecting future scenarios based on climate data, highlights the role of AI in designing preventive strategies. The present study aligns with this perspective by proposing a predictive model for respiratory hospitalizations in the state of São Paulo, grounded in environmental and meteorological data.
The model developed in this study combines technical innovation, explainability, and practical applicability, aligning with the guidelines established by Ordinance GM/MS No. 3,691/2024². The model was trained using the CatBoost algorithm¹², a gradient boosting library based on decision trees, commonly employed for regression and classification tasks. Weekly data from 2017 to 2022 were used for training, incorporating temporal variables (lags, moving averages, trend, and seasonality), pollutant–climate interactions, feature selection via Lasso regression, and cross-validation through GridSearchCV¹?,¹?.
Given the growing impact of environmental changes on public health, this study aimed to develop a predictive model of hospital admissions due to respiratory diseases using environmental and meteorological variables through artificial intelligence algorithms. The proposed model demonstrated high performance and practical applicability, positioning itself as a strategic tool for the management of healthcare networks. Its implementation can support decision-making processes related to resource allocation, early outbreak detection, and mitigation of health burdens, particularly in contexts marked by high socio-environmental vulnerability. By integrating data science, epidemiological surveillance, and climate variables, this research contributes to strengthening the analytical capacity of public health managers and expands the methodological toolkit available within Brazil’s Unified Health System (SUS), in alignment with contemporary guidelines for digital health and evidence-based governance.
METHODS
The study area encompasses the state of São Paulo, located in southeastern Brazil, covering a total of 248,219.485 km², with a Human Development Index (HDI) of 0.806 in 2021. The analysis period spanned from January 2017 to December 2022. Publicly available and anonymized secondary data were used, sourced from databases such as the Department of Informatics of the Unified Health System (DATASUS)¹³, the National Institute of Meteorology (INMET)¹?, and the Environmental Company of the State of São Paulo (CETESB)¹?. The dataset included hospital admissions for respiratory diseases (ICD-10 codes J00–J99), atmospheric pollutant concentrations (PM?.?, PM??, O?, NO?, SO?, CO), and meteorological conditions (mean temperature, relative humidity, precipitation, solar radiation, wind speed, and atmospheric pressure).
All data were integrated and organized into a single dataset using the Python programming language. Initially, daily records were transformed into weekly averages (for environmental and meteorological variables) and weekly sums (for the number of hospital admissions), using the epidemiological week as the temporal unit. According to the Brazilian Ministry of Health, the epidemiological week is a standardized count that begins on Sundays and is widely adopted in surveillance systems for monitoring health conditions¹³,²?.
Temporal variables were also created, including the month and the season corresponding to each epidemiological week, categorized as summer (December to February), autumn (March to May), winter (June to August), or spring (September to November).
Feature Engineering
A feature engineering step was subsequently conducted. Continuous variables were smoothed through truncation, limiting their values to the range between the 1st and 99th percentiles in order to reduce the influence of extreme values (outliers). For the dependent variable (hospital admissions), lag features were created from 1 to 10 weeks, along with moving averages over 3, 7, and 14 weeks, and a variable representing weekly variation (delta). Additionally, the hospitalization time series was decomposed using the classical additive model to extract trend and seasonality components.
Derived variables were computed for air pollutants and meteorological conditions, including moving averages over 3, 7, 14, and 21 days, as well as rolling standard deviations over 7 and 14 days. Interaction terms were incorporated, such as PM?.? × relative humidity, O? × temperature, and NO? × atmospheric pressure. In addition, binary variables were created to indicate exceedance of recommended environmental thresholds, such as PM?.? > 25 µg/m³ and O? > 100 µg/m³.
Following the enrichment of the dataset, missing values were addressed through appropriate imputation procedures. Categorical variables, such as month and season, were encoded using one-hot encoding to facilitate their inclusion in the model. Numerical features were standardized using Z-score normalization to ensure comparability across variables. To reduce dimensionality and address multicollinearity, Lasso regression was applied with a regularization parameter (? = 0.01) and a maximum of 10,000 iterations. This approach enabled the identification of the most relevant explanatory features, contributing to improved model interpretability and predictive performance.
Model training and validation
The final dataset was split into 70% for training and 30% for testing. The chosen algorithm was CatBoost Regressor, a gradient boosting method that constructs predictive models through the sequential combination of weak learners, typically decision trees¹?. This approach iteratively fits each new model to correct the residual errors of previous predictions by optimizing a loss function using gradient descent. Gradient boosting has gained prominence due to its high predictive accuracy and flexibility, and it is widely applied in tasks such as classification, regression, and ranking¹². Its ability to capture complex, non-linear interactions among heterogeneous variables makes it particularly suitable for structured (tabular) data, establishing it as one of the most robust and effective techniques in supervised learning¹?.
Hyperparameter tuning was performed using GridSearchCV with 3-fold cross-validation (k = 3), employing the coefficient of determination (R²) as the primary evaluation metric. The search space included tree depth (6 and 8), learning rate (0.02 and 0.03), number of iterations (500 and 1000), L2 regularization strength (1 and 3), sampling temperature (0.5 and 1), and randomness strength (1 and 2).
Performance Evaluation and Interpretability
Model performance was assessed using the coefficient of determination (R²), mean absolute error (MAE), and root mean squared error (RMSE), by comparing predicted versus actual hospitalization values. To support interpretation, a scatter plot of actual versus predicted values, a time series comparison, and a plot of absolute error over time were generated. Additionally, a correlation matrix was constructed to explore relationships among environmental variables.
Model explainability was examined using the SHAP (Shapley Additive exPanations) technique, which provides both local and global interpretations of each variable’s contribution to predictions. This approach enhances algorithmic transparency and supports the auditability of the predictive system.
RESULTS
The application of artificial intelligence in public health presents promising opportunities for predicting health outcomes and strengthening management across various levels of Brazil’s Unified Health System (SUS). In this study, we developed a predictive model using the CatBoost algorithm to estimate the weekly number of hospital admissions due to respiratory diseases in the state of São Paulo, based on environmental, meteorological, and seasonal data collected between 2017 and 2022.
The final model achieved an R² of 0.8948, indicating that approximately 89.5% of the variability in weekly respiratory hospitalizations was explained by the environmental and temporal predictors included. The RMSE was 73.47 hospitalizations per week, while the MAE reached 50.80, underscoring the model’s accuracy even during periods of greater data fluctuation.
Figure 1 presents the comparison between actual and predicted values of respiratory hospitalizations over a period of 47 weeks, based on the estimates generated by the CatBoost model. The curves exhibit similar variation patterns, demonstrating strong agreement between the observed values (solid blue line) and the model’s predictions (dashed orange line).

Fig.1

The model achieved a coefficient of determination of R² = 0.8948, indicating that approximately 89.5% of the variability in hospitalizations within the test set was explained by the independent variables. Furthermore, the mean absolute error (MAE = 50.80) and root mean square error (RMSE = 73.47) confirm the model’s strong predictive performance in absolute terms.
From a temporal perspective, the model successfully captured the peaks and troughs in the weekly hospitalization time series, an essential feature for operational applications within Brazil’s Unified Health System (SUS), such as early outbreak detection and proactive allocation of hospital resources. The model’s predictive stability across different seasons of the year also suggests sensitivity to seasonal variability, an effect incorporated through features such as month, season, trend, and seasonal components extracted from the time series.
These results stand out due to the high performance achieved in a practical application context using real-world data, integrating environmental and meteorological variables. The close alignment between the predicted and observed curves demonstrates the feasibility of using machine learning-based models as decision-support tools for epidemiological surveillance and territorial management of the healthcare network. When incorporated into systems such as VIGIAR or SIVEP-Gripe, such approaches can enhance the responsiveness of local health authorities in the face of critical events related to air pollution and adverse climatic conditions.
Individual Prediction Assessment – Scatter Plot of Actual vs. Predicted Values
Figure 2 presents the scatter plot of actual versus predicted hospitalizations due to respiratory diseases, as estimated by the CatBoost model. The diagonal dashed line represents the ideal scenario in which predictions perfectly match observed values. Most data points are closely clustered around this line, highlighting the model’s accuracy and its ability to generalize well on the test set.
The distribution of points suggests the absence of systematic overestimation or underestimation bias, reinforcing the model’s predictive robustness. The clustering of data near the ideal line in the central hospitalization range (approximately 800 to 1,200 cases per week) indicates that the model was particularly effective in predicting periods of moderate healthcare demand, which can be strategic for the tactical and operational planning of healthcare facilities.

Fig.2

In contexts of higher or lower hospitalization volumes, particularly at the distribution extremes, a slight dispersion is observed, which is expected in time series characterized by seasonal variability and multiple intervening factors. Nevertheless, deviations remain within a range considered acceptable for practical applications in public health, especially given the heterogeneity of environmental, meteorological, and social determinants associated with respiratory diseases. These findings further reinforce the reliability of the predictive model under real-world conditions, even when faced with atypical data fluctuations.
The scatterplot analysis supports the quantitative results presented in Figure 1 (R² = 0.8948), demonstrating that the model can explain, with a high degree of confidence, the variability in hospitalizations based on environmental and temporal variables. In the context of digital health, this constitutes evidence that machine learning-based models, such as CatBoost, are viable tools to support decision-making in health surveillance and the continuous monitoring of healthcare demand.
Error Distribution – Histogram of Prediction Errors
Figure 3 presents the histogram of the prediction errors from the CatBoost model, overlaid with a kernel density estimate (KDE) curve, which enables visualization of the residual dispersion pattern. Errors were calculated as the simple difference between the actual values of hospital admissions due to respiratory diseases and the values predicted by the model, considering the test dataset subset.
According to Figure 3, the error distribution is approximately symmetrical around zero, with a slight negative skew. Most prediction errors are concentrated within the range of -50 to +50 weekly hospital admissions, indicating that the model is generally well-calibrated and does not systematically overestimate or underestimate the actual values. This characteristic is particularly desirable in predictive models applied to health management, as it ensures greater reliability in evidence-based decision-making.

Fig.3

The distribution also exhibits relatively short tails, with few extreme outliers in absolute error, suggesting predictive stability even in the presence of seasonal or contextual variations not directly modeled. The absence of systematic bias and the near-normal shape of the distribution indicate that the model successfully captured a substantial portion of the explainable variance in the dependent variable, corroborating the previously reported satisfactory R², RMSE, and MAE values.
From a practical application standpoint within the SUS context, models that exhibit systematic errors or directional bias may lead to misallocation of resources or underestimation of critical scenarios, thereby compromising healthcare efficiency and equity. In this regard, the performance observed in this analysis highlights the model’s suitability for continuous and predictive monitoring of hospital demand due to respiratory conditions.
Finally, the analysis of the error distribution underscores the importance of integrating statistical evaluation methods into the development of decision-support tools in digital health. This ensures not only predictive performance but also transparency and auditability, fundamental pillars of algorithmic fairness as outlined in Ordinance GM/MS No. 3,691/2024².
Variable Importance – SHAP Summary Plot
Figure 4 presents the SHAP (Shapley Additive exPanations) summary plot, used to interpret the CatBoost model applied to the prediction of hospital admissions for respiratory diseases. This explainable approach enables an understanding of the individual contribution of each variable to the model’s predictions, linking the magnitude of the impact (x-axis) to the intensity of the variable values (represented by the color gradient in the side bar).

Fig. 4

Taken together, the results obtained through the SHAP analysis demonstrate the model’s ability to capture complex, non-linear relationships between air pollutants, climate, and health.
Environmental Context – Correlation Heatmap
Figure 5 presents the Pearson correlation matrix between the environmental and meteorological variables used in the model and the variable HOSPCIDX, representing the weekly number of hospital admissions for respiratory diseases. The matrix color scale ranges from blue (negative correlation) to red (positive correlation), and the numerical coefficients indicate the magnitude of the correlations.

Fig. 5

The meteorological variables include AVERAGE TEMP. (weekly mean temperature), PRECIPIT (accumulated precipitation), RELATIVE HUMIDITY (relative air humidity), RADIATION (solar radiation), WIND SPEED (mean wind speed), and ATM-P (atmospheric pressure). Finally, HOSPCIDX represents the weekly number of hospital admissions for respiratory diseases, classified according to ICD-10 codes J00–J99.
DISCUSSION
The findings of this study highlight the potential of artificial intelligence (AI) as a decision-support tool in public health management, particularly within the context of Brazil’s Unified Health System (SUS). As illustrated in Figure 1, the CatBoost model achieved high accuracy in predicting the weekly number of hospital admissions due to respiratory diseases in the state of São Paulo, with a coefficient of determination of R² = 0.8948, a mean absolute error (MAE) of 50.8, and a root mean squared error (RMSE) of 73.47. These metrics indicate a strong explanatory capacity, even in the presence of seasonal variability and the multivariate nature of the determinants analyzed. Such performance is especially relevant in the context of climate change and the increasing burden of respiratory diseases, particularly among vulnerable urban populations.
Figure 2 presents the scatter plot of actual versus predicted hospital admissions due to respiratory diseases, highlighting the agreement between the CatBoost model and the observed data. Beyond demonstrating the model’s overall accuracy, the scatter plot analysis provides insight into its robustness for forecasting operational scenarios. The model exhibits greater adherence within intermediate ranges of healthcare demand, indicating strong performance under conditions of moderate hospital occupancy. This behavior is desirable for public systems such as the SUS, where resource management often depends on realistic forecasts for routine operational contexts. Although a slight dispersion is observed at the extremes of hospitalization volumes, this trend is consistent with the multifactorial complexity of respiratory episodes, which involve social, environmental, biological, and structural determinants that are not always fully captured by the available data.
Figure 3 depicts the distribution of the model’s prediction errors, which appears approximately symmetric around zero, with a slight negative skew. Most errors are concentrated within the range of –50 to +50 weekly hospital admissions, indicating good calibration and the absence of systematic bias. This near-normal distribution, with short tails and few outliers, suggests that the model successfully captured most of the explainable variance of the dependent variable. Such characteristics confer robustness to the model, ensuring greater reliability in evidence-based decision-making, as recommended by Chen et al.? and Zhang et al.? for AI applications in public health.
Complementarily, the statistical analysis of the error distribution suggests that the model is unbiased and exhibits predictive stability. This stability is essential in public health contexts, where decisions based on inaccurate models can compromise the equity of care. The low frequency of extreme outliers indicates that CatBoost, combined with effective preprocessing and feature selection, can minimize critical errors. In particular, the approximately normal shape of the distribution reinforces the robustness of the methodological approach adopted, indicating that most of the explainable variance was indeed captured by the selected variables and the structure of the algorithm.
In Figure 4, the SHAP (SHapley Additive exPlanations) analysis reveals the variables with the greatest impact on model prediction. Solar radiation emerged as the most influential factor, suggesting an association with photochemical processes that increase tropospheric ozone (O?) concentrations, a well-known respiratory irritant¹?. According to Xiang et al.¹?, this relationship is particularly significant in urban areas with high vehicular density and limited atmospheric ventilation. PM?.? also ranked among the most relevant predictors, corroborating evidence of its systemic inflammatory effects and its association with the exacerbation of asthma, bronchitis, and lung infections, especially in vulnerable groups?. Pizzulli et al.²? emphasize that continuous exposure to fine particles is one of the main drivers of climate change impacts on respiratory health.
The variable HOSPCIDX, representing hospital admissions, stood out for its importance in the model, reflecting the temporal persistence pattern frequently observed in respiratory diseases. In line with epidemiological patterns, outbreaks often extend over several consecutive weeks, making recent history a robust indicator of future behavior.
Other atmospheric pollutants, such as CO and SO?, also exhibited a positive impact on the predictions, reinforcing their role as environmental markers of health risk. The HOSPCIDX variable underscored the importance of temporal autocorrelation in the historical series. This pattern is consistent with the progressive nature of respiratory diseases and the influence of cumulative exposures, as highlighted by Requia et al.²¹. Seasonality, incorporated through time series decomposition, also had a significant weight, reinforcing the importance of annual cyclical patterns, particularly in the autumn–winter period, in increasing hospital admissions²².
Meteorological variables such as PRESSAOATM (atmospheric pressure) and RADIACAO (solar radiation) showed a relevant impact, indicating that high-pressure atmospheric conditions favor pollutant stagnation and the formation of thermal inversions, which worsen air quality and increase respiratory health risks²³. The pollutant ozone (O?) exhibited a moderate impact, reflecting its dependence on meteorological factors such as solar radiation and temperature, and reinforcing its seasonal variability¹¹. As identified by Li et al.²?, this type of phenomenon occurs more frequently during the winter and during severe pollution episodes, particularly under atmospheric stability conditions.
According to Figure 5, the correlation analysis among these variables allows the identification of environmental and climatic patterns associated with variations in respiratory hospitalizations, thereby supporting epidemiological surveillance strategies and public health planning. The variables PM??, PM?.?, NO?, SO?, and CO showed a moderate positive correlation with hospital admissions, with particular emphasis on the gaseous pollutants NO? (r = 0.27), CO (r = 0.25), and SO? (r = 0.27).
Conversely, meteorological variables such as average temperature (r = –0.32), solar radiation (r = –0.39), and relative humidity (r = –0.09) exhibited negative correlations with HOSPCIDX, suggesting a potential protective effect of these climatic conditions, particularly during periods of greater atmospheric ventilation and pollutant dispersion. These patterns reflect the higher incidence of respiratory diseases during colder and drier periods, which are associated with increased atmospheric stability and pollutant concentration, particularly in winter. This finding reinforces the relevance of seasonality, which had already been captured by the CatBoost model (see Figure 4).
It is worth noting the high multicollinearity observed among particulate pollutants themselves (PM?? and PM?.?, r = 0.98) and among combustion gases (e.g., NO? and CO, r = 0.87), which justified the use of automatic variable selection techniques, such as Lasso regression, to mitigate redundancies in the predictive modeling process.
The positive correlation between atmospheric pressure (PRESSAOATM) and hospital admissions for respiratory diseases (r = 0.33) warrants attention, as it may reflect thermal inversion episodes in which high-pressure systems hinder pollutant dispersion, thereby promoting their accumulation near the ground.
In Figure 5, the correlation matrix confirms expected relationships: the pollutants NO?, CO, and SO? showed positive correlations with hospital admissions, whereas variables such as temperature and relative humidity exhibited negative correlations. The presence of these structured associations provides internal logic to the model, reinforcing its epidemiological and statistical coherence. Recent studies, such as that by Piracha et al.²?, indicate that the combination of thermal discomfort and poor air quality is a key determinant of increased hospitalizations for respiratory causes in densely urbanized regions.
These findings demonstrate that the proposed model is sensitive to seasonality, autocorrelation, and environmental variability, characteristics that are desirable for applications in predictive surveillance systems. Future integration with systems such as VIGIAR²? and SIVEP-Gripe²?, both linked to the Brazilian Ministry of Health, would enable real-time monitoring of respiratory risks, with the capacity to anticipate hospital demand, optimize resources, and plan preventive actions.
Despite the high performance achieved, certain limitations must be acknowledged. The absence of disaggregated data by age, comorbidities, or social indicators limits the personalization of analyses. Similarly, structural variables such as vaccination coverage, urban mobility, or seasonal events (e.g., viral outbreaks) were not integrated into the model, despite their recognized influence on hospitalization dynamics. Future iterations of the model could incorporate these factors, as well as conduct external validations in regions with distinct environmental characteristics, such as the states in Brazil’s North and Northeast regions.
Nevertheless, the approach proposed in this study advances the field by providing a practical, explainable, and replicable tool with the potential to strengthen Brazil’s Digital Health ecosystem. Aligned with the Brazilian Ministry of Health Ordinance GM/MS No. 3,691/2024² and the World Health Organization (WHO) guidelines²?, the methodology integrates data science, epidemiological surveillance, and public policy, fostering equity, transparency, and innovation within the Unified Health System (SUS).
CONCLUSION
This study developed and validated a highly accurate predictive model to estimate weekly hospital admissions for respiratory diseases in the state of São Paulo, integrating environmental and meteorological data within a machine learning framework. The modeling process employed the CatBoost algorithm, recognized for its robustness in handling tabular data with multiple correlated variables, and underwent a rigorous pipeline that included feature engineering, variable selection through Lasso regression, and hyperparameter optimization with cross-validation.
The performance achieved (R² ? 0.895; RMSE = 73.47; MAE = 50.80) demonstrated the model’s ability to capture non-linear relationships and complex seasonal patterns, ensuring consistent predictions even under conditions of high environmental variability. The error analysis revealed an approximately symmetric distribution with a low frequency of outliers, reinforcing the statistical calibration and predictive reliability of the proposed system.
The application of the SHAP (SHapley Additive exPlanations) technique ensured the explainability of the model by identifying the predictors with the greatest individual impact and contributing to its auditability, a fundamental requirement in the context of algorithmic governance in public health. Variables such as solar radiation, PM?.?, atmospheric pressure, and temporal autocorrelation emerged as critical factors, corroborating empirical evidence from the literature and demonstrating epidemiological consistency with the seasonal cycles of respiratory diseases.
Aligned with Ordinance GM/MS No. 3,691/2024² and the World Health Organization’s recommendations²? on the ethical and transparent use of artificial intelligence in healthcare, the model developed in this study stands out as a strategic tool for supporting epidemiological surveillance and public health management within the Brazilian Unified Health System (SUS). Its potential integration into platforms such as VIGIAR²? and SIVEP-Gripe²? would enhance real-time monitoring of respiratory health risks, enabling more efficient resource allocation and the anticipation of critical scenarios.
From a technical-scientific perspective, this study advances Digital Health applied to environmental health surveillance by proposing a replicable, scalable, and technically sound model, with potential for territorial adaptation and interoperability with national health information systems. To enhance its applicability, future developments should incorporate sociodemographic variables, vaccination indicators, urban mobility patterns, and epidemic events, as well as conduct external validations in heterogeneous regional contexts.
ACKNOWLEDGMENTS
The authors express their gratitude to the Laboratory of Climate Change and Air Pollution for the technical support, infrastructure, and collaborative environment that enabled the development of this research. Special thanks are extended to the team of researchers and technical staff, whose dedication, expertise, and cooperation were essential for the collection, organization, and analysis of the data presented in this study.

FUNDING
The authors acknowledge the financial support provided by the Coordination for the Improvement of Higher Education Personnel – Brazil (CAPES) – Finance Code 001.

Declaração de Disponibilidade de Dados
O conjunto de dados que dá suporte a este artigo está disponível no repositório SciELO Data, no Dataverse da revista Ciência & Saúde Coletiva, no seguinte link: https://doi.org/10.48331/SCIELODATAB.SICY0A

Data Availability Statement
The dataset that supports this article is available in the SciELO Data repository, in the Ciência & Saúde Coletiva Dataverse, at the following link: https://doi.org/10.48331/SCIELODATAB.SICY0A

Declaración de Disponibilidad de Datos
El conjunto de datos que respalda este artículo está disponible en el repositorio SciELO Data, en el Dataverse de la revista Ciência & Saúde Coletiva, en el siguiente enlace: https://doi.org/10.48331/SCIELODATAB.SICY0A
REFERENCES
1. Fundação Oswaldo Cruz (Fiocruz). Projeto Saúde Digital na Maré: conectando comunidades, ciência e cuidado [Internet]. Rio de Janeiro: Fiocruz; 2023 [cited 2025 Aug 7]. Available from: https://portal.fiocruz.br/projeto/saude-digital-na-mare.
2. Brasil. Ministério da Saúde. Portaria GM/MS nº 3.691, de 23 de maio de 2024. Institui a Ação Estratégica SUS Digital - Telessaúde [Internet]. Brasília: MS; 2024 [cited 2025 Aug 7]. Available from: https://bvs.saude.gov.br/bvs/saudelegis/gm/2024/prt3691_29_05_2024.html.
3. Atkinson RW, Kang S, Anderson HR, Mills IC, Walton HA. Epidemiological time series studies of PM2.5 and daily mortality and hospital admissions: a systematic review and meta-analysis. Thorax. 2014;69(7):660-665.
4. Chen R, Yin P, Meng X, et al. Associations between coarse particulate matter air pollution and cause-specific mortality: a nationwide analysis in 272 Chinese cities. Environ Health Perspect. 2019;127(1):017008.
5. Zhang L, Wilson JP, Zhao N, Zhang W, Wu Y. The dynamics of cardiovascular and respiratory deaths attributed to long-term PM2.5 exposures in global megacities. Sci Total Environ. 2022;842:156951.
6. Chen S, Bao Z, Ou Y, Chen K. Synergistic effects of air pollution and urban heat island on public health, a gender oriented nationwide study of China. Urban Clim. 2023;51:101671.
7. Ali MU, Liu G, Yousaf B, et al. A systematic review on global pollution status of particulate matter-associated potential toxic elements and health perspectives in urban environment. Environ Geochem Health. 2019;41(3):1131-1162.
8. Palumbo R, Fakhar Manesh M, Pellegrini MM, Caputo A, Flamini G. Organizing a sustainable smart urban ecosystem: perspectives and insights from a bibliometric analysis and literature review. J Clean Prod. 2021;297:126622.
9. Temirbekov N, Temirbekova M, Tamabay D, et al. Assessment of the negative impact of urban air pollution on population health using machine learning method. Int J Environ Res Public Health. 2023;20(18):6770.
10. Ku Y, Kwon SB, Yoon JH, Mun SK, Chang M. Machine learning models for predicting the occurrence of respiratory diseases using climatic and air-pollution factors. Clin Exp Otorhinolaryngol. 2022;15(2):168-176.
11. Tavella RA, Silva Júnior FMR, Adamatti DF, Scursone GF, et al. Predicting air pollution changes due to temperature increases in two Brazilian capitals using machine learning, a necessary perspective for a climate resilient health future. Int J Environ Health Res. 2025;35(11):3392-3406.
12. Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support [Internet]. 2018 [cited 2025 Aug 7]. Available from: https://arxiv.org/abs/1810.11363.
13. Brasil. Ministério da Saúde. Departamento de Informática do SUS (DATASUS) [Internet]. Brasília: MS; 2024 [cited 2025 Aug 7]. Available from: https://datasus.saude.gov.br/.
14. Instituto Nacional de Meteorologia (INMET). Banco de dados meteorológicos [Internet]. Brasília: INMET; 2024 [cited 2025 Aug 7]. Available from: https://bdmep.inmet.gov.br/.
15. Companhia Ambiental do Estado de São Paulo (CETESB). Qualidade do ar no estado de São Paulo [Internet]. São Paulo: CETESB; 2024 [cited 2025 Aug 7]. Available from: https://cetesb.sp.gov.br/ar/.
16. Bentéjac C, Csörgo A, Martínez-Muñoz G. A comparative analysis of gradient boosting algorithms. Artif Intell Rev. 2021;54:1937-1967.
17. Gonçalves VSF, de Carvalho VR. A review of interpretability methods for gradient boosting decision trees. J Braz Comput Soc. 2025;31(1):640-654.
18. Wu W, Yao M, Yang X, et al. Mortality burden attributable to long-term ambient PM2.5 exposure in China: using novel exposure-response functions with multiple exposure windows. Atmos Environ. 2021;246:118098.
19. Xiang S, Guo X, Kou W, et al. Substantial short- and long-term health effects of PM2.5 and its constituents even under future emission reductions in China. Sci Total Environ. 2023;874:162433.
20. Pizzulli VA, Telesca V, Covatariu G. Analysis of correlation between climate change and human health based on a machine learning approach. Healthcare (Basel). 2021;9(1):86.
21. Requia WJ, Vicedo Cabrera AM, Amini H, da Silva GL, Schwartz JD, Koutrakis P. Short term air pollution exposure and hospital admissions for cardiorespiratory diseases in Brazil: a nationwide time series study between 2008 and 2018. Environ Res. 2023;217:114794.
22. Xu B, Wang J, Li Z, et al. Seasonal association between viral causes of hospitalised acute lower respiratory infections and meteorological factors in China: a retrospective study. Lancet Planet Health. 2021;5(3):e154-e163.
23. Wang W. Progress in the impact of polluted meteorological conditions on the incidence of asthma. J Thorac Dis. 2016;8(1):E57-E61.
24. Li C, Yan F, Kang S, et al. Corrigendum to “Carbonaceous matter in the atmosphere and glaciers of the Himalayas and the Tibetan plateau: an investigative review” [Environ Int. 2021;146:106281]. Environ Int. 2023;179:108133.
25. Piracha A, Chaudhary MT. Urban air pollution, urban heat island and human health: a review of the literature. Sustainability. 2022;14(15):9234.
26. Brasil. Ministério da Saúde. Sistema de Vigilância da Qualidade do Ar – VIGIAR [Internet]. Brasília: MS; 2024 [cited 2025 Aug 7]. Available from: https://www.gov.br/saude/pt-br/vigiar.
27. Brasil. Ministério da Saúde. Sistema de Informação da Vigilância Epidemiológica da Gripe – SIVEP-Gripe [Internet]. Brasília: MS; 2024 [cited 2025 Aug 7]. Available from: https://opendatasus.saude.gov.br/dataset/sivep-gripe.
28. World Health Organization (WHO). Ethics and governance of artificial intelligence for health: guidance on large multi-modal models (LMMs) [Internet]. Geneva: WHO; 2024 [cited 2025 Aug 7]. Available from: https://www.who.int/publications/i/item/978924009225


Outros idiomas:







Como

Citar

Scursone, GF, Adamatti, DF, Silva Júnior, FMR, Tavella, RA, Bonifácio, AS, Machado, RA, Saes-Silva, E, Correia Filho, W. Use of artificial intelligence and environmental data to estimate respiratory hospitalizations in support of SUS management. Cien Saude Colet [periódico na internet] (2025/dez). [Citado em 16/12/2025]. Está disponível em: http://www.cienciaesaudecoletiva.com.br/artigos/use-of-artificial-intelligence-and-environmental-data-to-estimate-respiratory-hospitalizations-in-support-of-sus-management/19880

Últimos

Artigos



Realização



Patrocínio