PDF(987 KB)
CatBoost algorithm and Bayesian network model analysis based on risk prediction of cardiovascular and cerebro vascular diseases
Aimin WANG,Fenglin WANG,Yiming HUANG,Yaqi XU,Wenjing ZHANG,Xianzhu CONG,Weiqiang SU,Suzhen WANG,Mengyao GAO,Shuang LI,Yujia KONG,Fuyan SHI,Enxue TAO
PDF(987 KB)
PDF(987 KB)
CatBoost algorithm and Bayesian network model analysis based on risk prediction of cardiovascular and cerebro vascular diseases
Objective To screen the main characteristic variables affecting the incidence of cardiovascular and cerebrovascular diseases, and to construct the Bayesian network model of cardiovascular and cerebrovascular disease incidence risk based on the top 10 characteristic variables,and to provide the reference for predicting the risk of cardiovascular and cerebrovascular disease incidence. Methods From the UK Biobank Database, 315 896 participants and related variables were included. The feature selection was performed by categorical boosting (CatBoost) algorithm, and the participants were randomly divided into training set and test set in the ratio of 7∶3. A Bayesian network model was constructed based on the max-min hill-climbing (MMHC) algorithm. Results The prevalence of cardiovascular and cerebrovascular diseases in this study was 28.8%. The top 10 variables selected by the CatBoost algorithm were age, body mass index (BMI), low-density lipoprotein cholesterol (LDL-C), total cholesterol (TC), the triglyceride-glucose (TyG) index, family history, apolipoprotein A/B ratio, high-density lipoprotein cholesterol (HDL-C), smoking status, and gender. The area under the receiver operating characteristic (ROC) curve (AUC) for the CatBoost training set model was 0.770, and the model accuracy was 0.764; the AUC of validation set model was 0.759 and the model accuracy was 0.763. The clinical efficacy analysis results showed that the threshold range for the training set was 0.06-0.85 and the threshold range for the validation set was 0.09-0.81. The Bayesian network model analysis results indicated that age, gender, smoking status, family history, BMI, and apolipoprotein A/B ratio were directly related to the incidence of cardiovascular and cerebrovascular diseases and they were the significant risk factors. TyG index, HDL-C, LDL-C, and TC indirectly affect the risk of cardiovascular and cerebrovascular diseases through their impact on BMI and apolipoprotein A/B ratio. Conclusion Controlling BMI, apolipoprotein A/B ratio, and smoking behavior can reduce the incidence risk of cardiovascular and cerebrovascular diseases. The Bayesian network model can be used to predict the risk of cardiovascular and cerebrovascular disease incidence.
Cardiovascular and cerebrovascular disease / CatBoost algorithm / Bayesian network / Risk inference
R54
| 1 | REN Q Q, LI S Y, XIAO C L, et al. The impact of air pollution on hospitalization for cardiovascular and cerebrovascular disease in Shenyang, China[J]. Iran J Public Health, 2020, 49(8): 1476-1484. |
| 2 | YOU Q, SHAO X Y, WANG J P, et al. Progress on physical field-regulated micro/nanomotors for cardiovascular and cerebrovascular disease treatment[J]. Small Methods, 2023, 7(10): e2300426. |
| 3 | BENJAMIN E J, MUNTNER P, ALONSO A, et al. Heart disease and stroke statistics-2019 update: a report from the American heart association[J]. Circulation, 2019, 139(10): e56-e528. |
| 4 | MENSAH G A, ROTH G A, FUSTER V. The global burden of cardiovascular diseases and risk factors: 2020 and beyond[J]. J Am Coll Cardiol, 2019, 74(20): 2529-2532. |
| 5 | DISEASES AND INJURIES COLLABORATORSGBD. Global burden of 369 diseases and injuries in 204 countries and territories, 1990-2019: a systematic analysis for the Global Burden of Disease Study 2019[J]. Lancet, 2020, 396(10258): 1204-1222. |
| 6 | MELA A, RDZANEK E, PONIATOWSKI ? A, et al. Economic costs of cardiovascular diseases in Poland estimates for 2015-2017 years[J]. Front Pharmacol, 2020, 11: 1231. |
| 7 | QIAO W J, ZHANG X Y, KAN B, et al. Hypertension, BMI, and cardiovascular and cerebrovascular diseases[J]. Open Med, 2021, 16(1): 149-155. |
| 8 | STROKE COLLABORATORSGBD. Global, regional, and national burden of stroke and its risk factors, 1990-2019: a systematic analysis for the Global Burden of Disease Study 2019[J]. Lancet Neurol, 2021, 20(10): 795-820. |
| 9 | BOYD C, BROWN G, KLEINIG T, et al. Machine learning quantitation of cardiovascular and cerebrovascular disease: a systematic review of clinical applications[J]. Diagnostics, 2021, 11(3): 551. |
| 10 | CHEUNG C Y, XU D J, CHENG C Y, et al. A deep-learning system for the assessment of cardiovascular disease risk via the measurement of retinal-vessel calibre[J]. Nat Biomed Eng, 2021, 5(6): 498-508. |
| 11 | AZMI J, ARIF M, NAFIS M T, et al. A systematic review on machine learning approaches for cardiovascular disease prediction using medical big data[J]. Med Eng Phys, 2022, 105: 103825. |
| 12 | KELSHIKER M A, SELIGMAN H, HOWARD J P, et al. Coronary flow reserve and cardiovascular outcomes: a systematic review and meta-analysis[J]. Eur Heart J, 2022, 43(16): 1582-1593. |
| 13 | ZHENG P F, CHEN L Z, LIU P, et al. Identification of immune-related key genes in the peripheral blood of ischaemic stroke patients using a weighted gene coexpression network analysis and machine learning[J]. J Transl Med, 2022, 20(1): 361. |
| 14 | BIEDERMANN A, TARONI F. Bayesian networks and probabilistic reasoning about scientific evidence when there is a lack of data[J]. Forensic Sci Int, 2006, 157(2/3): 163-167. |
| 15 | BYCROFT C, FREEMAN C, PETKOVA D, et al. The UK Biobank resource with deep phenotyping and genomic data[J]. Nature, 2018, 562(7726): 203-209. |
| 16 | 黄夏璇, 黄 韬, 杨 瑞, 等. UK Biobank数据的应用介绍[J]. 中国循证医学杂志, 2022, 22(9): 1099-1107. |
| 17 | UCHAI S, ANDERSEN L F, THORESEN M, et al. Does the association between adiposity measures and prefrailty among older adults vary by social position?Findings from the Troms? study 2015/2016[J]. BMC Public Health, 2024, 24(1): 1457. |
| 18 | MACH F, BAIGENT C, CATAPANO A L, et al. 2019 ESC/EAS Guidelines for the management of dyslipidaemias: lipid modification to reduce cardiovascular risk[J]. Eur Heart J, 2020, 41(1): 111-188. |
| 19 | PIZZI N J. Fuzzy quartile encoding as a preprocessing method for biomedical pattern classification[J]. Theor Comput Sci, 2011, 412(42): 5909-5925. |
| 20 | JAYAWARDENA R, SOORIYAARACHCHI P. The inside story of fruits; exploring the truth behind conventional theories[J]. Diabetes Metab Syndr, 2021, 15(6): 102085. |
| 21 | CHUDASAMA Y V, KHUNTI K K, ZACCARDI F, et al. Physical activity, multimorbidity, and life expectancy: a UK Biobank longitudinal study[J]. BMC Med, 2019, 17(1): 108. |
| 22 | 苗丰顺, 李 岩, 高 岑, 等. 基于CatBoost算法的糖尿病预测方法[J]. 计算机系统应用, 2019, 28(9): 215-218. |
| 23 | HANCOCK J T, KHOSHGOFTAAR T M. CatBoost for big data: an interdisciplinary review[J]. J Big Data, 2020, 7(1): 94. |
| 24 | 胡建锦, 熊 伟, 方陆明, 等. 基于距离相关系数和Catboost方法的森林蓄积量估测[J]. 中南林业科技大学学报, 2023, 43(5): 27-35. |
| 25 | PRABU S, THIYANESWARAN B, SUJATHA M, et al. Grid search for predicting coronary heart disease by tuning hyper-parameters[J]. Comput Syst Sci Eng, 2022, 43(2): 737-749. |
| 26 | ROUSSON V, ZUMBRUNN T. Decision curve analysis revisited: overall net benefit, relationships to ROC curve analysis, and application to case-control studies[J]. BMC Med Inform Decis Mak, 2011, 11: 45. |
| 27 | 唐 末. 基于循证医学及机器学习的中医药影响早中期结直肠癌预后模型研究[D]. 北京: 中国中医科学院, 2022. |
| 28 | 钟 璐, 薛付忠. 基于贝叶斯网络不确定性推理的肺癌风险预测模型[J]. 山东大学学报(医学版), 2023, 61(4): 86-94. |
| 29 | 王旭春, 宋伟梅, 潘金花, 等. MMPC-Tabu 混合算法的贝叶斯网络模型在高脂血症相关因素研究中的应用[J]. 中国卫生统计, 2022, 39(3): 345-350, 355. |
/
| 〈 |
|
〉 |