COVID-19 mutasyonlarının tespitinde yapay zeka tabanlı algoritmaların kullanılması

Use of artificial intelligence-based algorithms in detecting COVID-19 mutations

PDF İndir

Tez No: 879079
Yazar: MEHMET BURUKANLI
Danışmanlar: PROF. DR. NEJAT YUMUŞAK
Tez Türü: Doktora
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2024
Dil: Türkçe
Üniversite: Sakarya Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
Sayfa Sayısı: 155

Özet

Koronavirüs hastalığı 2019 (COVİD-19) virüsü, son zamanlarda ortaya çıkan ve bulaşılıcılığı oldukça yüksek olan ölümcül bir koronavirüs türüdür. COVİD-19 virüsünün hızlı yayılması, insanlar arasında büyük korku ve paniğe neden olmuştur. Ülkeler, COVİD-19 virüsü ile mücadele etmek için tam kapanma, sokağa çıkma yasağı gibi bazı önlemler almak zorunda kalmışlardır. Fakat bu alınan önlemlere ragmen COVİD-19 virüsü yayılmaya devam etmiştir. COVİD-19 virüsü ile mücadele etmenin başka biri yöntemi ise aşı ve ilaçların geliştirilmesidir. COVİD-19 virüsüyle mücadelede aşı ve ilaçların geliştirilmesi büyük önem taşımaktadır. Geliştirilen bu aşı ve ilaçların etkinliği, COVİD-19 virüsünün mutasyona uğraması sonucu ya önemli oranda azalmış yada tamamen yok olmuştur. Bu nedenle, COVİD-19 mutasyonlarıyla mücadele etmek oldukça önemlidir. COVİD-19 virüsünün yapısında gelecekte meydana gelebilecek mutasyonlar önceden tahmin edilebilirse aşı ve ilaçlar daha kolay geliştirilebilir. Böylece enfekte olan alanlar karantinaya alınabilecek ve sonuçta COVİD-19 virüsüyle mücadele daha kolay olabilecektir. Yapay zeka tabanlı yaklaşımlar COVİD-19 virüsü tespitinde de umut verici sonuçlar sunmaktadır. Literatür incelendiğinde COVİD-19 virüsü ile ilgili gerçekleştirilen çalışmaların geneli COVİD-19 virüsünün diğer yönleri ile ilgili çalışmalardır. Bu nedenle literatürde COVİD-19 virüsünün mutasyon tahmin edilmesi açısından ciddi boşluk bulunmaktadır. Bu tez çalışmasında biz bu boşluğu bir nebze olsun doldurmayı amaçladık. Bu tez çalışmasında, COVİD-19 virüsü yapısında gelecekte meydana gelebilecek mutasyonları tahmin etmek için yapay zeka tabanlı üç model (TfrAdmCov, StackGridCov ve HyperAttCov) önerilmiştir. İlk önerilen TfrAdmCov modeli, adam optimizasyon algoritmasına sahip tamamen transformer kodlayıcı tabanlıdır. Önerilen TfrAdmCov model ile giriş dizisindeki değişkenler arasındaki bağımlılıklar kolay bir şekilde yakanalabilmektedir. Önerilen TfrAdmCov modeli, transformer tabanlı olması sebebiyle, aynı anda paralel hesaplama yapabilmektedir. Ayrıca, önerilen TfrAdmCov modelinin performansını arttırmak için eğitim, test ve Kfold veri setlerini oluşturma aşamasında agglomerative kümeleme algoritması tercih edilmiştir. Ek olarak, makine öğrenmesi algoritmalarının en iyi hyperparametre değerlerinin ayarlamak için GridSearchCV algoritmasında faydalanılmıştır. Deneysel sonuçlar detaylı olarak incelendiğinde, önerilen TfrAdmCov modelinin hem klasik yapay zeka tabanlı modellerden hem de birkaç son teknoloji modellerden daha iyi performans elde ettiğini göstermiştir. Önerilen TfrAdmCov modeli, COVİD-19 test veri seti üzerinde %99.93 doğruluk değerine, %100.00 kesinlik değerine, %97.38 hassasiyet değerine, %98.67 F1-skor değerine ve %98.65 MCC değerine ulaşmıştır. Benzer şekilde 10 rastgele deneminin ortalaması alındığında da, önerilen TfrAdmCov modeli, COVİD-19 test veri seti üzerinde %99.924 ile doğruluk, %97.18 ile hassasiyet, %98.57 ile F1-skor ve %98.54 ile MCC değeri açısından diğer modellerden daha iyi sonuçlar elde etmiştir. Önerilen TfrAdmCov modeli ile derin öğrenme modellerinin istatistiksel açıdan kıyaslamak için farklı rastgele tohumlarla 10 rastgele denemenin ortalaması alınarak elde elde edilen sonuçlar analiz edilmiştir. Ortalama, standart sapma, medyan, min ve maks gibi istatistiksel ölçümler kullanılarak her model için doğruluk, kesinlik, hatırlama, F1-skor ve MCC performans ölçüm metriği açısından detaylı değerlendirme gerçekleştirilmiştir. Ayrıca, önerilen TfrAdmCov modelinin performansını değerlendirmek için influenza A/H3N2 HA veri seti üzerinde mutasyon tahmini gerçekleştirilmiştir. Önerilen TfrAdmCov modeli, H3N2 HA test veri seti üzerinde %96.33 doğruluk, %81.55 kesinlik, %52.33 hassasiyet, %63.75 F1-skor ve %63.61 MCC değerlerinde diğer modellere göre daha iyi sonuçlar elde etmiştir. İnfluenza H3N2 HA test veri seti üzerindeki sonuçlar, önerilen TfrAdmCov modelinin oldukça sağlam olduğunu göstermiştir. İkinci olarak, COVİD-19 virüsünün mutasyon tahmini için sağlam bir StackGridCov modeli önerdik. Önerilen StackGridCov modeli, tamamen topluluk öğrenme tablanlıdır. Önerilen StackGridCov modelinin ve diğer modellerin performansını artırmak için GridSearchCV hiperparametre ayarlama algoritması kullanılmıştır. Önerilen StackGridCov modelinin ve diğer modellerin performansını değerlendirmek için, holdout tekniğinin yanı sıra stratified 10 katlı çapraz doğrulama tekniğinden faydalanılmıştır. Ek olarak önerilen StackGridCov modelinin performansını değerlendirmek için daha önce ortaya çıkan influenza A/H1N1 HA virüsü veri seti üzerinde mutasyon tahmini gerçekleştirilmiştir. GridSearchCV yöntemine sahip önerilen StackGridCov modeli, COVİD-19 test veri setinde 0.6623 doğruluk değeri, 0.6723 F1-skor değeri, 0.3273 MCC değeri ve 0.7018 AUC değeri ile diğer algoritmalardan daha iyi performans gösterimiştir. Ayrıca, önerilen StackGridCov modeli, influenza A/H1N1 HA test veri setinde 0.9460 doğruluk değeri, 0.7969 hassasiyet değeri, 0.8093 F1-skor değeri ve 0.7780 MCC değeri açısından diğer modellerden daha iyi performans göstermiştir. Sonuç olarak, GridSearchCV hiperparametre tekniğinin kullanılmasının genel olarak önerilen StackGridCov modeli ile diğer modellerim performansını arttırdığı gözlemlenmiştir. Üçüncü olarak, COVİD-19 virüs mutasyon tahmini için HyperMixer ve dikkat mekanizmalarına dayalı olan HyperAttCov modeli önerilmiştir. Önerilen HyperAttCov modelinin performansının en yüksek seviyeye çıkartmak için dikkat mekanızmalarından faydalanılmıştır. Önerilen HyperAttCov modeli, birçok derin öğrenme tabanlı ve makine öğrenmesi modellerinden daha iyi performans elde etmiştir. Deneysel sonuçlar detaylı olarak incelendiğinde, önerilen HyperAttCov modelinin, COVİD-19 test veri seti üzerinde %70.0 doğruluk değerine, %92.0 kesinlik değerine ve %46.5 MCC değerine ulaştığını gözlemlenmiştir. Benzer şekilde, önerilen HyperAttCov modeli, 10 adet rastgele denemenin ortalaması alındığında COVİD-19 test veri seti üzerinde %70.2 doğruluk değerine, %90.4 hassasiyet değerine ve %46.2 MCC değerine ulaşmıştır. Ayrıca, önerilen HyperAttCov modeli literatürdeki çalışmayla karşılaştırıldığında, test veri seti kümesi üzerinde oldukça başarılı sonuçlar elde etmiştir. Sonuç olarak, önerilen TfrAdmCov, StackGridCov ve HyperAttCov modelleri, COVİD-19 veri setinde meydana gelecek mutasyonları başarılı bir şekilde tahmin edebilmektedir. Elde edilen sonuçlar aşı ve ilaç geliştirilmesi açısından umut vericidır.

Özet (Çeviri)

Coronavirus disease 2019 (COVID-19) virus is a deadly type of coronavirus that has emerged recently and is highly contagious. The rapid spread of the COVID-19 virus has caused great fear and panic among people. Countries have had to take some measures such as complete closure and curfew to combat the COVID-19 virus. However, despite these measures, the COVID-19 virus continued to spread. Another method to combat the COVID-19 virus is the development of vaccines and drugs. The development of vaccines and drugs is of great importance in combating the COVID-19 virus. The effectiveness of these developed vaccines and drugs has either significantly decreased or disappeared completely as a result of the mutation of the COVID-19 virus. Therefore, it is very important to combat COVID-19 mutations. If future mutations in the structure of the COVID-19 virus can be predicted, vaccines and drugs can be developed more easily. Therefore, infected areas can be quarantined and ultimately the fight against the COVID-19 virus will be easier. Artificial intelligence-based approaches also offer promising results in detecting or predicting the COVID-19 virus. When the literature has been examined, most of the studies on the COVID-19 virus are studies on other aspects of the COVID-19 virus. For this reason, there is a serious gap in the literature in terms of mutation prediction of the COVID-19 virus. In this thesis study, we aim to fill this gap to some extent. In thesis study, three artificial intelligence-based models (TfrAdmCov, StackGridCov and HyperAttCov) have been proposed to predict future mutations in the COVID-19 Spike (S) protein structure. Firstly, the proposed TfrAdmCov model is completely transformer encoder based with Adam optimization algorithm. With the proposed TfrAdmCov model, dependencies between the variables in the input sequence can be easily captured. The proposed TfrAdmCov model can perform parallel calculations simultaneously because it is transformer encoder-based architecture. In addition, in order to increase the performance of the proposed TfrAdmCov model, agglomerative clustering algorithm has been preferred during creation of the training, testing and Kfold datasets. Additionally, the GridSearchCV algorithm has been used to set the best hyperparameter values of machine learning algorithms. The experimental results in detail shows that the proposed TfrAdmCov model achieves better performance than both classical artificial intelligence -based models and several state-of-the-art models. The proposed TfrAdmCov model achieved 99.93% accuracy value, 100.00% precision value, 97.38% recall value, 98.67% F1-score value and 98.65% MCC value on the COVID-19 testing dataset. In the COVID-19 testing dataset, the TfrAdmCov model with the Adam optimization algorithm correctly predicted 335 samples out of 344 samples in the“mutation”class, while it incorrectly predicted only 9 samples out of 344 samples in the“mutation”class. In addition, the proposed TfrAdmCov model with Adam optimization algorithm correctly predicted all samples out of 12386 samples in the“no mutation”class. Similarly, when the average of 10 random experiments have been taken, the proposed TfrAdmCov model achieved better results than other models in terms of accuracy with 99.924%, recall with 97.18%, F1-score with 98.57% and MCC value with 98.54% on the COVID-19 testing dataset. In addition, in order to statistically compare the proposed TfrAdmCov model with the deep learning models, the results obtained have been analyzed by taking the average of 10 random trials with different random seeds. Detailed evaluation has been carried out for each model in terms of accuracy, precision, recall, F1-score and MCC performance measurement metric using statistical measurements such as mean, standard deviation, median, mininum and maximum. The proposed TfrAdmCov model obtained an average of 0.999238, standard deviation of 0.000036, median of 0.999214, minimum of 0.999214 and maximum of 0.999293 among the 10 accuracy values obtained on the COVID-19 testing dataset. We also performed mutation prediction on the influenza A/H3N2 HA dataset to evaluate the performance of the proposed TfrAdmCov model. The proposed TfrAdmCov model achieved better results than other models 96.33% accuracy, 81.55% precision, 52.33% recall, 63.75% F1-score and 63.61% MCC values on the H3N2 HA testing dataset. On the H3N2 HA testing dataset, the proposed TfrAdmCov model correctly predicted 853 samples out of 1630 samples in the“mutation”class, while it incorrectly predicted 777 samples out of 1630 samples in the“mutation”class. In addition, the proposed TfrAdmCov model correctly predicted 24577 out of 24770 samples in the“no mutation”class, while it incorrectly predicted 193 out of 24770 samples in the“no mutation”class. Results on the influenza H3N2 HA testing dataset showed that the proposed TfrAdmCov model is quite robust. Secondly, we propose a robust StackGridCov model for mutation prediction of the COVID-19 virus. The proposed StackGridCov model is based on ensemble learning. The proposed StackGridCov model is a very successful model that maximizes the performance as much as possible by using many machine learning algorithms. The main reason for this can be expressed as the proposed StackGridCov model reduces the possibility of overfitting by combining the strengths of several base models. These base models may make errors in different parts of the input sequences. By combining the outputs of these base classifiers, the meta-classifier can compensate for these errors and ultimately make a more accurate prediction. The proposed StackGridCov model is flexible as different machine learning algorithms can be used in both the level-0 layer and the level-1 layer. The proposed StackGridCov model is more robust than other ensemble learning and other artificial intelligence techniques as it is less affected by overfitting. This is because the base learners are trained on the same training dataset and the meta learner is trained on the new large dataset by combining the predictions of these base classifiers on the training dataset, ultimately reducing the possibility of overfitting. In this thesis study, while the base learners at level-0 have been selected as SVM, RF, XGBoost, ANN, DT, GB, ET, AdaBoost learner has been chosen as the meta classifier at level-1. This selection of both base classifiers and meta classifier significantly improved the performance of the proposed StackGridCov model. In addition, we use the GridSearchCV hyperparameter tuning algorithm to improve the performance of the proposed StackGridCov model and other models. To evaluate the performance of the proposed StackGridCov model and other models, the stratified 10-fold cross-validation technique as well as the holdout technique has been used. Additionally, to evaluate the performance of the proposed StackGridCov model, mutation prediction has been performed on the previously emerging influenza A/H1N1 HA virus dataset. The proposed StackGridCov model with GridSearchCV method outperformed other algorithms in terms of accuracy value of 0.6623, F1-score value of 0.6723, MCC value of 0.3273 and AUC value of 0.7018 on the COVID-19 testing dataset. Moreover, the proposed StackGridCov algorithm with GridSearchCV technique outperformed the StackGridCov model without GridSearchCV technique on the COVID-19 testing dataset. The proposed StackGridCov model with GridSearchCV method increased the accuracy value (from 0.6016 to 0.6623), precision value (from 0.5833 to 0.6415), recall value (from 0.6566 to 0.7062), F1-score value (from 0.6178 to 0.6723). ), the MCC value (from 0.2063 to 0.3273) and the AUC value (from 0.6133 to 0.7018). The proposed StackGridCov model with the GridSearchCV method correctly predicted 399 samples out of 565 samples in the“mutation”class on the COVID-19 testing dataset, while it incorrectly predicted only 166 samples out of 565 samples in the“mutation”class. In addition, the proposed StackGridCov model with the GridSearchCV method correctly predicted 223 samples out of 587 samples in the“no mutation”class on the COVID-19 testing dataset, while it incorrectly predicted 364 samples out of 587 samples in the“no mutation”class. Similarly, the proposed StackGridCov outperformed other models in terms of accuracy value of 0.6610, a precision value of 0.6614, an F1-score value of 0.6607 and an MCC value of 0.3226 on the KFold dataset. Moreover, the proposed StackGridCov model outperformed other models in terms of accuracy value of 0.9460, recall value of 0.7969, F1-score value of 0.8093 and MCC value of 0.7780 on the Influenza A/H1N1 HA testing dataset. As a result, it has been observed that using the GridSearchCV hyperparameter technique has been generally increased the performance of the proposed StackGridCov model and other models. Thirdly, the HyperAttCov model, which is based on LSTM, HyperMixer and attention mechanisms, is proposed for COVID-19 virus mutation prediction. Attention mechanisms have been used to maximize the performance of the proposed HyperAttCov model. The proposed HyperAttCov model is able to capture the most relevant input features and long-term temporal dependencies in the input sequence. Additionally, in this thesis study, attention mechanisms (input attention mechanism and temporal attention mechanism) have been used to improve the performance of the proposed HyperAttCov model by focusing on important parts of the COVID-19 dataset. While the input attention mechanism is applied to the entire input dataset, the temporal attention mechanism is applied to the data obtained from the HyperMixer architecture. The proposed HyperAttCov model achieved better performance than many deep learning-based and machine learning models. When the experimental results have been examined in detail, it has been observed that the proposed HyperAttCov model reached 70.0% accuracy value, 92.0% precision value and 46.5% MCC value in the COVID-19 testing dataset. Similarly, the proposed HyperAttCov model achieved 70.2% accuracy value, 90.4% precision value and 46.2% MCC value on the COVID-19 testing dataset when averaged over 10 random trials. In addition, the proposed HyperAttCov model achieved very successful results on the COVID-19 testing dataset compared to the study in the literature. As a result, the proposed TfrAdmCov, StackGridCov and HyperAttCov models can successfully predict mutations that will occur on both the COVID-19 S protein and the influenza datasets. In addition, in this thesis study, it has been observed that the use of agglomerative clustering algorithm and GridSearchCV hyperparameter technique played an effective role in mutation prediction of the COVID-19 virus. The results obtained this thesis study are promising for vaccines and drugs development.

Benzer Tezler

Tez No
870170
Çok ilaca dirençli Mycobacterium tuberculosis suşlarının sekonder ilaçlara direnci ve dirençle ilişkili mutasyonların araştırılması
Investigation of resistance of multi-drug resistant Mycobacterium tuberculosis strains to secondary drugs and resistance-associated mutations
İHSAN KULAKSIZ
Tıpta Uzmanlık
Türkçe
2023
Mikrobiyoloji İstanbul Üniversitesi
Tıbbi Mikrobiyoloji Ana Bilim Dalı
PROF. DR. MELTEM UZUN
Tez No
856743
Türkiye'de SARS-CoV-2 virüsünün s geni mutasyonlarının araştırılması ve filogenetik analizler
Investigation of s gene mutations of SARS-CoV-2 virus in Turkey and phylogenetic analyses
ASLI AĞAR
Yüksek Lisans
Türkçe
2023
Biyoloji Gazi Üniversitesi
Biyoloji Ana Bilim Dalı
PROF. DR. LEYLA AÇIK
Tez No
812764
COVİD-19 testi pozitif çıkan ailevi akdeniz ateşi hastalarında mefv gen mutasyonlarının dağılımı
Distribution of mefv mutations in familial mediterranean fever patients who testing positive for COVİD-19
TUĞBA TEKELİ
Yüksek Lisans
Türkçe
2023
Tıbbi Biyoloji Necmettin Erbakan Üniversitesi
Tıbbi Biyoloji Ana Bilim Dalı
DOÇ. DR. HATİCE GÜL DURSUN
Tez No
753507
Face mask detection using deep learning methods
Derin öğrenme yöntemleriyle yüz maskesi tespiti
YOUNUS ALQADIRI
Yüksek Lisans
İngilizce
2022
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Bahçeşehir Üniversitesi
Bilgisayar Mühendisliği Bilim Dalı
DR. ÖĞR. ÜYESİ ZAFER İŞCAN
Tez No
842533
Acil servise başvuran COVİD-19 enfeksiyonu olan hastalarda protrombin G20210A (faktör 2) ve PAI-1-4G/5G gen polimorfizminin incelenmesi
Investigation of prothrombin G20210A (factor 2) and PAI-1-4G/5G gene polymorphism in patients with covid-19 infection presenting to emergency department
AYŞEGÜL BAŞTAŞ
Tıpta Uzmanlık
Türkçe
2023
Acil Tıp Pamukkale Üniversitesi
Acil Tıp Ana Bilim Dalı
PROF. DR. İBRAHİM TÜRKÇÜER

Geri Dön