Makine öğrenmesi yöntemleri ile web isteklerinde anomali tespiti

Anomaly detection in web requests using machine learning methods

PDF İndir

Tez No: 731374
Yazar: ÇAĞLAR ABABAY
Danışmanlar: DR. ÖĞR. ÜYESİ FİGEN ÖZEN
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2022
Dil: Türkçe
Üniversite: Haliç Üniversitesi
Enstitü: Lisansüstü Eğitim Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
Sayfa Sayısı: 53

Özet

Bir websitesinin en önemli önceliklerinden biri kesintisiz olarak hizmet sağlamasıdır. Yoğun kullanıcı trafiğinin yanı sıra kötü niyetli saldırılar da sistem kaynaklarının yetersiz kalmasına ve hizmetin kesilmesine sebep olabilmektedir. Hizmetin kesilmesi, web sitesi için hem maddi kayıp hem de itibar kaybına neden olacaktır. Bunun yanı sıra insanlar üzerinde psikolojik etkileri dahi olduğu düşünülmektedir. Günümüzde, ülkemizde ve dünyada bu tarz kesintiler ve etkileriyle ilgili haberlere bolca örnek verilebilir. Bundan dolayı anomali tespiti ile ilgili geliştirmelere büyük bir ilgi vardır. Bu tarz hizmet kesintileri büyük şirketler için milyarlarca dolar kaybı, küçük şirketler için belki de piyasadan silinme tehlikesi içerir. Ek olarak; çok yoğun trafik alan herhangi bir web sitesi (sosyal medya devleri) kesintiye uğradığında, bu site için normal olan trafiğin büyük bir kısmı başka bir web sitesine yığılmakta ve bu da bir anomali oluşturmaktadır. Bu durum da, aslında bir saldırı olmayıp, dengelerin bozulması kaynaklı bir anomali oluşturur. Dengeler bozulduğunda, anomalilerin saptanması ve önlem alınması da zorlaşacaktır. Bu sebeplerle, bu çalışmanın günümüzdeki internet kullanımının boyutu ve önemi düşünüldüğünde, tüm alanlarda büyük yarar sağlayabileceği aşikardır. Bu tezin amacı, kullanıcı trafiğindeki anormal olan isteklerin tespit edilmesi ve hızlı bir şekilde engellenmesini sağlamaktır. Bunun temelinde, normal olan web istekleriyle, anormal olan web isteklerinin ayrıştırılması (isteklerin sınıflandırılması) yer almaktadır. Anormallik durumu belirli bir zaman aralığında (son 1 saat, son 6 saat gibi) yapılan isteklerin tümüne bakılarak hesaplanmıştır. Anlık ya da çok kısa sürede tespit edilen anormal trafik engellenerek, web sitesinin hizmet devamlılığı ya da çok kısa süreli kesinti sağlanabilecektir. Popüler makine öğrenmesi yöntemleri (yapay sinir ağı, desteklenmiş karar ağacı, Naive Bayes sınıflandırıcı, lojistik regresyon sınıflandırıcı, destek vektör makinesi, otomatik kodlayıcı) kullanılmuş ve karşılaştırmaları yapılmıştır. Çalışmanın temelinde, belirli bir zaman aralığındaki ziyaretlerin, belirli metriklerden faydalanarak, sınıflandırılmasını sağlamak yatmaktadır. Örneğin, ortalama bir isteğin işlem süresi 100 milisaniye ise, 500 milisaniyelik bir istek anomali oluşturabilir. Bu ve buna benzer birçok metrik kullanılarak, istek yapan IP adresinin normal davranıştan vektörel olarak ne kadar uzakta olduğu hesaplanacak ve eşik düzey aşılmışsa bunun bir anomali olduğuna karar verilmiştir. Yapılan karşılaştırmalar, doğruluk yüzdeleri, F1 skorları ve çalışma zamanı üzerinden olmuştur. Gerçek bir websitesine ait, IP adresi bazında isteklerden oluşan, belirli bir tarih aralığındaki oluşturulan veri üzerinde denemeler yapılmıştır. Ham olan bu veri öncelikle işlenecek ve makine öğrenmesi metodlarından birine başlamadan önce hazır hale getirilmiştir. Ham veri tarih bazlı ve istek özelinde satırlardan oluşmaktadır. Hazırlanmış veri ise IP adresi bazlı olup, o IP adresinin belirli zaman aralığında; saniyede kaç istekte bulunduğu, isteklerin ortalama kaç milisaniye sürdüğü, isteklerin boyutu gibi parametrelerden oluşmaktadır. Ayrıca bu veri gerçek bir saldırı durumu da içermektedir ve bunun hangi IP adresleri tarafından yapıldığı bilinmektedir. Yapılacak çalışmanın denemelerinde, gerçek anomali durumları, sonucun doğruluğunu güçlendirmiştir. Aynı zamanda makine öğrenmesine de destek olmuştur. Veri üzerinde çalışmak üzere en uygun programa dillerinden biri olan Python ile geliştirmeler yapılmıştır. Python programlama dilinin en güncel ve stabil versiyonlarından biri olan 3.7 versiyonu kullanılmıştır. Ayrıca veri işleme, makine öğrenmesi metodlarının algoritmaları ve veri görselleştirme işlemleri için dünyada en çok kullanılan“numpy”,“sklearn”ve“tensorflow”kütüphanelerinden yararlanılmıştır. Her bir makine öğrenmesi, metodu aynı veri ile çalışan, ayrı birer proje şeklinde hazırlanacaktır. Böylece metodların birbiriyle kıyaslanması, verimliliğin daha doğru bir şekilde ölçülmesi hedeflenmiştir. Bu çalışmanın sonucunda, birçok web sitesinin ve sağlayıcılarının kullanabileceği, web sitesinin ve trafiğin türünden bağımsız olarak çalışabilen bir uygulama oluşturulmaktadır. Bu sayede web siteleri maddi kayıplardan ve itibar kaybından korunarak, müşterilerine kesintisiz hizmet sağlayabileceklerdir.

Özet (Çeviri)

One of the most important priorities of a website is to provide uninterrupted service. In addition to heavy user traffic, malicious attacks can also cause insufficient system resources and interruption of service. Interruption of the service will cause both financial loss and loss of reputation for the website. In addition, it is thought to have psychological effects on people. Today, there are plenty of examples of news about such cuts and their effects in our country and in the world. Therefore, there is a great interest in the development of anomaly detection. Such service interruptions entail the loss of billions of dollars for large companies and the danger of being wiped out of the market for small companies. In addition; When any very heavy traffic website (social media giants) is interrupted, most of the traffic that is normal for that site is piled up on another website, which creates an anomaly. This situation, in fact, is not an attack, but creates an anomaly caused by the disruption of balances. When the balances are disturbed, it will be difficult to detect anomalies and take precautions. The aim of this thesis is to detect abnormal requests in user traffic and to prevent them quickly. The basis for this is the separation (classification of requests) of normal web requests and abnormal web requests. Anomaly status was calculated by looking at all requests made in a certain time interval (such as the last 1 hour, the last 6 hours). Abnormal traffic detected instantly or in a very short time will be blocked, and the continuity of service of the website or a very short interruption will be ensured. Popular machine learning methods (artificial neural network, supported decision tree, Naive Bayes classifier, logistic regression classifier, support vector machine, autoencoder) were used and compared. The basis of the study is to classify the visits in a certain time period by using certain metrics. For example, if the processing time of an average request is 100 milliseconds, a 500 millisecond request may generate an anomaly. By using this and many similar metrics, it will be calculated how far the requesting IP address is vectorally from the normal behavior, and if the threshold level is exceeded, it is decided that this is an anomaly. Comparisons were made on percentages of accuracy, F1 scores, and runtime. Experiments have been made on the data of a real website, consisting of requests on the basis of IP addresses, in a certain date range. This raw data will be processed first and made ready before starting one of the machine learning methods. Raw data consists of date-based and request-specific lines. Prepared data, on the other hand, is based on IP address, within a certain time interval of that IP address; It consists of parameters such as how many requests per second, how many milliseconds the requests take on average, and the size of the requests. In addition, this data includes a real attack situation and it is known by which IP addresses it was made. In the trials of the study, the real anomaly cases strengthened the accuracy of the result. It also supported machine learning. Developed with Python, one of the most suitable programming languages to work on data. One of the most up-to-date and stable versions of the Python programming language, version 3.7, was used. In addition, the world's most widely used“numpy”,“sklearn”and“tensorflow”libraries were used for data processing, algorithms of machine learning methods and data visualization. Each machine learning method will be prepared as a separate project that works with the same data. Thus, it is aimed to compare the methods with each other and to measure the efficiency more accurately. As a result of this study, an application is created that can be used by many websites and their providers, and that can work regardless of the type of website and traffic. In this way, websites will be protected from financial losses and loss of reputation, and they will be able to provide uninterrupted service to their customers. Although open source datasets used in similar studies are ideal for simulation environment, it is clear that they cannot produce accurate outputs in practice. When a study conducted in 2019 is examined, there are 67,343 normal requests and 45,927 labeled request data in attack types in the training dataset. Less class imbalance facilitates the solution of the problem. However, what is expected in real life is that anomaly situations are much less than normal ones. In line with this expectation, situations where the class imbalance is much higher should also be evaluated during the modeling phase. In the aforementioned research, the accuracy value of the experiment with the support vector machine is 99%. In this study, the result of the experiment with the support vector machine is 97%. However, as mentioned before, the user access records of a real website were used in this study, the class imbalance in the data is quite high. In another study, which includes an open source user request log data with proper class balance, the decision tree method was used as in this study. The accuracy score in the experiment was found to be 97.7%. In this study, the decision tree method was strengthened with the“AdaBoost”method, and the accuracy value was found to be 99.9% even though the class imbalance was very high. NSL-KDD dataset with open source access was used in all of the literature researches. In addition, while preparing the training data in most of the studies, there is some normal request and the same amount of data labeled as attack, so as to create the class balance. Each condition that qualifies as an anomaly should be less numerous than the normal condition. On the contrary, it even contradicts the meaning of the word. In this way, since all the conditions in the simulation environment are in the most ideal situation, the results will be more successful and predictable. In fact, the simulation environment is expected to be very similar to the real environment. There are evaluations of six different applications written in the python programming language, using the same data set. These applications are respectively; It uses artificial neural network, assisted decision tree, pure Bayesian classifier, logistic regression classifier, support vector machine and automatic encoder methods. Findings are evaluated on true accuracy score, F1 score and confusion matrix. In addition, the working time for each application, in the same physical environment, with the same data set is also taken into account. Each different method used achieves different successful results in different types of data sets. Examples of this can be seen in many studies. However, in the data set used in this thesis, which of these methods is or is more efficient is shown in this section.

Benzer Tezler

Tez No
492641
Uygulama katmanı için güvenlik duvarı geliştirilmesi
An efficient firewall for web applications (EFWA)
METİN ŞAHİN
Yüksek Lisans
Türkçe
2018
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Gebze Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF. DR. İBRAHİM SOĞUKPINAR
Tez No
805435
XSS saldırılarının tespiti için web uygulama güvenlik duvarı (WAF) ve makine öğrenme teknikleri kullanan hibrit bir yaklaşım
A hybrid approach using web application firewall (WAF) and machine learning techniques to detect XSS attacks
İDRİS OLCAY
Yüksek Lisans
Türkçe
2023
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Eskişehir Osmangazi Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DR. ÖĞR. ÜYESİ ESRA NERGİS YOLAÇAN
Tez No
745676
Web servislerinde mesajin iki katmanlı QR kod ile iletimi ve makine öğrenmesi yöntemleri ile tespiti
Two layer QR code transmission of message inweb services and detection with machinelearning methods
MİRSAT YEŞİLTEPE
Doktora
Türkçe
2022
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Yıldız Teknik Üniversitesi
Matematik Mühendisliği Ana Bilim Dalı
PROF. DR. MUHAMMET KURULAY
Tez No
833695
Complex network-based link prediction in computer science, social science, and medical science publications in Iraq
Iraq'ta bilgisayar bilimi, sosyal bilimler ve tıbbi bilimler yayınlarında karmaşık ağ tabanlı bağlantı tahmini
ALBATOL ABDULMAHDI SALEH AL-DHAYAB
Yüksek Lisans
İngilizce
2023
Mühendislik Bilimleri Karabük Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DR. ÖĞR. ÜYESİ EMRAH ÖZKAYNAK
Tez No
915356
Ai-powered web application security mechanisms
Yapay zeka destekli ağ uygulaması güvenliği düzenekleri
DİLEK YILMAZER DEMİREL
Doktora
İngilizce
2024
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DR. ÖĞR. ÜYESİ MEHMET TAHİR SANDIKKAYA

Geri Dön