Yanıltıcı gözlem saldırılarına karşı çok etmenli rehberli derin pekiştirmeli öğrenme yaklaşımı

Multi agent guided deep reinforcement learning approach against state perturbed adversarial attack

PDF İndir

Tez No: 933958
Yazar: ÇAĞRI ÇERÇİ
Danışmanlar: PROF. DR. HAKAN TEMELTAŞ
Tez Türü: Doktora
Konular: Mekatronik Mühendisliği, Mechatronics Engineering
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2025
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Lisansüstü Eğitim Enstitüsü
Ana Bilim Dalı: Mekatronik Mühendisliği Ana Bilim Dalı
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 93

Özet

Derin pekiştirmeli öğrenme (DRL) algoritmaları hem tek hem de çok etmenli sistemlerde çevreyle etkileşime girer ve herhangi bir etiket verisi olmaksızın öğrenmeyi amaçlarlar. Yüksek boyutlu çalışma uzaylarında, toplayabilecekleri maksimum ödülü elde edebilmek için politikalarını geliştirirler. Aktör Kritik DRL algoritmaları, koordineli izleme ve kuşatma gibi görevlerde bir grup otonom kara veya hava aracının oluşumunu optimize ederek güvenliği arttırabilirler. DRL algoritmaları, etmenlerin çarpışma riskini azaltmalarına, engellerden kaçınmalarına ve hedeflerine en verimli yolu bulmalarına olanak tanır. Robotların karşılaştıkları sorunlara dayalı olarak stratejilerini uyarlamalarına olanak sağlayarak çevrelerini öğrenmelerine yardımcı olur. Bu nedenle, aktör-kritik DRL algoritmaları aracılığıyla çok etmenli robotik sistemler, doğru ve iş birliği içinde kararlar almayı öğrenerek görevlerini başarıyla tamamlayabilmektedir. Bu algoritmalar son yıllarda oldukça başarılı sonuçlar çıkarmışlardır. Arama ve kurtarma, keşif, askeri operasyonlar, yangın söndürme, otonom araçlar gibi çeşitli konularda uygulama alanları bulunmaktadır. Bunun yanında, algoritmaların başa çıkmakta zorlandığı durumlar da vardır. Simülasyon ortamlarında, sensörlerden gelen gözlem verilerinin tam değerlerinin sağlıklı bir şekilde alındığı varsayılır. Sinir ağı yapısının doğası gereği eğitilirken kullanılan girdilerden daha farklı girdiler oluşursa, sinir ağının bu durumu ekstrapolasyon yaparak çözebilme başarısı yeterince yüksek olamayacaktır. Politika, belirsiz koşullar altında en doğru kararı üretmekte zorlanacaktır. Bu durum da gerçek dünya uygulamalarında meydana gelebilecek herhangi bir bozulmuş durum verileri ihtimaline karşı algoritmayı dayanıksız kılacaktır. Belirsiz koşullar altında uygun karar verebilmek için geliştirdiğimiz bu yaklaşımda, koşulların belirsizliğini SA-MDP çerçevesiyle elde ediyoruz. Pekiştirmeli öğrenme algoritmasının dayanıklılığını arttırabilmek için yenilikçi çok etmenli rehberli pekiştirmeli öğrenme yaklaşımı öneriyoruz. Bu yaklaşımda sağlıklı verileri gözlemleyen rehber aktör ağının, eğitim aşamasında kontrol aktör ağına rehberlik eder. Bu yapıda, rehber aktörü yalnızca eğitim aşamasında kullanılmaktadır. Etmenler sadece kontrol aktörüne göre kararlarını alırlar. Kontrol aktörü için önerilen kayıp fonksiyonuyla birlikte rehber aktörlerin çıktıları düzenleyici olarak kullanılmaktadır. Algoritmamız Soft aktör kritik (SAC) ve ikiz gecikmeli derin deterministik politika gradyanı (TD3) algoritmaları üzerine uygulanmıştır. Kritik ve Value ağlarının sonuçları alınırken değerlendirilen aktör ağı için kontrol ağı, diğer aktörler için de rehber ağı kullanılmaktadır. Böylece ödül geri bildiriminde başarının arttığı gözlemlenmiştir. Önerdiğimiz algoritma, pyglet ortamında hazırlanan çok etmenli simülasyon sisteminde hedef takibi ve çevrelemesi görevine uygulanmıştır. Yaklaşımımızın ölçeklenebilir olduğunu göstermek için 3, 5 ve 7 etmen için algoritma ayrı ayrı eğitilmiş ve sonuçları gösterilmiştir. Başarımlar, açı, hedefle aradaki mesafe, çarpışma olasılığı, en yakın iki etmene eşit uzaklıkla olan mesafe üzerinden detaylı olarak aktarılmıştır. Toplam ödül sonuçları incelendiğinde geleneksel yönteme kıyasla, Rehberli MA-SAC algoritmasında 3 etmen için %14, 5 etmen için %11 ve 7 etmen için yaklaşık %14 başarı sağlamaktadır. MA-TD3 algoritması ile Rehberli MA-TD3 algoritması değerlendirildiğinde ise, 3 etmen için %9, 5 etmen için %14 ve 7 etmen için %32 ödül kazancı olmuştur. Bu sonuçlar, gürültülü ortamda dayanıklı kararlar vermek üzere eğitilen algoritmamızın gürültüsüz ortamlarda eğitilen MA-SAC ve MA-TD3 algoritmalarının sonuçlarına yakın bir başarı elde ettiğini göstermektedir.

Özet (Çeviri)

Markov Decision Processes (MDPs) provide a fundamental framework for modeling decision-making problems. In an MDP, an agent interacts with its environment by taking actions, which result in new states based on a defined transition model. Rewards are received as feedback depending on these actions. The goal of the agent is to find the optimal policy that maximizes the cumulative total reward. This structure enables a systematic approach to solving various uncertain and complex problems. In scenarios involving multiple agents, MDPs can be extended to capture the interactions and dependencies between agents. This makes them suitable for modeling collaborative and competitive tasks. Reinforcement learning algorithms leverage these principles to effectively address both single-agent and multi-agent systems in dynamic environments. With the advancements in processing power in recent years, Deep Reinforcement Learning (DRL) models are increasingly employed to solve highly complex problems. DRL algorithms operate based on MDP principles. They aim to learn without labeled data, interacting with the environment and training their policies to maximize cumulative rewards. They are designed to function effectively in complex scenarios. In high-dimensional workspaces, they explore the environment independently and continuously update their strategies based on received feedback, enabling them to adapt to dynamic and unpredictable conditions. Among DRL algorithms, Actor-Critic algorithms stand out due to their ability to manage continuous action spaces and effectively learn in multi-agent environments. In these algorithms, the actor network receives observational data from the environment and produces an action. The critic network, which takes the observation data and the action generated by the actor network as input, calculates the expected reward value if the current policy is followed. The expected reward values continue to be updated to maximize the rewards. Actor-Critic DRL algorithms are widely used in multi-agent systems. These algorithms help robots learn about their environment by adapting their strategies based on the problems they encounter. Thus, through Actor-Critic DRL algorithms, multi-agent robotic systems can learn to make accurate and cooperative decisions, successfully completing their tasks. These algorithms have yielded highly successful results in recent years, with applications in fields such as search and rescue, reconnaissance, military operations, firefighting, and autonomous vehicles. However, there are also situations in which algorithms struggle to cope. In simulation environments, it is assumed that the exact values of observation data from sensors are accurately obtained. This assumption can lead to a significant gap between simulation and reality, particularly in cases where sensor readings in the real world may be noisy, partial, or corrupted. Due to the nature of neural network structures, if inputs different from those used during training are encountered, the neural network's ability to manage this through extrapolation is not sufficiently high. Under uncertain conditions, the policy may struggle to produce the most accurate decision. This situation compromises the algorithm's reliability when faced with potentially corrupted state data, undermining its ability to perform reliably and safely. To address these challenges, within the scope of this thesis, models trained with an optimal adversarial state attack strategy were created to minimize future cumulative rewards. These adversarial models aim to intentionally introduce distortions in state data to simulate worst-case scenarios that could be encountered in real-world applications. Separate adversarial models are present for each actor network. The attack models attempt to manipulate the action obtained from the policy to minimize the total reward values. At this point, we propose a guided approach that introduces a dual-actor system for each agent to enhance robustness. The guide actor generates results based on healthy data, while the control actor observes only corrupted data. The guide actor guides the control actor network, which makes decisions based on adversarial neural network outputs in State Adversarial Markov Decision Process (SA-MDP) environments to improve the robustness of deep reinforcement learning algorithms. The guide actor is used only during training. Agents make decisions solely based on the control actor. The outputs of the guide actors are used as a regularizer in the proposed loss function for the control actor. Additionally, when the results of the Critic and Value networks are obtained, the control network is used for the evaluated actor network, while the guide network is used for other actors, thereby improving the overall reward feedback and stability of the learning process. Our approach was found to produce resilient decisions under adversarial attacks. Our approach was applied to the target tracking and encirclement task in a multi-agent simulation system developed using the Pyglet library. Each agent is capable of observing the relative positions of other agents and the target object, as well as its own velocity. Our algorithm was implemented on both Multi-Agent Soft Actor-Critic (MA-SAC) and Multi-Agent Twin Delayed Deep Deterministic Policy Gradient algorithms (MA-TD3). The reward function consisted of four components: angle success, distance to the desired distance from the target, probability of collision, and the distance to the equal spacing between the two closest agents. Accordingly, agents can reduce the risk of collisions with each other and the target object while ensuring the target object is tracked with a proper formation. To demonstrate the scalability of our approach, our algorithm was applied separately for groups of three, five, and seven agents, and the results were presented in detail. The training phase was carried out over 20,000 episodes, each consisting of 50 steps. The testing phase was conducted over 5,000 episodes. When examining the cumulative reward results, compared to traditional methods, the Guided MA-SAC algorithm achieved success rates of 14% for three agents, 11% for five agents, and approximately 14% for seven agents. When the MA-TD3 algorithm was compared with the Guided MA-TD3 algorithm, there was a reward increase of 9% for three agents, 14% for five agents, and 32% for seven agents. These results indicate that our algorithm, trained to make robust decisions in a noisy environment, achieves success close to that of MA-SAC and MA-TD3 algorithms trained in noise-free environments. The results demonstrate the scalability of the proposed method. In the following sections, we will first present the introduction and related developments in the literature. Then, we will demonstrate state adversarial neural network algorithms and discuss their implementation in various scenarios. Subsequently, we will explain in detail how we integrated our approach into the MA-TD3 and MA-SAC algorithms. Additionally, we will provide a detailed description of our problem definition and reward functions for the encirclement task. Finally, we will thoroughly explain our results and discuss the direction we plan to take in future studies.

Benzer Tezler

Tez No
577116
Türkiye'de ve Avrupa'da DRDoS yükselticilerinin analizi
An analysis of DRDoS amplifiers in Turkey and Europe
EMRE MURAT ERCAN
Yüksek Lisans
Türkçe
2019
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol TOBB Ekonomi ve Teknoloji Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF. DR. ALİ AYDIN SELÇUK
Tez No
258028
İktisadi zaman serilerinde karşılaşılan aykırı gözlemler ve kullanılan güçlü tahmin yöntemleri
Outliers encountered in economic time series and employed robust estimation techniques
ÖZLEM YORULMAZ
Doktora
Türkçe
2010
Ekonometri İstanbul Üniversitesi
İktisat Bölümü
PROF. DR. A. KARUN NEMLİOĞLU
Tez No
614566
Handling missingness, outliers and modeling in longitudinal data analysis
Longitudinal veri analizinde eksik gözlem, aykırı değer ve modelleme üzerine çalışma
MAROUA BEN GHOUL
Doktora
İngilizce
2019
İstatistik Anadolu Üniversitesi
İstatistik Ana Bilim Dalı
PROF. DR. BERNA YAZICI
Tez No
181957
Grafik yöntemlerle etkin gözlemlerin ve aykırı değerlerin tespiti
Identifying of influential observations and outliers with diagnostic plots
YEŞİM AYDIN
Yüksek Lisans
Türkçe
2006
İstatistik Ondokuz Mayıs Üniversitesi
İstatistik Ana Bilim Dalı
Y.DOÇ.DR. VEDİDE REZAN USLU
Tez No
928858
Penalized estimation in the bell regression
Bell regresyonda cezalı tahmin
COSMAS KAITANI NZIKU
Doktora
İngilizce
2025
İstatistik Eskişehir Osmangazi Üniversitesi
İstatistik Ana Bilim Dalı
PROF. DR. ARZU ALTIN YAVUZ

Geri Dön