Finans alanında veri mahremiyeti yöntemleri: Federe öğrenme ve sentetik veri üretimi

Data privacy methods in finance: Federated learning and synthetic data generation

PDF İndir

Tez No: 953595
Yazar: ELİF ÖZCAN
Danışmanlar: DOÇ. DR. YUSUF YASLAN
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2025
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Lisansüstü Eğitim Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
Sayfa Sayısı: 71

Özet

Bu tez, finans alanındaki veri kümelerinin kişisel ve hassas veri içermesi sebebiyle oluşan veri mahremiyeti problemi için kullanılabilen sentetik veri üretimi ve federe öğrenme yöntemlerinin sınıflandırma performansı ve uygulanabilirliğini karşılaştırmalı olarak incelemektedir. Çalışma kapsamında, kredi kartı temerrüt tahmini gibi finansal sınıflandırma problemlerinde kullanılan bir veri kümesi üzerinde federe öğrenme ve sentetik veri üretimi yöntemleri kullanılarak eğitilen modeller gerçek veriyle eğitilen merkezi veya lokal modellere karşılaştırılmıştır. Ayrıca performans kaybı olmadığı gösterildikten sonra sınıf dengesizliği probleminde sentetik veri ve federe öğrenmenin etkisi gösterilmiştir. Çizge tabanlı modellerin uygulanabilirliğini değerlendirmek amacıyla tablo halindeki veri, çizge yapısına dönüştürülmüş ve topolojik özelliklerin model başarısına katkısı analiz edilmiştir. Bu çalışmanın sonuçları veri paylaşımının mümkün olmadığı durumlarda da performans kaybı yaşamadan model eğitilebileceğini ve aynı zamanda bu yöntemlerin sınıf dengesizliği problemine de bir çözüm olabileceği göstermiştir. Çizge tabanlı modeller ile sınıflandırma performansı önemli ölçüde arttırılırken topolojik özniteliklerle de bir miktar artış sağlanmıştır. Bu tez, finansal veri gizliliğini koruma ile yüksek model performansı arasındaki dengeyi sağlamak için sentetik veri üretimi ve federe öğrenmenin birlikte veya ayrı ayrı nasıl kullanılabileceğine dair önemli çıkarımlar sunmakta; gerçek dünya uygulamaları ve regülasyonlarla uyumlu çözüm yolları önermektedir.

Özet (Çeviri)

This thesis comparatively investigates the classification performance and applicability of synthetic data generation and federated learning—two approaches that can be used to address the data privacy issues arising from the sensitive and personally identifiable nature of financial datasets. Within the scope of this study, these approaches were evaluated using a dataset commonly employed in financial classification tasks, such as credit card default prediction. Prior to training, the dataset was split into training and testing sets. In the federated learning setup, the training data was evenly divided among five clients. Synthetic data generation was performed separately for each client to simulate a more realistic and practical scenario. All experiments were conducted using the same test set to ensure consistency, and each experiment was repeated ten times to calculate average results, thereby reducing the impact of randomness on evaluation outcomes. In the federated learning approach, model training in each experiment was conducted over 10 communication rounds. To evaluate the performance of the experiments, metrics such as accuracy, F1 score, recall, precision, and AUC were utilized. Also, to assess the quality of the synthetic dataset, metrics such as Column Shapes Score, Column Pair Trends Score, and Overall Quality Score were used. Synthetic data generation involves creating an artificial dataset that statistically resembles real data but does not contain any sensitive or personally identifiable information. This approach enables organizations to share data or conduct model training without breaching privacy regulations such as the General Data Protection Regulation (GDPR) or similar local laws. If the synthetic dataset sufficiently reflects the statistical properties of the original, it can be used for analysis and machine learning training purposes, even in the absence of real data. Beyond enabling compliance with data privacy regulations, synthetic data can also help address class imbalance problems in datasets. In the literature, it has been shown that in domains such as fraud detection—where minority classes are underrepresented—synthetic data generation improves model performance by augmenting the rare class samples. Synthetic data generation also allows organizations to simulate rare events or counterfactual scenarios, enabling stress testing and risk modeling under conditions where historical data may be scarce. Furthermore, it facilitates secure collaboration and innovation across institutions by enabling the release of privacy-compliant datasets for research and development. Federated learning is a machine learning approach that enables collaborative model training without the need for raw data sharing. In traditional centralized machine learning, data must be collected in a single location, which is often not feasible in sensitive domains like finance due to privacy regulations. In federated learning, each client trains a local model on its own data and shares only the model weights or gradients with a central server. The server then aggregates the updates from each client using specific strategies and redistributes the updated model to all participants. This process is repeated for several iterations, eventually resulting in a global model. Research has shown that federated learning models can achieve performance levels comparable to or even exceeding those of centralized or local models, especially due to the enhanced data diversity they can leverage. Like synthetic data generation, federated learning can also address class imbalance challenges. In this thesis, multiple experiments were conducted to evaluate the utility of both approaches. In this study, two different aggregation methods were employed. The first method, FedAVG, is a widely used federated learning technique that performs weighted averaging of model parameters from all participating clients based on their local data sizes, thereby producing a global model that reflects the contributions of each client. The second method, FedF1, was specifically utilized in the experiment designed to evaluate the effect of federated learning at the client level. FedF1 aims to optimize the global model by aggregating client models based on their F1-scores, giving more influence to clients whose models demonstrate higher classification performance. In the first experiment, federated learning and synthetic data generation were compared with a traditional anonymization method. Four machine learning algorithms were used to evaluate their performance. Results showed that federated learning and synthetic data generation did not suffer from performance degradation compared to centralized models and outperformed the anonymization method. In the second experiment, centralized and federated models were trained on three different versions of the dataset: real, synthetic, and hybrid (a combination of real and synthetic data). To evaluate the client-level benefit of federated learning, local models were also compared. Findings indicated that federated learning not only matched the performance of centralized models but also outperformed local models in client-level evaluations, making it a valuable alternative when centralized training is not feasible. To assess the effects of class imbalance, several additional experiments were performed. In one scenario, an extreme imbalance was simulated where one of the three clients had data only from the majority class, while the other two had balanced datasets. In this case, synthetic data generation was not feasible, but federated learning successfully leveraged knowledge from other clients to improve minority class predictions. To measure the impact of data balancing techniques, classical approaches were compared with the synthetic data generation method used in this thesis. Each method was applied to balance the dataset, and classification models were trained accordingly. Results showed that the synthetic approach outperformed classical balancing methods, likely due to better feature representation in the generated samples. To further enhance classification performance, graph-based models were tested using the original non-graph-structured dataset transformed into a graph via similarity measures. These models outperformed traditional machine learning algorithms. When tested under class imbalance conditions using realistic test data (rather than cross-validation), an increase in F1-score and accuracy was observed, although AUC and recall slightly declined. Additionally, topological features extracted from graph structures were evaluated, resulting in modest improvements in classification performance. Finally, the most effective graph-based models were integrated with synthetic data generation and federated learning approaches. The combined use of these methods did not lead to performance loss compared to centralized models. This thesis presents valuable insights into how synthetic data generation and federated learning—used individually or in combination—can help strike a balance between preserving financial data privacy and achieving high classification performance. It proposes viable solutions aligned with real-world applications and regulatory frameworks.

Benzer Tezler

Tez No
887328
Privacy and security enhancements of federated learning
Federe öğrenme uygulamalarında mahremiyet ve güvenlik geliştirmeleri
ŞÜKRÜ ERDAL
Yüksek Lisans
İngilizce
2024
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilişim Uygulamaları Ana Bilim Dalı
PROF. DR. ENVER ÖZDEMİR
DR. FERHAT KARAKOÇ
Tez No
832713
Blok zinciri ve sağlık uygulamaları
Block chain and health application
HAMİT MIZRAK
Yüksek Lisans
Türkçe
2023
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Malatya Turgut Özal Üniversitesi
Enformatik Ana Bilim Dalı
DR. ÖĞR. ÜYESİ SERPİL ASLAN
Tez No
761576
Türkiye'de dijital devlet paydaşları arasındaki etkileşim süreçlerinde blokzincirinin kullanılabilirliği
Usability of blockchain in interaction processes between digital government stakeholders in Turkey
MUSTAFA SAYIN
Yüksek Lisans
Türkçe
2022
Kamu Yönetimi Ankara Hacı Bayram Veli Üniversitesi
Amme İdaresi Ana Bilim Dalı
PROF. DR. TÜRKSEL KAYA BENSGHİR
Tez No
887239
Privacy-preserving mechanisms for face verification systems
Yüz doğrulama sistemleri için gizliliği koruyucu mekanizmalar
MARAM H. W. ALAGHBAR
Yüksek Lisans
İngilizce
2024
Elektrik ve Elektronik Mühendisliği Yıldız Teknik Üniversitesi
Elektronik ve Haberleşme Mühendisliği Ana Bilim Dalı
PROF. DR. TÜLAY YILDIRIM
Tez No
953797
Post quantum cryptography: homomorphic encryption
Kuantum sonrası kriptografi: homomorfik şifreleme
EMRULLAH ULUIŞIK
Yüksek Lisans
İngilizce
2025
Savunma ve Savunma Teknolojileri İstanbul Teknik Üniversitesi
Matematik Mühendisliği Ana Bilim Dalı
PROF. DR. GÜLÇİN ÇİVİ BİLİR
DR. ÖĞR. ÜYESİ ERDEM ALKIM

Geri Dön