Geri Dön

Finans alanında veri mahremiyeti yöntemleri: Federe öğrenme ve sentetik veri üretimi

Data privacy methods in finance: Federated learning and synthetic data generation

  1. Tez No: 953595
  2. Yazar: ELİF ÖZCAN
  3. Danışmanlar: DOÇ. DR. YUSUF YASLAN
  4. Tez Türü: Yüksek Lisans
  5. Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
  6. Anahtar Kelimeler: Belirtilmemiş.
  7. Yıl: 2025
  8. Dil: Türkçe
  9. Üniversite: İstanbul Teknik Üniversitesi
  10. Enstitü: Lisansüstü Eğitim Enstitüsü
  11. Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
  12. Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
  13. Sayfa Sayısı: 71

Özet

Bu tez, finans alanındaki veri kümelerinin kişisel ve hassas veri içermesi sebebiyle oluşan veri mahremiyeti problemi için kullanılabilen sentetik veri üretimi ve federe öğrenme yöntemlerinin sınıflandırma performansı ve uygulanabilirliğini karşılaştırmalı olarak incelemektedir. Çalışma kapsamında, kredi kartı temerrüt tahmini gibi finansal sınıflandırma problemlerinde kullanılan bir veri kümesi üzerinde federe öğrenme ve sentetik veri üretimi yöntemleri kullanılarak eğitilen modeller gerçek veriyle eğitilen merkezi veya lokal modellere karşılaştırılmıştır. Ayrıca performans kaybı olmadığı gösterildikten sonra sınıf dengesizliği probleminde sentetik veri ve federe öğrenmenin etkisi gösterilmiştir. Çizge tabanlı modellerin uygulanabilirliğini değerlendirmek amacıyla tablo halindeki veri, çizge yapısına dönüştürülmüş ve topolojik özelliklerin model başarısına katkısı analiz edilmiştir. Bu çalışmanın sonuçları veri paylaşımının mümkün olmadığı durumlarda da performans kaybı yaşamadan model eğitilebileceğini ve aynı zamanda bu yöntemlerin sınıf dengesizliği problemine de bir çözüm olabileceği göstermiştir. Çizge tabanlı modeller ile sınıflandırma performansı önemli ölçüde arttırılırken topolojik özniteliklerle de bir miktar artış sağlanmıştır. Bu tez, finansal veri gizliliğini koruma ile yüksek model performansı arasındaki dengeyi sağlamak için sentetik veri üretimi ve federe öğrenmenin birlikte veya ayrı ayrı nasıl kullanılabileceğine dair önemli çıkarımlar sunmakta; gerçek dünya uygulamaları ve regülasyonlarla uyumlu çözüm yolları önermektedir.

Özet (Çeviri)

This thesis comparatively investigates the classification performance and applicability of synthetic data generation and federated learning—two approaches that can be used to address the data privacy issues arising from the sensitive and personally identifiable nature of financial datasets. Within the scope of this study, these approaches were evaluated using a dataset commonly employed in financial classification tasks, such as credit card default prediction. Prior to training, the dataset was split into training and testing sets. In the federated learning setup, the training data was evenly divided among five clients. Synthetic data generation was performed separately for each client to simulate a more realistic and practical scenario. All experiments were conducted using the same test set to ensure consistency, and each experiment was repeated ten times to calculate average results, thereby reducing the impact of randomness on evaluation outcomes. In the federated learning approach, model training in each experiment was conducted over 10 communication rounds. To evaluate the performance of the experiments, metrics such as accuracy, F1 score, recall, precision, and AUC were utilized. Also, to assess the quality of the synthetic dataset, metrics such as Column Shapes Score, Column Pair Trends Score, and Overall Quality Score were used. Synthetic data generation involves creating an artificial dataset that statistically resembles real data but does not contain any sensitive or personally identifiable information. This approach enables organizations to share data or conduct model training without breaching privacy regulations such as the General Data Protection Regulation (GDPR) or similar local laws. If the synthetic dataset sufficiently reflects the statistical properties of the original, it can be used for analysis and machine learning training purposes, even in the absence of real data. Beyond enabling compliance with data privacy regulations, synthetic data can also help address class imbalance problems in datasets. In the literature, it has been shown that in domains such as fraud detection—where minority classes are underrepresented—synthetic data generation improves model performance by augmenting the rare class samples. Synthetic data generation also allows organizations to simulate rare events or counterfactual scenarios, enabling stress testing and risk modeling under conditions where historical data may be scarce. Furthermore, it facilitates secure collaboration and innovation across institutions by enabling the release of privacy-compliant datasets for research and development. Federated learning is a machine learning approach that enables collaborative model training without the need for raw data sharing. In traditional centralized machine learning, data must be collected in a single location, which is often not feasible in sensitive domains like finance due to privacy regulations. In federated learning, each client trains a local model on its own data and shares only the model weights or gradients with a central server. The server then aggregates the updates from each client using specific strategies and redistributes the updated model to all participants. This process is repeated for several iterations, eventually resulting in a global model. Research has shown that federated learning models can achieve performance levels comparable to or even exceeding those of centralized or local models, especially due to the enhanced data diversity they can leverage. Like synthetic data generation, federated learning can also address class imbalance challenges. In this thesis, multiple experiments were conducted to evaluate the utility of both approaches. In this study, two different aggregation methods were employed. The first method, FedAVG, is a widely used federated learning technique that performs weighted averaging of model parameters from all participating clients based on their local data sizes, thereby producing a global model that reflects the contributions of each client. The second method, FedF1, was specifically utilized in the experiment designed to evaluate the effect of federated learning at the client level. FedF1 aims to optimize the global model by aggregating client models based on their F1-scores, giving more influence to clients whose models demonstrate higher classification performance. In the first experiment, federated learning and synthetic data generation were compared with a traditional anonymization method. Four machine learning algorithms were used to evaluate their performance. Results showed that federated learning and synthetic data generation did not suffer from performance degradation compared to centralized models and outperformed the anonymization method. In the second experiment, centralized and federated models were trained on three different versions of the dataset: real, synthetic, and hybrid (a combination of real and synthetic data). To evaluate the client-level benefit of federated learning, local models were also compared. Findings indicated that federated learning not only matched the performance of centralized models but also outperformed local models in client-level evaluations, making it a valuable alternative when centralized training is not feasible. To assess the effects of class imbalance, several additional experiments were performed. In one scenario, an extreme imbalance was simulated where one of the three clients had data only from the majority class, while the other two had balanced datasets. In this case, synthetic data generation was not feasible, but federated learning successfully leveraged knowledge from other clients to improve minority class predictions. To measure the impact of data balancing techniques, classical approaches were compared with the synthetic data generation method used in this thesis. Each method was applied to balance the dataset, and classification models were trained accordingly. Results showed that the synthetic approach outperformed classical balancing methods, likely due to better feature representation in the generated samples. To further enhance classification performance, graph-based models were tested using the original non-graph-structured dataset transformed into a graph via similarity measures. These models outperformed traditional machine learning algorithms. When tested under class imbalance conditions using realistic test data (rather than cross-validation), an increase in F1-score and accuracy was observed, although AUC and recall slightly declined. Additionally, topological features extracted from graph structures were evaluated, resulting in modest improvements in classification performance. Finally, the most effective graph-based models were integrated with synthetic data generation and federated learning approaches. The combined use of these methods did not lead to performance loss compared to centralized models. This thesis presents valuable insights into how synthetic data generation and federated learning—used individually or in combination—can help strike a balance between preserving financial data privacy and achieving high classification performance. It proposes viable solutions aligned with real-world applications and regulatory frameworks.

Benzer Tezler

  1. Privacy and security enhancements of federated learning

    Federe öğrenme uygulamalarında mahremiyet ve güvenlik geliştirmeleri

    ŞÜKRÜ ERDAL

    Yüksek Lisans

    İngilizce

    İngilizce

    2024

    Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrolİstanbul Teknik Üniversitesi

    Bilişim Uygulamaları Ana Bilim Dalı

    PROF. DR. ENVER ÖZDEMİR

    DR. FERHAT KARAKOÇ

  2. Blok zinciri ve sağlık uygulamaları

    Block chain and health application

    HAMİT MIZRAK

    Yüksek Lisans

    Türkçe

    Türkçe

    2023

    Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve KontrolMalatya Turgut Özal Üniversitesi

    Enformatik Ana Bilim Dalı

    DR. ÖĞR. ÜYESİ SERPİL ASLAN

  3. Türkiye'de dijital devlet paydaşları arasındaki etkileşim süreçlerinde blokzincirinin kullanılabilirliği

    Usability of blockchain in interaction processes between digital government stakeholders in Turkey

    MUSTAFA SAYIN

    Yüksek Lisans

    Türkçe

    Türkçe

    2022

    Kamu YönetimiAnkara Hacı Bayram Veli Üniversitesi

    Amme İdaresi Ana Bilim Dalı

    PROF. DR. TÜRKSEL KAYA BENSGHİR

  4. Privacy-preserving mechanisms for face verification systems

    Yüz doğrulama sistemleri için gizliliği koruyucu mekanizmalar

    MARAM H. W. ALAGHBAR

    Yüksek Lisans

    İngilizce

    İngilizce

    2024

    Elektrik ve Elektronik MühendisliğiYıldız Teknik Üniversitesi

    Elektronik ve Haberleşme Mühendisliği Ana Bilim Dalı

    PROF. DR. TÜLAY YILDIRIM

  5. Post quantum cryptography: homomorphic encryption

    Kuantum sonrası kriptografi: homomorfik şifreleme

    EMRULLAH ULUIŞIK

    Yüksek Lisans

    İngilizce

    İngilizce

    2025

    Savunma ve Savunma Teknolojileriİstanbul Teknik Üniversitesi

    Matematik Mühendisliği Ana Bilim Dalı

    PROF. DR. GÜLÇİN ÇİVİ BİLİR

    DR. ÖĞR. ÜYESİ ERDEM ALKIM