Ses olay tespit problemine derin öğrenme tabanlı çözümler
Utilizing footstep sound event detection by using cnn techniques for assuring property security
- Tez No: 959379
- Danışmanlar: PROF. DR. NEJAT YUMUŞAK
- Tez Türü: Doktora
- Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
- Anahtar Kelimeler: Belirtilmemiş.
- Yıl: 2025
- Dil: Türkçe
- Üniversite: Sakarya Üniversitesi
- Enstitü: Fen Bilimleri Enstitüsü
- Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
- Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
- Sayfa Sayısı: 87
Özet
Mülk güvenliği için üretilen gelişmiş kamera sistemleri, fiziksel olarak kolayca devre dışı bırakılabilmeleri nedeniyle çoğu zaman yetersiz kalmaktadır. Bu sınırlamayı aşmak amacıyla, olası tehditlere ait ses verilerinin güvenlik sistemlerine dâhil edilmesi, geleneksel yöntemlere önemli bir katkı sunmaktadır. Ses tabanlı algılama sistemleri, düşük maliyetli donanımlar (örneğin mikrofonlar) ile kurulabilen kompakt yapıları sayesinde görsel tabanlı güvenlik çözümlerine etkili ve ölçeklenebilir bir alternatif oluşturmaktadır. Bu çalışmada, adım sesi olaylarının sınıflandırılması amacıyla Evrişimli Sinir Ağı (CNN) tabanlı bir derin öğrenme yöntemi önerilmektedir. Geliştirilen model, çevresel sesler arasından adım seslerini tespit etmeye odaklanarak, mülk güvenliğine karşı tehditlerin erken fark edilmesini kolaylaştırmaktadır. Çalışmanın ilk aşamasında, gerçek yaşamdan çeşitli ses olaylarını içeren ReaLISED veri kümesi kullanılmıştır. Ancak bu veri kümesine dayalı ilk değerlendirme sonuçları görece düşük performans göstermiştir. Yapılan analizler, sınıf dengesizliği ve düşük kaliteli ya da tutarsız ses örneklerinin, modelin genelleştirilebilir örüntüler öğrenme kapasitesini sınırladığını ortaya koymuştur. Bu sorunları gidermek adına ikinci aşamada, düşük kaliteli veriler dikkatlice elenmiş ve Epidemic Sound platformundan temin edilen yüksek kaliteli adım sesi örnekleri ile veri kümesi zenginleştirilmiştir. Bu iyileştirme, sadece sınıf dağılımlarını dengelemekle kalmamış, aynı zamanda genel veri kalitesini artırarak modelin gerçek dünya koşullarında daha tutarlı ve güvenilir sınıflandırma performansı göstermesini sağlamıştır. Her bir ses olayı, Mel-Frekans Kepstrum Katsayıları (MFCC) temsiline dönüştürülmüştür. Ayrıca, MFCC ile ilgili ayarlar üzerinde kapsamlı parametre optimizasyonu gerçekleştirilmiş; özellikle n_fft ve n_mfcc değerleri, öznitelik çözünürlüğü ve sınıflandırma doğruluğu açısından optimize edilmiştir. Elde edilen MFCC temsilleri, CNN modelinin eğitimi için görüntü benzeri girişler olarak kullanılmıştır. Önerilen model, 17 farklı ses kategorisinden, adım sesi olaylarını %98 doğruluk oranı tespit etmeyi başarmıştır. Modelin sağlamlığı, 5 katlı ve 10 tekrar içeren Tekrarlı Katmanlı K-Fold Doğrulama yöntemi ile test edilmiş; F1-skorları 0.905 ile 0.992 arasında değişmiş ve ortalama 0.960 olarak hesaplanmıştır. Ek olarak, ResNet50, ResNet101, EfficientNetB0, EfficientNetB4 ve CRNN gibi güncel mimarilerle yapılan karşılaştırmalı değerlendirmeler, önerilen CNN modelinin etkinliğini pekiştirmiştir. Açık kaynaklı verilerin stratejik biçimde sisteme entegre edilmesi, şeffaflık ve tekrar edilebilirlik ilkelerini destekleyerek sistemin gerçek hayattaki ses tabanlı gözetim uygulamaları alanında taşıdığı pratik değeri güçlendirmiştir.
Özet (Çeviri)
Although camera-based security systems developed for property protection have reached an advanced level in recent years, they continue to exhibit a number of significant vulnerabilities that limit their overall effectiveness. One of the most pressing concerns is their susceptibility to deliberate tampering or sabotage. In particular, these systems can be easily disabled or rendered inoperative through direct physicalintervention by intruders, such as covering, destroying, or disconnecting the cameras. This vulnerability severely undermines their reliability, especially in high-risk or unsupervised environments. As a result, the need for supplementary or alternative security solutions that do not solely rely on visual data has become increasingly evident. In response to this need, researchers and system developers have begun exploring audio-based security systems, which detect and classify environmental sounds such as footsteps, door movements, and glass breakage. Audio- based event detection systems offer a range of practical advantages over their visual counterparts. These include simpler and more flexible installation processes, significantly lower hardware costs due to the affordability and availability of microphones, and greater resistance to tampering or sabotage. Unlike cameras, microphones can be easily concealed, making them less likely to be noticed or targeted by intruders. Furthermore, audio sensors are capable of capturing events occurring outside the direct line of sight, thus overcoming some of the spatial limitations associated with camera systems. Taken together, these advantages highlight the potential of audio-based systems as a cost-effective, resilient, and complementary approach to enhancing property security. This study proposes a deep learning-based approach for the classification of footstep sound events, addressing a critical component of intelligent surveillance systems. Footstep sounds are widely regarded as a primary acoustic indicator of human presence and can serve as early warnings of unauthorised entry, especially in secured or restricted areas. Unlike general-purpose sound event detection models, the model developed in this research is specifically optimised to distinguish footstep sounds from a variety of other environmental noises commonly found in indoor settings. The classification task is treated as a binary problem: detecting whether a given audio signal contains a footstep sound or not. To develop and refine the model, a two-stage experimental framework was employed. In the initial stage, the model was trained and evaluated using the ReaLISED dataset, which contains a broad range of labeled environmental audio events recorded under realistic indoor conditions. While this dataset provided a useful starting point due to its variety and availability, the initial performance of the model was found to be suboptimal, especially in terms of generalisation and robustness. A deeper examination of the dataset revealed some major issues affecting model performance: significant class imbalance, the inclusion of low-quality and mislabeled audio samples. The class imbalance was particularly problematic, as the dataset included a disproportionately small number of footstep events compared to other sound classes, causing the model to develop a bias toward the majority (non-footstep) class. Furthermore, some audio clips labeled as“walking”were actually recordings of running, jumping, or stair climbing—each of which exhibits distinct temporal and spectral characteristics that differ from normal footsteps. The dataset also contained segments with poor acoustic clarity, background interference, and faint or ambiguous signals, making it difficult for the model to extract meaningful features. These issues collectively hindered the model's ability to learn generalised patterns necessary for reliable footstep detection. To address the limitations identified in the initial phase, a second experimental stage was designed with a strong emphasis on dataset enhancement and refinement. This stage not only sought to resolve the issue of class imbalance by ensuring an equal distribution of footstep and non-footstep audio samples but also significantly improved the quality and diversity of the dataset. Additional high-quality footstep recordings were sourced from Epidemic Sound, a professional sound library, and integrated with carefully selected samples from the original ReaLISED dataset. The resulting collection—termed the“Footsteps Sound Dataset”—was structured to ensure each class contained 333 audio samples, yielding a total of 666 clips. This deliberate balance between classes was essential to prevent the model from being biased toward dominant categories and to promote fair learning. Prior to feeding the data into the classification model, all audio clips were processed into Mel-Frequency Cepstral Coefficient (MFCC) representations. These features, widely recognized for their ability to mimic the human auditory system's perception of sound, proved especially suitable for distinguishing footstep events from other types of audio. The MFCCs were structured as two-dimensional arrays, analogous to grayscale images, making them highly compatible with Convolutional Neural Networks (CNNs). A thorough hyperparameter optimisation process was conducted, focusing on key parameters such as n_fft (set to 2048) and n_mfcc (set to 20), which directly influenced the resolution and detail of the extracted features. These optimised parameters were chosen after a series of empirical trials and literature-guided benchmarking to ensure optimal performance in both training stability and generalisation. The CNN model trained on these MFCC inputs demonstrated strong classification capabilities, accurately distinguishing footstep sounds from a background of 17 other environmental sound categories. The model achieved a classification accuracy of approximately 98% on the validation set. To further assess the model's generalisability and robustness, an extensive Repeated Stratified K-Fold Cross-Validation procedure was employed. This validation method, consisting of 5 folds repeated 10 times, ensured that the model's performance was not contingent on a specific data split and mitigated the risk of overfitting. The model achieved F1-scores ranging from 0.905 to 0.992, with an average of 0.960 and a low standard deviation, indicating not only high overall performance but also consistency across multiple data partitions and experimental runs. These results underscore the effectiveness of the proposed CNN- based approach for reliable footstep detection in diverse acoustic environments. To comprehensively assess the proposed model's effectiveness, it was systematically benchmarked against several widely recognized deep learning architectures, each known for their performance in audio classification and computer vision tasks. These models included ResNet50, ResNet101, EfficientNetB0, EfficientNetB4, and Convolutional Recurrent Neural Networks (CRNNs), which combine convolutional layers with recurrent units to capture both spatial and temporal dependencies in audio signals. The comparative evaluation was carried out using identical datasets and experimental conditions to ensure fairness. Performance metrics such as precision, recall, and F1-score were used as the primary indicators of model capability. The results from these experiments consistently revealed that the custom-designed CNN model exceeded the performance of all benchmarked architectures in footstep detection tasks. While some models, particularly CRNN and ResNet50, demonstrated results close to the proposed CNN, none were able to match its consistent and high- level classification accuracy across all trials. These findings emphasise the benefits of a specialised and task-focused architecture, tailored specifically for detecting footstep events within noisy, multi-class acoustic environments. Another significant contribution of this study lies in the strategic integration of open- access audio data to construct a robust and diverse dataset. By incorporating professionally recorded samples from platforms such as Epidemic Sound and selectively combining them with the cleaned and curated portions of the ReaLISED dataset, this work ensured not only higher quality and balance in training data but also enhanced reproducibility and openness in research methodology. The use of publicly accessible sources contributes to the transparency and verifiability of the proposed approach, thereby supporting its adoption in both academic and applied contexts. Moreover, this strategy reinforces the practical relevance of the study by demonstrating that high-performing audio detection systems can be trained and validated using readily available resources, paving the way for real-world deployment in areas such as property security, smart buildings, and health monitoring systems. Ultimately, this research highlights a scalable, efficient, and highly accurate framework for footstep sound detection, offering both theoretical advancement and practical utility. To further demonstrate the practical applicability of the proposed system, a real-world deployment scenario was developed and analysed. In this envisioned use case, multiple microphones are strategically positioned across various locations within a property— such as entryways, hallways, or near windows—to continuously capture environmental audio signals. These microphones transmit the collected data in real time to a central processing hub, which is responsible for preprocessing the audio, extracting relevant features, and running the trained CNN-based footstep detection model. Upon identifying acoustic patterns indicative of human footsteps, particularly during user-specified active monitoring periods, the system forwards the results to a control interface. This control system includes a user-friendly dashboard where property owners can configure system settings, define monitoring schedules, and manage notification preferences. When unexpected footstep activity is detected— especially during periods when no movement is anticipated—the system immediately triggers an alert mechanism. These alerts can be delivered through push notifications, emails, or other messaging services, ensuring that users are promptly informed of potential intrusions. The ability to selectively activate or deactivate the system adds flexibility and user control, enhancing both usability and reliability. This end-to-end framework highlights the model's efficiency and readiness for real-world application. Compared to traditional surveillance systems, which often rely solely on visual input and require complex hardware and infrastructure, the audio-based approach provides a more discreet, cost-effective, and tamper-resistant alternative. By leveraging the strengths of deep learning and acoustic analysis, the proposed system offers a robust solution for intelligent, real-time security monitoring. In conclusion, this study introduces a reliable, low-cost, and high-accuracy audio- based approach for enhancing property security, particularly through the detection of footstep sound events. The proposed CNN-based model demonstrates strong performance and robustness, effectively addressing common limitations found in traditional camera-based surveillance systems, such as high cost, limited coverage, and susceptibility to tampering. By functioning as a complementary layer to existing security infrastructure—or even as a standalone solution in specific contexts—this method significantly broadens the scope of intelligent surveillance technologies. Additionally, the carefully designed model architecture, the rigorous two-phase training process, and the extensive data preprocessing techniques employed throughout the study offer a solid foundation for future research. These contributions not only improve the feasibility of real-world audio-based detection systems but also serve as a valuable guideline for researchers and practitioners aiming to advance the field of sound-based monitoring and intelligent audio analysis in both security and broader acoustic event detection applications.
Benzer Tezler
- Lifelong learning for auditory scene analysis
İşitsel sahne analizi için hayat boyu öğrenme
BARIŞ BAYRAM
Doktora
İngilizce
2022
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrolİstanbul Teknik ÜniversitesiBilgisayar Mühendisliği Ana Bilim Dalı
DOÇ. DR. GÖKHAN İNCE
- Classification of abnormal respiratory sounds using deep learning techniques
Solunum seslerinin derin öğrenme yöntemleri ile sınıflandırılması
AHAMADI ABDALLAH IDRISSE
Yüksek Lisans
İngilizce
2023
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve KontrolGazi ÜniversitesiBilgisayar Bilimleri Ana Bilim Dalı
DOÇ. DR. OKTAY YILDIZ
- İnşaat projelerinde bilgi paylaşımının önündeki engellerin kaldırılmasının firma performansına etkileri: Bir örnek olay incelemesi
The effects of eliminating knowledge sharing barriers in the construction projects on the firm's performance: A case study
AYSEL KARDELEN VATANSEVER
Yüksek Lisans
Türkçe
2016
Mimarlıkİstanbul Teknik ÜniversitesiMimarlık Ana Bilim Dalı
DOÇ. DR. HAKAN YAMAN
- A distributed human identification system for indoor environments
Kapalı ortamlar için dağıtık mimarili insan tanıma sistemi
EMRE SERCAN ASLAN
Yüksek Lisans
İngilizce
2016
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrolİstanbul Teknik ÜniversitesiBilgisayar Mühendisliği Ana Bilim Dalı
YRD. DOÇ. DR. GÖKHAN İNCE
- Vibration and flutter analysis of fluid loaded plates
Akışkan yüklü eğimli plakların titreşim ve flater analizi
ABDURRAHMAN ŞEREF CAN