Uçtan uca derin öğrenme yaklaşımlarıyla Türkçe eşgönderge çözümlemesi

Neural end to end Turkish coreference resolution

PDF İndir

Tez No: 930713
Yazar: TUĞBA PAMAY ARSLAN
Danışmanlar: PROF. DR. GÜLŞEN ERYİĞİT
Tez Türü: Doktora
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2025
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Lisansüstü Eğitim Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
Sayfa Sayısı: 165

Özet

Eşgönderge Çözümlemesi (EÇ), bir doküman içinde yer alan, aynı gerçek dünya varlığının (ör. bir kişi, yer veya olay) temsili olan sözcükler (ifade) arasındaki göndergesel ilişkinin çözümlenmesidir. Doğal Dil İşleme (DDİ) alanının anlamsal katmanında önemli bir görev olarak yer alan EÇ, metnin bağlamını derinlemesine çözümleyerek, dokümanın doğru bir şekilde anlaşılmasına ve istenen bilgilerin doğru bir şekilde çıkarılmasına yardımcı olmaktadır. Bu görevde, aralarında ilişki çözümlemesi yapılacak sözcük veya sözcük öbekleri bir ifade olarak tanımlanır. Uçtan uca bir EÇ sistemi, iki aşamadan oluşur: 1) İfade Saptama, 2) İlişki Çözümleme. İfade saptama aşamasında, dokümandaki tüm göndergesel ifadeler tespit edilir. Sonrasında, bu ifadeler arasındaki ilişkiler çözümlenerek aynı gerçek dünya varlığını temsil eden ifadeler aynı ifade kümesi altında birleştirilir. Türkçe, biçim bilimsel açıdan oldukça zengin ve zamir düşürme özelliğine sahip bir dildir. Bu özellikleri, Türkçe metinlerde bazı zamirlerin metin içerisinde açıkça yer almamasına olanak tanımaktadır. Dolayısıyla, Türkçe için geliştirilen kapsamlı bir EÇ sisteminin, düşürülen zamirleri de birer ifade olarak ele alıp bu zamirlerin ilişki çözümlemesini yapması, Türkçe yazılmış bir metnin anlam bütünlüğünün doğru anlaşılabilmesi için son derece önemlidir. Düşen zamirlere ilişkin bilgiler, cümledeki başka bir sözcüğün biçim bilimsel katmanında yer almaktadır. Bu durum, sözcüklerin yalnızca orijinal formlarının değil, aynı zamanda biçim birim düzeyinde de incelenmesini zorunlu kılmaktadır; dolayısıyla, Türkçe EÇ problemi diğer dillere kıyasla daha karmaşık bir hale gelmektedir. EÇ literatüründe yer alan çalışmalar incelendiğinde, çalışmaların çoğunun İngilizce üzerinde gerçekleştirildiği görülmektedir. Dil bilimsel açıdan Türkçeye benzeyen diller üzerinde yapılan EÇ çalışmaların ise son yıllarda başladığı görülmektedir. Yukarıda belirtilen Türkçenin dil bilimsel yapısından kaynaklanan biçim birim düzeyinde eşgönderge çözümlemesi gerekliliği, İngilizce için geliştirilmiş sistemlerin Türkçe için doğrudan uygulanmasına olanak tanımamaktadır. Bu tez çalışmasının hedefi, Türkçenin dil bilimsel özelliklerini göz önünde bulunduran ve yapay sinir ağları yöntemlerinden faydalanan, uçtan uca ilk Türkçe EÇ modelini gerçekleştirmektir. Bu doğrultuda: 1) Türkçenin yapısı, düşürülen zamirler açısından incelenmiş ve bu bilgiler için EÇ görevine özgü bir etiketleme şeması önerilmiş ve düşürülmüş zamirlerin bu şema ile göndergesel ifadeler olarak etiketlendiği güncel bir Türkçe EÇ veri kümesi sunulmuş, 2) Derin öğrenme yöntemlerinden faydalanan, farklı EÇ yaklaşımları ile geliştirilmiş Türkçe EÇ modelleri geliştirilerek, modellerin başarımları karşılaştırılmış, 3) Önerilen Türkçe EÇ veri kümesinin, çok dilli EÇ çalışmalarında kullanılabilmesi için ilgili veri kümesi koleksiyonlarında yer almasına yönelik çalışmalar tamamlanmış, 4) Türkçeyi de kapsamına alan çok dilli EÇ modelleri geliştirilerek, modellerin başarımları karşılaştırılmış, 5) Sonuç olarak, kod çözücü mimarisine sahip büyük dil modellerinden faydalanan, talimatlı tabanlı eğitilen, çok dilli EÇ modellerinin Türkçe EÇ üzerinde en iyi performansı gösterdiği ortaya konmuştur. Ek olarak, çok dilli modeller üzerinde yapılan iyileştirmeler ile özellikle dil bilimsel açından Türkçeye benzeren başka dillerdeki EÇ performanslarında da artışlar gözlemlenmiştir. Tez çalışmasında, mevcut etiketli Türkçe EÇ veri kümesi iyileştirilmiş ve düşürülmüş zamirlerin göndergesel ilişkileri etiketlenerek literatürdeki en güncel Türkçe EÇ veri kümesi oluşturulmuştur. Türkçenin EÇ başarımına, farklı eşgönderge çözümlemesi yaklaşımlarıyla (ifade çifti, ifade sıralama, uçtan uca) geliştirilen yapay sinir ağları tabanlı modellerin etkisi incelenmiştir. Veri kümesinin kalitesi ve düşürülmüş zamir etiketlemelerinin Türkçe EÇ modellerinin başarısına etkisi araştırılmıştır. Ayrıca, derin öğrenme yöntemleriyle geliştirilen Türkçe EÇ modellerinde çizge sinir ağları katmanlarının kullanımı ve bunun performansa etkisi de incelenmiştir. Türkçe üzerinde eğitilen tek dilli modeller, çok dilli olarak genişletilerek diller arası transferin Türkçe EÇ başarımına etkisi değerlendirilmiştir. Bu aşamada, Türkçe ve diğer dillerdeki EÇ başarımlarının, dillerin birbirlerinden öğrendikleri bilgilerle nasıl etkilendiği incelenmiştir. Türkçenin biçim bilimsel zenginliği nedeniyle, dil bilimsel bilgilerin EÇ modellerinde öznitelik olarak kullanılmasının etkisi, Türkçe ve benzer dillerdeki çok dilli EÇ veri kümesi üzerinde araştırılmıştır. Son olarak, kod çözücü mimarisi ve talimat tabanlı yöntemle geliştirilen çok dilli EÇ modelinin Türkçe ve diğer dillerdeki başarımları incelenmiştir. Sonuçlar, derin öğrenme yöntemlerinin Türkçe EÇ başarımını artırdığını göstermektedir. Kaliteli verilerle eğitilen Türkçe EÇ modelleri daha iyi sonuçlar elde etmiştir. Ayrıca, düşürülmüş zamirlerin etiketlenmesi ve bu ifadeler üzerinde eğitim yapılması, genel EÇ başarımını olumlu etkilemiştir. Çizge sinir ağlarının Türkçe EÇ performansını iyileştireceği hipotezi doğrulanamamıştır. Çok dilli modeller geliştirerek, diller arası transferin Türkçe EÇ başarımına olan olumlu etkileri gösterilmiştir. Türkçe ve benzer dil bilimsel özelliklere sahip dillerin EÇ performanslarında, açıkça belirtilen biçimsel özniteliklerin kullanılmasının olumlu etkisi gözlemlenmiştir. Son olarak, talimat tabanlı eğitimle geliştirilen çok dilli Türkçe EÇ modeli ile büyük dil modellerinin gücünden faydalanarak hem Türkçe hem de çok dilli EÇ performanslarında iyileşme sağlanmıştır.

Özet (Çeviri)

Coreference Resolution (CR) is a task of resolving referential relationships between words or phrases (i.e., mentions) existing in a document that represent the same real-world entity (e.g., a person, place, or an event). CR, an important task in the semantic layer of Natural Language Processing (NLP), helps understand a document by analyzing its context, making it easier to correctly extract the needed information. An end-to-end CR system consists of two stages: 1) Mention Detection, and 2) Coreference Linking. In the mention detection stage, all possible referential mentions are extracted. Then, coreferential relations between these automatically predicted mentions are resolved, grouping mentions that represent the same real-world entity into the same mention cluster. Turkish is a morphologically rich and pro-dropped language (PD-MRL). In such languages, personal and possessive pronouns can be omitted if the entities they refer to can be understood from the context. A comprehensive CR system for Turkish must consider dropped pronouns as mentions and resolve their relationships for the correct understanding of a text's semantic coherence. These dropped pronouns can be identifies through pronominal markers (i.e., verbal agreement or possessive suffixes) present in the morphology of other words within the same sentence. This requires the examination of not only the original forms of words but also their morpheme-level features, making the CR task in Turkish more complex compared to other languages. A review of CR literature shows that most studies have been conducted on English. Research on languages with similar linguistic features to Turkish has only recently emerged. From all the above-mentioned characteristics of Turkish, the existing English-focused CR models cannot be directly applied to Turkish and other languages with similar linguistic features. This thesis investigates CR in Turkish, considering the nature structure of Turkish, which present different requirements compared to existing models. The main aim of this thesis is to implement the first end-to-end Turkish CR model, considering the linguistic features of Turkish and utilizing artificial neural networks. This will bring Turkish CR studies in line with the latest advancements in the CR literature and improve the performance in Turkish to a level on par with the achievements seen in studies of other languages. In this regard: 1) The structure of Turkish, in terms of dropped pronouns, has been analyzed, and an annotation scheme for the CR task has been proposed. A modern Turkish CR dataset has been created where dropped pronouns are tagged as mentions in this schema, 2) Turkish CR models utilizing deep learning methods and based on various CR approaches have been developed and their performances compared, 3) Efforts have been made to ensure the inclusion of the proposed Turkish CR dataset in multilingual CR dataset collections for wider use, 4) Multilingual CR models, including Turkish, have been developed, and their performances have been compared, 5) As a result, it has been demonstrated that decoder-only large language models which have been fine-tuned using instruction-based training, achieve the best performance on Turkish CR. Furthermore, improvements on multilingual models have led to enhanced CR performance in languages linguistically similar to Turkish. From this perspective, the contributions of this thesis are outlined as follows: 1) Introduction of a new publicly available Turkish Coreference Resolution (CR) dataset (i.e., ITCC). 2) Exploration of various neural CR architectures (i.e., Mention Pair, Mention Ranking, and End-to-End) for Turkish. 3) Presentation of the first neural end-to-end coreference resolution results for Turkish. 4) Demonstration of the impact of dropped pronouns on the Turkish, marking the first such analysis in the literature. 5) Investigation of interlingual transfer on Turkish-included CR task. 6) Development of a neural multilingual end-to-end CR system that incorporates morphological information into transformer-based word embeddings. 7) Development of graph attention neural network-based Turkish CR model, which is one of the pioneering studies in the literature utilizing this approach. 8) Introduction of a decoder-only large language model adapted using instruction-based fine-tuning for the multilingual coreference resolution task, for the first time in the literature. In the thesis, an existing labeled Turkish CR dataset has been improved, and the coreferential relations of dropped pronouns have been annotated to create the newest and the most comprehensive Turkish CR dataset (İTÜ Turkish Coreference Corpus - ITCC) in the literature. This dataset represents the first step in developing comprehensive CR models that can work with Turkish and also other languages. The dataset is openly available for research purposes under the CorefUD 1.2 dataset collection. Turkish CR performance has been investigated by analyzing the impact of neural CR models developed by various CR approaches (mention pair, mention ranking, and end-to-end). The mention pair and mention ranking models are primitive ones that focus solely on coreference linking stage. End-to-end models include architectures specifically designed for the CR task, as well as models where a large language model is fine-tuned for the CR using a sequence-to-sequence approach. The research initially investigates Turkish CR models based on neural networks, and the findings are presented. The Turkish CR models leveraging neural networks outperform traditional machine learning-based Turkish CR models in the literature. The use of high-quality datasets has been shown to positively contribute to the performance of these data-driven Turkish CR models. Additionally, the presence of annotations for dropped pronouns and their inclusion in model training positively affects the performance of coreferential relation resolution between explicitly stated mentions (overt mentions) in documents. The first steps have been taken in the Turkish CR literature regarding the use of graph attention neural networks for the CR task, but no reliable improvement in performance has been achieved through this approach. In addition to the investigations on Turkish CR, this thesis also presents findings related to multilingual CR tasks. First, the impact of interlingual transfer in a multilingual CR model has been examined. This analysis demonstrates how knowledge learned in one language affects CR performance in another language. The research shows that multilingual training improves CR performance on Turkish, and the inclusion of Turkish in multilingual training significantly improves CR performance in languages that share linguistic similarities with Turkish. It has been observed that using morpheme-level information in morphologically rich languages, in a multilingual CR model improves both language-specific and cross-lingual CR performance. Finally, leveraging the undeniable power of the latest large language models, the performance of decoder-only large language models in the multilingual CR task has been examined. In this context, fine-tuning a large language model specifically for the CR task has resulted in a significant milestone in both Turkish CR and cross-lingual average CR performance, marking an important contribution to the literature. As future work, firstly, the range of languages used in the development of multilingual models should be broadened. However, since dataset labeling is a costly process, it could be beneficial to focus on model fine-tuning using unsupervised learning methods. Furthermore, fine-tuning large language models for multiple NLP tasks is expected to improve performance for a specific task. In this context, fine-tuning CR models for other tasks similar to it could promote transfer between tasks, leading to better coreference resolution performance.

Benzer Tezler

Tez No
776131
Aydınlatmanın görüntü işleme problemlerine etkisinin yapay zeka teknikleri kullanılarak analizi
Analysis of the effect of lighting on image processing problems using artificial intelligence techniques
BİRKAN BÜYÜKARIKAN
Doktora
Türkçe
2022
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Konya Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF. DR. ERKAN ÜLKER
Tez No
759206
Beyin tümör tespiti için derin öğrenme temelli bilgisayar destekli tanı sistemi
Deep learning based computer aided diagnostic system for brain tumor detection
TARIKCAN DOĞANAY
Yüksek Lisans
Türkçe
2022
Bilim ve Teknoloji Gazi Üniversitesi
Sağlık Bilişimi Ana Bilim Dalı
DOÇ. DR. OKTAY YILDIZ
Tez No
956166
Fingerprint recognition with deep learning
Derin öğrenme ile parmak izi tanıma
RESUL TAHA ÇALGIN
Yüksek Lisans
İngilizce
2025
Elektrik ve Elektronik Mühendisliği Eskişehir Osmangazi Üniversitesi
Elektrik-Elektronik Mühendisliği Ana Bilim Dalı
PROF. DR. ABDURRAHMAN KARAMANCIOĞLU
Tez No
595820
Mini autonomous car architecture for urban driving scenarios
Şehir i̇çi sürüş senaryolari için mini otonom araç mimarisi
GÖKHAN KARABULUT
Yüksek Lisans
İngilizce
2019
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Orta Doğu Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF. DR. TOLGA CAN
DR. ÖĞR. ÜYESİ SELİM TEMİZER
Tez No
683055
Data-driven condition monitoring and fault diagnosis of VFD-FED induction motors
Değişken frekanslı sürücü ile beslenen asenkron motorlarda veri odaklı durum izleme ve arıza tanılama
ALPER SENEM
Yüksek Lisans
İngilizce
2021
Elektrik ve Elektronik Mühendisliği İstanbul Teknik Üniversitesi
Mekatronik Mühendisliği Ana Bilim Dalı
PROF. DR. ŞENİZ ERTUĞRUL

Geri Dön