Türkçe zamansal ifadelerin etiketlenmesi ve normalleştirilmesi

Başlık çevirisi mevcut değil.

PDF İndir

Tez No: 684648
Yazar: AYŞENUR GENÇ
Danışmanlar: DOÇ. DR. AHMET CÜNEYD TANTUĞ
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2021
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Lisansüstü Eğitim Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
Sayfa Sayısı: 57

Özet

Yapısal olmayan metinden bilgi çıkarma alanında yapılan çalışmalar, doğal dil işleme alanında önemli bir yere sahiptir. Kelime kökü bulma, kelime sözcük türü etiketleme, kelime bağımlılık yapı ağacı çıkarım gibi yapısal çalışmaların yanı sıra, son senelerde bilgi çıkarım alanında yapılan çalışmalar önem kazanmıştır. Metin içerisinde tespit edilen semantik bilginin, yapısal bir forma normalleştirilmesi, bilginin çeşitli doğal dil işleme çalışmalarında etkili biçimde kullanılabilmesi için önem arz etmektedir. Zamansal ifade işaretleme ve normalizasyon çalışması, bilgi çıkarım sistemleri içerisinde önemli bir yere sahiptir. Metin içerisinde geçen olaylar hakkında zaman, süre, sıklık, aralık gibi bilgi taşıyan ifadelere (ör. bugün, iki ay sonra, 19 Temmuz'da, her hafta) zamansal ifadeler denilmektedir. Zamansal ifadelerin tespit edilmesi ve belirtilen standarda göre normalize edilmesi başta İngilizce, İspanyolca, Almanca, Çince, Arapça gibi dillerde yaygın bir araştırma alanıdır. Literatürde, bu diller için birçok zamansal ifade işaretleme ve normalizasyon sistemleri sunulmuş, manuel veya otomatik yöntemler ile zamansal ifadeleri işaretlenmiş veri setleri yayınlanmıştır. Sunulan bu sistemlerin, veri setleri üzerinde değerlendirilmesi için semantik değerlendirme seminerleri düzenlenmiştir. Bilgimiz dahilinde Türkçe literatüründe, bu zamana kadar herhangi bir zamansal ifadeleri işaretlenmiş, yapısal bir veri bankası yayınlanmamıştır. Ayrıca, baştan sona Türkçe zamansal ifade tespit ve normalizasyon görevlerini gerçekleştiren bir sisteme, literatür incelemelerimiz sırasında rastlanmamıştır. Bu tez çalışmasında, Türkçe zamansal ifade çıkarım ve normalizasyon alanında temel bir çalışma sayılabilecek, ilk uçtan uca ve Türkçe biçimbilimsel yapısının da dahil edildiği, kural tabanlı zamansal ifade etiketleme ve normalizasyon sistemi geliştirilmiştir. Sistemin geliştirilmesi ve test aşamasında kullanılmak üzere, 109 haber metninde yer alan zamansal ifadeler manuel yöntemle işaretlenmiştir. Tez kapsamında geliştirilen bu veri seti, gelecek araştırma çalışmalarında kullanılması amacı ile ortak kullanıma açılmıştır. Geliştirlen bu sistem, yayınlanan test veri seti üzerinde çalıştırılmıştır. Sistemin performansı, zamansal ifade etiketleme çalışmalarında kullanılan doğruluk (precision) ve tutarlılık (recall) formülleri kullanılarak ölçülmüştür. Metin içerisinde geçen zamansal ifadeler %89 F1 skoru başarısı ile tespit edilirken, doğru tespit edilen ifadelerin“type”ve“value”niteliklerinin normalizasyonunda sırasıyla %89 ve %88 F1 başarısı elde edilmiştir. Gelecek çalışmalarda, hata analizi ve sistem kısıtlamaları bölümlerinde bahsedilen eksiklikler ve tavsiyler göz önünde bulundurularak, daha yüksek başarımlı Türkçe zamansal ifade işaretleme ve normalizasyon çalışmaları gerçekleştirilebilir.

Özet (Çeviri)

Extraction of information from unstructured data is a highly appreciated research topic in natural language processing. Identification unstructured semantic information and normalization of identified expressions have a place in usage of semantic information effectively in many NLP applications. Temporal expression identification is taken into account as sub-categorization of entities with Date type in named entity recognition (NER) systems, and attract significant attention recently in information retrieval systems. The first underlying reason to conceptualize temporal expression tagging as a separate main task from named entity recognition systems, is improve the question answering system performance on temporal questions for instance“When Einstein born?”or“When Ataturk selected as first president?”because temporal question answering was becoming a non-trivial task in information retrieval systems. Temporal expression is a word or phrase that represents information about occurrence, repetition or elapsed time of an event or an action. A temporal expression can be absolute for example 12.01.2021 or relative such as next month to document creation time. Most of the studies in time expression tagging are concentrated on detecting 5 different temporal types: Date, Time, Duration, Interval and Set. After identification of temporal expression, information should be converted to a structural form in order to be more useful in NLP applications afterwards. This transformation step created need of normalization standard for detected temporal expressions. Almost all of the studies on temporal expression extraction and normalization depend on TimeML, which is a standard markup language to mark temporal expressions, temporal events and relations between these events. 7 different tag schemes are available in TimeML in order to annotate and normalize events and temporals. However, in this study, we only focus on detecting and normalization of temporal expressions excluding events and relations. Therefore only TIMEML and TIMEX3 tags are used in this work. Temporal expression identification and normalization systems are proposed for many languages for instance English, Spanich, German, Chinese and Arabic. Several distinguish approaches are selected for temporal expression tagging so far. Firstly, most preliminary works based on rule based systems and then after hybrid methods which combine rule based systems with machine learning systems or scheme based, systems, were proposed. Recent temporal expression identification and normalization systems have deep learning's advantages to build multilingual or more accurate systems. To develop, train and test temporal expression tagging models, manually or automatically annotated various datasets and lexicons are proposed in many languages. Evaluation workshops are organized to evaluate temporal expression tagging systems with specified evaluation metrics. However, any proposed temporal expression tagger or temporally annotated corpora is proposed for Turkish language up to now. In this study, we developed a morphological rule based temporal expression identification and normalization system, ITUTime, which is able to detecet and normalize Date, Time, Duration, Interval and Set types. Our system relies on morphology aware regular expression based rules operating on the local context. Our system does not require a morphological analysis or a POS(Part-of-speech) tagger. Turkish is one of the morphologically rich languages. Turkish is one of the morphologically rich languages. Therefore usage of morphological analysis tools is practical yet exhaustive for Turkish language processing. Instead of using morphological analyser and a morphological disambiguation tool for preprocessing the text, we created a simple lookup tables for possible inflections. Most of the rule based temporal identification and normalizastion systems reqiure tokenization as a preprocessing steps and their rules run over token patterns. Our temporal expression identification rules are simply running over the free text instead of tokens and this makes the proposed system free of any preprocessing steps. Due to the rule based structure and the lack of any processing phase, our system is not able to process non-canonical text. ITUTime has 4 main submodules, which are text number normalization, detector, normalizer and text number restoration. Initially we convert numbers that are written in words to their corresponding numerical representations, e.g. converting two days to 2 days by using regular expressions in text number normalization module. To be able to reconstruct the original input text at the output layer, these rules keep the original word based representations as the markup attributes. In our detector component, a set of nested rules for each temporal type are executed sequentially to detect temporal expressions in the text and determine their TIMEX3 types. In total 67 different complex regular expression rules are defined in our detector module. We provide only text regex rules to extract temporal expressions. We define a capturing group, which contains all possible inflectional suffixes to detect both nominative and inflected nouns, in regular expression rules set. If temporal is identified by a rule, extracted temporal expression is not modified by compositional or filtering rules later on and TIMEX3 type is assigned to type of rule set. Therefore, applying rules in a specific order becomes crucial for extracting temporal expression correctyl in our approach. To produce correct outputs, we opt for applying rules in the following order: Interval, Time, Date, Duration, and Set. The aim of normalization module is to construct a structured representation of the unstructured time expressions that are captured by the detector module. This component is composed of two rule sets, (1) a set for the exact temporals and (2) a set for the relative temporals. Normalizing exact temporals seems relatively easy, however, there are cases such as 2020'nin son çeyreği (last quarter of 2020), bu yılın son ayında (in the last month of this year) and similar expressions, which makes this task non-trivial. On the other hand, normalizing relative temporals with respect to document creation time is also challenging. To cover the possible relative temporal constructions, we have generated distinct rules considering the surrounding keyword clues. In the end of the normalization procedure, each temporal is annotated with a TIMEX3 tag. Finally, we apply a post processing step to restore the original numbers in words converted initially in text number normalization module. For Turkish language, a temporally annotated dataset has not been published to the best of our knowledge. To build an temporal expression dataset in Turkish, we crawled 109 news, which are published between 3th and 19th March 2018, from a daily Turkish newspaper. Collected articles belongs to distinctive categories: economy, political, breaking news, movie, travel, music. Dataset is manually annotated and we followed TIMEX3, which is described in TimeML 1.2.1 guideline, format during annotation. We have splitted the dataset into two parts: a development split (87 news) and a test split (22 news). In total we annotated 109 articles, which contains 32,600 token, and end up with 1,147 TIMEX3 tags. Our tagger achieved 89% F1-score in detecting and 89% and 88% F1-score in normalization of“type”and“value”TIMEX3 attributes according to specified precision and recall calculation, which are mainly used in temporal extraction systems.

Benzer Tezler

Tez No
603820
Recognition of non-manual signs in sign language
İşaret dilinde yüz ifadeleri ve kafa hareketlerinin tanınması
MÜJDE AKTAŞ
Yüksek Lisans
Türkçe
2019
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Boğaziçi Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF. DR. LALE AKARUN
Tez No
967033
Hareket algısı-bilişsel zaman etkileşimi: Hareketle ilişkili uyaranların zaman algısı ve hareket yönünün zihinsel zaman çizgisi ile etkileşiminin etkileri
Motion perception- cognitive time interaction: Time perception of motion-related stimuli and effects of the interaction of motion direction with mental time line
REYHAN ÜNVER
Doktora
Türkçe
2025
Psikoloji İstanbul Üniversitesi
Psikoloji Ana Bilim Dalı
PROF. DR. SEVTAP CİNAN
Tez No
632294
Kur'ân'ı anlamanın imkânı: Erken dönem dil verilerinin kronolojik değeri
The possibility of understanding the Qur'an: The chronological value of the early linguistic data
HÜSEYİN ORAL
Doktora
Türkçe
2020
Din Erciyes Üniversitesi
Temel İslam Bilimleri Ana Bilim Dalı
PROF. DR. ERDOĞAN PAZARBAŞI
Tez No
163914
Açık kanallarda türbülanslı sınır tabakasının incelenmesi
Başlık çevirisi yok
GAZİ DARICI
Yüksek Lisans
Türkçe
1986
İnşaat Mühendisliği Çukurova Üniversitesi
İnşaat Mühendisliği Ana Bilim Dalı
DOÇ.DR. SALİH KIRKGÖZ
Tez No
345189
William Alston'un din dili görüşü
William Alston's view of religious language
ÖMER FARUK ÖZDEMİR
Yüksek Lisans
Türkçe
2013
Din Kahramanmaraş Sütçü İmam Üniversitesi
Felsefe ve Din Bilimleri Ana Bilim Dalı
YRD. DOÇ. NECATİ DEMİR

Geri Dön