Geri Dön

Paraphrase identification using knowledge-lean techniques

Başlık çevirisi mevcut değil.

  1. Tez No: 403458
  2. Yazar: ASLI EYECİOĞLU ÖZMUTLU
  3. Danışmanlar: Dr. BILL KELLER
  4. Tez Türü: Doktora
  5. Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
  6. Anahtar Kelimeler: Paraphrasing, Knowledge-Lean, Twitter, Turkish, MSRPC, SVMs, N-grams, Overlap methods. Word2Vec
  7. Yıl: 2016
  8. Dil: İngilizce
  9. Üniversite: University of Sussex
  10. Enstitü: Yurtdışı Enstitü
  11. Ana Bilim Dalı: Belirtilmemiş.
  12. Bilim Dalı: Belirtilmemiş.
  13. Sayfa Sayısı: 13

Özet

Özet yok.

Özet (Çeviri)

This research addresses the problem of identification of sentential paraphrases; that is, the ability of an estimator to predict well whether two sentential text fragments are paraphrases. The paraphrase identification task has practical importance in the Natural Language Processing (NLP) community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods for identifying paraphrases should help to improve the performance of NLP systems that require language understanding. This includes key applications such as machine translation, information retrieval and question answering amongst others. Over the course of the last decade, a growing body of research has been conducted on paraphrase identification and it has become an individual working area of NLP. Our objective is to investigate whether techniques concentrating on automated understanding of text requiring less resource may achieve results comparable to methods employing more sophisticated NLP processing tools and other resources. These techniques, which we call“knowledge-lean”, range from simple, shallow overlap methods based on lexical items or n-grams through to more sophisticated methods that employ automatically generated distributional thesauri. The work begins by focusing on techniques that exploit lexical overlap and text-based statistical techniques that are much less in need of NLP tools. We investigate the question“To what extent can these methods be used for the purpose of a paraphrase identification task?”For the two gold standard data, we obtained competitive results on the Microsoft Research Paraphrase Corpus (MSRPC) and reached the state-of-the-art results on the Twitter Paraphrase Corpus, using only n-gram overlap features in conjunction with support vector machines (SVMs). These techniques do not require any language specific tools or external resources and appear to perform well without the need to normalise colloquial language such as that found on Twitter. It was natural to extend the scope of the research and to consider experimenting on another language, which is poor in resources. The scarcity of available paraphrase data led us to construct our own corpus; we have constructed a paraphrase corpus in Turkish. This corpus is relatively small but provides a representative collection, including a variety of texts. While there is still debate as to whether a binary or finegrained judgement satisfies a paraphrase corpus, we chose to provide data for a sentential textual similarity task by agreeing on fine-grained scoring, knowing that this could be converted to binary scoring, but not the other way around. The correlation between the results from different corpora is promising. Therefore, it can be surmised that languages poor in resources can benefit from knowledge-lean techniques. Discovering the strengths of knowledge-lean techniques extended with a new perspective to techniques that use distributional statistical features of text by representing each word as a vector (word2vec). While recent research focuses on larger fragments of text with word2vec, such as phrases, sentences and even paragraphs, a new approach is presented by introducing vectors of character n-grams that carry the same attributes as word vectors. The proposed method has the ability to capture syntactic relations as well as semantic relations without semantic knowledge. This is proven to be competitive on Twitter compared to more sophisticated methods.

Benzer Tezler

  1. Bir uçağın hareket denklemlerinde yer alan kararlılık türevlerinin hesaplanmasında kullanılan alternatif yöntemler

    Determination of stability and control derivatives of an aircraft by using alternative methods

    MURAD ÖZKIRŞEHİRLİ

    Yüksek Lisans

    Türkçe

    Türkçe

    2002

    Uçak Mühendisliğiİstanbul Teknik Üniversitesi

    Uçak Mühendisliği Ana Bilim Dalı

    DOÇ. DR. RAMAZAN TAŞALTIN

  2. Paraphrase identification in turkish using distributed representations of words and sentences

    Dağıtık kelime ve cümle temsilleri kullanılarak türkçe eşanlatım tespiti

    HAKKI ENGİN YORGANCIOĞLU

    Yüksek Lisans

    İngilizce

    İngilizce

    2019

    Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve KontrolEge Üniversitesi

    Uluslararası Bilgisayar Ana Bilim Dalı

    PROF. DR. BAHAR KARAOĞLAN

  3. Toplum sözleşmesinin felsefi kaynaklarını aramak: Thomas Hobbes, John Locke ve Jean Jacques Rousseau mukayesesi

    Looking for the philosophical sources of the societh concract: The comparison of Thomas Hobbes, John Locke and Jean Jacques Rousseau's

    MEHMET KUTLU

    Yüksek Lisans

    Türkçe

    Türkçe

    2020

    Siyasal BilimlerMuş Alparslan Üniversitesi

    Siyaset Bilimi ve Kamu Yönetimi Ana Bilim Dalı

    DR. ÖĞR. ÜYESİ YUSUF ÇİFCİ

  4. Une Etude critique et suggestions sur la fonction performative du langage

    Başlık çevirisi yok

    CİHAT ALGAN

    Yüksek Lisans

    Fransızca

    Fransızca

    1991

    Fransız Dili ve EdebiyatıHacettepe Üniversitesi

    DOÇ.DR. ZEYNEL KIRAN

  5. Perîşannameya Mela Perîşanê Dînewerî (mein-analîz)

    Molla Perişan-ı Dineverî'nin Perişannamesi (metin-analiz)

    EROL ŞAYBAK

    Yüksek Lisans

    Kürtçe

    Kürtçe

    2019

    Doğu Dilleri ve EdebiyatıMardin Artuklu Üniversitesi

    Kürt Dili ve Kültürü Ana Bilim Dalı

    DR. ÖĞR. ÜYESİ SHAHAB VALI