Document classification using improved word embeddings

Geliştirilmiş kelime gömmeleri kullanarak belge sınıflandırma

PDF İndir

Tez No: 826408
Yazar: RAAD SAADI MAHMOOD MAHMOOD
Danışmanlar: DR. ÖĞR. ÜYESİ AYHAN AKBAŞ
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2023
Dil: İngilizce
Üniversite: Çankırı Karatekin Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Elektronik ve Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 60

Özet

Bu çalışmada, derin öğrenme kullanarak tahmin yapmak için sınıflandırma yapabilen eğitilmiş bir model oluşturma işlemi, en önemli hedeflerden biridir = In this study, one of the most important goals is to create a trained model that can classify to make predictions using deep learning.

Özet (Çeviri)

In this study, one of our primary objectives is to develop a trained model capable of classification using deep learning techniques for prediction. Neural networks, especially in the realm of natural language processing, have demonstrated impressive results, notably in document classification. Researchers have focused extensively on classification prediction. Convolutional network models, recurrent networks, and other embedding mechanisms are employed where texts are extracted (embedded) from documents either at the sentence or word level. Historically, the Word2Vec model was utilized in natural language processing to extract words based on context. This was later augmented with Long Short-Term Memory (LSTM) networks. The use of N-gram properties, in context with the text and associations between words, has proven to enhance prediction accuracy in classification tasks. Previous studies have primarily based document classification on visual methods or formats, perhaps concentrating on titles and abstracts. However, this article posits that classification should be anchored in word inclusion. Utilizing a dataset comprised of 47,000 texts and topics, we employ word embeddings to determine document themes. These embedded words — vast textual content — are sorted into seven primary categories, serving as foundational classes in our dataset. This data then trains deep learning models designed for document classification (both for training and testing). Once trained, this model can autonomously classify documents based on embedded words and texts. Our approach begins by extracting words from the dataset's texts. Subsequently, two models are constructed using Word2Vec. The words undergo lemmatization, reverting to their original form. Superfluous elements, such as symbols and punctuation, are purged to ensure the text remains pure, concentrating solely on semantically significant words. These cleansed word series are then used to train the two models, aiming to establish correlations between words. Both models strive to construct associations based on word sequences within the text. The first model assigns vectors to words based on context and endeavors to predict context via these words. In contrast, the second model hinges on the interrelationships between words and predicts specific words based on classification, yielding a relational concept termed“Neighbor word”. Finally, we employ a deep learning model rooted in Long Short-Term Memory (LSTM). This is buttressed by the relationships deduced from the two Word2Vec models. Evaluations between them are conducted to ascertain which offers superior performance in predictive classification

Benzer Tezler

Tez No
828505
Hakem atama otomasyonu için bir karar destek sistemi: Doğal dil işleme ve veri-güdümlü optimizasyon ile bütünleşik bir yaklaşım
A decision support system for reviewer assignment automation: An integrated approach with natural language processing and data-driven optimization
MELTEM AKSOY
Doktora
Türkçe
2023
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Endüstri Mühendisliği Ana Bilim Dalı
DOÇ. DR. SEDA YANIK ÖZBAY
PROF. DR. MEHMET FATİH AMASYALI
Tez No
507612
Türkçe eşgönderge çözümlemesi
Turkish coreference resolution
TUĞBA PAMAY
Yüksek Lisans
Türkçe
2018
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DR. ÖĞR. ÜYESİ GÜLŞEN ERYİĞİT
Tez No
629631
Gizli dirichlet ayrımı ve Word2vec yöntemlerinin birleşimi ile özgün bir metin temsil modeli geliştirilmesi
Combining latent dirichlet allocation and Word2vec for a novel document representation model
HALİL İBRAHİM ÇELENLİ
Yüksek Lisans
Türkçe
2020
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Kocaeli Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DOÇ. DR. SEVİNÇ İLHAN OMURCA
DOÇ. DR. MURAT CAN GANİZ
Tez No
471840
Text classification via word embeddings: An application for Turkish music mood detection
Kelime temsilleri yoluyla metin sınıflaması: Türkçe müziklerde duygu tespiti uygulaması
BARIŞ ÇİMEN
Yüksek Lisans
İngilizce
2017
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Boğaziçi Üniversitesi
Yönetim Bilişim Sistemleri Ana Bilim Dalı
YRD. DOÇ. DR. AHMET ONUR DURAHİM
Tez No
724931
Deep learning methods with pre-trained word embeddings and pre-trained transformers for extreme multi label text classification
Çoklu etiket sınıflandırması için önceden eğitilmiş kelime vektörleri ve önceden eğitilmiş transformer modelleri ile derin öğrenme yöntemleri
NECDET EREN ERCİYES
Yüksek Lisans
İngilizce
2022
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Çankaya Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DR. ÖĞR. ÜYESİ ABDÜL KADİR GÖRÜR

Geri Dön