Arama sorguları üzerinde görev tabanlı kümeleme

Task-based clustering on search queries

PDF İndir

Tez No: 507150
Yazar: ALMILA SELCEN AKGÜN
Danışmanlar: DR. ÖĞR. ÜYESİ YUSUF YASLAN
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2018
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
Sayfa Sayısı: 67

Özet

Sorgu metinleri üzerinden görev çıkarımı, arama motorlarında ve birçok arama tabanlı uygulamalarda kullanılan önemli ve ilgi çekici konulardan birisidir. Kullanıcılar günlük aktivitelerinde ve kişisel planlarını yaparken sıklıkla internet arama motorları üzerinden sorgulama yaparlar. Metin madenciliği yöntemleriyle işlenen sorgu metinleri, popüler konu başlıklarının belirlenmesinde ve kullanıcı özelliklerinin keşfedilmesinde kullanılmaktadır. Kullanıcılar tarafından girilen benzer sorgular çeşitli özniteliklerine göre bir araya getirilerek anlamlı görevler çıkarılabilir. Arama yapılan alana yönelik doğru sonuçların döndürülmesinde, arama metni tamamlamada, kullanıcı amacı doğrultusunda öneriler sunulmasında görev çıkarımının önemli rolü vardır. Mevcut yaklaşımlarda kullanıcının oturum bilgisi, tıklanan doküman içerikleri ve sorgu öğeleri kullanılarak görev çıkarımı yapılmaktadır. Kullanıcının arama motorunu açtığı andan kapattığı ana kadar geçen süre kullanıcı oturumu olarak nitelendirilir. Bu süre zarfında yapılan sorgulamalar belirli bir göreve yönelik olabileceği gibi, birden fazla görev de paralel sorgulanıyor olabilir. Sorgu metinlerinin anlamsal özelliklerini temel alan çalışmalarda sorgu sonucu tıklanan doküman içerikleri, makale başlıkları önem kazanmaktadır. Önceki çalışmalarda kullanılan sorgu öğeleri ise, arama metninde geçen özel isimler, yer adları ve sözcük öbekleridir. Öğeler, sorgu içerisinde tek başına kullanıldığında bir anlam ifade ederken, bir araya geldiğinde farklı anlamlara gelen kelimelerin doğru görev kümelerine atanmasında başarımı arttırmaktadır. Son çalışmalarda, öğe kategori ilişkilerini gösteren Wikipedia kategori hiyerarşileri ve Probase taksonomi bilgileri kümeleme aşamasında kullanılarak görev kümeleri oluşturulmaktadır. Bu çalışmada, öğe kategorileri kümeleme aşaması yerine öznitelik çıkarımı için kullanılmıştır. Böylece, arama sorguları arasındaki anlamsal benzerliğin kategoriler üzerinden daha hassas ölçümlenmesi amaçlanmıştır. Sorgular arasındaki benzerlik hesaplamalarında sorgu öznitelik vektörleri kullanılmaktadır. Diğer çalışmalardan farklı olarak, öğe ve kategori gibi sözel özniteliklerin vektörel hale getirilmesi için n-gram ve word2vec yöntemlerinden yararlanılmaktadır. Yapılan çalışmada URL, oturum, kullanıcı, öğe ve kategori öznitelikleri farklı kombinasyonlarla birleştirilerek merkezi ve yoğunluk tabanlı kümeleme yöntemleriyle sorgular kümelenmiştir. Böylelikle, farklı öznitelik setlerinin kümeleme başarımındaki etkisi ölçümlenmiştir. Kümeleme aşamasında sayısal ve sözel öznitelikler bir arada kullanıldığı için, merkezi kümeleme yöntemi olarak K-Means algoritmasının özelleştirilmiş hali olan K-Medoids algoritması kullanılmıştır. Merkezi kümeleme yöntemlerinin gürültülü verilerden etkilendiği deney sonuçlarında görülmektedir. Bu çalışmada, gürültülü verilerin etkisini minimum seviyeye indirgemek için yoğunluk tabanlı kümeleme yöntemi olan DB-Scan algoritması da öznitelikler üzerinde çalıştırılmış ve sonuçlardaki başarım arttırılmıştır. Arama sorgularını temsil etmek üzere oluşturulan farklı öznitelik vektörleri elde edilen kümeleme sonuçlarının başarımı ile değerlendirilmiştir. Bir görev kümesi içerisindeki elemanların benzerliği ve görev küme merkezleri arasındaki uzaklık ölçümlenerek kümeleme sonuçları karşılaştırılmıştır. Küme içi değerlendirmede, aynı kümeye atanan elemanların birbirlerine ne kadar yakın olduğu bilgisi (Sıklık değeri) kullanılır. Kümeler arası değerlendirmede ise, her bir sorgu bir kümeye atandıktan sonra elde edilen küme merkezleri arasındaki mesafenin ne kadar uzak olduğu bilgisinden (Ayrışma değeri) faydalanılır. Bu çalışmanın sonucunda, sorgulardan elde edilen öğe ve kategori bilgileri anlamsal öznitelikler olarak bir arada ele alınıp word2vec yöntemiyle vektörel hale getirilmiş ve yoğunluk tabanlı kümeleme yöntemi kullanılarak görev tabanlı kümeleme başarımı arttırılmıştır.

Özet (Çeviri)

Task extraction on query texts is one of the important and interesting topic that used on search engines and many search-based applications. Users often search for their daily activities and personal plans on web search engines. Query texts processed by text mining methods are used in determining popular topic titles and exploring user behaviors. In this thesis, a study was carried out in the analysis of the queries entered in the search engines which collected the big data and in the meaningful task extraction. Similar queries entered by users can be aggregated according to their various features to make meaningful tasks. There is an important role in task extraction of providing suggestions in the direction of the user's intention, in the search text completion, in returning the correct results for the domain being searched. Task clusters are identified as phrases, expressions, or more complex representations. Session information, clicked document contents and query entities are used for feature extraction in existing approaches. The time elapsed until the user turns off the search engine is called the user session. During this period, inquiries may be directed to a specific task, or multiple tasks may be interrogated in parallel. A user session is not be appointed as a task because the user may be querying discrete tasks in parallel. In this case, it is difficult to understand user goal from queries entered into search engines in a user session. Some other approaches use semantic features extracted from search queries to find the disjoint tasks besides to log features. Semantic features represent the meaning of a piece of search query such as entity or category. In the studies based on the semantic features of the queries, the document contents and article titles as a result of querying become important. The special names, place names and phrases in the query text are labeled as query entity. Entities increase the success in assigning words, have different meanings when they are used alone in the query or when they come together, to the correct task clusters. Task extraction is made by the Wikipedia category hierarchies or Probase taxonomy that is used at clustering level in recent studies. Queries are aggregated as per their entity category to assign similar queries into same task. The focus of this study is to increase task clustering performance by using the semantic and lexical features of queries in the most optimal way. In this study, search queries with noisy data are first passed through the data preprocessing step. First, only punctuation marks are extracted from the dataset. Stop words (e.g.“and, the, then, also, after”in English) that do not produce any information in the semantic analysis in natural language processing (NLP), even if they have meaning, and are widely used at relevant language, removed from query log. Empty queries as the result of these operations are deleted from dataset. Lexical and semantic features are extracted from the processed data to calculate the similarity between queries. Through Dexter, the entities of each query statement in the dataset are subtracted. There can be more than one entity in a query statement, or no entity at all. For this reason, it has been observed that in the experiments, the entity vector increases the performance of clustering queries in which the special words are used, but does not affect performance in general search words. Dexter is also used to obtain category information which the entity information belongs to. An entity can have more than one category, and each category has more than one entity. DBPedia ontology which uses Wikipedia articles is used on Dexter to map category information. For instance, when the user enter consecutive queries“cherry or grape”and“benefit of banana”to web search engine,“cherry”and“grape”are tagged as query entities for first query and“banana”is tagged as entity for second query. The category information of“cherry”,“grape”and“banana”entities are“fruit”. Different entities are obtained for these two queries. However, their category feature is same. If entity information is used alone to find semantic similarity, these query would be dissimilar to each other. If category information is added to feature set, their similarity would increase. In this study, entity categories are used for feature extraction instead of clustering phase. Therefore, it is aimed to measure semantic similarity between search queries on categories precisely. Unlike some other studies, word2vec method is used for vectorization of semantic features such as entity and category. In addition, n-gram method which is one of NLP techniques has been applied to entity and category features. In another approach we used to make comparisons, the similarities between the queries were calculated by applying the word2vec method to each word in the queries. Here the query texts are separated by words according to the space (“ ”) character. The vector coverage of each lag was obtained using the word2vec method, which uses the Google News corpus. Words not found in corpus are evaluated as zero vector so as not to affect the result. If there is more than one word vector in a query, the query vector is calculated from the scalar sum of the word vectors. Query feature vectors are used in similarity calculations between queries. Cosine similarity is used in the computation of similarity of numerical and semantic features. As a result, a similarity matrix is obtained to be used in the clustering process from the cosine similarity between generated query vectors. The queries are clustered by centroid-based and density-based clustering methods with the different combination of feature sets. URL, session, user, entity and category information are extracted from search queries and used as feature sets. Thus, the effect of different feature sets on clustering performance is measured in this study. Centroid and density based clustering algorithms determine task clusters. Since the numerical and semantic attributes are used together in the clustering phase, the K-Medoids algorithm, the customized version of the K-Means algorithm, is used as the centroid clustering method. Experimental results show that centroid clustering methods are affected by noisy data. In this study, DB-Scan algorithm, a density-based clustering method, was also run on the feature vectors to compare clustering result performance between centroid and density based clustering methods. In the DB-Scan algorithm, which is a density-based clustering method, queries are clustered using the minimum number of elements in one cluster and the maximum value information of the distance between two elements. In this algorithm, to find the neighbors of each point in the dataset, the distances to all other points are calculated repeatedly. The similarity matrix is used to get rid of the recursive distance calculation cost. When the best results of the comparison are considered, it is seen that the DB-Scan algorithm achieves more successful results than K-Medoids algorithm. Density-based clustering methods are the reason for preference in text-based studies where the number of clusters can not be precisely determined since cluster numbers are determined by the dataset. For evaluation of this research, the similarities in cluster members and the distance of clusters' centers are measured. Within-cluster evaluation uses the information (compactness value) of how close the elements assigned to the same cluster are to each other. Between-cluster evaluation, the knowledge of how far the distance between cluster centers is obtained after each query is assigned to a cluster (separation value) is used. Our experiments demonstrate the importance of using different feature sets together to improve task extraction. The effects log features and semantic feature generation approaches using word2vec method on queries are investigated. It has been observed that when the entity and category semantic features are used in task extraction, more extensive tasks are generated. When all attributes are taken into account, the use of word2vec is influencing positively the success of the task inference, with the entity and category similarity being determined. Experimental evaluation of the proposed method by using entity and category vectors generated by word2vec method indicates that the proposed method outperforms existing approaches. Entity and category vectors are treated as query feature and task-based clustering performance is increased by using density-based clustering method.

Benzer Tezler

Tez No
899096
Etkin sorgu önerileri için kullanıcı sorgularının görev tabanlı yönetilmesi
Task based management of user queries for effective query suggestions
NURULLAH ATEŞ
Doktora
Türkçe
2024
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DOÇ. DR. YUSUF YASLAN
Tez No
313548
Cascaded cross entropy-based search result diversification
Çapraz entropi tabanlı kademeli arama sonuç çeşitlendirmesi
BİLGE KÖROĞLU
Yüksek Lisans
İngilizce
2012
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İhsan Doğramacı Bilkent Üniversitesi
Bilgisayar Mühendisliği Bölümü
PROF. DR. FAZLI CAN
Tez No
436562
Mikrodizi gen ifade veritabanlarında içerik-tabanlı arama
Content-based search on microarray gene expression databases
ESMA ERGÜNER ÖZKOÇ
Doktora
Türkçe
2016
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Ege Üniversitesi
Uluslararası Bilgisayar Ana Bilim Dalı
PROF. DR. MEHMET EMİN DALKILIÇ
PROF. DR. HASAN OĞUL
Tez No
847173
Fake news classification using machine learning and deep learning approaches
Makine öğrenimi ve derin öğrenme yaklaşımlarını kullanarak sahte haber sınıflandırması
SAJA ABDULHALEEM MAHMOOD AL-OBAIDI
Yüksek Lisans
İngilizce
2023
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Gazi Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DR. ÖĞR. ÜYESİ TUBA ÇAĞLIKANTAR
Tez No
605272
The use of local data in architectural design through augmented reality
Mimari tasarımda artırılmış gerçeklik aracılığıyla yerel veri kullanımı
FARUK CAN ÜNAL
Doktora
İngilizce
2019
Bilim ve Teknoloji İstanbul Teknik Üniversitesi
Bilişim Ana Bilim Dalı
DOÇ. DR. YÜKSEL DEMİR

Geri Dön