Parallel processing of large scale genomic data

Başlık çevirisi mevcut değil.

PDF İndir

Tez No: 402722
Yazar: MÜCAHİD KUTLU
Danışmanlar: DR. GAGAN AGRAWAL
Tez Türü: Doktora
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2015
Dil: İngilizce
Üniversite: The Ohio State University
Enstitü: Yurtdışı Enstitü
Ana Bilim Dalı: Belirtilmemiş.
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 172

Özet

Özet yok.

Özet (Çeviri)

An increasing amount of genomic data is becoming available for researchers with development of high-throughput and low-cost sequencing technologies. Analysis of such data has a significant potential for the new scientific and medical advances. However, as the amount of available data increases, use of parallelism and effective utilization of the computing resources become even more critical. Thus, novel parallelization approaches and frameworks that can help researchers develop parallel applications without dealing with low-level details of parallel coding are urgently needed for new advances in genomic research. In this dissertation, we introduce parallel genomic data analysis tools and middleware systems for developing efficient parallel genomic applications easily. With the proposed frameworks and parallel algorithms, we address the following challenges. (1) How to partition genomic data for parallel SNP calling and sequence quantification tools 2) Is it possible to utilize existing genomic applications in parallel executions? (3) How can we take advantage of domain-specific knowledge to increase the performance of the applications? (4) How to schedule the data intensive tasks (5) How can we implement efficient parallel genomic applications for memory-constrained many-core architectures such as Intel Xeon Phi? First, we focused on identification of variants in large-scale genomic data in parallel. After examining possible approaches, we identify one, which does not require any communication. However, achieving load-balance is non-trivial, because of the data-dependent nature of the processing. We develop three scheduling schemes including a dynamic scheme, which reduces scheduling overheads by using two different chunk sizes, a static scheme, which uses a pre-processing step to estimate workloads, and a combined scheme. We evaluate our schemes with various configurations and analyze their performances. Second, we develop a middleware system, PAGE, which supports 'mapreduce-like' processing, but with significant differences from a system like Hadoop, to be useful and effective for parallelizing analysis of genomic data. Particularly, it can work with map functions written in any language, thus allowing utilization of existing serial tools (even those for which only an executable is available) as map functions. Thus, it can greatly simplify parallel application development for scenarios where complex data formats and/or nuanced serial algorithms are involved, as is often the case for genomic data. It supports multiple partitioning methods and provides different scheduling schemes, and execution models, to match the nature of algorithms common in genetic research. In evaluation of our middleware system, we show that PAGE is able to parallelize various genomic applications and is able to achieve high parallel efficiency and scalability. Third, we focus on data-intensive computation challenge in genomic applications and develop a novel framework, RE-PAGE, which builds a processing framework for genomic data in distributed disks. We pay attention to scheduling and load balancing, particularly in view of the data-intensive nature of the target applications. Specifically, our framework includes: 1) use of domain-specific information in the formation of data chunks (which can be of non-uniform sizes), 2) replication and placement of each chunk on a small number of nodes, performed in an intelligent way, and 3) scheduling schemes for achieving load balance, when data movement costs out-weigh processing costs and the chunks are of nonuniform sizes. As stated above, amount of available genomic data is increasing rapidly with the recent developments in sequencing technologies. At the same time, the computational technologies are also developing with an enormous speed. For instance, the trends in computing technologies are towards architectures with large number of cores and smaller memory size per core (e.g., Intel Xeon Phi). Innovative solutions that meet the requirements of parallel genomic data processing with the constraints of the new computational architectures are urgently needed. Thus, as a forth contribution, we introduce a novel middleware system, GEM, for implementing shared-memory parallel genomic applications with memory-constrained many-core architectures. We propose load-map-reduce approach and a novel scheduling scheme to decrease I/O contention and to prevent over-consumption of the limited memory. We also use domain-specific knowledge to decrease the memory requirements of the tasks. In our experiments, we show that GEM has high scalability on Intel Xeon Phi architecture and outperforms other existing frameworks for genomic data processing. In our last work, we focused on probabilistic assignment of fragments, which are ambiguously mapped to target sequences. This is a very significant but time consuming procedure for the downstream analysis of the genomic data. We developed distributed memory parallel version of a popular probabilistic fragment assignment tool, namely eXpress, which is based on expectation-maximization algorithm. We discussed possible data distribution techniques and proposed a parallelization approach, which preserves the original algorithm's accuracy and also doesn't require communication of all processes at the end of each iteration. In our experiments, we showed that our approach achieves high speedup over eXpress without decreasing its accuracy.

Benzer Tezler

Tez No
946115
Development of a modular and open-sourcetomographic imaging software : enhancingthe reconstruction module for low-dose CT and dbt
Modüler ve açık kaynak kodlu tomografik görüntülemeyazılımı: düşük doz BT ve SMT taramaları içinrekonstrüksiyon alt modülünün geliştirilmesi
SEMA ALTUN
Yüksek Lisans
İngilizce
2024
Elektrik ve Elektronik Mühendisliği İstanbul Teknik Üniversitesi
Elektronik ve Haberleşme Mühendisliği Ana Bilim Dalı
DOÇ. DR. İSA YILDIRIM
Tez No
315729
Hadoop tabanlı büyük ölçekli görüntü işleme altyapısı
Hadoop based large scale image processing infrastructure
İLGİNÇ DEMİR
Yüksek Lisans
Türkçe
2012
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Kocaeli Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
YRD. DOÇ. DR. AHMET SAYAR
Tez No
213249
Genel amaçlı bir yapay sinir ağının karma bir donanımla gerçeklenmesi
Mixed mode hardware design of a general purposed artificial neural network
BURCU ERKMEN
Doktora
Türkçe
2007
Elektrik ve Elektronik Mühendisliği Yıldız Teknik Üniversitesi
Elektronik ve Haberleşme Mühendisliği Ana Bilim Dalı
PROF.DR. TÜLAY YILDIRIM
Tez No
708833
The solution of large-scale electromagnetic problems with MLFMA on single-GPU systems
Büyük ölçekli elektromanyetik problemlerin ÇSHÇY ile tekli-GİB sistemlerinde çözümü
MEHMET FATİH ERKAL
Yüksek Lisans
İngilizce
2022
Elektrik ve Elektronik Mühendisliği İhsan Doğramacı Bilkent Üniversitesi
Elektrik-Elektronik Mühendisliği Ana Bilim Dalı
PROF. DR. VAKUR BEHÇET ERTÜRK
DR. BARIŞCAN KARAOSMANOĞLU
Tez No
491077
Platform development for parallel operation of single board computers
Tek kart bilgisayarlarla paralel işlem yapabilmesi için platform geliştirilmesi
KÜBRA KARADAĞ
Yüksek Lisans
İngilizce
2017
Mekatronik Mühendisliği Dokuz Eylül Üniversitesi
Mekatronik Mühendisliği Ana Bilim Dalı
YRD. DOÇ. DR. ÖZGÜR TAMER

Geri Dön