Wednesday, September 28, 2011

Multithreaded downloading with wget


GNU wget is versatile and robust, but lacks support for multithreaded downloading. When downloading multiple files, it just goes one by one, which is quite inefficient if the bandwidth of each connection is limited.

There is a way to achieve nearly the same effect as multithreaded downloading (link),  and here is how you do it:
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
copy as many times as you deem appropriate to have as many processes downloading. The key is the -N option, which tells wget to download a file only when its local time stamp is older than the one in the server side.

Alternatively, I wrote a wrapper, pwget (short for parallel wget), that adds multithreading to wget. The program is available from https://github.com/songqiang/pwget. It has two options --max-num-threads and --sleep. The first option --max-num-threads gives the maximum number of connections you allow to establish. This number is usually determined by the setting on the server side and by default it is 3.  The second option --sleep specifies how often (in seconds) the master thread checks the status of downloading threads. When the master thread wakes up, it removes finished threads and add new downloading threads if necessary. Suppose you have the list of URLs in the file url-list.txt, then run
./pwget.py --max-num-threads 5 --sleep 2 -i url-list.txt
wget will begin downloading the list of URLs in url-list.txt with at most 5 connections at once. You can also specify the option for wget in the command line, which will be passed to working threads.

This tool has several limitations. The parallel level of pwget is based on each URL, so you need to list the all URLs in prior. Furthermore, if you have a single large file, pwget does not help. In that case, you may consider use aria2 (http://aria2.sourceforge.net/).   

Tuesday, September 20, 2011

Runnning SSH on a non-standard port

The default port for SSH connection is 22. However some servers change the default port to others, for example 22222, for security reasons. Here I list some common commands to deal with non-standard SSH port.

Suppose you have a SSH server ssh.example.edu with ssh port number 22222.

To copy your ssh public key to the server, run:
ssh-copy-id '-p 22222 jon@ssh.example.edu'
Note the single quote is necessary.

To log in to the SSH server, run
ssh -p 22222 jon@ssh.example.edu 
 To copy files between your local machine and the server with scp, run
scp -P 22222 local-files jon@ssh.example.edu:~
 Note the "-P" option is capitalised.

References:
1. http://www.itworld.com/nls_unixssh0500506
2. http://mikegerwitz.com/2009/10/07/ssh-copy-id-and-sshd-port/


Thursday, July 14, 2011

Setting Up a Hadoop Cluster

This post lists the steps to set up an Hadoop cluster in Ubuntu 11.04. Most codes can be directly copied and pasted.

* Hadoop
** Install Java
#+begin_src shell
sudo apt-get install sun-java6-jdk
sudo update-java-alternatives -s java-6-sun
#+end_src

** Add Hadoop User and Group
#+begin_src shell
sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop
#+end_src

** Configuring SSH and Password-less Login
#+begin_src sh
  # In the master node
  su hadoop
  ssh-keygen -t rsa -P ""
 
  for node in $(cat /conf/slaves);
  do
      ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@$node;
  done
#+end_src

** Install Hadoop
*** Install
#+begin_src sh
  ## download and install
  cd /home/hadoop/
  tar xzf hadoop-0.21.0.tar.gz
  mv hadoop-0.21.0 hadoop
#+end_src
*** Update .bashrc
#+begin_src sh
  ## update .bashrc
  # Set Hadoop-related environment variables
  export HADOOP_HOME=/home/hadoop/hadoop
  export HADOOP_COMMON_HOME="/home/hadoop/hadoop"
  export PATH=$PATH:$HADOOP_HOME/bin
  export PATH=$PATH:$HADOOP_COMMON_HOME/bin/
#+end_src
*** Update conf/hadoop-env.sh
#+begin_src sh
  export JAVA_HOME=/usr/lib/jvm/java-6-sun
  export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
#+end_src
*** Update conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://128.125.86.89:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>


</configuration>
*** Update conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!-- In: conf/mapred-site.xml -->
<property>
<name>mapreduce.jobtracker.address</name>
<value>128.125.86.89:54311</value>
</property>

</configuration>
*** Update conf/hdfs-site.xml
#+begin_src html
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

</configuration>
#+end_src
*** Update conf/masters (master node only)
#+begin_src sh
128.125.86.89
#+end_src
*** Update conf/slaves (master node only)
#+begin_src sh
128.125.86.89
slave-ip1
slave-ip2
......
#+end_src
*** Copy hadoop installation and configuration files to slave nodes
#+begin_src sh  
# In the master node  
su hadoop    
for node in $(cat /conf/slaves);  
do
      scp ~/.bashrc hadoop@$node:~;       scp -r ~/hadoop hadoop@#node:~;  
done
#+end_src
** Run Hadoop
*** Format HDFS
#+begin_src sh
hdfs namenode -format
#+end_src
*** Start Hadoop
#+begin_src sh
start-dfs.sh && sleep 300 && start-mapred.sh && echo "GOOD"
#+end_src
*** Run Jobs
#+begin_src sh
hadoop jar hadoop pipes
#+end_src
*** Stop Hadoop
#+begin_src sh
stop-mapred.sh && stop-dfs.sh
#+end_src
** References:
1. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ 
2. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
3. http://fclose.com/b/cloud-computing/290/hadoop-tutorial/
4. Fix could only be replicated to 0 nodes instead of 1 error

Thursday, March 10, 2011

Identifying dispersed epigenomic domains from ChIP-Seq data

Published in Bioinformatics http://bioinformatics.oxfordjournals.org/content/27/6/870.full

1 INTRODUCTION

Post-translational modifications to histone tails, including methylation and acetylaytion, have been associated with important regulatory roles in cell differentiation and disease development (Kouzarides, 2007). The application of ChIP-Seq to histone modification study has proved very useful for understanding the genomic landscape of histone modifications (Barski et al., 2007; Mikkelsen et al., 2007). Certain histone modifications are tightly concentrated, covering a few hundred base pairs. For example, H3K4me3 is usually associated with active promoters, and occurs only at nucleosomes close to transcription start sites (TSSs). On the other hand, many histone modifications are diffuse and occupy large regions, ranging from thousands to several millions of base pairs. A well known example H3K36me3 is associated with active gene expression and often spans the whole gene body (Barski et al., 2007). Reflected in ChIP-Seq data, the signals of these histone modifications are enriched over large regions, but lack well-defined peaks. It is worth pointing out that the property of being ‘diffuse’ is matter of degrees. Besides the modification frequency, the modification profile over a region is also affected by nucleosome densities and the strength of nucleosome positioning. By visual inspection of read-density profiles, we found that H2BK5me1, H3K79me1, H3K79me2, H3K79me3, H3K9me1, H3K9me3 and H3R2me1 show similar diffuse profiles.
There are several general questions about dispersed epigenomic domains that remain unanswered. Many of these questions center around how these domains are established and maintained. One critical step in answering these questions is to accurately locate the boundaries of these domains. However, most of existing methods for ChIP-Seq data analysis were originally designed for identifying transcription factor binding sites. These focus on locating highly concentrated ‘peaks’, and are inappropriate for identifying domains of dispersed histone modification marks (Pepke et al., 2009). Moreover, the quality of ‘peak’ analysis is measured in terms of sensitivity and specificity of peak calling (accuracy), along with how narrow the peaks are (precision; often determined by the underlying platform). But for diffuse histone modifications, significant ‘peaks’ are usually lacking and often the utility of identifying domains depends on how clearly the boundaries are located.

2 METHODS

Our method for identifying epigenomic domains is based on hidden Markov model (HMM) framework including the Baum–Welch training and posterior decoding (see Rabiner, 1989 for a general description).
Single sample analysis: we first obtain the read density profile by dividing the genome into non-overlapping fixed length bins and counting the number of reads in each bin. The bin size can be determined automatically as a function of the total number of reads and the effective genome size (Supplementary Section S1.5). We model the read counts with the negative binomial distribution after correcting for the effect of genomic deadzones. We first exclude unassembled regions of a genome from our analysis. Second, when two locations in the genome have identical sequences of length greater than or equal to the read length, any read derived from one of those locations will necessarily be ambiguous and is discarded. We refer to contiguous sets of locations to which no read can map uniquely as ‘deadzones’. Those bins within large deadzones (referred to as ‘deserts’) are ignored. For those bins outside of deserts, we correct for the deadzone effect by scaling distribution parameters according to the proportion of the bin which is not within a deadzone (Supplementary Section S1.3).
We assume a bin may have one of the two states: foreground state with high histone modification frequency and background state with low histone modification frequency. We developed a two state HMM for segmentation the genome into foreground domains and background domains.
Identifying and evaluating domain boundaries: while predicted domains themselves give the locations of boundaries, we characterize the boundaries with the following metrics. We evaluate domain boundaries based on posterior probabilities of transitions between the foreground state and the background state as estimated by the HMM. For each pair of consecutive genomic bins, the posterior probability is calculated for all possible transitions between those bins. If a boundary corresponds to the beginning of a domain, the boundary score is the posterior probability of a background to foreground transition and vice versa.
Next an empirical distribution of posterior transition probabilities is constructed by computing posterior transition probabilities from a dataset of randomly permuted bins with the same HMM parameters. Those bins whose posterior transition probabilities have significant empirical P-values are kept and consecutive significant bins are joined as being one boundary. We score each boundary with the posterior probability that a single transition occurs in this boundary. The peak of a boundary is set to the start of the bin with the largest transition probability (see Supplementary Section S3 for details).
Incorporating a control sample: ChIP-Seq experiments are influenced by background noises, contamination and other possible sources of error, and researchers have begun to realize the necessity of generating experimental controls in ChIP-Seq experiments. Two common forms of control exist: a non-specific antibody such as IgG to control the immunoprecipitation, and sequencing of whole cell extract to control for contamination and other possible sources of error. With the availability of a control sample, we use a similar two-state HMM with the novel NBDiff distribution to describe the relationship between the read counts in the two samples. Analogous to the Skellam distribution (Skellam, 1946), the NBDiff distribution describes the difference of two independent negative binomial random variables (see Supplementary Section S1.2 for details).
Simultaneously segmenting two modifications: the simultaneous analysis of two histone modification marks may reveal more accurate information about the status of genomic regions. It helps to understand the functions of different histone modification marks. It is also of interest to compare samples from different cells types because histone modification patterns are dynamic and subject to change during cell differentiation. We use the NBDiff distribution to model the read count difference between the two samples, and employ three-state HMM: where the basal state means these two signals are similar, the second state represents the signal in test sample A is greater than that in the test sample B and the third state represents the opposite case (details given in Supplementary Section S2.1).

3 EVALUATION AND APPLICATIONS

We simulated H3K36me3 ChIP-Seq data and compared RSEG, SICER (Zang et al., 2009) and HPeak (Qin et al., 2010). In terms of domain identification, RSEG outperforms SICER and HPeak for single-sample analysis and yields comparable results to SICER for analysis with control samples (Supplementary Section S4.1 and 4.2). We applied RSEG to H3K36me3 ChIP-Seq dataset from (Barski et al., 2007) and found a strong association between H3K36me3 domain boundaries with TSS and transcription termination site (TTS), which supports that RSEG can find high-quality domain boundaries (Supplementary Section S4.3).
We applied RSEG to four histone modification marks (H3K9me3, H3K27me3, H3K36me3 and H3K79me2) from two separate studies (Barski et al., 2007; Mikkelsen et al., 2007) (Supplementary Section S5.1). In particular, we discovered an interesting relationship between the two gene-overlapping marks H3K36me3 and H3K79me2 through boundary analysis. H3K79me2 tends to associate with 5-ends of genes, while H3K36me3 associates with 3-ends. About 41% of gene-overlapping K79 domains cover TSS in contrast to 11% of K36 domains. On the other hand, 84% of K36 domains cover TTS in contrast to 23% of K79 domains (Table 1). In those genes with both H3K36me3 and H3K79me2 signals, H3K79me2 domains tend to precede H3K36me3 domains, for example the DPF2 gene (Fig. 1) (see Supplementary Section S5.2 for more information). This novel discovery demonstrates the usefulness of boundary analysis for dispersed histone modification marks.
Fig. 1.
The H3K36me3 and H3K79me2 domains and their boundaries at DPF2 (chr11:64,854,646–64,880,304).
Table 1.
Location of H3K36me3 and H3K79me2 domain boundaries relative to genes
Finally we applied our three-state HMM to simultaneously analyze H3K36me3 and H3K79me2 (Supplementary Section S5.4). The result agrees with the above observations. The application of our three-state HMM to find differentially histone modification regions is given in Supplementary Section S5.3.

Saturday, February 26, 2011

Sustainable Scientific Data Archiving Model

As many researchers may have noticed, NCBI plans to discontinue the Short Read Archive (SRA) service due to budget constraints. This news surprises me, and, I believe, concerns the broad biomedical research community in general. While the biomedial research enters the -omics era and becomes more and more data-driven, the sudden close of SRA raises the question that how the scientific data be archived with a sustainable model? I discuss two strategies to preserve scientific data in a sustainable manner. The first proposes a central data repository that charges data deposition fee. The other approach proposes that the data is stored in P2P manner and a central gateway gathers metadata and tracks links to P2P seeds.

Sustainable data archiving model includes the following aspects: first the data should include necessary and accurate metadata; second, the data should be stored securely and remains authentic and correct for a long time; third the data should also include essential softwares and scripts to analyze the data; finally the data should be easily searched and accessed by the broader research community now and for a considerate period in the future so that researchers may use the dataset from different perspectives and even re-analyze the data in the future if new hypothesis and analytic methods emerges.

However as has been note elsewhere, there is a disconnection between the effort to produce the data and the effort to preserve the data. Simply put, funding agencies provide the money for produce the money but not the money to maintain the data. The grant for producing the data is in rather smaller time sale, usually two to five years. Once the grant is over, the project is done and the original researchers switched to other projects, the data produced is in the danger of being lost. Fortunately the biomedical research community has a pretty good record in depositing biological datasets for public research as has been exemplified by GeneBank and GEO. The Short Read Archive is designed to meet the requirements of the massively parallel sequencing reads data. However the discontinuance of this services demonstrates the uncertainty of current data sharing model due to lack of specific funding. Therefore I am considering the following two strategies for the sustain scientific data.

In the first strategy, we still rely on a central data repository like SRA that curates, stores and distributes biological datasets. To meet the financial requirement of such central repository, it charges certain amount of fee for the data hosted. It works as following: when the original data producer finish their research and submit a paper to a journal. The journal requires that their data is deposited in a certain repository and charges data deposition fee. Next the journal allocates the major part of the data deposition fee to the central data repository. The proposed data deposition fee is charged only once which can therefore be covered by the initial grant of the original data producer. With the ever-decreasing cost of data storage, the continual influx of single-time data deposition fee should keep the central data repository working.

The second strategy is initially brought up to me by my friend Li Xia and are further inspired by Morgan Langille, the creator of BioTorrents. In the strategy, the data set is stored by multiple hosts
who may have the resources and interest to keep the dataset. Next a central gateway keeps tracks of the BitTorrents seeds to the raw data and also stores the metadata associated with each data, such as the contact of data producer, experimental protocols and descriptions of the raw data. Especially the central gateway stores the version of the raw dataset and the MD5 or SHA sum for the data so that the data users can make sure they are obtaining updated and authentic dataset from essentially unreliable and untrustable data hosts in a P2P network. Since the central gateway needs only to track these metadata, its running cost is significantly smaller than the central data repository and therefore it can work just as a new section in the NCBI infrastructure.

I hope this discussion publicize the urgency for sustainable scientific data archiving so that the biomedical research community will work out a way after SRA ends.