Monday, February 04, 2013

FASTQ Quality Score Convesion Table


FASTQ Quality Score Convesion Table

In FASTQ format, the fourth line encodes the quality score of sequences in the second line. This scheme was initially used by the Phred base-calling program to use ASCII characters to encode the probability that the corresponding base call is wrong in traditional Sanger sequencing. The same format is also used by Illumina/Solexa sequencing, however the mapping from probability values to characters is slightly changed from the Phred score and also varies between different version of Solexa sequencer. The exact formula is given somewhere else. The following lists the conversion table for each platform and/or version. 


Range

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
    (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)


Sanger sequencing score

   |------+-------+-------+--------------|
   | char | value | Phred |  Error-Prob. |
   |------+-------+-------+--------------|
   | !    |    33 |     0 | 1.0000000000 |
   | "    |    34 |     1 | 0.7943282347 |
   | #    |    35 |     2 | 0.6309573445 |
   | $    |    36 |     3 | 0.5011872336 |
   | %    |    37 |     4 | 0.3981071706 |
   | &    |    38 |     5 | 0.3162277660 |
   | '    |    39 |     6 | 0.2511886432 |
   | (    |    40 |     7 | 0.1995262315 |
   | )    |    41 |     8 | 0.1584893192 |
   | *    |    42 |     9 | 0.1258925412 |
   | +    |    43 |    10 | 0.1000000000 |
   | ,    |    44 |    11 | 0.0794328235 |
   | -    |    45 |    12 | 0.0630957344 |
   | .    |    46 |    13 | 0.0501187234 |
   | /    |    47 |    14 | 0.0398107171 |
   | 0    |    48 |    15 | 0.0316227766 |
   | 1    |    49 |    16 | 0.0251188643 |
   | 2    |    50 |    17 | 0.0199526231 |
   | 3    |    51 |    18 | 0.0158489319 |
   | 4    |    52 |    19 | 0.0125892541 |
   | 5    |    53 |    20 | 0.0100000000 |
   | 6    |    54 |    21 | 0.0079432823 |
   | 7    |    55 |    22 | 0.0063095734 |
   | 8    |    56 |    23 | 0.0050118723 |
   | 9    |    57 |    24 | 0.0039810717 |
   | :    |    58 |    25 | 0.0031622777 |
   | ;    |    59 |    26 | 0.0025118864 |
   | <    |    60 |    27 | 0.0019952623 |
   | =    |    61 |    28 | 0.0015848932 |
   | >    |    62 |    29 | 0.0012589254 |
   | ?    |    63 |    30 | 0.0010000000 |
   | @    |    64 |    31 | 0.0007943282 |
   | A    |    65 |    32 | 0.0006309573 |
   | B    |    66 |    33 | 0.0005011872 |
   | C    |    67 |    34 | 0.0003981072 |
   | D    |    68 |    35 | 0.0003162278 |
   | E    |    69 |    36 | 0.0002511886 |
   | F    |    70 |    37 | 0.0001995262 |
   | G    |    71 |    38 | 0.0001584893 |
   | H    |    72 |    39 | 0.0001258925 |
   | I    |    73 |    40 | 0.0001000000 |
   |------+-------+-------+--------------|


Solexa score (prior 1.3)

   |------+-------+-------+--------------|
   | char | value | Phred |  Error-Prob. |
   |------+-------+-------+--------------|
   | ;    |    59 |    -5 | 0.7597469266 |
   | <    |    60 |    -4 | 0.7152527510 |
   | =    |    61 |    -3 | 0.6661394246 |
   | >    |    62 |    -2 | 0.6131368202 |
   | ?    |    63 |    -1 | 0.5573116338 |
   | @    |    64 |     0 | 0.5000000000 |
   | A    |    65 |     1 | 0.4426883662 |
   | B    |    66 |     2 | 0.3868631798 |
   | C    |    67 |     3 | 0.3338605754 |
   | D    |    68 |     4 | 0.2847472490 |
   | E    |    69 |     5 | 0.2402530734 |
   | F    |    70 |     6 | 0.2007600089 |
   | G    |    71 |     7 | 0.1663375308 |
   | H    |    72 |     8 | 0.1368068886 |
   | I    |    73 |     9 | 0.1118157698 |
   | J    |    74 |    10 | 0.0909090909 |
   | K    |    75 |    11 | 0.0735875561 |
   | L    |    76 |    12 | 0.0593509431 |
   | M    |    77 |    13 | 0.0477267210 |
   | N    |    78 |    14 | 0.0382865039 |
   | O    |    79 |    15 | 0.0306534300 |
   | P    |    80 |    16 | 0.0245033676 |
   | Q    |    81 |    17 | 0.0195623039 |
   | R    |    82 |    18 | 0.0156016622 |
   | S    |    83 |    19 | 0.0124327353 |
   | T    |    84 |    20 | 0.0099009901 |
   | U    |    85 |    21 | 0.0078806839 |
   | V    |    86 |    22 | 0.0062700123 |
   | W    |    87 |    23 | 0.0049868787 |
   | X    |    88 |    24 | 0.0039652856 |
   | Y    |    89 |    25 | 0.0031523092 |
   | Z    |    90 |    26 | 0.0025055927 |
   | [    |    91 |    27 | 0.0019912892 |
   | \\   |    92 |    28 | 0.0015823853 |
   | ]    |    93 |    29 | 0.0012573425 |
   | ^    |    94 |    30 | 0.0009990010 |
   | _    |    95 |    31 | 0.0007936978 |
   | `    |    96 |    32 | 0.0006305595 |
   | a    |    97 |    33 | 0.0005009362 |
   | b    |    98 |    34 | 0.0003979487 |
   | c    |    99 |    35 | 0.0003161278 |
   | d    |   100 |    36 | 0.0002511256 |
   | e    |   101 |    37 | 0.0001994864 |
   | f    |   102 |    38 | 0.0001584642 |
   | g    |   103 |    39 | 0.0001258767 |
   | h    |   104 |    40 | 0.0000999900 |
   |------+-------+-------+--------------|


Solexa score 1.3+

   |------+-------+-------+--------------|
   | char | value | Phred |   Error Prob |
   |------+-------+-------+--------------|
   | @    |    64 |     0 | 1.0000000000 |
   | A    |    65 |     1 | 0.7943282347 |
   | B    |    66 |     2 | 0.6309573445 |
   | C    |    67 |     3 | 0.5011872336 |
   | D    |    68 |     4 | 0.3981071706 |
   | E    |    69 |     5 | 0.3162277660 |
   | F    |    70 |     6 | 0.2511886432 |
   | G    |    71 |     7 | 0.1995262315 |
   | H    |    72 |     8 | 0.1584893192 |
   | I    |    73 |     9 | 0.1258925412 |
   | J    |    74 |    10 | 0.1000000000 |
   | K    |    75 |    11 | 0.0794328235 |
   | L    |    76 |    12 | 0.0630957344 |
   | M    |    77 |    13 | 0.0501187234 |
   | N    |    78 |    14 | 0.0398107171 |
   | O    |    79 |    15 | 0.0316227766 |
   | P    |    80 |    16 | 0.0251188643 |
   | Q    |    81 |    17 | 0.0199526231 |
   | R    |    82 |    18 | 0.0158489319 |
   | S    |    83 |    19 | 0.0125892541 |
   | T    |    84 |    20 | 0.0100000000 |
   | U    |    85 |    21 | 0.0079432823 |
   | V    |    86 |    22 | 0.0063095734 |
   | W    |    87 |    23 | 0.0050118723 |
   | X    |    88 |    24 | 0.0039810717 |
   | Y    |    89 |    25 | 0.0031622777 |
   | Z    |    90 |    26 | 0.0025118864 |
   | [    |    91 |    27 | 0.0019952623 |
   | \\   |    92 |    28 | 0.0015848932 |
   | ]    |    93 |    29 | 0.0012589254 |
   | ^    |    94 |    30 | 0.0010000000 |
   | _    |    95 |    31 | 0.0007943282 |
   | `    |    96 |    32 | 0.0006309573 |
   | a    |    97 |    33 | 0.0005011872 |
   | b    |    98 |    34 | 0.0003981072 |
   | c    |    99 |    35 | 0.0003162278 |
   | d    |   100 |    36 | 0.0002511886 |
   | e    |   101 |    37 | 0.0001995262 |
   | f    |   102 |    38 | 0.0001584893 |
   | g    |   103 |    39 | 0.0001258925 |
   | h    |   104 |    40 | 0.0001000000 |
   |------+-------+-------+--------------|


Solexa score 1.5+

   |------+-------+-------+--------------|
   | char | value | Phred |   Error Prob |
   |------+-------+-------+--------------|
   | C    |    67 |     3 | 0.5011872336 |
   | D    |    68 |     4 | 0.3981071706 |
   | E    |    69 |     5 | 0.3162277660 |
   | F    |    70 |     6 | 0.2511886432 |
   | G    |    71 |     7 | 0.1995262315 |
   | H    |    72 |     8 | 0.1584893192 |
   | I    |    73 |     9 | 0.1258925412 |
   | J    |    74 |    10 | 0.1000000000 |
   | K    |    75 |    11 | 0.0794328235 |
   | L    |    76 |    12 | 0.0630957344 |
   | M    |    77 |    13 | 0.0501187234 |
   | N    |    78 |    14 | 0.0398107171 |
   | O    |    79 |    15 | 0.0316227766 |
   | P    |    80 |    16 | 0.0251188643 |
   | Q    |    81 |    17 | 0.0199526231 |
   | R    |    82 |    18 | 0.0158489319 |
   | S    |    83 |    19 | 0.0125892541 |
   | T    |    84 |    20 | 0.0100000000 |
   | U    |    85 |    21 | 0.0079432823 |
   | V    |    86 |    22 | 0.0063095734 |
   | W    |    87 |    23 | 0.0050118723 |
   | X    |    88 |    24 | 0.0039810717 |
   | Y    |    89 |    25 | 0.0031622777 |
   | Z    |    90 |    26 | 0.0025118864 |
   | [    |    91 |    27 | 0.0019952623 |
   | \\   |    92 |    28 | 0.0015848932 |
   | ]    |    93 |    29 | 0.0012589254 |
   | ^    |    94 |    30 | 0.0010000000 |
   | _    |    95 |    31 | 0.0007943282 |
   | `    |    96 |    32 | 0.0006309573 |
   | a    |    97 |    33 | 0.0005011872 |
   | b    |    98 |    34 | 0.0003981072 |
   | c    |    99 |    35 | 0.0003162278 |
   | d    |   100 |    36 | 0.0002511886 |
   | e    |   101 |    37 | 0.0001995262 |
   | f    |   102 |    38 | 0.0001584893 |
   | g    |   103 |    39 | 0.0001258925 |
   | h    |   104 |    40 | 0.0001000000 |
   |------+-------+-------+--------------|


Solexa score 1.8+

   |------+-------+-------+--------------|
   | char | value | Phred |  Error-Prob. |
   |------+-------+-------+--------------|
   | !    |    33 |     0 | 1.000000e+00 |
   | "    |    34 |     1 | 7.943282e-01 |
   | #    |    35 |     2 | 6.309573e-01 |
   | $    |    36 |     3 | 5.011872e-01 |
   | %    |    37 |     4 | 3.981072e-01 |
   | &    |    38 |     5 | 3.162278e-01 |
   | '    |    39 |     6 | 2.511886e-01 |
   | (    |    40 |     7 | 1.995262e-01 |
   | )    |    41 |     8 | 1.584893e-01 |
   | *    |    42 |     9 | 1.258925e-01 |
   | +    |    43 |    10 | 1.000000e-01 |
   | ,    |    44 |    11 | 7.943282e-02 |
   | -    |    45 |    12 | 6.309573e-02 |
   | .    |    46 |    13 | 5.011872e-02 |
   | /    |    47 |    14 | 3.981072e-02 |
   | 0    |    48 |    15 | 3.162278e-02 |
   | 1    |    49 |    16 | 2.511886e-02 |
   | 2    |    50 |    17 | 1.995262e-02 |
   | 3    |    51 |    18 | 1.584893e-02 |
   | 4    |    52 |    19 | 1.258925e-02 |
   | 5    |    53 |    20 | 1.000000e-02 |
   | 6    |    54 |    21 | 7.943282e-03 |
   | 7    |    55 |    22 | 6.309573e-03 |
   | 8    |    56 |    23 | 5.011872e-03 |
   | 9    |    57 |    24 | 3.981072e-03 |
   | :    |    58 |    25 | 3.162278e-03 |
   | ;    |    59 |    26 | 2.511886e-03 |
   | <    |    60 |    27 | 1.995262e-03 |
   | =    |    61 |    28 | 1.584893e-03 |
   | >    |    62 |    29 | 1.258925e-03 |
   | ?    |    63 |    30 | 1.000000e-03 |
   | @    |    64 |    31 | 7.943282e-04 |
   | A    |    65 |    32 | 6.309573e-04 |
   | B    |    66 |    33 | 5.011872e-04 |
   | C    |    67 |    34 | 3.981072e-04 |
   | D    |    68 |    35 | 3.162278e-04 |
   | E    |    69 |    36 | 2.511886e-04 |
   | F    |    70 |    37 | 1.995262e-04 |
   | G    |    71 |    38 | 1.584893e-04 |
   | H    |    72 |    39 | 1.258925e-04 |
   | I    |    73 |    40 | 1.000000e-04 |
   | J    |    74 |    41 | 7.943282e-05 |
   |------+-------+-------+--------------|

Tuesday, January 22, 2013

A Guess on the Encryption Design of MEGA

The newly relaunched MEGA, successor to MegaUpload, raised lots of fanfare on the net. A novel feature of the new MEGA site is its encryption function. There are two interesting articles about the encryption technique in the new MEGA site. One from Ars Technica questioned the security and usefulness of MEGA encryption design (http://arstechnica.com/business/2013/01/megabad-a-quick-look-at-the-state-of-megas-encryption/). The other posted by MEGA blog address those concerns (https://mega.co.nz/#blog_3).

 In my opinion, the editor of Ars Technica does not understand or at least misunderstands MEGA's encryption design. There are some comments of that Ars article that explained the basic idea quite clearly, which was confirmed by Mega's reply.

 If my guess is right, the Encryption Design of MEGA is illustrated in the figure below. A pdf version of the figure is at https://www.box.com/s/uswje6orhhqahyv97ijk

Friday, December 28, 2012

Notes on Android Phones

Some notes when using my Google Nexus 4 phone.

1. backup Android files

Install Software Data Cale https://play.google.com/store/apps/details?id=com.lyy.softdatacable&hl=en. This app start a ftp server on your phone so that you can access your files with WiFi. After you start the app, it shows its IP address in the home screen, something like ftp://198.168.1.27:8888/

Next in your local machine, you can use lftp to mirror the files in your phone to your desktop machine. The following command will sync all newer files with in the root directory of your phone to the the n4 directory at your desktop
lftp ftp://198.168.1.27:8888/ -e "mirror --verbose --only-newer / n4"

2. Access Developer Options

 Go to Settings -> About Phone -> Click Build Number for seven time, the Developer option is enabled, from which you may adjust the scale of animation.






Friday, April 27, 2012

Notes on Upgrading to Ubuntu 12.04 Precise Pangolin



0. In general, the upgrading process was smooth. I have been using it for two days. Since this version is LTS, I would recommend that everyone upgrade.

1. The full-fledged Unity is so unstable that is essentially unusable. By default, the Unity Plugin is not enabled. When you log in for the first time, there is just an blank desktop, no Dash, no launcher. Some one wrote about how to fix this (http://askubuntu.com/questions/17381/unity-doesnt-load-no-launcher-no-dash-appears and http://askubuntu.com/questions/121782/blank-desktop-after-updates-today-only-unity2d-works-now), however it still does not work fine with me. In particular, if you log out and then log in again, the same blank desktop appears :-(.

2. Unity 2D works fine. The new HUD (Heads Up Display) is the killer feature.

3. Gnome Shell generally works. However the most extensions, such as User Themes, are unavailable now.

4. The default setting for dual monitors in Gnome Shell has a wired behavior. You can only switch workspaces in the primary monitor while the workspace in the secondary monitor keeps the same. To make the workspace span the two monitors, run the following command:
"gsettings set org.gnome.shell.overrides workspaces-only-on-primary false"
as pointed by http://gregcor.com/2011/05/07/fix-dual-monitors-in-gnome-3-aka-my-workspaces-are-broken/

5. In Gnome Shell, the wallpaper does not show up even I have already set it from System Settings -> Appearance. This can be fixed as following:

Open gnome-tweak-tool
click on the Desktop tab
Turn on "Have manager handle the desktop"Turn on "Computer icon visible on desktop"

Wednesday, September 28, 2011

Multithreaded downloading with wget


GNU wget is versatile and robust, but lacks support for multithreaded downloading. When downloading multiple files, it just goes one by one, which is quite inefficient if the bandwidth of each connection is limited.

There is a way to achieve nearly the same effect as multithreaded downloading (link),  and here is how you do it:
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
copy as many times as you deem appropriate to have as many processes downloading. The key is the -N option, which tells wget to download a file only when its local time stamp is older than the one in the server side.

Alternatively, I wrote a wrapper, pwget (short for parallel wget), that adds multithreading to wget. The program is available from https://github.com/songqiang/pwget. It has two options --max-num-threads and --sleep. The first option --max-num-threads gives the maximum number of connections you allow to establish. This number is usually determined by the setting on the server side and by default it is 3.  The second option --sleep specifies how often (in seconds) the master thread checks the status of downloading threads. When the master thread wakes up, it removes finished threads and add new downloading threads if necessary. Suppose you have the list of URLs in the file url-list.txt, then run
./pwget.py --max-num-threads 5 --sleep 2 -i url-list.txt
wget will begin downloading the list of URLs in url-list.txt with at most 5 connections at once. You can also specify the option for wget in the command line, which will be passed to working threads.

This tool has several limitations. The parallel level of pwget is based on each URL, so you need to list the all URLs in prior. Furthermore, if you have a single large file, pwget does not help. In that case, you may consider use aria2 (http://aria2.sourceforge.net/).   

Tuesday, September 20, 2011

Runnning SSH on a non-standard port

The default port for SSH connection is 22. However some servers change the default port to others, for example 22222, for security reasons. Here I list some common commands to deal with non-standard SSH port.

Suppose you have a SSH server ssh.example.edu with ssh port number 22222.

To copy your ssh public key to the server, run:
ssh-copy-id '-p 22222 jon@ssh.example.edu'
Note the single quote is necessary.

To log in to the SSH server, run
ssh -p 22222 jon@ssh.example.edu 
 To copy files between your local machine and the server with scp, run
scp -P 22222 local-files jon@ssh.example.edu:~
 Note the "-P" option is capitalised.

References:
1. http://www.itworld.com/nls_unixssh0500506
2. http://mikegerwitz.com/2009/10/07/ssh-copy-id-and-sshd-port/


Thursday, July 14, 2011

Setting Up a Hadoop Cluster

This post lists the steps to set up an Hadoop cluster in Ubuntu 11.04. Most codes can be directly copied and pasted.

* Hadoop
** Install Java
#+begin_src shell
sudo apt-get install sun-java6-jdk
sudo update-java-alternatives -s java-6-sun
#+end_src

** Add Hadoop User and Group
#+begin_src shell
sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop
#+end_src

** Configuring SSH and Password-less Login
#+begin_src sh
  # In the master node
  su hadoop
  ssh-keygen -t rsa -P ""
 
  for node in $(cat /conf/slaves);
  do
      ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@$node;
  done
#+end_src

** Install Hadoop
*** Install
#+begin_src sh
  ## download and install
  cd /home/hadoop/
  tar xzf hadoop-0.21.0.tar.gz
  mv hadoop-0.21.0 hadoop
#+end_src
*** Update .bashrc
#+begin_src sh
  ## update .bashrc
  # Set Hadoop-related environment variables
  export HADOOP_HOME=/home/hadoop/hadoop
  export HADOOP_COMMON_HOME="/home/hadoop/hadoop"
  export PATH=$PATH:$HADOOP_HOME/bin
  export PATH=$PATH:$HADOOP_COMMON_HOME/bin/
#+end_src
*** Update conf/hadoop-env.sh
#+begin_src sh
  export JAVA_HOME=/usr/lib/jvm/java-6-sun
  export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
#+end_src
*** Update conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://128.125.86.89:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>


</configuration>
*** Update conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!-- In: conf/mapred-site.xml -->
<property>
<name>mapreduce.jobtracker.address</name>
<value>128.125.86.89:54311</value>
</property>

</configuration>
*** Update conf/hdfs-site.xml
#+begin_src html
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

</configuration>
#+end_src
*** Update conf/masters (master node only)
#+begin_src sh
128.125.86.89
#+end_src
*** Update conf/slaves (master node only)
#+begin_src sh
128.125.86.89
slave-ip1
slave-ip2
......
#+end_src
*** Copy hadoop installation and configuration files to slave nodes
#+begin_src sh  
# In the master node  
su hadoop    
for node in $(cat /conf/slaves);  
do
      scp ~/.bashrc hadoop@$node:~;       scp -r ~/hadoop hadoop@#node:~;  
done
#+end_src
** Run Hadoop
*** Format HDFS
#+begin_src sh
hdfs namenode -format
#+end_src
*** Start Hadoop
#+begin_src sh
start-dfs.sh && sleep 300 && start-mapred.sh && echo "GOOD"
#+end_src
*** Run Jobs
#+begin_src sh
hadoop jar hadoop pipes
#+end_src
*** Stop Hadoop
#+begin_src sh
stop-mapred.sh && stop-dfs.sh
#+end_src
** References:
1. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ 
2. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
3. http://fclose.com/b/cloud-computing/290/hadoop-tutorial/
4. Fix could only be replicated to 0 nodes instead of 1 error

Thursday, March 10, 2011

Identifying dispersed epigenomic domains from ChIP-Seq data

Published in Bioinformatics http://bioinformatics.oxfordjournals.org/content/27/6/870.full

1 INTRODUCTION

Post-translational modifications to histone tails, including methylation and acetylaytion, have been associated with important regulatory roles in cell differentiation and disease development (Kouzarides, 2007). The application of ChIP-Seq to histone modification study has proved very useful for understanding the genomic landscape of histone modifications (Barski et al., 2007; Mikkelsen et al., 2007). Certain histone modifications are tightly concentrated, covering a few hundred base pairs. For example, H3K4me3 is usually associated with active promoters, and occurs only at nucleosomes close to transcription start sites (TSSs). On the other hand, many histone modifications are diffuse and occupy large regions, ranging from thousands to several millions of base pairs. A well known example H3K36me3 is associated with active gene expression and often spans the whole gene body (Barski et al., 2007). Reflected in ChIP-Seq data, the signals of these histone modifications are enriched over large regions, but lack well-defined peaks. It is worth pointing out that the property of being ‘diffuse’ is matter of degrees. Besides the modification frequency, the modification profile over a region is also affected by nucleosome densities and the strength of nucleosome positioning. By visual inspection of read-density profiles, we found that H2BK5me1, H3K79me1, H3K79me2, H3K79me3, H3K9me1, H3K9me3 and H3R2me1 show similar diffuse profiles.
There are several general questions about dispersed epigenomic domains that remain unanswered. Many of these questions center around how these domains are established and maintained. One critical step in answering these questions is to accurately locate the boundaries of these domains. However, most of existing methods for ChIP-Seq data analysis were originally designed for identifying transcription factor binding sites. These focus on locating highly concentrated ‘peaks’, and are inappropriate for identifying domains of dispersed histone modification marks (Pepke et al., 2009). Moreover, the quality of ‘peak’ analysis is measured in terms of sensitivity and specificity of peak calling (accuracy), along with how narrow the peaks are (precision; often determined by the underlying platform). But for diffuse histone modifications, significant ‘peaks’ are usually lacking and often the utility of identifying domains depends on how clearly the boundaries are located.

2 METHODS

Our method for identifying epigenomic domains is based on hidden Markov model (HMM) framework including the Baum–Welch training and posterior decoding (see Rabiner, 1989 for a general description).
Single sample analysis: we first obtain the read density profile by dividing the genome into non-overlapping fixed length bins and counting the number of reads in each bin. The bin size can be determined automatically as a function of the total number of reads and the effective genome size (Supplementary Section S1.5). We model the read counts with the negative binomial distribution after correcting for the effect of genomic deadzones. We first exclude unassembled regions of a genome from our analysis. Second, when two locations in the genome have identical sequences of length greater than or equal to the read length, any read derived from one of those locations will necessarily be ambiguous and is discarded. We refer to contiguous sets of locations to which no read can map uniquely as ‘deadzones’. Those bins within large deadzones (referred to as ‘deserts’) are ignored. For those bins outside of deserts, we correct for the deadzone effect by scaling distribution parameters according to the proportion of the bin which is not within a deadzone (Supplementary Section S1.3).
We assume a bin may have one of the two states: foreground state with high histone modification frequency and background state with low histone modification frequency. We developed a two state HMM for segmentation the genome into foreground domains and background domains.
Identifying and evaluating domain boundaries: while predicted domains themselves give the locations of boundaries, we characterize the boundaries with the following metrics. We evaluate domain boundaries based on posterior probabilities of transitions between the foreground state and the background state as estimated by the HMM. For each pair of consecutive genomic bins, the posterior probability is calculated for all possible transitions between those bins. If a boundary corresponds to the beginning of a domain, the boundary score is the posterior probability of a background to foreground transition and vice versa.
Next an empirical distribution of posterior transition probabilities is constructed by computing posterior transition probabilities from a dataset of randomly permuted bins with the same HMM parameters. Those bins whose posterior transition probabilities have significant empirical P-values are kept and consecutive significant bins are joined as being one boundary. We score each boundary with the posterior probability that a single transition occurs in this boundary. The peak of a boundary is set to the start of the bin with the largest transition probability (see Supplementary Section S3 for details).
Incorporating a control sample: ChIP-Seq experiments are influenced by background noises, contamination and other possible sources of error, and researchers have begun to realize the necessity of generating experimental controls in ChIP-Seq experiments. Two common forms of control exist: a non-specific antibody such as IgG to control the immunoprecipitation, and sequencing of whole cell extract to control for contamination and other possible sources of error. With the availability of a control sample, we use a similar two-state HMM with the novel NBDiff distribution to describe the relationship between the read counts in the two samples. Analogous to the Skellam distribution (Skellam, 1946), the NBDiff distribution describes the difference of two independent negative binomial random variables (see Supplementary Section S1.2 for details).
Simultaneously segmenting two modifications: the simultaneous analysis of two histone modification marks may reveal more accurate information about the status of genomic regions. It helps to understand the functions of different histone modification marks. It is also of interest to compare samples from different cells types because histone modification patterns are dynamic and subject to change during cell differentiation. We use the NBDiff distribution to model the read count difference between the two samples, and employ three-state HMM: where the basal state means these two signals are similar, the second state represents the signal in test sample A is greater than that in the test sample B and the third state represents the opposite case (details given in Supplementary Section S2.1).

3 EVALUATION AND APPLICATIONS

We simulated H3K36me3 ChIP-Seq data and compared RSEG, SICER (Zang et al., 2009) and HPeak (Qin et al., 2010). In terms of domain identification, RSEG outperforms SICER and HPeak for single-sample analysis and yields comparable results to SICER for analysis with control samples (Supplementary Section S4.1 and 4.2). We applied RSEG to H3K36me3 ChIP-Seq dataset from (Barski et al., 2007) and found a strong association between H3K36me3 domain boundaries with TSS and transcription termination site (TTS), which supports that RSEG can find high-quality domain boundaries (Supplementary Section S4.3).
We applied RSEG to four histone modification marks (H3K9me3, H3K27me3, H3K36me3 and H3K79me2) from two separate studies (Barski et al., 2007; Mikkelsen et al., 2007) (Supplementary Section S5.1). In particular, we discovered an interesting relationship between the two gene-overlapping marks H3K36me3 and H3K79me2 through boundary analysis. H3K79me2 tends to associate with 5-ends of genes, while H3K36me3 associates with 3-ends. About 41% of gene-overlapping K79 domains cover TSS in contrast to 11% of K36 domains. On the other hand, 84% of K36 domains cover TTS in contrast to 23% of K79 domains (Table 1). In those genes with both H3K36me3 and H3K79me2 signals, H3K79me2 domains tend to precede H3K36me3 domains, for example the DPF2 gene (Fig. 1) (see Supplementary Section S5.2 for more information). This novel discovery demonstrates the usefulness of boundary analysis for dispersed histone modification marks.
Fig. 1.
The H3K36me3 and H3K79me2 domains and their boundaries at DPF2 (chr11:64,854,646–64,880,304).
Table 1.
Location of H3K36me3 and H3K79me2 domain boundaries relative to genes
Finally we applied our three-state HMM to simultaneously analyze H3K36me3 and H3K79me2 (Supplementary Section S5.4). The result agrees with the above observations. The application of our three-state HMM to find differentially histone modification regions is given in Supplementary Section S5.3.

Saturday, February 26, 2011

Sustainable Scientific Data Archiving Model

As many researchers may have noticed, NCBI plans to discontinue the Short Read Archive (SRA) service due to budget constraints. This news surprises me, and, I believe, concerns the broad biomedical research community in general. While the biomedial research enters the -omics era and becomes more and more data-driven, the sudden close of SRA raises the question that how the scientific data be archived with a sustainable model? I discuss two strategies to preserve scientific data in a sustainable manner. The first proposes a central data repository that charges data deposition fee. The other approach proposes that the data is stored in P2P manner and a central gateway gathers metadata and tracks links to P2P seeds.

Sustainable data archiving model includes the following aspects: first the data should include necessary and accurate metadata; second, the data should be stored securely and remains authentic and correct for a long time; third the data should also include essential softwares and scripts to analyze the data; finally the data should be easily searched and accessed by the broader research community now and for a considerate period in the future so that researchers may use the dataset from different perspectives and even re-analyze the data in the future if new hypothesis and analytic methods emerges.

However as has been note elsewhere, there is a disconnection between the effort to produce the data and the effort to preserve the data. Simply put, funding agencies provide the money for produce the money but not the money to maintain the data. The grant for producing the data is in rather smaller time sale, usually two to five years. Once the grant is over, the project is done and the original researchers switched to other projects, the data produced is in the danger of being lost. Fortunately the biomedical research community has a pretty good record in depositing biological datasets for public research as has been exemplified by GeneBank and GEO. The Short Read Archive is designed to meet the requirements of the massively parallel sequencing reads data. However the discontinuance of this services demonstrates the uncertainty of current data sharing model due to lack of specific funding. Therefore I am considering the following two strategies for the sustain scientific data.

In the first strategy, we still rely on a central data repository like SRA that curates, stores and distributes biological datasets. To meet the financial requirement of such central repository, it charges certain amount of fee for the data hosted. It works as following: when the original data producer finish their research and submit a paper to a journal. The journal requires that their data is deposited in a certain repository and charges data deposition fee. Next the journal allocates the major part of the data deposition fee to the central data repository. The proposed data deposition fee is charged only once which can therefore be covered by the initial grant of the original data producer. With the ever-decreasing cost of data storage, the continual influx of single-time data deposition fee should keep the central data repository working.

The second strategy is initially brought up to me by my friend Li Xia and are further inspired by Morgan Langille, the creator of BioTorrents. In the strategy, the data set is stored by multiple hosts
who may have the resources and interest to keep the dataset. Next a central gateway keeps tracks of the BitTorrents seeds to the raw data and also stores the metadata associated with each data, such as the contact of data producer, experimental protocols and descriptions of the raw data. Especially the central gateway stores the version of the raw dataset and the MD5 or SHA sum for the data so that the data users can make sure they are obtaining updated and authentic dataset from essentially unreliable and untrustable data hosts in a P2P network. Since the central gateway needs only to track these metadata, its running cost is significantly smaller than the central data repository and therefore it can work just as a new section in the NCBI infrastructure.

I hope this discussion publicize the urgency for sustainable scientific data archiving so that the biomedical research community will work out a way after SRA ends.

Monday, October 12, 2009

Next Generation Genome Browser













When David, Fang Fang and I talk about UCSC Genome Browser today, I said "I would like a genome browser like Google Map". Later I find I become more excited about this idea: the next generation genome browser, which provides an more user friendly and powerful platform to ornanize and display genomic information.

What does the next genome browser ("genome map") look like ?

First, smoother zoom in and zoom out. Genomes are organized in hierarchical structure. Sometimes we need a birdview of the whole genome and sometimes we are interested in subtle local structures. It is of great value if we are change the resolution when examining the genome. So, we need dynamic and smoother zoom in and zoom out just like the little sliding bar in Google map. (update: I came across Jbrowser and Anno J browser that seems to have this function. See reference)

Second, advanced searching functions. Current genome browser are only able to search by genomic location, as a result the vast amount of annotation information can not be searched in genome browser. It will be cool there is a search box. Users input a keyword , such as a gene name and the our genome map display those regions match the query.

Third, what kind of web technology should we need? Ajax? Database back end? XML? Maybe google map is a good starting point.

There seems great possibility that such a genome map will appear and what other features are you looking for in the next generation genome brower?

Ref:
1. Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH. JBrowse: A next-generation genome browser. Genome Res. (2009) http://jbrowse.org/
2. AnnoJ Browser http://www.annoj.org/index.shtml

Friday, March 06, 2009

Exon/Intron Statistics in Human Genome

Data from: http://www.bioinfo.de/isb/2004/04/0032/main.html#tab-1

Table 1: Exon - intron distributions for human genome
Chr # Total # genes Total # exons Total # introns Max # exons/gene Chromosome size (determined) Avg # of exons/gene Avg length (bp) Std dev. Total length (bp) Shortest (bp) Longest (bp)
exon intron exon intron exon intron exon intron gene exon intron gene
1 2514 22345 19831 107 226828929 8.89 167.01 4736.52 229.37 14268.19 3731870 93929919 2 1 78 8449 476158 980961
2 1354 12506 11152 148 238349289 9.24 163.98 5883.23 226.88 17012.24 2050855 65609873 2 1 90 7572 483412 1897544
3 1394 13517 12123 118 195073306 9.70 164.06 6375.63 224.21 21019.22 2217700 77291760 2 1 150 6654 497816 990999
4 926 8299 7373 85 187239983 8.96 174.78 7168.94 266.64 19497.08 1450541 52856617 2 53 132 6255 494708 1467842
5 1186 9946 8760 90 177696509 8.39 189.50 7277.28 332.86 21277.20 1884777 63748970 2 1 150 6574 370360 930401
6 1306 11406 10100 145 169212327 8.73 173.62 5961.61 253.56 18967.75 1980397 60212251 2 31 159 7152 469892 1377570
7 2508 23045 20537 82 310210944 9.19 167.87 6703.87 271.88 20177.41 3868769 137677396 2 1 14 11923 458139 1641567
8 908 7823 6915 86 143297300 8.62 171.16 7354.15 258.43 21384.09 1339052 50853964 2 54 84 7308 453268 2055833
9 1033 8941 7908 72 117790386 8.66 170.66 5351.68 253.19 14121.26 1525926 42321074 2 33 105 6598 276306 865661
10 1017 10273 9256 69 132016990 10.10 153.79 6412.91 219.97 20271.48 1579898 59357955 2 52 105 7812 482575 1727184
11 1567 12459 10892 87 130908954 7.95 177.66 4341.42 237.03 15362.46 2213526 47286795 3 1 87 6183 437543 1463302
12 1299 12399 11100 89 129826379 9.55 158.07 4570.21 192.23 12979.23 1959945 50729293 2 30 81 6324 328545 1248678
13 426 3784 3358 83 95749578 8.88 183.47 7351.75 396.79 19082.4 694268 24687182 2 37 279 11555 317646 1175762
14 854 6837 6106 114 87191216 8.01 176.24 5653.70 276.66 19076.38 1204982 33826109 2 51 51 11304 479079 1210740
15 843 8106 7263 104 81992482 9.62 169.79 4660.70 271.38 11542.05 1376321 33850721 2 1 168 9527 207178 620362
16 1093 9986 8893 62 79932432 9.14 166.96 3661.25 242.60 13092.99 1667340 32559472 2 1 75 8607 466049 1167938
17 1459 13179 11720 74 79376966 9.03 165.08 3193.16 215.89 9875.72 2175698 37423835 2 30 63 4786 283762 712668
18 367 3333 2966 75 74658403 9.08 174.9 7905.40 256.53 19377.24 583054 23447419 3 67 225 4721 411175 1189866
19 1609 12169 10560 106 55878340 7.56 187.31 2032.87 279.92 4741.54 2279436 21467122 2 1 81 5059 170796 298909
20 775 6492 5717 80 59424990 8.38 160.34 4403.10 215.29 13613.39 1040952 25172558 3 54 135 3738 303713 1108855
21 309 2539 2230 47 33924367 8.22 168.59 5086.89 306.51 16098.67 428056 11343761 3 74 102 5916 323563 833627
22 671 5173 4502 54 34352072 7.71 171.14 3924.83 281.85 12999.39 885356 17669584 3 42 38 6762 447252 492969
X 1048 8568 7520 79 152118949 8.18 185.33 7627.85 299.66 23527.35 1587926 57361443 2 54 129 6102 493512 2217347
Y 98 660 562 44 24649555 6.73 173.74 5288.54 255.05 19676.46 114670 2972162 3 67 228 2493 400349 681119

Wednesday, February 18, 2009

Emacs Note

1. Packages for using emacs as an IDE

http://xtalk.msk.su/~ott/common/emacs/rc/emacs-rc-cedet.el.html
http://cedet.sourceforge.net/
http://cscope.sourceforge.net/
http://ecb.sourceforge.net/

2. How can I use emacs without gui when I work on a remote machine with a slow connection?
emacs -nw

3. In emacs shell mode, what setting need I modify to make the shell promote PS1 display correctly, e.g. with color like in a terminal?

Add the following code in your .emacs
(autoload 'ansi-color-for-comint-mode-on "ansi-color" nil t)
(add-hook 'shell-mode-hook 'ansi-color-for-comint-mode-on)

4. After I update to emacs 23, invoking flyspell-mode gives the following error "Enabling flyspell-mode gave an error". 


This is caused by the conflicts between site dictionaries and the dictionaries in emacs 23.It can be fixed as following:
cd /usr/share/emacs23/site-lisp/dictionaries-common
sudo rm *.el *.elc
5. How do I enable double spaces in emacs?

This feature is provided in the package setspace. Add the following commands in the preamble of your tex file.
\usepackage{setspace}
\doublespacing
6. Which font looks pretty in emacs?
My personal favourite is Nimbus Mono L regular. 

7. In org-mode, how can I change the default browser?
Add the following two lines in your .emacs file:
(setq browse-url-browser-function (quote browse-url-generic))
(setq browse-url-generic-program "google-chrome")
Similarly, if you want to use the open sourced version of Chrome Browser instead of the Google rebranded version, replace "google-chrome" with "chromium-browser"; if you want to use firefox, replace "google-chrome" with "firefox". 

8. I copy some text in emacs, how can I paste the text another application?
In your .emacs file. add the follow line
(setq x-select-enable-clipboard t)

Monday, February 16, 2009

Latex Notes

Tips:

1. How to type mathematical symbols
http://www.artofproblemsolving.com/LaTeX/AoPS_L_GuideSym.php

3. Use the following packages to make your docs more pretty
\usepackage{times, fullpage}

4.  How can I input the addition assign (+=) operator in latex?
\mathrel{\mathop+}=

5. How can I move all figures and tables to the end of article?
Use the package endfloat  http://www.ctan.org/pkg/endfloat

6. How can I edit and generate files in Chinese?
Use the xelatex command, see a simple template at https://github.com/songqiang/latex-templates/blob/master/latex-template-xelatex.tex



Good read:
  1. 陈硕: 用 LaTeX 排版技术书籍 https://github.com/chenshuo/typeset
  2. 无有的笔记空间: LaTeX 排版学习笔记 http://zoho.is-programmer.com/posts/30662.html


Thursday, December 04, 2008

Installing Ubuntu on HP Pavilion dv 4 1114nr

I got a new HP Pavilion dv4 1114nr in this holiday season. It has Windows Vista Home Edition pre-installled. Here is a log on how I install ubuntu on this laptop

1. Create Windows Vista recovery disk
Boot into Window s Vista. First, since HP does not provide recovery disk with new laptops any longer, you need to create your own recovery disks in case you need Windows Vista in the future. Start -> Recovery Disk Creation and follow the instructions.

2. Re-participation the hard drive
Windows Vista comes with hand drive resizing and re-participation utilities. That's cool! It saves our trouble to search for a 3rd party software.
Follow the instructions in the following documents:
1. Screenshot Tour: Repartition your hard drive in Windows Vista
2. Can I repartition my hard disk?

3. Download
Don't bother to download ubuntu installation iso and create your own installation CD. If you have internet access (a fair weak condition, isn't it?), you can use Unetbootin (http://en.wikipedia.org/wiki/UNetbootin).

I am not exactly sure. There seems a bug with Unetbootin.
I participated my hard drive into three particitions: C: windows system partition; D: HP recovery partition; F: unformated free partition, which is intended for Linux installation.

But when I select mode as Hard Drive, only C: partition is displayed; I have to select USB Live mode and select F: partition there. I am not sure what this implies, still waiting for the result.

5. sound issues
After the installation, the speaker and the microphone does not work. Particularly, I could not use skype :-(.

Solution to "no sound problem"
Open
sudo vi /etc/modprobe.d/alsa-base
Add the following line to the end of the file
options options snd-hda-intel model=laptop enable_msi=1

Solution to microphone problem:
It is possible due to the mic is muted.
Open Volume Control by double clicking the icon at top-right corner. Select preference and select the device for recording and playback. And cancel the mutation option.

Solution to skype "Audio playback" problem
Excute the following command in a terminal

killall pulseaudio
sudo apt-get remove pulseaudio # this seems not necessary
sudo apt-get install esound
sudo rm /etc/X11/Xsession.d/70pulseaudio
refer to http://www.econowics.com/news-from-the-net/170/skype-problem-with-audio-playback-ubuntu-810-intrepid-ibex/

refer to
https://bugs.launchpad.net/ubuntu/+bug/269586
https://help.ubuntu.com/community/HdaIntelSoundHowto


6. install skype
7. install songbird
8. install Java Runtime Environment
9. install Open Office 3.0
10. install Mac4lin
11. install VLC and other codecs
12 install sopcast and gsopcast (online TV channel)
13 install fcitx Chinese input
First remove default scim framework and install fcitx
sudo apt-get autoremove scim
sudo apt-get install fcitx
next modify Xsession to automatically start fictx for all users. Open
sudo gedit /etc/X11/Xsession.d/95xinput
and chang it to
export XMODIFIERS=@im=fcitx
export XIM=fcitx
export XIM_PROGRAM=fcitx
export GTK_IM_MODULE=fcitx
export QT_IM_MODULE=XIM
fcitx
Open
sudo vim /usr/lib/gtk-2.0/2.10.0/immodule-files.d/libgtk2.0-0.immodules
Change the line about xim to
"xim" "X Input Method" "gtk20" "/usr/share/locale" "en:ko:ja:th:zh"
======
Well, I come back to update this post. I just returned this hp laptop. This was the first time I bought a laptop from HP, unfortunately it was an disappointing experience. I have two issues to complain. The cpu fan is too noise. Even after I disabled the feature "Keep fan running" in BIOS, the fan still makes too much noise. The CD -ROM drive is not quiet either; it feels earthquake when the CD drive is working.

The recovery too is also annoying. I could not recovery my laptop to factory configuration, either via harddrive recovery tool or via recovery CDs. It failed with the "error 1002"; and the HP customer service can not provide any useful help (they outsource custume serive to India, as a result we have to adapt to Indian English).

Anyway, I will blacklist this model from HP: HP Pavilion dv4.

Reference:
1. Screenshot Tour: Repartition your hard drive in Windows Vista
2. Can I repartition my hard disk?
3. Unetbootin http://unetbootin.sourceforge.net/
4. Tutorial: Ubuntu Linux on HP Pavilion
http://aldeby.org/blog/index.php/howto-ubuntu-linux-on-hp-pavilion-dv2000-dv6000-dv9000-series-laptops
5. http://www.dailygyan.com/2008/11/10-things-you-should-do-immediately.html
6. Top 10 Ubuntu downloads http://lifehacker.com/5227309/top-10-ubuntu-downloads
7. http://theindexer.wordpress.com/2009/04/24/to-do-list-after-installing-ubuntu-904-aka-jaunty-jackalope/
8. Install Microsoft YaHei font http://hi.baidu.com/zzy011/blog/item/6651e3ed44a9c62f63d09f37.html

Saturday, November 08, 2008

<R>andom Notes

1. how to estimate the running time of a R function?

R has a function proc.time() http://rweb.stat.umn.edu/R/library/base/html/proc.time.html
sample code
## a way to time an R expression: system.time is preferred
> ptm <- proc.time()
> for (i in 1:50) mad(stats::runif(500))
> proc.time() - ptm
user system elapsed 
0.039 0.001 0.052 
## End(Not run)

2. string manipulation in R

define a string
> s = "some characters"

convert other type into a string
> s = as.character(some_variable_in_other_type)

Convert a string into numbers
> pi = as.numeric("3.14159")


string length
>nchar(s)

string concatenation
> s1 = "string1"
> s2 = "string2"
> paste(s1, s2, sep = "")

given a vector of strings, vs, return a string that is the concatenation of vs's elements
> vs = c("song", "qiang")
> paste(vs, collapse = "")
 "song qiang"

string splicing
suppose s is a string, how do we slice a substring of the s given starting position and ending position?
we use the following function. there is no default value for stop. it the value of stop is larger the the total
length of string, it is truncated to the length of the string
> substr(s, first = 1, stop = 12)

string split

> strsplit("song qiang", split=" ")
[1] "song" "qiang"


3. when making figures with legend box, the text expand out of legend box when we use dev.copy2eps()  to convert  the figure image to a eps file

This problem comes from the different specification of font sizes in difference devices. A ugly way to solve this problem is to specify text.width=strwidth("some string"),
where "some string"  refers to the longest legend text plus some extra characters. The optimal number of extra characters should be determined by trial and error.

4. How to handle exceptions in R?
Read about two functions try and tryCatch (R FAQ 7.32). An example with try is shown below:
for(i in 1:16)
{
   result <- try(nonlinear_modeling(i));
   if(class(result) == "try-error") next;
}

GNU/Linux Notes

GNU/Linux Notes

1. How to speed up my Linux booting?
See Bootchart http://www.bootchart.org/index.html
and remove unnecessary services in the booting process


2. One important thing to remember when creating a SVN repository
In Subversion 1.1, a repository is created with a Berkeley
DB back-end by default. This behavior may change in future
releases. Regardless, the type can be explicitly chosen with
the --fs-type argument:
$ svnadmin create --fs-type fsfs /path/to/repos
$ svnadmin create --fs-type bdb /path/to/other/repos

Do not create a Berkeley DB repository on a network
share—it cannot exist on a remote
filesystem such as NFS, AFS, or Windows SMB. Berkeley DB
requires that the underlying filesystem implement strict POSIX
locking semantics, and more importantly, the ability to map
files directly into process memory. Almost no network
filesystems provide these features. If you attempt to use
Berkeley DB on a network share, the results are
unpredictable—you may see mysterious errors right away,
or it may be months before you discover that your repository
database is subtly corrupted.
If you need multiple computers to access the repository,
you create an FSFS repository on the network share, not a
Berkeley DB repository. Or better yet, set up a real server
process (such as Apache or svnserve), store
the repository on a local filesystem which the server can
access, and make the repository available over a network.
Chapter 6, Server Configuration covers this process in
detail.
3. count file numbers in a directory and its directory

total number of files
find . some_directory|wc -l

list number of files in each directory in detail
#! /usr/bin/python

import os
import sys

def count(p):
if not os.path.isdir(p):
print "%s\t%d" % (p, 1)
return 1

pls = os.listdir(p)
s = 0
for d in pls:
if os.path.isdir(d):
s += count(d)
else:
s += 1

print "%s\t%d " % (p, s)
return s

p = sys.argv[1]
count(p)


4.  Ubuntu DNS Server Problem
Problem Description:  I run Ubuntu 9.04 on my computer and use Wicd (Wired and Wireless Network Manager) to configure network settings. However, sometimes when I use wireless network, Wicd is able to connect to routers (pingable), but it fails to parse domain names. There is something wrong with DNS server.


Tentative Solution: 1) First disable all settings related to DNS inside Wicd, i.e. do not use either static or global DNS server; 2) edit /etc/resolv.conf, add available DNS servers; 3) restart computer. 4) [Optional] sometimes if we configure wicd to automatically connect and use static DNS server, Wicd freezes while setting static server. In this case, we can edit /etc/wireless-settings.conf to disable automatic connection and static DNS server.


5. How to rename files or directories in order to remove white spaces in the filename?
for i in $(ls -1 *|grep " "); do
     mv "$i" $(echo $i|sed 's/ /-/g');
done

6. How to backup files (or directories) with tar and 7-zip?
First we create tar balls with the tar utility and then compress the tar balls with the 7z program.  If the content of the file is sensitive, you can encrypt it with the internal encryption option in 7z or with GPG. The code is as following:
for i in *; do
     tar cfv "$i.tar" "$i" && \
     7z a "$i.tar.7z" "$i.tar" && \
     # rm -rf "$i" && \
     # rm -rf "$i.tar"; done
done

7. how do I output the matching regex pattern in a line?
use grep -o PATTERN.

Wednesday, May 07, 2008

Connecting USC VPN Network in Ubuntu

[Update 2013-02-12]
Surprisingly, this old post still receive visitors occasionally. Right now, If you just want to browse the internet and download some papers, you may try the web svn service: sslvpn1.usc.edu.

[Original Post:]
At USC, when you use computers on campus, you can use directly electronic resources, databases, electronic journals because you are in USC private network. Now suppose that you go back to your apartment off campus or you travel away from USC, how can you get access to those electronic resources that USC pays for? That's where VPN come into place. VPN, also called IP tunneling, is a secure method to access computer resources in a private network. VPN stands for "virtual private network". Generally speaking, USC runs a VPN server which listens to your call in and access request. You need to run a VPN client on your own computer, which connects to the server and offer you access to USC resources as you are in USC private network.

However, ITS only provieds official support of VPN clients for Windows (link)and Mac OS (link). Here we give a VPN solution for linux users (take Ubuntu 8.04 for example).

1. Install Network Manager Applet through the Add/Remove in the Ubuntu menu. (Most time, this applet should be installed defautly; if so, just skip to step 2);

2. Install the VPN plug-in network-manager-vpnc. Open Synaptic Package Manager, search for package network-manager-vpnc and install;

3. Left click the network manager applet (usually in the top right corner of your screen) and select VPN Connections->Configure VPN->Add. Type a name in the Connection Name box, USC VPN for example; In Gateway field, type ; In vpn3k.usc.edu; In Group Name field, type USC. Click the Optional tab, select Override user name, type in your USC account (the same as your USC email) in the textbox below. Click Apply. Close the window titled VPN Connections


4. Left click the network manager applet and select VPN Connections then click on USC connection (USC VPN) to connect. In the above password box, type in your password associated with your USC account; in the below Group password, type GoTrojan. OK, we are done!


This tutorial is based on Ubuntu. I think you can also configure VPN client in Debian, Fedora, OpenSuse and other Linux distrobutions.

References:
1.VPN Client on Ubuntu https://help.ubuntu.com/community/VPNClient
2. Configuring the Cisco VPN 3000 Client (Windows 2000/XP/Vista) http://www.usc.edu/its/vpn/vpn3k47win.html#help

Saturday, May 03, 2008

Fixing Resolution Problem of Ubuntu On Paralles Desktop

Problem

After installing Ubuntu 8.04 Hardy Heron in Parallels Desktop on my Macbook Pro, the default resolution is 1024*768. I want to use my Macbook pro's 1440*900 full resolution. I tried to use System->Preference->Screen Resolution, but there are not 1440*900 at all.

Solution

Basic idea: The problem arises because Ubuntu fails to detect the settings of my monitor automatically. Then can I mannually modify xorg.conf to set the right resolution? Let's go!

Open up a terminal. First Backup the original xorg.conf

sudo cp /etc/X11/xorg.conf  /etc/X11/xorg.conf.backup

Next open, open xorg.conf with your favorite editor

sudo vi /etc/X11/xorg.conf

Search the section "Screen" like below.

Section "Screen"
Identifier "Default Screen"
Device "Generic Video Card"
Monitor "Generic Monitor"
EndSection

Probabably your file contains more lines similar to the following

SubSection "Display"
Depth 24
Modes "1024x768" "800x600" "640x480"
EndSubSection

Note the line "Modes "1024x768" "800x600" "640x480"". It says that there are three different kinds resolutions, but our desired resolution 1440x900 is omitted. So we can simply add this resolution option. It is like the following after modification

SubSection "Display"
Depth 24
Modes "1440x900" "1024x768" "800x600" "640x480"
EndSubSection

It’ll appear several times throughout the file. Each time you see it, just add your desired resolution (in your case, 1440×900).

If your file doesn't contain a similar Subsection "Display" inside the Section "Screen" (as shown above), you just add the Subsection "Display" yourself. And th final result looks like

Section "Screen"
Identifier "Default Screen"
Device "Generic Video Card"
Monitor "Generic Monitor"
DefaultDepth 24
SubSection "Display"
Depth 1
Modes "1440x900" "1024x768" "800x600" "640x480"
EndSubSection
SubSection "Display"
Depth 4
Modes "1440x900" "1024x768" "800x600" "640x480"
EndSubSection
SubSection "Display"
Depth 8
Modes "1440x900" "1024x768" "800x600" "640x480"
EndSubSection
SubSection "Display"
Depth 15
Modes "1440x900" "1024x768" "800x600" "640x480"
EndSubSection
SubSection "Display"
Depth 16
Modes "1440x900" "1024x768" "800x600" "640x480"
EndSubSection
SubSection "Display"
Depth 24
Modes "1440x900" "1024x768" "800x600" "640x480"
EndSubSection
EndSection

Finally save the above modifications. Restart your X session by pressing Ctrl-Atl-Breakspace (or reboot your ubuntu), it just works!

If you encounter total messy after this modificaion, don't panic because you still have the backup of the original xorg.conf!

Reference
1. http://gonz.wordpress.com/2007/09/22/fixing-screen-resolution-on-ubuntu-linux-in-parallels-desktop/
2. http://www.simplehelp.net/2007/04/30/how-to-increase-the-screen-resolutions-available-to-ubuntu-while-running-in-parallels-for-os-x/