Tuesday, December 10, 2013

A Reference Methylome Database and Analysis Pipeline to Facilitate Integrative and Comparative Epigenomics

Original link: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0081148
 
DNA methylation is implicated in a surprising diversity of regulatory, evolutionary processes and diseases in eukaryotes. The introduction of whole-genome bisulfite sequencing has enabled the study of DNA methylation at a single-base resolution, revealing many new aspects of DNA methylation and highlighting the usefulness of methylome data in understanding a variety of genomic phenomena. As the number of publicly available whole-genome bisulfite sequencing studies reaches into the hundreds, reliable and convenient tools for comparing and analyzing methylomes become increasingly important. We present MethPipe, a pipeline for both low and high-level methylome analysis, and MethBase, an accompanying database of annotated methylomes from the public domain. Together these resources enable researchers to extract interesting features from methylomes and compare them with those identified in public methylomes in our database.

Examples of high-level methylation features available in MethBase through the UCSC Genome Browser track hub.

Wednesday, December 04, 2013

Convert PDF files to high quality PNG figures

To display figures on your website, it is necessary to convert PDF files to image files in PNG format. However, the conversion sometimes results in low-quality figures, especially if there are texts in the PDF original files. Below are the procedures I used to convert PDF files to high-quality PNG files. It includes two step:

1. use Preview to convert PDF files to PNG files
Open your pdf file with Preview on Mac OS. Click File->Export. Select PNG from the Format field. Below the Format selector, there is a text box Resolution, which is the key to preserve high quality. Make sure to input quite high number, say 300 pixel/inch. Click Save. This produces a png file of high quality

2. use OptiPNG to reduce PNG file size
The PNG file from the above step is usually quite big, which may make your website slow to load. The OptiPNG (http://optipng.sourceforge.net/) program can used to reduce file size. With the default settings, it is able to reduce the png file size by half without perceptible loss in image quality.

You may see a PNG figure produced with the above procedure in my MethPipe website (http://smithlab.usc.edu/methpipe/).


Tuesday, August 20, 2013

Add trunk/tags/branches directories to an existing SVN repository

In a standard SVN repository, the top level directory is the project directory, which contains three subdirectories: trunk, tags and branches (SVN Best Practices). Most time, you actively work and update the trunk directory. When you release a new version, you may take a snapshot of the trunk directory by copying the trunk directory to tags. The branches directory is where you may try out some new ideas.

Occasionally, you may have a svn repository that does not follow the recommend layout, probably because it seemed not worth the efforts when you first start that toy-like project. However as developments continue, that repository may have lots of commits, and you found it much convenient if there are the trunk/tags/branches layout (for example, link). Here I will give a step-by-step tutorial.

First, we need to dump the old repository with svnadmin, and then create a new clean repository.
svnadmin dump /srv/svn/repos/test > test-repo.dump
mv /srv/svn/repos/test /srv/svn/repos/test-backup
svnadmin create /srv/svn/repos/test
Next, check out the clean repository, and add trunk, tags, and branches directories.
svn checkout PATH-TO-TEST-REPO
cd test
svn mkdir trunk tags branches
svn ci  trunk tags branches -m "add trunk tags branches structure"
Finally, load the previous repository dump into the trunk subdirectory. Note, the --parent-dir is essential.
svnadmin load  /srv/svn/repos/test --parent-dir  trunk < test-repo.dump
Done!

Friday, August 16, 2013

A Simple Python ConfigParser Class for Parsing Configuration Files

The default ConfigParser in Python is flexible and sophisticated, but surprisingly it behaves annoyingly when working with simple configutation files. It requires that every option must belong to certain sections (link). If there is no section, it aborts with an error. Additionally, it automatically converts keys to lower case,  therefore it is case-insensitive regarding keys (link).

To deal with these annoyances, I implemented an alternative ConfigParser (https://github.com/songqiang/configparser). It aims to work simple configuration files, that contains a key and its value in each line. The delimiter between a ket and its value can be equal (=), colon (:), whitespaces and tabs. Section names are  optional. It implements the same set of interfaces of the default ConfigParser excluding the functionality for writing and sophisticated customization. To use my ConfigParser, just download the ConfigParser.py  file and put it in the same directory with the calling python script. Since Python first looks up the current working directory when importing a module, my ConfigParser will override the default one.


Monday, February 04, 2013

FASTQ Quality Score Convesion Table


FASTQ Quality Score Convesion Table

In FASTQ format, the fourth line encodes the quality score of sequences in the second line. This scheme was initially used by the Phred base-calling program to use ASCII characters to encode the probability that the corresponding base call is wrong in traditional Sanger sequencing. The same format is also used by Illumina/Solexa sequencing, however the mapping from probability values to characters is slightly changed from the Phred score and also varies between different version of Solexa sequencer. The exact formula is given somewhere else. The following lists the conversion table for each platform and/or version. 


Range

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
    (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)


Sanger sequencing score

   |------+-------+-------+--------------|
   | char | value | Phred |  Error-Prob. |
   |------+-------+-------+--------------|
   | !    |    33 |     0 | 1.0000000000 |
   | "    |    34 |     1 | 0.7943282347 |
   | #    |    35 |     2 | 0.6309573445 |
   | $    |    36 |     3 | 0.5011872336 |
   | %    |    37 |     4 | 0.3981071706 |
   | &    |    38 |     5 | 0.3162277660 |
   | '    |    39 |     6 | 0.2511886432 |
   | (    |    40 |     7 | 0.1995262315 |
   | )    |    41 |     8 | 0.1584893192 |
   | *    |    42 |     9 | 0.1258925412 |
   | +    |    43 |    10 | 0.1000000000 |
   | ,    |    44 |    11 | 0.0794328235 |
   | -    |    45 |    12 | 0.0630957344 |
   | .    |    46 |    13 | 0.0501187234 |
   | /    |    47 |    14 | 0.0398107171 |
   | 0    |    48 |    15 | 0.0316227766 |
   | 1    |    49 |    16 | 0.0251188643 |
   | 2    |    50 |    17 | 0.0199526231 |
   | 3    |    51 |    18 | 0.0158489319 |
   | 4    |    52 |    19 | 0.0125892541 |
   | 5    |    53 |    20 | 0.0100000000 |
   | 6    |    54 |    21 | 0.0079432823 |
   | 7    |    55 |    22 | 0.0063095734 |
   | 8    |    56 |    23 | 0.0050118723 |
   | 9    |    57 |    24 | 0.0039810717 |
   | :    |    58 |    25 | 0.0031622777 |
   | ;    |    59 |    26 | 0.0025118864 |
   | <    |    60 |    27 | 0.0019952623 |
   | =    |    61 |    28 | 0.0015848932 |
   | >    |    62 |    29 | 0.0012589254 |
   | ?    |    63 |    30 | 0.0010000000 |
   | @    |    64 |    31 | 0.0007943282 |
   | A    |    65 |    32 | 0.0006309573 |
   | B    |    66 |    33 | 0.0005011872 |
   | C    |    67 |    34 | 0.0003981072 |
   | D    |    68 |    35 | 0.0003162278 |
   | E    |    69 |    36 | 0.0002511886 |
   | F    |    70 |    37 | 0.0001995262 |
   | G    |    71 |    38 | 0.0001584893 |
   | H    |    72 |    39 | 0.0001258925 |
   | I    |    73 |    40 | 0.0001000000 |
   |------+-------+-------+--------------|


Solexa score (prior 1.3)

   |------+-------+-------+--------------|
   | char | value | Phred |  Error-Prob. |
   |------+-------+-------+--------------|
   | ;    |    59 |    -5 | 0.7597469266 |
   | <    |    60 |    -4 | 0.7152527510 |
   | =    |    61 |    -3 | 0.6661394246 |
   | >    |    62 |    -2 | 0.6131368202 |
   | ?    |    63 |    -1 | 0.5573116338 |
   | @    |    64 |     0 | 0.5000000000 |
   | A    |    65 |     1 | 0.4426883662 |
   | B    |    66 |     2 | 0.3868631798 |
   | C    |    67 |     3 | 0.3338605754 |
   | D    |    68 |     4 | 0.2847472490 |
   | E    |    69 |     5 | 0.2402530734 |
   | F    |    70 |     6 | 0.2007600089 |
   | G    |    71 |     7 | 0.1663375308 |
   | H    |    72 |     8 | 0.1368068886 |
   | I    |    73 |     9 | 0.1118157698 |
   | J    |    74 |    10 | 0.0909090909 |
   | K    |    75 |    11 | 0.0735875561 |
   | L    |    76 |    12 | 0.0593509431 |
   | M    |    77 |    13 | 0.0477267210 |
   | N    |    78 |    14 | 0.0382865039 |
   | O    |    79 |    15 | 0.0306534300 |
   | P    |    80 |    16 | 0.0245033676 |
   | Q    |    81 |    17 | 0.0195623039 |
   | R    |    82 |    18 | 0.0156016622 |
   | S    |    83 |    19 | 0.0124327353 |
   | T    |    84 |    20 | 0.0099009901 |
   | U    |    85 |    21 | 0.0078806839 |
   | V    |    86 |    22 | 0.0062700123 |
   | W    |    87 |    23 | 0.0049868787 |
   | X    |    88 |    24 | 0.0039652856 |
   | Y    |    89 |    25 | 0.0031523092 |
   | Z    |    90 |    26 | 0.0025055927 |
   | [    |    91 |    27 | 0.0019912892 |
   | \\   |    92 |    28 | 0.0015823853 |
   | ]    |    93 |    29 | 0.0012573425 |
   | ^    |    94 |    30 | 0.0009990010 |
   | _    |    95 |    31 | 0.0007936978 |
   | `    |    96 |    32 | 0.0006305595 |
   | a    |    97 |    33 | 0.0005009362 |
   | b    |    98 |    34 | 0.0003979487 |
   | c    |    99 |    35 | 0.0003161278 |
   | d    |   100 |    36 | 0.0002511256 |
   | e    |   101 |    37 | 0.0001994864 |
   | f    |   102 |    38 | 0.0001584642 |
   | g    |   103 |    39 | 0.0001258767 |
   | h    |   104 |    40 | 0.0000999900 |
   |------+-------+-------+--------------|


Solexa score 1.3+

   |------+-------+-------+--------------|
   | char | value | Phred |   Error Prob |
   |------+-------+-------+--------------|
   | @    |    64 |     0 | 1.0000000000 |
   | A    |    65 |     1 | 0.7943282347 |
   | B    |    66 |     2 | 0.6309573445 |
   | C    |    67 |     3 | 0.5011872336 |
   | D    |    68 |     4 | 0.3981071706 |
   | E    |    69 |     5 | 0.3162277660 |
   | F    |    70 |     6 | 0.2511886432 |
   | G    |    71 |     7 | 0.1995262315 |
   | H    |    72 |     8 | 0.1584893192 |
   | I    |    73 |     9 | 0.1258925412 |
   | J    |    74 |    10 | 0.1000000000 |
   | K    |    75 |    11 | 0.0794328235 |
   | L    |    76 |    12 | 0.0630957344 |
   | M    |    77 |    13 | 0.0501187234 |
   | N    |    78 |    14 | 0.0398107171 |
   | O    |    79 |    15 | 0.0316227766 |
   | P    |    80 |    16 | 0.0251188643 |
   | Q    |    81 |    17 | 0.0199526231 |
   | R    |    82 |    18 | 0.0158489319 |
   | S    |    83 |    19 | 0.0125892541 |
   | T    |    84 |    20 | 0.0100000000 |
   | U    |    85 |    21 | 0.0079432823 |
   | V    |    86 |    22 | 0.0063095734 |
   | W    |    87 |    23 | 0.0050118723 |
   | X    |    88 |    24 | 0.0039810717 |
   | Y    |    89 |    25 | 0.0031622777 |
   | Z    |    90 |    26 | 0.0025118864 |
   | [    |    91 |    27 | 0.0019952623 |
   | \\   |    92 |    28 | 0.0015848932 |
   | ]    |    93 |    29 | 0.0012589254 |
   | ^    |    94 |    30 | 0.0010000000 |
   | _    |    95 |    31 | 0.0007943282 |
   | `    |    96 |    32 | 0.0006309573 |
   | a    |    97 |    33 | 0.0005011872 |
   | b    |    98 |    34 | 0.0003981072 |
   | c    |    99 |    35 | 0.0003162278 |
   | d    |   100 |    36 | 0.0002511886 |
   | e    |   101 |    37 | 0.0001995262 |
   | f    |   102 |    38 | 0.0001584893 |
   | g    |   103 |    39 | 0.0001258925 |
   | h    |   104 |    40 | 0.0001000000 |
   |------+-------+-------+--------------|


Solexa score 1.5+

   |------+-------+-------+--------------|
   | char | value | Phred |   Error Prob |
   |------+-------+-------+--------------|
   | C    |    67 |     3 | 0.5011872336 |
   | D    |    68 |     4 | 0.3981071706 |
   | E    |    69 |     5 | 0.3162277660 |
   | F    |    70 |     6 | 0.2511886432 |
   | G    |    71 |     7 | 0.1995262315 |
   | H    |    72 |     8 | 0.1584893192 |
   | I    |    73 |     9 | 0.1258925412 |
   | J    |    74 |    10 | 0.1000000000 |
   | K    |    75 |    11 | 0.0794328235 |
   | L    |    76 |    12 | 0.0630957344 |
   | M    |    77 |    13 | 0.0501187234 |
   | N    |    78 |    14 | 0.0398107171 |
   | O    |    79 |    15 | 0.0316227766 |
   | P    |    80 |    16 | 0.0251188643 |
   | Q    |    81 |    17 | 0.0199526231 |
   | R    |    82 |    18 | 0.0158489319 |
   | S    |    83 |    19 | 0.0125892541 |
   | T    |    84 |    20 | 0.0100000000 |
   | U    |    85 |    21 | 0.0079432823 |
   | V    |    86 |    22 | 0.0063095734 |
   | W    |    87 |    23 | 0.0050118723 |
   | X    |    88 |    24 | 0.0039810717 |
   | Y    |    89 |    25 | 0.0031622777 |
   | Z    |    90 |    26 | 0.0025118864 |
   | [    |    91 |    27 | 0.0019952623 |
   | \\   |    92 |    28 | 0.0015848932 |
   | ]    |    93 |    29 | 0.0012589254 |
   | ^    |    94 |    30 | 0.0010000000 |
   | _    |    95 |    31 | 0.0007943282 |
   | `    |    96 |    32 | 0.0006309573 |
   | a    |    97 |    33 | 0.0005011872 |
   | b    |    98 |    34 | 0.0003981072 |
   | c    |    99 |    35 | 0.0003162278 |
   | d    |   100 |    36 | 0.0002511886 |
   | e    |   101 |    37 | 0.0001995262 |
   | f    |   102 |    38 | 0.0001584893 |
   | g    |   103 |    39 | 0.0001258925 |
   | h    |   104 |    40 | 0.0001000000 |
   |------+-------+-------+--------------|


Solexa score 1.8+

   |------+-------+-------+--------------|
   | char | value | Phred |  Error-Prob. |
   |------+-------+-------+--------------|
   | !    |    33 |     0 | 1.000000e+00 |
   | "    |    34 |     1 | 7.943282e-01 |
   | #    |    35 |     2 | 6.309573e-01 |
   | $    |    36 |     3 | 5.011872e-01 |
   | %    |    37 |     4 | 3.981072e-01 |
   | &    |    38 |     5 | 3.162278e-01 |
   | '    |    39 |     6 | 2.511886e-01 |
   | (    |    40 |     7 | 1.995262e-01 |
   | )    |    41 |     8 | 1.584893e-01 |
   | *    |    42 |     9 | 1.258925e-01 |
   | +    |    43 |    10 | 1.000000e-01 |
   | ,    |    44 |    11 | 7.943282e-02 |
   | -    |    45 |    12 | 6.309573e-02 |
   | .    |    46 |    13 | 5.011872e-02 |
   | /    |    47 |    14 | 3.981072e-02 |
   | 0    |    48 |    15 | 3.162278e-02 |
   | 1    |    49 |    16 | 2.511886e-02 |
   | 2    |    50 |    17 | 1.995262e-02 |
   | 3    |    51 |    18 | 1.584893e-02 |
   | 4    |    52 |    19 | 1.258925e-02 |
   | 5    |    53 |    20 | 1.000000e-02 |
   | 6    |    54 |    21 | 7.943282e-03 |
   | 7    |    55 |    22 | 6.309573e-03 |
   | 8    |    56 |    23 | 5.011872e-03 |
   | 9    |    57 |    24 | 3.981072e-03 |
   | :    |    58 |    25 | 3.162278e-03 |
   | ;    |    59 |    26 | 2.511886e-03 |
   | <    |    60 |    27 | 1.995262e-03 |
   | =    |    61 |    28 | 1.584893e-03 |
   | >    |    62 |    29 | 1.258925e-03 |
   | ?    |    63 |    30 | 1.000000e-03 |
   | @    |    64 |    31 | 7.943282e-04 |
   | A    |    65 |    32 | 6.309573e-04 |
   | B    |    66 |    33 | 5.011872e-04 |
   | C    |    67 |    34 | 3.981072e-04 |
   | D    |    68 |    35 | 3.162278e-04 |
   | E    |    69 |    36 | 2.511886e-04 |
   | F    |    70 |    37 | 1.995262e-04 |
   | G    |    71 |    38 | 1.584893e-04 |
   | H    |    72 |    39 | 1.258925e-04 |
   | I    |    73 |    40 | 1.000000e-04 |
   | J    |    74 |    41 | 7.943282e-05 |
   |------+-------+-------+--------------|

Tuesday, January 22, 2013

A Guess on the Encryption Design of MEGA

The newly relaunched MEGA, successor to MegaUpload, raised lots of fanfare on the net. A novel feature of the new MEGA site is its encryption function. There are two interesting articles about the encryption technique in the new MEGA site. One from Ars Technica questioned the security and usefulness of MEGA encryption design (http://arstechnica.com/business/2013/01/megabad-a-quick-look-at-the-state-of-megas-encryption/). The other posted by MEGA blog address those concerns (https://mega.co.nz/#blog_3).

 In my opinion, the editor of Ars Technica does not understand or at least misunderstands MEGA's encryption design. There are some comments of that Ars article that explained the basic idea quite clearly, which was confirmed by Mega's reply.

 If my guess is right, the Encryption Design of MEGA is illustrated in the figure below. A pdf version of the figure is at https://www.box.com/s/uswje6orhhqahyv97ijk