Grand Prismatic Spring Lab: python

Friday, August 16, 2013

A Simple Python ConfigParser Class for Parsing Configuration Files

The default ConfigParser in Python is flexible and sophisticated, but surprisingly it behaves annoyingly when working with simple configutation files. It requires that every option must belong to certain sections (link). If there is no section, it aborts with an error. Additionally, it automatically converts keys to lower case, therefore it is case-insensitive regarding keys (link).

To deal with these annoyances, I implemented an alternative ConfigParser (https://github.com/songqiang/configparser). It aims to work simple configuration files, that contains a key and its value in each line. The delimiter between a ket and its value can be equal (=), colon (:), whitespaces and tabs. Section names are optional. It implements the same set of interfaces of the default ConfigParser excluding the functionality for writing and sophisticated customization. To use my ConfigParser, just download the ConfigParser.py file and put it in the same directory with the calling python script. Since Python first looks up the current working directory when importing a module, my ConfigParser will override the default one.

Wednesday, September 28, 2011

Multithreaded downloading with wget

GNU wget is versatile and robust, but lacks support for multithreaded downloading. When downloading multiple files, it just goes one by one, which is quite inefficient if the bandwidth of each connection is limited.

There is a way to achieve nearly the same effect as multithreaded downloading (link), and here is how you do it:

wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &

copy as many times as you deem appropriate to have as many processes downloading. The key is the -N option, which tells wget to download a file only when its local time stamp is older than the one in the server side.

Alternatively, I wrote a wrapper, pwget (short for parallel wget), that adds multithreading to wget. The program is available from https://github.com/songqiang/pwget. It has two options --max-num-threads and --sleep. The first option --max-num-threads gives the maximum number of connections you allow to establish. This number is usually determined by the setting on the server side and by default it is 3. The second option --sleep specifies how often (in seconds) the master thread checks the status of downloading threads. When the master thread wakes up, it removes finished threads and add new downloading threads if necessary. Suppose you have the list of URLs in the file url-list.txt, then run

./pwget.py --max-num-threads 5 --sleep 2 -i url-list.txt

wget will begin downloading the list of URLs in url-list.txt with at most 5 connections at once. You can also specify the option for wget in the command line, which will be passed to working threads.

This tool has several limitations. The parallel level of pwget is based on each URL, so you need to list the all URLs in prior. Furthermore, if you have a single large file, pwget does not help. In that case, you may consider use aria2 (http://aria2.sourceforge.net/).

Grand Prismatic Spring Lab

Friday, August 16, 2013

A Simple Python ConfigParser Class for Parsing Configuration Files

Wednesday, September 28, 2011

Multithreaded downloading with wget

About Me

Links

Blog Archive