Wednesday, September 28, 2011

Multithreaded downloading with wget


GNU wget is versatile and robust, but lacks support for multithreaded downloading. When downloading multiple files, it just goes one by one, which is quite inefficient if the bandwidth of each connection is limited.

There is a way to achieve nearly the same effect as multithreaded downloading (link),  and here is how you do it:
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
copy as many times as you deem appropriate to have as many processes downloading. The key is the -N option, which tells wget to download a file only when its local time stamp is older than the one in the server side.

Alternatively, I wrote a wrapper, pwget (short for parallel wget), that adds multithreading to wget. The program is available from https://github.com/songqiang/pwget. It has two options --max-num-threads and --sleep. The first option --max-num-threads gives the maximum number of connections you allow to establish. This number is usually determined by the setting on the server side and by default it is 3.  The second option --sleep specifies how often (in seconds) the master thread checks the status of downloading threads. When the master thread wakes up, it removes finished threads and add new downloading threads if necessary. Suppose you have the list of URLs in the file url-list.txt, then run
./pwget.py --max-num-threads 5 --sleep 2 -i url-list.txt
wget will begin downloading the list of URLs in url-list.txt with at most 5 connections at once. You can also specify the option for wget in the command line, which will be passed to working threads.

This tool has several limitations. The parallel level of pwget is based on each URL, so you need to list the all URLs in prior. Furthermore, if you have a single large file, pwget does not help. In that case, you may consider use aria2 (http://aria2.sourceforge.net/).   

No comments: