Wednesday, September 28, 2011

Multithreaded downloading with wget


GNU wget is versatile and robust, but lacks support for multithreaded downloading. When downloading multiple files, it just goes one by one, which is quite inefficient if the bandwidth of each connection is limited.

There is a way to achieve nearly the same effect as multithreaded downloading (link),  and here is how you do it:
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
copy as many times as you deem appropriate to have as many processes downloading. The key is the -N option, which tells wget to download a file only when its local time stamp is older than the one in the server side.

Alternatively, I wrote a wrapper, pwget (short for parallel wget), that adds multithreading to wget. The program is available from https://github.com/songqiang/pwget. It has two options --max-num-threads and --sleep. The first option --max-num-threads gives the maximum number of connections you allow to establish. This number is usually determined by the setting on the server side and by default it is 3.  The second option --sleep specifies how often (in seconds) the master thread checks the status of downloading threads. When the master thread wakes up, it removes finished threads and add new downloading threads if necessary. Suppose you have the list of URLs in the file url-list.txt, then run
./pwget.py --max-num-threads 5 --sleep 2 -i url-list.txt
wget will begin downloading the list of URLs in url-list.txt with at most 5 connections at once. You can also specify the option for wget in the command line, which will be passed to working threads.

This tool has several limitations. The parallel level of pwget is based on each URL, so you need to list the all URLs in prior. Furthermore, if you have a single large file, pwget does not help. In that case, you may consider use aria2 (http://aria2.sourceforge.net/).   

Tuesday, September 20, 2011

Runnning SSH on a non-standard port

The default port for SSH connection is 22. However some servers change the default port to others, for example 22222, for security reasons. Here I list some common commands to deal with non-standard SSH port.

Suppose you have a SSH server ssh.example.edu with ssh port number 22222.

To copy your ssh public key to the server, run:
ssh-copy-id '-p 22222 jon@ssh.example.edu'
Note the single quote is necessary.

To log in to the SSH server, run
ssh -p 22222 jon@ssh.example.edu 
 To copy files between your local machine and the server with scp, run
scp -P 22222 local-files jon@ssh.example.edu:~
 Note the "-P" option is capitalised.

References:
1. http://www.itworld.com/nls_unixssh0500506
2. http://mikegerwitz.com/2009/10/07/ssh-copy-id-and-sshd-port/