Monthly Archives October 2010

How can I download these gazillion links?

Posted by admin on October 06, 2010  /   Posted in Data Best Practices, Linux

Don’t you hate it when you find yourself needing to download hundreds of files, by clicking on each HTML link, one-at-a-time?

For example, some publishers decided that it is cute to chop their big document/manual/manuscript/research paper/etc. per chapter into their own PDF files.  Not too cute for us who wants to save them all, is it?

Or your ace graphic artist in the Philippines refused to learn Zip and has instead given you access to the directory containing hundreds of images that you need for tomorrow’s demo to the client.

I bet one of your first thought is to view the source of the web page and copy out all of the links, right? That would work, but it still involves a hunt-and-peck method fishing out all the links amidst the hairy HTML/CSS/JS code in the page.

Sure it’s easy if you are a UNIX command line hacker who can spit out the exact grep/awk/perl script to automate this in a heartbeat.  But in this journal we want an easier approach that most of us sophisticated (read: lazy) geeks would want to use.

Well, here is *one* easy way to solve this problem:

Use Firefox (version 3.6 at the time of this writing) and install a nifty add-on called Link Gopher.  After installation and a restart, down at the bottom right corner of the browser there will appear a small word called Links (next to the window resizer).

Go to the web page that listed all the links.

Right-click that Link word, and select Extract All Links. Voila! all of the links are now appear in a neat, clean web page.  Simply copy and paste all the links that you want to download into a text file, and run this command on a terminal:

cat links_list.txt | xargs -P 4 wget

This command will pipe the content of links_list.txt (that is the name of the file you saved the links into, by the way) into xargs, which in turn will fire off 4 parallel processes of ‘wget’ command, each being handed one of the links (NOTE: That -P parameter is pretty slick).

That’s it! now all you need to do is wait.  The output of wget processes downloading each link is also fun to watch.

When all the processes finished, you’d be left with all the files you need, and a text file containing the list of the file in URL format.  Now that’s pretty easy, wouldn’t you say?

We serve businesses of any type and size
Contact Us Today