Scrape MP3 Files
14 February 2007Heres how I do it:
wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i ~/mp3blogs.txt
And heres what this all means:
-r -H -l1 -np These options tell wget to download recursively. That means it goes to a URL, downloads the page there, then follows every link it finds. The -H tells the app to span domains, meaning it should follow links that point away from the blog. And the -l1 a lowercase L with a numeral one means to only go one level deep; that is, dont follow links on the linked site. In other words, these commands work together to ensure that you dont send wget off to download the entire Web — or at least as much as will fit on your hard drive. Rather, it will take each link from your list of blogs, and download it. The -np switch stands for “no parent”, which instructs wget to never follow a link up to a parent directory.
We dont, however, want all the links — just those that point to audio files we havent yet seen. Including -A.mp3 tells wget to only download files that end with the .mp3 extension. And -N turns on timestamping, which means wget wont download something with the same name unless its newer.
To keep things clean, well add -nd, which makes the app save every thing it finds in one directory, rather than mirroring the directory structure of linked sites. And -erobots=off tells wget to ignore the standard robots.txt files. Normally, this would be a terrible idea, since wed want to honor the wishes of the site owner. However, since were only grabbing one file per site, we can safely skip these and keep our directory much cleaner. Also, along the lines of good net citizenship, well add the -w5 to wait 5 seconds between each request as to not pound the poor blogs.
Finally, -i ~/mp3blogs.txt is a little shortcut. Typically, Id just add a URL to the command line with wget and start the downloading. But since I wanted to visit multiple mp3 blogs, I listed their addresses in a text file one per line and told wget to use that as the input.
Comments are closed.