Screen scraping with your browser's JavaScript console

I needed to experiment a bit with language packs for IE 10 the other day and that involved downloading and installing all the available language packs. Unfortunately I couldn't find a single convenient file for download that'd install everything. The language packs were available as separate downloads for each supported language. Like this:

image

This was a problem as I was in no mood to download each file individually and there were 100s of "download" buttons there. I figured I'd see if I can screen scrape the links from the DOM of this page and then write a little script to download all of them in one go. So I fired up an instance of IE and hit F12 to launch the developer tools and used the "Select element by click" button to quickly navigate to the markup associated with a "download" button.

select-element

As you can tell, all the download buttons are basically anchor tags and the href attribute points to the MSU file for that particular language. Also, you'll note that each such anchor tag has a class called "download" applied on it. So I should be able to fetch all the links by simply iterating through all anchor tags which have the "download" class applied on them. I switched to the "Console" tab in the developer tools window and ran the following script:

document.querySelectorAll("a.download")

And sure enough this produced a list of all the anchor tags I was interested in. I needed the URL however and not the DOM elements themselves. So I ran this next:

Array.prototype.forEach.call(
    document.querySelectorAll("a.download"),
    function (a) {
        console.log(a.href);
    });

This produced a list of links such as this (snipped since there are quite a lot of them):

http://download.microsoft.com/download/D/9/A/.../IE10-Windows6.1-LanguagePack-x64-zh-tw.msu 
http://download.microsoft.com/download/D/9/A/.../IE10-Windows6.1-LanguagePack-x64-zu-za.msu 
http://download.microsoft.com/download/D/9/A/.../IE10-Windows6.1-LanguagePack-x86-af-za.msu 
http://download.microsoft.com/download/D/9/A/.../IE10-Windows6.1-LanguagePack-x86-am-et.msu

If you're wondering why I had to iterate through each element in the list of nodes returned by querySelectorAll via Array.prototype.forEach.call then that's because what querySelectorAll returns isn't a JavaScript array object, i.e., it doesn't inherit from Array.prototype. It is instead a NodeList object which looks a lot like an array! It has numeric properties starting from 0 to N-1 where N is the number of elements returned and it has a length property as well which is equal to N. It turns out that all the Array methods are perfectly capable of dealing with such "array like" objects just as well as genuine, certified JavaScript arrays. Here's an example of what I am talking about:

var notArray = {
    0: "This ",
    1: "is ",
    2: "not ",
    3: "really ",
    4: "an ",
    5: "array.",
    length: 6
};

console.log(Array.prototype.reduce.call(
    notArray,
    function (previous, current) {
        return previous + current;
    },
    ""));

This snippet prints the following text to the console:

This is not really an array.

If you take another look at the list of URLs our script printed to the console, you'll notice from the file names that this list includes both x86 files and x64 files. I wanted only x64 files. So, I next changed the script to this:

Array.prototype.forEach.call(
    document.querySelectorAll("a.download[href*=x64]"),
    function (a) {
        console.log(a.href);
    });

The selector syntax above looks for all anchor tags in the DOM which has a class called "download" applied where the href attribute's value contains the string "x64". I had first implemented this via another call to Array.prototype.filter before learning that CSS3 selector syntax already provides for it! Pretty nifty no? That's pretty much it. I wanted to run a download script for fetching all the files so I slightly modified the script to produce wget calls like so:

Array.prototype.forEach.call(
    document.querySelectorAll("a.download[href*=x64]"),
    function (a) {
        console.log("wget " + a.href);
    });

And plonked the output into a batch file and ran it. Mission accomplished!

Now, it turned out that this particular page in question includes the jQuery library as well as can be seen when you pull up the files list from the "Script" tab in the developer console.

image

I could have done the same thing I did above using a slightly terser syntax using jQuery as well. Here's how:

$("a.download[href*=x64]").each(function () {
    console.log("wget " + this.href);
});

Not having to resort to the Array.prototype weirdness does make the code a lot cleaner doesn't it?

comments powered by Disqus