Sfynx Blog: Downloading and parsing external content through Javascript and VBScript

In this first post we share a method to download and parse a remote http page using the XMLHttpRequest object available in Javascript.

Background
If you want to get some information from a different web site, you could let the browser download that page and then parse it for the desired content.

Suggested procedure
1. Create an XMLHttpRequest object using a browser independent method.
2. Remove or replace all html tags that could cause the browser to run scripts, load external content, or behave in any unwanted way.
3. Request the target page and insert the returned content in a hidded iframe
4. Parse the iframe DOM and look for wanted content.
5. Process and reformat the extracted content.

1. Creating the XMLHttpRequest object
The following javascript function first tries to create the object for Internet Explorer using conditional compiling (See http://www.javascriptkit.com/javatutors/conditionalcompile.shtml).
If above does not work and XMLHttpRequest class (see http://www.w3schools.com/xml/xml_http.asp) is defined, then that will be used.

function createXmlHttp() {
    var xmlhttp=false;
    /*@cc_on @*/
    /*@if (@_jscript_version >= 5)
    // Through conditional compilation we can handle old Internet Explorer versions
    // and security blocked creation of the XMLHTTP object.

    try {
        xmlhttp = new ActiveXObject("Msxml2.XMLHTTP");
    } catch (e) {
        try {
            xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
        } catch (E) {
            xmlhttp = false;
        }
    }
    @end @*/
    if (!xmlhttp && typeof XMLHttpRequest!='undefined') {
        try {
            xmlhttp = new XMLHttpRequest();
        } catch (e) {
            xmlhttp=false;
        }
    }
    if (!xmlhttp && window.createRequest) {
        try {
            xmlhttp = window.createRequest();
        } catch (e) {
            xmlhttp=false;
        }
    }
    return xmlhttp;
}

2. Making content safe
Since we are going to insert the code in a hidden iframe, we don't want the browser to load any scripts or other active content.
Since VBScript is very easy to use for string manipulation (RegEx could be used, but is less straight forward), we could create a VBScript Function to do the work for us.

    <script language="vbscript">
        Function Safe(txt)
            Dim tags
            txt = Replace(txt, "<link", "<safelink")
            txt = Replace(txt, "</link>", "</safelink>")
            txt = Replace(txt, "<script", "<safescript")
            txt = Replace(txt, "</scr" & "ipt>", "</safescript>")
            txt = Replace(txt, "<iframe", "<safeframe")
            txt = Replace(txt, "</iframe>", "</safeframe>")
            txt = Replace(txt, "<img", "<safeimg")
            txt = Replace(txt, "</img>", "</safeimg>")

            Safe = txt
        End Function
    </script>

This way any <script> tag will be renamed to <safescript> and therefore it will not behave like an ordinary script tag.

3. Request the target page

This part is quite straight forward. For more information see http://www.w3schools.com/xml/xml_http.asp.

xmlhttp = createXmlHttp();
xmlhttp.open("GET", url, true);
xmlhttp.onreadystatechange = handleHttpResult;
xmlhttp.send(null);

function handleHttpResult() {
    if (xmlhttp.readyState == 4) {
        parseResponse(xmlhttp.responseText);
    }
}

4. Parsing the IFRAME DOM

First it is convenient to have two helper functions, function $() and function $$() to retrieve DOM objects, either from the main document DOM or from the IFRAME DOM.

function $(id) {
return document.getElementById(id);
}
function $$(id) {
return loadedDoc.getElementById(id);
}

function parseResponse(txt) {
    txt = safe(txt); // make the content safe according to above
    loadedDoc = $("iframe").contentWindow.document;
    loadedDoc.close();
    loadedDoc.write(" "); // clear iframe document
    loadedDoc.writeln(txt); // insert content into iframe document

    // Below an example of parsing. This of course needs to be adapted to every context.
    if ($$("interesting-content-tag") == null)
    {
        alert("NO RESULT FOUND");
        return;
    }
    var list = $$("interesting-content-tag").childNodes;
    var tag;
    for (var i = 0; i < list.length - 1; i++)
    {
        if (list[i].getElementsByTagName("p")[0].className == "title") // if current tag contains a tag <p class="title">
        {
            tag = list[i].getElementsByTagName("ul")[0].childNodes[0]; // then retrieve the first <li> child of the first <ul> tag.
            alert("Found tag: " + tag.innerHTML);                      // alert the user.
        }
    }
    if (tag == null) alert("No tag found");
}

Sfynx Blog

October 25, 2011

Downloading and parsing external content through Javascript and VBScript

No comments:

Post a Comment