Scraper Host Object

1.0 Introduction

The Scraper object allows data to be extracted from HTML pages and presented in XML format. It provides a bridge to data sources that don't have XML or Web service representations at present.

1.1 Example

                var config = <config>
                                <var-def name='response'>
                                    <html-to-xml>
                                        <http method='get' url='http://ww2.wso2.org/~builder/'/>
                                    </html-to-xml>
                                </var-def>
                             </config>;
                
                var scraper = new Scraper(config);
                result = scraper.response;
            

2.0 Scraper Object

The Scraper Object takes a set of scraping instructions in an XML language in its contructor. The scraping component wrapsWebHarvest.

Note that there are a few caveats when using the screen scraping language from within the Scraper object and within E4X, as listed below:

  1. The result of the scrape must be saved in a variable:
                            var config = <config>
                                            <var-def name='response'>
                                                <html-to-xml>
                                                    <http method='get' url='http://ww2.wso2.org/~builder/'/>
                                                </html-to-xml>
                                            </var-def>
                                         </config>;
                        
    The contents of the variable appear as a property on the Scraper object.
  2. The result comes back as a string at present. When the result represents XML, not only do you have to parse it into XML yourself, but you have to make sure you remove the XML declaration. The XML constructor does not parse documents, but only node lists, and rejects the declaration as an illegal processing instruction:
                            var scraper = new Scraper(config);
                            var result = scraper.response;
    
                            // strip off the XML declaration and parse as XML.
                            var resultXML = new XML(result.substring(result.indexOf('?>') + 2));
                            return resultXML;
                        
  3. The WebHarvest language <template> instruction allows variables to be referenced, using the notation ${variable-name}. The curly brackets conflict with the use of XML literals in E4X, where they cause evaluation of the enclosed data. To escape the curly brackets in E4X (so they will be interpreted by WebHarvest), use the character entity references &#x7B; and &#x7D; for '{' and '}' respectively.

3.0 API Documentation

Member Description Supported in version
Constructor Scraper(XML config);

The WebHarvest config object (as E4X XML) that defines the scrape. The scrape will be executed immediately, and the results will be exposed as properties on the Scraper object.

0.1
readonly property String var-name

After the scrape completes, each result variable defined in the config definition (using <var-def name="var-name"<) will appear as a property on the Scraper object.

0.1

4.0 References