2007/10/24
24 Oct, 2007

Mashing up a National Geographic Photo of the Day Feed

  • Jonathan Marsh
  • Vice President - Strategy - WSO2

I recently wrote a neat little mashup which demonstrates a little of the power of the WSO2 Mashup Server to flow information from one place to another, and from one format to another.  I had a simple set of requirements:

  1. I use the Google Photos Screensaver to show a slideshow of interesting photographs when the family-room computer isn’t being used.  Since the family room computer includes the 37-inch LCD screen as a separate monitor, high quality photos come out really clear and make for a nice, constantly changing design element for the room.  It works best if the set of photographs changes before it gets old.
  2. I recently found the National Geographic site’s “photo of the day” section as an interesting source of high-quality photographs that updates on a daily basis.  However, National Geographic doesn’t provide a feed for the photo of the day.

MashupEssentially then the task was to scrape the URLs from the photo of the day, and package them into a feed.  The complication comes from the fact that there doesn’t seem to be a list of photos of the day available on the National Geographic web site – just links from a particular photo to the one for the previous (or next) day.  Because a feed of 30 photos requires 30 different pages to be scraped, some caching really becomes necessary to improve performance, especially since feed readers can be expected to bombard the service if it proves popular.

I initially broke down the task into three parts:

  1. Scraping a photo of the day page to extract the useful metadata, including the date, title, photographer’s credit, and description of the photo, a set of links to the actual image in various sizes, and links to the page being scraped (so one can return there easily) and to the previous page in the photo stream.  Since this metadata shouldn’t vary, cache it locally for faster retrieval.
  2. Searching the cache or going to the web site (and thus populate the cache) to acquire the metadata for a particular date.
  3. Formatting the metadata for a particular range of dates into a feed.

Here’s how I approached each of these tasks.

Scraping a photo of the day page

The first order of business for scraping a page like this is simply to fetch the page, tidy it into XML so we can navigate it using tools like XPath.  The WSO2 Mashup Server provides a “Scraper” object that accepts an XML language describing the steps involved in scraping.  This configuration language is defined by the Web Harvest component that we use for scraping.  I usually start with a scraping mashup using a simple function that configures and performs the scrape, and returns the results:

 
function scrape_picture_page() {
    var config =
        <config>
            <var-def name='response'>
            <html-to-xml>
                    <http method='get' url="https://photography.nationalgeographic.com/photography/photo-of-the-day" />
            </html-to-xml>
            </var-def>
        </config>;
 
    var scraper = new Scraper(config);
 
    var bodyWithoutXMLDecl = scraper.response.substring(scraper.response.indexOf('?>') + 2);
    var result = new XML(bodyWithoutXMLDecl);
 
    return result;
}

 

The config language itself is pretty straightforward, once you learn to read it inside out – the <http> element fetches the requested URL, the <html-to-xml> does just what it sounds like and tidies the result, which is put into a variable named “response”.  The scrape is performed by initializing a new “Scraper” object with the config, and the result is made available through the “response” property on the result – corresponding to the “response” variable we defined within the config file.  One trick though – the result is a stream of XML text, including an XML declaration.  The E4X extensions can parse this into XML (new XML()), but can’t handle the XML declaration.  We have to strip off the declaration ourselves using string manipulation.

By placing the above function in a file named “nationalgeographic.js” in the “scripts” directory of the Mashup Server, a Web service with a scrape_picture_page operation will be deployed.  We can get to it through the try-it page (https://localhost:7762/services/jonathan/nationalgeographic?tryit) and see what the tidied HTML looks like for the page.

Extracting the data from the page can be a tedious process, involving looking at HTTP request-response pairs and trolling through the HTML source of a page.  Fortunately the National Geographic site’s HTML is simple and straightforwardly structured, with a number of well-placed identifiers to help us zero in on the interesting content.  I usually end up using Firebug (Firefox debugging extension) to navigate the live HTML of the page and develop some XPath expressions that extract the desired metadata for the page.  I’ve also found that, since Web Harvest communicates between components using strings rather than parsed XML, that defining a lot of XPath filters to extract information one element at a time during a scrape can perform poorly.  Instead it seems much faster to wrap a series of XPath expressions into a simple XSLT stylesheet so the XML can be parsed once, queried as much as needed, and an XML structure containing the results returned in one action.  To do that, I added an XSLT stylesheet to the above configuration:

    var config =
        <config>
            <var-def name='response'>
                <xslt>
                    <xml>
                        <html-to-xml>
                            <http method='get' url={url} />
                        </html-to-xml>
                    </xml>
                    <stylesheet>
                        <![CDATA[
                            <xsl:stylesheet version="1.0" xmlns:xsl="https://www.w3.org/1999/XSL/Transform">
                                <xsl:output method="xml" omit-xml-declaration="yes"/>
                                <xsl:template match="/">
                                    <photo>
                                        <xsl:for-each select="//*[@id='content-center-well']">
                                            <date><xsl:value-of select="div[@class='date']"/></date>
                                            <previous>https://photography.nationalgeographic.com<xsl:value-of select="div[@class='slide-navigation'][1]/p/a/@href"/></previous>
                                            <xsl:for-each select="div[@class='image-viewer clearfix']">
                                                <xsl:for-each select="table/tbody/tr[1]/td/a">
                                                    <page>https://photography.nationalgeographic.com/photography/photo-of-the-day/<xsl:value-of select="substring-before(substring-after(@href,'enlarge/'),'_pod_image.html')"/>.html</page>
                                                    <xsl:variable name="href" select="concat('https://photography.nationalgeographic.com', substring-before(img/@src, '-ga.jpg'))"/>
                                                    <location type='small'><xsl:value-of select="$href"/>-ga.jpg</location>
                                                    <location type='medium'><xsl:value-of select="$href"/>-sw.jpg</location>
                                                    <location type='large'><xsl:value-of select="$href"/>-lw.jpg</location>
                                                    <location type='wide'><xsl:value-of select="$href"/>-xl.jpg</location>
                                                </xsl:for-each>
                                                
                                                <xsl:for-each select="div[@class='summary']">
                                                    <title><xsl:value-of select="h3"/></title>
                                                    <credit><xsl:value-of select="p[@class='credit']"/></credit>
                                                    <description>
                                                        <xsl:copy-of select="div[@class='description']/node()"/>
                                                    </description>
                                                </xsl:for-each>
                                            </xsl:for-each>
                                        </xsl:for-each>
                                    </photo>
                                </xsl:template>
                            </xsl:stylesheet>
                        ]]>
                    </stylesheet>
                </xslt>
            </var-def>
        </config>;

 

Again, fairly straightforward – the <xslt> task has two inputs, <xml> and <stylesheet>.  The stylesheet unfortunately has to be enclosed in a CDATA section rather than as straight XML.  One other nice trick though – when the output is an XSLT template, the “omit-xml-declaration” flag can be used to strip off the XML declaration so we don’t have to do it through text manipulation, simplifying and accelerating our Javascript code.

So we’re almost there with this capability.  Some minor improvements and adding caching are all we need:

  1. Add an optional  “url” parameter to allow this page to work on any photo-of-the-day page URL.  Using E4X’s curly braces we can substitute this value right into the config file.
  2. If the result was successful (e.g. the <photo/> element has children), calculate the date in yyyy-mm-dd format and use the storexml service to cache it – choosing a path unlikely to conflict with other users of the storexml service.  To make the storexml service easy to call, we import it’s stub, which I got from https://localhost:7762/services/system/storexml?stub&lang=e4x and saved into the nationalgeographic.resources folder which serves as the sandbox for this service.  It’s important to save a copy because when the Mashup Server boots up the nationalgeographic service might be deployed before the storexml service – attempts to generate the stub at that time will fail and cause the nationalgeographic service to fail too.  The Mashup Server doesn’t yet track these dependencies (and we’re still thinking about whether this is a tractable problem or not.)
  3. Since this operation isn’t really supposed to be called by end-users of the feed, I could make it private using scrape_picture_page.visible = “false”, but instead I’ve just used the “operationName” property to rename it, indicating to users that it really is just for test purposes.
  4. Add type annotations.
  5. Add documentation annotations (not shown below.)
system.include("storexml.stub.js");
var cachePath = "nationalgeographic/cache/";
 
scrape_picture_page.operationName = "test_scrape_picture_page";
scrape_picture_page.inputTypes = {"url" : "xs:string?"};
scrape_picture_page.outputType = "xml";
function scrape_picture_page(url) {
    if (url == null)
        url = "https://photography.nationalgeographic.com/photography/photo-of-the-day";
    var config =
        <config>
            <var-def name='response'>
                <xslt>
                    <xml>
                        <html-to-xml>
                            <http method='get' url={url} />
                        </html-to-xml>
                    </xml>
                    <stylesheet>
...
                    </stylesheet>
                </xslt>
            </var-def>
        </config>;
 
    var scraper = new Scraper(config);
 
    var result = new XML(scraper.response);
   
    if (result.hasComplexContent()) {
        var date = xsDate(new Date(result.date));
        storexml.store(cachePath + date, result);
    }
 
    return result;
}

 

xsDate.visible = false;
function xsDate(d)
{
    return d.getUTCFullYear() + "-" +
          (d.getUTCMonth() < 9 ? "0": "" ) + (d.getUTCMonth() + 1) + "-" +
          (d.getUTCDate() < 10 ? "0": "" ) + d.getUTCDate();
}

As an aside, this shows a couple of my wishes:

  1. <xml> is a reserved tag name in XML, it’s unfortunate that Web Harvest doesn’t use something else.
  2. Web Harvest’s requirement that the stylesheet be enclosed in a CDATA section is unfortunate – it means well-formedness errors can’t be caught at Javascript/E4X compile time, but at runtime.  This slows down the development process.  I could put the stylesheet in a separate file, but that just makes it harder to share the service and see what’s going on.
  3. I’d prefer a way to get E4X XML back from Web Harvest directly so I wouldn’t have to parse it myself, worry about the XML Declaration, and so forth.  Maybe we can do something about this in a future release.
  4. Managing date formats becomes a bit of a chore.  I prefer the operations and cache to work on xs:date format (yyyy-mm-dd), but the page metadata is in the form “Month day, year” (directly from the scraped page).  And Javascript prefers to manipulate dates in its own Date object.  Soon we’ll see that the RSS profile defines a subset of the Javascript serialization that means a fourth conversion.

Finding a picture for a particular date

Now that we have a function that can scrape a page given a URL, and given that the data returned and cached by that function contains a link to the page for the previous day’s page, we can do some walking around in the cache to find data for a particular date.  That’s what this function does.

First, we look in the cache for a photo’s metadata.  If it’s there, we can simply return it – we’re done.  Otherwise we need to find the URL for the page representing that date and call the scrape_picture_page operation.

If I can’t find the requested date in the cache, I look for the next earlier date, and so on, until I do find a photo in the cache (or I reach today’s date).  That’s the first while loop.  Then, using the <previous> page url, I work backward again, incidentally populating the cache as I go, until I’m back to the date I was looking for.  The couple of “if” statements look for exceptional conditions: the first one handles the case where I’ve looked all the way forward till today but still haven’t found anything in the cache, and the second makes sure that if a page can’t be scraped for some reason that we give up and return what little we have before we dig ourselves any deeper.

picture_for_date.inputTypes = {"date" : "xs:string"};
picture_for_date.outputType = "xml";
function picture_for_date(date) {
    try {
        return storexml.retrieve(cachePath + date);
    } catch (e) {
        print("failed to find cached photo for date " + date);
        var photo;
        var startDate = parseDate(date);
        var today = new Date();
        // work forwards in the cache until we find something (or hit today)
        while (startDate <= today) {
            try {
                photo = storexml.retrieve(cachePath + xsDate(startDate));
                break;
            } catch (e) {
                startDate.setUTCDate(startDate.getUTCDate() + 1);
            }
        }
        // start with the most current thing in the cache (if any) an work backwards to the
        //    requested date, filling in the cache as we go...
        var targetDate = parseDate(date);
        while (startDate > targetDate) {
            var previousPageUrl;
            if (photo == null) previousPageUrl = null;
            else previousPageUrl = photo.previous;
           
            print("fetching photo for " + startDate);
            photo = scrape_picture_page(previousPageUrl);
            if (!photo.hasComplexContent())
                break;
            startDate.setUTCDate(startDate.getUTCDate() - 1);
        }
       
        return photo;
    }
}

 

Generating the feed

Now we have all the pieces in place to aggregate the data and generate a list of some kind as output.  The picture_of_the_day operation does that for us.

The function has some parameters controlling aspects of the feed – whether to link to the small, medium, large, or wide aspect ratio images, and how many items to include.  If no number is specified, we generate a feed of the latest 30 photos – just long enough to enjoy the photo but not so long we get tired of it.

The WSO2 Mashup Server has a Feed object to help construct feeds, but because I’m targeting this feed at the Google Photos Screensaver I need to include some feed extensions that aren’t supported in the 0.2 release (though they’ve just been added to the nightly build!).  It’s not hard to create an RSS by hand though, so that’s what I chose to do.  First I prepopulate the channel with title, links, and description, and then loop through the photos adding an item for each of them.  The first time through the loop, I also add in a <pubDate> reflecting the date of today’s photo.

Again, this isn’t rocket science – the hardest thing is simply to format the dates appropriately.  During the loop I use Javascript Date objects to increment days and tick over at the end of the month.  I convert that to an xs:date to access the cache, to an RSS Profile-conformant string for the <pubDate>, and to an xs:dateTime for use in the <atom:published/> element, which seems useful for the subscription page displayed in Internet Explorer 7.

picture_of_the_day.inputTypes = {"size" : "small | medium | large | wide", "numPhotos" : "number?"};
picture_of_the_day.outputType = "#raw";
function picture_of_the_day(size, numPhotos) {
    if (numPhotos == null) numPhotos = 30;
 
    var feed =
        <rss version="2.0">
            <channel>
                <title>National Geographic Picture-of-the-day (from WSO2 Mashup Server)</title>
                <link>https://mashups.wso2.org/services/nationalgeographic/picture_of_the_day?size={size}</link>
                <description>WSO2 Mashup Server mashup acquiring and caching links to the National Geographic
                Picture of the Day (https://photography.nationalgeographic.com/photography/picture-of-the-day),
                and exposing them as a feed.  Sizes of "small", "medium", "large", and "wide" are available.
                A max number of photos can be specified with the "numPhotos" parameter.</description>
            </channel>
        </rss>;
 
    var startDate = new Date();
    var photo, photoDate, url, urlsmall, entry;
    for (var i = 0; i < numPhotos; i++) {
        photo = picture_for_date(xsDate(startDate));
        if (photo.hasComplexContent()) {
                 url = photo.location.(@type == size).toString();
                urlsmall = photo.location.(@type == 'small').toString();
                photoDate = new Date(photo.date.toString());
                if (i == 0) {
                    feed.channel.appendChild(<pubDate>{rssDate(photoDate)}</pubDate>);
                }
                entry = <item xmlns:media="https://search.yahoo.com/mrss/">
                        {photo.title}
                        <description>
                            &lt;a href='{url}'>&lt;img src='{urlsmall}'/>&lt;/a>
                            {photo.description.*.toXMLString()}
                        </description>
                        <pubDate>{rssDate(photoDate)}</pubDate>
                        <link>{photo.page.toString()}</link>
                        <guid isPermaLink='false'>{photo.page.toString()}</guid>
                        <media:content url={url} type="image/jpeg" />
                        <media:thumbnail url={urlsmall} />
                        <atom:published xmlns:atom="https://www.w3.org/2005/Atom">{xsDate(photoDate)}T00:00:00Z</atom:published>
                    </item>;
                  feed.channel.appendChild(entry);
            }
           startDate.setUTCDate(startDate.getUTCDate() - 1);
    }
    return feed;
}

 

You can access this operation through the try-it page at https://localhost:7762/services/jonathan/nationalgeographic?tryit and see that the operation returns a feed.  However, the try-it uses SOAP by default under the covers, which isn’t terribly friendly to feed readers like the Google Photos Screensaver.  No problem – the Mashup Server also exposes it’s operation through a REST interface.  By accessing the URL https://localhost:7762/services/jonathan/nationalgeographic/photo_of_the_day?size=wide, you can see the feed directly in the browser, point the screen saver at it, subscribe to it, etc.  By adjusting the “size” and “numPhotos” parameters you can generate variants of the feed that suit your purpose.

Publishing the feed

Once I had the service written, tried it for a day or two to ensure it was stable (and fixed a couple of edge cases as a result), I used the administrative UI in the Mashup Server to publish it to https://mooshup.com, which hosts the service live on the internet for others to use.  The publishing process is simple – click the share button, confirm that https://mooshup.com is the destination, and click OK.  While we have lots to do to make this site an attractive and useful place for members of the mashup community to hang out, it does give me a stable internet URL for the feed so others can enjoy it.  You can exercise the try-it page live from there, look at the metadata, or download the service to your local installation of the Mashup Server and run it there.

Last Word

Hopefully this helps you get a feel for the Mashup Server in action.  We did some screen scraping, fairly sophisticated caching by invoking an external storexml Web service, formulated an RSS feed, and made it (and intermediate functions) available through a Web service including SOAP 1.2, SOAP 1.1, and HTTP bindings, including an HTTP GET binding amenable to RSS agents.  Although we didn’t look at them in detail in this article, the Mashup Server generated a try-it page for debugging and exercising the service, WSDL, Schema, stubs for accessing the service simply from Javascript or E4X environments, even generated some human-readable documentation for the mashup.  We ran the service locally, then published it live onto the internet.   It also would not be hard to generate a custom HTML interface providing (for example) a slideshow of these photos, but in this case I wanted to show that user interfaces can go beyond just HTML pages by using Google Photos Screensaver as my ultimate user interface.

So what’s next for this service?  The main improvement I can think of is rewriting the code to use the Feed object when it becomes capable of handling the images.  It took me a while to figure out which RSS extensions were necessary and it would be nice not to worry about the representation of dates.  Maybe I could even offer an Atom feed in parallel.  Another idea related to performance would be to experiment with a different, perhaps additional, caching strategy – which is to cache the entire feed to disk and periodically refresh it using the recurrence capabilities of the mashup server.  But those are perhaps good topics for future articles!

Until then, enjoy the great photos available from National Geographic!

References:

Author

Jonathan Marsh, Director Mashup Technologies, jonathan at wso2 dot com

 

About Author

  • Jonathan Marsh
  • Vice President - Strategy
  • WSO2