WSO2Con 2013 CFP Banner

Introduction to StAX (Streaming API for XML)

Discuss this article on Stack Overflow
By Eran Chinthaka
  • 19 Jun, 2007
  • Level:  Introductory
  • Reads: 23820
Streaming API for XML (StAX) provides more control and flexibility to those who need to process XML documents in their applications. The performance gain and the memory efficiency that can be achieved by using StAX is very important for most of the applications that use XML. This article by Eran Chinthaka gives an introduction to StAX and provides some insights into using it.
chinthaka's picture
Eran Chinthaka
Software Engineer
WSO2 Inc.

Introduction to StAX

XML has become the standard for data exchange between peers. When more and more systems started using XML, processing XML documents became a critical and integral part of those applications. Depending on different environments, there are various ways to process an XML document within a program. Let's first look at these approaches in brief. All approaches can be broadly categorized into two:

  1. Tree-based APIs - The whole XML document is parsed and a model of that is built in memory. DOM (Document Object Model) based implementations use this technique. This approach made it possible to go back and forward through an XML which is already read. The model they build is usually larger than the original XML document, thus duplicating and wasting memory.
  2. Event based APIs - An event based parser parses the whole XML document and throws "events" depending on the information content of the XML. The common approach has been to use a push-based approach like SAX to process the XML. In the push model, the parser continuously pushes events to the calling application until it finishes reading the whole XML document. This is more efficient than DOM in terms of memory. However, the problem with the push model is that once started, it goes to the end of the document and the caller must be ready to handle all the events in one shot. The caller that invokes the parser has no control over the parsing process.

Both these models give little or no control to the user in the parsing process. Once started, tree based or event based push models consume the whole data stream at once. JSR 173 defined a pull streaming model, StAX, for processing XML documents. In this model, unlike in SAX, the client has the full control to start, proceed, pause, and resume the parsing process.

Where will this come in handy? Think about an XML message router or mediator. It will look for certain parameters in the XML message and route the message to the proper destination. This node doesn't need to read the whole XML message. At the same time, it can start forwarding to the destination as soons as its criteria are satisfied. This is an ideal scenario for a pull parser to be used.

StAX

Now let's look at how this is achieved and how the API accommodates such a model.

The client first needs to create a parser giving the XML he needs to parse, as a java.io.InputStream, java.io.Reader or a javax.xml.transform.Source.

FileInputStream fileInputStream = new FileInputStream(fileLocation);

XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(fileInputStream);

Then the user should ask the parser to proceed by calling the next() method on the xmlStreamReader. Each call to this method will emit one of the events listed below.

  • XMLStreamConstants.START_ELEMENT
  • XMLStreamConstants.END_ELEMENT
  • XMLStreamConstants.PROCESSING_INSTRUCTION
  • XMLStreamConstants.CHARACTERS
  • XMLStreamConstants.COMMENT
  • XMLStreamConstants.SPACE
  • XMLStreamConstants.START_DOCUMENT
  • XMLStreamConstants.END_DOCUMENT
  • XMLStreamConstants.ENTITY_ REFERENCE
  • XMLStreamConstants.ATTRIBUTE
  • XMLStreamConstants.DTD
  • XMLStreamConstants.CDATA
  • XMLStreamConstants.NAMESPACE
  • XMLStreamConstants.NOTATION_DECLARATION
  • XMLStreamConstants.ENTITY_DECLARATION

Depending on the event, you can get more information by calling other corresponding methods appropriate to the event. For example, if the START_ELEMENT event is thrown, then calling getLocalName() will return the local name of the element. Here's a list of corresponding methods that can be called for a given event.

Event Valid Methods
All States getProperty(), hasNext(), require(), close(), getNamespaceURI(), isStartElement(), isEndElement(), isCharacters(), isWhiteSpace(), getNamespaceContext(), getEventType(),getLocation(), hasText(), hasName()
START_ELEMENT next(), getName(), getLocalName(), hasName(), getPrefix(), getAttributeXXX(), isAttributeSpecified(), getNamespaceXXX(), getElementText(), nextTag()
ATTRIBUTE next(), nextTag() getAttributeXXX(), isAttributeSpecified()
NAMESPACE next(), nextTag() getNamespaceXXX()
END_ELEMENT next(), getName(), getLocalName(), hasName(), getPrefix(), getNamespaceXXX(), nextTag()
CHARACTERS next(), getTextXXX(), nextTag()
CDATA next(), getTextXXX(), nextTag()
COMMENT next(), getTextXXX(), nextTag()
SPACE next(), getTextXXX(), nextTag()
START_DOCUMENT next(), getEncoding(), getVersion(), isStandalone(), standaloneSet(), getCharacterEncodingScheme(), nextTag()
END_DOCUMENT close()
PROCESSING_INSTRUCTION next(), getPITarget(), getPIData(), nextTag()
ENTITY_REFERENCE next(), getLocalName(), getText(), nextTag()
DTD next(), getText(), nextTag()

Playing with StAX

Reading XML

Now let's see how we can play around a bit with the StAX API. We need to download the StAX API JAR and an implementation of StAX API. Download the StAX API JAR. There are a couple of implementations available today. Let's use the woodstox implementation.

Let's first print events from sample1.xml. (sample.xml can be found in the resources folder of sources.zip)



<article:Article xmlns:article="http://www.article.org"

xmlns:author="http://author.org">

<!-- This sample1.xml is used for samples in

"Introducing StAX" article -->

<Name>Introducing StAX</Name>

<author:Author>Eran Chinthaka</author:Author>



<?This_is_some_processing_instruction?>

</article:Article>



First you need to create an instance of XMLStreamReader. The StAX API provides XMLInputFactory to create an instance of XMLStreamReader.


FileInputStream fileInputStream =

new FileInputStream(fileLocation);

XMLStreamReader xmlStreamReader =

XMLInputFactory.newInstance().

createXMLStreamReader(fileInputStream);

Then we have to ask the parser to proceed through each event. XMLStreamReader provides an iterator-like API to check the existence of the next event.


while (xmlStreamReader.hasNext()) {

printEventInfo(xmlStreamReader);

}

xmlStreamReader.close();

This code will iterate until xmlStreamReader has no further events to be thrown. Note that closing the xmlStreamReader instance is not required, but is considered good programming practice.

Now we need to get the events from the parser and call appropriate methods to extract information about the XML.


int eventCode = reader.next();



switch (eventCode) {

case XMLStreamConstants.START_ELEMENT :

System.out.println("event = START_ELEMENT");

System.out.println("Localname = "+reader.getLocalName());

break;

case XMLStreamConstants.END_ELEMENT :

System.out.println("event = END_ELEMENT");

System.out.println("Localname = "+reader.getLocalName());

break;

case XMLStreamConstants.PROCESSING_INSTRUCTION :

System.out.println("event = PROCESSING_INSTRUCTION");

System.out.println("PIData = " + reader.getPIData());

break;

..............................

..............................

..............................

The interesting thing to note here is that the user must call the parser to proceed by calling reader.next(). The parser will proceed to the next step only after that. This is the main difference between pull and push parsing. In push parsing, as with SAX, once the SAX parser starts sending events, the user or the client application has no control over it. In pull parsing, as seen here, the client application can decide the pace of parsing at its own discretion.

Say you want to process only one element of the XML, if present. In this approach, you put a simple if statement in the START_ELEMENT handling code and you are done. If you do not want to process any XML after that, you can simply close the stream and forget about it, rather than parsing the whole XML.

One typical example of this kind of processing is when you relay pieces of XML. Most of the time the intermediary node will look for a particular XML element and will then forward it to the proper destination, without requiring the parsing of the whole XML chunk.

When you run the above piece of code against sample1.xml, the output will be as follows


event = START_ELEMENT

Localname = Article

========================

event = COMMENT

Comment = This sample1.xml is used for samples in "Introducing StAX" article

========================

event = START_ELEMENT

Localname = Name

========================

event = CHARACTERS

Characters = Introducing StAX

========================

event = END_ELEMENT

Localname = Name

========================

event = START_ELEMENT

Localname = Author

========================

event = CHARACTERS

Characters = Eran Chinthaka

========================

event = END_ELEMENT

Localname = Author

========================

event = PROCESSING_INSTRUCTION

PIData =

========================

event = END_ELEMENT

Localname = Article

========================

event = END_DOCUMENT

Document Ended

========================

Writing XML

Now let's try to write the same XML to the output using the XMLStreamWriter interface.

In just the same way as we create XMLStreamReader to read XML using the XMLInputFactory, we need to create an instance of XMLStreamWriter using the XMLOutputFactory.


XMLStreamWriter writer = XMLOutputFactory.newInstance().

createXMLStreamWriter(outStream);

Then this writer can be used to write events. For example,

  • to write a start element : writer.writeStartElement("Name")
  • to write an end element : writer.writeEndElement()
  • to write a comment : writer.writeComment("This sample1.xml is used for samples in \"Introducing StAX\" article")
  • to write a namespace : writer.writeNamespace("author", "http://author.org")
  • to write text : writer.writeCharacters("Introducing StAX")
  • to write a processing instruction : writer.writeProcessingInstruction("This_is_a_processing_instruction")

 

Having written these events to the XMLStreamWriter you must flush and close the writer.


writer.flush();

writer.close();

StAX API

StAX contains two distinct APIs to work with an XML. One is cursor API and the other is iterative API. What we have discussed so far is cursor API. As you can see, cursor API always points to one thing at a time and it always moves forward, and never goes backward.

The iterator API, on the other hand, tries to visualize the XML stream as a set of event objects. The base iterator API is called XMLEvent, and there are sub-interfaces for each event type. The XMLEventReader interface has the following methods to interact with the XML info-set.




public XMLEvent nextEvent() throws XMLStreamException;

public boolean hasNext();

public XMLEvent peek() throws XMLStreamException;

More information on the iterator API can be found here.

Advantages of Pull Parsing

Most of the applications that process XML benefit from stream parsing, and most of the time does not require the entire DOM model in the memory. Having mentioned that as the main advantage we have in pull parsing, let's look at the other aspects.

  • As we discussed at the start of this article, the client gains control of this parsing model and the parsing happens according to client requirements. However in the pull model, the client is "pushed" with data, irrespective of whether it is needed.
  • Pull parsing libraries are much smaller compared to the respective push libraries, and even the client code that interacts with these libraries are small, even for complex documents.
  • Filtering of elements is easier, as the client knows that when a particular element comes in, he has time to make the decisions.

Now let's compare StAX with some of the existing XML parsing technologies available today. (table adapted from "Does StAX Belong in Your XML Toolbox?" (http://www.developer.com/xml/article.php/3397691) by Jeff Ryan)

Feature StAX SAX DOM TrAX
API Style Pull events; streaming Push events; streaming In memory tree based XSLT Rule based templates
Ease of Use High Medium High Medium
XPath Capability No No Yes Yes
CPU and Memory Utilization Good Good Depends Depends
Forward Only Yes Yes No No
Reading Yes Yes Yes Yes
Writing Yes No Yes Yes
Create, Read, Update, Delete (CRUD) No No Yes No

Summary

This approach of XML processing gives more control to the client application than to the parser, enabling faster and memory-efficient processing. This is becoming a standard across different domains of XML processing. For example, Apache Axis2, one of the prominent SOAP processing engines, improved its performance four times, on average, over its predecessor by using a StAX-based XML processing model called Axiom. Axiom is more memory-efficient and performant than the existing object models available today due to the usage of StAX as its XML parsing technology.

Resources

  • Sample code for this article
  • JSR 173 Streaming API for XML Specification
  • Apache Axiom The XML Object Model which uses StAX as its underlying XML parsing methodology

Author

Eran Chinthaka, WS PMC Member, Member, Apache Software Foundation, chinthaka(!) at apache(!) dot org(!)
WSO2Con 2014 USA