2008/08/17
17 Aug, 2008

Understanding Axiom Text Manipulation APIs

  • Eran Chinthaka
  • Software Engineer - WSO2

Applies To

<Project/lan> Apache Axiom/Java

Table of Contents

  1. Introduction
  2. Axiom Text Handling API

Introduction

As you might already know, Axiom is based on StAX parsing. While reading an XML document, a StAX parser throws events, depending on the information items it encounters. Axiom captures all these information into a form of OMNode and adds them to the parent element.

Text content within an XML can come from different events. Inside Axiom, we capture the following events from the parser, and model information items within those events into an OMText node.

  1. XMLStreamConstants.CHARACTERS
  2. XMLStreamConstants.CDATA
  3. XMLStreamConstants.SPACE
  4. XMLStreamConstants.ENTITY_REFERENCE

Even though we capture all these information items within a OMText node, we save the type of event inside the type attribure of OMText. At the same time, a given element can contain more than one type of text information. Axiom being an XML object model preserving full information content, we don't throw away any of the information retrieved from the parser at any point. Even a space is an important part of the information content. If you already know about XML security, you will understand the importance of each and every character inside an XML, in the context of canicalization.

When we were designing the Axiom API, we understood that a user's information requirements may vary, often based on the level of expertise and the context of use. Some users just need to get text content inside an element, irrespective of whether those are a combination of muliple character information, and those users need to get the content of text within an element easily. Some others want to dig into the character information details, and those users would benefit from an advanced API. 

Let's now see how you can work with Axiom API to get the text information you need.

Axiom Text Handling API

Convenience API Get Character Information Items

OMElement.getText() method will retrieve all the OMText nodes, which are of type XMLStreamConstants.CHARACTERS, within a given element, concatenate them, and returns as a single string. This is pretty neat and simple, if you know what you are expecting from the text content. This is just a convenience method, and this doesn't retrieve all the text information from the OMElement, for example, we don't concatenate white spaces to the returned string in this case.

Getting All Text Information

But if you want to get all the text information as it is, you need to retrieve all the text nodes from an Element and process on your own. This is very much similar to how w3c.dom API exposes text content.

        Iterator children = element.getChildren();
        while (children.hasNext()) {
            OMNode omNode = (OMNode) children.next();
            if (omNode instanceof OMText) {
                OMText omText = (OMText) omNode;
                switch (omText.getType()) {
                 case XMLStreamConstants.CHARACTERS : 
                     // process characters
                     break;
                 case XMLStreamConstants.CDATA :
                     // process CDATA
                     break;
                 case XMLStreamConstants.ENTITY_REFERENCE :
                     // process entity references
                     break;
                 case XMLStreamConstants.SPACE :
                     // process white spaces
                     break;     
                }
            }
        }

Working with Text as QNames

There is another tricky part in relation to certain text contents. There can be situations in which some text items are namespace qualified. If you are familiar with SOAP fault messages, you must have seen error codes like <Code>env:Server</Code>. In this case, the prefix env is associated with a namespace within this context, usually the SOAP envelope's namespace. These situations necessitates users to mix namespace information also with text content. These sorts of functionalities and convenience methods are used to achive functionalities that are greatly demanded from Axiom, as it servs the Axis2 SOAP engine as the main client of its object model.

OMElement.getTextAsQname() methods will try to resolve the namespace information within text content and return a QName. During this resolution it converts a prefix:local QName string into a proper QName, evaluating it in the OMElement's context. If the text content contains unprefixed QNames, those are resolved to the local namespace, without harming the information content within them.

Axiom API also has a method to set the text content as a QName. For example, you might want to have an element like <MyElement xmlns:myns="https://example.org" >myns:Information</MyElement>. You can either first define myns namespace in to MyElement and add the text content. But the easiest way is to add the namespace information also with the text content as shown below.

myElement.setText(new QName("https://example.org", "Information", "myns"));

When you set namespace information also in the text QName, inside Axiom, it will first check for a namespace with the given information. If it can find a namespace matching the content, then it will associate the text content with it. If not, it will define a namespace within Axiom element, and associated the text content with it.

As we mentioned earlier, it is worth to note that these are some convenient methods, which came out of requirements of Axis2 project and does not necessarily reside in an XML object model.

Summary

We discussed different ways of extracting text information from Axiom, depending on information requirements. Axiom provides these additional convenience methods to work with special text information, while preserving all information items extracted from the parser/user and exposing all of them using common text extraction methods.

Author(s)

Author : Eran Chinthaka, WS PMC Member/Member-Apache Software Foundation, eran(dot)chinthaka(Y)gmail.com, where y=@

 

About Author

  • Eran Chinthaka
  • Software Engineer
  • WSO2 Inc.