Sezioni CDATA con XmlBeans

May 17th, 2008 No comments

CDATA sections represent, for Java <-> Xml serialization libraries, a nagging problem. They’re not tags, as such, and while reading the text content of a node they’re usually ignored. Writing, now, is different; in fact, when a Java object has to be serialized as a Xml string, there are four options available:

  1. output a node’s content as-is: this could lead to malformed Xml;
  2. always enclose a node’s content in a CDATA section: this, too, is a less-than-optimal solution, since 1) we don’t know if it’s really necessary, and 2) the size of the output increases and, after all, there are scenarios where size matters, whatever they say;
  3. parse the node’s content and substitute problematic characters with the corresponding entities (i.e. &lt; for <) or character references (i.e. &#nnn;, where nnn stands for the Unicode identifier of that specific character);
  4. parse the node’s content and, if problematic characters are present, enclose the whole node’s content in a CDATA section.

XmlBeans allows explicit management of content during serialization with regard to CDATA, but only since version 2.3.0, through a couple of methods in the XmlOptions class (namely, setSaveCDataLengthThreshold and setSaveCDataEntityCountThreshold).

The last released version of the previous development branch (1.0.4) has no such methods; it decides on its own when to enclose a node’s content in a CDATA section and when to choose the character substitution route instead. In fact, it encompasses the 3rd and 4th option mentioned above.

The chosen algorithm is somewhat arbitrary (for all curious people out there: it’s implemented in the private method entitizeContent of the inner class Parse, inside the org.apache.xmlbeans.impl.store.Saver class.
To summarize, this XmlBeans version uses CDATA sections only if:

  • the node’s content contains more than 32 characters;
  • said content contains at least 6 characters that should be replaced by their respective entities;
  • the total number of those characters represents more than 1% of the length of the whole text.

In any other case, entitizeContent kicks in.

What made me go and look for such a thing? Necessity, as is often the case. I was forced to use this particular (old) version of XmlBeans, since it was already in use and shared with other development teams in these parts. And I had to envisage a possible workaround:

public static void main(String[] args) throws Exception {
	/* Our test XML */
	String xml = "<root><first></first><second></second></root>";

	/* Let's build our XmlObject based
	 * on the XML string just defined */
	XmlObject xmlObject = XmlObject.Factory.parse(xml);

	/* The problematic content */
	String content = "Ampersand and less-than: & <";

	/* We'll add this content through a cursor */
	XmlCursor cursor = xmlObject.newCursor();
	cursor.toNextToken();
	cursor.toChild("first");

	/* The first node shall have our "raw" content */
	cursor.setTextValue(content);

	/* The second node shall have our content
	 * in a CDATA section */
	cursor.toNextSibling();
	cursor.setTextValue(XmlHelper.insertCDATATrigger(content));

	/* Here's the result */
	System.out.println(XmlHelper.getStringFromXmlBean(xmlObject));

	/*
	 * <root>
	 *   <first>Ampersand and less-than: & <</first>
	 *   <second><![CDATA[Ampersand and less-than: & <]]></second>
	 * </root>
	 */
}

Every time I want to be sure that a node’s content will be comfortably wrapped in CDATA markers, I use the insertCDATATrigger method when setting’s the node’s text value. Afterwards, when it’s time to serialize, here comes getStringFromXmlBean (I needed a String as output, but it can be easily adapted even if you use one of the other save methods).

The complete class is available here: XmlHelper.java (556) - 4.45 KB

The elegance of this kind of solution is rather questionable, I can see that. But it works and, short of re-implementing serialization, or modifying your own XmlBeans code (which could be generated code, the kind it’s better not to prod), I find it to be sensible enough.

Tags: , ,

Protected: Vodafone 190 Utils

April 30th, 2008 Enter your password to view comments.

This post is password protected. To view it please enter your password below:


Incipit

April 30th, 2008 1 comment

You have to start somewhere. So, I’ll have to feign indifference towards the blank page looking at me, and begin to admit it: yes, it appears I’m entering the blogging tunnel.

To my own (partial) excuse, I may appeal to the fact that this blog shall be mainly of the techie persuasion, ranging from code fragments (mine and others’) to software reviews, intermingled with some of my own ramblings (which will probably use technology and computer science only as a starting point).

All right, I know: the same can be said for hundreds of thousands other blogs – you’ll just have to trust me… 8-)

Moreover, I can’t be any more precise (yet) – and, now that I think of it, I wouldn’t want to, either: there’s a balance between legitimate expectations and absolute boredom, and it’s quite a faint one, if you ask me.

Anyway, I’ve been toying with computer keyboards for the last twenty years (I probably like it) and my tastes and interests have been updating themselves following technology’s pace, and were naturally conditioned by its social impact.

What am I trying to tell you? That I’m the first not to know where this thing will be going… :-) …but I believe in discovery being part and parcel of having fun, so I’ll just have to wish good exploration to you and me both.

Tags: ,