Sezioni CDATA con XmlBeans
CDATA sections represent, for Java <-> Xml serialization libraries, a nagging problem. They’re not tags, as such, and while reading the text content of a node they’re usually ignored. Writing, now, is different; in fact, when a Java object has to be serialized as a Xml string, there are four options available:
- output a node’s content as-is: this could lead to malformed Xml;
- always enclose a node’s content in a CDATA section: this, too, is a less-than-optimal solution, since 1) we don’t know if it’s really necessary, and 2) the size of the output increases and, after all, there are scenarios where size matters, whatever they say;
- parse the node’s content and substitute problematic characters with the corresponding entities (i.e.
<for <) or character references (i.e.&#nnn;, where nnn stands for the Unicode identifier of that specific character); - parse the node’s content and, if problematic characters are present, enclose the whole node’s content in a CDATA section.
XmlBeans allows explicit management of content during serialization with regard to CDATA, but only since version 2.3.0, through a couple of methods in the XmlOptions class (namely, setSaveCDataLengthThreshold and setSaveCDataEntityCountThreshold).
The last released version of the previous development branch (1.0.4) has no such methods; it decides on its own when to enclose a node’s content in a CDATA section and when to choose the character substitution route instead. In fact, it encompasses the 3rd and 4th option mentioned above.
The chosen algorithm is somewhat arbitrary (for all curious people out there: it’s implemented in the private method entitizeContent of the inner class Parse, inside the org.apache.xmlbeans.impl.store.Saver class.
To summarize, this XmlBeans version uses CDATA sections only if:
- the node’s content contains more than 32 characters;
- said content contains at least 6 characters that should be replaced by their respective entities;
- the total number of those characters represents more than 1% of the length of the whole text.
In any other case, entitizeContent kicks in.
What made me go and look for such a thing? Necessity, as is often the case. I was forced to use this particular (old) version of XmlBeans, since it was already in use and shared with other development teams in these parts. And I had to envisage a possible workaround:
public static void main(String[] args) throws Exception {
/* Our test XML */
String xml = "<root><first></first><second></second></root>";
/* Let's build our XmlObject based
* on the XML string just defined */
XmlObject xmlObject = XmlObject.Factory.parse(xml);
/* The problematic content */
String content = "Ampersand and less-than: & <";
/* We'll add this content through a cursor */
XmlCursor cursor = xmlObject.newCursor();
cursor.toNextToken();
cursor.toChild("first");
/* The first node shall have our "raw" content */
cursor.setTextValue(content);
/* The second node shall have our content
* in a CDATA section */
cursor.toNextSibling();
cursor.setTextValue(XmlHelper.insertCDATATrigger(content));
/* Here's the result */
System.out.println(XmlHelper.getStringFromXmlBean(xmlObject));
/*
* <root>
* <first>Ampersand and less-than: & <</first>
* <second><![CDATA[Ampersand and less-than: & <]]></second>
* </root>
*/
}
Every time I want to be sure that a node’s content will be comfortably wrapped in CDATA markers, I use the insertCDATATrigger method when setting’s the node’s text value. Afterwards, when it’s time to serialize, here comes getStringFromXmlBean (I needed a String as output, but it can be easily adapted even if you use one of the other save methods).
The complete class is available here: XmlHelper.java (556) - 4.45 KB
The elegance of this kind of solution is rather questionable, I can see that. But it works and, short of re-implementing serialization, or modifying your own XmlBeans code (which could be generated code, the kind it’s better not to prod), I find it to be sensible enough.