Sanitize XML String Tag(s): XML


The following lists the range of valid XML characters. Any character not in the range is not allowed.
HexidecimalDecimal
#x9#9
#xA#10
#xD#13
#x20-#xD7FF#32-#55295
#xE000-#xFFFD#57344-#65533
#x10000-#x10FFFF#10000-#1114111
any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.

ref : http://www.w3.org/TR/REC-xml/#charsets.

The exception to this rule is that CDATA sections may contain any character, including ones not in the above range.

For example, if data is coming from a Cut&Paste operation from a Microsoft Word document, you may end up with 0x1a characters. Later, when the XML data is parsed, an Exception "hexadecimal value 0x1A, is an invalid character" will be thrown.

The following methods will remove all invalid XML characters from a given string (the special handling of a CDATA section is not supported).

Using Regex

  public static String sanitizeXmlChars(String xml) {
    if (xml == null || ("".equals(xml))) return "";
    // ref : http://www.w3.org/TR/REC-xml/#charsets
    // jdk 7
    Pattern xmlInvalidChars =
      Pattern.compile(
         "[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]"
      
        );
    return xmlInvalidChars.matcher(xml).replaceAll("");
  }

Using StringBuilder and for-loop

  public static String sanitizeXmlChars(String in) {
    StringBuilder out = new StringBuilder();
    char current;

    if (in == null || ("".equals(in))) return "";
    for (int i = 0; i < in.length(); i++) {
        current = in.charAt(i);
        if ((current == 0x9) ||
            (current == 0xA) ||
            (current == 0xD) ||
            ((current >= 0x20) && (current <= 0xD7FF)) ||
            ((current >= 0xE000) && (current <= 0xFFFD)) ||
            ((current >= 0x10000) && (current <= 0x10FFFF)))
            out.append(current);
    }
    return out.toString();
  }

blog comments powered by Disqus