Handle UTF8 file with BOMTag(s): IO
From Wikipedia, the byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.
The common BOMs are :
Encoding | Representation (hexadecimal) | Representation (decimal) |
UTF-8 | EF BB BF | 239 187 191 |
UTF-16 (BE) | FE FF | 254 255 |
UTF-16 (LE) | FF FE | 255 254 |
UTF-32 (BE) | 00 00 FE FF | 0 0 254 255 |
UTF-32 (LE) | FF FE 00 00 | 255 254 0 0 |
UTF8 file are a special case because it is not recommended to add a BOM to them. The presence of UTF8 BOM can break other tools like Java. In fact, Java assumes the UTF8 don't have a BOM so if the BOM is present it won't be discarded and it will be seen as data.
To create an UTF8 file with a BOM, open the Windows Notepad, create a simple text file and save it as utf8.txt with the encoding UTF-8.
Now if you examine the file content as binary, you see the BOM at the beginning.
If we read it with Java.
import java.io.*; public class x { public static void main(String args[]) { try { FileInputStream fis = new FileInputStream("c:/temp/utf8.txt"); BufferedReader r = new BufferedReader(new InputStreamReader(fis, "UTF8")); for (String s = ""; (s = r.readLine()) != null;) { System.out.println(s); } r.close(); System.exit(0); } catch (Exception e) { e.printStackTrace(); System.exit(1); } } }
?helloworld
This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like javadoc ou xml parsers.
The Apache IO Commons provides some tools to handle this situation. The BOMInputStream class detects the BOM and, if required, can automatically skip it and return the subsequent byte as the first byte in the stream.
Or you can do it manually. The next example converts an UTF8 file to ANSI. We check the first line for the presence of the BOM and if present, we simply discard it.
import java.io.*; public class UTF8ToAnsiUtils { // FEFF because this is the Unicode char represented by the UTF-8 byte order mark (EF BB BF). public static final String UTF8_BOM = "\uFEFF"; public static void main(String args[]) { try { if (args.length != 2) { System.out .println("Usage : java UTF8ToAnsiUtils utf8file ansifile"); System.exit(1); } boolean firstLine = true; FileInputStream fis = new FileInputStream(args[0]); BufferedReader r = new BufferedReader(new InputStreamReader(fis, "UTF8")); FileOutputStream fos = new FileOutputStream(args[1]); Writer w = new BufferedWriter(new OutputStreamWriter(fos, "Cp1252")); for (String s = ""; (s = r.readLine()) != null;) { if (firstLine) { s = UTF8ToAnsiUtils.removeUTF8BOM(s); firstLine = false; } w.write(s + System.getProperty("line.separator")); w.flush(); } w.close(); r.close(); System.exit(0); } catch (Exception e) { e.printStackTrace(); System.exit(1); } } private static String removeUTF8BOM(String s) { if (s.startsWith(UTF8_BOM)) { s = s.substring(1); } return s; } }