This can be useful before inserting data into a database to made sorting easier.
Technique 1
It's a simple using the sun.text.Normalizer class. However, since the class is in sun.* package,
it is considered outside of the Java platform, can be different across OS platforms
(Solaris, Windows, Linux, Macintosh, etc.) and can change at any time without notice
with SDK versions (1.2, 1.2.1, 1.2.3, etc). In general, writing java programs that rely on
sun.* is risky: they are not portable, and are not supported.
For an alternative to the sun.text.Normalizer class, you may to take a look at IBM's ICU4J project on SourceForge.
We are calling the normalize() with the option DECOMP (for decomposition, see Unicode Normalization). So if we pass à, the method returns a + ` . Then using a regular expression, we clean up the string to keep only valid US-ASCII characters.
JDK1.4
import sun.text.Normalizer;
public class Accent {
public static String value = "é à î _ @";
public static void main(String args[]) throws Exception{
System.out.println(formatString(value));
// output : e a i _ @
}
public static String formatString(String s) {
String temp = Normalizer.normalize(s, Normalizer.DECOMP, 0);
return temp.replaceAll("[^\\p{ASCII}]","");
}
}A note from ajmacher:
The Normalizer API changed in JDK6... it can now be found in java.text.Normalizer and its usage is slightly different (but enough to break it), so Technique 1 will cause compiler errors in JDK6. Try :
java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD);
Technique 2
As an alternative, replaceAll() and regular expressions on a String can also be used :
public class Test {
public static void main(String args[]) {
String s = "È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô";
s = s.replaceAll("[èéêë]","e");
s = s.replaceAll("[ûù]","u");
s = s.replaceAll("[ïî]","i");
s = s.replaceAll("[àâ]","a");
s = s.replaceAll("Ô","o");
s = s.replaceAll("[ÈÉÊË]","E");
s = s.replaceAll("[ÛÙ]","U");
s = s.replaceAll("[ÏÎ]","I");
s = s.replaceAll("[ÀÂ]","A");
s = s.replaceAll("Ô","O");
System.out.println(s);
// output : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o
}
}
Technique 3
While the two techniques above are ok... there are a little bit slow.
The following HowTo is faster because we using one String to contain all the possible characters to be converted and a String with the ASCII equivalent. So we need to detect the position in the first String and then do a lookup in the second String.
public class AsciiUtils {
private static final String PLAIN_ASCII =
"AaEeIiOoUu" // grave
+ "AaEeIiOoUuYy" // acute
+ "AaEeIiOoUuYy" // circumflex
+ "AaOo" // tilde
+ "AaEeIiOoUuYy" // umlaut
+ "Aa" // ring
+ "Cc" // cedilla
;
private static final String UNICODE =
"\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
+"\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
+"\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
+"\u00C3\u00E3\u00D5\u00F5"
+"\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
+"\u00C5\u00E5"
+"\u00C7\u00E7"
;
// private constructor, can't be instanciated!
private AsciiUtils() { }
// remove accentued from a string and replace with ascii equivalent
public static String convertNonAscii(String s) {
StringBuffer sb = new StringBuffer();
int n = s.length();
for (int i = 0; i < n; i++) {
char c = s.charAt(i);
int pos = UNICODE.indexOf(c);
if (pos > -1){
sb.append(PLAIN_ASCII.charAt(pos));
}
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String args[]) {
String s =
"The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
System.out.println(AsciiUtils.convertNonAscii(s));
// output :
// The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
}
}Written and compiled by Réal Gagnon ©1998-2008
[ home ]