This can be useful before inserting data into a database to made sorting easier.
Technique 1
It's a simple using the sun.text.Normalizer class. However, since the class is in sun.* package,
it is considered outside of the Java platform, can be different across OS platforms
(Solaris, Windows, Linux, Macintosh, etc.) and can change at any time without notice
with SDK versions (1.2, 1.2.1, 1.2.3, etc). In general, writing java programs that rely on
sun.* is risky: they are not portable, and are not supported.
For an alternative to the sun.text.Normalizer class, you may to take a look at IBM's ICU4J project on SourceForge.
We are calling the normalize() with the option DECOMP (for decomposition, see Unicode Normalization). So if we pass à, the method returns a + ` . Then using a regular expression, we clean up the string to keep only valid US-ASCII characters.
JDK1.4
import sun.text.Normalizer;
public class Accent {
public static String value = "é à î _ @";
public static void main(String args[]) throws Exception{
System.out.println(formatString(value));
// output : e a i _ @
}
public static String formatString(String s) {
String temp = Normalizer.normalize(s, Normalizer.DECOMP, 0);
return temp.replaceAll("[^\\p{ASCII}]","");
}
}A note from ajmacher:
The Normalizer API changed in JDK6... it can now be found in java.text.Normalizer and its usage is slightly different (but enough to break it), so Technique 1 will cause compiler errors in JDK6. Try :
java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD);
Technique 2
As an alternative, replaceAll() and regular expressions on a String can also be used :
public class Test {
public static void main(String args[]) {
String s = "È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô";
s = s.replaceAll("[èéêë]","e");
s = s.replaceAll("[ûù]","u");
s = s.replaceAll("[ïî]","i");
s = s.replaceAll("[àâ]","a");
s = s.replaceAll("Ô","o");
s = s.replaceAll("[ÈÉÊË]","E");
s = s.replaceAll("[ÛÙ]","U");
s = s.replaceAll("[ÏÎ]","I");
s = s.replaceAll("[ÀÂ]","A");
s = s.replaceAll("Ô","O");
System.out.println(s);
// output : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o
}
}
Technique 3
While the two techniques above are ok... there are a little bit slow.
The following HowTo is faster because we using one String to contain all the possible characters to be converted and a String with the ASCII equivalent. So we need to detect the position in the first String and then do a lookup in the second String.
public class AsciiUtils {
private static final String PLAIN_ASCII =
"AaEeIiOoUu" // grave
+ "AaEeIiOoUuYy" // acute
+ "AaEeIiOoUuYy" // circumflex
+ "AaOoNn" // tilde
+ "AaEeIiOoUuYy" // umlaut
+ "Aa" // ring
+ "Cc" // cedilla
+ "OoUu" // double acute
;
private static final String UNICODE =
"\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
+ "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
+ "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
+ "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
+ "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
+ "\u00C5\u00E5"
+ "\u00C7\u00E7"
+ "\u0150\u0151\u0170\u0171"
;
// private constructor, can't be instanciated!
private AsciiUtils() { }
// remove accentued from a string and replace with ascii equivalent
public static String convertNonAscii(String s) {
if (s == null) return null;
StringBuilder sb = new StringBuilder();
int n = s.length();
for (int i = 0; i < n; i++) {
char c = s.charAt(i);
int pos = UNICODE.indexOf(c);
if (pos > -1){
sb.append(PLAIN_ASCII.charAt(pos));
}
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String args[]) {
String s =
"The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
System.out.println(AsciiUtils.convertNonAscii(s));
// output :
// The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
}
}
As a bonus, here a method to convert a given string to uppercase with no accent. This can be useful in a database field to simplify name searching with accent or not.
public class StringUtils {
private StringUtils() {}
private static final String UPPERCASE_ASCII =
"AEIOU" // grave
+ "AEIOUY" // acute
+ "AEIOUY" // circumflex
+ "AON" // tilde
+ "AEIOUY" // umlaut
+ "A" // ring
+ "C" // cedilla
+ "OU" // double acute
;
private static final String UPPERCASE_UNICODE =
"\u00C0\u00C8\u00CC\u00D2\u00D9"
+ "\u00C1\u00C9\u00CD\u00D3\u00DA\u00DD"
+ "\u00C2\u00CA\u00CE\u00D4\u00DB\u0176"
+ "\u00C3\u00D5\u00D1"
+ "\u00C4\u00CB\u00CF\u00D6\u00DC\u0178"
+ "\u00C5"
+ "\u00C7"
+ "\u0150\u0170"
;
public static String toUpperCaseSansAccent(String txt) {
if (txt == null) {
return null;
}
String txtUpper = txt.toUpperCase();
StringBuilder sb = new StringBuilder();
int n = txtUpper.length();
for (int i = 0; i < n; i++) {
char c = txtUpper.charAt(i);
int pos = UPPERCASE_UNICODE.indexOf(c);
if (pos > -1){
sb.append(UPPERCASE_ASCII.charAt(pos));
}
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String args[]) throws Exception {
String s =
"The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
System.out.println(
StringUtils.toUpperCaseSansAccent(s));
// output :
// THE RESULT : E,E,E,E,U,U,I,I,A,A,O,E,E,E,E,U,U,I,I,A,A,O,C
}
}
Written and compiled by Réal Gagnon ©1998-2008
[ home ]