Jump to Real's How-to Main page

Unaccent letters

The following snippet removes from a String accented letters and replace them by their regular ASCII equivalent.

This can be useful before inserting data into a database to made sorting easier.

Technique 1
It's a simple using the sun.text.Normalizer class. However, since the class is in sun.* package, it is considered outside of the Java platform, can be different across OS platforms (Solaris, Windows, Linux, Macintosh, etc.) and can change at any time without notice with SDK versions (1.2, 1.2.1, 1.2.3, etc). In general, writing java programs that rely on sun.* is risky: they are not portable, and are not supported.

For an alternative to the sun.text.Normalizer class, you may to take a look at IBM's ICU4J project on SourceForge.

We are calling the normalize() with the option DECOMP (for decomposition, see Unicode Normalization). So if we pass à, the method returns a + ` . Then using a regular expression, we clean up the string to keep only valid US-ASCII characters.

JDK1.4

import sun.text.Normalizer;

public class Accent {
   public static String value = "é à î _ @";

   public static void main(String args[]) throws Exception{
       System.out.println(formatString(value));
       // output : e a i _ @
   }

   public static String formatString(String s) {
        String temp = Normalizer.normalize(s, Normalizer.DECOMP, 0);
        return temp.replaceAll("[^\\p{ASCII}]","");
    }

}

A note from ajmacher:

The Normalizer API changed in JDK6... it can now be found in java.text.Normalizer and its usage is slightly different (but enough to break it), so Technique 1 will cause compiler errors in JDK6. Try :

java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD);

Technique 2
As an alternative, replaceAll() and regular expressions on a String can also be used :

public class Test {
  public static void main(String args[]) {
    String s = "È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô";

    s = s.replaceAll("[èéêë]","e");
    s = s.replaceAll("[ûù]","u");
    s = s.replaceAll("[ïî]","i");
    s = s.replaceAll("[àâ]","a");
    s = s.replaceAll("Ô","o");

    s = s.replaceAll("[ÈÉÊË]","E");
    s = s.replaceAll("[ÛÙ]","U");
    s = s.replaceAll("[ÏÎ]","I");
    s = s.replaceAll("[ÀÂ]","A");
    s = s.replaceAll("Ô","O");

    System.out.println(s);
    // output : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o
    }
}

Technique 3
While the two techniques above are ok... there are a little bit slow.

The following HowTo is faster because we using one String to contain all the possible characters to be converted and a String with the ASCII equivalent. So we need to detect the position in the first String and then do a lookup in the second String.


public class AsciiUtils {
    private static final String PLAIN_ASCII =
      "AaEeIiOoUu"    // grave
    + "AaEeIiOoUuYy"  // acute
    + "AaEeIiOoUuYy"  // circumflex
    + "AaOoNn"        // tilde
    + "AaEeIiOoUuYy"  // umlaut
    + "Aa"            // ring
    + "Cc"            // cedilla
    + "OoUu"          // double acute
    ;

    private static final String UNICODE =
     "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"             
    + "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD" 
    + "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177" 
    + "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
    + "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF" 
    + "\u00C5\u00E5"                                                             
    + "\u00C7\u00E7" 
    + "\u0150\u0151\u0170\u0171" 
    ;

    // private constructor, can't be instanciated!
    private AsciiUtils() { }

    // remove accentued from a string and replace with ascii equivalent
    public static String convertNonAscii(String s) {
       if (s == null) return null;
       StringBuilder sb = new StringBuilder();
       int n = s.length();
       for (int i = 0; i < n; i++) {
          char c = s.charAt(i);
          int pos = UNICODE.indexOf(c);
          if (pos > -1){
              sb.append(PLAIN_ASCII.charAt(pos));
          }
          else {
              sb.append(c);
          }
       }
       return sb.toString();
    }

    public static void main(String args[]) {
       String s = 
         "The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
       System.out.println(AsciiUtils.convertNonAscii(s));
       // output : 
       // The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
    }
}
Thanks to MV Bastos for the "tilde" bug fix
Thanks to L.. Tama for the missing Ñ !
Thanks to T. Hague for the missing "double acute";

As a bonus, here a method to convert a given string to uppercase with no accent. This can be useful in a database field to simplify name searching with accent or not.


public class StringUtils {
  private StringUtils() {}
  
  private static final String UPPERCASE_ASCII =
    "AEIOU" // grave
    + "AEIOUY" // acute
    + "AEIOUY" // circumflex
    + "AON" // tilde
    + "AEIOUY" // umlaut
    + "A" // ring
    + "C" // cedilla
    + "OU" // double acute
    ;

  private static final String UPPERCASE_UNICODE =
    "\u00C0\u00C8\u00CC\u00D2\u00D9"
    + "\u00C1\u00C9\u00CD\u00D3\u00DA\u00DD"
    + "\u00C2\u00CA\u00CE\u00D4\u00DB\u0176"
    + "\u00C3\u00D5\u00D1"
    + "\u00C4\u00CB\u00CF\u00D6\u00DC\u0178"
    + "\u00C5"
    + "\u00C7"
    + "\u0150\u0170"
    ;

  public static String toUpperCaseSansAccent(String txt) {
       if (txt == null) {
          return null;
       } 
       String txtUpper = txt.toUpperCase();
       StringBuilder sb = new StringBuilder();
       int n = txtUpper.length();
       for (int i = 0; i < n; i++) {
          char c = txtUpper.charAt(i);
          int pos = UPPERCASE_UNICODE.indexOf(c);
          if (pos > -1){
            sb.append(UPPERCASE_ASCII.charAt(pos));
          }
          else {
            sb.append(c);
          }
       }
       return sb.toString();
  }
  
  
  public static void main(String args[]) throws Exception {
    String s = 
      "The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
    System.out.println(
         StringUtils.toUpperCaseSansAccent(s));
    // output : 
    //  THE RESULT : E,E,E,E,U,U,I,I,A,A,O,E,E,E,E,U,U,I,I,A,A,O,C
  }
}


If you find this article useful, consider making a small donation
to show your support for this Web site and its content.

Written and compiled by Réal Gagnon ©1998-2008
[ home ]