Real'sHowTo AddThis Feed Button
Custom Search

Unaccent lettersTag(s): Internationalization String/Number


The following snippets remove from a String accented letters and replace them by their regular ASCII equivalent.

These can be useful before inserting data into a database to made sorting easier.

Using java.text.Normalizer

It's a simple using the java.text.Normalizer class.

We are calling the normalize(). If we pass à, the method returns a + ` . Then using a regular expression, we clean up the string to keep only valid US-ASCII characters.

import java.text.Normalizer;
import java.util.regex.Pattern;

public class StringUtils {
  private StringUtils() {}

  public static String unAccent(String s) {
    //
    // JDK1.5
    //   use sun.text.Normalizer.normalize(s, Normalizer.DECOMP, 0);
    //
    String temp = Normalizer.normalize(s, Normalizer.Form.NFD);
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(temp).replaceAll("");
  }

  public static void main(String args[]) throws Exception{
    String value = "   _ @";
    System.out.println(StringUtils.unAccent(value));
    // output : e a i _ @
  }
}

Using String.replaceAll()

As an alternative, replaceAll() and regular expressions on a String can also be used :
public class Test {
  public static void main(String args[]) {
    String s = "È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô";

    s = s.replaceAll("[èéêë]","e");
    s = s.replaceAll("[ûù]","u");
    s = s.replaceAll("[ïî]","i");
    s = s.replaceAll("[àâ]","a");
    s = s.replaceAll("Ô","o");

    s = s.replaceAll("[ÈÉÊË]","E");
    s = s.replaceAll("[ÛÙ]","U");
    s = s.replaceAll("[ÏÎ]","I");
    s = s.replaceAll("[ÀÂ]","A");
    s = s.replaceAll("Ô","O");

    System.out.println(s);
    // output : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o
  }
}

The String.indexOf()

While the two techniques above are ok... there are a little bit slow.

The following HowTo is faster because we using one String to contain all the possible characters to be converted and a String with the ASCII equivalent. So we need to detect the position in the first String and then do a lookup in the second String.


public class AsciiUtils {
  private static final String PLAIN_ASCII =
      "AaEeIiOoUu"    // grave
    + "AaEeIiOoUuYy"  // acute
    + "AaEeIiOoUuYy"  // circumflex
    + "AaOoNn"        // tilde
    + "AaEeIiOoUuYy"  // umlaut
    + "Aa"            // ring
    + "Cc"            // cedilla
    + "OoUu"          // double acute
    ;

  private static final String UNICODE =
     "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
    + "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
    + "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
    + "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
    + "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
    + "\u00C5\u00E5"
    + "\u00C7\u00E7"
    + "\u0150\u0151\u0170\u0171"
    ;

  // private constructor, can't be instanciated!
  private AsciiUtils() { }

  // remove accentued from a string and replace with ascii equivalent
  public static String convertNonAscii(String s) {
    if (s == null) return null;
      StringBuilder sb = new StringBuilder();
      int n = s.length();
      for (int i = 0; i < n; i++) {
        char c = s.charAt(i);
        int pos = UNICODE.indexOf(c);
        if (pos > -1){
          sb.append(PLAIN_ASCII.charAt(pos));
        }
        else {
          sb.append(c);
        }
     }
     return sb.toString();
  }

  public static void main(String args[]) {
    String s =
      "The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
    System.out.println(AsciiUtils.convertNonAscii(s));
    // output :
    // The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
  }
}
Thanks to MV Bastos for the "tilde" bug fix
Thanks to L.. Tama for the missing Ñ !
Thanks to T. Hague for the missing "double acute";

As a bonus, here a method to convert a given string to uppercase with no accent. This can be useful in a database field to simplify name searching with accent or not.


public class StringUtils {
  private StringUtils() {}

  private static final String UPPERCASE_ASCII =
    "AEIOU" // grave
    + "AEIOUY" // acute
    + "AEIOUY" // circumflex
    + "AON" // tilde
    + "AEIOUY" // umlaut
    + "A" // ring
    + "C" // cedilla
    + "OU" // double acute
    ;

  private static final String UPPERCASE_UNICODE =
    "\u00C0\u00C8\u00CC\u00D2\u00D9"
    + "\u00C1\u00C9\u00CD\u00D3\u00DA\u00DD"
    + "\u00C2\u00CA\u00CE\u00D4\u00DB\u0176"
    + "\u00C3\u00D5\u00D1"
    + "\u00C4\u00CB\u00CF\u00D6\u00DC\u0178"
    + "\u00C5"
    + "\u00C7"
    + "\u0150\u0170"
    ;

  public static String toUpperCaseSansAccent(String txt) {
    if (txt == null) {
      return null;
    }
    String txtUpper = txt.toUpperCase();
    StringBuilder sb = new StringBuilder();
    int n = txtUpper.length();
    for (int i = 0; i < n; i++) {
      char c = txtUpper.charAt(i);
      int pos = UPPERCASE_UNICODE.indexOf(c);
      if (pos > -1){
        sb.append(UPPERCASE_ASCII.charAt(pos));
      }
      else {
        sb.append(c);
      }
    }
    return sb.toString();
  }


  public static void main(String args[]) throws Exception {
    String s =
      "The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
    System.out.println(
         StringUtils.toUpperCaseSansAccent(s));
    // output :
    //  THE RESULT : E,E,E,E,U,U,I,I,A,A,O,E,E,E,E,U,U,I,I,A,A,O,C
  }
}

blog comments powered by Disqus


If you find this article useful, consider making a small donation
to show your support for this Web site and its content.

Written and compiled by Réal Gagnon ©1998-2014
[ home ]