Dec 29, 2010

String.trim() on Unicode space characters In Depth

As developers, we know that the \u00A0 is a unicode space character. Even the \240 is a unicode space character. As a Java developer, have you been working with String.trim()? I am sure, your answer would be "Yes"!

Now, let us come to an interesting pointer. Have you ever had a look at how String.trim() works? If you had been tech savvy and if you would have read the Java Specs, am sure, you would be knowing the bizarre nature of String class's trim() function.

For the others, let us get further. Just type in the following java code as a stand-alone Java program and try to run it!

Sample Program
public class Sample {
public static void main(String arg[])
{
    String test = "\240";
    System.out.println("hai:"+test.trim()+":hai");
    for(int i=0;i<test.length();i++)
    {
        char ch = test.charAt(i);
        System.out.println(ch+":"+Character.isSpaceChar(ch));
    }

}
Output:
hai: :hai<br />

In the above example, when you run it, you would be surprised to see that the text between "hai:" and ":hai" would still be a space even after executing trim() on the test string. As a developer, I expected "hai::hai" to be printed too! But the "hai: :hai" surprised me. On digging a bit further, I am able to see that the Java String class trim() works a bit differently.

So, I was wondering whether trim() really trims all white spaces and space characters? May be not and here goes the clarification.

Trim(): Clarification in Detail

Java's String abide by UTF-16. However, Java's trim() says that it trims off the leading and trailing white space characters listed by Java and not unicode. It does not trim all the space and (or) white space characters as defined by uni-code. In this context, we need to understand 2 more pointers.
  1. Character.isWhiteSpace(char/int): Checks whether the specified int or character is a whitespace. 'white space' here refers to a sub set of white space characters defined by unicode(excluded are the non-breaking white space characters defined by unicode and included are uni-code lower control characters that are equal or below U+0020), a list of white space characters defined by Java(not in the unicode list). 
  2. Character.isSpaceChar(char): Checks whether the given character is a space character. Space character in turn refers to space characters defined by uni-code.

And as I understand, Java String Class's trim() trims off the white spaces (list as defined by Java) and this is definitely not as intuitive as it should have been and hence this post!

As always, I appreciate your feedback/hollers regarding this article any time!

No comments:

Post a Comment