Special Character Conversion Problems – ISO-8859-1 to Unicode
Ever see one of these funny characters � or (4-digits in a box) .
The black question mark appears when the value for a special character doesn’t match a character in the character set used for displaying the text. I found this happened when working with text data that was encoded in ISO-8859-1 and displayed in UTF-8. This is a correct problem as character sets need to be converted before you can display them.
But what do you do when you convert the data from the incoming encoding to the outgoing encoding and end up with � or (4-digits in a box) still.
Well, that was the problem I encountered.
For about a week I banged my head trying to solve that character conversion problem in .NET for a Web Design Toronto project. I received data in ISO-8859-1 format and displayed it in UTF-8. The problem occurred when the special character bullet
(•) was in the data received. It showed up as � because I was not doing a conversion to UTF-8. So I did some research and found
After I used this function and displayed the text, I found that the bullet was then converted to \u0095 which was displayed as a box with 0095 in it. I thought that it did not convert correctly and I searched Google for \u0095 and I kept getting references to Unicode. So I started to suspect that the conversion was incorrect. I came across Bullet – Unicode Character – FileFormat which listed the conversion chart for a bullet and the correct Unicode character is \u2022. Obviously, this is not correct so I wondered if the conversion was broken. I researched a little more and found Message Waiting – Unicode Character – FileFormat which is the \u0095 character.
So I have converted successfully from ISO-8859-1 to Unicode but when displayed in a browser with UTF-8 it doesn’t seem to recognize that character so I end up with the box and four digits in it.
How To Get the Browser To Display The Special Unicode Characters
As I examined the chart at FileFormat for Message Waiting and it indicated that • (•) is the HTML entity for the Message Waiting Dot. So I looked for how to convert Unicode to HTML entities in .NET.
But this didn’t solve my problem. HTMLEncode only converted special characters below 127 in the ASCII table. My research led me to a post about expanding the HTMLEncode to include special characters above 127. Apparently, the integer value of the Unicode character is also the HTML entity number. So appending &# to the integer value followed by a semi-colon is the HTML entity for that Unicode character. Example: • (•).
The Final Conversion Method
I put the ISO-8859-1 conversion to Unicode together with the special character conversion to make sure the data will display in the browser.