I was reading a blog by a Brazilian colleague about the BOM (Byte Order Mark) bytes in UTF-8 files, and I found it interesting to translate a summary. With Igor Escobar’s due authorization, of course, here is what I found to be of good tutorial value:
What is BOM signature in UTF-8 documents ?
Some applications insert a combination of particular bytes in the beginning of the files. Those are used to indicate that the following content have Unicode characters. This combination of characters is known as Byte Order Mark (BOM). Some editors show this signature as an extra line, other applications, like Zend Studio, show the signature as ( ï»¿).
Is the BOM signature important ?
In the case of UTF-8 encoded files, it is not. You may take that signature off without causing interpretation problems. The BOM signature is only important for UTF-16 and UTF-32 documents. It is used to inform the user-agent how to interpret the characters.
How to detect the presence of BOM signature in UTF-8 files?
First we need to detect if this extra line at the beginning of the file is really a BOM signature. You could try using your eyes, but if your editor interprets correctly the file signature, you will see nothing. But if your editor does not interpret the signature correctly, you will see the characters ï»¿ in the beginning of your document. If you use a binary editor capable of showing hexadecimal characters, the signature could be identified as EF BB BF. Alternatively, if you have a good editor, it will tell you the document encoding on the footer or in some menu.
If anyway you have no success finding it, there are web applications capable of detecting the BOM signature.
Some editors like Notepad++ (Windows, free) and Komodo (Linux, free) allow you to specify if you want or not the signature at the time you save your document. Take a look at the “Format” menu.
Be careful with BOM
In some editors like Windows Notepad, if you choose to save your file as UTF-8, it will automatically place the BOM signature.
BOM signature in CSS files can cause interpretation problems of some rules with certain user-agents, so it should be removed.
In some navigators, the presence of that signature can cause ALL characters on your page to be interpreted as they were UTF-8, independently of any contrary declaration you have.
For those folks who speak Portuguese, here is the link to the full article where you can even find a PERL script to remove the BOM signature.