#PSCXTip How to determine the byte order mark of a text file

Text files created by PowerShell are little endian Unicode (UTF-16LE) by default.  You can see this by inspecting the first couple of bytes of a text file for a BOM i.e. a byte order mark.  BOMs are not required but PowerShell usually create a BOM when it creates a text file.  Typical BOMs you’ll encounter with Windows and PowerShell are:

UTF-8 		: 0xEF 0xBB 0xBF
UTF-16LE 	: 0xFF 0xFE

You can’t use code like [System.IO.File]::ReadAllText() to view a BOM because the bytes associated with the BOM aren’t output – just the associated text is output.  Get-Content works the same way except when you use the –Encoding Byte parameter.  Given a file created in PowerShell:

PS> Get-Date > date.txt

You can see the encoding using Get-Content like so:

PS> Get-Content .\date.txt –Encoding Byte –TotalCount 3

However, unless you’re quick with your decimal to hex conversions, this output isn’t ideal. The PowerShell Community Extensions comes with a command called Format-Hex that will format its input or a specified file in hex format. This utility is much like the od command from UNIX. The output from the Format-Hex command for the same file as above would be:

PS> Format-Hex .\date.txt -Count 16

Address:  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F ASCII
-------- ----------------------------------------------- ----------------
00000000 FF FE 0D 00 0A 00 53 00 75 00 6E 00 64 00 61 00 ......S.u.n.d.a.

Here we can see the first two bytes are 0xFF 0xFE, which is UTF-16LE or little endian Unicode.  If we saved the date.txt as UTF-8:

PS> Get-Date | Out-File date.txt -Encoding Utf8
PS> Format-Hex .\date.txt -Count 16

Address:  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F ASCII
-------- ----------------------------------------------- ----------------
00000000 EF BB BF 0D 0A 53 75 6E 64 61 79 2C 20 44 65 63 .....Sunday, Dec

Here we can see the UTF-8 BOM 0xEF 0xBB 0xBF.  This tip is most useful when you’re processing a file created by another program with PowerShell and you need to make sure you leave the file in the same encoding that it started out with.

Note: There are many more useful PowerShell Community Extensions (PSCX) commands. If you are interested in this great community project led by PowerShell MVPs Keith Hill and Oisin Grehan, give PSCX a try at http://pscx.codeplex.com.

Filed in: Columns, Tips and Tricks Tags: , , ,

Leave a Reply

Submit Comment

© 2016 PowerShell Magazine. All rights reserved. XHTML / CSS Valid.
Proudly designed by Theme Junkie.
%d bloggers like this: