How to observe encoding file in ANSI, UTF8 and UTF8 without BOM -
Question
-
Hi all,
I am having a problem with detecting a .txt/.csv file encoding. I need to detect a file in ANSI, UTF8 and UTF8 without BOM but the problem is the encoding of ANSI and UTF8 without BOM are the same. I checked the function below and saw that ANSI and UTF8 without BOM have the same encoding. so, How can I detect UTF8 without BOM encoding file? because I need to handle for this case in my code.
Thank you.
///////////////////////////////////////////////////////////////////
public Encoding GetFileEncoding(cord srcFile)
{
// *** Use Default of Encoding.Default (Ansi CodePage)
Encoding enc = Encoding.Default;
// *** Detect byte order mark if any - otherwise assume default
byte[] buffer = new byte[10];
FileStream file = new FileStream(srcFile, FileMode.Open);
file.Read(buffer, 0, ten);
file.Close();
if (buffer[0] == 0xef && buffer[ane] == 0xbb && buffer[ii] == 0xbf)
enc = Encoding.UTF8;
else if (buffer[0] == 0xfe && buffer[1] == 0xff)
enc = Encoding.Unicode;
else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
enc = Encoding.UTF32;
else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
enc = Encoding.UTF7;
else if (buffer[0] == 0xFE && buffer[1] == 0xFF)
// 1201 unicodeFFFE Unicode (Large-Endian)
enc = Encoding.GetEncoding(1201);
else if (buffer[0] == 0xFF && buffer[1] == 0xFE)
// 1200 utf-16 Unicode
enc = Encoding.GetEncoding(1200);
return enc;
}
//////////////////////////////////////////////
Answers
-
Hi,
There is no 100% reliable way to detemine if a byte stream in ANSI or UTF-8 (without BOM).
If by ANSI you really meant ASCII, things are a little simpler - If at that place are whatever bytes over x7F, so you could infer its UTF-viii (or bad ASCII, or another code page altogether. Only you know if you tin can exclude those possibilities).
If it actually is ANSI (and btw, at that place is no such single code folio. At that place is a gear up of Windows Lawmaking Pages, the most mutual of which is lawmaking page 1252, so I will presume you mean that), you going to have to do a little more than work. Y'all could wait to run into if the bytes above x7F are legal UTF-viii. The will always come in at least pairs, but up to six bytes tin can be used to encode a unmarried character. (Run into hither: http://en.wikipedia.org/wiki/UTF-8).
Then, if the bytes stream is not legal UTF-8, you could infer its ANSI (Code Page 1252). (Or but bad UTF-8 or something else. Again just you know if these can be excluded.)
If the byte stream has no bytes values over x7F, your in luck - either encoding will work.
HTH,
Nick
- Marked as answer by Wednesday, July 9, 2022 9:45 AM
How To Remove Bom From Utf-8 File,
Source: https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/b172cd4d-25fe-4696-8c0f-37226c053d71/how-to-detect-encoding-file-in-ansi-utf8-and-utf8-without-bom
Posted by: brocksucken.blogspot.com
0 Response to "How To Remove Bom From Utf-8 File"
Post a Comment