How can I detect the encoding/codepage of a text file

281

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.

Is there a way to (automatically) detect the codepage of a text file?

The detectEncodingFromByteOrderMarks, on the StreamReader constructor, works for UTF8 and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850, windows1252.

Thanks for your answers, this is what I've done.

The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.

Solution:

Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
Loop through all codepages, and display the ones that give a solution with the user provided text.
If more as one codepage pops up, ask the user to specify more text.

GvS

출처 c# .net text encoding globalization

20 답변

252

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

Found it: en.wikipedia.org/wiki/Bush_hid_the_facts - JV.
I downvoted this answer for two reasons. First, saying that "you need to be told" is not helpful. Who would tell me, and through what medium would they do so? If I'm the one who saved the file, who would I ask? Myself? Second, the article is not especially helpful as a resource for answering the question. The article is more of a history of encoding written in a David Sedaris style. I appreciate the narrative, but it doesn't simply / directly answer the question. - geneorama
@geneorama, I think Joel's article addresses your questions better than I ever could, but here goes... The medium surely depends on the environment in which the text is received. Better that the file (or whatever) contains that information (I'm thinking HTML and XML). Otherwise the person sending the text should be allowed to supply that information. If you were the one who created the file, how can you not know what encoding it uses? - JV.
@geneorama, continued... Finally, I suppose the main reason the article doesn't answer the question simply is because there is no simple answer to that question. If the question were "How can I guess..." then I would have answered differently. - JV.
@JV I later learned that xml/html can specify character encoding, thanks for mentioning that useful tidbit. - geneorama

If you're looking to detect non-UTF encodings (i.e. no BOM), you're basically down to heuristics and statistical analysis of the text. You might want to take a look at the Mozilla paper on universal charset detection (same link, with better formatting via Wayback Machine).

Funnily enough my Firefox 3.05 installation detects that page as UTF-8, showing a number of question-mark-in-a-diamond glyphs, although the source has a meta tag for Windows-1252. Manually changing the character encoding shows the document correctly. - devstuff
Your sentence "If you're looking to detect non-UTF encodings (i.e. no BOM)" is slightly misleading; the unicode standard does not recommend adding a BOM to utf-8 documents! (and this recommendation, or lack thereof, is the source of many headaches). ref: en.wikipedia.org/wiki/Byte_order_mark#UTF-8 - Tao
This is done so you can concatenate UTF-8 strings without accumulating redundant BOMs. Besides, a Byte-Order Mark is not needed for UTF-8, unlike UTF-16 for example. - sashoalm
The link is down. - Mateusz Piotrowski

Have you tried C# port for Mozilla Universal Charset Detector

Example from http://code.google.com/p/ude/

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}

Worked flawlessly for Windows-1252 type. - seebiscuit
And how can you use it to read a text file to string using that? CharsetDetector returns the name of the encoding in string format and that's it... - Bartosz
@Bartosz private Encoding GetEncodingFromString(string encoding) { try { return Encoding.GetEncoding(encoding); } catch { return Encoding.ASCII; } } - PrivatePyle

You can't detect the codepage

This is clearly false. Every web browser has some kind of universal charset detector to deal with pages which have no indication whatsoever of an encoding. Firefox has one. You can download the code and see how it does it. See some documentation here. Basically, it is a heuristic, but one that works really well.

Given a reasonable amount of text, it is even possible to detect the language.

Here's another one I just found using Google:

"heuristics" - so the browser isn't quite detecting it, it's making an educated guess. "works really well" - so it doesn't work all the time then? Sounds to me like we're in agreement. - JV.
The standard for HTML dictates that, if the character set is not defined by the document, then it should be considered to be encoded as UTF-8. - Jon Trauntvein
Which is cool unless we're reading non-standard HTML documents. Or non-HTML documents. - Kos
This answer is wrong, so I had to downvote. Saying it'd be false that you cannot detect the codepage, is wrong. You can guess and your guesses can be rather good, but you cannot "detect" a codepage. - z80crew
@JonTrauntvein According to the HTML5 specs a character encoding declaration is required even if the encoding is US-ASCII – a lacking declaration results in using a heuristic algorithm, not in falling back to UTF8. - z80crew

I know it's very late for this question and this solution won't appeal to some (because of its english-centric bias and its lack of statistical/empirical testing), but it's worked very well for me, especially for processing uploaded CSV data:

http://www.architectshack.com/TextFileEncodingDetector.ashx

Advantages:

BOM detection built-in
Default/fallback encoding customizable
pretty reliable (in my experience) for western-european-based files containing some exotic data (eg french names) with a mixture of UTF-8 and Latin-1-style files - basically the bulk of US and western european environments.

Note: I'm the one who wrote this class, so obviously take it with a grain of salt! :)

Notepad++ has this feature out-of-the-box. It also supports changing it.

Looking for different solution, I found that

https://code.google.com/p/ude/

this solution is kinda heavy.

I needed some basic encoding detection, based on 4 first bytes and probably xml charset detection - so I've took some sample source code from internet and added slightly modified version of

http://lists.w3.org/Archives/Public/www-validator/2002Aug/0084.html

written for Java.

    public static Encoding DetectEncoding(byte[] fileContent)
    {
        if (fileContent == null)
            throw new ArgumentNullException();

        if (fileContent.Length < 2)
            return Encoding.ASCII;      // Default fallback

        if (fileContent[0] == 0xff
            && fileContent[1] == 0xfe
            && (fileContent.Length < 4
                || fileContent[2] != 0
                || fileContent[3] != 0
                )
            )
            return Encoding.Unicode;

        if (fileContent[0] == 0xfe
            && fileContent[1] == 0xff
            )
            return Encoding.BigEndianUnicode;

        if (fileContent.Length < 3)
            return null;

        if (fileContent[0] == 0xef && fileContent[1] == 0xbb && fileContent[2] == 0xbf)
            return Encoding.UTF8;

        if (fileContent[0] == 0x2b && fileContent[1] == 0x2f && fileContent[2] == 0x76)
            return Encoding.UTF7;

        if (fileContent.Length < 4)
            return null;

        if (fileContent[0] == 0xff && fileContent[1] == 0xfe && fileContent[2] == 0 && fileContent[3] == 0)
            return Encoding.UTF32;

        if (fileContent[0] == 0 && fileContent[1] == 0 && fileContent[2] == 0xfe && fileContent[3] == 0xff)
            return Encoding.GetEncoding(12001);

        String probe;
        int len = fileContent.Length;

        if( fileContent.Length >= 128 ) len = 128;
        probe = Encoding.ASCII.GetString(fileContent, 0, len);

        MatchCollection mc = Regex.Matches(probe, "^<\\?xml[^<>]*encoding[ \\t\\n\\r]?=[\\t\\n\\r]?['\"]([A-Za-z]([A-Za-z0-9._]|-)*)", RegexOptions.Singleline);
        // Add '[0].Groups[1].Value' to the end to test regex

        if( mc.Count == 1 && mc[0].Groups.Count >= 2 )
        {
            // Typically picks up 'UTF-8' string
            Encoding enc = null;

            try {
                enc = Encoding.GetEncoding( mc[0].Groups[1].Value );
            }catch (Exception ) { }

            if( enc != null )
                return enc;
        }

        return Encoding.ASCII;      // Default fallback
    }

It's enough to read probably first 1024 bytes from file, but I'm loading whole file.

If someone is looking for a 93.9% solution. This works for me:

public static class StreamExtension
{
    /// <summary>
    /// Convert the content to a string.
    /// </summary>
    /// <param name="stream">The stream.</param>
    /// <returns></returns>
    public static string ReadAsString(this Stream stream)
    {
        var startPosition = stream.Position;
        try
        {
            // 1. Check for a BOM
            // 2. or try with UTF-8. The most (86.3%) used encoding. Visit: http://w3techs.com/technologies/overview/character_encoding/all/
            var streamReader = new StreamReader(stream, new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true), detectEncodingFromByteOrderMarks: true);
            return streamReader.ReadToEnd();
        }
        catch (DecoderFallbackException ex)
        {
            stream.Position = startPosition;

            // 3. The second most (6.7%) used encoding is ISO-8859-1. So use Windows-1252 (0.9%, also know as ANSI), which is a superset of ISO-8859-1.
            var streamReader = new StreamReader(stream, Encoding.GetEncoding(1252));
            return streamReader.ReadToEnd();
        }
    }
}

Very nice solution. One can easily wrap the body of ReadAsString() in a loop of allowed encodings if more than 2 encodings (UTF-8 and ASCI 1252) should be allowed. - ViRuSTriNiTy
After trying tons of examples, I finally got to yours. I am in a happy place right now. lol Thank!!!!!!! - Sedrick

I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.

Given that dictionary (hash), you take your input text and:

if it starts with any BOM character ('\xfe\xff' for UTF-16-BE, '\xff\xfe' for UTF-16-LE, '\xef\xbb\xbf' for UTF-8 etc), I treat it as suggested
if not, then take a large enough sample of the text, take all byte-pairs of the sample and choose the encoding that is the least common suggested from the dictionary.

If you've also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.

So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.

The StreamReader class's constructor takes a 'detect encoding' parameter.

It's just "encoding" link here.. and the description says we have to provide the Encoding.. - SurajS
@SurajS: Look at the other overloads. - leppie
the original author wants to detect the encoding for a file, which would potentially not have the BOM Marker. The StreamReader detects encoding from BOM Header as per signature. public StreamReader( Stream stream, bool detectEncodingFromByteOrderMarks ) - ibondre

The tool "uchardet" does this well using character frequency distribution models for each charset. Larger files and more "typical" files have more confidence (obviously).

On ubuntu, you just apt-get install uchardet.

On other systems, get the source, usage & docs here: https://github.com/BYVoid/uchardet

On Mac via homebrew: brew install uchardet - Paul B

If you can link to a C library, you can use libenca. See http://cihar.com/software/enca/. From the man page:

Enca reads given text files, or standard input when none are given, and uses knowledge about their language (must be supported by you) and a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings.

It's GPL v2.

Got the same problem but didn't found a good solution yet for detecting it automatically . Now im using PsPad (www.pspad.com) for that ;) Works fine

Since it basically comes down to heuristics, it may help to use the encoding of previously received files from the same source as a first hint.

Most people (or applications) do stuff in pretty much the same order every time, often on the same machine, so its quite likely that when Bob creates a .csv file and sends it to Mary it'll always be using Windows-1252 or whatever his machine defaults to.

Where possible a bit of customer training never hurts either :-)

I was actually looking for a generic, not programming way of detecting the file encoding, but I didn't find that yet. What I did find by testing with different encodings was that my text was UTF-7.

So where I first was doing: StreamReader file = File.OpenText(fullfilename);

I had to change it to: StreamReader file = new StreamReader(fullfilename, System.Text.Encoding.UTF7);

OpenText assumes it's UTF-8.

you can also create the StreamReader like this new StreamReader(fullfilename, true), the second parameter meaning that it should try and detect the encoding from the byteordermark of the file, but that didn't work in my case.

Yikes! Who is writing files in UTF-7??? - John Machin
@JohnMachin I agree that it is rare, but it is mandated e.g. in some parts of the IMAP protocol. If that's where you are, you would not have to guess, though. - tripleee

Open file in AkelPad(or just copy/paste a garbled text), go to Edit -> Selection -> Recode... -> check "Autodetect".

As addon to ITmeze post, I've used this function to convert the output of C# port for Mozilla Universal Charset Detector

    private Encoding GetEncodingFromString(string codePageName)
    {
        try
        {
            return Encoding.GetEncoding(codePageName);
        }
        catch
        {
            return Encoding.ASCII;
        }
    }

MSDN

Thanks @Erik Aronesty for mentioning uchardet.

Meanwhile the (same?) tool exists for linux: chardet.
Or, on cygwin you may want to use: chardetect.

See: chardet man page: https://www.commandlinux.com/man-page/man1/chardetect.1.html

This will heuristically detect (guess) the character encoding for each given file and will report the name and confidence level for each file's detected character encoding.

10Y (!) had passed since this was asked, and still I see no mention of MS's good, non-GPL'ed solution: IMultiLanguage2 API.

Most libraries already mentioned are based on Mozilla's UDE - and it seems reasonable that browsers have already tackled similar problems. I don't know what is chrome's solution, but since IE 5.0 MS have released theirs, and it is:

Free of GPL-and-the-like licensing issues,
Backed and maintained probably forever,
Gives rich output - all valid candidates for encoding/codepages along with confidence scores,
Surprisingly easy to use (it is a single function call).

It is a native COM call, but here's some very nice work by Carsten Zeumer, that handles the interop mess for .net usage. There are some others around, but by and large this library doesn't get the attention it deserves.

-2

I use this code to detect Unicode and windows default ansi codepage when reading a file. For other codings a check of content is necessary, manually or by programming. This can de used to save the text with the same encoding as when it was opened. (I use VB.NET)

'Works for Default and unicode (auto detect)
Dim mystreamreader As New StreamReader(LocalFileName, Encoding.Default) 
MyEditTextBox.Text = mystreamreader.ReadToEnd()
Debug.Print(mystreamreader.CurrentEncoding.CodePage) 'Autodetected encoding
mystreamreader.Close()

How can I detect the encoding/codepage of a text file

20 답변

Linked

Related

Latest