Text encodings

Home

About

When a computer stores text, it encodes each character as a numeric value and stores the byte (or bytes) associated with that number. When it needs to display or print that character, it consults the encoding scheme to determine which character the number represents.

The first computers used the encoding scheme called "ASCII", which stands for American Standard Code for Information Interchange. It specified 128 values and includes codes for upper and lower case letters, numbers, the common symbols on a keyboard, and some non-visible control codes that were heavily used in early computers.

As computers became more sophisticated and were introduced in non-English speaking countries, the limitations of the ASCII encoding scheme became apparent. It didn’t include codes for accented characters, such as é and ü, and had no chance of handling idiographic languages, such as Japanese or Chinese, which require thousands of characters.

As a result, extensions to the ASCII encoding scheme were developed. However, multiple schemes were developed by many groups without thought for how they would all interact. Outside the initial 128 codes, the schemes, in general, do not agree on which codes represent which letters or symbols. For example, in the US macOS and Windows computers use different encodings for these higher values. So the code for é on Windows differs from that on macOS. Many other encoding schemes for handling languages that use non-ASCII characters were also developed.

The most modern and general solution to the problem is a encoding called Unicode. It is designed to handle every character in every language. It also enables you to represent a mixture of languages within one text stream. The emoji you see on your phone are also represented in the Unicode encoding. It is also under the control of an international standards committee and is extended from time to time as required. This ensures that it is able to be standard around the world and to cope with any new languages and symbols that come along.

OxonStat

Internally OxonStat uses Unicode to represent all string data (specifically UTF8). This allows OxonStat to handle text in any language including non-roman alphabets such as Cyrillic, Japanese, Chinese and Arabic. Any language can be used for Variable names and in the data for Text variables.

Importing data

When you import data from external file formats you may find that the encoding system that was used to create the file was something other than Unicode. In these cases you will need to let OxonStat know what encoding was used when the file was created. It will then be able to perform the appropriate conversion required for the text to be correctly displayed. Examples of extenal files that may require a Text Encoding to be specified include:

Microsoft Excel (.xlsx) files and OxonStats own format use Unicode encoding and so the text is automatically recognised.

In each case the system will allow you to check the Advanced options button and select an appropriate Text Encoding. If you are unsure of the correct encoding you can try importing the file using different encodings until you find the one that shows the data correctly. If your data only contains letters and number represented in the Standard ASCII encoding you may simply use UTF8 and the data will be read correctly.