Character Encoding Research

2024-05-05 14:15:03 1 minute read

Character encoding schemes map characters to binary data. They are fundamental in both file formats and programming data representation.

Character Encoding Schemes

ZhiZe recommends using Unicode’s UTF family encoding scheme:

Unicode

The following are some of the schemes that are Unicode compatible or related:

ASCII
ISO 8859

The following encoding schemes can be replaced by Unicode or are deprecated:

Big5
GB Character Encoding Scheme Series
IBM PC Code Page 437
JIS Character Encoding Scheme Series
KS X Character Encoding Scheme Series

File Format

Files without extensions will be treated as plain text files by default. If you want to add a suffix to a plain text file, use the .txt.

Plain text files use UTF-8 encoding by default. Other UTF series encodings will add a BOM (Byte Order Mark) at the beginning of the file by default. They can therefore be distinguished from UTF-8.

Character or String Representation in Programming Models

Different programming languages use different encoding schemes. Which variant of UTF a programming language uses by default is determined by the programming language specification.

Generally speaking, even if a programming language uses UTF-16 or UTF-32 as the default character and string encoding scheme, it will not add a BOM at the beginning of the character or string data structure explicitly. Byte order is determined by the programming language’s compiler or runtime environment.

Twitter Facebook LinkedIn

ZhiZe

Character Encoding Research

Character Encoding Schemes

File Format

Character or String Representation in Programming Models

You May Also Enjoy

SubRip

Audio Video Interleave

LRC

Matroska Subtitle