Understanding Character Set in Oracle

When we’re woking with multiple character sets (such as GBK and UTF-8), Chinese characters (or other languages) might be displayed in some odd codes. Therefore we need to pay attention to the underlying principles of character sets.

In computing, a character encoding is used to represent a repertoire of characters by some kind of an encoding system (From Wikipedia). The same characters could have different encodings in different character set. For instance, the characters ‘中国’ is stored as ‘d6,d0,b9,fa’ (two bytes for one character) in GBK and ‘e4,b8,ad,e5,9b,bd’ (three bytes for one character) in UTF-8.

Typically, we will use terminal (SecureCRT, Xshell, putty, etc. ) to connect to remote Linux host (using SSH) and interact with the Oracle database server in SQL*Plus. In this case, we should take care of the following character set settings:

windows client OS (where the terminal resides)
open an cmd window, input “chcp” to check code page of the windows client.
code 936 <– This is the code for GBK character set.

Linux OS (where the database server runs on):

Secure CRT (terminal) : Default Settings
DB server character set: AL32UTF8

Most of the time, we can only change settings in terminal and NLS_LANG. When the connection to database server has been established, we could query v$session_connect_info to identify character set used by client:

select osuser, client_charset, client_connection, client_version, client_driver, client_oci_library
from v$session_connect_info
where sid=123 and serial#=19025;

-------------------- ------------------------------ ------------- -------------------- --------- ---------------------------
oracle US7ASCII Homogeneous SQL*PLUS Home-based

We can see that default SecureCRT character setting is US7ASCII.

Now we can test several scenarios to understand how the character sets work.

SecureCRT: Default Settings

SQL> create table t1(a varchar2(32));

Table created.

SQL> insert into t1 values('中国');

1 row created.

SQL> commit;

Commit complete.

SQL> select a,dump(a,1016) from t1;

A DUMP(A,1016)
-------------------------------- ---------------------------------
???? Typ=1 Len=12 CharacterSet=AL32UTF8: ef,bf,bd,ef,bf,bd,ef,bf,bd,ef,bf,bd

— odd code detected, data in database is incorrect
— client character set is US7ASCII, which is different from AL32UTF8, characters converted incorrectly

secureCRT — default or GB2312

— client character set : ZHS16GBK

SQL> select a,dump(a,1016) from t1;

A DUMP(A,1016)
-------------------- ---------------------------------
中国 Typ=1 Len=6 CharacterSet=AL32UTF8: e4,b8,ad,e5,9b,bd

— This is UTF-8 encoding of ‘中国’. character set converted correctly.
secureCRT — UTF8

SQL> select a,dump(a,1016) from t1;

A DUMP(A,1016)
---------------- ---------------------------
中国 Typ=1 Len=9 CharacterSet=AL32UTF8: e6,b6,93,ee,85,9e,e6,b5,97

— conversion happened : ZHS16GBK -> AL32UTF8
— Characters we typed are encoded in UTF-8 format. Since character sets are different between client and server, Oracle treat the UTF-8 formatted code as ZHS16GBK and converted it to the corresponding UTF-8 codes. The ‘中国’characters are displayed currectly simply because NLS_LANG is different from database character set. When we set NLS_LANG to AL32UTF8, we can see that characters stored in database are actually strange codes.

secureCRT — default or GB2312

SQL> select a,dump(a,1016) from t1;

A DUMP(A,1016)
------------- --------------------------------
中国 Typ=1 Len=4 CharacterSet=AL32UTF8: d6,d0,b9,fa

— This is ZHS16GBK code format of ‘中国’.
— Character set was not converted because client character set is consistent with server.
— When the terminal setting is default, the encoding was determined by windows client code page, namely GBK.

secureCRT — UTF8

SQL> select a,dump(a,1016) from t1;

A DUMP(A,1016)
------------ ----------------------------
中国 Typ=1 Len=6 CharacterSet=AL32UTF8: e4,b8,ad,e5,9b,bd

— This is UTF-8 encoding of ‘中国’. This is the perfect case, no conversion, no information loss.


If NLS_LANG is consistent with server character set, characters will not be converted between client and server and vice versa. Encodings will be determined by terminal and client OS settings. Terminal setting enjoys higher priority. When the database character set is UTF-8, the best setting would be to set NLS_LANG and terminal to UTF8 together.



We can also use the following query to determine character setting of the current session:
select userenv(‘LANGUAGE’) from dual;

Please be advised that LANGUAGE returns the language and territory used by the current session along with the database character set. Therefore the character set output here is the DATABASE character set, NOT the CLIENT character set.

In addition, we can use bc to determine the true meaning of a particular encoding. For instance, we can conduct the following steps to translate ‘e4b8ad’ to UTF8 character (please bear in mind your settings of NLS_LANG and terminal):

[oracle@bcmsdbs01 ~]$ bc
bc 1.06.95
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
ibase=16 <-- input base is 16. default output base is 10
E4B8AD <-- All the characters are in upper case
[oracle@bcmsdbs01 ~]$ exit

SQL> select chr(14989485) from dual;

中 <-- We can see that the encoding is actually chinese character '中'

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s