Monday, September 19, 2011

UNICODE: CODE POINT IN SQL, PL/SQL

Unicode

Code Points in SQL or PL/SQL

If you have ever processed text strings in PL/SQL you may have found the need for a function that returns a character's Unicode code point value. E.g. given an 'A' the function should return a 65 because 'A' is the 65th character in the Unicode code chart.
The following table of test values gives some examples of characters in various planes of the Unicode codespace and the desired code point values we would want from a code point function.
create table test_values
( input          varchar2(10)
, note           varchar2(25)
, decimal_output integer
, hex_output     varchar2(10)
) ;

insert into test_values values ( unistr('\0041'      ), 'LATIN CAPITAL LETTER A'  , 65    , '41'    );
insert into test_values values ( unistr('\00C6'      ), 'LATIN CAPITAL LETTER AE' , 198   , 'C6'    );
insert into test_values values ( unistr('\27A8'      ), 'BLACK RIGHTWARDS ARROW'  , 10152 , '27A8'  );
insert into test_values values ( unistr('\BBBB'      ), 'HANGUL SYLLABLE'         , 48059 , 'BBBB'  );
insert into test_values values ( unistr('\D83C\DF55' ), 'SLICE OF PIZZA (Plane 1)', 127829, '1F355' );
insert into test_values values ( unistr('\D86D\DF40' ), 'CJK IDEOGRAPH (Plane 2)' , 177984, '2B740' );
insert into test_values values ( null                 , 'NULL'                    , null  , null    );

commit;

set null "(null)"
set linesize 120

column hex_output format a18

select * from test_values ;

INPUT      NOTE                      DECIMAL_OUTPUT     HEX_OUTPUT
---------- ------------------------- ------------------ ------------------
A          LATIN CAPITAL LETTER A                    65 41
Æ          LATIN CAPITAL LETTER AE                  198 C6
➨          BLACK RIGHTWARDS ARROW                 10152 27A8
뮻          HANGUL SYLLABLE                        48059 BBBB
         SLICE OF PIZZA (Plane 1)              127829 1F355
띀         CJK IDEOGRAPH (Plane 2)               177984 2B740
(null)     NULL                      (null)             (null)
Note: Your browser may not render all the characters in the INPUT column correctly.

DUMP, ASCII, ASCIISTR

Unfortunately Oracle doesn't supply a SQL or PL/SQL function that fits the bill (as of 11gR2 at least). Functions like DUMP, ASCII, and ASCIISTR threaten to be useful in crafting a custom solution but the queries below demonstrate that none of them can be relied upon to handle the entirety of the Unicode codespace, including supplementary characters, consistently across different database character sets.
column dump format a30
column ascii format 99999999999
column asciistr format a10

select
  note
, decimal_output
, hex_output
, dump( input ) as "DUMP"
, ascii( input ) as "ASCII"
, asciistr( input ) as "ASCIISTR"
from test_values ;

------------------------------------------------------------
-- sample output from a db with NLS_CHARACTERSET=AL32UTF8
------------------------------------------------------------

NOTE                      DECIMAL_OUTPUT HEX_OUTPUT DUMP                            ASCII ASCIISTR
------------------------- -------------- ---------- ------------------------ ------------ ----------
LATIN CAPITAL LETTER A                65 41         Typ=1 Len=1: 65                    65 A
LATIN CAPITAL LETTER AE              198 C6         Typ=1 Len=2: 195,134            50054 \00C6
BLACK RIGHTWARDS ARROW             10152 27A8       Typ=1 Len=3: 226,158,168     14851752 \27A8
HANGUL SYLLABLE                    48059 BBBB       Typ=1 Len=3: 235,174,187     15445691 \BBBB
SLICE OF PIZZA (Plane 1)          127829 1F355      Typ=1 Len=4: 240,159,141   4036988309 \D83C\DF55
                                                    ,149
CJK IDEOGRAPH (Plane 2)           177984 2B740      Typ=1 Len=4: 240,171,157   4037778816 \D86D\DF40
                                                    ,128
NULL                      (null)         (null)     NULL                     (null)       (null)

7 rows selected.
------------------------------------------------------------
-- sample output from a db with NLS_CHARACTERSET=UTF8
------------------------------------------------------------

NOTE                      DECIMAL_OUTPUT HEX_OUTPUT DUMP                            ASCII ASCIISTR
------------------------- -------------- ---------- ------------------------ ------------ ----------
LATIN CAPITAL LETTER A                65 41         Typ=1 Len=1: 65                    65 A
LATIN CAPITAL LETTER AE              198 C6         Typ=1 Len=2: 195,134            50054 \00C6
BLACK RIGHTWARDS ARROW             10152 27A8       Typ=1 Len=3: 226,158,168     14851752 \27A8
HANGUL SYLLABLE                    48059 BBBB       Typ=1 Len=3: 235,174,187     15445691 \BBBB
SLICE OF PIZZA (Plane 1)          127829 1F355      Typ=1 Len=6: 237,160,188     15573180 \D83C\DF55
                                                    ,237,189,149
CJK IDEOGRAPH (Plane 2)           177984 2B740      Typ=1 Len=6: 237,161,173     15573421 \D86D\DF40
                                                    ,237,189,128
NULL                                                NULL

7 rows selected.
The results show that both DUMP and ASCII return different values for databases using different character sets. As well their return values are far removed from the code point values we're looking for.
ASCIISTR produces consistent values across character sets but only returns code point values for non-ASCII Plane 0 characters plus the character "\".
Thus, these functions offer little help in meeting our requirement. Let's consider some alternatives.

XQuery fn:string_to_codepoints()

While there appear to be no SQL or PL/SQL functions that return a character's code point value other languages available in Oracle do have such functions. XQuery's fn:string_to_codepoints() function is one example. (XQuery is available in Oracle releases 10gR2 and higher, but not Oracle XE.)
column xquery_string_to_codepoints format a30

select
  note
, decimal_output
, xmlquery
  ( 'fn:string-to-codepoints( $p_input )'
    passing input as "p_input"
    returning content
  ).getstringval()
  as xquery_string_to_codepoints
from test_values ;

NOTE                      DECIMAL_OUTPUT XQUERY_STRING_TO_CODEPOINTS
------------------------- -------------- ------------------------------
LATIN CAPITAL LETTER A                65 65
LATIN CAPITAL LETTER AE              198 198
BLACK RIGHTWARDS ARROW             10152 10152
HANGUL SYLLABLE                    48059 48059
SLICE OF PIZZA (Plane 1)          127829 127829
CJK IDEOGRAPH (Plane 2)           177984 177984
NULL                      (null)         (null)
The results are the same on databases with AL32UTF8 and UTF8 character sets.
Note that fn:string_to_codepoints() is only available via the XMLQUERY command in SQL statements or via the XQUERY SQL*Plus command. If we wanted to process hundreds or thousands of individual characters in a PL/SQL block we would need to issue hundreds or thousands of SELECT XMLQUERY ... FROM DUAL commands to use fn:string_to_codepoints(). From a performance perspective this is not desireable so let's keep looking at alternatives.

Java codePointAt()

Java has a method called codePointAt() which we can leverage as follows. (Note that Java is not available in Oracle XE.)
create or replace and compile java source named my_java as
public class my_java
{
  public static int codepoint( String p_string )
  {
    if ( p_string == null )
    { return -1 ; }
    else
    { return p_string.codePointAt(0) ; }
  }
};
/
show errors

create or replace function javacodepoint
  ( p_string in varchar2 )
  return pls_integer
  -- result_cache -- uncomment this line to use the PL/SQL result cache
  as language java
  name 'my_java.codepoint( java.lang.String ) return java.lang.int'
;
/
show errors

column javacodepoint format 999999999999999

select
  note
, decimal_output
, javacodepoint( input ) as javacodepoint
from test_values ;

NOTE                      DECIMAL_OUTPUT    JAVACODEPOINT
------------------------- -------------- ----------------
LATIN CAPITAL LETTER A                65               65
LATIN CAPITAL LETTER AE              198              198
BLACK RIGHTWARDS ARROW             10152            10152
HANGUL SYLLABLE                    48059            48059
SLICE OF PIZZA (Plane 1)          127829           127829
CJK IDEOGRAPH (Plane 2)           177984           177984
NULL                      (null)                       -1
The results are the same on databases with AL32UTF8 and UTF8 character sets.
Note that the "int" datatype in Java does not allow nulls so the sample code above returns -1 for a null input.
Now that we have a solution that we can use in PL/SQL which doesn't require a SQL context switch it would be tempting to stop here. I'm a bit of a PL/SQL purist though and prefer pure SQL or PL/SQL solutions whenever possible since they typically run faster, leaner, and scale better than, for lack of a better word, impure solutions. They also don't require technologies that may not be available on all Oracle versions. So let's forge ahead and try one final alternative.

Custom PL/SQL Solution - UNICODEPOINT()

With the help of a built-in character conversion utility called utl_i18n.string_to_raw and some binary arithmetic courtesy of the Unicode FAQ for "UTF-8, UTF-16, UTF-32 & BOM we can craft a custom function called UNICODEPOINT that gives us a pure PL/SQL solution.
set scan off

create function unicodepoint ( p_char in varchar2 )
  return pls_integer
  -- result_cache -- uncomment this line to use the PL/SQL result cache
is
  ----------------------------------------------------------------------
  -- Notes:
  --
  -- This function contains logic adapted from the Unicode FAQ for 
  -- "UTF-8, UTF-16, UTF-32 & BOM" at
  -- 
  --   http://unicode.org/faq/utf_bom.html
  --
  -- See these questions in the FAQ.
  --
  --  What's the algorithm to convert from UTF-16 to character codes?
  --  Isn't there a simpler way to do this?
  --
  -- Some of the expressions in the FAQ use the C << left shift operator
  -- on 16 bit numbers within a 32 bit register, e.g. (0xD800 << 10).
  -- Since PL/SQL doesn't have a "<<" operator we will instead
  -- multiply the 16 bit values by 2**10 (i.e. 1024), which is
  -- equivalent to a left shift by 10 bits.
  ----------------------------------------------------------------------

  l_first_char constant varchar2(2 char) := substrc( p_char, 1, 1 );
  l_raw_char   constant raw(8)           := utl_i18n.string_to_raw( l_first_char, 'AL16UTF16' ) ;
  l_hex_val    constant varchar2(8)      := rawtohex( l_raw_char ) ;

  l_hex_lead   constant char(4)          := substr( l_hex_val, 1, 4 );
  l_hex_trail  constant char(4)          := substr( l_hex_val, 5, 4 );

  l_dec_lead   constant pls_integer := to_number( l_hex_lead , 'XXXX' );
  l_dec_trail  constant pls_integer := to_number( l_hex_trail, 'XXXX' );

  ----------------------------------------------------------------------
  -- the surrogate offset value below was derived from the following
  -- C expression in the UTF-16 FAQ mentioned above
  --
  --   ( 0x10000 - (0xD800 << 10) - 0xDC00 )
  ----------------------------------------------------------------------

  l_dec_surrogate_offset constant pls_integer := -56613888 ;

begin

  if p_char is null then return( null ); end if ;

  if l_hex_trail is null then

    return( l_dec_lead );

  else

    ----------------------------------------------------------------------
    -- the expression in the return() argument was derived from the
    -- following C expression in the UTF-16 FAQ mentioned above
    --
    --   (lead << 10) + trail + SURROGATE_OFFSET;
    ----------------------------------------------------------------------

    return( l_dec_lead * 1024 + l_dec_trail + l_dec_surrogate_offset );

  end if;

end ;
/
show errors

column unicodepoint_in_hex format a20

select
  note
, decimal_output
, unicodepoint( input ) as unicodepoint
, hex_output
, to_char( unicodepoint( input ), 'FMXXXXX' ) as unicodepoint_in_hex
from test_values ;

NOTE                      DECIMAL_OUTPUT UNICODEPOINT HEX_OUTPUT UNICODEPOINT_IN_HEX
------------------------- -------------- ------------ ---------- --------------------
LATIN CAPITAL LETTER A                65           65 41         41
LATIN CAPITAL LETTER AE              198          198 C6         C6
BLACK RIGHTWARDS ARROW             10152        10152 27A8       27A8
HANGUL SYLLABLE                    48059        48059 BBBB       BBBB
SLICE OF PIZZA (Plane 1)          127829       127829 1F355      1F355
CJK IDEOGRAPH (Plane 2)           177984       177984 2B740      2B740
NULL                      (null)         (null)       (null)     (null)
The results are the same on databases with AL32UTF8 and UTF8 character sets. Unfortunately I don't have access to databases that use other character sets and was unable to test it there. I suspect the results will be unchanged. If you plan to use UNICODEPOINT on such databases be sure to test the code thoroughly.

Comparison

I was curious to see how the three solutions above stacked up against each other. Here are the results of running them through a test harness. In my tests each solution was used in an anonymous PL/SQL block to find the code points for 5,120 different characters (4,096 Plane 0 characters and 1024 Plane 1 characters). Here are the results.
METRIC_NAME               fn:string-to-codepoints() JAVACODEPOINT UNICODEPOINT
------------------------- ------------------------- ------------- ------------
CPU used by this session  122                       37            13
CPU used when call starte 122                       36            13
DB time                   428                       206           58
Elapsed Time (1/100 sec)  443                       237           65
Latch Gets Total          2,812                     116,999       1,842
bytes received via SQL*Ne 1,096                     768           765
calls to get snapshot scn 6,029                     4             2
enqueue releases          9                         12,056        8
enqueue requests          13                        12,060        12
execute count             6,028                     4             3
java call heap collected  0                         175,584       0
java call heap collected  0                         2,604         0
java call heap object cou 0                         2,604         0
java call heap object cou 0                         2,604         0
java call heap total size 0                         786,432       0
java call heap total size 0                         786,432       0
java call heap used size  0                         175,584       0
java call heap used size  0                         175,584       0
java session heap live si 0                         106,496       0
java session heap live si 0                         106,496       0
java session heap used si 0                         106,496       0
java session heap used si 0                         106,496       0
non-idle wait time        102                       7             2
opened cursors cumulative 6,029                     7             3
parse time elapsed        283                       187           51
recursive calls           6,040                     18            4
session cursor cache hits 6,022                     0             0
session pga memory        327,680                   2,949,120     0
session pga memory max    327,680                   2,949,120     131,072
session uga memory        196,548                   0             0
session uga memory max    196,548                   123,452       123,452
The results confirmed my suspicion that a pure PL/SQL solution would run the fastest, use the least amount of memory, and scale the best (i.e. use the fewest latches) of the three solutions.
I hope you find the UNICODEPOINT() function as useful as I do.

Cleanup

drop table test_values ;

drop java source my_java ;

drop function javacodepoint ;

drop function unicodepoint ;

No comments:

Post a Comment

Oracle Escape Characters

Oracle allows the assignment of special escape characters to tell Oracle that the character is interpreted literally.  Certain characters ...