Unicode
Code Points in SQL or PL/SQL
If you have ever processed text strings in PL/SQL you may have found the need for a function that returns a character's Unicode code point value. E.g. given an 'A' the function should return a 65 because 'A' is the 65th character in the Unicode code chart.
The following table of test values gives some examples of characters in various planes of the Unicode codespace and the desired code point values we would want from a code point function.
ASCIISTR produces consistent values across character sets but only returns code point values for non-ASCII Plane 0 characters plus the character "\".
Thus, these functions offer little help in meeting our requirement. Let's consider some alternatives.
Note that fn:string_to_codepoints() is only available via the XMLQUERY command in SQL statements or via the XQUERY SQL*Plus command. If we wanted to process hundreds or thousands of individual characters in a PL/SQL block we would need to issue hundreds or thousands of SELECT XMLQUERY ... FROM DUAL commands to use fn:string_to_codepoints(). From a performance perspective this is not desireable so let's keep looking at alternatives.
Note that the "int" datatype in Java does not allow nulls so the sample code above returns -1 for a null input.
Now that we have a solution that we can use in PL/SQL which doesn't require a SQL context switch it would be tempting to stop here. I'm a bit of a PL/SQL purist though and prefer pure SQL or PL/SQL solutions whenever possible since they typically run faster, leaner, and scale better than, for lack of a better word, impure solutions. They also don't require technologies that may not be available on all Oracle versions. So let's forge ahead and try one final alternative.
I hope you find the UNICODEPOINT() function as useful as I do.
The following table of test values gives some examples of characters in various planes of the Unicode codespace and the desired code point values we would want from a code point function.
create table test_values
( input varchar2(10)
, note varchar2(25)
, decimal_output integer
, hex_output varchar2(10)
) ;
insert into test_values values ( unistr('\0041' ), 'LATIN CAPITAL LETTER A' , 65 , '41' );
insert into test_values values ( unistr('\00C6' ), 'LATIN CAPITAL LETTER AE' , 198 , 'C6' );
insert into test_values values ( unistr('\27A8' ), 'BLACK RIGHTWARDS ARROW' , 10152 , '27A8' );
insert into test_values values ( unistr('\BBBB' ), 'HANGUL SYLLABLE' , 48059 , 'BBBB' );
insert into test_values values ( unistr('\D83C\DF55' ), 'SLICE OF PIZZA (Plane 1)', 127829, '1F355' );
insert into test_values values ( unistr('\D86D\DF40' ), 'CJK IDEOGRAPH (Plane 2)' , 177984, '2B740' );
insert into test_values values ( null , 'NULL' , null , null );
commit;
set null "(null)"
set linesize 120
column hex_output format a18
select * from test_values ;
INPUT NOTE DECIMAL_OUTPUT HEX_OUTPUT
---------- ------------------------- ------------------ ------------------
A LATIN CAPITAL LETTER A 65 41
Æ LATIN CAPITAL LETTER AE 198 C6
➨ BLACK RIGHTWARDS ARROW 10152 27A8
뮻 HANGUL SYLLABLE 48059 BBBB
SLICE OF PIZZA (Plane 1) 127829 1F355
띀 CJK IDEOGRAPH (Plane 2) 177984 2B740
(null) NULL (null) (null)
Note: Your browser may not render all the characters in the INPUT column correctly.DUMP, ASCII, ASCIISTR
Unfortunately Oracle doesn't supply a SQL or PL/SQL function that fits the bill (as of 11gR2 at least). Functions like DUMP, ASCII, and ASCIISTR threaten to be useful in crafting a custom solution but the queries below demonstrate that none of them can be relied upon to handle the entirety of the Unicode codespace, including supplementary characters, consistently across different database character sets.column dump format a30
column ascii format 99999999999
column asciistr format a10
select
note
, decimal_output
, hex_output
, dump( input ) as "DUMP"
, ascii( input ) as "ASCII"
, asciistr( input ) as "ASCIISTR"
from test_values ;
------------------------------------------------------------
-- sample output from a db with NLS_CHARACTERSET=AL32UTF8
------------------------------------------------------------
NOTE DECIMAL_OUTPUT HEX_OUTPUT DUMP ASCII ASCIISTR
------------------------- -------------- ---------- ------------------------ ------------ ----------
LATIN CAPITAL LETTER A 65 41 Typ=1 Len=1: 65 65 A
LATIN CAPITAL LETTER AE 198 C6 Typ=1 Len=2: 195,134 50054 \00C6
BLACK RIGHTWARDS ARROW 10152 27A8 Typ=1 Len=3: 226,158,168 14851752 \27A8
HANGUL SYLLABLE 48059 BBBB Typ=1 Len=3: 235,174,187 15445691 \BBBB
SLICE OF PIZZA (Plane 1) 127829 1F355 Typ=1 Len=4: 240,159,141 4036988309 \D83C\DF55
,149
CJK IDEOGRAPH (Plane 2) 177984 2B740 Typ=1 Len=4: 240,171,157 4037778816 \D86D\DF40
,128
NULL (null) (null) NULL (null) (null)
7 rows selected.
------------------------------------------------------------
-- sample output from a db with NLS_CHARACTERSET=UTF8
------------------------------------------------------------
NOTE DECIMAL_OUTPUT HEX_OUTPUT DUMP ASCII ASCIISTR
------------------------- -------------- ---------- ------------------------ ------------ ----------
LATIN CAPITAL LETTER A 65 41 Typ=1 Len=1: 65 65 A
LATIN CAPITAL LETTER AE 198 C6 Typ=1 Len=2: 195,134 50054 \00C6
BLACK RIGHTWARDS ARROW 10152 27A8 Typ=1 Len=3: 226,158,168 14851752 \27A8
HANGUL SYLLABLE 48059 BBBB Typ=1 Len=3: 235,174,187 15445691 \BBBB
SLICE OF PIZZA (Plane 1) 127829 1F355 Typ=1 Len=6: 237,160,188 15573180 \D83C\DF55
,237,189,149
CJK IDEOGRAPH (Plane 2) 177984 2B740 Typ=1 Len=6: 237,161,173 15573421 \D86D\DF40
,237,189,128
NULL NULL
7 rows selected.
The results show that both DUMP and ASCII return different values for databases using different character sets. As well their return values are far removed from the code point values we're looking for. ASCIISTR produces consistent values across character sets but only returns code point values for non-ASCII Plane 0 characters plus the character "\".
Thus, these functions offer little help in meeting our requirement. Let's consider some alternatives.
XQuery fn:string_to_codepoints()
While there appear to be no SQL or PL/SQL functions that return a character's code point value other languages available in Oracle do have such functions. XQuery's fn:string_to_codepoints() function is one example. (XQuery is available in Oracle releases 10gR2 and higher, but not Oracle XE.)column xquery_string_to_codepoints format a30
select
note
, decimal_output
, xmlquery
( 'fn:string-to-codepoints( $p_input )'
passing input as "p_input"
returning content
).getstringval()
as xquery_string_to_codepoints
from test_values ;
NOTE DECIMAL_OUTPUT XQUERY_STRING_TO_CODEPOINTS
------------------------- -------------- ------------------------------
LATIN CAPITAL LETTER A 65 65
LATIN CAPITAL LETTER AE 198 198
BLACK RIGHTWARDS ARROW 10152 10152
HANGUL SYLLABLE 48059 48059
SLICE OF PIZZA (Plane 1) 127829 127829
CJK IDEOGRAPH (Plane 2) 177984 177984
NULL (null) (null)
The results are the same on databases with AL32UTF8 and UTF8 character sets.Note that fn:string_to_codepoints() is only available via the XMLQUERY command in SQL statements or via the XQUERY SQL*Plus command. If we wanted to process hundreds or thousands of individual characters in a PL/SQL block we would need to issue hundreds or thousands of SELECT XMLQUERY ... FROM DUAL commands to use fn:string_to_codepoints(). From a performance perspective this is not desireable so let's keep looking at alternatives.
Java codePointAt()
Java has a method called codePointAt() which we can leverage as follows. (Note that Java is not available in Oracle XE.)create or replace and compile java source named my_java as
public class my_java
{
public static int codepoint( String p_string )
{
if ( p_string == null )
{ return -1 ; }
else
{ return p_string.codePointAt(0) ; }
}
};
/
show errors
create or replace function javacodepoint
( p_string in varchar2 )
return pls_integer
-- result_cache -- uncomment this line to use the PL/SQL result cache
as language java
name 'my_java.codepoint( java.lang.String ) return java.lang.int'
;
/
show errors
column javacodepoint format 999999999999999
select
note
, decimal_output
, javacodepoint( input ) as javacodepoint
from test_values ;
NOTE DECIMAL_OUTPUT JAVACODEPOINT
------------------------- -------------- ----------------
LATIN CAPITAL LETTER A 65 65
LATIN CAPITAL LETTER AE 198 198
BLACK RIGHTWARDS ARROW 10152 10152
HANGUL SYLLABLE 48059 48059
SLICE OF PIZZA (Plane 1) 127829 127829
CJK IDEOGRAPH (Plane 2) 177984 177984
NULL (null) -1
The results are the same on databases with AL32UTF8 and UTF8 character sets.Note that the "int" datatype in Java does not allow nulls so the sample code above returns -1 for a null input.
Now that we have a solution that we can use in PL/SQL which doesn't require a SQL context switch it would be tempting to stop here. I'm a bit of a PL/SQL purist though and prefer pure SQL or PL/SQL solutions whenever possible since they typically run faster, leaner, and scale better than, for lack of a better word, impure solutions. They also don't require technologies that may not be available on all Oracle versions. So let's forge ahead and try one final alternative.
Custom PL/SQL Solution - UNICODEPOINT()
With the help of a built-in character conversion utility called utl_i18n.string_to_raw and some binary arithmetic courtesy of the Unicode FAQ for "UTF-8, UTF-16, UTF-32 & BOM we can craft a custom function called UNICODEPOINT that gives us a pure PL/SQL solution.set scan off
create function unicodepoint ( p_char in varchar2 )
return pls_integer
-- result_cache -- uncomment this line to use the PL/SQL result cache
is
----------------------------------------------------------------------
-- Notes:
--
-- This function contains logic adapted from the Unicode FAQ for
-- "UTF-8, UTF-16, UTF-32 & BOM" at
--
-- http://unicode.org/faq/utf_bom.html
--
-- See these questions in the FAQ.
--
-- What's the algorithm to convert from UTF-16 to character codes?
-- Isn't there a simpler way to do this?
--
-- Some of the expressions in the FAQ use the C << left shift operator
-- on 16 bit numbers within a 32 bit register, e.g. (0xD800 << 10).
-- Since PL/SQL doesn't have a "<<" operator we will instead
-- multiply the 16 bit values by 2**10 (i.e. 1024), which is
-- equivalent to a left shift by 10 bits.
----------------------------------------------------------------------
l_first_char constant varchar2(2 char) := substrc( p_char, 1, 1 );
l_raw_char constant raw(8) := utl_i18n.string_to_raw( l_first_char, 'AL16UTF16' ) ;
l_hex_val constant varchar2(8) := rawtohex( l_raw_char ) ;
l_hex_lead constant char(4) := substr( l_hex_val, 1, 4 );
l_hex_trail constant char(4) := substr( l_hex_val, 5, 4 );
l_dec_lead constant pls_integer := to_number( l_hex_lead , 'XXXX' );
l_dec_trail constant pls_integer := to_number( l_hex_trail, 'XXXX' );
----------------------------------------------------------------------
-- the surrogate offset value below was derived from the following
-- C expression in the UTF-16 FAQ mentioned above
--
-- ( 0x10000 - (0xD800 << 10) - 0xDC00 )
----------------------------------------------------------------------
l_dec_surrogate_offset constant pls_integer := -56613888 ;
begin
if p_char is null then return( null ); end if ;
if l_hex_trail is null then
return( l_dec_lead );
else
----------------------------------------------------------------------
-- the expression in the return() argument was derived from the
-- following C expression in the UTF-16 FAQ mentioned above
--
-- (lead << 10) + trail + SURROGATE_OFFSET;
----------------------------------------------------------------------
return( l_dec_lead * 1024 + l_dec_trail + l_dec_surrogate_offset );
end if;
end ;
/
show errors
column unicodepoint_in_hex format a20
select
note
, decimal_output
, unicodepoint( input ) as unicodepoint
, hex_output
, to_char( unicodepoint( input ), 'FMXXXXX' ) as unicodepoint_in_hex
from test_values ;
NOTE DECIMAL_OUTPUT UNICODEPOINT HEX_OUTPUT UNICODEPOINT_IN_HEX
------------------------- -------------- ------------ ---------- --------------------
LATIN CAPITAL LETTER A 65 65 41 41
LATIN CAPITAL LETTER AE 198 198 C6 C6
BLACK RIGHTWARDS ARROW 10152 10152 27A8 27A8
HANGUL SYLLABLE 48059 48059 BBBB BBBB
SLICE OF PIZZA (Plane 1) 127829 127829 1F355 1F355
CJK IDEOGRAPH (Plane 2) 177984 177984 2B740 2B740
NULL (null) (null) (null) (null)
The results are the same on databases with AL32UTF8 and UTF8 character sets. Unfortunately I don't have access to databases that use other character sets and was unable to test it there. I suspect the results will be unchanged. If you plan to use UNICODEPOINT on such databases be sure to test the code thoroughly.Comparison
I was curious to see how the three solutions above stacked up against each other. Here are the results of running them through a test harness. In my tests each solution was used in an anonymous PL/SQL block to find the code points for 5,120 different characters (4,096 Plane 0 characters and 1024 Plane 1 characters). Here are the results.METRIC_NAME fn:string-to-codepoints() JAVACODEPOINT UNICODEPOINT
------------------------- ------------------------- ------------- ------------
CPU used by this session 122 37 13
CPU used when call starte 122 36 13
DB time 428 206 58
Elapsed Time (1/100 sec) 443 237 65
Latch Gets Total 2,812 116,999 1,842
bytes received via SQL*Ne 1,096 768 765
calls to get snapshot scn 6,029 4 2
enqueue releases 9 12,056 8
enqueue requests 13 12,060 12
execute count 6,028 4 3
java call heap collected 0 175,584 0
java call heap collected 0 2,604 0
java call heap object cou 0 2,604 0
java call heap object cou 0 2,604 0
java call heap total size 0 786,432 0
java call heap total size 0 786,432 0
java call heap used size 0 175,584 0
java call heap used size 0 175,584 0
java session heap live si 0 106,496 0
java session heap live si 0 106,496 0
java session heap used si 0 106,496 0
java session heap used si 0 106,496 0
non-idle wait time 102 7 2
opened cursors cumulative 6,029 7 3
parse time elapsed 283 187 51
recursive calls 6,040 18 4
session cursor cache hits 6,022 0 0
session pga memory 327,680 2,949,120 0
session pga memory max 327,680 2,949,120 131,072
session uga memory 196,548 0 0
session uga memory max 196,548 123,452 123,452
The results confirmed my suspicion that a pure PL/SQL solution would run the fastest, use the least amount of memory, and scale the best (i.e. use the fewest latches) of the three solutions. I hope you find the UNICODEPOINT() function as useful as I do.
Cleanup
drop table test_values ;
drop java source my_java ;
drop function javacodepoint ;
drop function unicodepoint ;
No comments:
Post a Comment