Monday, September 19, 2011

Regular Expression

Regular Expressions

Regular expression support for pattern matching was introduced in Oracle 10g. It offers a powerful alternative to older, less robust pattern matching features such as the LIKE operator. Oracle regular expression support is based on the POSIX standard, but it also has some non-POSIX extensions similar to those used in the PERL programming language.
In the first four tutorials the Oracle conditions and functions that support regular expressions will be introduced. Since all Oracle regular expression features recognize the same regular expression patterns, these four tutorials will not attempt to explore regular expression syntax in detail. This will be done in the remaining tutorials which discuss four different aspects of regular expression pattern matching.
  1. WHAT single piece of text are we trying to match?
  2. HOW MANY repetitions of that text are we trying to match?
  3. WHERE in the source string should we look for a match?
  4. WHICH occurrence of the text do we want to match?




REGEXP_LIKE
The REGEXP_LIKE pattern matching condition is similar to the LIKE condition. Whereas LIKE only performs simple pattern matching using the wildcards "%" and "_", REGEXP_LIKE performs complex regular expression pattern matching allowing it to match a much greater range of string patterns than LIKE. The following examples compare pattern matching with both LIKE and REGEXP_LIKE.
Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
execute set_target( 'aaa'  )

execute add_target( 'abc'  )
execute add_target( 'bbb'  )
execute add_target( 'bcd'  )
execute add_target( 'cde'  )
execute add_target( 'def'  )
execute add_target( 'efg'  )

column target heading "TARGET|STRING"

LIKE vs REGEXP_LIKE

Matching an entire string is performed as follows.
select target
from   targets
where  target LIKE 'abc' ;

TARGET
STRING
----------
abc

select target
from   targets
where  REGEXP_LIKE( target, '^abc$' ) ;

TARGET
STRING
----------
abc

This is how to match a string anywhere in the target.
select target
from   targets
where  target LIKE '%bc%' ;

TARGET
STRING
----------
abc
bcd

select target
from   targets
where  REGEXP_LIKE( target, 'bc' ) ;

TARGET
STRING
----------
abc
bcd

These examples show how to match a string at the start of the target.
select target
from   targets
where  target LIKE 'b%' ;

TARGET
STRING
----------
bbb
bcd

select target
from   targets
where  REGEXP_LIKE( target, '^b' ) ;

TARGET
STRING
----------
bbb
bcd

These examples show how to match a string at the end of the target.
select target
from   targets
where  target LIKE '%b' ;

TARGET
STRING
----------
bbb

select target
from   targets
where  REGEXP_LIKE( target, 'b$' ) ;

TARGET
STRING                    
----------
bbb

To match any single character do the following.
select target
from   targets
where  target LIKE 'a_c' ;

TARGET
STRING
----------
abc

select target
from   targets
where  REGEXP_LIKE( target, '^a.c$' ) ;

TARGET
STRING
----------
abc

One example of a complex pattern that REGEXP_LIKE can handle but a single, basic LIKE condition cannot (without using text manipulation functions) is a list of characters. In this example we search for strings that have either an "a", "c", or "f" as the second character of the string.
--
-- not possible
--

select target
from   targets
where  REGEXP_LIKE( target, '^.[acf]' ) ;

TARGET
STRING
----------
aaa
bcd
efg


REGEXP_INSTR
The REGEXP_INSTR function is similar to the INSTR function except that it finds the location of a pattern of characters, specified with a regular expression, instead of finding the location of a simple string literal. The following examples compare searches using INSTR with those that use REGEXP_INSTR.
Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
execute set_target( 'aaa'  )

execute add_target( 'abc'  )
execute add_target( 'bbb'  )
execute add_target( 'bcd'  )
execute add_target( 'cde'  )
execute add_target( 'def'  )
execute add_target( 'efg'  )

column target heading "TARGET|STRING"

INSTR vs REGEXP_INSTR

Matching a string literal is performed as follows.
select
  target ,
  INSTR( target, 'b' ) position
from
  targets
;

TARGET
STRING     POSITION
---------- --------
aaa               0
abc               2
bbb               1
bcd               1
cde               0
def               0
efg               0

select
  target ,
  REGEXP_INSTR( target, 'b' ) position
from
  targets
;

TARGET
STRING     POSITION
---------- --------
aaa               0
abc               2
bbb               1
bcd               1
cde               0
def               0
efg               0

Matching a complex pattern is possible with REGEXP_INSTR, but not INSTR (unless you use other text manipulation functions as well). The following example shows how to find the position of either the first "a", "c", or "f" character in the target.
--
-- not possible
--

select
  target ,
  REGEXP_INSTR( target, '[acf]' ) position
from
  targets
;

TARGET
STRING     POSITION
---------- --------
aaa               1
abc               1
bbb               0
bcd               2
cde               1
def               3
efg               2



REGEXP_REPLACE      
The REGEXP_REPLACE function is similar to both the REPLACE function and the TRANSLATE function except that it replaces a string pattern, specified with a regular expression, instead of a string literal. As well, REGEXP_REPLACE allows you to specify backreferences (references to sub-patterns defined in the main search pattern) for the replacement string whereas REPLACE only allows string literals in the replacement string. The following examples compare replacements performed using REPLACE or TRANSLATE with those using REGEXP_REPLACE.
Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
execute set_target( 'aaa'  )

execute add_target( 'abc'  )
execute add_target( 'bbb'  )
execute add_target( 'bcd'  )
execute add_target( 'cde'  )
execute add_target( 'def'  )
execute add_target( 'efg'  )

column target heading "TARGET|STRING"

REPLACE and TRANSLATE vs REGEXP_REPLACE

Replacing a string literal is performed as follows.
select
  target ,
  REPLACE( target, 'bc', 'XX' )
    as replaced_target
from
  targets
;

TARGET
STRING     REPLACED_TARGET
---------- ---------------
aaa        aaa
abc        aXX
bbb        bbb
bcd        XXd
cde        cde
def        def
efg        efg

select
  target ,
  REGEXP_REPLACE( target, 'bc', 'XX' )
    as replaced_target
from
  targets
;

TARGET
STRING     REPLACED_TARGET
---------- ---------------
aaa        aaa
abc        aXX
bbb        bbb
bcd        XXd
cde        cde
def        def
efg        efg

The following example shows how to replace all of the "a", "c", and "f" characters in the target with an "X".
select
  target ,
  TRANSLATE
    ( target, 'acf', 'XXX' )
    as replaced_target
from
  targets
;

TARGET
STRING     REPLACED_TARGET
---------- ---------------
aaa        XXX
abc        XbX
bbb        bbb
bcd        bXd
cde        Xde
def        deX
efg        eXg

select
  target ,
  REGEXP_REPLACE
    ( target, '[acf]', 'X' )
    as replaced_target
from
  targets
;

TARGET
STRING     REPLACED_TARGET
---------- ---------------
aaa        XXX
abc        XbX
bbb        bbb
bcd        bXd
cde        Xde
def        deX
efg        eXg

While the TRANSLATE function can perform many different single character replacements all at once, this cannot be done with a single, basic REGEXP_REPLACE expression.
select
  target ,
  TRANSLATE
    ( target, 'acf', 'XYZ' )
    as replaced_target
from
  targets
;

TARGET
STRING     REPLACED_TARGET
---------- ---------------
aaa        XXX
abc        XbY
bbb        bbb
bcd        bYd
cde        Yde
def        deZ
efg        eZg

--
-- not possible with REGEXP_REPLACE
--

Matching a simple search pattern, like a "bc" at the end of a string, cannot be easily performed with a single, basic REPLACE or TRANSLATE expression. REGEXP_REPLACE handles such pattern matches easily.
--
-- not possible with
-- REPLACE or TRANSLATE
--

select
  target ,
  REGEXP_REPLACE( target, 'bc$', 'XY' )
    as replaced_target
from
  targets
;

TARGET
STRING     REPLACED_TARGET
---------- ---------------
aaa        aaa
abc        aXY
bbb        bbb
bcd        bcd
cde        cde
def        def
efg        efg


REGEXP_COUNT

Introduced in Oracle 11g the REGEXP_COUNT function greatly simplifies counting the number of times a pattern appears inside a string. The following examples demonstrate how to use REGEXP_COUNT with some simple patterns. For versions prior to 11g alternative logic that provides similar functionality is also presented.
Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
execute set_target( 'aaa'        )
execute add_target( 'a b c'      )
execute add_target( 'b  b  b  b' )
execute add_target( '  bbb xx  ' )
execute add_target( 'wdef def w' )
execute add_target( 'defxdefdef' )              
execute add_target( null         )
set null "(null)"
column target        heading "TARGET|STRING"
column pattern_count heading "PATTERN|COUNT"

Count String Pattern Occurrences (Oracle 11g)

To count the occurrences of the letter "b" in the target strings we can use a query like this.
select
  target ,
  REGEXP_COUNT( TARGET, 'b' ) as pattern_count
from
  targets
;
 
TARGET        PATTERN
STRING          COUNT
---------- ----------
aaa                 0
a b c               1
b  b  b  b          4
  bbb xx            3
wdef def w          0
defxdefdef          0
(null)     (null)
 
This query finds the total occurrences of the letters "a", "c", or "f" in the target strings.
select
  target ,
  REGEXP_COUNT( TARGET, '[acf]' ) as pattern_count
from
  targets
;     
 
TARGET        PATTERN
STRING          COUNT
---------- ----------
aaa                 3
a b c               2
b  b  b  b          0
  bbb xx            0
wdef def w          2
defxdefdef          3
(null)     (null)
 
To calculate the number of times a string like "def" occurs we use a query like this.
select
  target ,
  REGEXP_COUNT( TARGET, 'def' ) as pattern_count
from
  targets
;
      
TARGET        PATTERN
STRING          COUNT
---------- ----------
aaa                 0
a b c               0
b  b  b  b          0
  bbb xx            0
wdef def w          2
defxdefdef          3
(null)     (null)
                       

Count String Pattern Occurrences (Oracle 10g)

While Oracle 10g supports most of the regular expression features described in this section it unfortunately does not support REGEXP_COUNT. However, we can use the LENGTH and REGEXP_REPLACE functions to produce similar results.
To count the occurrences of the letter "b" in the target strings we can use a query like this (the column INTERMEDIATE_STRING is included to demonstrate the inner workings of each solution).
column intermediate_string format a12 heading "INTERMEDIATE|STRING"
select
  target ,
  regexp_replace( target, '[^b]', null )                     as intermediate_string ,
  NVL( LENGTH( REGEXP_REPLACE( TARGET, '[^b]', NULL ) ), 0 ) as pattern_count
from
  targets
;
 
TARGET     INTERMEDIATE    PATTERN
STRING     STRING            COUNT
---------- ------------ ----------
aaa        (null)                0
a b c      b                     1
b  b  b  b bbbb                  4
  bbb xx   bbb                   3
wdef def w (null)                0
defxdefdef (null)                0
(null)     (null)                0     
 
This query finds the total occurrences of the letters "a", "c", or "f" in the target strings.
select
  target ,
  regexp_replace( target, '[^acf]', null )                     as intermediate_string ,
  NVL( LENGTH( REGEXP_REPLACE( TARGET, '[^acf]', NULL ) ), 0 ) as pattern_count
from
  targets
;
 
TARGET     INTERMEDIATE    PATTERN
STRING     STRING            COUNT
---------- ------------ ----------
aaa        aaa                   3
a b c      ac                    2
b  b  b  b (null)                0
  bbb xx   (null)                0
wdef def w ff                    2
defxdefdef fff                   3
(null)     (null)                0
 
To calculate the number of times a string like "def" occurs we use a query like this.
select
  target ,
  regexp_replace( target, '(d)ef|.', '\1' )                     as intermediate_string ,
  NVL( LENGTH( REGEXP_REPLACE( TARGET, '(d)ef|.', '\1' ) ), 0 ) as pattern_count
from
  targets
;
 
TARGET     INTERMEDIATE    PATTERN
STRING     STRING            COUNT
---------- ------------ ----------
aaa        (null)                0
a b c      (null)                0
b  b  b  b (null)                0
  bbb xx   (null)                0
wdef def w dd                    2
defxdefdef ddd                   3
(null)     (null)                0
 
To find the number of words in a string we can use a query like this (for our purposes we define a word as any series of characters that are not a space).
select
  target ,
  REGEXP_COUNT( TARGET, '[^ ]+' ) as pattern_count
from
  targets
;
 
TARGET        PATTERN
STRING          COUNT
---------- ----------
aaa                 1
a b c               3
b  b  b  b          4
  bbb xx            2
wdef def w          3
defxdefdef          1
(null)     (null)       
 
To ensure PATTERN_COUNT always returns a non-null value use NVL.
select
  target ,
  NVL( regexp_count( target, '[^ ]+' ), 0 ) as pattern_count
from
  targets
;
 
TARGET        PATTERN
STRING          COUNT
---------- ----------
aaa                 1
a b c               3
b  b  b  b          4
  bbb xx            2
wdef def w          3
defxdefdef          1
(null)              0

To find the number of words in a string we can use a query like this.
select
  target ,
  regexp_replace( target, ' *[^ ]+ *', 'w' )                     as intermediate_string ,
  NVL( LENGTH( REGEXP_REPLACE( TARGET, ' *[^ ]+ *', 'w' ) ), 0 ) as pattern_count
from
  targets
;
 
TARGET     INTERMEDIATE    PATTERN
STRING     STRING            COUNT
---------- ------------ ----------
aaa        w                     1
a b c      www                   3
b  b  b  b wwww                  4
  bbb xx   ww                    2
wdef def w www                   3
defxdefdef w                     1
(null)     (null)                0                          
 

Count String Pattern Occurrences (Any Oracle Version)

In any Oracle version we can use the LENGTH and REPLACE functions to count patterns, though the logic is slightly more complex than the techniques above.
To count the occurrences of the letter "b" in the target strings we can use a query like this.
column target_length       heading "TARGET|LENGTH"
column intermediate_length heading "INTERMEDIATE|LENGTH"
select
  target ,
  replace( target, 'b', null )                     as intermediate_string ,
  nvl( length( target ), 0 )                       as target_length ,
  nvl( length( replace( target, 'b', null ) ), 0 ) as intermediate_length ,
  NVL( LENGTH( TARGET ), 0 )
    - NVL( LENGTH( REPLACE( TARGET, 'b', NULL ) ), 0 )
    as pattern_count
from
  targets
;
 
TARGET     INTERMEDIATE     TARGET INTERMEDIATE    PATTERN
STRING     STRING           LENGTH       LENGTH      COUNT
---------- ------------ ---------- ------------ ----------
aaa        aaa                   3            3          0
a b c      a  c                  5            4          1
b  b  b  b                      10            6          4
  bbb xx      xx                10            7          3
wdef def w wdef def w           10           10          0
defxdefdef defxdefdef           10           10          0
(null)     (null)                0            0          0
 
(The INTERMEDIATE_STRING value in the third row is six spaces. In the fourth row it is three spaces followed by "xx" followed by two spaces.)
This query finds the total occurrences of the letters "a", "c", or "f" in the target strings.
select
  target ,
  translate( target, 'xacf', 'x' )                     as intermediate_string ,
  nvl( length( target ), 0 )                           as target_length ,
  nvl( length( translate( target, 'xacf', 'x' ) ), 0 ) as intermediate_length ,
  NVL( LENGTH( TARGET ), 0 )
    - NVL( LENGTH( TRANSLATE( TARGET, 'xacf', 'x' ) ), 0 )
    as pattern_count
from
  targets
;
 
TARGET     INTERMEDIATE     TARGET INTERMEDIATE    PATTERN
STRING     STRING           LENGTH       LENGTH      COUNT
---------- ------------ ---------- ------------ ----------
aaa        (null)                3            0          3
a b c       b                    5            3          2
b  b  b  b b  b  b  b           10           10          0
  bbb xx     bbb xx             10           10          0
wdef def w wde de w             10            8          2
defxdefdef dexdede              10            7          3
(null)     (null)                0            0          0
 
To calculate the number of times a string like "def" occurs we use a query like this.
column intermediate_string_1 format a12 heading "INTERMEDIATE|STRING 1"
column intermediate_string_2 format a12 heading "INTERMEDIATE|STRING 2"
select
  target ,
  replace( target, 'def', 'd' )                      as intermediate_string_1 ,
  replace( target, 'def', null )                     as intermediate_string_2 ,
  nvl( length( replace( target, 'def', 'd'  ) ), 0 ) as length_1 ,
  nvl( length( replace( target, 'def', null ) ), 0 ) as length_2 ,
  NVL( LENGTH( REPLACE( TARGET, 'def', 'd' ) ), 0 )
    - NVL( LENGTH( REPLACE( TARGET, 'def', NULL ) ), 0 )
    as pattern_count
from
  targets
;
 
TARGET     INTERMEDIATE INTERMEDIATE                          PATTERN
STRING     STRING 1     STRING 2       LENGTH_1   LENGTH_2      COUNT
---------- ------------ ------------ ---------- ---------- ----------
aaa        aaa          aaa                   3          3          0
a b c      a b c        a b c                 5          5          0
b  b  b  b b  b  b  b   b  b  b  b           10         10          0
  bbb xx     bbb xx       bbb xx             10         10          0
wdef def w wd d w       w  w                  6          4          2
defxdefdef dxdd         x                     4          1          3
(null)     (null)       (null)                0          0          0
 
To find the number of words in a string we can use a query like this. Note the query will not work for target strings containing more than one space between words.
column trimmed_length_plus_1 heading "TRIMMED LENGTH|PLUS 1"
select
  target ,
  trimmed ,
  replace( trimmed, ' ', null )                     as intermediate_string ,
  nvl( length( trimmed ) + 1, 0 )                   as trimmed_length_plus_1 ,
  nvl( length( replace( trimmed, ' ', null ) ), 0 ) as intermediate_length ,
  NVL( LENGTH( TRIMMED ) + 1, 0 )
    - NVL( LENGTH( REPLACE( TRIMMED, ' ', NULL ) ), 0 )
    as pattern_count
from
  ( select target, ltrim( rtrim(target) ) as trimmed from targets )
where
  nvl( ltrim( rtrim( target ) ), 'x' ) not like '%  %'
;
 
TARGET                     INTERMEDIATE TRIMMED LENGTH INTERMEDIATE    PATTERN
STRING     TRIMMED         STRING               PLUS 1       LENGTH      COUNT
---------- --------------- ------------ -------------- ------------ ----------
aaa        aaa             aaa                       4            3          1
a b c      a b c           abc                       6            3          3
  bbb xx   bbb xx          bbbxx                     7            5          2
wdef def w wdef def w      wdefdefw                 11            8          3
defxdefdef defxdefdef      defxdefdef               11           10          1
(null)     (null)          (null)                    0            0          0
 
In Oracle 8i or later the LTRM and RTRIM functions can be replaced with a single TRIM call.
select
  target ,
  trimmed ,
  replace( trimmed, ' ', null )                     as intermediate_string ,
  nvl( length( trimmed ) + 1, 0 )                   as trimmed_length_plus_1 ,
  nvl( length( replace( trimmed, ' ', null ) ), 0 ) as intermediate_length ,
  nvl( length( trimmed ) + 1, 0 )
    - nvl( length( replace( trimmed, ' ', null ) ), 0 )
    as pattern_count
from
  ( select target, TRIM( target ) as trimmed from targets )
where
  nvl( TRIM( target ), 'x' ) not like '%  %'
;
 
TARGET                     INTERMEDIATE TRIMMED LENGTH INTERMEDIATE    PATTERN
STRING     TRIMMED         STRING               PLUS 1       LENGTH      COUNT
---------- --------------- ------------ -------------- ------------ ----------
aaa        aaa             aaa                       4            3          1
a b c      a b c           abc                       6            3          3
  bbb xx   bbb xx          bbbxx                     7            5          2
wdef def w wdef def w      wdefdefw                 11            8          3
defxdefdef defxdefdef      defxdefdef               11           10          1
(null)     (null)          (null)                    0            0          0
 

REGEXP_SUBSTR       
The REGEXP_SUBSTR function provides a superset of the functionality available with the SUBSTR function. While SUBSTR extracts a string from a specific location in the target, REGEXP_SUBSTR extracts a string that matches a given pattern, specified with a regular expression, from anywhere in the target. The following examples compare expressions using SUBSTR with those that use REGEXP_SUBSTR.
Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
execute set_target( 'aaa'  )

execute add_target( 'abc'  )
execute add_target( 'bbb'  )
execute add_target( 'bcd'  )
execute add_target( 'cde'  )
execute add_target( 'def'  )
execute add_target( 'efg'  )

column target heading "TARGET|STRING"

SUBSTR vs REGEXP_SUBSTR

Extracting a string from a fixed position, e.g. characters 2 and 3, is performed as follows.
select
  target ,
  SUBSTR( target, 2, 2 ) substring
from
  targets
;

TARGET
STRING     SUBSTRING
---------- ---------------
aaa        aa
abc        bc
bbb        bb
bcd        cd
cde        de
def        ef
efg        fg

select
  target ,
  REGEXP_SUBSTR( target, '..', 2 ) substring
from
  targets
;

TARGET
STRING     SUBSTRING
---------- ---------------
aaa        aa
abc        bc
bbb        bb
bcd        cd
cde        de
def        ef
efg        fg

Matching a complex pattern is possible with REGEXP_SUBSTR, but not SUBSTR (unless you use other functions as well). The following example shows how to extract either the first "a", "c", or "f" character in the target.
--
-- not possible
--

select
  target ,
  REGEXP_SUBSTR( target, '^[acf]' ) substring
from
  targets
;

TARGET
STRING     SUBSTRING
---------- ---------------
aaa        a
abc        a
bbb        (null)
bcd        (null)
cde        c
def        (null)
efg        (null)




Pattern Matching
Preface
In the tutorials to follow we will examine the syntax for specifying patterns using regular expressions. These patterns can be used in any of the REGEXP_ features described earlier in this section. For simplicity, pattern matching will be broken down into four different categories.
1.      WHAT single piece of text are we trying to match?
2.      HOW MANY repetitions of that text are we trying to match?
3.      WHERE in the source string should we look for a match?
4.      WHICH occurrence of the text do we want to match?
Note: When using regular expressions in your own code you will do so using the REGEXP_LIKE, REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR features described in preceeding tutorials. In many of the examples to follow you will not see these features used directly. Instead they are embedded in a view called TEST_RESULTS, which we will select from to see the effect of various pattern and target combinations (initialized with custom procedures like SET_PATTERN, SET_TARGET, and ADD_TARGET). This is done so that we can focus solely on pattern matching and not repeat syntax for features already presented.
The source code for the custom procedures and the view can be found in the Setup topic at the end of this section. The TEST_RESULTS view contains the following columns.






Column Name
Description
PATTERN
contains the regular expression for the pattern being tested
TARGET
contains the target string being examined
MATCH
indicates whether or not a match was found using REGEXP_LIKE
MATCHED_VALUE
contains the value within the TARGET that matched the PATTERN; the value is extracted using REGEXP_SUBSTR; if a match was not found or a match to an empty string was made MATCHED_VALUE will contain a null (indicated as '(null)')
POSITION
indicates the matched value's location within the TARGET as returned by REGEXP_INSTR

No comments:

Post a Comment

Oracle Escape Characters

Oracle allows the assignment of special escape characters to tell Oracle that the character is interpreted literally.  Certain characters ...