Regular Expressions
Regular expression support for pattern matching was introduced in Oracle 10g. It offers a powerful alternative to older, less robust pattern matching features such as the LIKE operator. Oracle regular expression support is based on the POSIX standard, but it also has some non-POSIX extensions similar to those used in the PERL programming language.
In the first four tutorials the Oracle conditions and functions that support regular expressions will be introduced. Since all Oracle regular expression features recognize the same regular expression patterns, these four tutorials will not attempt to explore regular expression syntax in detail. This will be done in the remaining tutorials which discuss four different aspects of regular expression pattern matching.
- WHAT single piece of text are we trying to match?
- HOW MANY repetitions of that text are we trying to match?
- WHERE in the source string should we look for a match?
- WHICH occurrence of the text do we want to match?
REGEXP_LIKE
The REGEXP_LIKE pattern matching condition is similar to the LIKE condition. Whereas LIKE only performs simple pattern matching using the wildcards "%" and "_", REGEXP_LIKE performs complex regular expression pattern matching allowing it to match a much greater range of string patterns than LIKE. The following examples compare pattern matching with both LIKE and REGEXP_LIKE.
Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
execute set_target( 'aaa' )
execute add_target( 'abc' )
execute add_target( 'bbb' )
execute add_target( 'bcd' )
execute add_target( 'cde' )
execute add_target( 'def' )
execute add_target( 'efg' )
column target heading "TARGET|STRING"
LIKE vs REGEXP_LIKE
Matching an entire string is performed as follows. | |
select target from targets where target LIKE 'abc' ; TARGET STRING ---------- abc | select target from targets where REGEXP_LIKE( target, '^abc$' ) ; TARGET STRING ---------- abc |
This is how to match a string anywhere in the target. | |
select target from targets where target LIKE '%bc%' ; TARGET STRING ---------- abc bcd | select target from targets where REGEXP_LIKE( target, 'bc' ) ; TARGET STRING ---------- abc bcd |
These examples show how to match a string at the start of the target. | |
select target from targets where target LIKE 'b%' ; TARGET STRING ---------- bbb bcd | select target from targets where REGEXP_LIKE( target, '^b' ) ; TARGET STRING ---------- bbb bcd |
These examples show how to match a string at the end of the target. | |
select target from targets where target LIKE '%b' ; TARGET STRING ---------- bbb | select target from targets where REGEXP_LIKE( target, 'b$' ) ; TARGET STRING ---------- bbb |
To match any single character do the following. | |
select target from targets where target LIKE 'a_c' ; TARGET STRING ---------- abc | select target from targets where REGEXP_LIKE( target, '^a.c$' ) ; TARGET STRING ---------- abc |
One example of a complex pattern that REGEXP_LIKE can handle but a single, basic LIKE condition cannot (without using text manipulation functions) is a list of characters. In this example we search for strings that have either an "a", "c", or "f" as the second character of the string. | |
-- -- not possible -- | select target from targets where REGEXP_LIKE( target, '^.[acf]' ) ; TARGET STRING ---------- aaa bcd efg |
REGEXP_INSTR
The REGEXP_INSTR function is similar to the INSTR function except that it finds the location of a pattern of characters, specified with a regular expression, instead of finding the location of a simple string literal. The following examples compare searches using INSTR with those that use REGEXP_INSTR.
Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
execute set_target( 'aaa' )
execute add_target( 'abc' )
execute add_target( 'bbb' )
execute add_target( 'bcd' )
execute add_target( 'cde' )
execute add_target( 'def' )
execute add_target( 'efg' )
column target heading "TARGET|STRING"
INSTR vs REGEXP_INSTR
Matching a string literal is performed as follows. | |
select target , INSTR( target, 'b' ) position from targets ; TARGET STRING POSITION ---------- -------- aaa 0 abc 2 bbb 1 bcd 1 cde 0 def 0 efg 0 | select target , REGEXP_INSTR( target, 'b' ) position from targets ; TARGET STRING POSITION ---------- -------- aaa 0 abc 2 bbb 1 bcd 1 cde 0 def 0 efg 0 |
Matching a complex pattern is possible with REGEXP_INSTR, but not INSTR (unless you use other text manipulation functions as well). The following example shows how to find the position of either the first "a", "c", or "f" character in the target. | |
-- -- not possible -- | select target , REGEXP_INSTR( target, '[acf]' ) position from targets ; TARGET STRING POSITION ---------- -------- aaa 1 abc 1 bbb 0 bcd 2 cde 1 def 3 efg 2 |
REGEXP_REPLACE
The REGEXP_REPLACE function is similar to both the REPLACE function and the TRANSLATE function except that it replaces a string pattern, specified with a regular expression, instead of a string literal. As well, REGEXP_REPLACE allows you to specify backreferences (references to sub-patterns defined in the main search pattern) for the replacement string whereas REPLACE only allows string literals in the replacement string. The following examples compare replacements performed using REPLACE or TRANSLATE with those using REGEXP_REPLACE.
Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
execute set_target( 'aaa' )
execute add_target( 'abc' )
execute add_target( 'bbb' )
execute add_target( 'bcd' )
execute add_target( 'cde' )
execute add_target( 'def' )
execute add_target( 'efg' )
column target heading "TARGET|STRING"
REPLACE and TRANSLATE vs REGEXP_REPLACE
Replacing a string literal is performed as follows. | |
select target , REPLACE( target, 'bc', 'XX' ) as replaced_target from targets ; TARGET STRING REPLACED_TARGET ---------- --------------- aaa aaa abc aXX bbb bbb bcd XXd cde cde def def efg efg | select target , REGEXP_REPLACE( target, 'bc', 'XX' ) as replaced_target from targets ; TARGET STRING REPLACED_TARGET ---------- --------------- aaa aaa abc aXX bbb bbb bcd XXd cde cde def def efg efg |
The following example shows how to replace all of the "a", "c", and "f" characters in the target with an "X". | |
select target , TRANSLATE ( target, 'acf', 'XXX' ) as replaced_target from targets ; TARGET STRING REPLACED_TARGET ---------- --------------- aaa XXX abc XbX bbb bbb bcd bXd cde Xde def deX efg eXg | select target , REGEXP_REPLACE ( target, '[acf]', 'X' ) as replaced_target from targets ; TARGET STRING REPLACED_TARGET ---------- --------------- aaa XXX abc XbX bbb bbb bcd bXd cde Xde def deX efg eXg |
While the TRANSLATE function can perform many different single character replacements all at once, this cannot be done with a single, basic REGEXP_REPLACE expression. | |
select target , TRANSLATE ( target, 'acf', 'XYZ' ) as replaced_target from targets ; TARGET STRING REPLACED_TARGET ---------- --------------- aaa XXX abc XbY bbb bbb bcd bYd cde Yde def deZ efg eZg | -- -- not possible with REGEXP_REPLACE -- |
Matching a simple search pattern, like a "bc" at the end of a string, cannot be easily performed with a single, basic REPLACE or TRANSLATE expression. REGEXP_REPLACE handles such pattern matches easily. | |
-- -- not possible with -- REPLACE or TRANSLATE -- | select target , REGEXP_REPLACE( target, 'bc$', 'XY' ) as replaced_target from targets ; TARGET STRING REPLACED_TARGET ---------- --------------- aaa aaa abc aXY bbb bbb bcd bcd cde cde def def efg efg |
REGEXP_COUNT
Introduced in Oracle 11g the REGEXP_COUNT function greatly simplifies counting the number of times a pattern appears inside a string. The following examples demonstrate how to use REGEXP_COUNT with some simple patterns. For versions prior to 11g alternative logic that provides similar functionality is also presented.Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
execute set_target( 'aaa' )
execute add_target( 'a b c' )
execute add_target( 'b b b b' )
execute add_target( ' bbb xx ' )
execute add_target( 'wdef def w' )
execute add_target( 'defxdefdef' )
execute add_target( null )
set null "(null)"
column target heading "TARGET|STRING"
column pattern_count heading "PATTERN|COUNT"
Count String Pattern Occurrences (Oracle 11g)
To count the occurrences of the letter "b" in the target strings we can use a query like this.select
target ,
REGEXP_COUNT( TARGET, 'b' ) as pattern_count
from
targets
;
TARGET PATTERN
STRING COUNT
---------- ----------
aaa 0
a b c 1
b b b b 4
bbb xx 3
wdef def w 0
defxdefdef 0
(null) (null)
This query finds the total occurrences of the letters "a", "c", or "f" in the target strings. select
target ,
REGEXP_COUNT( TARGET, '[acf]' ) as pattern_count
from
targets
;
TARGET PATTERN
STRING COUNT
---------- ----------
aaa 3
a b c 2
b b b b 0
bbb xx 0
wdef def w 2
defxdefdef 3
(null) (null)
To calculate the number of times a string like "def" occurs we use a query like this. select
target ,
REGEXP_COUNT( TARGET, 'def' ) as pattern_count
from
targets
;
TARGET PATTERN
STRING COUNT
---------- ----------
aaa 0
a b c 0
b b b b 0
bbb xx 0
wdef def w 2
defxdefdef 3
(null) (null)
Count String Pattern Occurrences (Oracle 10g)
While Oracle 10g supports most of the regular expression features described in this section it unfortunately does not support REGEXP_COUNT. However, we can use the LENGTH and REGEXP_REPLACE functions to produce similar results.To count the occurrences of the letter "b" in the target strings we can use a query like this (the column INTERMEDIATE_STRING is included to demonstrate the inner workings of each solution).
column intermediate_string format a12 heading "INTERMEDIATE|STRING"
select
target ,
regexp_replace( target, '[^b]', null ) as intermediate_string ,
NVL( LENGTH( REGEXP_REPLACE( TARGET, '[^b]', NULL ) ), 0 ) as pattern_count
from
targets
;
TARGET INTERMEDIATE PATTERN
STRING STRING COUNT
---------- ------------ ----------
aaa (null) 0
a b c b 1
b b b b bbbb 4
bbb xx bbb 3
wdef def w (null) 0
defxdefdef (null) 0
(null) (null) 0
This query finds the total occurrences of the letters "a", "c", or "f" in the target strings. select
target ,
regexp_replace( target, '[^acf]', null ) as intermediate_string ,
NVL( LENGTH( REGEXP_REPLACE( TARGET, '[^acf]', NULL ) ), 0 ) as pattern_count
from
targets
;
TARGET INTERMEDIATE PATTERN
STRING STRING COUNT
---------- ------------ ----------
aaa aaa 3
a b c ac 2
b b b b (null) 0
bbb xx (null) 0
wdef def w ff 2
defxdefdef fff 3
(null) (null) 0
To calculate the number of times a string like "def" occurs we use a query like this. select
target ,
regexp_replace( target, '(d)ef|.', '\1' ) as intermediate_string ,
NVL( LENGTH( REGEXP_REPLACE( TARGET, '(d)ef|.', '\1' ) ), 0 ) as pattern_count
from
targets
;
TARGET INTERMEDIATE PATTERN
STRING STRING COUNT
---------- ------------ ----------
aaa (null) 0
a b c (null) 0
b b b b (null) 0
bbb xx (null) 0
wdef def w dd 2
defxdefdef ddd 3
(null) (null) 0
To find the number of words in a string we can use a query like this (for our purposes we define a word as any series of characters that are not a space).
select
target ,
REGEXP_COUNT( TARGET, '[^ ]+' ) as pattern_count
from
targets
;
TARGET PATTERN
STRING COUNT
---------- ----------
aaa 1
a b c 3
b b b b 4
bbb xx 2
wdef def w 3
defxdefdef 1
(null) (null)
To ensure PATTERN_COUNT always returns a non-null value use NVL. select
target ,
NVL( regexp_count( target, '[^ ]+' ), 0 ) as pattern_count
from
targets
;
TARGET PATTERN
STRING COUNT
---------- ----------
aaa 1
a b c 3
b b b b 4
bbb xx 2
wdef def w 3
defxdefdef 1
(null) 0
select
target ,
regexp_replace( target, ' *[^ ]+ *', 'w' ) as intermediate_string ,
NVL( LENGTH( REGEXP_REPLACE( TARGET, ' *[^ ]+ *', 'w' ) ), 0 ) as pattern_count
from
targets
;
TARGET INTERMEDIATE PATTERN
STRING STRING COUNT
---------- ------------ ----------
aaa w 1
a b c www 3
b b b b wwww 4
bbb xx ww 2
wdef def w www 3
defxdefdef w 1
(null) (null) 0
Count String Pattern Occurrences (Any Oracle Version)
In any Oracle version we can use the LENGTH and REPLACE functions to count patterns, though the logic is slightly more complex than the techniques above.To count the occurrences of the letter "b" in the target strings we can use a query like this.
column target_length heading "TARGET|LENGTH"
column intermediate_length heading "INTERMEDIATE|LENGTH"
select
target ,
replace( target, 'b', null ) as intermediate_string ,
nvl( length( target ), 0 ) as target_length ,
nvl( length( replace( target, 'b', null ) ), 0 ) as intermediate_length ,
NVL( LENGTH( TARGET ), 0 )
- NVL( LENGTH( REPLACE( TARGET, 'b', NULL ) ), 0 )
as pattern_count
from
targets
;
TARGET INTERMEDIATE TARGET INTERMEDIATE PATTERN
STRING STRING LENGTH LENGTH COUNT
---------- ------------ ---------- ------------ ----------
aaa aaa 3 3 0
a b c a c 5 4 1
b b b b 10 6 4
bbb xx xx 10 7 3
wdef def w wdef def w 10 10 0
defxdefdef defxdefdef 10 10 0
(null) (null) 0 0 0
(The INTERMEDIATE_STRING value in the third row is six spaces. In the fourth row it is three spaces followed by "xx" followed by two spaces.) This query finds the total occurrences of the letters "a", "c", or "f" in the target strings.
select
target ,
translate( target, 'xacf', 'x' ) as intermediate_string ,
nvl( length( target ), 0 ) as target_length ,
nvl( length( translate( target, 'xacf', 'x' ) ), 0 ) as intermediate_length ,
NVL( LENGTH( TARGET ), 0 )
- NVL( LENGTH( TRANSLATE( TARGET, 'xacf', 'x' ) ), 0 )
as pattern_count
from
targets
;
TARGET INTERMEDIATE TARGET INTERMEDIATE PATTERN
STRING STRING LENGTH LENGTH COUNT
---------- ------------ ---------- ------------ ----------
aaa (null) 3 0 3
a b c b 5 3 2
b b b b b b b b 10 10 0
bbb xx bbb xx 10 10 0
wdef def w wde de w 10 8 2
defxdefdef dexdede 10 7 3
(null) (null) 0 0 0
To calculate the number of times a string like "def" occurs we use a query like this. column intermediate_string_1 format a12 heading "INTERMEDIATE|STRING 1"
column intermediate_string_2 format a12 heading "INTERMEDIATE|STRING 2"
select
target ,
replace( target, 'def', 'd' ) as intermediate_string_1 ,
replace( target, 'def', null ) as intermediate_string_2 ,
nvl( length( replace( target, 'def', 'd' ) ), 0 ) as length_1 ,
nvl( length( replace( target, 'def', null ) ), 0 ) as length_2 ,
NVL( LENGTH( REPLACE( TARGET, 'def', 'd' ) ), 0 )
- NVL( LENGTH( REPLACE( TARGET, 'def', NULL ) ), 0 )
as pattern_count
from
targets
;
TARGET INTERMEDIATE INTERMEDIATE PATTERN
STRING STRING 1 STRING 2 LENGTH_1 LENGTH_2 COUNT
---------- ------------ ------------ ---------- ---------- ----------
aaa aaa aaa 3 3 0
a b c a b c a b c 5 5 0
b b b b b b b b b b b b 10 10 0
bbb xx bbb xx bbb xx 10 10 0
wdef def w wd d w w w 6 4 2
defxdefdef dxdd x 4 1 3
(null) (null) (null) 0 0 0
To find the number of words in a string we can use a query like this. Note the query will not work for target strings containing more than one space between words. column trimmed_length_plus_1 heading "TRIMMED LENGTH|PLUS 1"
select
target ,
trimmed ,
replace( trimmed, ' ', null ) as intermediate_string ,
nvl( length( trimmed ) + 1, 0 ) as trimmed_length_plus_1 ,
nvl( length( replace( trimmed, ' ', null ) ), 0 ) as intermediate_length ,
NVL( LENGTH( TRIMMED ) + 1, 0 )
- NVL( LENGTH( REPLACE( TRIMMED, ' ', NULL ) ), 0 )
as pattern_count
from
( select target, ltrim( rtrim(target) ) as trimmed from targets )
where
nvl( ltrim( rtrim( target ) ), 'x' ) not like '% %'
;
TARGET INTERMEDIATE TRIMMED LENGTH INTERMEDIATE PATTERN
STRING TRIMMED STRING PLUS 1 LENGTH COUNT
---------- --------------- ------------ -------------- ------------ ----------
aaa aaa aaa 4 3 1
a b c a b c abc 6 3 3
bbb xx bbb xx bbbxx 7 5 2
wdef def w wdef def w wdefdefw 11 8 3
defxdefdef defxdefdef defxdefdef 11 10 1
(null) (null) (null) 0 0 0
In Oracle 8i or later the LTRM and RTRIM functions can be replaced with a single TRIM call. select
target ,
trimmed ,
replace( trimmed, ' ', null ) as intermediate_string ,
nvl( length( trimmed ) + 1, 0 ) as trimmed_length_plus_1 ,
nvl( length( replace( trimmed, ' ', null ) ), 0 ) as intermediate_length ,
nvl( length( trimmed ) + 1, 0 )
- nvl( length( replace( trimmed, ' ', null ) ), 0 )
as pattern_count
from
( select target, TRIM( target ) as trimmed from targets )
where
nvl( TRIM( target ), 'x' ) not like '% %'
;
TARGET INTERMEDIATE TRIMMED LENGTH INTERMEDIATE PATTERN
STRING TRIMMED STRING PLUS 1 LENGTH COUNT
---------- --------------- ------------ -------------- ------------ ----------
aaa aaa aaa 4 3 1
a b c a b c abc 6 3 3
bbb xx bbb xx bbbxx 7 5 2
wdef def w wdef def w wdefdefw 11 8 3
defxdefdef defxdefdef defxdefdef 11 10 1
(null) (null) (null) 0 0 0
REGEXP_SUBSTR
The REGEXP_SUBSTR function provides a superset of the functionality available with the SUBSTR function. While SUBSTR extracts a string from a specific location in the target, REGEXP_SUBSTR extracts a string that matches a given pattern, specified with a regular expression, from anywhere in the target. The following examples compare expressions using SUBSTR with those that use REGEXP_SUBSTR.
Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
execute set_target( 'aaa' )
execute add_target( 'abc' )
execute add_target( 'bbb' )
execute add_target( 'bcd' )
execute add_target( 'cde' )
execute add_target( 'def' )
execute add_target( 'efg' )
column target heading "TARGET|STRING"
SUBSTR vs REGEXP_SUBSTR
Extracting a string from a fixed position, e.g. characters 2 and 3, is performed as follows. | |
select target , SUBSTR( target, 2, 2 ) substring from targets ; TARGET STRING SUBSTRING ---------- --------------- aaa aa abc bc bbb bb bcd cd cde de def ef efg fg | select target , REGEXP_SUBSTR( target, '..', 2 ) substring from targets ; TARGET STRING SUBSTRING ---------- --------------- aaa aa abc bc bbb bb bcd cd cde de def ef efg fg |
Matching a complex pattern is possible with REGEXP_SUBSTR, but not SUBSTR (unless you use other functions as well). The following example shows how to extract either the first "a", "c", or "f" character in the target. | |
-- -- not possible -- | select target , REGEXP_SUBSTR( target, '^[acf]' ) substring from targets ; TARGET STRING SUBSTRING ---------- --------------- aaa a abc a bbb (null) bcd (null) cde c def (null) efg (null) |
Pattern Matching
Preface
In the tutorials to follow we will examine the syntax for specifying patterns using regular expressions. These patterns can be used in any of the REGEXP_ features described earlier in this section. For simplicity, pattern matching will be broken down into four different categories.
1. WHAT single piece of text are we trying to match?
2. HOW MANY repetitions of that text are we trying to match?
3. WHERE in the source string should we look for a match?
4. WHICH occurrence of the text do we want to match?
Note: When using regular expressions in your own code you will do so using the REGEXP_LIKE, REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR features described in preceeding tutorials. In many of the examples to follow you will not see these features used directly. Instead they are embedded in a view called TEST_RESULTS, which we will select from to see the effect of various pattern and target combinations (initialized with custom procedures like SET_PATTERN, SET_TARGET, and ADD_TARGET). This is done so that we can focus solely on pattern matching and not repeat syntax for features already presented.
The source code for the custom procedures and the view can be found in the Setup topic at the end of this section. The TEST_RESULTS view contains the following columns.
Column Name | Description |
PATTERN | contains the regular expression for the pattern being tested |
TARGET | contains the target string being examined |
MATCH | indicates whether or not a match was found using REGEXP_LIKE |
MATCHED_VALUE | contains the value within the TARGET that matched the PATTERN; the value is extracted using REGEXP_SUBSTR; if a match was not found or a match to an empty string was made MATCHED_VALUE will contain a null (indicated as '(null)') |
POSITION | indicates the matched value's location within the TARGET as returned by REGEXP_INSTR |
No comments:
Post a Comment