Chapter 7 – Character strings

Note

You are reading a digital copy of SQL-99 Complete, Really, a book that documents the SQL-99 standard.

The book and the standard does not reflect the features of CrateDB, we are just publishing it as a service for the community and for reference purposes.

On the one hand, CrateDB does not implement the SQL-99 standard thoroughly, on the other hand, it extends the standard for implementing cluster features and others.

For more information specific to CrateDB, check out the CrateDB Reference documentation.

In SQL, a character string is any sequence of zero or more alphanumeric characters that belong to a given Character set. A Character set is a named characters that belong to a given Character set. A Character set is a named character repertoire that includes a Form-of-use encoding scheme which defines how the repertoire’s characters are encoded as numbers. The SQL Standard defines several standard Character sets that your DBMS must support. A character string value may be a <literal>, an <identifier>, the value of a parameter or a host language variable or the result of any expression or argument that evaluates to a character string. All character strings belong to some Character set and are governed by the rules of some Collation during comparisons. A Collation is a named collating sequence. Character strings that belong to the same Character set are compatible.

A character string has a length – a non-negative integer equal to the number of characters in the string. Characters in a character string are numbered from left to right beginning with 1. A Character string also has a coercibility attribute; this helps your DBMS determine which Collation to use for a comparison that doesn’t provide an explicit COLLATE clause. The coercibility attribute can be either COERCIBLE, EXPLICIT, IMPLICIT or NO COLLATION. (A coercibility attribute of COERCIBLE, EXPLICIT or IMPLICIT means the string has a current default Collation. A coercibility attribute of NO COLLATION means the string does not have a current default Collation.) Character strings are stored in either of the six character string <data type>s: CHARACTER, CHARACTER VARYING, CHARACTER LARGE OBJECT, NATIONAL CHARACTER, NATIONAL CHARACTER VARYING or NATIONAL CHARACTER LARGE OBJECT.

Table of Contents

Character string <literal>s

An SQL <character string literal> has five parts:

  1. Its value: the sequence of characters that make up the <literal>.

  2. Its length: the number of characters that make up the <literal>.

  3. The name of the Character set that the <literal> belongs to.

  4. The name of the <literal>’s default Collation. (This is the Collation that may be used to compare the <literal> with another character string in the absence of an explicit COLLATE clause.)

  5. The <literal>’s coercibility attribute: normally COERCIBLE, but can be EXPLICIT. (The coercibility attribute helps your DBMS determine which Collation to use for a comparison that doesn’t provide an explicit COLLATE clause.)

A <character string literal> is either a <character string literal> or a national <character string literal>.

<character string literal>

The required syntax for a <character string literal> is as follows.

<character string literal> ::=
[ <Character set name> ]'string' [ COLLATE <Collation name> ]

A <character string literal> is a string of zero or more alphanumeric characters inside a pair of single quote marks. The string’s characters must all belong to the same Character set. Its <data type> is fixed length CHARACTER, though it is compatible with the CHARACTER, CHARACTER VARYING, CHARACTER LARGE OBJECT, NATIONAL CHARACTER, NATIONAL CHARACTER VARYING and NATIONAL CHARACTER LARGE OBJECT <data type>s. The <literal>’s length is the number of characters inside the quote marks; the delimiting single quotes aren’t part of the <literal>, so they’re not included in the calculation of the <character string literal>’s size. Two consecutive single quotes within a character string (i.e.: ‘’) represent one single quote mark; together, they count as one character when calculating the size of the <literal>. Here is an example of a <character string literal>:

'This is a <character string literal>'

[Obscure Rule] The optional Character set specification – an underline character immediately preceding a <Character set name> (no space is allowed between them) – names the Character set that the <literal> belongs to. Your current <AuthorizationID> must have the USAGE Privilege for that Character set. For example, this <character string literal>:

LATIN1 'Hello'

belongs to the LATIN1 Character set. Note: For qualified names, the underline character always precedes the highest level of explicit qualification in the <Character set name>. If you omit the Character set specification, the characters in the <literal> must belong to the Character set of the SQL-client Module that contains the <literal>. Here are two examples of a <character string literal>:

'This is a string in the default Character set'
LATIN1'This is a string in the LATIN1 Character set'

[Obscure Rule] A <character string literal> normally has a coercibility attribute of COERCIBLE and a default Collation that is the Collation defined for its Character set – See “Character Strings and Collations” on page 159. The optional COLLATE clause names the <literal>´s EXPLICIT Collation for an operation. The Collation named must be a Collation defined for the relevant Character set, but you may specify a default Collation for a <literal> that is different from the default Collation of it´s Character set. If you´re using COLLATE in an SQL-Schema statement, then the <AuthorizationID> that iwns the containing Schema must have the USAGE Privileg on <Collation name>. If you´re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on <Collation name>. Here are four more examples of a <character string literal>.

'This string in the default Character set will use the default Character set''s Collation'
LATIN1'This in the LATIN1 Character set will use LATIN1''s Collation'
'This string in the default Character set will use a Collation named MY.COLLATION_1' COLLATE my.collation_1
LATIN1'This string in the LATIN1 Character set will use a Collation named MYCOLLATION_1' COLLATE my.collation_1

[Obscure Rule] SQL allows you to break a long <character string literal> up into two or more smaller <character string literal>s, split by a <separator> that includes a newline character. When it sees such a <literal>, your DBMS will ignore the <separator> and treat the multiple strings as a single <literal>. For example, here are two equivalent <character string literal>s:

'This is part of a string'
' and this is the other part'

'This is part of a string and this is the other part'

(In the first example, there is a carriage return newline <separator> between string' and and.)

If you want to restrict your code to Core SQL, don’t add a Character set specification to <character string literal>s, don’t add a COLLATE clause to <character string literal>s and don’t split long <character string literal>s into smaller strings.

<national character string literal>

The required syntax for a <national character string literal> is as follows.

<national character string literal> ::=
N'string' [ COLLATE <Collation name> ]

A <national character string literal> is a <character string literal> preceded by the letter N; it is a synonym for a <character string literal> that belongs to a predefined “national” Character set. Its <data type> is fixed length NATIONAL CHARACTER, though it is compatible with the CHARACTER, CHARACTER VARYING, CHARACTER LARGE OBJECT, NATIONAL CHARACTER, NATIONAL CHARACTER VARYING and NATIONAL CHARACTER LARGE OBJECT <data type>s.

Other than the fact that you may not add a Character set specification to a <national character string literal> because N'string' implies the same national Character set used for NCHAR, NCHAR VARYING and NCLOB <data type>s, the specifications for the two types of <character string literal>s are the same. Here are two examples of a <national character string literal>:

N'This string in the national Character set will use the national Character set''s Collation'
N'This string in the national Character set will use a Collation named
MY.COLLATION_1' COLLATE my.collation_1

[NON-PORTABLE] The national Character set used by <national character string literal>s and the NCHAR, NCHAR VARYING and NCLOB <data type>s is non-standard because the SQL Standard requires implementors to define a national Character set. [OCELOT Implementation] The OCELOT DBMS that comes with this book defines the national Character set to be ISO8BIT. For example, here are two equivalent <character string literal>s:

N'Hello there'
ISO8BIT'Hello there'

If you want to restrict your code to Core SQL, don’t use <national character string literal>s.

Character string <data type>s

A character string <data type> is defined by a descriptor that contains five pieces of information.

  1. The <data type>’s name: either CHARACTER, CHARACTER VARYING, CHARACTER LARGE OBJECT, NATIONAL CHARACTER, NATIONAL CHARACTER VARYING or NATIONAL CHARACTER LARGE OBJECT.

  2. The <data type>’s fixed length or maximum length (as applicable).

  3. The name of the Character set that the <data type>’s set of valid values belong to. (An operation that attempts to make a character string <data type> contain a character that does not belong to its Character set will fail: your DBMS will return the SQLSTATE error 22021 "data exception-character not in repertoire".)

  4. The name of the <data type>’s default Collation. (This is the Collation that may be used to compare the <data type>’s values in the absence of an explicit COLLATE clause.)

  5. The <data type>’s coercibility attribute – normally IMPLICIT, but can be EXPLICIT. (The coercibility attribute helps your DBMS determine which Collation to use for a comparison that doesn’t provide an explicit COLLATE clause.)

CHARACTER

The required syntax for a CHARACTER <data type> specification is as follows.

CHARACTER <data type> ::=
CHARACTER [ (length) ]
[ CHARACTER SET <Character set name> ]
[ COLLATE <Collation name> ]

CHARACTER may be abbreviated as CHAR and is a fixed length alphanumeric string, exactly “length” characters long. It defines a set of character string values that belong to a given Character set. For example, this <character string literal>:

'BOB'

is a valid value for this <data type> specification:

CHAR(3)

The optional length, if specified, is an unsigned, positive integer that defines the exact length, in characters, of acceptable values. The minimum length and the default length are both 1. For example, these two <data type> specifications:

CHAR
CHAR(1)

both define a set of character string values that are exactly one character long.

[NON-PORTABLE] The maximum length for CHAR is non-standard because the SQL Standard requires implementors to define CHAR’s maximum length. FIPS says that CHAR should have a maximum length of at least 1000 characters. [OCELOT Implementation] The OCELOT DBMS that comes with this book allows you to define a length ranging from 1 to 4096 for CHAR.

[Obscure Rule] The optional CHARACTER SET clause names the Character set that the <data type>’s values belong to. Your current <AuthorizationID> must have the USAGE Privilege for that Character set. For example, this <data type> specification:

CHAR(15) CHARACTER SET LATIN1

defines a set of character string values, exactly 15 characters long, that belong to the LATIN1 Character set. If you omit the CHARACTER SET clause when specifying a character string <data type> in a <Column definition or a CREATE DOMAIN statement, the <data type>’s Character set is the Character set named in the DEFAULT CHARACTER SET clause of the CREATE SCHEMA statement that defines the Schema that the <data type> belongs to. For example, consider this SQL statement, which creates a Schema that has a default Character set of LATIN1:

CREATE SCHEMA schema_example
   DEFAULT CHARACTER SET INFORMATION_SCHEMA.LATIN;

Based on this definition, the <data type> specification in this SQL statement defines a set of character string values, exactly 15 characters long, that belong to the LATIN1 Character set:

CREATE TABLE schema_example.Table_1 (
   column_1 CHAR(15));

[NON-PORTABLE] If you omit the CHARACTER SET clause when specifying a character string <data type> anywhere other than in a <Column definition> or a CREATE DOMAIN statement, the <data type>’s Character set is non-standard because the SQL Standard requires implementors to define a default Character set for such situations. [OCELOT Implementation] The OCELOT DBMS that comes with this book defines ISO8BIT as the default Character set for such situations.

[Obscure Rule] A CHAR <data type> has a coercibility attribute of IMPLICIT. The optional COLLATE clause defines the <data type>’s default Collation. The Collation named must be a Collation defined for the relevant Character set, but you may define a Column, Field or Domain with a default Collation that is different from the default Collation of its Character set. If you’re using COLLATE in a SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Collation name>”. If you’re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Collation name>”. For example, these two <data type> specifications:

CHAR(15) COLLATE my.collation_1
CHAR(15) CHARACTER SET my.charset_1 COLLATE my.collation_1

both define a set of character string values exactly 15 characters long. The first example defines the <data type>’s Character set to be the default Character set. The second example defines the <data type>’s Character set to be a Character set named my.charset_1. Both examples define the <data type>’s default Collation to be a Collation named my.colllation_1. If you omit the COLLATE clause, the <data type> is defined as if its Character set’s default Collation was explicitly specified – see “Character Strings and Collations”.

If you want to restrict your code to Core SQL, don’t use the CHARACTER SET clause or the COLLATE clause for CHAR <data type> specifications.

NATIONAL CHARACTER

The required syntax for a NATIONAL CHARACTER <data type> specification is as follows.

NATIONAL CHARACTER <data type> ::=
NATIONAL CHARACTER [ (length) ] [ COLLATE <Collation name> ]

NATIONAL CHARACTER may be abbreviated as NATIONAL CHAR and as NCHAR. NCHAR is a synonym for a CHAR <data type> that belongs to a predefined “national” Character set.

Other than the fact that you may not add a CHARACTER SET clause to an NCHAR <data type> specification because NCHAR implies the same national Character set used for <national character string literal>s and NCHAR VARYING and NCLOB <data type>s, the specifications for the NCHAR and CHAR <data type>s are the same. Here are two examples of an NCHAR <data type> specification:

NCHAR(10)
 -- uses the national Character set's Collation
NCHAR(10) COLLATE my.collation_1
 -- uses a Collation named MY.COLLATION_1

[NON-PORTABLE] The national Character set used by <national character string literal>s and the NCHAR, NCHAR VARYING and NCLOB <data type>s is non-standard because the SQL Standard requires implementors to define a national Character set. [OCELOT Implementation] The OCELOT DBMS that comes with this book defines the national Character set to be ISO8BIT. For example, these two <data type> specifications both define the same set of valid values:

NCHAR(10)
CHAR(10) CHARACTER SET ISO8BIT

If you want to restrict your code to Core SQL, don’t use the NCHAR <data type>.

CHARACTER VARYING

The required syntax for a CHARACTER VARYING <data type> specification is as follows.

CHARACTER VARYING <data type> ::=
CHARACTER VARYING (length)
[ CHARACTER SET <Character set name> ]
[ COLLATE <Collation name> ]

CHARACTER VARYING may be abbreviated as CHAR VARYING and as VARCHAR and is a variable length alphanumeric string, from zero to “length” characters long. It defines a set of character string values that belong to a given Character set. For example, the follwoing three <character string literal>s:

'BOB'
'BOBBY'
'ROBERT'

are all valid values for this <data type> specification:

VARCHAR(6)

The mandatory length specification is an unsigned, positive integer that defines the maximum length, in characters, of acceptable values. The minimum length is 1.

[NON-PORTABLE] The maximum length for VARCHAR is non-standard because the SQL Standard requires implementors to define VARCHAR’s maximum length. FIPS says that VARCHAR should have a maximum length of at least 1000 characters. [OCELOT Implementation] The OCELOT DBMS that comes with this book allows you to define a length ranging from 1 to 4096 for VARCHAR.

[Obscure Rule] The optional CHARACTER SET clause names the Character set that the <data type>’s values belong to; see the remarks under "CHARACTER". For example, this <data type> specification:

VARCHAR(15) CHARACTER SET LATIN1

defines a set of character string values, 0 to 15 characters long, that belong to the LATIN1 Character set. (Zero length strings may be stored in a VARCHAR field.)

[Obscure Rule] A VARCHAR <data type> has a coercibility attribute of IMPLICIT. The optional COLLATE clause defines the <data type>’s default Collation; see the remarks under "CHARACTER". For example, these two <data type> specifications:

VARCHAR(15) COLLATE my.collation_1
VARCHAR(15) CHARACTER SET my.charset_1 COLLATE my.collation_1

both define a set of character string values, 0 to 15 characters long, that have a default Collation called my.collation_1.

If you want to restrict your code to Core SQL, don’t use the CHARACTER SET clause or the COLLATE clause for VARCHAR <data type> specifications.

NATIONAL CHARACTER VARYING

The required syntax for a NATIONAL CHARACTER VARYING <data type> specification is:

NATIONAL CHARACTER VARYING <data type> ::=
NATIONAL CHARACTER VARYING (length) [ COLLATE <Collation name> ]

NATIONAL CHARACTER VARYING may be abbreviated as NATIONAL CHAR VARYING and as NCHAR VARYING. NCHAR VARYING is a synonym for a VARCHAR <data type> that belongs to a predefined “national” Character set.

Other than the fact that you may not add a CHARACTER SET clause to an NCHAR VARYING <data type> specification because NCHAR VARYING implies the same national Character set used for <national character string literal>s and NCHAR and NCLOB <data type>s, the specifications for the NCHAR VARYING and VARCHAR <data type>s are the same. Here are two examples of an NCHAR VARYING <data type> specification:

NCHAR VARYING(10)
 -- uses the national Character set's Collation
NCHAR VARYING(10) COLLATE my.collation_1
 -- uses a Collation named MY.COLLATION_1

[NON-PORTABLE] The national Character set used by <national character string literal>s and the NCHAR, NCHAR VARYING and NCLOB <data type>s is non-standard because the SQL Standard requires implementors to define a national Character set. [OCELOT Implementation] The OCELOT DBMS that comes with this book defines the national Character set to be ISO8BIT.For example, these two <data type> specifications both define the same set of valid values:

NCHAR VARYING(10)
VARCHAR(10) CHARACTER SET ASCII_FULL

If you want to restrict your code to Core SQL, don’t use the NCHAR VARYING <data type>.

CHARACTER LARGE OBJECT

The required syntax for a CHARACTER LARGE OBJECT <data type> specification is as follows.

CHARACTER LARGE OBJECT <data type> ::=
{CHARACTER LARGE OBJECT | CLOB} [ (length) ]
[ CHARACTER SET <Character set name> ]
[ COLLATE <Collation name> ]

CHARACTER LARGE OBJECT may be abbreviated as CHAR LARGE OBJECT and as CLOB and is a variable length alphanumeric string, from zero to “length” characters long. It defines a set of large object character string values that belong to a given Character set. For example, these three <character string literal>s:

'BOB'
'BOBBY'
'ROBERT'

are all valid values for this <data type> specification:

CLOB(6)

The optional length, if specified, is an unsigned positive integer, possibly followed by a letter code (either “K”, “M” or “G”). It defines the maximum length of acceptable values in the CLOB field.

  • If the length n is not followed by a letter code, the CLOB may hold up to n characters.

  • The length may include the letter code “K” (kilobyte), “M” (megabyte), or “G” (gigabyte). If the length is defined as nk, the CLOB may hold up to n*1024 characters. If the length is defined as nG, the CLOB may hold up to n*1,073,741,824 characters.

For example, the following <data type> specification defines a set of large Object character string values that may range from zero to 20 characters.

CLOB(20)

(Zero length large object character strings can be stored in a CLOB field.)

This <data type> specification defines a set of large object character string values that may range from zero to 2048 characters:

CLOB(2K)

This <data type> specification defines a set of large object character string values that may range from zero to 2,097,152 characters:

CLOB(2M)

And this <data type> specification defines a set of large object character string values that may range from zero to 2,147,483,648 characters:

CLOB(2G)

[NON-PORTABLE] The default length and the maximum length for CLOB are non- standard because the SQL Standard requires implementors to define CLOB’s default and maximum lengths.

[OCELOT Implementation] The OCELOT DBMS that comes with this book allows the length of CLOB to range from 1 to 32 and sets the default length of a CLOB <data type> to 1K. For example, these two <data type> specifications are equivalent: both define a set of large object character string values that may range from zero to 1024 characters:

CLOB
CLOB(1K)

[Obscure Rule] The optional CHARACTER SET clause names the Character set that the <data type>’s values belong to; see the remarks under "CHARACTER". For example, this <data type> specification:

CLOB(5M) CHARACTER SET LATIN1

defines a set of large object character string values, 0 to 5,242,880 characters long, that belong to the LATIN1 Character set.

[Obscure Rule] A CLOB <data type> has a coercibility attribute of IMPLICIT. The optional COLLATE clause defines the <data type>’s default Collation; see the remarks under "CHARACTER". For example, these two <data type> specifications:

CLOB(3G) COLLATE my.collation_1
CLOB(3G) CHARACTER SET my.charset_1 COLLATE my.collation_1

both define a set of large object character string values, 0 to 3,221,225,472 characters long, that have a default Collation called my.colllation_1.

If you want to restrict your code to Core SQL, don’t use the CHARACTER SET clause or the COLLATE clause for CLOB <data type> specifications.

NATIONAL CHARACTER LARGE OBJECT

The required syntax for a NATIONAL CHARACTER LARGE OBJECT <data type> specification is as follows.

NATIONAL CHARACTER LARGE OBJECT <data type> ::=
NATIONAL CHARACTER LARGE OBJECT [ (length) ][ COLLATE <Collation name> ]

NATIONAL CHARACTER LARGE OBJECT may be abbreviated as NCHAR LARGE OBJECT and as NCLOB. NCLOB is a synonym for a CLOB <data type> that belongs to a predefined “national” Character set.

Other than the fact that you may not add a CHARACTER SET clause to an NCLOB <data type> specification because NCLOB implies the same national Character set used for <national character string literal>s and NCHAR and NCHAR VARYING <data type>s, the specifications for the NCLOB and CLOB <data type>s are the same. Here are two examples of an NCLOB <data type> specification:

NCLOB(2K)
 -- uses the national Character set's Collation
NCLOB(2K) COLLATE my.collation_1
 -- uses a Collation named MY.COLLATION_1

[NON-PORTABLE] The national Character set used by <national character string literal>s and the NCHAR, NCHAR VARYING and NCLOB <data type>s is non-standard because the SQL Standard requires implementors to define a national Character set. [OCELOT Implementation] The OCELOT DBMS that comes with this book defines the national Character set to be ISO8BIT. For example, these two <data type> specifications both define the same set of valid values:

NCLOB(1G)
CLOB(1G) CHARACTER SET ISO8BIT

If you want to restrict your code to Core SQL, don’t use the NCLOB <data type>.

Now that we’ve described SQL’s character string <data type>s, let’s look at some example SQL statements that put them to use.

These SQL statements make a Table with six fixed length character string Columns, insert a row, then search for any string greater than ‘hi’.

CREATE TABLE Fixed_Char_Examples (
   occurrence_char_1 CHAR(2),
   occurrence_char_2 CHAR(2) CHARACTER SET LATIN1,
   occurrence_char_3 CHAR(2) COLLATE my.collation_1,
   occurrence_char_4 CHAR(2) CHARACTER SET LATIN1 COLLATE my.collation_1,
   occurrence_nchar_1 NCHAR(2),
   occurrence_nchar_2 NCHAR(2) COLLATE my.collation_1);

 INSERT INTO Fixed_Char_Examples (
       occurrence_char_1,
       occurrence_char_2,
       occurrence_char_3,
       occurrence_char_4,
       occurrence_nchar_1,
       occurrence_nchar_2)
       VALUES ('mm','mm','mm','mm','mm','mm');

 SELECT occurrence_char_1,
        occurrence_char_2,
        occurrence_char_3,
        occurrence_char_4,
        occurrence_nchar_1,
        occurrence_nchar_2
 FROM   Fixed_Char_Examples
 WHERE  occurrence_char_4 > 'hi';

 SELECT occurrence_char_1,
        occurrence_char_2,
        occurrence_char_3,
        occurrence_char_4,
        occurrence_nchar_1,
        occurrence_nchar_2
 FROM   Fixed_Char_Examples
 WHERE  occurrence_nchar_2 > N'hi';

These SQL statements make a Table with six variable length character string Columns, insert a row, then search for any string not equal to ‘hi’.

CREATE TABLE Varying_Char_Examples (
   occurrence_varchar_1 VARCHAR(5),
   occurrence_varchar_2 VARCHAR(5) CHARACTER SET LATIN1,
   occurrence_varchar_3 VARCHAR(5) COLLATE my.collation_1,
   occurrence_varchar_4 VARCHAR(5) CHARACTER SET LATIN1 COLLATE my.collation_1,
   occurrence_nvchar_1 NCHAR VARYING(5),
   occurrence_nvchar_2 NCHAR VARYING(5) COLLATE my.collation_1);

 INSERT INTO Varying_Char_Examples (
       occurrence_varchar_1,
       occurrence_varchar_2,
       occurrence_varchar_3,
       occurrence_varchar_4,
       occurrence_nvchar_1,
       occurrence_nvchar_2)
       VALUES ('mm','mm','mm','mm','mm','mm');

 SELECT occurrence_varchar_1,
        occurrence_varchar_2,
        occurrence_varchar_3,
        occurrence_varchar_4,
        occurrence_nvchar_1,
        occurrence_nvchar_2
 FROM   Varying_Char_Examples
 WHERE  occurrence_varchar_4 <> 'hi';

 SELECT occurrence_varchar_1,
        occurrence_varchar_2,
        occurrence_varchar_3,
        occurrence_varchar_4,
        occurrence_nvchar_1,
        occurrence_nvchar_2
 FROM   Varying_Char_Examples
 WHERE  occurrence_nvchar_2 <> N'hi';

These SQL statements make a Table with six large object character string Columns, insert a row, then search for any string equal to ‘hi’.

CREATE TABLE Large_Char_Examples (
   occurrence_clob_1 CLOB(10),
   occurrence_clob_2 CLOB(10K) CHARACTER SET LATIN1,
   occurrence_clob_3 CLOB(10M) COLLATE my.collation_1,
   occurrence_clob_4 CLOB(10G) CHARACTER SET LATIN1 COLLATE my.collation_1,
   occurrence_nclob_1 NCLOB(2K),
   occurrence_nclob_2 NCLOB COLLATE my.collation_1);

 INSERT INTO Large_Char_Examples (
       occurrence_clob_1,
       occurrence_clob_2,
       occurrence_clob_3,
       occurrence_clob_4,
       occurrence_nclob_1,
       occurrence_nclob_2)
       VALUES ('mm','mm','mm','mm','mm','mm');

 SELECT occurrence_clob_1,
        occurrence_clob_2,
        occurrence_clob_3,
        occurrence_clob_4,
        occurrence_nclob_1,
        occurrence_nclob_2
 FROM   Large_Char_Examples
 WHERE  occurrence_clob_4 = 'hi';

 SELECT occurrence_clob_1,
        occurrence_clob_2,
        occurrence_clob_3,
        occurrence_clob_4,
        occurrence_nclob_1,
        occurrence_nclob_2
 FROM   Large_Char_Examples
 WHERE  occurrence_nclob_2 = N'hi';

Character String Operations

A character string is compatible with, and comparable to, all other character strings from the same Character set – that is, character strings are mutually comparable and mutually assignable as long as they belong to the same Character set. Character strings may not be directly compared with, or directly assigned to, any other <data type> class, though implicit type conversions can occur in expressions, SELECTs, INSERTs, DELETEs and UPDATEs. Explicit character string type conversions can be forced with the CAST operator.

CAST

In SQL, CAST is a scalar operator that converts a given scalar value to a given scalar <data type>. The required syntax for the CAST operator is as follows.

CAST (<cast operand> AS <cast target>)
   <cast operand> ::= scalar_expression
   <cast target> ::= <Domain name> | <data type>

The CAST operator converts values of a source <data type> into values of a target <data type>, where each <data type> is an SQL pre-defined <data type> (data conversions between UDTs are done with a user-defined cast). The source <data type>, or <cast operand>, can be any expression that evaluates to a single value. The target <data type>, or <cast target>, is either an SQL predefined <data type> specification or the name of a Domain whose defined <data type> is the SQL predefined <data type> that you want to convert the value of “scalar_expression” into. (If you use CAST ()… AS <Domain name>) your current <AuthorizationID> must have the USAGE Privilege on that Domain.)

It isn’t, of course, possible to convert the values of every <data type> into the values of every other <data type>. For character strings, the rules are:

  • CAST (NULL AS <data type>) and CAST (character_string_source_is_a_null_value AS <data type>) both result in NULL.

  • You can CAST a fixed length or variable length character string or a CLOB or NCLOB source to these targets:

    • exact numeric

    • approximate numeric

    • fixed length character string (if source and target belong to the same Character set)

    • variable length character string (if source and target belong to the same Character set)

    • CLOB (if source and target belong to the same Character set)

    • NCLOB (if source and target belong to the same Character set),

    • fixed length bit string

    • variable length bit string

    • date

    • time

    • timestamp

    • year-month interval

    • day-time interval

    • boolean.

    You can also CAST a fixed length or variable length character string or a CLOB or NCLOB source to a UDT target or a <reference type> target if a user-defined cast exists for this purpose and your current <AuthorizationID> has the EXECUTE Privilege on that user-defined cast.

When you CAST a character string to an exact numeric target or an approximate numeric target, your DBMS strips any leading or trailing spaces from the source and converts the remaining string – which must be the character representation of a number – to that number. For example, CAST ('-25' AS SMALLINT) results in a SMALLINT value of -25. If your source string doesn’t represent a number, the CAST will fail: your DBMS will return the SQLSTATE error 22018 "data exception-invalid character value for cast".

When you CAST a character string to a fixed length character string target, a variable length character string target or a CLOB or NCLOB target, both source and target must belong to the same Character set, the result has the COERCIBLE coercibility attribute and the Collation of the result is the default Collation of the target’s Character set.

  • For fixed length character string targets, if the length of the source equals the fixed length of the target, the result of the CAST is the source string. If the length of the source is shorter than the fixed length of the target, the result of the CAST is the source string padded on the right with however many spaces are required to make the lengths match. If the length of the source is longer than the fixed length of the target, the result of the CAST is a character string that contains as much of the source string as possible – in this case, if the truncated characters are not all spaces, your DBMS will return the SQLSTATE warning 01004 "warning-string data, right truncation".

  • For variable length character string or CLOB or NCLOB targets, if the length of the source is less than or equals the maximum length of the target, the result of the CAST is the source string. If the length of the source is longer than the maximum length of the target, the result of the CAST is a character string that contains as much of the source string as possible – in this case, if the truncated characters are not all spaces, your DBMS will return the SQLSTATE warning 01004 "warning-string data, right truncation".

When you CAST a character string to a fixed length bit string or a variable length bit string target, the result is the character string converted to a string of bits. That is, when you CAST a character string (which has a “form- of-use encoding”) to a bit string (which has no encoding), you get the bits that make up the characters in the source string. For example, assume the source character string for a CAST belongs to the ISO8BIT Character set. In this Character set, the code for the letter 'A' is 41 hexadecimal (the binary number 01000001) and the code for the letter 'B' is 42 hexadecimal (the binary number 01000010) – so CAST ('AB' TO BIT(16)) will result in B'0100000101000010'.

  • For fixed length bit string targets, if the bit length of the converted source string equals the fixed bit length of the target, the result of the CAST is the converted source string. If the converted source value’s bit length is larger than the fixed bit length of the target, the result of the CAST is a bit string that contains as much of the converted source string as possible. In this case, if the truncated bits are not all zero-bits, your DBMS will return the SQLSTATE warning 01004 "warning-string data, right truncation". If the converted source value’s bit length is less than the fixed bit length of the target, the result of the CAST is the converted bit string, padded on the least significant end with as many zero-bits as required to make the lengths match – in this case, your DBMS will return the SQLSTATE warning 01008 "warning-implicit zero-bit padding".

  • For variable length bit string targets, if the bit length of the converted source string is less than or equals the maximum bit length of the target, the result of the CAST is the converted source string. If the converted source value’s bit length is larger than the maximum bit length of the target, the result of the CAST is a bit string that contains as much of the converted source string as possible – in this case, if the truncated bits are not all zero-bits, your DBMS will return the SQLSTATE warning 01004 "warning-string data, right truncation".

When you CAST a character string to a date target, your DBMS strips any leading or trailing spaces from the source and converts the remaining string – which must be the character representation of a valid date – to that date. If your source string doesn’t represent a valid date, the CAST will fail: your DBMS will return the SQLSTATE error 22007 "data exception-invalid datetime format".

When you CAST a character string to a time target, your DBMS strips any leading or trailing spaces from the source and converts the remaining string - - which must be the character representation of a valid time – to that time. If your source string doesn’t represent a valid time, the CAST will fail: your DBMS will return the SQLSTATE error 22007 "data exception-invalid datetime format". If your source string isn’t a string that could represent any time (even an invalid one), the CAST will also fail: your DBMS will return the SQLSTATE error 22018 "data exception-invalid character value for cast".

When you CAST a character string to a timestamp target, your DBMS strips any leading or trailing spaces from the source and converts the remaining string - - which must be the character representation of a valid timestamp – to that timestamp. If your source string doesn’t represent a valid timestamp, the CAST will fail: your DBMS will return the SQLSTATE error 22007 "data exception-invalid datetime format". If your source string isn’t a string that could represent any timestamp (even an invalid one), the CAST will also fail: your DBMS will return the SQLSTATE error 22018 "data exception-invalid character value for cast".

When you CAST a character string to an interval target, your DBMS strips any leading or trailing spaces from the source and converts the remaining string - - which must be the character representation of a valid interval for the target – to that interval. If your source string doesn’t represent a valid interval for the target, the CAST will fail: your DBMS will return the SQLSTATE error 22XXX "data exception-invalid interval format". If your source string isn’t a string that could represent any interval (even an invalid one for the target), the CAST will also fail: your DBMS will return the SQLSTATE error 22007 "data exception-invalid datetime format".

When you CAST a character string to a boolean target, your DBMS strips any leading or trailing spaces from the source and converts the remaining string - - which must be the character representation of one of the truth values TRUE, FALSE or UNKNOWN – to that truth value. If your source string doesn’t represent a truth value, the CAST will fail: your DBMS will return the SQLSTATE error 22018 "data exception-invalid character value for cast".

When you CAST a character string to a UDT or a <reference type> target, your DBMS invokes the user defined cast routine, with the source value as the routine’s argument. The CAST result is the value returned by the user defined cast.

Here are some examples of CAST operations with character string source operands:

CAST (N'The rain in Spain' AS CHAR (18))
-- result is the string 'The rain in Spain'
CAST ('050.00' AS DECIMAL (3,1))
-- result is the number 50.0
CAST ('1997-04-04' AS DATE)
--result is the date DATE '1997-04-04'
CAST ('FALSE' AS BOOLEAN)
--result is the truth value FALSE

If you want to restrict your code to Core SQL, (a) don’t use <Domain name> as a CAST target: CAST only to a <data type> and (b) don’t use CAST to convert any CLOB or NCLOB values to another <data type>.

Assignment

SQL allows you to assign only compatible character strings – that is, character strings are mutually assignable only if the source string and the target string belong to the same Character set. If you need to assign a character string to a target that belongs to a different Character set, use the TRANSLATE function to translate the source into an equivalent string that belongs to the target’s Character set.

In SQL, when a character string is assigned to a character string target, the assignment is done one character at a time, from left to right.

When a character string is taken from SQL-data to be assigned to a fixed length character string target and the source is shorter than the target, the source is padded (on the right) with spaces until it matches the target’s size. If the source is longer than the target, the source is truncated to fit the target. In this case, your DBMS will return the SQLSTATE warning 01004 "warning-string data, right truncation". When a character string is taken from SQL-data to be assigned to a variable length character string or CLOB target, the size of the target is first set either to the size of the source or to its own maximum length, whichever is less. The source may then be truncated, if necessary, to match the size of the target. In this case, your DBMS will return the SQLSTATE warning 01004 "warning-string data, right truncation".

[Obscure Rule] Since only SQL accepts null values, if your source is NULL, then your target’s value is not changed. Instead, your DBMS will set its indicator parameter to -1, to indicate that an assignment of the null value was attempted. If your target doesn’t have an indicator parameter, the assignment will fail: your DBMS will return the SQLSTATE error 22002 "data exception-null value, no indicator parameter". If your source is a non-null value that fits into your target, your DBMS will set the target’s indicator parameter (if any) to zero. If your source is longer than your target, your DBMS will set your target’s indicator parameter to the length of the source; that is, if your source is 12 characters long and your target can accept only 10 characters, your DBMS will set the target’s indicator parameter to 12, to indicate that 2 characters were lost on assignment. If the source’s length is too big to be assigned to the indicator, the assignment will fail: your DBMS will return the SQLSTATE error 22022 "data exception-indicator overflow". We’ll talk more about indicator parameters in our chapters on SQL binding styles.

When a character string is assigned to a fixed length SQL-data character string target and the source is shorter than the target, the source is padded (on the right) with spaces until it matches the target’s size. If the source is larger than the target, but the extra characters are all spaces, the source’s significant character string value is assigned to the target. If the source is larger than the target and the extra characters are not all spaces, the assignment will fail: your DBMS will return the SQLSTATE error 22001 "data exception-string data, right truncation". When a character string is assigned to a variable length SQL-data character string or CLOB target, the size of the target is first set either to the size of the source or to its own maximum length, whichever is less. If the source is larger than the target, but the extra characters are all spaces, the source’s significant character string value is assigned to the target. If the source is larger than the target and the extra characters are not all spaces, the assignment will fail: your DBMS will return the SQLSTATE error 22001 "data exception-string data, right truncation".

[Obscure Rule] There are two ways to assign a null value to an SQL-data target. Within SQL, you can use the <keyword> NULL in an INSERT or an UPDATE statement to indicate that the target should be set to NULL; that is, if your source is NULL, your DBMS will set your target to NULL. Outside of SQL, if your source has an indicator parameter that is set to -1, your DBMS will set your target to NULL (regardless of the value of the source). An indicator parameter with a value less than -1 will cause an error: your DBMS will return the SQLSTATE error 22010 "data exception-invalid indicator parameter value".

Comparison

SQL provides the usual scalar comparison operators, = and <> and < and <= and > and >= – to perform operations on CHAR, VARCHAR, NCHAR and NCHAR VARYING character strings but provides only the = and <> operators to perform operations on CLOB and NCLOB character strings. All of them will be familiar; there are equivalent operators in other computer languages. If any of the comparands are NULL, the result of the operation is UNKNOWN. For example:

'hello' < 'zebra'

returns TRUE.

'hello' > {result is NULL}

returns UNKNOWN.

SQL also provides three quantifiers – ALL, SOME, ANY – which you can use along with a comparison operator to compare a CHAR, VARCHAR, NCHAR or NCHAR VARYING value with the collection of values returned by a <table subquery>. (You can’t use quantifiers in CLOB or NCLOB comparisons.) Place the quantifier after the comparison operator, immediately before the <table subquery>. For example:

SELECT char_column
FROM   Table_1
WHERE  char_column = ALL (
   SELECT char_column
   FROM   Table_2);

ALL returns TRUE either (a) if the collection is an empty set (i.e.: if it contains zero rows) or (b) if the comparison operator returns TRUE for every value in the collection. ALL returns FALSE if the comparison operator returns FALSE for at least one value in the collection.

SOME and ANY are synonyms. They return TRUE if the comparison operator returns TRUE for at least one value in the collection. They return FALSE either (a) if the collection is an empty set or (b) if the comparison operator returns FALSE for every value in the collection. The search condition = ANY (collection) is equivalent to IN (collection).

SQL allows you to compare character strings only if (a) they belong to the same Character set and (b) have, or can be coerced into having, the same Collation for the comparison – that is, character strings are mutually comparable only if both their Character sets and their Collations are the same. When a character string is compared to another character string, the comparison is done one character at a time, from left to right. The result of the comparison is determined by the rules of the relevant Collation. Because of this, two strings with different lengths may, or may not, compare as equal.

Although a comparison involves two character strings, both of which have a default Collation, only one of the Collations can be used to govern the result of the comparison. So, when you (a) compare character strings that have different current default Collations and (b) don’t explicitly specify a Collation for the comparison, your DBMS will use the coercibility attribute of each string to choose the relevant Collation.

A <Column name>, <Column reference> or other character string value that includes a COLLATE clause has a coercibility attribute of EXPLICIT – its Collation is the Collation named. If a <Column name> or <Column reference> doesn’t include a COLLATE clause, it has a coercibility attribute of IMPLICIT – its Collation is the Collation specified when the Column was created (see the <data type> definitions). If any other character string value (e.g.: a host variable or a <literal>) doesn’t include a COLLATE clause, it normally has a coercibility attribute of COERCIBLE – its Collation is the default Collation for its Character set. Sometimes a character string value is the result of an expression that joins strings with different Collations (e.g.: a concatenation operation that doesn’t include a COLLATE clause). These character strings have a coercibility attribute of NO COLLATION.

After determining the coercibility attribute of each character string in a comparison, your DBMS will choose the relevant Collation using these rules:

  • Strings with COERCIBLE coercibility may be compared to strings with any coercibility attribute except NO COLLATION. If both comparands have COERCIBLE coercibility, the relevant Collation is the default Collation of their mutual Character set. If one comparand has COERCIBLE coercibility and the other has EXPLICIT coercibility, the relevant Collation is the EXPLICIT Collation. If one comparand has COERCIBLE coercibility and the other has IMPLICIT coercibility, the relevant Collation is the IMPLICIT Collation.

  • Strings with EXPLICIT coercibility may be compared to strings with any coercibility attribute. If one comparand has EXPLICIT coercibility and the other has COERCIBLE, IMPLICIT or NO COLLATION coercibility, the relevant Collation is the EXPLICIT Collation. If both comparands have EXPLICIT coercibility, they must also have the same Collation. The relevant Collation is their mutual EXPLICIT Collation.

  • Strings with IMPLICIT coercibility may be compared to strings with any coercibility attribute except NO COLLATION.If one comparand has IMPLICIT coercibility and the other has COERCIBLE coercibility, the relevant Collation is the IMPLICIT Collation. If one comparand has IMPLICIT coercibility and the other has EXPLICIT coercibility, the relevant Collation is the EXPLICIT Collation. If both comparands have IMPLICIT coercibility, they must also have the same Collation. The relevant Collation is their mutual IMPLICIT Collation.

  • Strings with NO COLLATION coercibility may only be compared to strings with a coercibility attribute of EXPLICIT. The relevant Collation is the EXPLICIT Collation.

When you compare character strings that have different lengths, the result also depends on whether the relevant Collation has the PAD SPACE attribute or the NO PAD attribute. If the relevant Collation has the PAD SPACE attribute, your DBMS will extend the shorter character string to the length of the larger string (by padding it on the right with spaces) before comparing the strings. If the relevant Collation has the NO PAD attribute, then – all other things being equal – the longer string will evaluate as greater than the shorter string. That is, with a NO PAD Collation, the result of these rules is that a shorter comparand which is equal to the same-length substring of a larger comparand will evaluate as less than the larger comparand – even if the remainder of the larger string consist only of spaces or controll characters. For example, a comparison of these two <literal>s:

'BOB'
'BOB    '

would result in the first <literal> being evaluated as less than the second with a PAD SPACE Collation and as less than the second with a NO PAD Collation (assuming that the Collations both use the familiar Latin collating rules.)

CREATE TABLE Table_1 (
   char_column CHAR(5));

INSERTS INTO Table_1 (char_column)
VALUES ('A');

In this example, the string actually inserted is five characters long, i.e.:

'A    '

Thus, with a PAD SPACE Collation, this predicate is TRUE:

... WHERE char_column = 'A'

and with a NO PAD Collation, the same predicate is FALSE.

To summarize, SQL doesn’t allow character strings to be compared unless they belong to the same Character set and have the same Collation for the comparison. You may explicitly specify the relevant Character set or allow it to default to an implicit Character set chosen by your DBMS. You may also explicitly specify the relevant Collation by adding a COLLATE clause to your expression; this will override the expression’s default collating sequence. If you omit the COLLATE clause, your DBMS will choose the relevant Collation for you – see “Character Strings and Collations”, later in this chapter.

If you want to restrict your code to Core SQL, don’t use CLOBs or NCLOBs in comparisons.

Other Operations

With SQL, you have a wide range of operations that you can perform on character strings, or on other values to get a character string result.

Concatenation

The required syntax for a character string concatenation is as follows.

character concatenation ::=
character_string_operand_1 || character_string_operand_2
[ COLLATE <Collation name> ]

The concatenation operator operates on two operands, both of which must evaluate to character strings belonging to the same Character set. It joins the strings together in the order given and returns a character string with a length equal to the sum of the lengths of its operands. If either of the operands is NULL, the result of the operation is also NULL. Here are two examples of character string concatenations:

'hello' || ' bob'
 -- returns hello bob

char_column || 'hello'
-- returns CHAR_COLUMN's value followed by hello

[Obscure Rule] If both operands are fixed length character strings, the concatenation result is a fixed length character string with a length equal to the sum of the lengths of the operands – this length may not exceed the maximum allowed for a fixed length character string.

[Obscure Rule] If either operand is a variable length character string and the sum of their lengths is not greater than the maximum allowed length for a variable length character string, the concatenation result is a variable length character string with a length equal to the sum of the lengths of the operands. If the sum of the operands’ lengths is greater than the maximum allowed, but the extra characters are all spaces, the concatenation result is a variable length character string with a length equal to the maximum allowed length. If the sum of the operands’ lengths is greater than the maximum allowed, and the extra characters are not all spaces, the concatenation will fail: your DBMS will return the SQLSTATE error 22001 "data exception-string data, right truncation".

[Obscure Rule] The result of a character string concatenation normally has a coercibility attribute and Collation determined by Table 7-2 “Collating Sequences and Coercibility Rules for Dyadic Operations”, but you can use the optional COLLATE clause to force EXPLICIT coercibility with a specific Collation. The Collation named must be a Collation defined for the relevant Character set. If you’re using COLLATE in an SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Collation name>”. If you’re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Collation name>”. For example:

'hello' || 'bob' COLLATE my.collation_1

specifies that the result of the concatenation should use a Collation named my.collation_1.

If you want to restrict your code to Core SQL, don’t use the concatenation operator with CLOBs or NCLOBs and don’t use the COLLATE clause to force an EXPLICIT Collation for any character string concatenation.

Scalar functions

SQL provides eleven scalar functions that return a character string: the <case expression>, the <cast specification>, the <char substring function>, the <char overlay function>, the <char trim function>, the <fold function>, the <character translation function>, the <form-of-use conversion function>, the <regular expression substring function>, the <specific type function> and the <niladic user function>. It also provides four scalar functions that operate on character strings, returning a number: the <char position expression>, the <bit length expression>, the <char length expression> and the <octet length expression>. We’ll discuss all but the <specific type function>, the <niladic user function>, the <case expression> and the <cast specification> here. Look for the rest in other chapters; for now, just remember that they all evaluate to a character string and can therefore be used anywhere in SQL that a character string could be used.

<char substring function>

The required syntax for a <char substring function> is as follows.

<char substring function> ::=
SUBSTRING (character_string_argument
   FROM start_argument [ FOR length_argument ]
   [ COLLATE <Collation name> ])

SUBSTRING operates on three arguments: the first must evaluate to a character string, the other two must evaluate to exact numeric integers. It extracts a substring from character_string_argument and returns a variable length character string with a maximum length that equals the fixed length or maximum variable length (as applicable) of the character string argument. If any of the arguments are NULL, SUBSTRING returns NULL.

The start_argument is a number that marks the first character you want to extract from character_string_argument. If SUBSTRING includes the (optional) FOR clause, “length_argument” is the total number of characters you want to extract. If you omit the FOR clause, SUBSTRING will begin at start_argument and extract all the rest of the characters in character_string_argument. Here are some examples of SUBSTRING.

SUBSTRING('epiphany' FROM 5)
-- returns hany

SUBSTRING('epiphany' FROM 5 FOR 3)
-- returns han

SUBSTRING(char_column FROM 1 FOR 4)
-- returns the first four characters of the value in CHAR_COLUMN

. . . WHERE SUBSTRING (char_column FROM 3 FOR 1) = 'A'
-- returns "true" if the third character of the value in CHAR_COLUMN is the letter A

If length_argument is negative, your DBMS will return SQLSTATE error 22011 "data exeption-substring error." If start_argument is greater than the length of character_string_argument, or if (start_argument + length_argument) is less than one, SUBSTRING returns a zero-length character string. If start_argument is negative, or if (start_argument + length_argument) is greater than the length of character_string_argument, that´s okay – the DBMS just ignores any characters before the start of character_string_argument or after the end of character_string_argument. Note that

SUBSTRING('abc' FROM -2 FOR 4)

is legal SQL syntax, but pointless because it won´t return the “expected”result. The SQL Standart requires SUBSTRING to return 'a' for this operation – not 'ab'.

[Obscure Rule] The result of SUBSTRING belongs to the same Character set that its string argument does. It normally has a coercibility attribute and Collation determined by Table 7-1 “Collating Sequences and Coercibility Rules for Monadic Operations”, where character_string_argument is the monadic operator, but you can use the optional COLLATE clause to force EXPLICIT coercibility with a specific Collation. The Collation named must be a Collation defined for the relevant Character set. If you’re using COLLATE in an SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Collation name>”. If you’re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Collation name>”. For example,

SUBSTRING(char_column FROM 1 FOR 4) COLLATE my.collation_1

specifies that the result of SUBSTRING should use a Collation named my.collation_1.

[Obscure Rule] SUBSTRING can also operate on a bit string and a BLOB. We’ve ignored these options for now – look for them in our chapters on bit strings and BLOBs.

If you want to restrict your code to Core SQL, don’t use SUBSTRING with NCLOBs and don’t use the COLLATE clause to force an EXPLICIT Collation for any SUBSTRING operation.

<char overlay function>

The required syntax for a <char overlay function> is as follows.

<character overlay function> ::=
OVERLAY (character_string_argument_1
PLACING character_string_argument_2
FROM start_argument [ FOR length_argument ]
[ COLLATE <Collation name> ])

OVERLAY operates on four arguments: the first two must evaluate to character strings belonging to the same Character set, the other two must evaluate to exact numeric integers. It extracts a substring from character_argument_1, replacing it with character_string_argument_2, and returns the resulting character string. If any of the arguments are NULL, OVERLAY returns NULL.

The start_argument is a number that marks the first character you want to replace in character_string_argument_1. If OVERLAY includes the (optional) FOR clause, length_argument is the total number of characters you want to replace. Thus, start_argument and length_argument identify the portion of character_string_argument_1 you want to replace, while character_string_2 is what you want to with. If you omit the FOR clause, then length_argument defaults to the length of character_string_argument_2. Here are some examples of OVERLAY:

OVERLAY('epiphany' PLACING 'no' FROM 5)
-- returns epipnony

OVERLAY('epiphany' PLACING 'no' FROM 5 FOR 3)
-- returns epipnoy

[Obscure Rule] The result of OVERLAY belongs to the same Character set that its arguments do. It normally has a coercibility attribute and Collation determined by Table 7-2 “Collating Sequences and Coercibility Rules for Dyadic Operations”, but you can use the optional COLLATE clause to force EXPLICIT coercibility with a specific Collation. The Collation named must be a Collation defined for the relevant Character set. If you’re using COLLATE in an SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Collation name>”. If you’re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Collation name>”. For example:

OVERLAY('epiphany' PLACING 'no' FROM 5) COLLATE my.collation_1

specifies that the result of OVERLAY should use a Collation named my.collation_1.

[Obscure Rule] OVERLAY can also operate on a BLOB. We’ve ignored this option for now – look for it in our chapter on BLOBs.

If you want to restrict your code to Core SQL, don’t use OVERLAY with character strings.

<char trim function>

The required syntax for a <char trim function> is as follows.

<char trim function> ::=
TRIM ([ [ {LEADING | TRAILING | BOTH} ] [ character_string_argument_1 ]
FROM ] character_string_argument_2
[ COLLATE <Collation name> ])

TRIM operates on two arguments, both of which must evaluate to character strings that belong to the same Character set and have the same Collation for the operation. It strips all leading, all trailing or all leading and all trailing trim characters from character_string_argument_2 and returns the resulting variable length character string. The result has a maximum length that equals the fixed length or maximum variable length (as applicable) of character_string_argument_2. If any of the arguments are NULL, TRIM returns NULL.

The trim specification is either LEADING (i.e.: trim all leading trim characters), TRAILING (i.e.: trim all trailing trim characters) or BOTH (i.e.: trim all leading and all trailing trim characters). If this clause is omitted, TRIM defaults to BOTH. For example, these two TRIM functions are equivalent: they both strip away all leading and all trailing letters A:

TRIM('A' FROM char_column)
TRIM(BOTH 'A' FROM char_column)

The character_string_argument_1 defines the trim character: the character that should be stripped away by the TRIM function. If character_string_argument_1 is omitted, TRIM strips spaces away. For example, these two TRIM functions are equivalent: they both strip away all trailing spaces:

TRIM(TRAILING FROM char_column)
TRIM(TRAILING ' ' FROM char_column)

These two TRIM functions are equivalent – they both strip away all leading spaces:

TRIM(LEADING FROM char_column)
TRIM(LEADING ' ' FROM char_column)

These two TRIM functions are equivalent – they both strip away all leading and all trailing spaces:

TRIM(char_column)
TRIM(BOTH ' ' FROM char_column)

If the length of character_string_argument_1 is not one character, TRIM will fail: your DBMS will return the SQLSTATE error 22027 "data exception-trim error".

[Obscure Rule] The result of TRIM belongs to the same Character set that its arguments do. It normally has a coercibility attribute and Collation determined by Table 7.1, “Collating Sequences and coercibility rules for monadic operations,” where character_string_argument_2 is the monadic operand, but you can use the optional COLLATE clause to force EXPLICIT coercibility with a specific Collation. The Collation named must be a Collation defined for the relevant Character set. If you’re using COLLATE in an SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Collation name>”. If you’re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Collation name>”. For example:

TRIM(BOTH ' ' FROM char_column) COLLATE my.collation_1

specifies that the result of TRIM should use a Collation named my.collation_1.

[Obscure Rule] TRIM can also operate on a BLOB. We’ve ignored this option for now – look for it in our chapter on BLOBs.

If you want to restrict your code to Core SQL, don’t use TRIM with NCLOBs and don’t use the COLLATE clause to force an EXPLICIT Collation for any TRIM operation.

<fold function>

The required syntax for a <fold function> is as follows.

<fold function> ::=
{ UPPER | LOWER } (character_string_argument [ COLLATE <Collation name> ])

UPPER and LOWER operate on an argument that evaluates to a character string. UPPER converts every lower case letter in character_string_argument to its corresponding upper case equivalent, while LOWER converts every upper case letter in character_string_argument to its corresponding lower case equivalent. Any character that has no upper or lower case equivalent (as applicable) remains unchanged. The conversion reflects the normal rules for letters of the simple Latin 26-letter alphabet – that is abcdefghijklmnopqrstuvwxyz converts to and from ABCDEFGHIJKLMNOPQRSTUVWXYZ – but it also reflects the normal rules for the accented letters in character_string_argument’s Character set, e.g.: ö converts to and from Ö.

Both UPPER and LOWER return a character string with a length that equals the fixed length or maximum variable length (as applicable) of character_string_argument. If the character string argument is NULL``, UPPER and LOWER return NULL. Here are some examples:

UPPER('E. E. Cummings')
-- returns E. E. CUMMINGS

LOWER('E. E. Cummings')
-- returns e. e. cummings

UPPER(LOWER('E. E. Cummings'))
-- returns E. E. CUMMINGS

In the last example, UPPER and LOWER do not cancel each other out; the output string is not the same as the input string. Such information loss occurs because fold functions don’t affect characters which are already in the right case.

Tip

A string which contains no letters will be the same after UPPER and LOWER, so if you need to test whether a character string contains letters do this:

SELECT character_column
FROM   Table_Of_Character_Strings
WHERE  UPPER(character_column) <> LOWER(character_column);

Such a query will find '1a 2' but will not find '1$ 2'; thus you can use fold functions to filter out strings which contain letters.

[Obscure Rule] The result of UPPER and LOWER is a fixed length string if character_string_argument is a fixed length string and a variable length string if character_string_argument is a variable length string. In either case, the result belongs to the same Character set that the argument does. It normally has a coercibility attribute and Collation determined by Table 7-1 “Collating Sequences and Coercibility Rules for Monadic Operations”, but you can use the optional COLLATE clause to force EXPLICIT coercibility with a specific Collation. The Collation named must be a Collation defined for the relevant Character set. If you’re using COLLATE in an SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Collation name>”. If you’re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Collation name>”. For example:

UPPER('hello') COLLATE my.collation_1

specifies that the result of UPPER should use a Collation named my.collation_1 and

LOWER('HELLO') COLLATE my.collation_1

specifies that the result of LOWER should use a Collation named my.collation_1.

If you want to restrict your code to Core SQL, don’t use the COLLATE clause to force an EXPLICIT Collation for any UPPER or LOWER operation.

<character translation function>

The required syntax for a <character translation function> is as follows.

<character translation function> ::=
TRANSLATE (character_string_argument USING <Translation name>
   [ COLLATE <Collation name> ])

TRANSLATE operates on an argument that evaluates to a character string. It converts every character in character_string_argument to its corresponding equivalent in another Character set (by changing each character according to some many-to-one or one-to-one mapping) and returns a variable length character string that belongs to the target Character set defined for “<Translation name>”. If the character string argument is NULL, TRANSLATE returns NULL. Here is an example of TRANSLATE:

TRANSLATE('hello' USING my.translation_1)
-- returns a string, equivalent to hello, that belongs to the target Character set defined for a Translation called MY.TRANSLATION_1

(Translations are defined using the CREATE TRANSLATION statement.)

If you’re using TRANSLATE in an SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Translation name>”. If you’re using TRANSLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Translation name>”.

[Obscure Rule] The result of TRANSLATE normally has a coercibility attribute of IMPLICIT and uses the default Collation of the Translation’s target Character set, but you can use the optional COLLATE clause to force EXPLICIT coercibility with a specific Collation. The Collation named must be a Collation defined for the target Character set. If you’re using COLLATE in an SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Collation name>”. If you’re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Collation name>”. For example:

TRANSLATE('hello' USING my.translation_1) COLLATE my.collation_1

specifies that the result of TRANSLATE should use a Collation named my.collation_1.

If you want to restrict your code to Core SQL, don’t use TRANSLATE.

<form-of-use conversion function>

The required syntax for a <form-of-use conversion function> is as follows.

<form-of-use conversion function> ::=
CONVERT (character_string_argument USING <Form-of-use conversion name>
[ COLLATE <Collation name> ])

CONVERT operates on an argument that evaluates to a character string. It converts every character in character_string_argument to its corresponding equivalent using another Form-of-use and returns the resulting variable length character string. If the character string argument is NULL, CONVERT returns NULL. Here is an example of CONVERT:

CONVERT('hello' USING INFORMATION_SCHEMA.new_form)
-- returns a string, equivalent to hello, whose characters are encoded using a Form-of-use called NEW_FORM

A Form-of-use is a character repertoire’s encoding scheme – the one-to-one mapping scheme between each character in the repertoire and a set of internal codes (usually 8-bit values) that define how the repertoire’s characters are encoded as numbers. (These codes are also used to specify the order of the characters within the repertoire.) Supported Forms-of-use are all predefined by your DBMS and thus belong to INFORMATION_SCHEMA. SQL provides no ability to define your own Forms-of-use.

CONVERT’s purpose is to allow you to transfer character strings between SQL- data and your host application, therefore you may only use the function in certain places. When transferring SQL-data to the host, CONVERT is legal only as part of a <select sublist>. For example,

SELECT CONVERT(char_column USING INFORMATION_SCHEMA.new_form)
FROM   Table_1
WHERE  char_column = 'hello';

When transferring host values into SQL-data, use CONVERT to change any host parameter. For example:

INSERT INTO Table_1 (char_column)
VALUES (CONVERT(:char_parameter USING INFORMATION_SCHEMA.new_form));

Tip

You might want to use CONVERT to change a character string’s encoding scheme from 8-bit to 16-bit.

[NON-PORTABLE] Whether you can use CONVERT or not is non-standard because the SQL Standard requires implementors to define all Forms-of-use supported – but has no requirement that a DBMS must support any Form-of-use at all. However, you can use TRANSLATE to provide an equivalent operation. [OCELOT Implementation] The OCELOT DBMS that comes with this book does not provide support for any Form-of-use.

[Obscure Rule] The result of CONVERT belongs to a Character set that consists of the same character repertoire that its argument’s Character set has – but with a different Form-of-use encoding. It normally has a coercibility attribute of IMPLICIT and uses the default Collation of its Character set, but you can use the optional COLLATE clause to force EXPLICIT coercibility with a specific Collation. The Collation named must be a Collation defined for the target Character set. If you’re using COLLATE in an SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Collation name>”. If you’re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Collation name>”. For example:

CONVERT('hello' USING INFORMATION_SCHEMA.new_form) COLLATE my.collation_1

specifies that the result of CONVERT should use a Collation named my.collation_1.

If you want to restrict your code to Core SQL, don’t use CONVERT.

<regular expression substring function>

The required syntax for a <regular expression substring function> is as follows.

<regular expression substring function> ::=
SUBSTRING (character_string_argument FROM pattern FOR escape_character
[ COLLATE <Collation name> ])

SUBSTRING operates on three arguments, all of which evaluate to character strings that belong to the same Character set. It extracts a substring based on pattern from character_string_argument and returns a variable length character string with a maximum length that equals the fixed length or maximum variable length (as applicable) of character_string_argument. Both pattern and escape_character must be regular expressions (see our discussion of the SIMILAR predicate) and escape_character must be exactly one character long. If any of the arguments are NULL, SUBSTRING returns NULL.

The pattern shows the substring you want to extract from character_string_argument. It’s actually a triple pattern: it must consist of three regular expressions, the middle of which is a tagged regular expression (a regular expression that is delimited by scape_character immediately followed by a double quote sign). For example, if your escape character is ?, then pattern must contain ? exactly two times, as in:

'The rain?"%?"Spain'

The three parts of patternstart pattern ?"middle pattern?" end pattern – must match character_string_argument’s start, middle and end; that is, the expression:

'character_string_argument' SIMILAR TO 'pattern'

must be TRUE if the “escape <double quote>” markers are stripped from the pattern. If that’s not the case – that is, if the start or end patterns aren’t in the string – SUBSTRING returns NULL. Otherwise the result of SUBSTRING is character_string_argument’s middle string which corresponds to the middle pattern. Thus, for this SUBSTRING function:

SUBSTRING('The rain in Spain' FROM 'The rain?"%?"Spain' FOR '?')

the result is ' in ' – that is, the return from SUBSTRING is the string of characters which appears between the start pattern ('The rain') and the end pattern ('Spain').

[Obscure Rule] The result of SUBSTRING belongs to the same Character set that its string arguments do. It normally has a coercibility attribute and Collation determined by Table 7-1 “Collating Sequences and Coercibility Rules for Monadic Operations”, where character_string_argument" is the monadic operator, but you can use the optional COLLATE clause to force EXPLICIT coercibility with a specific Collation. The Collation named must be a Collation defined for the relevant Character set. If you’re using COLLATE in an SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Collation name>”. If you’re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Collation name>”. For example:

SUBSTRING(char_column FROM 'hi/"[b-o]/"by' FOR '/') COLLATE my.collation_1

specifies that the result of SUBSTRING should use a Collation named my.collation_1.

If you want to restrict your code to Core SQL, don’t use the <regular expression substring function> form of SUBSTRING.

<char position expression>

The required syntax for a <char position expression> is as follows.

<char position expression> ::=
POSITION (character_string_argument_1 IN character_string_argument_2)

POSITION operates on two arguments, both of which must evaluate to character strings that belong to the same Character set. It determines the first character position (if any) at which character_string_argument_1 is found in character_string_argument_2 and returns this as an exact numeric integer. If either of the arguments are NULL, POSITION returns NULL. If character_string_argument_1 is a zero-length character string, POSITION returns one. If character_string_argument_1 is not found in character_string_argument_2, POSITION returns zero. Here are some examples of POSITION:

POSITION('is' IN 'mistake')
-- returns 2

POSITION('yy' IN 'mistake')
-- returns 0

POSITION('' IN 'mistake')
-- returns 1

[NON-PORTABLE] The precision of POSITION’s result is non-standard because the SQL Standard requires implementors to define the result’s precision. [OCELOT Implementation] The OCELOT DBMS that comes with this book gives the result of POSITION an INTEGER <data type>.

[Obscure Rule] POSITION can also operate on a bit string and a BLOB. We’ve ignored these options for now – look for them in our chapters on bit strings and BLOBs.

If you want to restrict your code to Core SQL, don’t use POSITION with character strings.

<bit length expression>

The required syntax for a <bit length expression> is as follows.

<bit length expression> ::=
BIT_LENGTH (character_string_argument)

BIT_LENGTH operates on an argument that evaluates to a character string. It determines the length of the argument, in bits, and returns this as an exact numeric integer, e.g., BIT_LENGTH('hello') returns 40 (assuming that an 8-bit Character set is in use). If the argument is NULL, BIT_LENGTH returns NULL.

Tip

The length of a character string argument depends on the Character set it belongs to. Most Character sets are 8-bit sets, so BIT_LENGTH would return 8 for each character in your argument. But if you’re using a DBCS, remember that BIT_LENGTH will allot 16 bits for each character.

Tip

BIT_LENGTH will return the total length of your character string argument – including any trailing (or leading) spaces. If you’re looking for the length of the significant value only, use TRIM with BIT_LENGTH. For example:

BIT_LENGTH('hello   ')
-- returns 64; the length of hello followed by 3 spaces
BIT_LENGTH(TRIM('hello   '))
-- returns 40; the length of hello

[NON-PORTABLE] The precision of BIT_LENGTH’s result is non-standard because the SQL Standard requires implementors to define the result’s precision. [OCELOT Implementation] The OCELOT DBMS that comes with this book gives the result of BIT_LENGTH an INTEGER <data type>.

[Obscure Rule] BIT_LENGTH can also operate on a bit string and a BLOB. We’ve ignored these options for now – look for them in our chapters on bit strings and BLOBs.

If you want to restrict your code to Core SQL, don’t use BIT_LENGTH with NCLOBs.

<char length expression>

The required syntax for a <char length expression> is as follows.

<char length expression> ::=
{CHAR_LENGTH | CHARACTER_LENGTH} (character_string_argument)

CHAR_LENGTH (or CHARACTER_LENGTH) operates on an argument that evaluates to a character string. It determines the length of the argument, in characters, and returns this as an exact numeric integer, e.g., CHAR_LENGTH('hello') returns 5. If the argument is NULL, CHAR_LENGTH returns NULL.

Tip

CHAR_LENGTH will return the total length of your character string argument – including any trailing (or leading) spaces. If you’re looking for the length of the significant value only, use TRIM with CHAR_LENGTH. For example:

CHAR_LENGTH('hello   ')
 -- returns 8; the length of hello followed by 3 spaces
CHAR_LENGTH(TRIM('hello   '))
 -- returns 5; the length of hello

Tip

CHAR_LENGTH returns the number of Latin letters in your character string argument. For example, if your argument is 'Chorizo' and you’re using a Spanish Collation, the second character is ‘h’ – it is not 'o' despite the digraph, because CHAR_LENGTH doesn’t care about the Collation. Thus:

CHAR_LENGTH('Chorizo') COLLATE my.spanish_collation

returns 7, not 6.

[NON-PORTABLE] The precision of CHAR_LENGTH’s result is non-standard because the SQL Standard requires implementors to define the result’s precision. [OCELOT Implementation] The OCELOT DBMS that comes with this book gives the result of CHAR_LENGTH an INTEGER <data type>.

[Obscure Rule] CHAR_LENGTH can also operate on a bit string and a BLOB. We’ve ignored these options for now – look for them in our chapters on bit strings and BLOBs.

If you want to restrict your code to Core SQL, don’t use CHAR_LENGTH with NCLOBs.

<octet length expression>

The required syntax for a <octet length expression> is as follows.

<octet length expression> ::=
OCTET_LENGTH (character_string_argument)

OCTET_LENGTH operates on an argument that evaluates to a character string. It determines the length of the argument, in octets, and returns this as an exact numeric integer, e.g., OCTET_LENGTH('hello') returns 5 (assuming that an 8-bit Character set is in use; the octet length of a string is the bit length divided by 8, ignoring any remainder.) If the argument is NULL, OCTET_LENGTH returns NULL.

Tip

The length of a character string argument depends on the Character set it belongs to. Most Character sets are 8-bit sets, so OCTET_LENGTH would return 1 for each character in your argument. But if you’re using a DBCS, remember that OCTET_LENGTH will allot 2 octets for each character.

Tip

OCTET_LENGTH will return the total length of your character string argument – including any trailing (or leading) spaces. If you’re looking for the length of the significant value only, use TRIM with OCTET_LENGTH. For example:

OCTET_LENGTH('hello   ')
-- returns 8; the length of hello followed by 3 spaces

OCTET_LENGTH(TRIM('hello   '))
-- returns 5; the length of hello

[NON-PORTABLE] The precision of OCTET_LENGTH’s result is non-standard because the SQL Standard requires implementors to define the result’s precision. [OCELOT Implementation] The OCELOT DBMS that comes with this book gives the result of OCTET_LENGTH an INTEGER <data type>.

[Obscure Rule] OCTET_LENGTH can also operate on a bit string and a BLOB. We’ve ignored these options for now – look for them in our chapters on bit strings and BLOBs.

If you want to restrict your code to Core SQL, don’t use OCTET_LENGTH with NCLOBs.

Set functions

SQL provides five set functions that operate on CHAR, VARCHAR, NCHAR and NCHAR VARYING character strings: COUNT, MAX, MIN and GROUPING. SQL also provides three set functions that operate on CLOB and NCLOB character strings: COUNT and GROUPING. Since none of these operate exclusively with character string arguments, we won’t discuss them here; look for them in our chapter on set functions.

Predicates

In addition to the comparison operators, SQL provides ten other predicates that operate on CHAR, VARCHAR, NCHAR and NCHAR VARYING character strings: the <like predicate>, the <similar predicate>, the <between predicate>, the <in predicate>, the <null predicate>, the <exists predicate>, the <unique predicate>, the <match predicate>, the <quantified predicate> and the <distinct predicate>. SQL also provides five predicates that operate on CLOB and NCLOB character strings: the <like predicate>, the <similar predicate>, the <null predicate>, the <exists predicate> and the <quantified predicate>. Each will return a boolean value: either TRUE, FALSE or UNKNOWN. Only the <like predicate> and the <similar predicate> operate strictly on strings; we’ll discuss them here. Look for the rest in our chapter on search conditions.

<like predicate>

The required syntax for a <like predicate> is as follows.

<like predicate> ::=
character_string_argument [ NOT ] LIKE pattern [ ESCAPE escape_character ]

LIKE is a predicate that operates on three operands that evaluate to character strings belonging to the same Character set: it searches for values that contain a given pattern. NOT LIKE is the converse and lets you search for values that don’t contain a given pattern. The character_string_argument is the character string you’re searching within, the “pattern” is the pattern you’re searching for and the optional escape_character is a character that tells your DBMS to treat a metacharacter in the pattern as itself (rather than as a metacharacter). If character_string_argument contains the pattern, LIKE returns TRUE and NOT LIKE returns FALSE. If character_string_argument does not contain the pattern, LIKE returns FALSE and NOT LIKE returns TRUE. If any of the operands are NULL, LIKE and NOT LIKE return UNKNOWN.

The pattern you specify in pattern may contain any combination of regular characters and metacharacters. Any single character in pattern that is not a metacharacter or the escape_character represents itself in the pattern. For example, this predicate:

char_column LIKE 'A'

is TRUE for 'A'.

Special significance is attached to metacharacters in a pattern. The metacharacters are: _ and %. If the predicate doesn’t include an ESCAPE clause, they are interpreted as follows:

  • _ An underline character means “any single character”. For example, the predicate:

    char_column LIKE 'A_C'
    

is TRUE for 'A C', 'AAC', 'ABC', 'AxxxxxxxxC' and so on.

  • % A percent sign means “any string of zero or more characters”. For example, this predicate:

    char_column LIKE 'A%C'
    

is TRUE for 'AC', 'A C', 'AxC', 'AxxxxxxxxC' and so on.

If you want to search for a character that would normally be interpreted as a metacharacter, you must use the optional ESCAPE clause. To do so,

  1. Pick a character that you won’t need in the pattern and designate it as your escape character.

  2. In the pattern, use your escape character followed immediately by the metacharacter, to designate the metacharacter as a character you want to search for. For example:

    ... LIKE 'B$%'
    

    (without an ESCAPE clause) means “like the letter B followed by a dollar sign followed by anything at all”, while:

    ... LIKE 'B$?%' ESCAPE '?'
    

    means “like the letter B followed by a dollar sign followed by a percent sign” (since % is preceded by the escape character it has no special significance in this pattern). Your escape character can also be followed by itself in the pattern, if you want to search for the escape character. For example:

    ... LIKE 'B$??' ESCAPE '?'
    

    means “like the letter B followed by a dollar sign followed by a question mark” (since ? is preceded by the escape character it has no special significance in this pattern). Your best choice for an escape character is an SQL special character which isn’t a [NOT] LIKE metacharacter. We suggest the question mark.

The escape_character must be exactly one character long. If it isn’t, [NOT] LIKE will fail: your DBMS will return the SQLSTATE error 22019 "data exception-invalid escape character". If escape_character is _ or % and that metacharacter is used once only in your pattern, or if escape_character is used without being followed by a metacharacter (or by itself) in your pattern, [NOT] LIKE will fail: your DBMS will return the SQLSTATE error 22025 "data exception-invalid escape sequence". For example, this predicate will result in SQLSTATE 22025:

LIKE 'B%B' ESCAPE '%'

For the purposes of [NOT] LIKE, a substring of character_string_argument is a sequence of zero or more contiguous characters, where each character belongs to exactly one such substring (this includes any trailing spaces in the argument). A substring specifier of pattern is either (a) _: an arbitrary character specifier, (b) %: an arbitrary string specifier, (c) escape_character followed by _ or % or escape_character or (d) any other single character. If character_string_argument and pattern are both variable length character strings with a length of zero, LIKE returns TRUE. LIKE also returns TRUE if “pattern” is found in character_string_argument. That is, LIKE returns TRUE only if the number of substrings in character_string_argument equals the number of substring specifiers in pattern and all of these conditions are also met:

  • If the pattern’s n-th substring specifier is _, then the argument’s n-th substring must be any single character.

  • If the pattern’s n-th substring specifier is %, then the argument’s n-th substring must be any sequence of zero or more characters.

  • If the pattern’s n-th substring specifier is any other character, then the argument’s n-th substring must be equal (in length and character representation) to that substring specifier – without trailing spaces being added to the argument. Note that this means if the pattern is found in the argument, but the lengths don’t match, LIKE returns FALSE. For example, these four predicates all return TRUE:

    'bob' LIKE 'b_b'
    'bob' LIKE 'b%b'
    'bob   ' LIKE 'b_b   '
    'bob   ' LIKE 'b%b   '
    

    But these two predicates return FALSE because of the trailing spaces in character_string_argument that aren’t found in pattern:

    'bob   ' LIKE 'b_b'
    'bob   ' LIKE 'b%b'
    

    And these two predicates return FALSE because of the trailing spaces in pattern that aren’t found in character_string_argument:

    'bob' LIKE 'b_b   '
    'bob' LIKE 'b%b   '
    

Note that this is only a problem with fixed length character string arguments. Here’s a more complete example:

CREATE TABLE Test_Stuffs (
   column_1 CHAR(4));

INSERT INTO Test_Stuffs (column_1)
VALUES ('ABC');
-- actually inserts 'ABC ' (four characters)

SELECT *
FROM   Test_Stuffs
WHERE  column_1 = 'ABC';
-- works because comparisons will pad the shorter argument (assuming the relevant Collation has the PAD SPACE attribute) so the test is WHERE 'ABC ' = 'ABC '

SELECT *
FROM   Test_Stuffs
WHERE  column_1 LIKE '%C';
-- fails because LIKE never pads the shorter argument, no matter what Collation is used, so the test is "find a value of any length that ends in C" -- and 'ABC ' ends in a space, not in C

To get around this, use TRIM to get rid of trailing spaces in your character_string_argument, like this:

SELECT *
FROM   Test_Stuffs
WHERE  TRIM (TRAILING FROM column_1) LIKE '%C';

[Obscure Rule] The result of [NOT] LIKE belongs to the same Character set that its operands do. If you omit the ESCAPE clause, then it has a Collation determined by Table 7-3 “Collating Sequences used for Comparisons”, where character_string_argument is comparand 1 and pattern is comparand 2. If you include the ESCAPE clause, then it also has a Collation determined by Table 7-3 “Collating Sequences used for Comparisons”, where

  • comparand 1 is determined by Table 7-2 “Collating Sequences and Coercibility Rules for Dyadic Operations”, where character_string_argument is operand 1 and pattern is operand 2.

  • comparand 2 is escape_character.

[Obscure Rule] [NOT] LIKE can also operate on BLOBs. We’ve ignored this option for now – look for it in our chapter on BLOBs.

If you want to restrict your code to Core SQL, don’t use the [NOT] LIKE predicate with CLOBs or NCLOBs and, when you do use [NOT LIKE], make sure your character_string_argument is a <Column reference> and that your pattern and your escape_character are both <value specification>s.

<similar predicate>

The required syntax for a <similar predicate> is as follows.

<similar predicate> ::=
character_string_argument [ NOT ] SIMILAR TO pattern
[ ESCAPE escape_character ]

SIMILAR is a predicate that operates on three operands that evaluate to character strings belonging to the same Character set. It works much like Unix’s grep: it searches for values that contain a given pattern. NOT SIMILAR is the converse and lets you search for values that don’t contain a given pattern. The character_string_argument is the character string you’re searching within, the pattern is the pattern you’re searching for and the optional escape_character is a character that tells your DBMS to treat a metacharacter in the pattern as itself (rather than as a metacharacter). If character_string_argument contains the pattern, SIMILAR returns TRUE and NOT SIMILAR returns FALSE. If character_string_argument does not contain the pattern, SIMILAR returns ``FALSE and NOT SIMILAR returns TRUE. If any of the operands are NULL, SIMILAR and NOT SIMILAR return UNKNOWN.

The pattern you specify in pattern must be a regular expression: a sequence of ordinary characters combined with some special characters (or metacharacters). It may contain character ranges, repetitions and combinations. Any single character in pattern that is not a metacharacter or the escape_character represents itself in the pattern. For example, this predicate,

char_column SIMILAR TO 'A'

is TRUE for 'A'.

Special significance is attached to metacharacters in a pattern. The metacharacters are: _ and % and * and + and | and ( and ) and [ and ] and ^ and - and :. If the predicate doesn’t include an ESCAPE clause, they are interpreted as follows:

  • _ An underline character means “any single character”. For example, this predicate:

    char_column SIMILAR TO 'A_C'
    

    is TRUE for 'A C', 'AAC', 'ABC', 'A#C' and so on.

  • % A percent sign means “any string of zero or more characters”. For example, this predicate:

    char_column SIMILAR TO 'A%C'
    

    is TRUE for 'AC', 'A C', 'AxC', 'AxxxxxxxxC' and so on.

  • “*” An asterisk means “preceding repeats indefinitely” (from zero to infinity times). For example, this predicate:

    char_column SIMILAR TO 'A*'
    

    is TRUE for 'A', 'AA', 'AAA', 'AAAA' and so on.

  • “+” A plus sign means “preceding repeats indefinitely” (from one to infinity times). For example, this predicate:

    char_column SIMILAR TO 'A+'
    

    is TRUE for 'A', 'AA', 'AAA', 'AAAA' and so on.

  • [ ] Brackets are used for character enumeration in the pattern. There are two ways to enumerate: as a simple list or with a minus sign, with the result that a match is made with any one of the characters inside the brackets. For example, this predicate:

    char_column SIMILAR TO '[A]'
    

    is TRUE for 'A'. This predicate

    char_column SIMILAR TO '[AQZ]'
    

    is TRUE for 'A' or 'Q' or 'Z'. This predicate:

    char_column SIMILAR TO '[A-E]'
    

    is TRUE for 'A' or 'B' or 'C' or 'D' or 'E'. And this predicate:

    char_column SIMILAR TO '[A-EQ-S]'
    

    is TRUE for 'A' or 'B' or 'C' or 'D' or 'E' or 'Q' or 'R' or 'S'.

  • [^ ] A circumflex inside enumerating brackets means negative enumeration. The options are the same as for ordinary enumeration, with a negated meaning. For example, this predicate:

    char_column SIMILAR TO '[^A-C]'
    

    is TRUE for anything not equal to 'A' or to 'B' or to 'C'. This predicate:

    char_column SIMILAR TO '[^AQZ]'
    

    is TRUE for anything not equal to 'A' or 'Q' or 'Z'. And this predicate:

    'ABCDE' SIMILAR TO '[^C-F]'
    

    is FALSE, since the last character in the character string argument must not be 'C' or 'D' or 'E' or 'F'.

  • [: :] Brackets containing colons surrounding one of: ALPHA, UPPER, LOWER, DIGIT or ALNUM are used for set enumeration in the pattern. For example, this predicate:

    char_column SIMILAR TO '[:ALPHA:]'
    

    is TRUE for values of char_column that are equal to any simple Latin letter, i.e.: to any of ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz. This predicate:

    char_column SIMILAR TO '[:UPPER:']
    

    is TRUE for values of char_column that are equal to any simple Latin upper case letter, i.e.: to any of ABCDEFGHIJKLMNOPQRSTUVWXYZ. This predicate:

    char_column SIMILAR TO '[:LOWER:]'
    

    is TRUE for values of char_column that are equal to any simple Latin lower case letter, i.e.: to any of abcdefghijklmnopqrstuvwxyzv. This predicate:

    char_column SIMILAR TO '[:DIGIT:]'
    

    is TRUE for values of char_column that are equal to any digit, i.e.: to any of 0123456789. And this predicate:

    char_column SIMILAR TO '[:ALNUM:]'
    

    is TRUE for values of char_column that are equal to any simple Latin letter or to any digit, i.e.: to any of ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789.

  • | The vertical bar means “the logical OR of the first and second expressions”. For example, this predicate,

    char_column SIMILAR TO '[A-C']|[:DIGIT:]'
    

    is TRUE for any of ABC0123456789. The | operator has a lower priority than * and + have.

  • || The concatenation operator means “concatenate one element from first expression with one element from second expression”. For example, this predicate:

    char_column SIMILAR TO '[A-C]||[:DIGIT:]'
    

    is TRUE for: 'A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9', 'C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8' and 'C9'.

  • ( ) Parentheses in a pattern force the order of evaluation, in the usual way. For example, this predicate:

    char_column SIMILAR TO '[:UPPER:]|([:DIGIT:][:DIGIT:])'
    

    is TRUE for any single upper case letter, or for any two digits.

  • If pattern is not a valid pattern, [NOT] SIMILAR will fail: your DBMS will return the SQLSTATE error 2201B "data exception-invalid regular expression". Here are two examples of invalid patterns:

    '^[:UPER:]'
    '[:abc:]'
    

If you want to search for a character that would normally be interpreted as a metacharacter, you must use the optional ESCAPE clause. To do so:

  1. Pick a character that you won’t need in the pattern and designate it as your escape character.

  2. In the pattern, use your escape character followed immediately by the metacharacter, to designate it as a character you want to search for. For example:

    ... SIMILAR TO 'B$%'
    

    (without an ESCAPE clause) means “similar to the letter B followed by a dollar sign followed by anything at all”, while:

    ... SIMILAR TO 'B$?%' ESCAPE '?'
    

    means “similar to the letter B followed by a dollar sign followed by a percent sign” (since % is preceded by the escape character it has no special significance in this pattern). Your escape character can also be followed by itself in the pattern, if you want to search for the escape character. For example:

    ... SIMILAR TO 'B$??' ESCAPE '?'
    

    means “similar to the letter B followed by a dollar sign followed by a question mark” (since ? is preceded by the escape character it has no special significance in this pattern). Your best choice for an escape character is an SQL special character which isn’t a [NOT] SIMILAR metacharacter. We suggest the question mark.

The escape_character must be exactly one character long. If it isn’t, [NOT] SIMILAR will fail: your DBMS will return the SQLSTATE error 22019 "data exception-invalid escape character". If escape_character is [ or ] or ( or ) or | or ^ or - or + or * or _ or % and that metacharacter is used once only in your pattern, or if escape_character is used without being followed by a metacharacter (or itself) in your pattern, [NOT] SIMILAR will fail: your DBMS will return the SQLSTATE error 2200C "data exception-invalid use of escape character". For example, this predicate will result in SQLSTATE 2200C:

SIMILAR TO 'B?B' ESCAPE '?'

If escape_character is a colon and your pattern contains that metacharacter surrounding one of: ALPHA, UPPER, LOWER, DIGIT or ALNUM, [NOT] SIMILAR will fail: your DBMS will return the SQLSTATE error 2200B "data exception-escape character conflict".

[Obscure Rule] The result of [NOT] SIMILAR belongs to the same Character set that its operands do. If you omit the ESCAPE clause, then it has a Collation determined by Table 7-3 “Collating Sequences used for Comparisons”, where character_string_argument is comparand 1 and pattern is comparand 2. If you include the ESCAPE clause, then it also has a Collation determined by Table 7-3 “Collating Sequences used for Comparisons”, where:

  • comparand 1 is determined by Table 7-2 “Collating Sequences and Coercibility Rules for Dyadic Operations”, where character_string_argument is operand 1 and pattern is operand 2.

  • comparand 2 is “escape_character”.

If you want to restrict your code to Core SQL, don’t use the [NOT] SIMILAR predicate.

Common checks

Although [NOT] SIMILAR is not terribly useful in WHERE clauses (it’s too inefficient) it is great for CHECK clauses. Here are some real-world examples of strings which have rigid format specifications. [NOT] SIMILAR is appropriate for making sure the strings meet the specifications.

Postal Codes

These strings must be “letter digit letter space digit letter digit” – and the letters must be upper case simple Latin letters, e.g.: 'T5E 1G7', 'V1K 4K0'. To make sure your data fits these requirements, use a simple Domain Constraint:

ALTER DOMAIN postal_code
ADD CONSTRAINT postal_code_specs
CHECK (VALUE SIMILAR TO
  '[:UPPER:][DIGIT:][:UPPER] [:DIGIT]:[UPPER:][:DIGIT:]');

Periodic Table Symbols

These strings are either a single upper case simple Latin letter capital letter (e.g.: ‘H’, ‘O’) or one upper case and one lower case letter (E.G.: ‘Al’, ‘Fe’). To make sure your data fits these requirements, use another simple Domain Constraint:

CREATE DOMAIN periodic_table_element CHAR(2)
CHECK (VALUE SIMILAR TO '[A-Z]|([A-Z][a-z])');

North American Telephone Numbers

These strings must be “digit digit digit minus-sign digit digit digit digit”; optionally preceded by “left-parenthesis digit digit digit right-parenthesis” (e.g.: ‘498-1234’ or ‘(604)498-1234’). This is a hard one: the string includes both an optional format and a special character that needs “escaping”. To make sure your data fits these requirements, use a Table Constraint:

CREATE TABLE Table_1 (
phone_number CHAR(13),
CHECK (phone_number SIMILAR TO
   '([:DIGIT:][:DIGIT:][:DIGIT:]?-[:DIGIT:][:DIGIT:][:DIGIT:][:DIGIT:])
    |
    (?([:DIGIT:][:DIGIT:][:DIGIT:]?)[:DIGIT:][:DIGIT:][:DIGIT:]?-
    [:DIGIT:][:DIGIT:][:DIGIT:][:DIGIT:])' ESCAPE '?');

(This example is shown on multiple lines for clarity; in reality the string may not contain carriage returns, nor would the “pattern”).

[NOT] LIKE <B+>or [NOT] SIMILAR?

[NOT] LIKE and [NOT] SIMILAR are both pattern-matching predicates. You should continue to use [NOT] LIKE if the pattern contains only _ and % wildcards – although [NOT] SIMILAR can use these wildcards too, there is no advantage in using an SQL3 expression when an SQL-92 expression will do. For more complex patterns, your choice is between [NOT] SIMILAR and nothing. When you make the switchover, remember that there are subtle differences between [NOT] SIMILAR and [NOT] LIKE, in the way that collating sequences are handled (which affects which characters are regarded as “equal” and whether there are pad spaces at the end of a string).

Character Strings and Character Sets

[Obscure Rule] applies for this entire section.

In the last chapter, we made the observation that a computer character set has two parts: a character repertoire and an agreement on how the repertoire’s characters will be encoded as numbers. SQL has a similar rule: An SQL Character set is a combination of two things:

  1. A character repertoire: the set of characters that belong to the Character set.

  2. A Form-of-use: the repertoire’s encoding scheme – the one-to-one mapping scheme between each character in the repertoire and a set of internal codes (usually 8-bit values) that define how the repertoire’s characters are encoded as numbers. (These codes are also used to specify the order of the characters within the repertoire.)

All SQL character strings belong to some Character set. Whenever you’re working with an SQL character string, you may either specify the Character set it belongs to, or allow it to belong to a default Character set chosen by your DBMS. (To simplify matters, we recommend that you always follow the latter course. This will ensure that you get standard results across SQL-sessions.)

To explicitly specify a Character set for a character string, add a CHARACTER SET clause to a <data type> specification and/or _<Character set name> to a <literal>, as shown in the appropriate syntax diagrams in this chapter. Your current <AuthorizationID> must have the USAGE Privilege for the Character set named.

If you choose not to specify a Character set for a character string, the current default Character set is implicit. Your DBMS will choose the current default Character set using these rules:

  1. A character string <data type> specification (in CREATE SCHEMA, CREATE TABLE, CREATE DOMAIN, ALTER TABLE and ALTER DOMAIN) that doesn’t include an explicit CHARACTER SET clause is treated as if the default Character set of the Schema it’s defined in was explicitly named.

    [NON-PORTABLE] In any operation other than defining a Domain, defining a Column or defining a Field (e.g.: in a CAST operation), a character string <data type> specification that doesn’t include a CHARACTER SET clause will be treated as if it belongs to a Character set that is non-standard because the SQL Standard requires implementors to define what the operation’s default Character set is. [OCELOT Implementation] The OCELOT DBMS that comes with this book uses ISO8BIT – the DBMS’s initial default Character set – as the default Character set for such operations.

  2. Any other character string value that doesn’t include an explicit Character set specification must either consist only of <SQL language character>s or the value’s Character set defaults to (a) the default Character set of the Schema, if it’s found in a CREATE SCHEMA statement, (b) the default Character set of the SQL-session, if it’s found in a dynamic SQL statement or (c) the default Character set of the Module you’re running, if it’s found inany other SQL statement in a Module.

Every Character set has at least one Collation – its default Collation. You may define additional Collations for any Character set.

If you want to restrict your code to Core SQL, don’t explicitly define the Character set that any character string belongs to – always allow it to belong to the default Character set.

Character Strings and Collations

[Obscure Rule] applies for this entire section.

A Collation, or collating sequence, is a set of rules that determines the result when character strings are compared. The result of any character string comparison thus depends on the Collation used – we’ll call this the relevant Collation. Different Collations might result in different comparison results for the same two strings, e.g.: a case-sensitive Collation will determine that the letter “A” and the letter “a” are not equal, but a case-insensitive Collation will determine that “A” and “a” are equal.

Whenever you’re comparing an SQL character string, you may either specify the relevant Collation, or allow the comparison to be governed by a default Collation chosen by your DBMS. (To simplify matters, we recommend that you always follow the latter course. This will ensure that you get standard results across SQL-sessions.)

To explicitly specify a Collation for a comparison, add a COLLATE clause to your character string, as shown in the appropriate syntax diagrams in chapter. The Collation you name must either (a) be the default Collation for the relevant Character set or (b) be defined as a Collation for the relevant Character set by some CREATE COLLATION statement. If you’re using COLLATE in an SQL-Schema statement, then the <AuthorizationID> that owns the containing Schema must have the USAGE Privilege on “<Collation name>”. If you’re using COLLATE in any other SQL statement, then your current <AuthorizationID> must have the USAGE Privilege on “<Collation name>”.

If you choose not to specify a Collation for a comparison, the current default Collation is implicit. Your DBMS will choose the current default Collation for a character string using these rules.

First, to choose a character string’s default Collation:

  • If a character string <data type> specification doesn’t include a COLLATE clause but does include a CHARACTER SET clause, the default Collation for that <data type>’s values is the default Collation of the Character set named.

  • If any other character string value doesn’t include a COLLATE clause, the default Collation for that value is the default Collation of the value’s Character set.

Second, to choose one of the comparand’s default Collations for the comparison:

  • Expressions that involve only non-Columns (i.e.: a <literal>, host language variable, parameter or expression result) are compared using the default Collation for the character string values’ mutual Character set.

  • Expressions that involve both Columns (i.e.: <Column name>s or <Column reference>s) and non-Columns are compared using the Column(s)’ mutual default Collation. If you want to compare values from multiple Columns with different default Collations, you must include a COLLATE clause in your expression.

Table 7-1 shows how the collating sequence and coercibility attribute are determined for the result of a monadic operation.

Table 7.1 Collating sequences and coercibility rules for monadic operations

OPERAND’S COERCIBILITY ATRIBUTE

OPERAND’S COLLATION

RESULT’S COERCIBILITY ATTRIBUTE

RESULT’S COLLATION

Coercible

Default

Coercible

Default

Implicit

X

Implicit

X

No Collation

None

No Collation

None

Explicit

X

Explicit

X

Table 7-2 shows how the collating sequence and coercibility attribute are determined for the result of a dyadic operation.

Table 7-3 shows how the collating sequence is determined for a particular comparison.

OPERAND_1’S COERCIBILITY ATRIBUTE

OPERAND_1’S COLLATION

OPERAND_2’S COERCIBILITY ATTRIBUTE

OPERAND_2’S COLLATION

RESULT’S COERBILITY ATRIBUTE

RESULT’S COLLATION

Coercible

Default

Coercible

Default

Coercible

Default

Coercible

Default

Implicit

X

Implicit

X

Coercible

Default

No Collation

None

No Collation

None

Coercible

Default

Explicit

X

Explicit

X

Implicit

X

Coercible

Default

Implicit

X

Implicit

X

Implicit

X

Implicit

X

Implicit

X

Implicit

Y<>X

No Collation

None

Implicit

X

No Collation

None

No Collation

None

Implicit

X

Explicit

Y

Explicit

Y

No Collation

None

Coercible

Default

No Collation

None

No Collation

None

Implicit

X

No Collation

None

No Collation

None

No Collation

None

No Collation

None

No Collation

None

Explicit

X

Explicit

X

Explicit

X

Coercible

Default

Explicit

X

Explicit

X

Implicit

Y

Explicit

X

Explicit

X

No Collation

None

Explicit

X

Explicit

X

Explicit

X

Explicit

X

Explicit

X

Explicit

Y<>X

invalid syntax

Table 7.3 shows how the collation sequence is determined for a particular comparison.

Table 7,3 Collating sequences used for comparison

OPERAND_1´S COERCIBILITY ATRIBUTE

OPERAND_1´S COLLATION

OPERAND_2´S COERCIBILITY ATTRIBUTE

OPERAND_2´S COLLATION

COLLATION USE FOR COMPARISON

Coercible

Default

Coercible

Default

Default

Coercible

Default

Implicit

X

X

Coercible

Default

No Collation

None

invalid syntax

Coercible

Default

Explicit

X

X

Implicit

X

Coercible

Default

X

Implicit

X

Implicit

X

X

Implicit

X

Implicit

Y<>X

invalid syntax

Implicit

X

No Collation

None

invalid syntax

Implicit

X

Explicit

Y

Y

No Collation

None

Coercible

Default

invalid syntax

No Collation

None

Implicit

X

invalid syntax

No Collation

None

No Collation

None

invalid syntax

No Collation

None

Explicit

X

X

Explicit

X

Coercible

Default

X

Explicit

X

Implicit

Y

X

Explicit

X

No Collation

None

X

Explicit

X

Explicit

X

X

Explicit

X

Explicit

Y<>X

invalid syntax

Note

For expressions involving more than two comparands, the collating sequnce is effectively determined on a cumulative basis; the result for the first two comparands becomes comparand_1 for the next comparison, the result for this becomes comparand_1 for the comparison after that, and so on.

Dialects

The “typical” SQL DBMS supports most of the standard character data types, but often uses preferred local names. For example, Oracle has a 2000-byte maximum (2048 for Oracle Lite) for the CHAR <data type> and offers a (non-standard) LONG VARCHAR type to define larger character string fields.

Sybase allows for a large variety of Character sets, with only one “group” (Character set) allowed at a time. Baltic languages are in the East European group (8859-2), though it doesn’t seem possible to get a correct result in this case. Collations supported are: English+French+German (all together!), Spanish, Hungarian, Russian; then everything else is binary. Sybase does not support SQL CHARACTER SETs, COLLATIONs or TRANSLATIONs, nor does it support CONVERT – to convert you need an offline utility. It does provide some Unicode support.