Wednesday, December 19, 2007

STRINGS IN C, A BRIEF REVIEW

A C-style string is defined as an array of characters that terminates in the null character. For example, in a C program the following would declare a string variable str with a storage allocation of 6 characters, the last character being reserved for the terminating null character:

char str[6];

If we wish to also initialize a string variable at the time it is declared, we can do so by

char str[5 + 1] = "hello";

or by

char str[] = "hello"; /* (A) */

where we have omitted the length of the array str. The double-quoted string of characters on the right-hand side, "hello", is called a string literal.[2] A string literal is a string constant, very much like the number 5 is an integer constant. Since a string literal is stored as an array of chars, the compiler represents it by the memory address of the first character, in the above case the address of the character h. More precisely, the type of a string literal is const char*.

We can also use a character pointer directly to represent a string, as in

char* str = "hello"; /* (B) */

which causes the address of the first character, h, of the string literal "hello" to be stored in the pointer variable str. Note that the declaration in (B) gives you direct access to the block of memory, that is read-only, in which the string literal is stored. On the other hand, the declaration in (A) copies the string literal from wherever it is stored into the designated array.

While we may declare a string variable to be an array of characters, as in the definition in line (A) above, or to be a character pointer, as in the definition in line (B), the two versions are not always interchangeable. In the array version, the individual characters can be modified, as would be the case with an array in general. However, with the pointer version, the individual characters of the string cannot be changed because a string literal, being of type const char*, is stored in a read-only section of the memory. The fact that a statement such as the one shown in line (B) is legal is because the compiler allows you to assign a const char* type to char* type. So whereas the pointer str in line (B) is of type char*, it is pointing to a block of read-only memory in which the string literal itself is stored.[3] For another difference between the string definitions in lines (A) and (B), the identifier str in the array version is the name of an array—it cannot be assigned values as it cannot serve as an lvalue. On the other hand, in the pointer version in line (B), str is a pointer variable that, during program execution, could be given any value of type char*.

We will now review briefly the frequently used functions in C that are provided by the string.h header file for performing operations on strings. These include strcmp whose prototype is given by

int strcmp( const char* argl, const char* arg2 );

for comparing two strings that are supplied to it as arg1 and arg2. It returns a value less than, equal to, or greater than 0 depending on whether arg1 is less than, equal to, or greater than arg2. Typically, ASCII character sets are used and strings are compared using the ASCII integer codes associated with the characters. For example, the following inequality is true for the one-character strings shown

strcmp( "A", "a" ) < 0

because the ASCII code for the character A is 65, whereas the ASCII code for a is 97, making the string literal "A" less than the string literal "a". Given this character by character comparison on the basis of ASCII codes, longer strings are compared using lexicographic ordering—an ordering that is akin to how words are arranged in a dictionary. For example, in lexicographic ordering, the string "abs" will occur before the string absent, so the former is less than the latter. However, the string Zebra will occur before the string debra, as the former is less than the latter because the ASCII codes for all uppercase letters, A through Z, occupy the range 65 through 90, whereas the codes for lowercase letters, a through z, occupy the range 97 through 122.

Another frequently used string function from the string.h header file is the strlen function for ascertaining the length of a string. This function has the following prototype:

size_t strlen( const char* arg );

where the return type, size_t, defined in the header file stddef.h, is usually either unsigned int or unsigned long int. For practically all cases, we can simply think of the value returned by strlen as an integer. To illustrate,

strlen( "hello" )

returns 5. Note that the integer count returned by strlen does not include the terminating null character.

Another very useful C function for dealing with strings is

char* strcpy( char* arg1, const char* arg2 );

which copies the characters from the string arg2 into the memory locations pointed to by arg1. For illustration, we could say

char str1[6];
char* str2 = "hello";
strcpy( str1, str2 );

or, using the C memory allocation function malloc (),

char* str1 = (char*) malloc( 6 );
char* str2 = "hello";
strcpy( str1, str2 );

In both cases above, the string hello will be copied into the memory locations pointed to by the character pointer str1. The function strcpy () returns the pointer that is its first argument. However, in most programming, the value returned by strcpy () is ignored. The returned value can be useful in nested calls to this function [45, p. 252].

When one wants to join two strings together, the following function from the string.h header comes handy

char* strcat( char* arg1, const char* arg2 );

This function appends the string pointed to by arg2 to the string pointed to by arg1. For example,

char str1[8];
strcpy( str1, "hi" );
strcat( str1, "there" );

will cause the string hithere to be stored at the memory locations pointed to by str1. As with the strcpy () function, the string concatenation function returns the pointer to its first argument. But again as before, the returned value is usually ignored in most programming.

[2]The initialization syntax shown at (A) copies over the string literal stored in a read-only section of the memory into the array. Therefore, effectively, the declaration shown at (A) is equivalent to

char str[] = { 'h', 'e', '1', '!', 'o', '\0' } ;


[3]Some C and C++ compilers do allow a string literal to be modified through a pointer to which the string literal is assigned. For example, the following will work with some compilers:

char* str = "hello";
*str = 'j';

But modifying a string literal though a pointer in this manner could result in non-portable code. If you must modify a string literal, it is best to first copy it into an array that is stored at a location different from where the string literal itself is stored, as in

char str[] = "hello";
str[0] = 'j';

String literals being represented by const char* allows for code optimization, such as achieved by storing only one copy of each literal.
SOME COMMON SHORTCOMINGS OF C-STYLE STRINGS
C-style strings can be painful to use, especially after you have seen the more modern representations of strings in other languages. For starters, when invoking some of the most commonly used string library functions in C, such as a strcpy (), strcat (), and so on, you have to ensure that sufficient memory is allocated for the output string. This requirement, seemingly natural to those who do most of their programming in C, appears onerous after you have experienced the convenience of the modern string types.

Consider this sample code for the string type from the C++ Standard Library:

string str1 = "hi";
string str2 = "there";
string str3;
str3 = str1 + str2;

We are joining the strings str1 and str2 together and copying the resulting string into the string object str3. Using the operator + for joining two strings together seems very natural. More particularly, note that we do not worry about whether or not we have allocated sufficient memory for the new longer string. The system automatically ensures that the string object str3 has sufficient memory available to it for storing the new string, regardless of its length.

Now compare the above code fragment with the following fragment that tries to do the same thing but with C-style strings using commonly used functions for string processing in C:

char* str1 = "hi";
char* str2 = "there";
char* str3 = (char*) malloc( strlen( str1 ) + strlen( str2 ) + 1 );
strcpy( str3, str1 );
strcat( str3, str2 );

The syntax here is definitely more tortured. A visual examination of the code, if too hasty, can be confusing with regard to the purpose of the code. You have to remind yourself about the roles of the functions strcpy and strcat to comprehend what's going on. You also have to remember to allocate memory for str3—forgetting to do so is not as uncommon as one might like to think. What's worse, for proper memory allocation for str3 you have to remember to add 1 for the null terminator to the byte count obtained by adding the values returned by strlen for the strings str1 and str2. (Just imagine the disastrous consequences if you should forget!)

For another example of the low-level tedium involved and the potential for introducing bugs when using C-style strings, consider the following function:

void strip( char* q ) {
char* p = q + strlen( q ) -1; //(A)
while ( *p == ' ' && p >= q ) //(B)
*p-- = '0'; //(C)
}

which could be used to strip off blank space at the trailing end of a string. So in a call such as

char* str = (char*) malloc( 10 );
strcpy( str, "hello " );
strip( str ) ;

the function strip would erase the five blank space characters after "hello" in the string str. Going back to the definition of strip, in line (A) we first set the local pointer p to point to the last character in the string. In line (B), we dereference this pointer to make sure that a blank space is stored there and that we have not yet traversed all the way back to the beginning of the string. If both these conditions are satisfied, in line (C) we dereference the pointer again, setting its value equal to the null character, and subsequently decrement the pointer.[4] If someone were to write in a hurry the implementation code for strip, it is not inconceivable that they'd write it in the following form:

void strip( char* q ) {
char* p = q + strlen( q ) - 1;
while ( *p == ' ' ) //(D)
*p-- = '\0';
}

where in line (D) we have forgotten to make sure that that the local pointer p does not get decremented to a value before the start of the argument string. While this program would compile fine and would probably also give correct results much of the time, it could also cause exhibit unpredictable behavior. In programs such as this, one could also potentially forget to dereference a string pointer resulting in programs that would compile alright, but not run without crashing.

[4]Recall from C programming that the unary postfix increment operator, ‘–’, has a higher precedence than the indirection operator ‘*’. So the expression *ptr– in line (C) is parsed as *(ptr–). But because the decrement operator is postfix, the expression ptr– evaluates to ptr. Therefore, what gets dereferenced is ptr. It is only after the evaluation of the expression that ptr is decremented by the postfix decrement operator.


C++ STRINGS
C++ has a built-in type string that avoids the pitfalls of C-style strings.[5] Since many aspects of this type cannot be fully explained until we discuss concepts such as operator overloading, our goal in this section is limited to familiarizing the reader with some rudimentary aspects of this type to the extent that we can use it in some of the examples in this and later chapters. To use the C++ string type, you must include its associated header file:

#include

4.3.1 Constructing a C++ String Object
To declare a string with initialization, we can say

string str( "hi there");

which is a call to the constructor of the string class with "hi there" as its const char* argument. An alternative way to initialize a string is

string str = "hi there";

We can also use the following syntax:

string str = string( "hi there" );

We can think of the right-hand side here as constructing an anonymous string object that is then assigned to the variable str through what's known as the copy constructor for the string class.[6]. We also have the option of invoking the new operator to obtain a pointer to a string object:

string* p = new string( "hi there" );

An empty string can be declared in the following manner[7]

string str;

or as

string str = "";

These declarations create an object of class string whose name is str.[8] A principal feature of this object is that it stores inside it the string of characters specified by the initialization syntax. The stored string may or may not be null terminated. If in a particular implementation of C++, the string is not null terminated, that does not create a problem because also stored in the object is the exact length of the string. So there is never any question about how many memory locations would need to be accessed in order to read an entire string.

While the string constructor invocations illustrated above show us how to convert a const char* string into a string object, what about the opposite? How does one convert a C++ string object back into a C-style null-terminated string? This is done by invoking the c_str () member function for the string class:

string str( "hello" );
const char* c_string = str.c_str();

4.3.2 Accessing Individual Characters
The individual characters of a C++ string can be accessed for reading and writing by either using the subscript operator'[]' or the member function at (). The former is not range checked, while the latter is. What that means is that suppose you try to access a character position that does not really exist, what you get with the subscript operator is unpredictable, meaning implementation dependent. On the other hand, if you try to access a nonexistent position with the at () function, the program is guaranteed to abort. This is illustrated by the following program where we have commented out the line in which we invoke the at () function with an argument that is clearly outside the range valid for the “hello” string. If you uncomment this line, the program will abort at run time when the flow of control reaches that line. On the other hand, when we try to reach the same index with the subscript operator, we may see some garbage character displayed on the screen.

--------------------------------------------------------------------------------
// StringCharIndexing.

#include
using namespace std;

int main()
{
string str( "hello" );
0char ch = str[0]; // ch initialized to 'h'
str[0] = 'j'; // str now equals "jello"
ch = str. at( 0 ); // ch's value is now 'j'
str.at(0) = 'h'; // str agains equals "hello"
ch = str[ 1000 ]; // garbage value for ch
// ch = str.at( 1000 ); // program aborts if uncommented
return 0 ;
}
--------------------------------------------------------------------------------

4.3.3 String Comparison
Two strings can be compared for equality (or inequality) on the basis of the ASCII codes associated with the characters using the binary operators ‘==’, ‘! =’, ‘>’, ‘>=’, ‘<’, and ‘<=’. Two strings are equal if and only if they are composed of identical character sequences. A string is less than another string if the former occurs earlier in a lexicographic ordering of the strings on the basis of the ASCII codes associated with the characters.

While the operators listed above are all binary, in the sense that they return either true or false, sometimes it is more useful to employ a 3–valued comparison function, compare (), that is defined for the string class. Given two string objects str1 and str2, the invocation

str1.compare( str2 );

returns one of three possible values:

a positive value if str1 is greater than str2

0 if str1 is equal to str2

a negative value if str1 is less than str2

For example,

string str1( "abc" );
string str2( "abc123" );
if ( str1.compare( str2 ) == 0 ) // test returns false
.....
if ( str1.compare( str2 ) < 0 ) // test returns true
.....
if ( str1.compare( str2 ) > 0 ) // test rturns false
....

It is also possible to invoke compare with additional arguments that designate at what character position to start the comparison in the invoking string and how many characters to use from the argument string. In the following example, "hello" is the string that invokes compare on the argument string "ellolotion" in line (A). The second argument to compare in line (A)—in this case 1—designates the index at which the start the character comparisons in the string "hello". This means that the string comparison will begin at the letter ‘e’ of "hello". The third argument to compare in line (A) is 4; this is the number of characters from the string "ellolotion" that will be used for string comparison.

string str1("hello");
string str2("ellolotion");
if ( str1.compare( str2, 1, 4 ) == 0 ) //(A)
cout << "\nThe substring starting at index 1 "
"of 'hello' is the same as the first "
"four chars of 'ellolotion'."
<< endl; //(B)
else
cout << "The compare test failed" << endl;

For the example code shown, the comparison test in line (A) returns true and the message in the statement leading up to line (B) is printed out.

In the three-argument version of compare shown in line (A) above, the second argument is of type string:: size_type,[9] which for all practical purposes can be considered to be int, and the third argument of type unsigned int. There is also a two-argument version of compare in which the second argument plays the same role as in the example shown. Now the comparison is with the entire argument string. We should also mention that the compare function works just the same if its first argument is a C-style const char* string.[10]

A 3–valued string comparison function, such as the compare function, is what you'd need for certain kinds of string sorting functions. Let's say we wish to sort an array of string literals as shown below:

string wordList[] = {"hello", "halo", "jello", "yellow", //(C)
"mellow", "Hello", "JELLO", "Yello",
"MELLOW"};

Although later the reader will be introduced to the sorting functions designed expressly for C++, we can sort this array by using the venerated qsort function defined originally in the stdlib.h header file of the C standard library, but also made available through the header file string of C++. The function qsort, frequently an implementation of quick-sort, is capable of sorting an array of any data type as long as you are able to specify a comparison function for the elements of the array.[11] The prototype of qsort is

void qsort( void* base, //(D)
size_t nmemb,
size_t size,
int (* compar) ( const void*, const void* ) );

where base is a pointer to the first element of the array to be sorted, nmemb the number of elements to be sorted, [12] size the size of each element in bytes, [13] and, finally, compar a pointer to a user-defined function for comparing any two elements of the array. The user defined comparison function that will be bound to the parameter compar must return an int and must take exactly two arguments, both of type void*. Furthermore, for qsort() to work correctly, the int returned by the comparison function must be positive when the entity pointed to by the first argument is greater than the entity pointed to by the second argument; must be negative when the opposite is the case; and must be zero when the two entities are equal.

Here is a possible comparison function for the fourth argument of qsort for sorting the elements of the array wordList of line (C) above:[14]

int compareStrings( const void* arg1, const void* arg2 ) { //(E)
return ( *( static_cast( arg1 ) ) ). compare (
*( static_cast( arg2) ) );
}

In terms of the return type and the parameter structure, this comparison function corresponds exactly to what is specified for the fourth argument of qsort () in line (D). The actual comparison is carried out by invoking the compare function of the string class.

Shown below is a simple program that pulls together the code fragments shown above into a complete program:

--------------------------------------------------------------------------------
//Qsort.cc
#include
using namespace std;

int compareStrings( const void* arg1, const void* arg2 );
int checkUpperCase( string buffer );
int main()
{
string wordList[] = {"hello", "halo", "jello", "yellow",
"mellow", "Hello", "JELLO", "Yello",
"MELLOW"};
cout << sizeof( wordList[] << endl; // 36

int sizeArray = sizeof( wordList ) / sizeof( wordList[ 0 ] );
cout << sizeArray << endl; // 9

qsort( wordList, sizeArray , sizeof(string), compareStrings);
int j = 0;
while ( j < sizeArray )
cout << wordList[j++] << " ";
//Hello JELLO MELLOW Yello halo hello jello mellow yellow
cout << endl;
return 0;
}

int compareStrings( const void* arg1, const void* arg2 ) {
return ( *( static_cast( arg1 ) ) ).compare(
*( static_cast( arg2) ) );
}
--------------------------------------------------------------------------------

4.3.4 Joining Strings Together
Through the overloading of the ‘+’ operator, the string class makes it very easy to join strings together without having to worry whether or not you allocated sufficient memory for the result string.[15] For example, we can say

string str1( "hello" );
string str2( "there" );
string str3 = str1 + " " + str2; // "hello there"
str2 += str1; // "therehello"

which would result in the object str3 storing the string "hello there" and the object str2 storing the string "therehello". The operator ‘+’ works the same if the second operand is of type const char* or just char as long as the first operand is an object of type string.[16] So while the following will not work

string s = "hello" + " there"; // Wrong

the following does:

string s = string( "hello" ) + " there";

It is also possible to use the append member function for joining two strings, or one string with a part of another string, as the following example illustrates:

string string1( "hello" );
string string2( " the world at large" );
string string3 = string1;

string3.append( string2 ); //(A)
cout << string3; // "hello the world at large"

string1.append( string2, 3, 6 ); //(B)
cout << string1; // "hello world"

In the one-argument invocation of append in line (A), the entire argument string is appended to the invoking string. In the three-argument version of append, shown in line (B), a substring from the argument string is appended to the invoking string. The substring begins at the index specified by the second argument, with the third argument specifying its length. The second and the third arguments in the three-argument version are both of type string:: size_type, which as mentioned before can be taken to be the same as int for the purpose of program design.

There is also a two-argument version of append in which the second argument is the same as the second argument of the three-argument version. In this case, the entire argument string starting at the specified index is appended to the invoking string.

As is true of all string class member functions, the argument string can also be a C-style const char* string.

4.3.5 Searching for Substrings and Characters
A frequent problem in string processing is that we want to know if a given string has particular substrings or particular characters in it. Consider, for example, the problem of isolating words in a text file. Of the many different ways of solving this problem, one would be to read the file one line at a time and to then look for whitespace characters in each line. If not excessively large, we could even read the entire file as a single string and then look for whitespace characters (which include line-feeds and carriage returns) to break the string into individual words.

The C++ string library provides a number of functions for searching for substrings and individual characters in a string. These functions are named find, rfind, find_first_of, find_last_of, find_first_not_of, and find_last_not_of. In all there are 24 functions with these six names, the various versions of the functions catering to different types of arguments. In this section, we will explain how one can invoke find and find_first_of on string type objects with string or char type arguments. (Their usage on const char* type arguments is parallel to the usage on string arguments.) The functions rfind do the same thing as find, except that they start the search from the end of a string towards its beginning. The functions find_last_of again do the same thing as find_first_of, except that they start their search at the end of a string toward its beginning.

Here is an example that illustrates how one can invoke find to search for a substring in a string:

string::size_type pos = 0;
string quote( "Some cause happiness wherever they go,"
" others whenever they go - Oscar Wilde" );
if ( ( pos = quote.find( "happiness" ) ) != string::npos ) //(A)
cout << "The quote contains the word 'happiness'" << endl;

The function find returns the index of the character in the invoking string where it scores a match with the argument string. This index, although officially of type string:: size_type, can be taken to be an int for all practical purposes. If no match is found, find returns a symbolic constant string:: npos, a static data member of the string class also of type size_t. The actual value of npos is such that no actual character index in any valid string would ever correspond to it. In the above program fragment, note how we compare the value returned by find with the symbolic constant npos to establish the presence or the absence of the substring.

The following program shows a simple demonstration of the use of find. It also shows how replace, another member function of the string class, can be used together with find to search for each occurrence of a substring in a string and, when found, how the substring can be replaced with another string. The program produces the output

4
32
one armadillo is like any other armadillo

where the numbers 4 and 32 are the position indices where the substring "hello" occurs in the larger string "one hello is like any other hello". Here is the program:[17]

--------------------------------------------------------------------------------
//StringFind.cc

#include
using namespace std;

int main()
{
string str( "one hello is like any other hello" );
string searchString( "hello" );
string replaceString( "armadillo" );

assert( searchString != replaceString );

string::size_type pos = 0;
while ( (pos = str.find(searchString, pos)) != string::npos ) {
str.replace( pos, searchString.size(), replaceString );
pos++;
}
cout << str << endl; //one armadillo is like any other armadillo
return 0;
}
--------------------------------------------------------------------------------

Note the use of the 2-argument version of find in the above program. The second argument tells find where to begin the search for the substring. When you are searching for a character or a substring with find, after you have obtained the first match, you need to increment the index represented by pos so that the search can continue on for the next occurrence. If you don't do that, find will keep on returning the same index ad infinitum.

The above example code also illustrates the use of the 3–argument replace. This function can take up to five arguments. The two additional arguments, both of type string:: size_type, specify the position in the argument string and the number of characters to be taken starting at that position for the purpose of replacement.

Shown below is an example of how one can use the string library function find_first_of to locate and count some of the more frequently used punctuation marks in a string. We place all the punctuation marks we are looking for in a string called marks, with the original string stored in quote. We invoke find_first_of on quote and supply it with marks as its first argument, the second argument consisting of the position index in quote where we want the search to begin. Note how we increment pos after each hit. If we did not do so, the function find_first_of will keep on returning the same location where it found the first punctuation mark. For the example shown, the program returns a count of five.

string quote( "Ah, Why, ye Gods, should two and two "
"make four? - Alexander Pope" );
string marks( ",.?:;-" );
string::size_type pos = 0;
int count = 0;
while ( ( pos = quote.find_first_of( marks, pos ) )
!= string::npos ) {
++pos;
++count;
}
cout << count << endl; // 5

4.3.6 Extracting Substrings
The string library offers the function substr for extracting a substring from a source string on which the function is invoked. This function can be invoked with one argument, of type size_type, that designates the index of the character that marks the start of the substring desired from the source string. The extracted substring will extend all the way to the end of the source string. This use is illustrated by the following code fragment. Here the string returned by substr will start at the position indexed 44 and go to the end of the quote. As a result, the output produced by line (B) is "Fiction has to make sense.—Tom Clancy".

string quote( "The difference between reality and fiction? "
"Fiction has to make sense. - Tom Clancy" );
string str = quote.substr( 44 );
cout << str << endl; //(A)

There is also a two-argument version of substr in which the first argument works the same as in the example shown above. The second argument, also of type size_type, now designates the number of characters to be extracted from the source string. If the number of characters requested exceeds the number remaining in the source string, the extracted substring will stop at the end of the source string. The following code fragment, which will output "Fiction," illustrates this usage.

string quote( "The difference between reality and fiction?"
"Fiction has to make sense. - Tom Clancy" );
string str = quote.substr( 44, 7 );
cout << str << endl; // Fiction

It is also possible to invoke the substr function with no arguments, in which case it simply returns a copy of the string object on which it is invoked.

Substrings can also be extracted by invoking the string constructor with a string argument and with additional optional arguments to specify the starting index for substring extraction and the number of characters to be extracted from the first argument string. In the invocations of the string constructor below that construct the objects str_1 and str_2, the first yields the substring "Fiction has to make sense. - Tom Clancy", and the second just the word "Fiction".

string quote( "The difference between reality and fiction?"
"Fiction has to make sense. - Tom Clancy" );
string str_1( quote, 44 );
string str_2( quote, 44, 7 );

4.3.7 Erasing and Inserting Substrings
The string class member function erase can be used to erase a certain number of characters in the string on which the function is invoked. The function can be invoked with zero arguments, with one argument, and with two arguments. When invoked with no arguments, the function erases the string stored in the invoking object and replaces it with the empty string "". When invoked with one argument, which must be of type string: :size_type, the string stored in the invoking object is erased from the position indexed by the second argument to the end. When invoked with two arguments, both of typestring:: size_type, the second argument designates the number of characters to be erased starting at the position specified by the first argument.

The following code fragment illustrates the two-argument erase. It also illustrates the insert member function which can be used to insert a new substring into a string object. The function insert can be invoked with either two arguments, or three arguments, or four arguments. When invoked with two arguments, the first argument, of type string: :size_type, designates the index of the position at which the new insertion is to begin, and the second argument the string to be inserted. In the three-argument version, the additional argument specifies a position in the argument string that designates the start of the substring to be inserted; the substring continues to the end. In the four-argument invocation, the last argument specifies the number of characters to be taken from the argument string for the purpose of insertion.

The example below shows two-argument and four-argument versions of insert.

string: :size_type pos = 0;
string quote = "Some cause happiness wherever they go, "
"others whenever they go - Oscar Wilde";
if ( ( pos = quote.find( "happiness" ) ) != string: :npos ) {
quote.erase( pos, 9 );
quote.insert( pos, "excitement" );
}
cout << quote << endl; //(A)
quote.erase( pos, 10 );
cout << quote << endl; //(B)
quote. insert( pos, "infinite happiness in the air", 9, 9 );
cout << quote << endl; //(C)

The code produces the following output:

FROM LINE (A):
Some cause excitement wherever they go, others whenever they go - Oscar Wilde

FOME LINE (B):
Some cause wherever they go, others whenever they go - Oscar Wilde

FROM LINE (C):
Some cause happiness wherever they go, others whenever they go - Oscar Wilde

4.3.8 Size and Capacity
The size() (or length(), which does the same thing) member function when invoked on a string object will ordinarily return the number of characters in the string stored in the object. This will also ordinarily be the amount of memory allocated to a string object for the storage of the characters of the string.

string str( "0123456789" );
cout << str.size() << endl; // returns 10

When you extend the length of a string by using, say, the ‘+=’ operator, the size of the allocated memory is automatically increased to accommodate the longer length. But if a string is going to be extended in bits and pieces frequently, you can reduce the background memory-allocation work by preallocating additional memory for the string through the resize() member function. If we refer to the total amount of memory currently available to a string for the storage of its characters as the string object's capacity, we can use resize to endow a string with any desired capacity. In the code fragment shown below, we initially create a string object of size 10 characters. At this moment the capacity of the string object is also 10. But then we increase the capacity to 20 characters by invoking resize, although the number of actual characters in the string is still 10.

--------------------------------------------------------------------------------
//StringSize.cc

#include
#include

int main()
{
string str = "0123456789";

cout << "The current capacity of the string is:"
<< str.size() << endl; // 10
str.resize( 20 );

cout << "The new capacity of the string is:"
<< str.size() << endl; // 20

cout << "The actual length of the string is: " // 10
<< strlen( str.c_str() ) << endl;

cout << "The string object after resizing "
<< "to 20 a 10 character string: "
<< str << endl; // "0123456789"
str += "hello";
cout << str << endl; // "0123456789hello"

return 0;
}
--------------------------------------------------------------------------------

This code shows a one-argument version of resize. When supplied with an optional second argument, which must be of type char, the designated character is used to initialize the spaces not occupied by the characters in the string, the default being the null character.

While on the subject of size, we also want to clarify the relationship between the size of a string object and the size of the string held by a string object. The size of a string object can be ascertained by invoking sizeof( string ), which for g++ returns 4 for all strings (but could return 8 on some systems). Before we go into why sizeof( string ) returns the same number for all strings on any given system, let's quickly review the nature of sizeof.

Remember from C that, despite its appearance, sizeof is not a function, but an operator. It is not a function in the sense that it does not evaluate its argument; it only looks at the type of its argument. To illustrate the nature of this operator, all of the following invocations of sizeof[18]

int x = 4;
int y = 5;
sizeof(x);
sizeof(x + y);
sizeof x;
sizeof( int );
sizeof int;

eturn on the author's machine the same value, which is 4 for the 4 bytes that it takes to store an int.[19] So if we say

string s1 = "hello";
string s2 = "hello there";

and then invoke the sizeof operator by

sizeof( s1 ); // returns 4 for g++
sizeof( s2 ); // returns 4 for g++

we'd get exactly the same answer in both cases, the number 4 (or 8 for some compilers). Compare this with the following case of applying sizeof to the string literals directly:

sizeof( "hello" ); // returns 6
sizeof( "hello there" ); // returns 12

We get 6 for the string literal "hello" because it is NOT stored as a string object and because its internal representation is a null-terminated array of characters. Similarly for the string literal "hello there".

The constant value of 4 returned by sizeof( string ) is easy to understand if we think of the string class as having been provided with a single non-static data member of type char* for holding a character pointer to a null-terminated array of characters.

class string {
char* ptr;
// static data members if needed
public:
// string functions
};

Then the memory occupied by a string object would be what's needed by its sole nonstatic data member shown—4 bytes for the pointer. On the other hand, if a compiler returned 8 bytes for sizeof ( string ), that's because the string class used by that compiler comes with an additional data member—of possibly an unsigned integer type—for holding the size of the string pointed to by the first data member. In this case, it would not be absolutely necessary for the char* string to be null terminated since the second data member would tell us directly how many characters belonged to the string.

Note that if we applied the sizeof operator to any pointer type, we'd get 4 for the four bytes to hold a memory address. For example,

sizeof ( string* ) -> 4
sizeof ( int* ) -> 4
sizeof ( char* ) -> 4

We have brought the above statements together in the following program:

--------------------------------------------------------------------------------
//StringSizeOf.cc
#include
#include

int main()
{
cout << sizeof( "hello" ) << endl; // 6
cout << sizeof( "hello there" ) << endl; // 12
string str1 = "hello";
string str2 = "hello there";

cout << sizeof( str1 ) << endl; // 4
cout << sizeof( str2 ) << endl; // 4

char* s1 = "hello";
char* s2 = "hello there";

cout << sizeof( s1 ) << endl; // 4
cout << sizeof( s2 ) << endl; // 4

char c_arr[] = "how are you?";
cout << sizeof( c_arr ) << endl; // 13

return 0;
}
--------------------------------------------------------------------------------

Before ending this subsection, we should remind the reader that sizeof () can sometimes show seemingly unexpected behavior. Consider the role of sizeof in the following program that attempts to find the size of the array in a called function by invoking sizeof:

--------------------------------------------------------------------------------
//ArraySizeOf.cc

#include

int sum( int [], int );

int main()
{
int data [100] = {2, 3};
int m = sizeof( data ) / sizeof ( data[0] ); // (A)
cout << sum( data, 100 ) << endl;
return 0;
}

int sum( int a[], int arr_size ) {
//the following value of n is not very useful
int n = sizeof( a ) / sizeof( a[0] ); // (B)

int result = 0;
int* p = a;
while (p-a return result;
}
--------------------------------------------------------------------------------

While at (A) the number m will be set to 100, at (B) the number n will be set to 1. The reason for this is that when an array name is a function parameter, it is treated strictly as a pointer. So the numerator on the right-hand side at (B) is synonymous with sizeof( int* ) which yields 4.

4.3.9 Some Other String Functions
The string library offers a function swap that can be used to swap the actual strings stored inside two string objects. In the following code fragment, after the execution of the third statement, the object str1 will store the string "lemonade", whereas the object str2 will store the string "lemon".

string str1 = "lemon";
string str2 = "lemonade";
str1.swap( str2 );

A different effect is achieved by the assign function. After the execution of the third statement below, both the objects str1 and str2 will contain the string "lemonade";

string str1 = "lemon";
string str2 = "lemonade";
str1.assign( str2 );

[5]Actually, the built-in string type in C++ is the template class basic_string. The C++ string class is a typedef alias for basic_string, which is the basic_string template with char as its template parameter. The concept of a template class, introduced briefly in Chapter 3, is presented more fully in Chapter 13.

[6]Copy constructors are discussed in Chapter 11.

[7]Depending on how the string type is implemented, a C++ string may not include a null terminator at the end. In that case, an empty C++ can be truly empty, as opposed to a "" string in C which consists of the null terminator.

[8]We could also have said: "This declaration creates an object of type string." For nonprimitive types, the characterizations type and class are used interchangeably in object-oriented programming.

[9]On the basis of the notation explained in Section 3.16.1 of Chapter 3, the syntax string:: size_type refers to inner type size_type defined for the string class.

[10]This is actually true of all string member functions. They work the same for both string and const char* arguments.

[11]In Chapter 5, we discuss the notion of stable sorting for class type objects and point out that qsort may not be the best sorting function to invoke in some cases.

[12]We can think of size_t as an unsigned integer.

[13]For the example array shown, each element of the array is a string object that is initialized by the corresponding string literal on the right hand side of the declaration for wordList. So we can use sizeof(string) for the third argument of qsort.

[14]Typical C syntax for the same function would be

int compareStrings( const void* arg1, const void* arg2 ) {
return (*(const string*) arg1).compare(*(const string*) arg2);
}

The difference between the C way of writing this function and the C++ syntax shown in line (E) is with regard to casting. What is done by the cast operator (const string*) in the C version here is accomplished by static_cast() in the C++ definition in line (E). The static_cast and other C++ cast operators are presented in Chapters 6 and 16.

[15]Obviously, there has to be sufficient free memory available to the memory allocator used by the string class for this to be the case. If the memory needed is not available, the memory allocator will throw an exception.

[16]As we will explain in Chapter 12, for class type operands the compiler translates the expression

str1 + str2;
into
str1.operator+( str2 );

where the function operator+ contains the overload definition for the ‘+’ operator. That makes str1 the operand on which the function operator+ is invoked and str2 the argument operand. We may loosely refer to str1 as the invoking operand.

[17]Note the use of the assert function in this program. The test stated in the argument to this function must evaluate to true for the thread of execution to proceed beyond the point of this function call.

[18]Although the parentheses are not really needed in sizeof(x), in the sense that we could also have said sizeof x, because of operator precedence the compiler would understand sizeof (x + y) and sizeof x + y differently. Since the operator sizeof is a unary operator and since unary operators have higher precedence than binary operators, sizeof x + y; would be interpreted as sizeof (x) + y.

[19]To be precise, the sizeof operator in C++ returns the size of a type-name in terms of the size of a char. However, in most implementations, the size of a char is 1 for the 1 byte that it takes to hold a character in C++. Also as a point of difference between C++ and C, in C sizeof ( 'x' ) returns 4, whereas sizeof ( char ) returns 1. On the other hand, in C++, both sizeof ( 'x' ) and sizeof ( char ) return 1. The reason for the discrepancy between the two sizeof values for C is that a char argument to the operator is read as an int, as is often the case with char arguments in C. Despite this discrepancy in C, the following idiom in C

int size;
char arr[3] = {'*', 'y', 'z'};
size = sizeof ( arr ) / sizeof( arr[0] );

does exactly what the programmer wants it to do (the value of size is set to 3, the number of elements in the array) because the sizeof operator looks only at the type of arr[0] in the denominator. In other words, even though sizeof( 'x' ) returns 4 in C, sizeof( arr[0] ) will always return 1.
STRINGS IN JAVA
Java provides two classes, String and StringBuffer, for representing strings and for string processing. An object of type String cannot be modified after it is created.[20] It can be deleted by the garbage collector if there are no variables holding references to it, but it cannot be changed. For this reason, string objects of type String are called immutable. If you want to carry out an in-place modification of a string, the string needs to be an object of type StringBuffer.

As in C++, a string literal in Java is double-quoted. String literals in Java are objects of type String. As in C++, two string literals consisting of the same sequence of characters are one and the same object in the memory. That is, there is only one String object stored for each string literal even when that literal is mentioned at different places in a program, in different classes, or even in different packages of a Java program.

That a string literal consisting of a given sequence of characters is stored only once in the memory is made clear by the following program. Lines (A) and (B) of the program define two different String variables, strX and strY, in two different classes; both strX and strY are initialized with string literals consisting of the same sequence of characters. Nonetheless, a comparison of the two with the ‘==' operator in line (D) tests true. Recall, the operator ‘==' returns true only when its two operands are one and the same object in the memory.

Line (C) of the program illustrates the following string-valued constant expression on the right-hand-side of the assignment operator

"hell" + "o"

In such cases, the Java compiler creates a new string literal by joining the two string literals "hell" and "o". Being still a literal, the resulting literal is not stored separately in the memory if it was previously seen by the compiler. So in our case, the variable strZ in line (C) will point to the same location in the memory as the variables strX in line (A) and strY in line (B). This is borne out by the fact that the ‘==' comparison in line (E) tests true.

While joining two string literals together results in a constant expression that is resolved at compile time, the assignment to the variable s3 in the following three instructions can only be made at run time. Therefore, the string hello constructed on the right-hand side in the third statement below will have a separate existence as a String object in the memory even if a string literal consisting of the same sequence of characters was created previously by the program. That should explain why the comparison in line (F) of the program tests false.

String s1 = "hel";
String s2 = "lo";
String s3 = s1 + s2;

However, Java provides a mechanism through the method intern () defined for the String class that allows a string created at run-time to be added to the pool of string literals (if it was not in the pool already). If the above three instructions are replaced with

String s1 = "hel";
String s2 = "lo";
String s3 = (s1 + s2).intern();

Java will compare the character sequence in the string object returned by s1 + s2 with the string literals already in store. If a match is found, intern() returns a reference to that literal. If a match is not found, then the string returned by s1 + s2 is added to the pool of string literals and a reference to the new literal returned. That should explain why the ‘==' comparison in line (G) of the program tests true. The reference returned by (s1 + s2). intern () will point to the same string literal as the data member strx of class X.

Here is the program:

--------------------------------------------------------------------------------
//StringLiteralUniqueness.java

class X { public static String strX = "hello"; } //(A)

class Y { public static String strY = "hello"; } //(B)

class Z { public static String strZ = "hell" + "o"; } //(C)

class Test {
public static void main( String[] args ) {

// output: true
System.out.println( X.strX == Y.strY ); //(D)

// output: true
System.out.println( X.strX == Z.strZ ); //(E)

String s1 = "hel";
String s2 = "lo";

// output: false
System.out.println( X.strX == (s1 + s2 ) ); //(F)

// output: true
System.out.println( X.strX == (s1 + s2).intern() ); //(G)
}
}
--------------------------------------------------------------------------------

4.4.1 Constructing String and StringBuffer Objects
String objects are commonly constructed using the following syntax

String str = "hello there";

or

String str = new String( "hello there" );

For constructing a StringBuffer object, the first declaration does not work because of type incompatibilities caused by the fact that the right hand side would be a String object and the left hand side a StringBuffer object.

StringBuffer strbuf = "hello there"; //WRONG

StringBuffer objects are commonly constructed using the following syntax

StringBuffer strbuf = new StringBuffer( "hello there" );

An empty String object, meaning a String object with no characters stored in it, can be created by

String s0 = "";

or by

String s0 = new String();

To create an empty StringBuffer object, use either

StringBuffer sb0 = new StringBuffer( "" );

or

StringBuffer sb0 = new StringBuffer();

When a String object is created with a nonempty initialization, the amount of memory allocated to the object for the storage of the characters equals exactly what's needed for the characters. On the other hand, when a new StringBuffer object is created, the amount of memory allocated to the object for actual representation of the string is often 16 characters larger than what is needed. This is to reduce the memory allocation overhead for modifications to a string that add small number of characters to the string at a time. The number of characters that a StringBuffer object can accommodate without additional memory allocation is called its capacity. The number of characters stored in a String or a StringBuffer object can be ascertained by invoking the method length () and the capacity of a StringBuffer object by invoking the method capacity ():

String str = "hello there";
System.out.println( str.length() ); // 11
StringBuffer strbuf = new StringBuffer( "hello there" );
System.out.println( strbuf.length() ); // 11
System.out.println( strbuf.capacity() ); // 27

One is, of course, not limited to the capacity that comes with the default initialization of a StringBuffer object-usually 16 over what is needed for the initialization string. If we invoke the StringBuffer with an int argument, it constructs a string buffer with no characters in, but a capacity as specified by the argument. So the following invocation

StringBuffer strbuf = new StringBuffer( 1024 );

would create string buffer of capacity 1024. Characters may then be inserted into the buffer by using, say, the append function that we will discuss later in this section.

While we have shown all the different possible constructor invocations for the StringBuffer class, the String class allows for many more, all with different types of arguments. In the rest of this section, we will show a few more of the String constructors. One of the String constructors takes a char array argument to construct a String object from an array of characters, as in the following example:[21]

char[] charArr = { 'h', 'e', 'l', 'l', 'o' };
String str4 = new String( charArr );

A String object can also be constructed from an array of bytes, as in

byte[] byteArr = { 'h', 'e', 'l', 'l', 'o' };
String str5 = new String( byteArr ); // "hello"

Each byte of the byte array byteArr will be set to the ASCII encoding of the corresponding character in the initializer. When constructing a String from the byte array, the Java Virtual Machine translates the bytes into characters using the platform's default encoding, which in most cases would be the ASCII encoding. Subsequently, the String object is constructed from the default encodings for the characters.

If the default encoding will not do the job for constructing a String from a byte array, it is possible to specify the encoding to be used.[22] In the following example, the byte array is specified so that each pair of bytes starting from the beginning corresponds to a Unicode representation of the character shown by the second byte of the pair. For example, the 16-bit pattern obtained by joining together one-byte ASCII based representations of '\O' and 'h' is the Unicode in its big-endian representation for the character 'h'. As a result, the string formed by the constructor is again "hello".

byte[] byteArr2 = { '\O', 'h', '\o', 'e', '\o', 'l',
'\0', 'l', '\0', 'o' };
String str6 = new String( byteArr2, "UTF-16BE" ); // "hello"

If we wanted to specify the byte order in the little-endian representation, we'd need to use the "UTF-16LE" encoding, as shown below:

byte[] byteArr3 = { 'h', '\0', 'e', '\0', 'l', '\0',
'l', '\0', 'o', '\0' };
String str7 = new String( byteArr3, "UTF-16LE" ); // "hello"

The last two invocations of the String constructor throw the UnsupportedEncodingException if the specified encoding is not supported by a JVM. The topic of exceptions and how to deal with them will be discussed in Chapter 10.

4.4.2 Accessing Individual Characters
The individual characters of a Java string can be accessed by invoking the charAt method with an int argument:

String str = "hello";
char ch = str.charAt( 1 ); // 'e'

StringBuffer strbuf = new StringBuffer( "hello" );
ch = strbuf.charAt( 1 ); // 'e'

Since the strings created through the StringBuffer class are mutable, it is possible to write into each character position in such a string, as the following example illustrates:

StringBuffer strbuf = new StringBuffer( "hello" );
strbuf.setCharAt( 0, 'j' );

which would convert "hello" into "jello".

Indexing for accessing the individual characters of a string is always range checked in Java. If you try to access an index that is outside the valid limits for a string, JVM will throw an exception of type StringIndexOutOf BoundsException:

String str = "hello";
char ch = str.charAt( 100 ); // ERROR

StringBuffer strbuf = new StringBuffer( "hello" );
ch = strbuf.charAt( 100 ); // ERROR

For a StringBuffer string, it is a range violation if you try to access an index that is outside the length of the string even if the index is inside the capacity.

StringBuffer strbuf = new StringBuffer( "hello" );
System.out.println( strbuf.capacity() ); // 21
ch = strbuf.charAt( 20 ); // ERROR

For a StringBuffer string, you can delete a character by invoking deleteCharAt:

StringBuffer strbuf = new StringBuffer( "hello" );
strbuf.deleteCharAt( 0 );
System.out.println( strbuf.length() ); // 4, was 5
System.out.println( strbuf.capacity() ); // 21, was 21

By deleting a character, the deleteCharAt method shrinks the length of the string by one, but note that the capacity of the string buffer remains unaltered.

4.4.3 String Comparison
Java strings are compared using the equals and compareTo methods, and the ‘==' operator. The method equals returns a TRUE/FALSE answer, whereas the method compareTo returns an integer that tells us whether the String on which the method is invoked is less than, equal to, or greater than the argument String. For example, in the following program fragment

String str1 = "stint";
String str2 = "stink";
System.out.println( str1.equals( str2 ) ); // false

String str3 = "stint";
String str4 = "stink";
System.out.println( str3.compareTo( str4 ) > 0 ); // true

the first print statement outputs false because the strings pointed to by str1 and str2 are composed of different character sequences. The second print statement outputs true because the string str3 is indeed "greater" than the string str4. We'll have more to say on the compareTo method later in this subsection when we talk about sorting arrays of strings.

With regard to the ‘==' operator, as we have already mentioned, the operator can only be used for testing whether two different String variables are pointing to the same String object. Suppose we have the following statements in a program

String s1 = new String("Hello");
String s2 = s1;

then s1 == s2 would evaluate to true because both s1 and s2 will be holding references to the same string object, meaning an object that resides at the same place in the memory. On the other hand, if we say

String s1 = new String("hello");
String s2 = new String("hello");

then s1 == s2 will evaluate to false because we now have two distinct String objects at two different places in the memory even though the contents of both objects are identical in value, since they are both formed from the same string literal.

As was mentioned earlier in Chapter 3, both equals and ‘==' are defined for the Object class, the root class in the Java hierarchy of classes, and that the system-supplied definitions for both are the same for Object-comparison on the basis of equality of reference. So, as defined for Object, both these predicates tell us whether the two references point to exactly the same object in the memory. However, while equals can be overridden, ‘==' cannot because it is an operator. The method equals has already been overridden for us in the String class. So it carries out its comparisons on the basis of equality of content for String type strings. But since, in general, operators cannot be overridden in Java, the operator ‘==' retains its meaning as defined in the Object class.

A word of caution about comparing objects of type StringBuffer: While the system provides us with an overridden definition for the equals method for the String class, it does not do so for the StringBuffer class. In other words, while for the String class you can use the equals method to test for the equality of content, you cannot do so for the StringBuffer class, as borne out by the following code:

String s1 = new String( "Hello" );
String s2 = new String( "Hello" );
System.out.println( ( s1.equals( s2 ) ) + """ ); // true

StringBuffer s3 = new StringBuffer( "Hello" );
StringBuffer s4 = new StringBuffer( "Hello" );
System.out.println( ( s3.equals( s4 )) + "" ); // false

If you must compare two StringBuffer objects for equality of content, you can can do so by first constructing String objects out of them via the toString method, as in

StringBuffer sb = new StringBuffer( "Hello" );
if ( ( sb.toString().equals( "jello" ) )
....

We will now revisit the compareTo method for the String class. The String class implements the Comparable interface by providing an implementation for the compareTo method. The compareTo method as provided for the String class compares two strings lexicographically using the Unicode values associated with the characters in the string.[23] Because the String class comes equipped with compareTo method, we say that String objects possess a natural ordering, which implies that we are allowed to sort an array of Strings by invoking, say, java.util.Arrays.sort without having to explicitly supply a comparison function to the sort method. This is in accord with our Chapter 3 discussion on comparing objects in Java. The following example illustrates invoking java.util.Arrays.sort for sorting an array of strings.

If we do not want the array of strings to be sorted according to the compareTo comparison function, we can invoke a two-argument version of java.util.Arrays.sort and supply for its second argument an object of type Comparator that has an implementation for a method called compare that tells the sort function how to carry out comparisons.[24] If all you want to do is to carry out a case-insensitive comparison, you can use the Comparator object CASE_INSENSITIVE_ORDER that comes as a static data member of the String class. In the code example shown below, the second sort is a case-insensitive sort. The java.util.Arrays.sort is based on the merge-sort algorithm.

--------------------------------------------------------------------------------
//StringSort.java

import java.util.*;

class StringSort {
public static void main( String[] args ) {
String[] strArr = { "apples", "bananas", "Apricots", "Berries", "oranges", "Oranges", "APPLES", "peaches"};
String[] strArr2 = strArr;
System.out.println("Case sensitive sort with Arrays.sort:" );
Arrays.sort( strArr );
for (int i=0; i System.out.println( strArr[i] );
System.out.println("\nCase insensitive sort:" );
Arrays.sort( strArr2, String.CASE_INSENSITIVE_ORDER );
for (int i=0; i System.out.println( strArr2[i] );
}
}
--------------------------------------------------------------------------------

The output of this program is

Case sensitive sort:
APPLES
Apricots
Berries
Oranges
apples
bananas
oranges
peaches
Case insensitive sort:
APPLES
apples
Apricots
bananas
Berries
Oranges
oranges
peaches

4.4.4 Joining Strings Together
In general, Java does not overload its operators. But there is one exception to that general rule, the operator ‘+' for just the String type (and not even for the StringBuffer type). The overload definition for this operator will cause the object str3 in the following code fragment to store the string "hello there".

String str1 = "hello";
String str2 = "there";
String str3 = str1 + str2;

Strings of type StringBuffer can be joined by invoking the append method, as in

StringBuffer strbuf = new StringBuffer( "hello" );
StringBuffer strbuf2 = new StringBuffer( " there" );
strbuf.append( strbuf2 );
System.out.println( strbuf ); // "hello there"
String str = "!";
strbuf.append( str );
System.out.println( strbuf ); // "hello there!"

The capacity of a string buffer is automatically increased if it runs out of space as additional characters are added to the string already there.

In addition to invoking the append method with either the String or the StringBuffer arguments, you can also invoke it with some of the other types that Java supports, as illustrated by:

StringBuffer strbuf = new StringBuffer( "hello" );
int x = 123;
strbuf.append( x );
System.out.println( strbuf ); // "hello123"
double d = 9.87;
strbuf.append( d );
System.out.println( strbuf ); // "hello1239.87"

As you can see, append first converts its argument to a string representation and then appends the new string to the one already in the buffer. This permits append to be invoked for any object, even a programmer-defined object, as long as it is possible to convert the object into its string representation. As we saw in Chapter 3, when a class is supplied with an override definition for the toString method, the system can automatically create string representations of the objects made from the class.

Going back to the joining of String type strings, an immutable string class is inefficient for serial concatenation of substrings, as in

String s = "hello" + " there" + " how" + " are" + " you";

The string concatenations on the right are equivalent to

String s = "hello" + (" there" + (" how" + (" are" + " you")));

If the Java compiler had available to it only the immutable String class for string processing, each parenthesized concatenation on the right would demand that a new String object be created. Therefore, this example would entail creation of five String objects, of which only one would be used. And then there would be further work entailed in the garbage collection of the eventually unused String objects. Fortunately, the Java compiler does not really use the String class for the operations on the right. Instead, it uses the mutable StringBuffer class and the append method of that class to carry out the concatenations shown above. The final result is then converted back to a String.

4.4.5 Searching and Replacing
One can search for individual characters and substrings in a String type string by invoking the indexOf method:

String str = "hello there";
int n = str.indexOf( "the" ); // 6

By supplying indexOf with a second int argument, it is also possible to specify the index of the starting position for the search. This can be used to search for all occurrences of a character or a substring, as the following code fragment illustrates:

String mystr = new String( "one hello is like any other hello" );
String search = "hello";
int pos = 0;
while ( true ) {
pos = mystr.indexOf( search, pos );
if ( pos == -1 ) break;
System.out.println( "hello found at: " + pos ); // 4 and 28
pos++;
}

To parallel our C++ program StringFind.cc, we show next a program that searches for all occurrences of a substring and, when successful, it replaces the substring by another string. Since a String is immutable, we'll have to use a StringBuffer for representing the original string. But since there are no search functions defined for the StringBuffer class, we have to somehow combine the the mutability of a StringBuffer with the searching capability of a String. The following program illustrates this to convert "one hello is like any other hello" into "one armadillo is alike any other armadillo".

--------------------------------------------------------------------------------
//StringFind.java
class StringFind {
public static void main( String[] args ) {
StringBuffer strbuf = new StringBuffer(
"one hello is like any other hello" );
String searchString = "hello";
String replacementString = "armadillo";
int pos = 0;
while ( ( pos = (new String(strbuf)).indexOf(
searchString, pos ) ) != -1 ) {
strbuf.replace( pos, pos +
searchString.length(), replacementString );
pos++;
}
System.out.println( strbuf );
}
}
--------------------------------------------------------------------------------

There is also the method lastIndexOf that searches for the rightmost occurrence of a character or a substring:

String str = "hello there";
int n = str.lastIndxOf( "he" ); // 7

The methods endsWith and startsWith can be invoked to check for suffixes and prefixes in strings:

String str = "hello there";
if (str.startsWith( "he" ) ) // true
....
if ( str.endsWith( "re" ) ) // true
....

4.4.6 Erasing and Inserting Substrings
The following example shows how we can search for a substring, erase it, and then insert in its place another substring. What erase did for C++ is now done by delete with two int arguments for the beginning index and the ending index of the character sequence to be deleted. Insertion of a substring is carried out with the insert method whose first argument, of type int, specifies the index where the new substring is to be spliced in.

--------------------------------------------------------------------------------
// StringInsert.java
class StringInsert {
public static void main( String[] args ) {
int pos = 0;
StringBuffer quote = new StringBuffer(
"Some cause happiness wherever they go,"
+ " others whenever they go - Oscal Wilde" );
String search = "happiness";
if ( ( pos = ( new String(quote) ).indexOf( search) ) != -1 ) {
quote.delete( pos, pos + search.length() );
quote.insert( pos, "excitement" );
}
System.out.println( quote );
}
}
--------------------------------------------------------------------------------

4.4.7 Extracting Substrings
Both String and StringBuffer support substring extraction by invoking the substring method with either one int argument or two int arguments. When only one argument is supplied to substring, that is the beginning index for the substring to be extracted. The substring extracted will include all of the characters from the beginning index till the end. When two arguments are supplied, the second argument stands for the ending index of the desired substring. In all cases, for both String and StringBuffer, the returned object is a new String. For illustration:

String str = "0123456789abc";
System.out.println( str.substring( 5 ) ); // "56789abc"
System.out.println( str.substring( 5, 9 ) ); // "56789"
StringBuffer stb = new StringBuffer( "0123456789abc" );
System.out.println( stb.substring( 5 ) ); // "56789abc"
System.out.println( stb.substring( 5, 9 ) ); // "56789"

[20]Operations on String type objects sometimes have the appearance that you might be changing an object of type String, but that is never the case. In all such operations, a new String object is usually formed. For example, in the following statements the string literal "jello" in line (A) did not get changed into "hello" in line (B). The string literals "jello" and "hello" occupy separate places in the memory. Initially, s1 holds a reference to the former literal and then to the latter literal. After s1 changes its reference to "hello", the string literal "jello" will eventually be garbage collected if no other variable is holding a reference to it. The statement in line (C) results in the creation of a new String object whose reference is held by the variable s2.

String s1 = "jello"; //(A)
s1 = "hello"; //(B)
String s2 = s1 + " there"; //(C)

By the same token, in lines (D) and (E) below, the object s2 is a new string object, as opposed to being an extension of the object s1:

String s1 = "hello"; //(D)
String s2 = s1.concat( "there" ); //(E)

The invocation of the concat method in line (E) returns a new string that is a concatenation of the string on which the method is invoked and the argument string.

[21]The reader may wish to read the rest of this subsection after we discuss the different primitive types in Java in Chapter 6.

[22]Java supports the following character encodings that we will discuss further in Chapter 6:

US-ASCII (this is the seven-bit ASCII)

ISO-8859-1 (ISO-Latin-1)

UTF-8 (8-bit Unicode Transformation Format)

UTF-16BE (16-bit Unicode in big-endian byte order)

UTF-16LE (16-bit Unicode in little-endian byte order)

UTF-16 (16-bit Unicode in which the byte order is specified by a mandatory initial byte-order mark)

No comments: