- Howard Chu
The Sad State of C Strings
Updated: Aug 5, 2021
Character strings are an essential component of any programming language, but C Strings were a bit unusual in not defining an explicit string type. The C standard specifies some standard library functions for operating on C strings, which gives them a de-facto type and standard methods, but the standard library functions were horrible when first invented in the 1970s, and nothing sane has replaced them yet. In honor of the Chinese Lunar New Year, and the 30th anniversary of this rant, I delve once more into these problems.
There are numerous design flaws in the C string library, including the lack of a string type with explicit lengths, and other such misfeatures. One of my pet peeves is the sheer idiocy of the basic Copy function, strcpy(). It takes two arguments: a pointer to a destination and a pointer to a source, and returns the pointer to the destination. This is a prime example of idiotic API design - a function should never return a value to a caller that the caller already knows. It’s a waste of a return value, redundant information. Staying within the C string paradigm, the smart thing to have done would be to return a pointer to the *end* of the destination - this is new information that the caller doesn’t have, and is extraordinarily useful. Knowing the end and the beginning of the destination lets you quickly compute the string’s length, thus obviating a call to strlen(). (And because string operations so often need to know the lengths of the strings they’re operating on, strlen() is otherwise heavily used, at a cost equal to the length of the string.)
Furthermore, this misdesign causes the C library to need another almost-identical function, strcat() for concatenating two strings together. Since knowledge of where the string ends is not retained anywhere, you need an explicit function like strcat() that walks to the end of the destination string before then copying on the contents of the source string.
As I pointed out in my original posts, constructing a larger string out of multiple smaller strings is a pretty common programming task. With the C library’s definitions, you can do this: char buf[MAXLEN]; int len; strcat(strcat(strcat(strcpy(buf, "This "),"is "),"a long "),"string."); len = strlen(buf); The above example executes in exponential time with the length of the strings. It’s often used as an example of Shlemiel the Painter algorithms (although that name comes about 15 years after I first drew attention to the problem). Using my strcopy() proposal you could do this: len = strcopy(strcopy(strcopy(strcopy(buf, "x"),"y"),"z"),"phooey") - buf; which executes in linear time and avoids the 2x cost from strlen() running through the entire string a second time.
The same misdesign also applies to memcpy().
Meanwhile, as the years rolled on and programmers continued to get bitten by the poor design of the Standard C Library’s string functions, new concerns raised their heads - buffer overflows. Again, this is a direct consequence of the C language lacking an explicit string type with explicit lengths. Many solutions have also been proposed to solve this, with BSD’s strlcpy() gaining the most adoption. Unfortunately, it too is an idiotic design. strlcpy() takes three arguments - the destination and source, as with strcpy(), and also a size for the destination buffer. Passing the buffer size allows strlcpy() to stop short of overflowing the buffer, which is an admirable goal, just poorly implemented.
The obvious flaw in strlcpy() is again to do with constructing long strings. You can no longer do the above example in a single statement: strcat(strcat(strcat(strcpy(buf, "This "),"is "),"a long "),"string."); because with strlcpy() you would have to recompute the remaining buffer size for every call, yielding horrendously redundant and inefficient code: char buf[MAXLEN], *ptr = buf; int len, rem = sizeof(buf); len = strlcpy(ptr, "This ", rem); rem -= len; ptr += len; len = strlcpy(ptr, "is ", rem); rem -= len; ptr += len; len = strlcpy(ptr, "a long ", rem); rem -= len; ptr += len; len = strlcpy(ptr, "string.", rem); ptr += len; len = ptr - buf; The correct API design would simply pass a pointer to the end of the buffer. This will be a constant and thus not require recomputing before each invocation: char buf[MAXLEN], *end = buf+sizeof(buf); int len; len = strecopy(strecopy(strecopy(strecopy(buf, "This ", end),"is ",end),"a long ",end),"string.",end) - buf;
It’s a shame that after so much time has passed and so much energy has been expended on these topics, that the community still hasn’t adopted an intelligent solution that both preserves the original use cases and solves the extant problems. In OpenLDAP we exterminated most uses of plain C strings years ago and it was a lengthy process, beginning in earnest in October 2001 and not really being completed until February 2003. When such poor APIs are so deeply ingrained into the official specification of the language it’s difficult to make improvements and get new programmers to use them. Even the programmers who might be aware of the question typically consider it too minor a detail to sweat over, but as always, the Devil’s in the details.
One of the fundamental principles of good code is “don’t compute the same thing twice.” The Standard C Library violates this principle in its functions that return a value to the caller that the caller already possesses. The BSD strl* functions violate this principle in forcing the programmer to recompute the position of the end of a destination buffer, even though that endpoint doesn’t change. The inefficiencies that result from violating this principle are insidious and far-reaching, turning otherwise straightforward-looking code into performance disasters.