What's the rationale for null terminated strings?

Billy ONeal Source

As much as I love C and C++, I can't help but scratch my head at the choice of null terminated strings:

  • Length prefixed (i.e. Pascal) strings existed before C
  • Length prefixed strings make several algorithms faster by allowing constant time length lookup.
  • Length prefixed strings make it more difficult to cause buffer overrun errors.
  • Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here.
  • Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings.
  • C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation.
  • Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.

Several of these things have come to light more recently than C, so it would make sense for C to not have known of them. However, several were plain well before C came to be. Why would null terminated strings have been chosen instead of the obviously superior length prefixing?

EDIT: Since some asked for facts (and didn't like the ones I already provided) on my efficiency point above, they stem from a few things:

  • Concat using null terminated strings requires O(n + m) time complexity. Length prefixing often require only O(m).
  • Length using null terminated strings requires O(n) time complexity. Length prefixing is O(1).
  • Length and concat are by far the most common string operations. There are several cases where null terminated strings can be more efficient, but these occur much less often.

From answers below, these are some cases where null terminated strings are more efficient:

  • When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules.
  • In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).

None of the above are nearly as common as length and concat.

There's one more asserted in the answers below:

  • You need to cut off the end of the string

but this one is incorrect -- it's the same amount of time for null terminated and length prefixed strings. (Null terminated strings just stick a null where you want the new end to be, length prefixers just subtract from the prefix.)

c++cstringnull-terminated

Answers

answered 7 years ago Robert S Ciaccio #1

C doesn't have a string as part of the language. A 'string' in C is just a pointer to char. So maybe you're asking the wrong question.

"What's the rationale for leaving out a string type" might be more relevant. To that I would point out that C is not an object oriented language and only has basic value types. A string is a higher level concept that has to be implemented by in some way combining values of other types. C is at a lower level of abstraction.

in light of the raging squall below:

I just want to point out that I'm not trying to say this is a stupid or bad question, or that the C way of representing strings is the best choice. I'm trying to clarify that the question would be more succinctly put if you take into account the fact that C has no mechanism for differentiating a string as a datatype from a byte array. Is this the best choice in light of the processing and memory power of todays computers? Probably not. But hindsight is always 20/20 and all that :)

answered 7 years ago khachik #2

I think, it has historical reasons and found this in wikipedia:

At the time C (and the languages that it was derived from) were developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (though also used by early versions of BASIC), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time). But one byte limits the length to 255. This length limitation was far more restrictive than the problems with the C string, so the C string in general won out.

answered 7 years ago Sanjit Saluja #3

The null termination allows for fast pointer based operations.

answered 7 years ago Hans Passant #4

From the horse's mouth

None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled *e. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

Dennis M Ritchie, Development of the C Language

answered 7 years ago R.. #5

Obviously for performance and safety, you'll want to keep the length of a string while you're working with it rather than repeatedly performing strlen or the equivalent on it. However, storing the length in a fixed location just before the string contents is an incredibly bad design. As J├Ârgen pointed out in the comments on Sanjit's answer, it precludes treating the tail of a string as a string, which for example makes a lot of common operations like path_to_filename or filename_to_extension impossible without allocating new memory (and incurring the possibility of failure and error handling). And then of course there's the issue that nobody can agree how many bytes the string length field should occupy (plenty of bad "Pascal string" languages used 16-bit fields or even 24-bit fields which preclude processing of long strings).

C's design of letting the programmer choose if/where/how to store the length is much more flexible and powerful. But of course the programmer has to be smart. C punishes stupidity with programs that crash, grind to a halt, or give your enemies root.

answered 7 years ago Jonathan Wood #6

In many ways, C was primitive. And I loved it.

It was a step above assembly language, giving you nearly the same performance with a language that was much easier to write and maintain.

The null terminator is simple and requires no special support by the language.

Looking back, it doesn't seem that convenient. But I used assembly language back in the 80s and it seemed very convenient at the time. I just think software is continually evolving, and the platforms and tools continually get more and more sophisticated.

answered 7 years ago kriss #7

The question is asked as a Length Prefixed Strings (LPS) vs zero terminated strings (SZ) thing, but mostly expose benefits of length prefixed strings. That may seem overwhelming, but to be honest we should also consider drawbacks of LPS and advantages of SZ.

As I understand it, the question may even be understood as a biased way to ask "what are the advantages of Zero Terminated Strings ?".

Advantages (I see) of Zero Terminated Strings:

  • very simple, no need to introduce new concepts in language, char arrays/char pointers can do.
  • the core language just include minimal syntaxic sugar to convert something between double quotes to a bunch of chars (really a bunch of bytes). In some cases it can be used to initialize things completely unrelated with text. For instance xpm image file format is a valid C source that contains image data encoded as a string.
  • by the way, you can put a zero in a string literal, the compiler will just also add another one at the end of the literal: "this\0is\0valid\0C". Is it a string ? or four strings ? Or a bunch of bytes...
  • flat implementation, no hidden indirection, no hidden integer.
  • no hidden memory allocation involved (well, some infamous non standard functions like strdup perform allocation, but that's mostly a source of problem).
  • no specific issue for small or large hardware (imagine the burden to manage 32 bits prefix length on 8 bits microcontrollers, or the restrictions of limiting string size to less than 256 bytes, that was a problem I actually had with Turbo Pascal eons ago).
  • implementation of string manipulation is just a handful of very simple library function
  • efficient for the main use of strings : constant text read sequentially from a known start (mostly messages to the user).
  • the terminating zero is not even mandatory, all necessary tools to manipulate chars like a bunch of bytes are available. When performing array initialisation in C, you can even avoid the NUL terminator. Just set the right size. char a[3] = "foo"; is valid C (not C++) and won't put a final zero in a.
  • coherent with the unix point of view "everything is file", including "files" that have no intrinsic length like stdin, stdout. You should remember that open read and write primitives are implemented at a very low level. They are not library calls, but system calls. And the same API is used for binary or text files. File reading primitives get a buffer address and a size and return the new size. And you can use strings as the buffer to write. Using another kind of string representation would imply you can't easily use a literal string as the buffer to output, or you would have to make it have a very strange behavior when casting it to char*. Namely not to return the address of the string, but instead to return the actual data.
  • very easy to manipulate text data read from a file in-place, without useless copy of buffer, just insert zeroes at the right places (well, not really with modern C as double quoted strings are const char arrays nowaday usually kept in non modifiable data segment).
  • prepending some int values of whatever size would implies alignment issues. The initial length should be aligned, but there is no reason to do that for the characters datas (and again, forcing alignment of strings would imply problems when treating them as a bunch of bytes).
  • length is known at compile time for constant literal strings (sizeof). So why would anyone want to store it in memory prepending it to actual data ?
  • in a way C is doing as (nearly) everyone else, strings are viewed as arrays of char. As array length is not managed by C, it is logical length is not managed either for strings. The only surprising thing is that 0 item added at the end, but that's just at core language level when typing a string between double quotes. Users can perfectly call string manipulation functions passing length, or even use plain memcopy instead. SZ are just a facility. In most other languages array length is managed, it's logical that is the same for strings.
  • in modern times anyway 1 byte character sets are not enough and you often have to deal with encoded unicode strings where the number of characters is very different of the number of bytes. It implies that users will probably want more than "just the size", but also other informations. Keeping length give use nothing (particularly no natural place to store them) regarding these other useful pieces of information.

That said, no need to complain in the rare case where standard C strings are indeed inefficient. Libs are available. If I followed that trend, I should complain that standard C does not include any regex support functions... but really everybody knows it's not a real problem as there is libraries available for that purpose. So when string manipulation efficiency is wanted, why not use a library like bstring ? Or even C++ strings ?

EDIT: I recently had a look to D strings. It is interesting enough to see that the solution choosed is neither a size prefix, nor zero termination. As in C, literal strings enclosed in double quotes are just short hand for immutable char arrays, and the language also has a string keyword meaning that (immutable char array).

But D arrays are much richer than C arrays. In the case of static arrays length is known at run-time so there is no need to store the length. Compiler has it at compile time. In the case of dynamic arrays, length is available but D documentation does not state where it is kept. For all we know, compiler could choose to keep it in some register, or in some variable stored far away from the characters data.

On normal char arrays or non literal strings there is no final zero, hence programmer has to put it itself if he wants to call some C function from D. In the particular case of literal strings, however the D compiler still put a zero at the end of each strings (to allow easy cast to C strings to make easier calling C function ?), but this zero is not part of the string (D does not count it in string size).

The only thing that disappointed me somewhat is that strings are supposed to be utf-8, but length apparently still returns a number of bytes (at least it's true on my compiler gdc) even when using multi-byte chars. It is unclear to me if it's a compiler bug or by purpose. (OK, I probably have found out what happened. To say to D compiler your source use utf-8 you have to put some stupid byte order mark at beginning. I write stupid because I know of not editor doing that, especially for UTF-8 that is supposed to be ASCII compatible).

answered 7 years ago Cristian #8

Assuming for a moment that C implemented strings the Pascal way, by prefixing them by length: is a 7 char long string the same DATA TYPE as a 3-char string? If the answer is yes, then what kind of code should the compiler generate when I assign the former to the latter? Should the string be truncated, or automatically resized? If resized, should that operation be protected by a lock as to make it thread safe? The C approach side stepped all these issues, like it or not :)

answered 7 years ago dvhh #9

Lazyness, register frugality and portability considering the assembly gut of any language, especially C which is one step above assembly (thus inheriting a lot of assembly legacy code). You would agree as a null char would be useless in those ASCII days, it (and probably as good as an EOF control char ).

let's see in pseudo code

function readString(string) // 1 parameter: 1 register or 1 stact entries
    pointer=addressOf(string) 
    while(string[pointer]!=CONTROL_CHAR) do
        read(string[pointer])
        increment pointer

total 1 register use

case 2

 function readString(length,string) // 2 parameters: 2 register used or 2 stack entries
     pointer=addressOf(string) 
     while(length>0) do 
         read(string[pointer])
         increment pointer
         decrement length

total 2 register used

That might seem shortsighted at that time, but considering the frugality in code and register ( which were PREMIUM at that time, the time when you know, they use punch card ). Thus being faster ( when processor speed could be counted in kHz), this "Hack" was pretty darn good and portable to register-less processor with ease.

For argument sake I will implement 2 common string operation

stringLength(string)
     pointer=addressOf(string)
     while(string[pointer]!=CONTROL_CHAR) do
         increment pointer
     return pointer-addressOf(string)

complexity O(n) where in most case PASCAL string is O(1) because the length of the string is pre-pended to the string structure (that would also mean that this operation would have to be carried in an earlier stage).

concatString(string1,string2)
     length1=stringLength(string1)
     length2=stringLength(string2)
     string3=allocate(string1+string2)
     pointer1=addressOf(string1)
     pointer3=addressOf(string3)
     while(string1[pointer1]!=CONTROL_CHAR) do
         string3[pointer3]=string1[pointer1]
         increment pointer3
         increment pointer1
     pointer2=addressOf(string2)
     while(string2[pointer2]!=CONTROL_CHAR) do
         string3[pointer3]=string2[pointer2]
         increment pointer3
         increment pointer1
     return string3

complexity O(n) and prepending the string length wouldn't change the complexity of the operation, while I admit it would take 3 time less time.

On another hand, if you use PASCAL string you would have to redesign your API for taking in account register length and bit-endianness, PASCAL string got the well known limitation of 255 char (0xFF) beacause the length was stored in 1 byte (8bits), and it you wanted a longer string (16bits->anything) you would have to take in account the architecture in one layer of your code, that would mean in most case incompatible string APIs if you wanted longer string.

Example:

One file was written with your prepended string api on an 8 bit computer and then would have to be read on say a 32 bit computer, what would the lazy program do considers that your 4bytes are the length of the string then allocate that lot of memory then attempt to read that many bytes. Another case would be PPC 32 byte string read(little endian) onto a x86 (big endian), of course if you don't know that one is written by the other there would be trouble. 1 byte length (0x00000001) would become 16777216 (0x0100000) that is 16 MB for reading a 1 byte string. Of course you would say that people should agree on one standard but even 16bit unicode got little and big endianness.

Of course C would have its issues too but, would be very little affected by the issues raised here.

answered 7 years ago Pyry Jahkola #10

Somehow I understood the question to imply there's no compiler support for length-prefixed strings in C. The following example shows, at least you can start your own C string library, where string lengths are counted at compile time, with a construct like this:

#define PREFIX_STR(s) ((prefix_str_t){ sizeof(s)-1, (s) })

typedef struct { int n; char * p; } prefix_str_t;

int main() {
    prefix_str_t string1, string2;

    string1 = PREFIX_STR("Hello!");
    string2 = PREFIX_STR("Allows \0 chars (even if printf directly doesn't)");

    printf("%d %s\n", string1.n, string1.p); /* prints: "6 Hello!" */
    printf("%d %s\n", string2.n, string2.p); /* prints: "48 Allows " */

    return 0;
}

This won't, however, come with no issues as you need to be careful when to specifically free that string pointer and when it is statically allocated (literal char array).

Edit: As a more direct answer to the question, my view is this was the way C could support both having string length available (as a compile time constant), should you need it, but still with no memory overhead if you want to use only pointers and zero termination.

Of course it seems like working with zero-terminated strings was the recommended practice, since the standard library in general doesn't take string lengths as arguments, and since extracting the length isn't as straightforward code as char * s = "abc", as my example shows.

answered 7 years ago Daniel C. Sobral #11

Calavera is right, but as people don't seem to get his point, I'll provide some code examples.

First, let's consider what C is: a simple language, where all code has a pretty direct translation into machine language. All types fit into registers and on the stack, and it doesn't require an operating system or a big run-time library to run, since it were meant to write these things (a task to which is superbly well-suited, considering there isn't even a likely competitor to this day).

If C had a string type, like int or char, it would be a type which didn't fit in a register or in the stack, and would require memory allocation (with all its supporting infrastructure) to be handled in any way. All of which go against the basic tenets of C.

So, a string in C is:

char s*;

So, let's assume then that this were length-prefixed. Let's write the code to concatenate two strings:

char* concat(char* s1, char* s2)
{
    /* What? What is the type of the length of the string? */
    int l1 = *(int*) s1;
    /* How much? How much must I skip? */
    char *s1s = s1 + sizeof(int);
    int l2 = *(int*) s2;
    char *s2s = s2 + sizeof(int);
    int l3 = l1 + l2;
    char *s3 = (char*) malloc(l3 + sizeof(int));
    char *s3s = s3 + sizeof(int);
    memcpy(s3s, s1s, l1);
    memcpy(s3s + l1, s2s, l2);
    *(int*) s3 = l3;
    return s3;
}

Another alternative would be using a struct to define a string:

struct {
  int len; /* cannot be left implementation-defined */
  char* buf;
}

At this point, all string manipulation would require two allocations to be made, which, in practice, means you'd go through a library to do any handling of it.

The funny thing is... structs like that do exist in C! They are just not used for your day-to-day displaying messages to the user handling.

So, here is the point Calavera is making: there is no string type in C. To do anything with it, you'd have to take a pointer and decode it as a pointer to two different types, and then it becomes very relevant what is the size of a string, and cannot just be left as "implementation defined".

Now, C can handle memory in anyway, and the mem functions in the library (in <string.h>, even!) provide all the tooling you need to handle memory as a pair of pointer and size. The so-called "strings" in C were created for just one purpose: showing messages in the context of writting an operating system intended for text terminals. And, for that, null termination is enough.

answered 6 years ago supercat #12

One point not yet mentioned: when C was designed, there were many machines where a 'char' was not eight bits (even today there are DSP platforms where it isn't). If one decides that strings are to be length-prefixed, how many 'char's worth of length prefix should one use? Using two would impose an artificial limit on string length for machines with 8-bit char and 32-bit addressing space, while wasting space on machines with 16-bit char and 16-bit addressing space.

If one wanted to allow arbitrary-length strings to be stored efficiently, and if 'char' were always 8-bits, one could--for some expense in speed and code size--define a scheme were a string prefixed by an even number N would be N/2 bytes long, a string prefixed by an odd value N and an even value M (reading backward) could be ((N-1) + M*char_max)/2, etc. and require that any buffer which claims to offer a certain amount of space to hold a string must allow enough bytes preceding that space to handle the maximum length. The fact that 'char' isn't always 8 bits, however, would complicate such a scheme, since the number of 'char' required to hold a string's length would vary depending upon the CPU architecture.

answered 6 years ago Brangdon #13

"Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string."

First, extra 3 bytes may be considerable overhead for short strings. In particular, a zero-length string now takes 4 times as much memory. Some of us are using 64-bit machines, so we either need 8 bytes to store a zero-length string, or the string format can't cope with the longest strings the platform supports.

There may also be alignment issues to deal with. Suppose I have a block of memory containing 7 strings, like "solo\0second\0\0four\0five\0\0seventh". The second string starts at offset 5. The hardware may require that 32-bit integers be aligned at an address that is a multiple of 4, so you have to add padding, increasing the overhead even further. The C representation is very memory-efficient in comparison. (Memory-efficiency is good; it helps cache performance, for example.)

answered 3 years ago supercat #14

Many design decisions surrounding C stem from the fact that when it was originally implemented, parameter passing was somewhat expensive. Given a choice between e.g.

void add_element_to_next(arr, offset)
  char[] arr;
  int offset;
{
  arr[offset] += arr[offset+1];
}

char array[40];

void test()
{
  for (i=0; i<39; i++)
    add_element_to_next(array, i);
}

versus

void add_element_to_next(ptr)
  char *p;
{
  p[0]+=p[1];
}

char array[40];

void test()
{
  int i;
  for (i=0; i<39; i++)
    add_element_to_next(arr+i);
}

the latter would have been slightly cheaper (and thus preferred) since it only required passing one parameter rather than two. If the method being called didn't need to know the base address of the array nor the index within it, passing a single pointer combining the two would be cheaper than passing the values separately.

While there are many reasonable ways in which C could have encoded string lengths, the approaches that had been invented up to that time would have all required functions that should be able to work with part of a string to accept the base address of the string and the desired index as two separate parameters. Using zero-byte termination made it possible to avoid that requirement. Although other approaches would be better with today's machines (modern compilers often pass parameters in registers, and memcpy can be optimized in ways strcpy()-equivalents cannot) enough production code uses zero-byte terminated strings that it's hard to change to anything else.

PS--In exchange for a slight speed penalty on some operations, and a tiny bit of extra overhead on longer strings, it would have been possible to have methods that work with strings accept pointers directly to strings, bounds-checked string buffers, or data structures identifying substrings of another string. A function like "strcat" would have looked something like [modern syntax]

void strcat(unsigned char *dest, unsigned char *src)
{
  struct STRING_INFO d,s;
  str_size_t copy_length;

  get_string_info(&d, dest);
  get_string_info(&s, src);
  if (d.si_buff_size > d.si_length) // Destination is resizable buffer
  {
    copy_length = d.si_buff_size - d.si_length;
    if (s.src_length < copy_length)
      copy_length = s.src_length;
    memcpy(d.buff + d.si_length, s.buff, copy_length);
    d.si_length += copy_length;
    update_string_length(&d);
  }
}

A little bigger than the K&R strcat method, but it would support bounds-checking, which the K&R method doesn't. Further, unlike the current method, it would be possible to easily concatenate an arbitrary substring, e.g.

/* Concatenate 10th through 24th characters from src to dest */

void catpart(unsigned char *dest, unsigned char *src)
{
  struct SUBSTRING_INFO *inf;
  src = temp_substring(&inf, src, 10, 24);
  strcat(dest, src);
}

Note that the lifetime of the string returned by temp_substring would be limited by those of s and src, which ever was shorter (which is why the method requires inf to be passed in--if it was local, it would die when the method returned).

In terms of memory cost, strings and buffers up to 64 bytes would have one byte of overhead (same as zero-terminated strings); longer strings would have slightly more (whether one allowed amounts of overhead between two bytes and the maximum required would be a time/space tradeoff). A special value of the length/mode byte would be used to indicate that a string function was given a structure containing a flag byte, a pointer, and a buffer length (which could then index arbitrarily into any other string).

Of course, K&R didn't implement any such thing, but that's most likely because they didn't want to spend much effort on string handling--an area where even today many languages seem rather anemic.

answered 2 years ago BenK #15

According to Joel Spolsky in this blog post,

It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

After seeing all the other answers here, I'm convinced that even if this is true, it's only part of the reason for C having null-terminated "strings". That post is quite illuminating as to how simple things like strings can actually be quite hard.

answered 8 months ago kkaaii #16

gcc accept the codes below:

char s[4] = "abcd";

and it's ok if we treat is as an array of chars but not string. That is, we can access it with s[0], s[1], s[2], and s[3], or even with memcpy(dest, s, 4). But we'll get messy characters when we trying with puts(s), or worse with strcpy(dest, s).

comments powered by Disqus