The zend_string API¶
Strings in C are usually represented as null-terminated char *
pointers. As PHP supports strings that contain
null bytes, PHP needs to explicitly store the length of the string. Additionally, PHP needs strings to fit into its
general framework of reference-counted structures. This is the purpose of the zend_string
type.
Structure¶
A zend_string
has the following structure:
struct _zend_string {
zend_refcounted_h gc;
zend_ulong h;
size_t len;
char val[1];
};
Like many other structures in PHP, it embeds a zend_refcounted_h
header, which stores the
reference count, as well as some flags.
The actual character content of the string is stored using the so called “struct hack”: The string content is
appended to the end of the structure. While it is declared as char[1]
, the actual size is determined dynamically.
This means that the zend_string
header and the string contents are combined into a single allocation, which is
more efficient than using two separate ones. You will find that PHP uses the struct hack in quite a number of places
where a fixed-size header is combined with a dynamic amount of data.
The length of the string is stored explicitly in the len
member. This is necessary to support strings that
contain null bytes, and is also good for performance, because the string lengths does not need to be constantly
recalculated. It should be noted that while len
stores the length without a trailing null byte, the actual
string contents in val
must always contain a trailing null byte. The reason is that there are quite a few C APIs
that accept a null-terminated string, and we want to be able to use these APIs without creating a separate
null-terminated copy of the string. To give an example, the PHP string "foo\0bar"
would be stored with
len = 7
, but val = "foo\0bar\0"
.
Finally, the string stores a cache of the hash value h
, which is used when using strings as
hashtable keys. It starts with value 0
to indicate that the hash has not been computed
yet, while the real hash is computed on first use.
String accessors¶
Just like with zvals, you don’t manipulate zend_string
fields by hand but use a number of access
macros instead:
zend_string *str = zend_string_init("foo", strlen("foo"), 0);
php_printf("This is my string: %s\n", ZSTR_VAL(str));
php_printf("It is %zd char long\n", ZSTR_LEN(str)); // %zd is the printf format for size_t
zend_string_release(str);
The two most important ones are ZSTR_VAL()
, which returns the string contents as char *
, and ZSTR_LEN()
,
which returns the string length as size_t
.
The naming of these macros is slightly unfortunate in that both ZSTR_VAL
/ZSTR_LEN
, as well as
Z_STRVAL
/Z_STRLEN
exist, and both only differ by the position of the underscore. Remember that ZSTR_*
macros work on zend_string
, while Z_
macros work on zval
:
zval val;
ZVAL_STRING(&val, "foo");
// Z_STRLEN, Z_STRVAL work on zval.
php_printf("string(%zd) \"%s\"\n", Z_STRLEN(val), Z_STRVAL(val));
// ZSTR_LEN, ZSTR_VAL work on zend_string.
zend_string *str = Z_STR(val);
php_printf("string(%zd) \"%s\"\n", ZSTR_LEN(str), ZSTR_VAL(str));
zval_ptr_dtor(&val);
The hash value cache of the string can be accessed using ZSTR_H()
. However, this accesses the raw cache, which
will be zero if the hash has not been computed yet. Instead, ZSTR_HASH()
or zend_string_hash_val()
should be
used to either get the pre-cached hash, or compute it. In the very rare case where a string is modified after initial
construction, it is possible to discard the cached value using zend_string_forget_hash_val()
.
Memory management¶
While we already know how to initialize string zvals, the only direct string creation
API that has been introduced until now is zend_string_init()
, which is used to create a zend_string
from an
existing string and length.
The most fundamental string creation function on which all others are based is zend_string_alloc()
:
size_t len = 40;
zend_string *str = zend_string_alloc(len, /* persistent */ 0);
for (size_t i = 0; i < len; i++) {
ZSTR_VAL(str)[i] = 'a';
}
// Don't forget to null-terminate!
ZSTR_VAL(str)[len] = '\0';
This function allocates a string of a certain length (as always, the length does not include the trailing null byte), and leaves its initialization to you. Like all string allocation functions, it accepts a parameter that determines whether to use the per-request allocator, or the persistent one.
The zend_string_safe_alloc(n, m, l, persistent)
function allocates a string of length n * m + l
. This
function is commonly useful for encoding changes. For example, this is how we could hex encode a string:
zend_string *convert_to_hex(zend_string *orig_str) {
zend_string *hex_str = zend_string_safe_alloc(2, ZSTR_LEN(orig_str), 0, /* persistent */ 0);
char *p = ZSTR_VAL(str);
for (size_t i = 0; i < ZSTR_LEN(orig_str), i++) {
const char *to_hex = "0123456789abcdef";
unsigned char c = ZSTR_VAL(orig_str)[i];
*p++ = to_hex[c >> 4];
*p++ = to_hex[c & 0xf];
}
*p = '\0';
return hex_str;
}
Why can’t we simply use zend_string_alloc(2 * ZSTR_LEN(orig_str), 0)
instead? The reason is that the
zend_string_safe_alloc()
function will make sure that the n * m + l
calculation does not overflow. For
example, if you are on a 32-bit system, and the string is exactly 2GB large, then multiplying the length by two will
overflow and result in a zero length. The following code will exceed the bounds of the allocation and corrupt
unrelated memory. The zend_string_safe_alloc()
API detects this situation and throws a fatal error in this case.
It is also possible to change the size of a string using zend_string_realloc()
and its variations:
zend_string *zend_string_realloc(zend_string *s, size_t len, bool persistent);
// Requires new length larger old length.
zend_string *zend_string_extend(zend_string *s, size_t len, bool persistent);
// Requires new length smaller new length.
zend_string *zend_string_truncate(zend_string *s, size_t len, bool persistent)
// n * m + l safe variant of zend_string_realloc.
zend_string *zend_string_safe_realloc(zend_string *s, size_t n, size_t m, size_t l, bool persistent);
As strings are refcounted structures, the realloc functions also take the refcount into account. While this is not how these functions are implemented, their semantics are equivalent to doing something like this:
zend_string *new_str = zend_string_init(ZSTR_VAL(s), ZSTR_LEN(s), persistent);
zend_string_release(s);
return new_str;
That is, these functions release the string passed to them, but it is safe to use them with shared (or immutable) strings. If the strings is shared, the refcount is decremented, but the string is not destroyed.
This also brings us to the next topic: refcount management. Rather than using raw GC_*
macros, the
zend_string
API contains two helpers to increase the refcount:
zend_string_addref(str);
return str;
// More compact:
return zend_string_copy(str);
Unlike GC_ADDREF()
, the zend_string_addref()
function will handle immutable strings properly. However, the
function that is used most often by far is zend_string_copy()
. This function not only increments the refcount,
but also returns the original string. This makes code more readable in practice.
While a zend_string_dup()
function that performs an actual copy of the string (rather than only a refcount
increment) also exists, the behavior is often considered confusing, because it only copies non-immutable strings.
If you want to force a copy of a string, you are better off creating a new one using zend_string_init()
.
If the duplication is for the purpose of modifying an already existing string, zend_string_separate()
can be
used instead:
zend_string *modify_char(zend_string *orig_str) {
zend_string *str = zend_string_separate(orig_str, /* persistent */ 0);
ZEND_ASSERT(ZSTR_LEN(str) > 0);
ZSTR_VAL(str)[0] = 'A';
return str;
}
Just like the general zval separation concept, this will return the original string (with discarded hash cache) if it has a refcount of one, and is thus uniquely owned, and will create a copy otherwise.
Finally, strings needs to be released when no longer used. You are already familiar with the zend_string_release()
API, which will decrement the refcount, and free the string if it drops to zero. You are well served by using only
this function.
However, you may also encounter a number of optimized variations. The most common is zend_string_release_ex()
,
which allows you to specify whether the passed string is persistent or non-persistent:
zend_string_release_ex(str, /* persistent */ 0);
Normally, this would be determined base on the string flags. This avoids the runtime check, and generates less code. Finally, there are two more functions that only work on strings with refcount one:
// Requires refcount 1 or immutable.
zend_string_free(str);
// Requires refcount 1 and not immutable.
zend_string_efree(str);
You should avoid using these functions, as it is easy to introduce critical bugs when some API changes from returning new strings to reusing existing ones.
Other operations¶
The zend_string
API supports a few additional operations. The most common one is comparing strings:
zend_string *foo = zend_string_init("foo", sizeof("foo")-1, 0);
zend_string *FOO = zend_string_init("FOO", sizeof("FOO")-1, 0);
// Case-sensitive comparison between zend_strings.
bool result = zend_string_equals(foo, FOO); // false
// Case-insensitive comparison between zend_strings.
bool result = zend_string_equals_ci(foo, FOO); // true
// Case-sensitive comparison with a string literal.
bool result = zend_string_equals_literal(foo, "FOO"); // false
// Case-insensitive comparison with a string literal.
bool result = zend_string_equals_literal_ci(foo, "FOO"); // false
zend_string_release(foo);
zend_string_release(FOO);
There are also helpers to concatenate two or three strings. If you need to concatenate more strings, you should use
the smart_str
API discussed in the next chapter instead.
zend_string *foo = zend_string_init("foo", sizeof("foo")-1, 0);
zend_string *bar = zend_string_init("bar", sizeof("bar")-1, 0);
// Creates "foobar"
zend_string *foobar = zend_string_concat2(
ZSTR_VAL(foo), ZSTR_LEN(foo),
ZSTR_VAL(bar), ZSTR_LEN(bar));
// Creates "foo::bar"
zend_string *foo_bar = zend_string_concat3(
ZSTR_VAL(foo), ZSTR_LEN(foo),
"::", sizeof("::")-1,
ZSTR_VAL(bar), ZSTR_LEN(bar));
zend_string_release(foo);
zend_string_release(bar);
zend_string_release(foobar);
zend_string_release(foo_bar);
As you can see, these APIs accept pairs of char *
and lengths, rather than zend_string
structures. This
allows parts of the concatenation to be provided using string literals, without having to allocate a zend_string
for them.
Finally, the zend_string_tolower()
API can be used to lower-case a string:
zend_string *FOO = zend_string_init("FOO", sizeof("FOO")-1, 0);
zend_string *foo = zend_string_tolower(FOO);
zend_string_release(foo);
zend_string_release(FOO);
The lower-casing uses ASCII rules and is not locale dependent. It is commonly used as a way to make hashtable keys case-insensitive.
Interned strings¶
Just a quick word here about interned strings. You could need such a concept in extension development. Interned strings also interact with opcache extension.
Interned strings are deduplicated strings. When used with opcache, they also get reused from request to request.
Say you want to create the string “foo”. What you tend to do is simply create a new string “foo”:
zend_string *foo;
foo = zend_string_init("foo", strlen("foo"), 0);
/* ... */
But a question arises : Hasn’t that piece of string already been created before you need it? When you need a string, you code is executed at some point in PHP’s life, that means that some piece of code happening before yours may have needed the exact same piece of string (“foo” for our example).
Interned strings is about asking the engine to probe the interned strings store, and reuse the already allocated pointer if it could find your string. If not : create a new string and “intern” it, that is make it available to other parts of PHP source code (other extensions, the engine itself, etc…).
Here is an example:
zend_string *foo;
foo = zend_string_init("foo", strlen("foo"), 0);
foo = zend_new_interned_string(foo);
php_printf("This string is interned : %s", ZSTR_VAL(foo));
zend_string_release(foo);
What we do in the code above, is we create a new zend_string
very classically. Then, we pass that created
zend_string
to zend_new_interned_string()
. This function looks for the same piece of string (“foo” here) into
the engine interned string buffer. If it finds it (meaning someone already created such a string), it then releases
your string (probably freeing it) and replaces it with the string from the interned string buffer. If it does not find it:
it adds it to the interned string buffer and so makes it available for future usage or other parts of PHP.
You must take care about memory allocation. Interned strings always have a refcount set to one, because they don’t need to be refcounted, as they will get shared with the interned strings buffer, and thus they can’t be destroyed out of it.
Example:
zend_string *foo, *foo2;
foo = zend_string_init("foo", strlen("foo"), 0);
foo2 = zend_string_copy(foo); /* increments refcount of foo */
/* foo points to the interned string buffer, and refcount
* in original zend_string falls back to 1 */
foo = zend_new_interned_string(foo);
/* This doesn't do anything, as foo is interned */
zend_string_release(foo);
/* The original buffer referenced by foo2 is released */
zend_string_release(foo2);
/* At the end of the process, PHP will purge its interned
string buffer, and thus free() our "foo" string itself */
It’s all about garbage collection.
When a string is interned, its GC flags are changed to add the IS_STR_INTERNED
flag, whatever the memory allocation
class they use (permanent or request based).
This flag is probed when you want to copy or release a string. If the string is interned, the engine does not increment
its refcount as you copy the string. But it doesn’t decrement it nor free it if you release the string. It shadowly
does nothing. At the end of the process lifetime, it will destroy its interned strings buffer, and it will free your
interned strings.
This process is in fact a little bit more complex than this. If you make use of an interned string out of a request processing, that string will be interned for sure. However, if you make use of an interned string as PHP is treating a request, then this string will only get interned for the current request, and will get cleared after that. All this is valid if you don’t use the opcache extension, something you shouldn’t do : use it.
When using the opcache extension, if you make use of an interned string out of a request processing, that string will be interned for sure and will also be shared to every PHP process or thread that will be spawned by you parallelism layer. Also, if you make use of an interned string as PHP is treating a request, this string will also get interned by opcache itself, and shared to every PHP process or thread that will be spawned by you parallelism layer.
Interned strings mechanisms are then changed when opcache extension fires in. Opcache not only allows to intern strings
that come from a request, but it also allows to share them to every PHP process of the same pool. This is done using
shared memory. When saving an interned string, opcache will also add the IS_STR_PERMANENT
flag to its GC info.
That flag means the memory allocation used for the structure (zend_string
here) is permanent, it could be a shared
read-only memory segment.
Interned strings save memory, because the same string is never stored more than once in memory. But it could waste some CPU time as it often needs to lookup the interned strings store, even if that process is well optimized yet. As an extension designer, here are global rules:
If opcache is used (it should be), and if you need to create read-only strings : use an interned string.
If you need a string you know for sure PHP will have interned (a well-known-PHP-string, f.e “php” or “str_replace”), use an interned string.
If the string is not read-only and could/should be altered after its been created, do not use an interned string.
If the string is unlikely to be reused in the future, do not use an interned string.
Warning
Never ever try to modify (write to) an interned string, you’ll likely crash.
Interned strings are detailed in Zend/zend_string.c