String optimization in BaCon

Aka beating strlen.

Preamble

As BaCon is a plain Basic-to-C converter, the resulting code depends on the C implementation of strings, which, in fact, comes down to the known concept of character arrays. A string is nothing more than a series of values in memory which ends with a '0' value.

Plain C code is notorious for its bad performance on strings, especially when it needs to calculate the length of a string. Usually, the bytes in memory are verified until the terminating '0' value is encountered. For large strings this will take a considerable amount of time, especially in repetitive operations.

In the ongoing attempt to improve the performance of string operations, I have investigated several approaches for the BaCon project. The below report is a short summary of techniques which I have tried.


Different binary layout

A common implementation stores the actual string length in the beginning of the string. This means that, for example, the first 4 bytes contain the length of the string, and the next 4 bytes contain the size of the buffer holding the string. (This would limit the size of the string to 4294967295 bytes (4Gb), which, for most purposes, should be enough.) After that, the actual characters of the string itself are stored.


b1

b2

b3

b4

l1

l2

l3

l4

char1

char2

char<n>

String length

Buffer length

Actual bytes of the string


Programming languages like Pascal and Basic usually implement their strings this way. They change the binary layout of the actual string, where, as mentioned, the first 8 bytes contain meta information about length and buffer size.

For BaCon this seems a promising approach, however, as BaCon is a Basic-to-C converter, it makes use of the string functions provided by libc. I am referring to functions like printf, strcat, strcpy and friends. If strings are being fed to standard C functions like these, then this would cause unpredictable results. The libc string functions always assume character arrays in a correct format, otherwise they will not work properly - encountering binary values would cause printf to print garbage on the screen, for example.

Now, it is possible to add a default offset of 8 bytes to the beginning of the memory address to overcome this problem. However, we do not know in advance what type of string comes in. For example, the used string very well may contain a string literal. This is plain text hard coded in the BaCon program.

PRINT “Hello” & world$

In this example, both a string literal and also a memory address are used. Clearly, such string literal does not have the 8 bytes offset with meta information. So how can we check whether a string consists of a string literal, or whether it contains a memory address with the 8 byte offset?

We cannot, but if we could, the BaCon implementation itself had to be adapted on all places to make sure the standard libc string functions work correctly. Such change would be long, tedious and error-prone.


Hash table

Another idea is to use hash values paired with the string length. The approach is to take the actual memory address of a string, which is a unique value. This unique memory address then can be put into a hash table which also can store a string length value (key-value store).

The hash table should not contain any collisions, meaning that the hash table should be perfect. For performance reasons it would be nice if the table would provide values in a sequence starting from 1 to have indexes available for an array. This is called a “minimal” hash.

Furthermore, as we do not know in advance how many memory addresses the BaCon program is going to use, the hash table should be able to grow and shrink dynamically.

So, we're looking for a dynamic minimal perfect hash.

Algorithms which implement such hash tables do exist, but when it comes to performance, they are very slow. The implementations of a dynamic hash table often make use of linked lists, and the unavoidable memory inserts and deletions simply take too much time. Eventually, I found that they take more time than a string length calculation which I am trying to beat.


Pointer swap

While working on these ideas, a very simple concept popped up, which actually has been implemented. In BaCon, all string operating functions make use of a temporary buffer to store their (intermediate) results. This is to ensure that nested string functions can pass their results to a higher level.

text$ = MID$(“Goodbye, Hello”, 9) & CHR$(32) & RIGHT$(“The World”, 5)

The above example concatenates three string operations into one resulting string. The functions MID$, CHR$ and RIGHT$ first will store their result into temporary buffers, which then are passed to the '&' operator, which, in its turn, stores the result of the concatenation into another temporary buffer. The final result then is assigned to the variable 'text$'. The assignment is performed by copying the contents from the temporary buffer to the variable 'text$'.

So, instead of copying the result into a variable, it also is possible to swap the memory addresses of the variable and the temporary buffer. The variable 'text$' will point to the result, and the contents of the temporary buffer will be changed to the previous character array of the variable. And because the buffer is temporary, it is going to be overwritten anyway in a next string operation, so there's no need to care about that.

This technique of pointer swapping will save some time, because it avoids a needless copy of bytes from one memory area to another. Even though it does not help us with string length calculation, it does help to improve the performance of string operations considerably.


Memory pool

As strings are stored in memory, it sometimes can be a good idea to allocate a block of memory before the actual program starts. Then a private function to share and administer can assign parts of that allocated memory to the program. This could save time, because the real memory allocation on the heap already took place, and the private memory allocation simply has to administer small chunks.

But it only saves time in case a lot of subsequent allocations and frees take place in the program. In BaCon, this is not the case. Sure, string variables need to have memory allocated and local variables need to be freed. But a test setup shows that a good private memory administration not only is difficult to implement, but also that a very basic version of such administration already is slower than the allocations from libc (malloc, calloc).

I concluded that this approach would improve performance only in a situation where there is a huge amount of memory allocations each claiming large blocks of memory from the heap.


Static character pointers

When using character pointers in functions, it is a good idea to declare them as static. Such pointers keep their value, e.g. the memory address they are pointing to, so the next time when the function is called, the pointer still points to the allocated memory. Therefore, there is no need to free a pointer at the end of a function, or to allocate new memory when entering the function. This will save a lot of system calls and kernel activity.

Note that it may be needed to set the memory of the string to an empty string at the beginning of the function.

Also note that because static pointers always will point to the same memory, recursive functions may not work properly. So in each level of recursion, the same memory is overwritten.

The technique of static pointers is partly implemented in BaCon: static pointers are used for string variables and string arrays which get their value assigned at declaration time. This technique is not suitable for regular string variables, because of the aforementioned problems with recursion.



Using a pre-buffer

The last technique to discuss, which actually has been implemented, is using a pre-buffer. This technique is implemented by libraries as SDS and GB Strings. In this approach, a block of memory is allocated, of which the first bytes contain meta information about length and buffer size. The returned pointer itself is a pointer to the start of the actual string. The benefit of this is that the string functions from libc will work properly, because they see the actual string.

As mentioned, this concept is used in BaCon. However, there are a couple of problems which need to be solved.

First of all, how do we know if an incoming string already uses a pre-buffer?

PRINT a$ & b$ & c$

In this example, we have three variables. If one of the variables does not have a pre-buffer, and BaCon likes to look into it, then a crash will occur (segfault). BaCon cannot simply look into memory which is not allocated.

To solve this problem, we have to think of a trick. As mentioned before, variables point to memory addresses which contain data for the string. The memory addresses themselves are allocated by libc functions like 'malloc' and 'calloc'. Now, we have to realize that such memory allocations always return an address which is aligned to 8 or 16 bytes (depending on architecture). Maybe, on some exotic platforms, it is 4 bytes or 32 bytes. But the point is, the returned memory addresses themselves always are even numbers.

So, if a memory address returned by malloc/calloc is an even number, we simply state that such address does not have a pre-buffer.

Therefore, an internal function is implemented, which will transform a regular even memory address for a string in such a way, that a pre-buffer is present, and that the returned address is an odd number. So in the BaCon implementation, when performing string operations, it always can be assumed that an odd memory address for a string has a pre-buffer present.

String operations then can quickly fetch information from this pre-buffer, saving time which otherwise would have been needed for length calculations.


b1

b2

b3

b4

l1

l2

l3

l4

byte

char1

char2

char<n>

String length

Buffer length

Not used

Actual bytes of the string








returned pointer

The above picture represents the new layout. Meta information is stored in bytes 0-7 and byte 8 is not used. The returned pointer will be at the start of the string as indicated.

Then, the next problem is string literals (again).

PRINT a$ & “Hello world” & c$

A string literal is put into the text memory segment during compile time. The generated C program, created by BaCon, uses addresses in this segment to obtain the string literal. The problem is that these addresses for string literals maybe be even, or odd.

So, simply from the fact that a memory address is odd or even, we cannot see if there is a pre-buffer. In fact, if we try to store information to such pre-buffer, a crash will occur. String literals in the text segment reside in a certain memory area, which is read-only. While the addresses for string variables in the data segment reside in a read-write part in memory.

To solve this problem, BaCon will store the lowest and highest memory address of the odd addresses created by itself in global variables. Then, to check for a pre-buffer, it also must include a range check: the address itself should not only be an odd number, it also should lie in between the minimum and maximum stored. This way, it can be assured that the odd address really is a memory address, and not some address in the text segment of the program.

The last problem is the 'strtok' function from libc. BaCon uses this function only in the FOR..IN statement.

FOR x$ IN “Hello World”

The 'strtok' function splits up a string into pieces and returns a memory address to the next piece. It therefore can return an address which is odd and which lies in the minimum/maximum range.

In BaCon, this problem is solved by simply re-implementing the FOR..IN statement without the use of 'strtok'.


Conclusion

By implementing pointer swapping, static pointers and pre-buffers, the performance for string operations has increased significantly. The average performance gain for simple string concatenations is more than 70% and can grow even more when using large strings.



(c) Peter van Eerten - March 16, 2016 – http://www.basic-converter.org

Document updated on November 9, 2016.