The common method for interning strings breaks in fantastic ways. In Kerf, we’ve taken the old method and revised it for success in the current generation of languages.
If you’ve forgotten what string interning is from your time with Java, it’s a simple way of ensuring that any string appears only once. So for instance, if I have ten objects whose type is the string “lion”, I don’t have ten copies of the string “lion” in memory. What I have instead is a single copy of “lion” and then ten links which all point to the single copy somehow.
Most often these links are pointers (raw addresses in memory). We’ll discuss how this breaks soon. The reference copy of “lion” is never allowed to be moved, changed, or released. It’s permanent. The other implementation detail to figure out is how to keep the strings unique. The next time we create a “lion” string we need to trade our string in for a link, and what is usually done is that the reference copy is stored in a hashtable or some other deduplicating data structure. This lets us figure out that lion already exists, and when we perform the check we can walk away with our link at the same time. If the phrase is “bullmation”, and it doesn’t already exist, then we can add the initial copy to the data structure at the same time.
One benefit of string interning is that you don’t waste space if you’re storing the same long string over and over again.
There is another benefit to string interning, which is that it converts a list of variable-length strings to a fixed-width array. An array of pointers may be indexed into easily, just as any other array. It is not possible to index into a list of variable-length strings without additional overhead.
One really bad thing about string interning is that you usually can’t get rid of any string you’ve seen. If your language is naively storing pointers, you have no way of knowing when it gets rid of the last one. This means that the more strings you come across the more memory you lose forever. Simple parsing jobs can cause languages with string interning to run out of memory.
Java most famously uses string interning, but the concept originated in early Lisp. It shows up in some form or another in most languages. Kerf’s authors learned much about string interning from K, which got it seemingly directly from Lisp, and we’ll get to that in a second.
The Java Language
Java likes string interning because it provides a shortcut method for testing string equality. To test the equality of two strings in Java you simply check whether the links are equal. Links are a small fixed size, but strings can be any length, so testing takes O(1) time instead of O(n). Java also touts that this test can be accomplished with pointers, as in “string1 == string2” instead of the more verbose “string1.equals(string2)”: this does little other than to generate trivia and serve as a further point of confusion.
Java is half-hearted in its commitment to interned strings. Performance reasons force Java to expose multiple other categories of string types. Character arrays are a basic type in Java, and character buffers are a special object. You can see why Java needs these types. With permanent default interning, any sort of sequence involving character-level appends, such as reading the contents of a book from a file, would result in a preposterous O(n²) version of an otherwise trivial technique. The multiple categories of string objects leads to a fragmented ecosystem and a real headache.
What is exasperating about Java, is that they probably shouldn’t have exposed string interning at all. The kind of string editing operation that devolves to O(n²), and pollutes the pool, is by far the most common use case for Java strings. And the kind of use case that demands interning, frequent equality testing, such as in a database, rarely appears in Java, which is not an appropriate language for that level of performance anyway. So they got it backwards. What they should’ve done is made “lion” an uninterned character array, and suffered the penalty on equality testing, which in Java applications would be unnoticeable.
(Commenters point out that Java has changed over the years, so the preceding refers only to Java as I was taught.)
The K Language
Arthur Whitney’s K series of languages makes more appropriate use of string interning. K supports both interned and non-interned strings, but correctly makes the basic string type non-interned, and reserves the interned type for special occasions.
K’s string interning uses raw pointers. K calls interned strings “symbols”—a name adopted from Lisp—and leans heavily on interning’s constant-time comparison benefits. The most common use for a symbol in K is for representing a stock ticker, such as ‘IBM’ or ‘MSFT’, and K stores long columns of these symbols. If you do a table scan, for instance, there’s a savings in doing a single pointer comparison over doing a character-by-character comparison each time.
Internally, K depends on symbols for a variety of other speedups. It uses linear-time lookup dictionaries, instead of constant hash lookups, so interning is a necessity there. Variable lookup also depends on interning, and uses the same linear-time dictionaries. Without interning the core parts of the language would be appreciably slower.
Whitney’s thinking about symbols informs much of our views on the subject, and we consider our thoughts on the matter to be the natural extension of his. Working with his symbol implementation helped us to see where to take the concept next.
What’s broken in previous implementations of string interning is the global, per-process pool for immutable strings, and the use of memory pointers to reference them. So, everything.
The core issue is that process-level interning is the wrong way to think about strings. The evolving view is that the process is a single, individually-unimportant node in the graph of some inter-process operation. So focusing on the process as the end-all is the wrong way to go about it. Dynamic pointers are one of the big offenders here. But so is the global pool. Processes change from launch to launch, and while it is usually only modest overhead to rebuild an intern pool, the transitory nature of the process and the oracular nature of the pool make global interning an awkward choice. This is most noticeable when writing objects to disk or sending them across the network. As soon as the process needs to communicate the scheme breaks down. The pointer scheme makes life easier for the language implementor, at least at first, but it makes it harder for everything and everyone else involved in supporting the language.
In the old system, each process is an island. A pointer from one process is useless to another process, or to a file on disk. At a minimum this adds an extra layer of serialization and deserialization. What should you turn the pointer into (null delimited strings, string objects?)? Do you send the whole pool each time? How do you avoid storing a fixed-width column as a variable-length one?
One common and fairly bad way to send an array of interned strings across the network is to send them as a null-delimited list of variable length strings. So if you have the string “manticore” repeated hundreds of times, some systems actually transfer the entire null delimited string hundreds of times, with no deduplication. The strategy here is to replace all fixed-width pointers with variable-length null delimited strings.
Another poor technique we’ve seen is a scheme for writing a column of symbols to disk. In this case, the product writes a separate deduplicated null-terminated list of symbols for each table, that effectively represents a static copy of the intern pool. This symbol list is shared among all columns. Then the individual column is written, which consists of a fixed-width array of numeric indices into the symbol list. So the pointers have been translated into indices into this list.
When it comes time to read a column from the disk, the entire symbol list must be loaded into memory, and then the symbol column must be translated from indices into dynamic pointers. There is overhead in inserting the symbol from the list into the pool (rehashing), and overhead in translating to pointers. This translation is always necessary because you can’t use the same integer mapping from two different tables (to see why, consider a table where ‘1’ is ‘YHOO’ and another where ‘1’ is ‘SBUX’). More generally, such symbol lists cannot be shared, though this was often desirable, and they trivially fall out of sync between tables and processes.
Depending on how you want to look at it, the “global pool of pointers” method forces you to use three distinct representations (memory, disk, and network) and at least four translation layers: back and forth from memory to the disk, and back and forth from memory to the network.
The right way to think about interning is to reduce the scope of the intern pool. In Kerf, instead of maintaining one global pool per process, we maintain local pools on a per-object basis. We assume that the process is one of many, and we avoid the idea of a global pool. Then we get rid of dynamic pointers. Kerf indexes into object-local pools using numeric indices. This works exceptionally well for array languages, but the technique extends to regular imperative languages as well.
This technique allows objects to have the same representation in-memory, on disk, and transferred over the network. No translation layer is necessary. Objects with interned strings can now be mapped. The technique also allows Kerf to get away with using a single string type, instead of several incompatible ones.
Kerf is an array language, and to support string interning it exposes a special kind of array called an ‘Enum’. An Enum works like you would expect a regular array to work, with the exception that, internally, the array deduplicates objects. So the following list, when stored as an Enum,
['lion', 'lion', 'lion']
appears to be a regular array, but actually only stores one copy of the string ‘lion’ in an associated hashset, and then stores three integer references to inside the hashset. The pool is dynamic. You could use a hashtable or a regular array/list instead of a hashset, though those options are less suitable. When you insert an item into the enum it checks to see whether that item exists, and adds only an integer reference. Same story when you concatenate a list to the enum (except optimized), and so on. But in every instance the enum offers the same functionality as a regular array. There is no limit to the number of interned items, and Kerf’s enum support arbitrary data types, not just strings. Kerf uses enums for columns in a table, say.
The natural objection to this is that, if you use multiple columns, you create multiple copies of the strings. In practice this objection turns out not to matter. The number of pools in which a given string may appear is relatively small (perhaps 10). The pools themselves are small (anywhere from ten entries to one million entries). The size of the pools is dominated by the number of indices referencing them (say billions of indices). Reference counted objects means that all of this matters even less anyways.
Let’s take an example. There are about 3,000 tickers traded on the NASDAQ. If we assume they’re 10 characters each (they’re less), that’s only about 30k in overhead per pool. With ten enumerated columns, that’s only 300k in overhead. A daily column may have 30M entries, a historical column may have thousands of times that. So for stock prices we’re talking starting duplication ratios of about 0.1% to thousands of times less. In practice, because the local pools can be reference counted, there may not be any duplication at all.
The second objection is that it is not immediately obvious how to reconcile two separate objects with different pools. We’ve built an entire language on top of this architecture, and we can say from experience that in practice this also turns out to be simple. These cases appear less frequently than you might think. Standard unoptimized tools handle basically all of these without any additional effort. In cases where additional effort is necessary, the architecture turns out to be the right tool for the job. Consider the example of optimizing the case where two potentially different lists of interned strings are tested for equality. You cannot simply compare the vectors of indices, because the adjoining pools might be composed differently. What you can do, however, is create an O(c) lookup table that handles the translation. Because the pools are small, the lookup tables are guaranteed to be small as well. The operation remains linear in the number of strings.
All sorts of interesting results pop out of this theory of string interning. Perhaps the most impressive is how easy it is to grade (sort) interned strings in O(n) time instead of O(n*lg n) time. I’ll skip over the details here, but effectively, you get to use a counting sort on a list of integers, which is a linear sort, and likely at the practical lower bound. Imagine that: linear sorting for strings. It works so well that I suspect there are cases where you want to enumerate a regular list before sorting it. I find this to be incredible. Other results like this one continue to appear naturally, which to us is strong evidence this is the right path for string interning.