Atlas Data Structure

Kerf introduces a new data structure called an “Atlas” that we think is a helpful thing to have inside of a programming language. We’ve found it very useful to have database Tables inside of a programming language, and the Atlas concept is a natural extension of that. Be careful about jumping to conclusions, though, since an Atlas does a few special things that make it unlike what you might expect. I’ll give a preview of the major points here:

  1. Automatic indexing
  2. Responds to SQL
  3. First-class language type
  4. Columnar implementation

We didn’t spend much time talking about Atlases originally, because interest in NoSQL-style computation seemed to be on the wane. Recently, however, we’ve had a lot of requests for those sorts of applications, and so a detailed description seems in order. (If you’ve come across any interesting problems like that in your work, tell us about them.)

If a Table is a way of including a SQL table inside of the language, then an Atlas is a way of including a NoSQL store: none of the rows have to have the same columns. In other words, an Atlas is a schemaless table, or simply a list of JSON objects, with keys in the objects pointing to values. The Atlas is a generalization of the Table where the rows don’t have to be uniform. In one sense, an Atlas is a sparse Table. You could also think of a Table as a vectorized Atlas.

 

In Kerf we use the term “Map” to refer to what JSON calls an Object {"a":1}. So the term Atlas to refer to a List of Maps (with an Index) was an obvious choice. We use the special convenience syntax {{ and }} to create a table, and the related {[ and ]} convenience syntax to create an atlas. Both of these came about because they fit in well with JSON, and Kerf is a JSON (& SQL) superset.

1. The first thing that makes an Atlas special is that every key is indexed invisibly and by default. This means that no matter what key you query on, your query will be optimized. Atlases let you forget about indexing entirely. In a typical NoSQL store, if you want optimized queries on a key you have to issue the equivalent of a CREATE INDEX command, wait for the index to be created, incur future performance penalties, and so on. With Atlases you don’t have to fool around with any of that. The keys are already indexed, there’s no additional performance penalty, and adding a never-before-seen key doesn’t impact performance. Effectively, you can just forget about indices entirely, even for nested keys.

 

2. The second thing that makes Atlases interesting is that you can use SQL on them and it will just work. Typical NoSQL stores force you to learn a proprietary query language. It’s a lot easier to be able to use the SQL that you already know, and then to have the output be what you would expect. Even though there’s no specification in SQL for how a NoSQL table should behave, most programmers have an intuitive sense of what should come back. A lot of crazy corner cases have to be handled to accommodate this method of thinking, but because that’s all taken care of on our end (the internals), you don’t have to worry about it at all.

 

3. The third thing that make Atlases special is that the data type is integrated with the language. An Atlas is a first-class data type, just like an int or an array. This means that from inside the language, you can create Atlases, query them, manipulate the results, etc. If you’re querying an old-fashioned database, you need to slurp your results in through the tiny straw of an API. But in Kerf, the language and database are the same, so the code you write works directly on the data. There’s no traversal between a language layer and a database layer. Because the two layers are integrated, so to speak, this means that you can perform “out-of-core” computations without putting special thought into it. It also means that there’s no impact to performance.

 

4. The fourth thing that makes Atlases special is that they are completely and entirely columnar in order to optimize performance. The index is columnar, the keys are columnar, the values are columnar, everything is columnar. This helps with performance and also with serializing the object.

Because a few existing solutions for schemaless data don’t really perform as advertised, we’ve come across several companies who are suffering in silence with irregular data problems they don’t know how to solve. But there is a tool that exists to solve it, called an Atlas, and we package it in Kerf. If you are faced with a problem with a lot of columns, or ragged or irregular columns, or sparse data, or more than one table mixed together, or have just been disappointed in NoSQL in general, you may want to look into Kerf. Contact us for any reason at founders@kerfsoftware.com.

Advertisements

Time Bars

Customers always ask us whether Kerf supports rolling up times into “time bars” or “time buckets.” The answer is yes, Kerf support time bars, and our time bars do a lot more than you might expect. Because not everybody is familiar with what time bars do, in this post we’ll give a basic overview of what time bars are and how they work.

Motivation

Time-series data always contains a timestamp column, and this column usually does more than mere record-keeping. With traditional data, you might track when a user registers for your e-commerce website, but the time when the user registers is irrelevant to the purpose of the site. A distinguishing quality of time-series data is that the time column matters. Changing the order of the users table doesn’t make a difference, but changing the times on a column of stock prices makes a big difference. Time matters here.

What this means is that you’re not really interested in a pure ordered table of events. The table is only a compacted depiction of what happened. If the table were really accurate, there would be empty space between the rows, proportional to the length of time between events. This is impractical, so we timestamp the events, and the distance between them is implied.



┌─────┬───────────────────────┐
│price│time                   │
├─────┼───────────────────────┤
│ 20.0│2016.11.11T00:00:11.000│
│ 24.0│2016.11.11T00:00:13.000│ //+02s difference
│ 23.0│2016.11.11T00:00:19.000│ //+06s difference
│ 23.0│2016.11.11T00:00:37.000│ //+18s difference
│ 25.0│2016.11.11T00:01:31.000│ //+36s difference
└─────┴───────────────────────┘

Though ultimately we work with tables, mentally we have to think of them as timelines. Timelines have a lot of weird properties tables don’t. Our ordered table can only have one row in the first position, one row in the second, and so on. A timeline, in comparison, can have 132 events in the first hour, none in the second, 15 in the third, etc. The arrival of events is irregular. Because this is tricky to deal with, we want to turn a timeline back into an ordered table, that is, we want to make an irregular sequence of events regular. This is what time bars are for.


select bars(5s,time) from t

┌───────────────────────┐
│time                   │
├───────────────────────┤
│2016.11.11T00:00:10.000│
│2016.11.11T00:00:10.000│ //repeat
│2016.11.11T00:00:15.000│
│2016.11.11T00:00:35.000│ //gap
│2016.11.11T00:01:30.000│
└───────────────────────┘

Time bars are used to normalize time-series. In Kerf, the bars function floors all the timestamps to the nearest multiple of five minutes, say. By design, this output contains a lot of repeats. The point of the repeats is that they are keys that you can group on. By performing aggregation functions on groups of rows, you can find the average price over five minute intervals, or the sum of the lots that were traded during those times. The aggregation reduces what happened during each window into a uniform output.

select avg(price) from t group by bars(5s, time)

┌───────────────────────┬─────┐
│time                   │price│
├───────────────────────┼─────┤
│2016.11.11T00:00:10.000│ 22.0│
│2016.11.11T00:00:15.000│ 23.0│
│2016.11.11T00:00:35.000│ 23.0│
│2016.11.11T00:01:30.000│ 25.0│
└───────────────────────┴─────┘

Aggregation is lossy. We’re condensing many rows into one, so we’re losing data. We’ve also rounded off the times, so we’re losing data there, too. The part that concerns us here is that we’re losing accuracy in the output.

Losing accuracy is a bad thing, but it’s sometimes worth it if you get something in return. The first advantange to lossiness is that it can make data comprehensible to people: looking at quarterly performance is more revealing than looking at pages of raw ticks. This is using a filter, in particular a downsampling filter, and downsampling filters are lossy methods for finding signal in lots of noisy data.

The second advantage to lossiness is that sometimes you need to delete data more than you need to conserve accuracy, and so you can bar your data to reduce what you have to store. This is great when the cost of disk is more important than some extra accuracy. It’s less great when insufficient technology forces you to use time bars because it otherwise can’t handle the volume of data. Kerf is designed to always handle the full volume of ticks without bars.

Absolute Bars

The standard way to create time bars is using what we’ll call absolute bars. Set to one-hour increments, an absolute time bar will roll up all the events from today at 7:00-7:59AM into a 7:00AM bucket, all the events from 8:00-8:59AM into an 8:00AM bucket, and so on. Tomorrow if events happen at 7:00AM, those would go in a separate bucket; future 7:00AM events never go into today’s 7:00AM bucket. That means that an hour bucket needs to keep track of the day, the month, and the year, even though it’s only grouping by hour. The rule is that every unit greater than the unit you’re grouping on is kept and unchanged. Units less than the one you’re grouping on will be truncated to zero. It’s easier to explain this with a picture.



┌─────┬───────────────────────┐
│price│time                   │
├─────┼───────────────────────┤
│   10│2016.06.22T06:55:00.000│
│   12│2016.06.22T07:01:00.000│
│   11│2016.06.22T07:04:00.000│
│   11│2016.06.22T07:22:00.000│
│   12│2016.06.22T07:43:00.000│
│   12│2016.06.22T09:04:00.000│
│   13│2016.06.23T07:33:00.000│ //next day
└─────┴───────────────────────┘



┌─────┬───────────────────────┬───────────────────────┐
│price│time                   │timebars               │
├─────┼───────────────────────┼───────────────────────┤
│   10│2016.06.22T06:55:00.000│2016.06.22T06:00:00.000│
│   12│2016.06.22T07:01:00.000│2016.06.22T07:00:00.000│
│   11│2016.06.22T07:04:00.000│2016.06.22T07:00:00.000│
│   11│2016.06.22T07:22:00.000│2016.06.22T07:00:00.000│
│   12│2016.06.22T07:43:00.000│2016.06.22T07:00:00.000│
│   12│2016.06.22T09:04:00.000│2016.06.22T09:00:00.000│
│   13│2016.06.23T07:33:00.000│2016.06.23T07:00:00.000│
└─────┴───────────────────────┴───────────────────────┘

Here we’ve updated the table to show the time bars. In practice no one does this, since you can group on the result of the bars function without storing it, and that’s quicker if you’re already familiar with bars.

select avg price from t group by bars(1h,time)

┌───────────────────────┬─────┐
│time                   │price│
├───────────────────────┼─────┤
│2016.06.22T06:00:00.000│ 10.0│
│2016.06.22T07:00:00.000│ 11.5│
│2016.06.22T09:00:00.000│ 12.0│
│2016.06.23T07:00:00.000│ 13.0│
└───────────────────────┴─────┘

Relative times can be expressed succinctly in Kerf. In all our examples, relative times are given as the first argument to the bars function. Some common times to want to group on:



1y     yearly
3m     quarterly
1m     monthly
7d     weekly
1d     daily
1h     hourly
5i     five-minute bars ('i' since 'm' is month)
1i     one-minute bars
1s     one-second bars
1000n  microseconds (via nanoseconds)

Modular “Wall Clock” Bars

Many time-series exhibit some form of periodicity. Home electricity usage increases during the summer months. E-commerce websites sell more goods during the holidays. Stock trading activity decreases during lunch. A good way to isolate this trend is to stack the hours from different days on top of each other. We can call these modular bars.

The first step to generating modular bars is to consider 7AM as a time on the wall that recurs every day. Using this method you would group hours from different days together. This lets you see how the hours in the day perform against each other over time.


┌─────┬───────────────────────┐
│price│time                   │
├─────┼───────────────────────┤
│  101│2016.06.22T06:30:00.000│
│  102│2016.06.22T12:30:00.000│
│  103│2016.06.22T18:30:00.000│
│  102│2016.06.23T00:30:00.000│
│  102│2016.06.23T06:30:00.000│
│  101│2016.06.23T12:30:00.000│
└─────┴───────────────────────┘





select time['hour'] as hour from t

┌────┐
│hour│
├────┤
│   6│
│  12│
│  18│
│   0│
│   6│
│  12│
└────┘




select time['day'] as day, time['hour'] as hour from t

┌───┬────┐
│day│hour│
├───┼────┤
│ 22│   6│
│ 22│  12│
│ 22│  18│
│ 23│   0│
│ 23│   6│
│ 23│  12│
└───┴────┘

In Kerf, indexing in to a time or a time array with special keywords like ‘hour’ will return the attribute in question. With this technique you can extract


'date'     //2016.06.22
'time'     //10:11:00.000
'year'     //2016
'month'    //6
'day'      //...
'hour'
'minute'
'second'
'millisecond'
'microsecond'
'nanosecond'
'week'

Note that this “drops” any time attribute larger or smaller than the attribute in question. So it differs from our absolute bars method from earlier. So different days may produce the same output for ‘hour’. These are keys we can group on.


select avg(price) from t group by time['hour']

┌────┬─────┐
│time│price│
├────┼─────┤
│   6│101.5│
│  12│101.5│
│  18│103.0│
│   0│102.0│
└────┴─────┘

To separate by day and hour, but not year, add another element to the list of group by arguments.


select avg(price) from t group by time['day'], time['hour']

┌────┬─────┬─────┐
│time│time1│price│
├────┼─────┼─────┤
│  22│    6│101.0│
│  22│   12│102.0│
│  22│   18│103.0│
│  23│    0│102.0│
│  23│    6│102.0│
│  23│   12│101.0│
└────┴─────┴─────┘

Which is the same, in this limited case, as the average on the absolute bars.


select avg price from t group by bars(1h, time)

┌───────────────────────┬─────┐
│time                   │price│
├───────────────────────┼─────┤
│2016.06.22T06:00:00.000│101.0│
│2016.06.22T12:00:00.000│102.0│
│2016.06.22T18:00:00.000│103.0│
│2016.06.23T00:00:00.000│102.0│
│2016.06.23T06:00:00.000│102.0│
│2016.06.23T12:00:00.000│101.0│
└───────────────────────┴─────┘

To contact us about Kerf, mail kevin@kerfsoftware.com

Kerf meets the smartgrid

Kerf is primarily designed as a tool for dealing with data streams. The most obvious streams of data which bring in customers are stock market data, but there are lots of interesting data streams in the world. Enough that people are starting to have a name for this “vertical” -aka “internet of things.” One of the “things of the internet” I’ve worked with in the past is smartmeter data. This is a stream of electrical power consumption data from individual power meters in homes, commercial sites, factories, electric car chargers and what have you. While everyone knows that markets create big streams of potentially interesting data, the data from power grids is also big and interesting. It’s big because a typical city might have millions of power meters in it, recording data at minute or quarter hour increments. It’s interesting because power companies need to know what is going on with their networks to provide uninterrupted electrical service that we all take for granted. It’s also very interesting to power companies who rely on “renewables” such as wind or solar which are not steady or controllable producers of power. If you have a power grid which relies on mostly renewables, you have all kinds of complicated operations research problems to deal with during times when it is cloudy, dark or the breezes are calm.

Problems

The data volume of “internet of things” applications can be unwieldy to manage in standard tools, but there are analytics challenges as well. A standard challenge for smartgrid stuff is doing forecasts. Such models can get really hairy. Every smart meter installation is quite different from every other smart meter. A power meter for a car charging station is vastly different from that of a school, an apartment or a house, and all of those things are different from one another. The “users” of each apartment and house have different schedules, appliances and devices. Getting this right in an automated way can pay big dividends; you can provide services to the people with the meter, and you can do more accurate forecasts on a city wide level if you understand the generating process at the individual meter level.

If you want to get something like this right in an automated fashion, it’s a fairly large project involving paying people like me to think about it for a few months. Doing all this in an automated, general and reliable way involves feature creation, feature selection algorithms, machine learning, data gathering, data cleansing, weather forecasting, wind forecasting, optimization and the kitchen sink. I’d love to eventually have the full suite of such tools to do such a project in Kerf as primitives, but we don’t know if customers would appreciate or use such things, so I’ll make do with some small steps for this blog.

Let’s say we have some simple meter data with the average power usage per installation. No temperature or weather data here, so we’ll pretend these things are not important.

KeRF> meta_table(smeter)

┌────────────┬────┬────────────┬────────────┬───────┐
│column      │type│type_name   │is_ascending│is_disk│
├────────────┼────┼────────────┼────────────┼───────┤
│       Power│  -3│float vector│           0│      1│
│        time│  -4│stamp vector│           0│      1│
│installation│   8│        enum│           0│      1│
└────────────┴────┴────────────┴────────────┴───────┘

len smeter
  21894311

A standard observation about forecasting smart grid stuff is that electricity use is periodic. People use different amounts of electricity depending on what hour of the day it is, and what month of the year it is. These are pretty good features that would go into any machine learning gizmo which would ultimately solve this problem.

A simple thing to do toward building a forecasting model for this sort of data would be to pull out some average baselines for hour of the day and month of the year, while adding some other quantities for debugging purposes:

tmp:select avg(Power) as PowerAvg,max(Power) as PowerMax,std(Power) as PowerStd,min(Power) as PowerMin from smeter where time>2014.01.01,group by time['month'],time['hour'],installation
tmp: select time as month,time1 as hour,* from tmp
tmp1:delete_keys(tmp,['time','time1'])
tmp:tmp1


┌─────┬────┬────────────┬────────┬────────┬────────┬─────────┐
│month│hour│installation│PowerAvg│PowerMax│PowerStd│PowerMin │
├─────┼────┼────────────┼────────┼────────┼────────┼─────────┤
│    9│   5│    1ce2e29f│ 113.889│ 363.697│ 77.6535│      0.0│
│    9│   6│    1ce2e29f│ 161.965│ 586.212│ 123.717│  10.7606│
│    9│   7│    1ce2e29f│  183.03│  905.73│ 186.052│      0.0│
│    9│   8│    1ce2e29f│ 252.629│ 1125.37│ 262.054│      0.0│
│    9│   9│    1ce2e29f│ 268.239│  1234.6│  281.07│  7.78113│
│    9│  10│    1ce2e29f│  270.93│ 1189.17│ 273.545│      0.0│
│    9│  11│    1ce2e29f│ 283.249│ 1186.31│ 288.657│      0.0│
│    9│  12│    1ce2e29f│ 281.222│ 1170.18│ 282.604│0.0728503│
│    9│  13│    1ce2e29f│ 284.771│ 1159.36│ 276.452│ 0.691112│
│    9│  14│    1ce2e29f│ 291.867│ 1194.55│ 280.146│  2.42096│
│    9│  15│    1ce2e29f│ 289.168│ 1342.17│  273.67│      0.0│
│    9│  16│    1ce2e29f│ 263.163│ 1081.73│ 254.826│      0.0│
│    9│  17│    1ce2e29f│ 177.601│ 905.783│ 162.548│  8.05548│
│    9│  18│    1ce2e29f│ 153.921│ 564.091│  125.28│  9.07539│
│    9│  19│    1ce2e29f│ 154.173│ 597.273│ 130.062│      0.0│
│    9│  20│    1ce2e29f│ 157.302│ 618.441│ 135.536│      0.0│
└─────┴────┴────────────┴────────┴────────┴────────┴─────────┘

The snippet of result I included from the selection gives the average power for different months and hours for each meter installation. You can see in the example above that the meter shown meter uses more power at noon than it does at midnight, at least for September, which is more or less what you’d expect.

You could use these averages as a sort of baseline model, either predicting each month/hour as being what they were before, or attempting to build on them in some other way. But, you can see that the peak usage is very high, as is the standard deviation, even in sample, so that’s not likely to work. Finally, a lot of the minima are 0 or very low.

Forging ahead anyway with our dumb model, perhaps joining this with our original data set might be a useful thing to do.

smeter['month']: flatten xvals select time['month'] from smeter
smeter['hour']: flatten xvals select time['hour'] from smeter

left_join(smeter,(select month,hour,installation,PowerAvg from tmp),["installation","month","hour"])

┌───────┬───────────────────────┬────────────┬─────┬────┬────────┐
│Power  │time                   │installation│month│hour│PowerAvg│
├───────┼───────────────────────┼────────────┼─────┼────┼────────┤
│53.6267│2014.09.21T05:15:00.000│    f387e7b0│    9│   5│ 101.992│
│41.4469│2014.09.21T05:30:00.000│    f387e7b0│    9│   5│ 101.992│
│86.6634│2014.09.21T05:45:00.000│    f387e7b0│    9│   5│ 101.992│
│ 91.229│2014.09.21T06:00:00.000│    f387e7b0│    9│   6│ 100.237│
│ 65.806│2014.09.21T06:15:00.000│    f387e7b0│    9│   6│ 100.237│
│57.8399│2014.09.21T06:30:00.000│    f387e7b0│    9│   6│ 100.237│
│81.6965│2014.09.21T06:45:00.000│    f387e7b0│    9│   6│ 100.237│
│47.6387│2014.09.21T07:00:00.000│    f387e7b0│    9│   7│ 103.481│
│53.9568│2014.09.21T07:15:00.000│    f387e7b0│    9│   7│ 103.481│
│211.005│2014.09.21T07:30:00.000│    f387e7b0│    9│   7│ 103.481│
│55.2504│2014.09.21T07:45:00.000│    f387e7b0│    9│   7│ 103.481│
│63.5775│2014.09.21T08:00:00.000│    f387e7b0│    9│   8│ 106.195│
│ 47.352│2014.09.21T08:15:00.000│    f387e7b0│    9│   8│ 106.195│
│81.0503│2014.09.21T08:30:00.000│    f387e7b0│    9│   8│ 106.195│
│83.0564│2014.09.21T08:45:00.000│    f387e7b0│    9│   8│ 106.195│
│65.3051│2014.09.21T09:00:00.000│    f387e7b0│    9│   9│ 111.119│
└───────┴───────────────────────┴────────────┴─────┴────┴────────┘

Well, we can see this won’t work, but how badly won’t it work? MAPE is easily defined in Kerf, and unlike them other databases, you can do sensible things like drop the (numerous) zero examples from the actual Power readings.

dropInf:{[x] x[which(abs(x)!=inf)]}
def mape(x,y){
 xx:abs(dropInf((x-y)/x));
 sum(xx) / count(xx)} 

 select 100*mape(Power,PowerAvg) as MAPE from smeter where time>2014.01.01 group by installation

┌────────────┬───────┐
│installation│MAPE   │
├────────────┼───────┤
│    f387e7b0│271.342│
│    eeb1e1e1│4898.16│
│    8d525ff4│157.625│
│    9cd89c1e│ 260.53│
│    106c60a8│354.543│
│    4925364d│59.0307│
│    e74a5b3f│101.899│
│    cd1f9a75│138.086│
│            ⋮│      ⋮│
└────────────┴───────┘

This is a failure as a forecasting technique, but it illustrates a Kerf workflow with internet of things data. You can go very fast on a large database even on a crummy little laptop. You can also do better things.

Let’s say you wanted to build some linear regression models based on, say, time of day and month of year. Easily done in Kerf. You need a little helper function to add a dummy variable, as the standard kerf  least squares primitive doesn’t include a constant (it assumes beta*x = y).

xx: xvals select month,hour from tmp where installation='6af73e77'
  [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...], 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...]]

We want this matrix to have a constant laminated to the front of it, so

transpose 1 join mapright transpose xx
  [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...], 
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...], 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...]]

 

Now, let’s stick it in a convenience function for pulling out C’s and betas for a linear regression model:

y: flatten xvals select PowerAvg from tmp where installation='6af73e77'
Cbeta: {[x, y] lsq((transpose 1 join mapright transpose x), y)}
Cbeta(xx,y)
  [82.8605, -1.35372, 0.302477]

 

This could be useful if we built some dummy variables. If you regress as we did in this function, we get betas which are correlated to the encodings. This isn’t useful unless the hour of the day or month of the year as a growing function actually means something. Probably not. You’d probably want to blow those guys out as a linear regression on month of year, hour of day as dummy variables. Something like this:

xx: (1+range(12))= mapright flatten xvals select month from tmp where installation = '6af73e77'
  [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
...
Cbeta(transpose xx,y)
  [71.5752, 1.78876, -1.54225, 3.20775, 3.62774, 6.81636, 26.7092, 25.2714, 29.176, 28.5876, 9.84373, -31.3805, -30.5305]

Using all these parts, we can build regression models based on hour of day, month of year. We can also see patterns in the months of the year; this installation uses more power in the summer months than in winter.

Exponential smoothing is another thing which we might want for examining or predicting things. Kerf doesn’t come with this as a core verb, but we just need to write the definition of exponential smoothing down along with the deconverge combinator; one of those beautiful situations where Kerf is a clean mathematical notation:

def ewma(x,a) {
 {[z,y] (a*z)+(1-a)*y} deconverge x
}

Predicting electrical load a day or week ahead is considered a difficult machine learning problem due to all the nonlinearities involved, and the fact that you’re ultimately predicting what a bunch of monkeys do with light switches. The general solution to this problem is something I’ve done in Lisp (and ported to R) using metric space and information theoretic techniques. When I was doing this, I had absurd overhead from the data storage system. Pulling the pieces of the data in and out of the database was awful. The data was too big to keep in memory all at once, because the DB system I was using was not designed for time series data. Had I some Kerf to solve the problem with, I could have as easily ported the code from Lisp to Kerf, and the whole thing would have been ridiculously scalable and deployable on an as-is basis, with additional features such as streaming queries and live monitoring immediately available. That’s the power of a tool like Kerf. Programmable database engines allows us to do lots of fun out of core calculations in a high level language.

Meanwhile this data set allows us to develop lots of interesting time oriented techniques for solving difficult problems. I’ll visit it again in further blogs.

The Best Algorithm No One Knows About

Here’s a program roughly 0% of programmers know how to write: generate a list of random tweets, without duplication. You may think you know how to do this, but you almost assuredly don’t.

Say you work at Twitter and have to select just one tweet at random. This is an easy problem. Generate a random number less than the total number of tweets, and draw the tweet that corresponds somehow to your number. If you have to make a list of tweets, then, provided you don’t mind duplicates, you can repeat this same process over and over. We get a list of tweets drawn at random, possibly with some duplication. Any programmer could write this program.

The catch comes when you add the requirement that the selected tweets be unique. The whole problem changes. Once uniqueness is required, you get trapped by another, implicit, requirement, which is that we expect every tweet to be equally likely to appear.

(Stated more formally: given non-negative integers k and n with k <= n, generate a list of k distinct random numbers, each less than n. We reasonably ask that this be done in O(k) time and O(1) additional space. Finally, the distribution must be uniform.)

We started with the language of tweets, but it’s also possible to use the language of cards: deal k cards from a deck of size n, without replacement. So a poker hand like A2345 of clubs is valid for k=5 and n=52, but a hand containing AAAAA of clubs is not.

The misleading part in using the language of cards is that you don’t often consider a deck of size 264. What’s nice about the card formulation though is that it conveys how simple the problem statement really is. This is a fundamental problem that was open for a long time. Nobody knew how to deal cards.

The first obstacle that makes the “dealing” (without replacement) hard is that the “draw” method doesn’t work. If you draw (with replacement) over and over again, ignoring cards you’ve already picked, you can simulate a deal, but you run into problems. The main one is that eventually you’re ignoring too many cards and the algorithm doesn’t finish on time (this is the coupon collector’s problem).

There are several things to try next and none of them work. You can’t generate a random number for every card in the deck (too many cards). You can’t rearrange the deck (too much space). Once you get the basic complexity requirements down, what makes the problem slippery is keeping the balance in the distribution. You can try to play with “resizing the window” you’re looking at, but most of these methods add bias (usually, you undersample early items and oversample later ones).

Card dealing was unsolved for a long time. Finally, Jeffrey Vitter published the solution in 1984. A cleaned-up version appeared in 1987. To put this in context, public-key cryptography algorithms appeared about a decade earlier in the 1970s. Basically all other fundamental problems that sound like this were solved in the 1960s and 1950s, so we might have expected to have a solution from then. But in the 1980s, dealing cards was a research-level problem in Knuth.

The algorithm is still relatively unknown. When I came across a need for the algorithm in my work, it took a lot of detective work to even find it. There was no source code online. The algorithm is not referenced often, and when it is, it is in a highly obscure fashion. After reaching Vitter’s papers it again takes a concentrated effort to figure out what you need. This was a big surprise to me.

I think the explanation for this is that the method was originally presented as a capstone effort in the area of “reservoirs” and “tape sampling”. These days, while it’s possible you’ve worked at a place that uses tape drives for backup, chances are you’ve never programmed for a tape drive. I certainly haven’t. Based on my reading of the research, by the mid-80s subjects like tape drive sampling were considered antique. The most important algorithm appeared after people had stopped looking at that field. The only reason I know the field exists is it’s a giant section in Knuth that I’ve flipped past many times. Earlier algorithms for “card dealing” or “urn sampling” are hidden in there.

Today tape drive algorithms seem pointless, but there’s a good reason for approaching the problem from this perspective. Random access is costly on a tape drive. Regular access is costly on a tape drive. Tape drives are much less nimble than disk drives, and this enforces a strongly sequential read mentality. A sequential, long-duration read pattern is a useful way to frame the problem, as it results in a solution which is both sorted and “online”. Sorted means the dealt cards are in order, and online means you can do them one at a time, perhaps while you wait for a spindle to turn. This completist approach may be how you need to think about the problem in order to solve it in the first place.

The reason it’s nice for an algorithm to produce results in sorted order is because it’s easy to randomize the order after the fact. The reverse is not true. Sorting algorithms are linear only in restricted cases, and so the addition of an O(k*log(k)) sort will forfeit our original O(k) runtime. List shuffling algorithms are O(k) and easy to implement.

That it’s so easy to shuffle a small list is probably why the algorithm is overlooked. When most people need to deal cards they shuffle the entire deck. This works as long as the size of the deck is within reason. What happens is that once the deck is over a certain size, people resort to simple drawing algorithms, and nobody looks too closely at bias or inefficiency in the randomized application.

This is unfortunate. If you’ve ever sat through a statistics or combinatorics class, you know drawing balls from urns without replacement is a fundamental sampling application. In array languages like APL, sampling with and without replacement are implemented as first-class primitives “deal” and “draw”. It’s nice in theory to have sampling as a primitive since then you don’t have to worry about screwing it up. You do have to worry about the language implementors getting it wrong, though. I’ve talked with several people who’ve implemented deal-like primitives in the past, and Vitter’s name only comes up when I bring it up, so I assume many existing deal implementations either use slower methods, or, I guess, some kind of ad hoc method with an incorrect distribution.

We use Vitter’s algorithm in Kerf, and deal() is a first-class primitive. Kerf’s deal() is great for when banks need to sample a random collection of stock tickers, or a random collection of trades from throughout the day. Uses appear all the time.

I’ve cleaned up an older version of the C source code we use and made it available below. This source first appeared in a previous project here. A description of the algorithm and pseudocode can be found here in Vitter’s paper.


//Copyright Kevin Lawler, released under ISC License

double random_double()  //your random double function
{
  //[0,1) uniformly random double, mt19937-64.c source is fine
  //keep in mind most double-based implementations drop randomness to 52-bits
  return genrand64_real2();
}

//POTENTIAL_OPTIMIZATION_POINT: Christian Neukirchen points out we can replace exp(log(x)*y) by pow(x,y)
//POTENTIAL_OPTIMIZATION_POINT: Vitter paper points out an exponentially distributed random var can provide speed ups
//Vitter, J.S. - An Efficient Algorithm for Sequential Random Sampling - ACM Trans. Math. Software 11 (1985), 37-57.
//'a' is space allocated for the hand
//'n' is the size of the hand
//'N' is the upper bound on the random card values
void vitter(int64_t *a, int64_t n, int64_t N) //Method D
{
 

  int64_t i = 0;
  int64_t j = -1;
  int64_t t; 
  int64_t qu1 = -n + 1 + N; 
  int64_t S; 
  int64_t negalphainv = -13;
  int64_t threshold = -negalphainv*n;
  int64_t limit;

  double nreal = n;
  double Nreal = N;
  double ninv = 1.0/n;
  double nmin1inv = 1.0/(n-1);
  double Vprime = exp(log(random_double())*ninv);

  double qu1real = -nreal + 1.0 + Nreal;
  double negSreal;
  double U; 
  double X; 
  double y1; 
  double y2; 
  double top; 
  double bottom;
  

  while(n > 1 && threshold < N)
  {
    nmin1inv=1.0/(-1.0+nreal);

    while(1)
    {
      while(1)
      {
        X = Nreal * (-Vprime + 1.0);
        S = floor(X);

        if(S < qu1)
        {
          break;
        }

        Vprime = exp(log(random_double()) * ninv);
      }

      U = random_double();
      
      negSreal = -S;
      
      y1 = exp(log(U*Nreal/qu1real) * nmin1inv);
      
      Vprime = y1 * (-X / Nreal+1.0) * (qu1real / (negSreal + qu1real));
      
      if(Vprime <= 1.0)
      {
        break;
      }
      
      y2 = 1.0; 
      
      top = -1.0 + Nreal;       
      
      if(-1+n > S)
      {
        bottom = -nreal + Nreal;
        limit = -S + N;
      }
      else
      {
        bottom = -1.0 + negSreal + Nreal; 
        limit = qu1;
      }
      
      for(t = N-1; t >= limit; t--)
      {
        y2=(y2 * top) / bottom;
        top--; 
        bottom--;
      }
      
      if(Nreal / (-X + Nreal) >= y1 * exp(log(y2) * nmin1inv))
      {
        Vprime = exp(log(random_double()) * nmin1inv);
        break;
      }

      Vprime = exp(log(random_double()) * ninv);
    }

    j += S + 1;

    a[i++] = j;

    N = -S + (-1 + N); 

    Nreal = negSreal + (-1.0 + Nreal);

    n--;
    nreal--;
    ninv = nmin1inv;

    qu1 = -S + qu1;
    qu1real = negSreal + qu1real;

    threshold += negalphainv;
  }

  if(n>1)
  {
    vitter_a(a+i,n,N,j); // if i>0 then n has been decremented
  }
  else
  {
    S = floor(N * Vprime);

    j += S + 1;

    a[i++] = j;
  }

}

void vitter_a(int64_t *a, int64_t n, int64_t N, int64_t j) //Method A
{
  int64_t S, i = 0;
  double top = N - n, Nreal = N, V, quot;

  while(n >= 2)
  {
    V = random_double(); 
    S=0;
    quot=top/Nreal;

    while (quot>V)
    {
      S++; top--; Nreal--;
      quot = (quot * top)/Nreal;
    }

    j+=S+1;

    a[i++]=j;

    Nreal--;

    n--;
  }

  S = floor(round(Nreal) * random_double());

  j+=S+1;

  a[i++]=j;

}

String Interning Done Right

Refresher

The common method for interning strings breaks in fantastic ways. In Kerf, we’ve taken the old method and revised it for success in the current generation of languages.

If you’ve forgotten what string interning is from your time with Java, it’s a simple way of ensuring that any string appears only once. So for instance, if I have ten objects whose type is the string “lion”, I don’t have ten copies of the string “lion” in memory. What I have instead is a single copy of “lion” and then ten links which all point to the single copy somehow.

Most often these links are pointers (raw addresses in memory). We’ll discuss how this breaks soon. The reference copy of “lion” is never allowed to be moved, changed, or released. It’s permanent. The other implementation detail to figure out is how to keep the strings unique. The next time we create a “lion” string we need to trade our string in for a link, and what is usually done is that the reference copy is stored in a hashtable or some other deduplicating data structure. This lets us figure out that lion already exists, and when we perform the check we can walk away with our link at the same time. If the phrase is “bullmation”, and it doesn’t already exist, then we can add the initial copy to the data structure at the same time.

intern-diagram-01

One benefit of string interning is that you don’t waste space if you’re storing the same long string over and over again.

There is another benefit to string interning, which is that it converts a list of variable-length strings to a fixed-width array. An array of pointers may be indexed into easily, just as any other array. It is not possible to index into a list of variable-length strings without additional overhead.

One really bad thing about string interning is that you usually can’t get rid of any string you’ve seen. If your language is naively storing pointers, you have no way of knowing when it gets rid of the last one. This means that the more strings you come across the more memory you lose forever. Simple parsing jobs can cause languages with string interning to run out of memory.

Java most famously uses string interning, but the concept originated in early Lisp. It shows up in some form or another in most languages. Kerf’s authors learned much about string interning from K, which got it seemingly directly from Lisp, and we’ll get to that in a second.

The Java Language

Java likes string interning because it provides a shortcut method for testing string equality. To test the equality of two strings in Java you simply check whether the links are equal. Links are a small fixed size, but strings can be any length, so testing takes O(1) time instead of O(n). Java also touts that this test can be accomplished with pointers, as in “string1 == string2” instead of the more verbose “string1.equals(string2)”: this does little other than to generate trivia and serve as a further point of confusion.

Java is half-hearted in its commitment to interned strings. Performance reasons force Java to expose multiple other categories of string types. Character arrays are a basic type in Java, and character buffers are a special object. You can see why Java needs these types. With permanent default interning, any sort of sequence involving character-level appends, such as reading the contents of a book from a file, would result in a preposterous O(n²) version of an otherwise trivial technique. The multiple categories of string objects leads to a fragmented ecosystem and a real headache.

What is exasperating about Java, is that they probably shouldn’t have exposed string interning at all. The kind of string editing operation that devolves to O(n²), and pollutes the pool, is by far the most common use case for Java strings. And the kind of use case that demands interning, frequent equality testing, such as in a database, rarely appears in Java, which is not an appropriate language for that level of performance anyway. So they got it backwards. What they should’ve done is made “lion” an uninterned character array, and suffered the penalty on equality testing, which in Java applications would be unnoticeable.

(Commenters point out that Java has changed over the years, so the preceding refers only to Java as I was taught.)

The K Language

Arthur Whitney’s K series of languages makes more appropriate use of string interning. K supports both interned and non-interned strings, but correctly makes the basic string type non-interned, and reserves the interned type for special occasions.

K’s string interning uses raw pointers. K calls interned strings “symbols”—a name adopted from Lisp—and leans heavily on interning’s constant-time comparison benefits. The most common use for a symbol in K is for representing a stock ticker, such as ‘IBM’ or ‘MSFT’, and K stores long columns of these symbols. If you do a table scan, for instance, there’s a savings in doing a single pointer comparison over doing a character-by-character comparison each time.

Internally, K depends on symbols for a variety of other speedups. It uses linear-time lookup dictionaries, instead of constant hash lookups, so interning is a necessity there. Variable lookup also depends on interning, and uses the same linear-time dictionaries. Without interning the core parts of the language would be appreciably slower.

Whitney’s thinking about symbols informs much of our views on the subject, and we consider our thoughts on the matter to be the natural extension of his. Working with his symbol implementation helped us to see where to take the concept next.

Modern Problems

What’s broken in previous implementations of string interning is the global, per-process pool for immutable strings, and the use of memory pointers to reference them. So, everything.

The core issue is that process-level interning is the wrong way to think about strings. The evolving view is that the process is a single, individually-unimportant node in the graph of some inter-process operation. So focusing on the process as the end-all is the wrong way to go about it. Dynamic pointers are one of the big offenders here. But so is the global pool. Processes change from launch to launch, and while it is usually only modest overhead to rebuild an intern pool, the transitory nature of the process and the oracular nature of the pool make global interning an awkward choice. This is most noticeable when writing objects to disk or sending them across the network. As soon as the process needs to communicate the scheme breaks down. The pointer scheme makes life easier for the language implementor, at least at first, but it makes it harder for everything and everyone else involved in supporting the language.

In the old system, each process is an island. A pointer from one process is useless to another process, or to a file on disk. At a minimum this adds an extra layer of serialization and deserialization. What should you turn the pointer into (null delimited strings, string objects?)? Do you send the whole pool each time? How do you avoid storing a fixed-width column as a variable-length one?

One common and fairly bad way to send an array of interned strings across the network is to send them as a null-delimited list of variable length strings. So if you have the string “manticore” repeated hundreds of times, some systems actually transfer the entire null delimited string hundreds of times, with no deduplication. The strategy here is to replace all fixed-width pointers with variable-length null delimited strings.

Another poor technique we’ve seen is a scheme for writing a column of symbols to disk. In this case, the product writes a separate deduplicated null-terminated list of symbols for each table, that effectively represents a static copy of the intern pool. This symbol list is shared among all columns. Then the individual column is written, which consists of a fixed-width array of numeric indices into the symbol list. So the pointers have been translated into indices into this list.

When it comes time to read a column from the disk, the entire symbol list must be loaded into memory, and then the symbol column must be translated from indices into dynamic pointers. There is overhead in inserting the symbol from the list into the pool (rehashing), and overhead in translating to pointers. This translation is always necessary because you can’t use the same integer mapping from two different tables (to see why, consider a table where ‘1’ is ‘YHOO’ and another where ‘1’ is ‘SBUX’). More generally, such symbol lists cannot be shared, though this was often desirable, and they trivially fall out of sync between tables and processes.

Depending on how you want to look at it, the “global pool of pointers” method forces you to use three distinct representations (memory, disk, and network) and at least four translation layers: back and forth from memory to the disk, and back and forth from memory to the network.

Kerf’s Solution

The right way to think about interning is to reduce the scope of the intern pool. In Kerf, instead of maintaining one global pool per process, we maintain local pools on a per-object basis. We assume that the process is one of many, and we avoid the idea of a global pool. Then we get rid of dynamic pointers. Kerf indexes into object-local pools using numeric indices. This works exceptionally well for array languages, but the technique extends to regular imperative languages as well.

This technique allows objects to have the same representation in-memory, on disk, and transferred over the network. No translation layer is necessary. Objects with interned strings can now be mapped. The technique also allows Kerf to get away with using a single string type, instead of several incompatible ones.

Kerf is an array language, and to support string interning it exposes a special kind of array called an ‘Enum’. An Enum works like you would expect a regular array to work, with the exception that, internally, the array deduplicates objects. So the following list, when stored as an Enum,

['lion', 'lion', 'lion']

appears to be a regular array, but actually only stores one copy of the string ‘lion’ in an associated hashset, and then stores three integer references to inside the hashset. The pool is dynamic. You could use a hashtable or a regular array/list instead of a hashset, though those options are less suitable. When you insert an item into the enum it checks to see whether that item exists, and adds only an integer reference. Same story when you concatenate a list to the enum (except optimized), and so on. But in every instance the enum offers the same functionality as a regular array. There is no limit to the number of interned items, and Kerf’s enum support arbitrary data types, not just strings. Kerf uses enums for columns in a table, say.

intern-diagram-02

The natural objection to this is that, if you use multiple columns, you create multiple copies of the strings. In practice this objection turns out not to matter. The number of pools in which a given string may appear is relatively small (perhaps 10). The pools themselves are small (anywhere from ten entries to one million entries). The size of the pools is dominated by the number of indices referencing them (say billions of indices). Reference counted objects means that all of this matters even less anyways.

Let’s take an example. There are about 3,000 tickers traded on the NASDAQ. If we assume they’re 10 characters each (they’re less), that’s only about 30k in overhead per pool. With ten enumerated columns, that’s only 300k in overhead. A daily column may have 30M entries, a historical column may have thousands of times that. So for stock prices we’re talking starting duplication ratios of about 0.1% to thousands of times less. In practice, because the local pools can be reference counted, there may not be any duplication at all.

The second objection is that it is not immediately obvious how to reconcile two separate objects with different pools. We’ve built an entire language on top of this architecture, and we can say from experience that in practice this also turns out to be simple. These cases appear less frequently than you might think. Standard unoptimized tools handle basically all of these without any additional effort. In cases where additional effort is necessary, the architecture turns out to be the right tool for the job. Consider the example of optimizing the case where two potentially different lists of interned strings are tested for equality. You cannot simply compare the vectors of indices, because the adjoining pools might be composed differently. What you can do, however, is create an O(c) lookup table that handles the translation. Because the pools are small, the lookup tables are guaranteed to be small as well. The operation remains linear in the number of strings.

All sorts of interesting results pop out of this theory of string interning. Perhaps the most impressive is how easy it is to grade (sort) interned strings in O(n) time instead of O(n*lg n) time. I’ll skip over the details here, but effectively, you get to use a counting sort on a list of integers, which is a linear sort, and likely at the practical lower bound. Imagine that: linear sorting for strings. It works so well that I suspect there are cases where you want to enumerate a regular list before sorting it. I find this to be incredible. Other results like this one continue to appear naturally, which to us is strong evidence this is the right path for string interning.

An Introduction to Combinators

One of the most powerful features of Kerf are the combinators. Kerf is a vector language. You can operate on vectors and matrices as if they are natural data units, achieving interpreter speedups over looping constructs. While everyone is used to for loops in an interpreter, interpreters do very badly on them, which is why they always encourage people programming in R and Matlab to use the vector constructs if available. If you think about what goes on in various kinds of interpreter, you realize that there is a lot going on inside of a for loop.
Depending on how the interpreter is implemented, you may have to parse each line in the loop for every iteration in the loop; you have to evaluate test conditions, maintain state and so on.

Even if your loop is trivial and does some element wise operation on vectors it ends up going slower than it would in a vector operation. Interpreters need to check things, build stacks, move things around in a for loop. For example:

a=rnorm(1000000)
sumint <- function(x){
x0=0
 for(i in 1:length(x)) {
 x0 = x0+ x[i]
 }
 x0
}
system.time(sumint(a))
   user  system elapsed 
  0.361   0.005   0.367 
system.time(sum(a))
   user  system elapsed 
  0.001   0.000   0.001

Compilers can generally optimize down to the metal, so that’s one way out. In Kerf, you can use combinators to help the interpreter avoid these problems. For example, summation amounts to putting + in between all the elements of a vector. Fold puts the function to its left in between all of the elements of the thing on the right.

timing(1)
a:rand(1000000,1.0)
+ fold a
 1ms

This is a trivial application, but it illustrates the power of the idea, and its speed in action. Tell the interpreter to do a lot of things at once, and it will be done at close to machine speed, even if there is no compiled primitive to accomplish the task. With a good set of combinators you can achieve the satori known as “No Stinking Loops.”

Sum is a trivial example (there is a primitive in Kerf for this), but one which illustrates the power of the idea. Fold takes any function with two inputs and sticks it in between elements

  function foo(a,b){ sqrt(a+b)}
foo fold range(10)
  3.51283

unfold is like fold, except you keep your intermediate results around

function foo(a,b){ sqrt(a+b)}
foo unfold range(10)
  3.51283
  [0, 1.0, 1.73205, 2.17533, 2.48502, 2.73588, 2.95565, 3.15526, 3.33995, 3.51283]

mapdown can be used with unary functions or binary functions. While mapdown is pretty similar to applying a unary function to a list, it retains the shape of a map or table:

count mapdown   {a:1 2 3,b:1 2 }
  [3, 2]

For binary functions, it assumes the left and right data arguments have the same shape:

1 2 3 + mapdown 4 5 6
  [5, 7, 9]
 2 3 + mapdown 4 5 6
         ^
 Conformable error

Just as “fold” is a lot like “reduce,” “mapdown” is like “map” in the standard functional programming model that gets so much attention.


mapback is one of my favorite combinators, as this sort of operation is incredibly dreary to write in a for loop. Have you ever wanted to run an operation on an element and the element before it in a vector? I know I have. Calculating differences for example.

reverse range(10)
  [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
- mapback reverse range(10)
[9, -1, -1, -1, -1, -1, -1, -1, -1, -1]

While many language give you primitives for calculating differences on a vector, only a few APLs give you the ability to do this in general. To see what is going on, use the binary function “join”

join(1,2)
  [1, 2]

join mapback reverse range(10)
  [9, 
 [8, 9], 
 [7, 8], 
 [6, 7], 
 [5, 6], 
 [4, 5], 
 [3, 4], 
 [2, 3], 
 [1, 2], 
 [0, 1]]

mapright maps binary verb and data on the left hand side to each of the values on the right side,

1 2 + mapright 1 2 3
  [[2, 3], 
 [3, 4], 
 [4, 5]]
1 2 join mapright 1 2 3
  [[1, 2, 1], 
 [1, 2, 2], 
 [1, 2, 3]]

mapleft, as you might expect, maps the data on the right to binary verb on the left hand side to each of the values on the left hand side.

1 2 + mapleft 1 2 3
  [[2, 3, 4], 
 [3, 4, 5]]
1 2 join mapleft 1 2 3
  [[1, 1, 2, 3], 
 [2, 1, 2, 3]]

Sometimes mapleft and mapright evaluate to the same thing. Remembering that the equals sign and “equals” are binary functions in Kerf, we can use mapright to create a diagonal array.

range(5) equals mapright range(5)
  [[1, 0, 0, 0, 0], 
 [0, 1, 0, 0, 0], 
 [0, 0, 1, 0, 0], 
 [0, 0, 0, 1, 0], 
 [0, 0, 0, 0, 1]]

range(5) equals mapleft range(5)
  [[1, 0, 0, 0, 0], 
 [0, 1, 0, 0, 0], 
 [0, 0, 1, 0, 0], 
 [0, 0, 0, 1, 0], 
 [0, 0, 0, 0, 1]]

“converge” is a classic combinator from functional programming which operates on unary functions. For binary functions, it is the same thing as “fold.” It has a brother function in deconverge which allows you to see the intermediate steps. There are two ways to use it.

The simplest usage of converge is to execute a function on a value, then its result, until that value no longer changes. Non numeric people probably don’t run into this very often, but numerics guys are always doing this. Rather than constructing some elaborate while loop, you can just write the recursion relation down. For example, calculating the golden ratio:

{[x] 1+1/x} converge 1
  1.61803

“deconverge” is the same thing, except it keeps around the intermediate results in a list. This is useful for debugging purposes, but this intermediate list might be something you’re also interested in.

{[x] 1+1/x} deconverge 1 
  [1, 2.0, 1.5, 1.66667, 1.6, 1.625, 1.61538, 1.61905, 1.61765, 1.61818, 1.61798, 1.61806, 1.61803, 1.61804, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803]

The other way to use converge is to declare you’re going to do something to the right hand value, left hand value number of times:

4 {[x] x * 2} converge [4,2]
  [64, 32]

deconverge applied to the same lambda function:

4 {[x] x * 2} deconverge [4,2]
  [[4, 2], 
 [8, 4], 
 [16, 8], 
 [32, 16], 
 [64, 32]]

Or, in the case of the Golden Ratio, you might be interested in just running 10 iterations to see what’s going on (and make sure it doesn’t run forever; don’t worry though, ctrl-C still works):

10 {[x] 1+1/x} deconverge 1
  [1, 2.0, 1.5, 1.66667, 1.6, 1.625, 1.61538, 1.61905, 1.61765, 1.61818, 1.61798]

Combinators are one of the things that makes Kerf different. They’re the same idea as adverbs in J or K, but they represent the patterns that are most often productively used. Personally, I have always hated writing loops; it drives me bananas having to say, “OK, computer, I want to count through my list this way and do these things to it: start at index 0, then step forward by 1.” Too many ways to insert a typo in the loop, and you get sick of typing the same damn thing, over and over again. Even if they run slower than a loop, I’d want some kind of combinator thing, just to make my life easier and less error prone.

Timestamps done right

I’ve used a lot of tools meant for dealing with time series. Heck, I’ve written a few at this point. The most fundamental piece of dealing with timeseries is a timestamp type. Under the covers, a timestamp is just a number which can be indexed. Normal humans have a hard time dealing with a number that represents seconds of the epoch, or nanoseconds since whenever. Humans need to see things which look like the ISO format for timestamps.

Very few programming languages have timestamps as a native type. Some SQLs do, but SQL isn’t a very satisfactory programming language by itself. At some point you want to pull your data into something like R or Matlab and deal with your timestamps in an environment that you can do linear regressions in. Kerf is the exception.

Consider the case where you have a bunch of 5 minute power meter readings (say, from a factory) with timestamps. You’re probably storing your data in a database somewhere, because it won’t fit into memory in R. Every time you query your data for a useful chunk, you have to parse the stamps in the chunk into a useful type; timeDate in the case of R. Because the guys who wrote R didn’t think to include a useful timestamp data type, the DB package doesn’t know about timeDate (it is an add on package), and so each timestamp for each query has to be parsed. This seems trivial, but a machine learning gizmo I built was entirely performance bound by this process. Instead of parsing the timestamps once in an efficient way into the database, and passing the timestamp type around as if it were an int or a float, you end up parsing them every time you run the forecast, and in a fairly inefficient way. I don’t know of any programming languages other than Kerf which get this right. I mean, just try it in Java.

Kerf gets around this by integrating the database with the language.

Kerf also has elegant ways of dealing with timestamps within the language itself.

Consider a timestamp in R’s timeDate. R’s add-on packages timeDate + zoo or xts are my favorite way of doing such things in R, and it’s the one I know best, so this will be my comparison class.

require(timeDate) 
a=as.timeDate("2012-01-01")
GMT
[1] [2012-01-01]

In Kerf, we can just write the timestamp down

a:2012.01.01
  2012.01.01

A standard problem is figuring out what a date is relative to a given day. In R, you have to know that it’s basically storing seconds, so:

as.timeDate("2012-01-01") + 3600*24
GMT
[1] [2012-01-02]

Kerf, just tell it to add a day:

2012.01.01 + 1d
  2012.01.02

This gets uglier when you have to do something more complex. Imagine you have to add a month and a day. To do this in general in R is complex and involves writing functions.

In Kerf, this is easy:

2012.01.01 + 1m1d
  2012.02.02

Same story with hours, minutes and seconds

2012.01.01 + 1m1d + 1h15i17s
  2012.02.02T01:15:17.000

And if you have to find a bunch of times which are a month, day, hour and 15 minutes and 17 seconds away from the original date, you can do a little Kerf combinator magic:

b: 2012.01.01 + (1m1d + 1h15i17s) times mapright  range(10)
  [2012.01.01, 2012.02.02T01:15:17.000, 2012.03.03T02:30:34.000, 2012.04.04T03:45:51.000, 2012.05.05T05:01:08.000, 2012.06.06T06:16:25.000, 2012.07.07T07:31:42.000, 2012.08.08T08:46:59.000, 2012.09.09T10:02:16.000, 2012.10.10T11:17:33.000]

The mapright combinator runs the verb and noun to its right on the vector which is to the left. So you’re multiplying (1m1d + 1h15i17s) by range(10) (which is the usual [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ), then adding it to 2012.01.01.

You can’t actually do this in a simple way in R.  Since there is no convenient token to add a month, you have to generate a time sequence with monthly periods. The rest is considerably less satisfying as well, since you have to remember to add numbers. In my opinion, this is vastly harder to read and maintain than the kerf line.

b=timeSequence(from=as.timeDate("2012-01-01"),length.out=10,by="month") + (3600*24 + 3600 + 15*60 + 17) *0:9
 [2012-01-01 00:00:00] [2012-02-02 01:15:17] [2012-03-03 02:30:34] [2012-04-04 03:45:51] [2012-05-05 05:01:08] [2012-06-06 06:16:25] [2012-07-07 07:31:42] [2012-08-08 08:46:59] [2012-09-09 10:02:16] [2012-10-10 11:17:33]

This represents a considerable achievement in language design; an APL which is easier to read than a commonly used programming language for data scientists. I am not tooting my own horn here, Kevin did it.

If I wanted to know what week or second these times occur at, I can subset the implied fields in a simple way in Kerf:

b['week']
  [1, 6, 10, 15, 19, 24, 28, 33, 37, 42]
b['second']
  [0, 17, 34, 51, 8, 25, 42, 59, 16, 33]

I think the way to do this in R is with the “.endpoints” function, but it doesn’t seem to do the right thing

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04 LTS
other attached packages:
[1] xts_0.9-7         zoo_1.7-12        timeDate_3012.100

.endpoints(b, on="week")
 [1]  0  1  2  3  4  5  6  7  8  9 10
.endpoints(b, on="second")
 [1]  0  1  2  3  4  5  6  7  8  9 10

You can cast to a POSIXlt and get the second at least, but no week of year.

as.POSIXlt(b)$week
NULL
as.POSIXlt(b)$sec
 [1]  0 17 34 51  8 25 42 59 16 33

Maybe doing this using one of the other date classes, like as.Date…

 weekGizmo<-function(x){ as.numeric(format(as.Date(time(x))+3,"%U")) }

Not exactly clear, but it does work. If you have ever done things with time in R, you will have had an experience like this. I’m already reaching for a few different kinds of date and time objects in R. There are probably a dozen kinds of timestamps in R which do different subsets of things, because whoever wrote them wasn’t happy with what was available at the time. One good one is better. That way when you have some complex problem, you don’t have to look at 10 different R manuals and add on packages to get your problem solved.

Here’s a more complex problem. Let’s say you had a million long timeseries with some odd periodicities and you want to find the values which occur at week 10, second 33 of any hour.

ts:{{pwr:rand(1000000,1.0),time:(2012.01.01 + (1h15i17s times mapright  range(1000000)))}}
timing(1)
select *,time['second'] as seconds,time['week'] as weeks from ts where time['second']=33 ,time['week'] =10

┌────────┬───────────────────────┬───────┬─────┐
│pwr     │time                   │seconds│weeks│
├────────┼───────────────────────┼───────┼─────┤
│0.963167│2012.03.01T01:40:33.000│     33│   10│
│0.667559│2012.03.04T04:57:33.000│     33│   10│
│0.584127│2013.03.06T05:06:33.000│     33│   10│
│0.349303│2013.03.09T08:23:33.000│     33│   10│
│0.397669│2014.03.05T01:58:33.000│     33│   10│
│0.850102│2014.03.08T05:15:33.000│     33│   10│
│0.733821│2015.03.03T22:50:33.000│     33│   10│
│0.179552│2015.03.07T02:07:33.000│     33│   10│
│       ⋮│                      ⋮│      ⋮│    ⋮│
└────────┴───────────────────────┴───────┴─────┘
    314 ms

In R, I’m not sure how to do this … you’d have to use a function that outputs the week of year then something like this (which, FWIIW, is fairly slow) function to do the query.

require(xts)
ts=xts(runif(1000000), as.timeDate("2012-01-01") + (3600 + 15*60 + 17) *0:999999)
weekGizmo<-function(x){ as.numeric(format(as.Date(time(x))+3,"%U")) }
queryGizmo newx
 newx[(wks==10) & (secs==33)]
}
system.time(queryGizmo(ts))
   user  system elapsed 
  4.215   0.035   4.254

The way R does timestamps isn’t terrible for a language designed in the 1980s, and the profusion of time classes is to be expected from a language that has been around that long. Still, it is 2016, and there is nothing appreciably better out there other than Kerf.

Lessons for future language authors:

  1. If you query from a database, that type needs to be propagated through to the language as a first class type. If this isn’t possible to do directly, there should be some way of quickly translating between classes in the DB and classes that doesn’t involve parsing a string representation.
  2. timestamps should be a first class type in your programming language. Not an add on type as in R or Python or Java.
  3. timestamps should have performant and intuitive ways of accessing implied fields
  4. it would be nice if it handles nanoseconds gracefully, even though it is hard to measure nanoseconds.