Kerf meets the smartgrid

Kerf is primarily designed as a tool for dealing with data streams. The most obvious streams of data which bring in customers are stock market data, but there are lots of interesting data streams in the world. Enough that people are starting to have a name for this “vertical” -aka “internet of things.” One of the “things of the internet” I’ve worked with in the past is smartmeter data. This is a stream of electrical power consumption data from individual power meters in homes, commercial sites, factories, electric car chargers and what have you. While everyone knows that markets create big streams of potentially interesting data, the data from power grids is also big and interesting. It’s big because a typical city might have millions of power meters in it, recording data at minute or quarter hour increments. It’s interesting because power companies need to know what is going on with their networks to provide uninterrupted electrical service that we all take for granted. It’s also very interesting to power companies who rely on “renewables” such as wind or solar which are not steady or controllable producers of power. If you have a power grid which relies on mostly renewables, you have all kinds of complicated operations research problems to deal with during times when it is cloudy, dark or the breezes are calm.

Problems

The data volume of “internet of things” applications can be unwieldy to manage in standard tools, but there are analytics challenges as well. A standard challenge for smartgrid stuff is doing forecasts. Such models can get really hairy. Every smart meter installation is quite different from every other smart meter. A power meter for a car charging station is vastly different from that of a school, an apartment or a house, and all of those things are different from one another. The “users” of each apartment and house have different schedules, appliances and devices. Getting this right in an automated way can pay big dividends; you can provide services to the people with the meter, and you can do more accurate forecasts on a city wide level if you understand the generating process at the individual meter level.

If you want to get something like this right in an automated fashion, it’s a fairly large project involving paying people like me to think about it for a few months. Doing all this in an automated, general and reliable way involves feature creation, feature selection algorithms, machine learning, data gathering, data cleansing, weather forecasting, wind forecasting, optimization and the kitchen sink. I’d love to eventually have the full suite of such tools to do such a project in Kerf as primitives, but we don’t know if customers would appreciate or use such things, so I’ll make do with some small steps for this blog.

Let’s say we have some simple meter data with the average power usage per installation. No temperature or weather data here, so we’ll pretend these things are not important.

KeRF> meta_table(smeter)

┌────────────┬────┬────────────┬────────────┬───────┐
│column      │type│type_name   │is_ascending│is_disk│
├────────────┼────┼────────────┼────────────┼───────┤
│       Power│  -3│float vector│           0│      1│
│        time│  -4│stamp vector│           0│      1│
│installation│   8│        enum│           0│      1│
└────────────┴────┴────────────┴────────────┴───────┘

len smeter
  21894311

A standard observation about forecasting smart grid stuff is that electricity use is periodic. People use different amounts of electricity depending on what hour of the day it is, and what month of the year it is. These are pretty good features that would go into any machine learning gizmo which would ultimately solve this problem.

A simple thing to do toward building a forecasting model for this sort of data would be to pull out some average baselines for hour of the day and month of the year, while adding some other quantities for debugging purposes:

tmp:select avg(Power) as PowerAvg,max(Power) as PowerMax,std(Power) as PowerStd,min(Power) as PowerMin from smeter where time>2014.01.01,group by time['month'],time['hour'],installation
tmp: select time as month,time1 as hour,* from tmp
tmp1:delete_keys(tmp,['time','time1'])
tmp:tmp1


┌─────┬────┬────────────┬────────┬────────┬────────┬─────────┐
│month│hour│installation│PowerAvg│PowerMax│PowerStd│PowerMin │
├─────┼────┼────────────┼────────┼────────┼────────┼─────────┤
│    9│   5│    1ce2e29f│ 113.889│ 363.697│ 77.6535│      0.0│
│    9│   6│    1ce2e29f│ 161.965│ 586.212│ 123.717│  10.7606│
│    9│   7│    1ce2e29f│  183.03│  905.73│ 186.052│      0.0│
│    9│   8│    1ce2e29f│ 252.629│ 1125.37│ 262.054│      0.0│
│    9│   9│    1ce2e29f│ 268.239│  1234.6│  281.07│  7.78113│
│    9│  10│    1ce2e29f│  270.93│ 1189.17│ 273.545│      0.0│
│    9│  11│    1ce2e29f│ 283.249│ 1186.31│ 288.657│      0.0│
│    9│  12│    1ce2e29f│ 281.222│ 1170.18│ 282.604│0.0728503│
│    9│  13│    1ce2e29f│ 284.771│ 1159.36│ 276.452│ 0.691112│
│    9│  14│    1ce2e29f│ 291.867│ 1194.55│ 280.146│  2.42096│
│    9│  15│    1ce2e29f│ 289.168│ 1342.17│  273.67│      0.0│
│    9│  16│    1ce2e29f│ 263.163│ 1081.73│ 254.826│      0.0│
│    9│  17│    1ce2e29f│ 177.601│ 905.783│ 162.548│  8.05548│
│    9│  18│    1ce2e29f│ 153.921│ 564.091│  125.28│  9.07539│
│    9│  19│    1ce2e29f│ 154.173│ 597.273│ 130.062│      0.0│
│    9│  20│    1ce2e29f│ 157.302│ 618.441│ 135.536│      0.0│
└─────┴────┴────────────┴────────┴────────┴────────┴─────────┘

The snippet of result I included from the selection gives the average power for different months and hours for each meter installation. You can see in the example above that the meter shown meter uses more power at noon than it does at midnight, at least for September, which is more or less what you’d expect.

You could use these averages as a sort of baseline model, either predicting each month/hour as being what they were before, or attempting to build on them in some other way. But, you can see that the peak usage is very high, as is the standard deviation, even in sample, so that’s not likely to work. Finally, a lot of the minima are 0 or very low.

Forging ahead anyway with our dumb model, perhaps joining this with our original data set might be a useful thing to do.

smeter['month']: flatten xvals select time['month'] from smeter
smeter['hour']: flatten xvals select time['hour'] from smeter

left_join(smeter,(select month,hour,installation,PowerAvg from tmp),["installation","month","hour"])

┌───────┬───────────────────────┬────────────┬─────┬────┬────────┐
│Power  │time                   │installation│month│hour│PowerAvg│
├───────┼───────────────────────┼────────────┼─────┼────┼────────┤
│53.6267│2014.09.21T05:15:00.000│    f387e7b0│    9│   5│ 101.992│
│41.4469│2014.09.21T05:30:00.000│    f387e7b0│    9│   5│ 101.992│
│86.6634│2014.09.21T05:45:00.000│    f387e7b0│    9│   5│ 101.992│
│ 91.229│2014.09.21T06:00:00.000│    f387e7b0│    9│   6│ 100.237│
│ 65.806│2014.09.21T06:15:00.000│    f387e7b0│    9│   6│ 100.237│
│57.8399│2014.09.21T06:30:00.000│    f387e7b0│    9│   6│ 100.237│
│81.6965│2014.09.21T06:45:00.000│    f387e7b0│    9│   6│ 100.237│
│47.6387│2014.09.21T07:00:00.000│    f387e7b0│    9│   7│ 103.481│
│53.9568│2014.09.21T07:15:00.000│    f387e7b0│    9│   7│ 103.481│
│211.005│2014.09.21T07:30:00.000│    f387e7b0│    9│   7│ 103.481│
│55.2504│2014.09.21T07:45:00.000│    f387e7b0│    9│   7│ 103.481│
│63.5775│2014.09.21T08:00:00.000│    f387e7b0│    9│   8│ 106.195│
│ 47.352│2014.09.21T08:15:00.000│    f387e7b0│    9│   8│ 106.195│
│81.0503│2014.09.21T08:30:00.000│    f387e7b0│    9│   8│ 106.195│
│83.0564│2014.09.21T08:45:00.000│    f387e7b0│    9│   8│ 106.195│
│65.3051│2014.09.21T09:00:00.000│    f387e7b0│    9│   9│ 111.119│
└───────┴───────────────────────┴────────────┴─────┴────┴────────┘

Well, we can see this won’t work, but how badly won’t it work? MAPE is easily defined in Kerf, and unlike them other databases, you can do sensible things like drop the (numerous) zero examples from the actual Power readings.

dropInf:{[x] x[which(abs(x)!=inf)]}
def mape(x,y){
 xx:abs(dropInf((x-y)/x));
 sum(xx) / count(xx)} 

 select 100*mape(Power,PowerAvg) as MAPE from smeter where time>2014.01.01 group by installation

┌────────────┬───────┐
│installation│MAPE   │
├────────────┼───────┤
│    f387e7b0│271.342│
│    eeb1e1e1│4898.16│
│    8d525ff4│157.625│
│    9cd89c1e│ 260.53│
│    106c60a8│354.543│
│    4925364d│59.0307│
│    e74a5b3f│101.899│
│    cd1f9a75│138.086│
│            ⋮│      ⋮│
└────────────┴───────┘

This is a failure as a forecasting technique, but it illustrates a Kerf workflow with internet of things data. You can go very fast on a large database even on a crummy little laptop. You can also do better things.

Let’s say you wanted to build some linear regression models based on, say, time of day and month of year. Easily done in Kerf. You need a little helper function to add a dummy variable, as the standard kerf  least squares primitive doesn’t include a constant (it assumes beta*x = y).

xx: xvals select month,hour from tmp where installation='6af73e77'
  [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...], 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...]]

We want this matrix to have a constant laminated to the front of it, so

transpose 1 join mapright transpose xx
  [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...], 
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...], 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...]]

 

Now, let’s stick it in a convenience function for pulling out C’s and betas for a linear regression model:

y: flatten xvals select PowerAvg from tmp where installation='6af73e77'
Cbeta: {[x, y] lsq((transpose 1 join mapright transpose x), y)}
Cbeta(xx,y)
  [82.8605, -1.35372, 0.302477]

 

This could be useful if we built some dummy variables. If you regress as we did in this function, we get betas which are correlated to the encodings. This isn’t useful unless the hour of the day or month of the year as a growing function actually means something. Probably not. You’d probably want to blow those guys out as a linear regression on month of year, hour of day as dummy variables. Something like this:

xx: (1+range(12))= mapright flatten xvals select month from tmp where installation = '6af73e77'
  [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
...
Cbeta(transpose xx,y)
  [71.5752, 1.78876, -1.54225, 3.20775, 3.62774, 6.81636, 26.7092, 25.2714, 29.176, 28.5876, 9.84373, -31.3805, -30.5305]

Using all these parts, we can build regression models based on hour of day, month of year. We can also see patterns in the months of the year; this installation uses more power in the summer months than in winter.

Exponential smoothing is another thing which we might want for examining or predicting things. Kerf doesn’t come with this as a core verb, but we just need to write the definition of exponential smoothing down along with the deconverge combinator; one of those beautiful situations where Kerf is a clean mathematical notation:

def ewma(x,a) {
 {[z,y] (a*z)+(1-a)*y} deconverge x
}

Predicting electrical load a day or week ahead is considered a difficult machine learning problem due to all the nonlinearities involved, and the fact that you’re ultimately predicting what a bunch of monkeys do with light switches. The general solution to this problem is something I’ve done in Lisp (and ported to R) using metric space and information theoretic techniques. When I was doing this, I had absurd overhead from the data storage system. Pulling the pieces of the data in and out of the database was awful. The data was too big to keep in memory all at once, because the DB system I was using was not designed for time series data. Had I some Kerf to solve the problem with, I could have as easily ported the code from Lisp to Kerf, and the whole thing would have been ridiculously scalable and deployable on an as-is basis, with additional features such as streaming queries and live monitoring immediately available. That’s the power of a tool like Kerf. Programmable database engines allows us to do lots of fun out of core calculations in a high level language.

Meanwhile this data set allows us to develop lots of interesting time oriented techniques for solving difficult problems. I’ll visit it again in further blogs.

Advertisements