Example how to use catboost with the time series data - catboost

In the introduction/promo video (https://www.youtube.com/watch?v=s8Q_orF4tcI) you have mentioned that Catboost can analyse the time series historical data for weather forecasts.
But I was not able to find anything like this in tutorials: https://github.com/catboost/catboost/tree/master/catboost/tutorials

Here are some examples of time series models using CatBoost (no affiliation):
Kaggle: CatBoost - forget about time series
Forecasting Time Series with Gradient Boosting
One thing I see around that I don't have first-hand knowledge of is using the has_time parameter to specify that the observations should be ordered (and not randomized) using a timestamp column.

Related

Evaluating the performance of variational autoencoder on unlabeled data

I've designed a variational autoencoder (VAE) that clusters sequential time series data.
To evaluate the performance of VAE on labeled data, First, I run KMeans on the raw data and compare the generated labels with the true labels using Adjusted Mutual Info Score (AMI). Then, after the model is trained, I pass validation data to it, run KMeans on latent vectors, and compare the generated labels with the true labels of validation data using AMI. Finally, I compare the two AMI scores with each other to see if KMeans has better performance on the latent vectors than the raw data.
My question is this: How can we evaluate the performance of VAE when the data is unlabeled?
I know we can run KMeans on the raw data and generate labels for it, but in this case, since we consider the generated labels as true labels, how can we compare the performance of KMeans on the raw data with KMeans on the latent vectors?
Note: The model is totally unsupervised. Labels (if exist) are not used in the training process. They're used only for evaluation.
In unsupervised learning you evaluate the performance of a model by either using labelled data or visual analysis. In your case you do not have labelled data, so you would need to do analysis. One way to do this is by looking at the predictions. If you know how the raw data should be labelled, you can qualitatively evaluate the accuracy. Another method is, since you are using KMeans, is to visualize the clusters. If the clusters are spread apart in groups, that is usually a good sign. However, if they are closer together and overlapping, the labelling of vectors in the respective areas may be less accurate. Alternatively, there may be some sort of a metric that you can use to evaluate the clusters or come up with your own.

Forecasting out of sample with Fourier regressors

I'm trying to create a multivariate multi-step-ahead forecast using machine learning (weekly and yearly seasonality).
I use some exogenous variables, including Fourier terms. I'm happy with the results of testing the model with in sample data, but now I want to go for production and make real forecasts on completely unseen data. While I can update the other regressors (variables) since they are dummy variables and related to time, I don't know how I will generate new Fourier terms for the N steps ahead.
I have an understanding problem here and what to check it with you: when you generate the fourier terms based on periodicity and the number of sin/cos used to decompose the time serie you want to forecast this process should be independent on that values of the time series. Is that right?
If so, how do you extend the terms for the N steps?
Just for the sake of completeness, I use R.
Thank you
From what I am reading and understanding, you want to get future N terms on the Fourier. To do this, you need to shift your calculated time frame to be some point in the past (say N-1). This is just simple causality, you cannot model the future with Fourier (for example, you cant have (N-1) = a(N+1) + b(N-2) + c(N).

Best regression model where some fields may be intentionally blank for some samples

I'm looking to build a regression model where I have time based variables that may or may not exist for each data sample.
For instance, let's say we wanted to build a regression model where we could predict how long a new car will last. One of the values is when the car gets its first servicing. However, there are some samples where the car never gets serviced at all. In these situations, how can I account for this when building the model? Can I even use a linear regression model or will I have to choose a different regression model?
When I think about it, this is basically the equivalent of having 2 fields: one for whether the car was serviced and if that is true, a second field for when. But I'm not sure how to build a regression that has data that is intentionally missing.
Apply regression without using time-series. To try to capture seasonality in the data, encode the date/time columns into binary columns (to represent year, day of year, day of the month and day of the week etc.).

Is BatchNorm turned off when inferencing?

I read from several sources that implicitly suggest batchnorm being turned off for inference but I have no definite answer for this.
Most common is to use a moving average of mean and std for your batch normalization as used by Keras for example (https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py). If you just turn it off the network will perform worse on the same data, due to changes in how the images are processed.
This is done by storing the average mean and the average std of all the batches used during training the network. Then in inference this moving average is used for normalization.

How to store historic time series data

we're storing a bunch of time series data from several measurement devices.
All devices may provide different dimensions (energy, temp, etc.)
Currently we're using MySQL to store all this data in different tables (according to the dimension) in the format
idDevice, DateTime, val1, val2, val3
We're also aggregating this data from min -> Hour -> Day -> Month -> Year each time we insert new data
This is running quite fine, but we're running out of disk space as we are growing and in general I doubt that a RDBMS is the right answer to keep archive data.
So we're thinking of moving old/cold data on Amazon S3 and write some fancy getter that can recieve this data.
So here my question comes: what could be a good data format to support the following needs:
The data must be extensible in terms: once i a while a device will provide more data, then in the past -> the count of rows can grow/increase
The data must be updated. When a customer delivers historic data, we need to be able to update that for the past.
We're using PHP -> would be nice to have connectors/classes :)
I've had a look on HDF5, but it seems there is no PHP lib.
We're also willing to have a look on cloud based TimeSeries Databases.
Thank you in advance!
B
You might consider moving to a dedicated time-series database. I work for InfluxDB and our product meets most of your requirements right now, although it is still pre-1.0 release.
We're also aggregating this data from min -> Hour -> Day -> Month -> Year each time we insert new data
InfluxDB has built-in tools to automatically downsample and expire data. All you do is write the raw points and set up a few queries and retention policies, InfluxDB handles the rest internally.
The data must be extensible in terms: once i a while a device will provide more data, then in the past -> the count of rows can grow/increase
As long as historic writes are fairly infrequent they are no problem for InfluxDB. If you are frequently writing in non-sequential data the write performance can slow down, but only while the non-sequential points are being replicated.
InfluxDB is not quite schema-less, but the schema cannot be pre-defined, and is derived from the points inserted. You can add new tags (metadata) or fields (metrics) simply by writing a new point that includes them, and you can automatically compose or decompose series by excluding or including the relevant tags when querying.
The data must be updated. When a customer delivers historic data, we need to be able to update that for the past.
InfluxDB silently overwrites points when a new matching point comes in. (Matching means same series and timestamp, to the nanosecond)
We're using PHP -> would be nice to have connectors/classes :)
There are a handful of PHP libraries out there for InfluxDB 0.9. None are officially supported but likely one fits your needs enough to extend or fork.
You haven't specified what you want enough.
Do you care about latency? If not, just write all your data points to per-interval files in S3, then periodically collect them and process them. (No Hadoop needed, just a simple script downloading the new files should usually be plenty fast enough.) This is how logging in S3 works.
The really nice part about this is you will never outgrow S3 or do any maintenance. If you prefix your files correctly, you can grab a day's worth of data or the last hour of data easily. Then you do your day/week/month roll-ups on that data, then store only the roll-ups in a regular database.
Do you need the old data at high resolution? You can use Graphite to roll-up your data automatically. The downside is that it looses resolution as you age. But the upside is that your data is a fixed size and never grows, and writes can be handled quickly. You can even combine the above approach and send data to Graphite for quick viewing, but keep the data in S3 for other uses down the road.
I haven't researched the various TSDBs extensively, but here is a nice HN thread about it. InfluxDB is nice, but new. Cassandra is more mature, but the tooling to use it as a TSB isn't all there yet.
How much new data do you have? Most tools will handle 10,000 datapoints per second easily, but not all of them can scale beyond that.
I'm with the team that develops Axibase Time-Series Database. It's a non-relational database that allows you to efficiently store timestamped measurements with various dimensions. You can also store device properties (id, location, type, etc) in the same database for filtering and grouped aggregations.
ATSD doesn't delete raw data by default. Each sample takes 3.5+ bytes per tuple: time:value. Period aggregations are performed at request time and the list of functions includes: MIN, MAX, AVG, SUM, COUNT, PERCENTILE(n), STANDARD_DEVIATION, FIRST, LAST, DELTA, RATE, WAVG, WTAVG as well as some some additional functions for computing threshold violations per period.
Backfilling historical data is fully supported except that the timestamp has to be greater than January 1, 1970. Time precision is milliseconds or seconds.
As for deployment options, you could host this database on AWS. It runs on most Linux distributions. We could run some storage efficiency and throughput tests for you if you want to post sample data from your dataset here.