Performance socket nodejs + mysql - mysql

I am developing an application that consist of a server in node.js, basically a socket that is listening for incoming connections.
Data that arrive to my server, come from GPS trackers (30 approximately), that send 5 records per minute each one, so in a minute a will have 5*30 = 150 records, in a hour i will have 150*60 = 9000 records, in a day 9000*24 =216000 and in a month 216000*30 = 6.480.000 million of records.
In addition to the latitude and longitude, i have to store in the database (MySql) the cumulative distance of each tracker. Each tracker send to server positions, and i have to calculate kms between 2 points every time i receive data (to decrease the work to the database when it has millions of records).
So the question is, what is the correct way to sum the kilometers and store it?
I think sum the entire database is not a solution because in millions of records will be very slow. Maybe, every time i have to store a new point (150 times per minute), can I do a select last record in database and then sum the cumulative kilometer with the new distance calc?

2.5 inserts per second is only a modest rate. 6M records/month -- no problem.
How do you compute the KMs? Compute the distance of the previous GPS reading to the current? Or maybe back to the start? Keep in mind that GPS readings can be a bit flaky; a car going in a straight line may look drunk when plotted every 12 seconds. Meanwhile, I will assume you need some kind of index on (sensor, sequence) to keep track of the previous (or first) reading to do the distance.
But, what will you do with the distance? Is it being continually read out for display somewhere? That is, are you updating some non-MySQL thingie 150 times per minute? If so, you have an app that should receive the new GPS reading, store it into MySQL, read the starting point (or remember it), compute the kms and update the graph. That is, MySQL is not the focus here, but your app is.
As for representation of lat/lng, I reference my cheat sheet to see that FLOAT may be optimal.
Kilometers should almost certainly be stored as FLOAT. That gives you about 7 significant digits of precision. You should decide whether to the value represents "meters" or "kilometers". (The precision is the same.)

Related

Alerting framework for incoming traffic

Currently I put hourly traffic (total number of input requests) of my website in a MySQL table. I keep data for the last 90 days.
I want to check every hour, lets say 6th hour, that has the traffic increased/decreased beyond some threshold than last 7 days or last 30 days 6th hour traffic. Basically, I see a pattern of traffic. Different hours have different values.
-> Is there some alerting framework which I can use as it is for this purpose?
-> If yes, can someone suggest some open source?
-> If no, I am thinking of keeping running average of last 7 days / last 30 days in a MySQL table for every hour. And according, write a script to generate alerts based on those numbers. I am not very sure whether I should be mean, median or standard deviation. Can someone enlighten me there?

Rails - how to search through a big data collection and display them in a few-second-intervals?

I have a database of hundreds of thousands records that I fetch from database and counting a geographic distance between these records. The problem is that this search takes like 15 - 20 seconds, so I am trying to speed it up.
I think I can't do more with indexing of columns as I am grabbing the whole database table. The most time consume to count the geographic distance (through longitude and latitude). I don't know if there's a way to speed up this computation.
Because this task is - in my mind - almost the same like searching fly tickets, where you set a "from city" point to a "destination city" point and the search engine will gradually display found results to the user in time interval, like:
it displays some results
in line 2 seconds it will add another computed results
in another 2 seconds another computed results
and so on
I think this way of displaying results would be the best for my case - however, how this engine works? How can one make the search that will like every 2 seconds display another and another new results?
As the application is written in Ruby on Rails, for this kind of search would be:
AJAX
delayed_job
Possibly something else yet?
Or am I thinking about this problem in a wrong way and is there a better one to solve it?
Thank you.

Database design - How much data to store, performance vs quality

There is some value, x, which I am recording every 30 seconds, currently into a database with three fields:
ID
Time
Value
I am then creating a mobile app which will use that data to plot charts in views of:
Last hour
Last 24 hours.
7 Day
30 Day
Year
Obviously, saving every 30 seconds for the last year and then sending that data to a mobile device will be too much (it would mean sending 1051200 values).
My second thought was perhaps I could use the average function in MySQL, for example, collect all of the averages for every 7 days (creating 52 points for a year), and send those points. This would work, but still MySQL would be trawling through creating averages and if many users connect, it's going to be bad.
So simply put, if these are my views, then I do not need to keep track of all that data. Nobody should care what x was a year ago to the precision of every 30 seconds, this is fine. I should be able to use "triggers" to create some averages.
I'm looking for someone to check what I have below is reasonable:
Store values every 30s in a table (this will be used for the hour view, 120 points)
When there are 120 rows are in the 30s table (120 * 30s = 60 mins = 1 hour), use a trigger to store the first half an hour in a "half hour average" table, remove the first 60 entries from the 30s table. This new table will need to have an id, start time, end time and value. This half hour average will be used for the 24 hour view (48 data points).
When the half hour table has more than 24 entries (12 hours), store the first 6 as an average in a 6 hour average table and then remove from the table. This 6 hour average will be used for the 7 day view (28 data points).
When there are 8 entries in the 6 hour table, remove the first 4 and store this as an average day, to be used in the 30 day view (30 data points).
When there are 14 entries in the day view, remove the first 7 and store in a week table, this will be used for the year view.
This doesn't seem like the best way to me, as it seems to be more complicated than I would imagine it should be.
The alternative is to keep all of the data and let mysql find averages as and when needed. This will create a monstrously huge database. I have no idea about the performance yet. The id is an int, time is a datetime and value is a float. Is 1051200 records too many? Now is a good time to add, I would like to run this on a raspberry pi, but if not.. I do have my main machine which I could use.
Your proposed design looks OK. Perhaps there are more elegant ways of doing this, but your proposal should work too.
RRD (http://en.wikipedia.org/wiki/Round-Robin_Database) is a specialised database designed to do all of this automatically, and it should be much more performant than MySQL for this specialised purpose.
An alternative is the following: keep only the original table (1051200 records), but have a trigger that generates the last hour/day/year etc views every time a new record is added (e.g. every 30 seconds) and store/cache the result somewhere. Then your number-crunching workload is independent of the number of requests/clients you have to serve.
1051200 records may or may not be too many. Test in your Raspberry Pi to find out.
Let me give a suggestion on the physical layout of your table, regardless on whether you decide to keep all data or "prune" it from time to time...
Since you generate a new row "every 30 seconds", then Time can serve as a natural key without fear of exceeding the resolution of the underlying data type and causing duplicated keys. You don't need ID in this scenario1, so your table is simply:
Time (PK)
Value
And since InnoDB tables are clustered, not having secondary indexes2 means the whole table is stored in a single B-Tree, which is as efficient as it gets from storage and querying perspective. On top of that, Value is automatically covered, which may not have been the case in your original design unless you specifically designed your index(es) for that.
Using time as key can be tricky in general, but I think may be worth it in this particular case.
1 Unless there are other tables that reference it through FOREIGN KEYs, or you have already written too much code that depends on it.
2 Which would be necessary in the original design to support efficient aggregation.

Filter records from a database on a minimum time interval for making graph

We have a MySQL database table with statistical data that we want to present as a graph, with timestamp used as the x axis. We want to be able to zoom in and out of the graph between resolutions of, say, 1 day and 2 years.
In the zoomed out state, we will not want to get all data from the table, since that would mean to much data being shipped through the servers, and the graph resolution will be good enough with less data anyway.
In MySQL you can make queries that only select e.g. every tenth value and similar, which could be usable in this case. However, the intervals between values stored in the database isn't consistent, two values can be separated by as little as 10 minutes and as much as 6 hours, possibly more.
So the issue is that it is difficult to calculate a good stepping interval for the query, if we skip every tenth value for some reslution, that may work for series 10 minutes inbetween, but for 6 hour intervals we will throw away too much and the graph will end up having a too low resolution for comfort.
My impression is that MySQL isn't able to have a stepping interval depend on time so it would skip rows that are e.g. in the vicinity of five minutes of am included rows.
One solution could be to set 6 hours as a minimal resolution requirement for the graph, so we don't throw away values unless 6 hours is represented by a sufficiently small distance in the graph. I fear that this may result in too much data being read and sent through the system if the interval actually is smaller.
Another solution is to have more intelligence in the Java code, reading sets of data iteratively from low resolution and downwards until the data is good enough.
Any ideas for a solution that would enable us to get optimal resolution in one read, without too large result sets being read from the database, while not putting too much load on the database? I'm having wild ideas about installing an intermediate NoSQL component to store the values in, that might support time intervals the way I want - not sure if that actually is an option in the organisation.

What is a more efficient way to keep a daily ranking log for each user with MySQL?

I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???