Database design for heavy timed data logging - Car Tracking System - mysql

I am a making a car tracking system and i want to store data that each car sends after every 5 seconds in a MySql database. Assuming that i have 1000 cars transmitting data to my system after 5 seconds, and the data is stored in one table. At some point i would want to query this table to generate reports for specific vehicle. I am confused between logging all the vehicles data in one table or creating a table for each vehicle (1000 tables). Which is more efficient?

OK 86400 seconds per day / 5 = 17280 records per car and day.
Will result in 17,280,000 records per day. This is not an issue for MYSQL in general.
And a good designed table will be easy to query.
If you go for one table for each car - what is, when there will be 2000 cars in future.
But the question is also: how long do you like to store the data?
It is easy to calculate when your database is 200 GB, 800GB, 2TB,....

One table, not one table per car. A database with 1000 tables will be a dumpster fire when you try to back it up or maintain it.
Keep the rows of that table as short as you possibly can; it will have many records.
Index that table both on timestamp and on (car_id, timestamp) . The second index will allow you to report on individual cars efficiently.
Read https://use-the-index-luke.com/

This is the "tip of the iceberg". There are about 5 threads here and on dba.stackexchange relating to tracking cars/trucks. Here are some further tips.
Keep datatypes as small as possible. Your table(s) will become huge -- threatening to overflow the disk, and slowing down queries due to "bulky rows mean that fewer rows can be cached in RAM".
Do you keep the "same" info for a car that is sitting idle overnight? Think of how much disk space this is taking.
If you are using HDD disks, plain on 100 INSERTs/second before you need to do some redesign of the ingestion process. (1000/sec for SSDs.) There are techniques that can give you 10x, maybe 100x, but you must apply them.
Will you be having several servers collecting the data, then doing simple inserts into the database? My point is that that may be your first bottleneck.
PRIMARY KEY(car_id, ...) so that accessing data for one car is efficient.
Today, you say the data will be kept forever. But have you computed how big your disk will need to be?
One way to shrink the data drastically is to consolidate "old" data into, say, 1-minute intervals after, say, one month. Start thinking about what you want to keep. For example: min/max/avg speed, not just instantaneous speed. Have an extra record when any significant change occurs (engine on; engine off; airbag deployed; etc)
(I probably have more tips.)

Related

Database design suggestions for a data scraping/warehouse application?

I'm looking into the database design for a data ware house kind of project which involves large number of inserts daily.The data archives will be further used to generate reports.I will have a list of user's (for example a user set of 2 million), for which I need to monitor daily social networking activities associated with them.
For example, let there be a set of 100 users say U1,U2,...,U100
I need to insert their daily status count into my database.
Consider the total status count obtained for a user U1 for period June 30 - July 6, is as follows
June 30 - 99
July 1 - 100
July 2 - 102
July 3 - 102
July 4 - 105
July 5 - 105
July 6 - 107
The database should keep daily status count of each users ,like
For user U1,
July 1- 1 (100-99)
July 2- 2 (102-100)
July 3- 0 (102-102)
July 4- 3 (105-102)
July 5- 0 (105-105)
July 6- 2 (107-105)
Similarly the database should hold archived details of the full set of user's.
And on a later phase , I envision to take aggregate reports out of these data like total points scored on each day,week,month,etc; and to compare it with older data.
I need to start things from the scratch.I am experienced with PHP as a server side script and MySQL. I am confused on the database side? Since I need to process about a million insertion daily,what all things should be taken care of?
I am confused on how to to design a MySQL database in this regard ? On which storage engine to be used and design patterns to be followed keeping in my mind the data could later used effectively with aggregate functions.
Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day.Does lot of table's could create some overhead?
Does MySQL fits my requirement? 2 million or more DB operations are done every day. How the server and other things are to be considered in this case.
1) The database should handle concurrent inserts, which should enable 1-2 million inserts per day.
Before inserting I suggest to calculate daily status count,i.e the difference today's count with yesterday's.
2) on a later phase, the archives data (collected over past days) is used as a data warehouse and aggregation tasks are to be performed on it.
Comments:
I have read MyISAM is the best choice for data warehousing projects and at the same time heard INNODB excels in many ways. Many have suggested on proper tuning to get it done, I would like to get thoughts on that as well.
When creating a data warehouse, you don't have to worry about normalization. You're inserting rows and reading rows.
I'd just have one table like this.
Status Count
------------
User id
Date
Count
The primary (clustering) key would be (User id, Date). Another unique index would be (Date, User id).
As far as whether or not MySQL can handle this data warehouse, that depends on the hardware that MySQL is running on.
Since you don't need referential integrity, I'd use MyISAM as the engine.
As for table design, a dimensional model with a star schema is usually a good choice for a datamart where there are mostly inserts and reads. I see two different granularities for the status data, one for status per day and one for status per user, so I would recommend tables similar to:
user_status_fact(user_dimension_id int, lifetime_status int)
daily_status_fact (user_dimension_id int, calendar_dimension_id int, daily_status int)
user_dimension(user_dimension_id, user_id, name, ...)
calendar_dimension(calendar_dimension_id, calendar_date, day_of_week, etc..)
You might also consider having the most detailed data available even though you don't have a current requirement for it as it may make it easier to build aggregates in the future:
status_fact (user_dimension_id int, calendar_dimension_id int, hour_dimension_id, status_dimension_id, status_count int DEFAULT 1)
hour_dimension(hour_dimension_id, hour_of_day_24, hour_of_day_12, ...)
status_dimension(status_dimension_id, status_description string, ...)
If you aren't familiar with the dimensional model, I would recommend the book data warehouse toolkit by Kimball.
I would also recommend MyISAM since you don't need the transactional integrity provided by InnoDB when dealing with a read-mostly warehouse.
I would question whether you want to do concurrent inserts into a production database though. Often in a warehouse environment this data would get batched over time and inserted in bulk and perhaps go through a promotion process.
As for scalability, mysql can certainly handle 2M write operations per day on modest hardware. I'm inserting 500K+ rows/day (batched hourly) on a cloud based server with 8GB of ram running apache + php + mysql and the inserts aren't really noticeable to the php users hitting the same db.
I'm assuming you will get one new row per user per day inserted (not 2M rows a day as some users will have more than one status). You should look at how many new rows per day you expect that to created. When you get to a large number of rows you might have to consider partitioning, sharding and other performance tricks. There are many books out there that can help you with that. Or you could also consider moving to an analytics db such as Amazon Red Shift.
I would create a fact table for each user status for each day. This fact table would connect to a date dimension via a date_key and to a user dimension via a user_key. The primary key for the fact table should be a surrogate key = status_key.
So, your fact table now has four fields: status_key, date_key, user_key, status.
Once the dimension and fact tables have been loaded, then do the processing and aggregating.
Edit: I assumed you knew something about datamarts and star schemas. Here is a simple star schema to base your design on.
This design will store any user's status for a given day. (If the user status can change during the day, just add a time dimension).
This design will work on MySQL or SQL Server. You will have to manage a million inserts per day, don't bog it down with comparisons to previous data points. You can do that with the datamart (star schema) after it's loaded - that's what it's for - analysis and aggregation.
If there are large number of DML operation and selecting records from database MYISAM engine would be prefer. INNODB is mainly use for TCL and referential integrity.You can also specify engine at table level.
If you need to generate the report then also MYISAM engine work faster than INNODB.See which table or data you need for your report.
Remember that if you generate reports from MYSQL database processing on millions of data using PHP programming could create a problem.You may encounter 500 or 501 error frequently.
So report generation view point MYISAM engine for required table will be useful.
You can also store data in multiple table to prevent overhead otherwise there is a chance for DB table crash.
It looks like you need a schema that will keep a single count per user per day. Very simple. You should create a single table which is DAY, USER_ID, and STATUS_COUNT.
Create an index on DAY and USER_ID together, and if possible keep the data in the table sorted by DAY and USER_ID also. This will give you very fast access to the data, as long as you are querying it by day ranges for any (or all) users.
For example:
select * from table where DAY = X and USER_ID in (Y, Z);
would be very fast because the data is ordered on disk sequentially by day, then by user_id, so there are very few seeks to satisfy the query.
On the other hand, if you are more interested in finding a particular user's activity for a range of days:
select * from table where USER_ID = X and DAY between Y and Z;
then the previous method is less optimal because finding the data will require many seeks instead of a sequential scan. Index first by USER_ID, then DAY, and keep the data sorted in that order; this will require more maintenance though, as the table would need to be re-sorted often. Again, it depends on your use case, and how fast you want your queries against the table to respond.
I don't use MySQL extensively, but I believe MyISAM is faster for inserts at the expense of transaction isolation. This should not be a problem for the system you're describing.
Also, 2MM records per day should be child's play (only 23 inserts / second) if you're using decent hardware. Especially if you can batch load the records using mysqlimport. If that's not possible, 23 inserts/second should still be very doable.
I would not do the computation of the delta from the previous day in the insertion of the current day however. There is an analytic function called LAG() that will do that for you very handily (http://explainextended.com/2009/03/10/analytic-functions-first_value-last_value-lead-lag/), not to mention it doesn't seem to serve any practical purpose at the detail level.
With this detail data, you can aggregate it any way you'd like, truncating the DAY column down to WEEK or MONTH, but be careful how you build aggregates. You're talking about over 7 billion records per year, and re-building aggregates over so many rows can be very costly, especially on a single database. You might consider doing aggregation processing using Hadoop (I'd recommend Spark over plain old Map/Reduce also, its far more powerful). This will alleviate any computation burden from your database server (which can't easily scale to multiple servers) and allow it to do its job of recording and storing new data.
You should consider partitioning your table as well. Some purposes of partitioning tables are to distribute query load, ease archival of data, and possibly increase insert performance. I would consider partitioning along the month boundary for an application such as you've described.

MySQL Table Locks

I was asked to do some PHP scripts on MySQL DB to show some data when I noticed the strange design they had.
They want to perform a study that would require collecting up to 2000 records per user and they are automatically creating a new table for each user that registers. It's a pilot study at this stage so they have around 30 tables but they should have 3000 users for the real study.
I wanted to suggest gathering all of them in a single table but since there might be around 1500 INSERTs per minute to that database during the study period, I wanted to ask this question here first. Will that cause table locks in MySQL?
So, Is it one table with 1500 INSERTs per minute and a maximum size of 6,000,000 records or 3000 tables with 30 INSERTs per minute and a maximum size of 2000 records. I would like to suggest the first option but I want to be sure that it will not cause any issues.
I read that InnoDB has row-level locks. So, will that have a better performance combined with the one table option?
This is a huge loaded question. In my experience performance is not really measured accurately by table size alone. It comes down to design. Do you have the primary keys and indexes in place? Is it over indexed? That being said, I have also found that almost always one trip to the DB is faster than dozens. How big is the single table (columns)? What kind of data are you saving (larger than 4000K?). It might be that you need to create some prototypes to see what performs best for you. The most I can recommend is that you carefully judge the size of the data you are collecting and allocate accordingly, create indexes (but not too many, don't over index), and test.

Right design for MySQL database

I want to build a MySQL database for storing the ranking of a game every 1h.
Since this database will become quite large in a short time, I figured it's important to have a proper design. Therefor some advice would be gratefully appreciated.
In order to keep it as small as possible, I decided to log only the first 1500 positions of the ranking. Every ranking of a player holds the following values:
ranking position, playername, location, coordinates, alliance, race, level1, level2, points1, points2, points3, points4, points5, points6, date/time
My approach was to simply grab all values of each top 1500 player every hour by a php script and insert them into the MySQL as one row. So every day the MySQL will grow 36,000 rows. I will have a second script that deletes every row that is older than 28 days, otherwise the database would get insanely huge. Both scripts will run as a cronjob.
The following queries will be performed on this data:
The most important one is simply the query for a certain name. It should return all stats for the player for every hour as an array.
The second is a query in which all players have to be returned that didn't gain points1 during a certain time period from the latest entry. This should return a list of players that didn't gain points (for the last 24h for example).
The third is a query in which all players should be listed that lost a certain amount or more points2 in a certain time period from the latest entry.
The queries shouldn't take a lifetime, so I thought I should probably index playernames, points1 and points2.
Is my approach to this acceptable or will I run into a performance/handling disaster? Is there maybe a better way of doing this?
Here is where you risk a performance problem:
Your indexes will speed up your reads, but will considerably slow down your writes. Especially since your DB will have over 1 million rows in that one table at any given time. Since your writes are happening via cron, you should be okay as long as you insert your 1500 rows in batches rather than one round trip to the DB for every row. I'd also look into query compiling so that you save that overhead as well.
Ranhiru Cooray is correct, you should only store data like the player name once in the DB. Create a players table and use the primary key to reference the player in your ranking table. The same will go for location, alliance and race. I'm guessing that those are more or less enumerated values that you can store in another table to normalize your design and be returned in your results with appropriates JOINs. Normalizing your data will reduce the amount of redundant information in your database which will decrease it's size and increase it's performance.
Your design may also be flawed in your ranking position. Can that not be calculated by the DB when you select your rows? If not, can it be done by PHP? It's the same as with invoice tables, you never store the invoice total because it is redundant. The items/pricing/etc can be used to calculate the order totals.
With all the adding/deleting, I'd be sure to run OPTIMIZE frequently and keep good backups. MySQL tables---if using MyISAM---can become corrupted easily in high writing/deleting scenarios. InnoDB tends to fair a little better in those situations.
Those are some things to think about. Hope it helps.

What is a more efficient way to keep a daily ranking log for each user with MySQL?

I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???

What is the cost of indexing multiple db columns?

I'm writing an app with a MySQL table that indexes 3 columns. I'm concerned that after the table reaches a significant amount of records, the time to save a new record will be slow. Please inform how best to approach the indexing of columns.
UPDATE
I am indexing a point_value, the
user_id, and an event_id, all required
for client-facing purposes. For an
instance such as scoring baseball runs
by player id and game id. What would
be the cost of inserting about 200 new
records a day, after the table holds
records for two seasons, say 72,000
runs, and after 5 seasons, maybe a
quarter million records? Only for
illustration, but I'm expecting to
insert between 25 and 200 records a
day.
Index what seems the most logical (that should hopefully be obvious, for example, a customer ID column in the CUSTOMERS table).
Then run your application and collect statistics periodically to see how the database is performing. RUNSTATS on DB2 is one example, I would hope MySQL has a similar tool.
When you find some oft-run queries doing full table scans (or taking too long for other reasons), then, and only then, should you add more indexes. It does little good to optimise a once-a-month-run-at-midnight query so it can finish at 12:05 instead of 12:07. However, it's a huge improvement to reduce a customer-facing query from 5 seconds down to 2 seconds (that's still too slow, customer-facing queries should be sub-second if possible).
More indexes tend to slow down inserts and speed up queries. So it's always a balancing act. That's why you only add indexes in specific response to a problem. Anything else is premature optimization and should be avoided.
In addition, revisit the indexes you already have periodically to see if they're still needed. It may be that the queries that caused you to add those indexes are no longer run often enough to warrant it.
To be honest, I don't believe indexing three columns on a table will cause you to suffer unless you plan on storing really huge numbers of rows :-) - indexing is pretty efficient.
After your edit which states:
I am indexing a point_value, the user_id, and an event_id, all required for client-facing purposes. For an instance such as scoring baseball runs by player id and game id. What would be the cost of inserting about 200 new records a day, after the table holds records for two seasons, say 72,000 runs, and after 5 seasons, maybe a quarter million records? Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
My response is that 200 records a day is an extremely small value for a database, you definitely won't have anything to worry about with those three indexes.
Just this week, I imported a days worth of transactions into one of our database tables at work and it contained 2.1 million records (we get at least one transaction per second across the entire day from 25 separate machines). And it has four separate composite keys which is somewhat more intensive than your three individual keys.
Now granted, that's on a DB2 database but I can't imagine IBM are so much better than the MySQL people that MySQL can only handle less than 0.01% of the DB2 load.
I made some simple tests using my real project and real MySql database.
My results are: adding average index (1-3 columns in an index) to a table - makes inserts slower by 2.1%. So, if you add 20 indexes, your inserts will be slower by 40-50%. But your selects will be 10-100 times faster.
So is it ok to add many indexes? - It depends :) I gave you my results - You decide!
Nothing for select queries, though updates and especially inserts will be order of magnitudes slower - which you won't really notice before you start inserting a LOT of rows at the same time...
In fact at a previous employer (single user, desktop system) we actually DROPPED indexes before starting our "import routine" - which would first delete all records before inserting a huge number of records into the same table...
Then when we were finished with the insertion job we would re-create the indexes...
We would save 90% of the time for this operation by dropping the indexes before starting the operation and re-creating the indexes afterwards...
This was a Sybase database, but the same numbers apply for any database...
So be careful with indexes, they're FAR from "free"...
Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
With that kind of insertion rate, the cost of indexing an extra column will be negligible.
Without some more details about expected usage of the data in your table worrying about indexes slowing you down smells a lot like premature optimization that should be avoided.
If you are really concerned about it, then setup a test database and simulate performance in the worst case scenarios. A test proving that is or is not a problem will probably be much more useful then trying to guess and worry about what may happen. If there is a problem you will be able to use your test setup to try different methods to fix the issue.
The index is there to speed retrieval of data, so the question should be "What data do I need to access quickly?". Without the index, some queries will do a full table scan (go through every row in the table) in order to find the data that you want. With a significant amount of records this will be a slow and expensive operation. If it is for a report that you run once a month then maybe thats okay; if it is for frequently accessed data then you will need the index to give your users a better experience.
If you find the speed of the insert operations are slow because of the index then this is a problem you can solve at the hardware level by throwing more CPUs, RAM and better hard drive technology at the problem.
What Pax said.
For the dimensions you describe, the only significant concern I can imagine is "What is the cost of failing to index multiple db columns?"