Database design suggestions for a data scraping/warehouse application? - mysql

I'm looking into the database design for a data ware house kind of project which involves large number of inserts daily.The data archives will be further used to generate reports.I will have a list of user's (for example a user set of 2 million), for which I need to monitor daily social networking activities associated with them.
For example, let there be a set of 100 users say U1,U2,...,U100
I need to insert their daily status count into my database.
Consider the total status count obtained for a user U1 for period June 30 - July 6, is as follows
June 30 - 99
July 1 - 100
July 2 - 102
July 3 - 102
July 4 - 105
July 5 - 105
July 6 - 107
The database should keep daily status count of each users ,like
For user U1,
July 1- 1 (100-99)
July 2- 2 (102-100)
July 3- 0 (102-102)
July 4- 3 (105-102)
July 5- 0 (105-105)
July 6- 2 (107-105)
Similarly the database should hold archived details of the full set of user's.
And on a later phase , I envision to take aggregate reports out of these data like total points scored on each day,week,month,etc; and to compare it with older data.
I need to start things from the scratch.I am experienced with PHP as a server side script and MySQL. I am confused on the database side? Since I need to process about a million insertion daily,what all things should be taken care of?
I am confused on how to to design a MySQL database in this regard ? On which storage engine to be used and design patterns to be followed keeping in my mind the data could later used effectively with aggregate functions.
Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day.Does lot of table's could create some overhead?
Does MySQL fits my requirement? 2 million or more DB operations are done every day. How the server and other things are to be considered in this case.
1) The database should handle concurrent inserts, which should enable 1-2 million inserts per day.
Before inserting I suggest to calculate daily status count,i.e the difference today's count with yesterday's.
2) on a later phase, the archives data (collected over past days) is used as a data warehouse and aggregation tasks are to be performed on it.
Comments:
I have read MyISAM is the best choice for data warehousing projects and at the same time heard INNODB excels in many ways. Many have suggested on proper tuning to get it done, I would like to get thoughts on that as well.

When creating a data warehouse, you don't have to worry about normalization. You're inserting rows and reading rows.
I'd just have one table like this.
Status Count
------------
User id
Date
Count
The primary (clustering) key would be (User id, Date). Another unique index would be (Date, User id).
As far as whether or not MySQL can handle this data warehouse, that depends on the hardware that MySQL is running on.
Since you don't need referential integrity, I'd use MyISAM as the engine.

As for table design, a dimensional model with a star schema is usually a good choice for a datamart where there are mostly inserts and reads. I see two different granularities for the status data, one for status per day and one for status per user, so I would recommend tables similar to:
user_status_fact(user_dimension_id int, lifetime_status int)
daily_status_fact (user_dimension_id int, calendar_dimension_id int, daily_status int)
user_dimension(user_dimension_id, user_id, name, ...)
calendar_dimension(calendar_dimension_id, calendar_date, day_of_week, etc..)
You might also consider having the most detailed data available even though you don't have a current requirement for it as it may make it easier to build aggregates in the future:
status_fact (user_dimension_id int, calendar_dimension_id int, hour_dimension_id, status_dimension_id, status_count int DEFAULT 1)
hour_dimension(hour_dimension_id, hour_of_day_24, hour_of_day_12, ...)
status_dimension(status_dimension_id, status_description string, ...)
If you aren't familiar with the dimensional model, I would recommend the book data warehouse toolkit by Kimball.
I would also recommend MyISAM since you don't need the transactional integrity provided by InnoDB when dealing with a read-mostly warehouse.
I would question whether you want to do concurrent inserts into a production database though. Often in a warehouse environment this data would get batched over time and inserted in bulk and perhaps go through a promotion process.
As for scalability, mysql can certainly handle 2M write operations per day on modest hardware. I'm inserting 500K+ rows/day (batched hourly) on a cloud based server with 8GB of ram running apache + php + mysql and the inserts aren't really noticeable to the php users hitting the same db.
I'm assuming you will get one new row per user per day inserted (not 2M rows a day as some users will have more than one status). You should look at how many new rows per day you expect that to created. When you get to a large number of rows you might have to consider partitioning, sharding and other performance tricks. There are many books out there that can help you with that. Or you could also consider moving to an analytics db such as Amazon Red Shift.

I would create a fact table for each user status for each day. This fact table would connect to a date dimension via a date_key and to a user dimension via a user_key. The primary key for the fact table should be a surrogate key = status_key.
So, your fact table now has four fields: status_key, date_key, user_key, status.
Once the dimension and fact tables have been loaded, then do the processing and aggregating.
Edit: I assumed you knew something about datamarts and star schemas. Here is a simple star schema to base your design on.
This design will store any user's status for a given day. (If the user status can change during the day, just add a time dimension).
This design will work on MySQL or SQL Server. You will have to manage a million inserts per day, don't bog it down with comparisons to previous data points. You can do that with the datamart (star schema) after it's loaded - that's what it's for - analysis and aggregation.

If there are large number of DML operation and selecting records from database MYISAM engine would be prefer. INNODB is mainly use for TCL and referential integrity.You can also specify engine at table level.
If you need to generate the report then also MYISAM engine work faster than INNODB.See which table or data you need for your report.
Remember that if you generate reports from MYSQL database processing on millions of data using PHP programming could create a problem.You may encounter 500 or 501 error frequently.
So report generation view point MYISAM engine for required table will be useful.
You can also store data in multiple table to prevent overhead otherwise there is a chance for DB table crash.

It looks like you need a schema that will keep a single count per user per day. Very simple. You should create a single table which is DAY, USER_ID, and STATUS_COUNT.
Create an index on DAY and USER_ID together, and if possible keep the data in the table sorted by DAY and USER_ID also. This will give you very fast access to the data, as long as you are querying it by day ranges for any (or all) users.
For example:
select * from table where DAY = X and USER_ID in (Y, Z);
would be very fast because the data is ordered on disk sequentially by day, then by user_id, so there are very few seeks to satisfy the query.
On the other hand, if you are more interested in finding a particular user's activity for a range of days:
select * from table where USER_ID = X and DAY between Y and Z;
then the previous method is less optimal because finding the data will require many seeks instead of a sequential scan. Index first by USER_ID, then DAY, and keep the data sorted in that order; this will require more maintenance though, as the table would need to be re-sorted often. Again, it depends on your use case, and how fast you want your queries against the table to respond.
I don't use MySQL extensively, but I believe MyISAM is faster for inserts at the expense of transaction isolation. This should not be a problem for the system you're describing.
Also, 2MM records per day should be child's play (only 23 inserts / second) if you're using decent hardware. Especially if you can batch load the records using mysqlimport. If that's not possible, 23 inserts/second should still be very doable.
I would not do the computation of the delta from the previous day in the insertion of the current day however. There is an analytic function called LAG() that will do that for you very handily (http://explainextended.com/2009/03/10/analytic-functions-first_value-last_value-lead-lag/), not to mention it doesn't seem to serve any practical purpose at the detail level.
With this detail data, you can aggregate it any way you'd like, truncating the DAY column down to WEEK or MONTH, but be careful how you build aggregates. You're talking about over 7 billion records per year, and re-building aggregates over so many rows can be very costly, especially on a single database. You might consider doing aggregation processing using Hadoop (I'd recommend Spark over plain old Map/Reduce also, its far more powerful). This will alleviate any computation burden from your database server (which can't easily scale to multiple servers) and allow it to do its job of recording and storing new data.
You should consider partitioning your table as well. Some purposes of partitioning tables are to distribute query load, ease archival of data, and possibly increase insert performance. I would consider partitioning along the month boundary for an application such as you've described.

Related

Database design for heavy timed data logging - Car Tracking System

I am a making a car tracking system and i want to store data that each car sends after every 5 seconds in a MySql database. Assuming that i have 1000 cars transmitting data to my system after 5 seconds, and the data is stored in one table. At some point i would want to query this table to generate reports for specific vehicle. I am confused between logging all the vehicles data in one table or creating a table for each vehicle (1000 tables). Which is more efficient?
OK 86400 seconds per day / 5 = 17280 records per car and day.
Will result in 17,280,000 records per day. This is not an issue for MYSQL in general.
And a good designed table will be easy to query.
If you go for one table for each car - what is, when there will be 2000 cars in future.
But the question is also: how long do you like to store the data?
It is easy to calculate when your database is 200 GB, 800GB, 2TB,....
One table, not one table per car. A database with 1000 tables will be a dumpster fire when you try to back it up or maintain it.
Keep the rows of that table as short as you possibly can; it will have many records.
Index that table both on timestamp and on (car_id, timestamp) . The second index will allow you to report on individual cars efficiently.
Read https://use-the-index-luke.com/
This is the "tip of the iceberg". There are about 5 threads here and on dba.stackexchange relating to tracking cars/trucks. Here are some further tips.
Keep datatypes as small as possible. Your table(s) will become huge -- threatening to overflow the disk, and slowing down queries due to "bulky rows mean that fewer rows can be cached in RAM".
Do you keep the "same" info for a car that is sitting idle overnight? Think of how much disk space this is taking.
If you are using HDD disks, plain on 100 INSERTs/second before you need to do some redesign of the ingestion process. (1000/sec for SSDs.) There are techniques that can give you 10x, maybe 100x, but you must apply them.
Will you be having several servers collecting the data, then doing simple inserts into the database? My point is that that may be your first bottleneck.
PRIMARY KEY(car_id, ...) so that accessing data for one car is efficient.
Today, you say the data will be kept forever. But have you computed how big your disk will need to be?
One way to shrink the data drastically is to consolidate "old" data into, say, 1-minute intervals after, say, one month. Start thinking about what you want to keep. For example: min/max/avg speed, not just instantaneous speed. Have an extra record when any significant change occurs (engine on; engine off; airbag deployed; etc)
(I probably have more tips.)

Right design for MySQL database

I want to build a MySQL database for storing the ranking of a game every 1h.
Since this database will become quite large in a short time, I figured it's important to have a proper design. Therefor some advice would be gratefully appreciated.
In order to keep it as small as possible, I decided to log only the first 1500 positions of the ranking. Every ranking of a player holds the following values:
ranking position, playername, location, coordinates, alliance, race, level1, level2, points1, points2, points3, points4, points5, points6, date/time
My approach was to simply grab all values of each top 1500 player every hour by a php script and insert them into the MySQL as one row. So every day the MySQL will grow 36,000 rows. I will have a second script that deletes every row that is older than 28 days, otherwise the database would get insanely huge. Both scripts will run as a cronjob.
The following queries will be performed on this data:
The most important one is simply the query for a certain name. It should return all stats for the player for every hour as an array.
The second is a query in which all players have to be returned that didn't gain points1 during a certain time period from the latest entry. This should return a list of players that didn't gain points (for the last 24h for example).
The third is a query in which all players should be listed that lost a certain amount or more points2 in a certain time period from the latest entry.
The queries shouldn't take a lifetime, so I thought I should probably index playernames, points1 and points2.
Is my approach to this acceptable or will I run into a performance/handling disaster? Is there maybe a better way of doing this?
Here is where you risk a performance problem:
Your indexes will speed up your reads, but will considerably slow down your writes. Especially since your DB will have over 1 million rows in that one table at any given time. Since your writes are happening via cron, you should be okay as long as you insert your 1500 rows in batches rather than one round trip to the DB for every row. I'd also look into query compiling so that you save that overhead as well.
Ranhiru Cooray is correct, you should only store data like the player name once in the DB. Create a players table and use the primary key to reference the player in your ranking table. The same will go for location, alliance and race. I'm guessing that those are more or less enumerated values that you can store in another table to normalize your design and be returned in your results with appropriates JOINs. Normalizing your data will reduce the amount of redundant information in your database which will decrease it's size and increase it's performance.
Your design may also be flawed in your ranking position. Can that not be calculated by the DB when you select your rows? If not, can it be done by PHP? It's the same as with invoice tables, you never store the invoice total because it is redundant. The items/pricing/etc can be used to calculate the order totals.
With all the adding/deleting, I'd be sure to run OPTIMIZE frequently and keep good backups. MySQL tables---if using MyISAM---can become corrupted easily in high writing/deleting scenarios. InnoDB tends to fair a little better in those situations.
Those are some things to think about. Hope it helps.

What is a more efficient way to keep a daily ranking log for each user with MySQL?

I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???

Database architecture for millions of new rows per day

I need to implement a custom-developed web analytics service for large number of websites. The key entities here are:
Website
Visitor
Each unique visitor will have have a single row in the database with information like landing page, time of day, OS, Browser, referrer, IP, etc.
I will need to do aggregated queries on this database such as 'COUNT all visitors who have Windows as OS and came from Bing.com'
I have hundreds of websites to track and the number of visitors for those websites range from a few hundred a day to few million a day. In total, I expect this database to grow by about a million rows per day.
My questions are:
1) Is MySQL a good database for this purpose?
2) What could be a good architecture? I am thinking of creating a new table for each website. Or perhaps start with a single table and then spawn a new table (daily) if number of rows in an existing table exceed 1 million (is my assumption correct). My only worry is that if a table grows too big, the SQL queries can get dramatically slow. So, what is the maximum number of rows I should store per table? Moreover, is there a limit on number of tables that MySQL can handle.
3) Is it advisable to do aggregate queries over millions of rows? I'm ready to wait for a couple of seconds to get results for such queries. Is it a good practice or is there any other way to do aggregate queries?
In a nutshell, I am trying a design a large scale data-warehouse kind of setup which will be write heavy. If you know about any published case studies or reports, that'll be great!
If you're talking larger volumes of data, then look at MySQL partitioning. For these tables, a partition by data/time would certainly help performance. There's a decent article about partitioning here.
Look at creating two separate databases: one for all raw data for the writes with minimal indexing; a second for reporting using the aggregated values; with either a batch process to update the reporting database from the raw data database, or use replication to do that for you.
EDIT
If you want to be really clever with your aggregation reports, create a set of aggregation tables ("today", "week to date", "month to date", "by year"). Aggregate from raw data to "today" either daily or in "real time"; aggregate from "by day" to "week to date" on a nightly basis; from "week to date" to "month to date" on a weekly basis, etc. When executing queries, join (UNION) the appropriate tables for the date ranges you're interested in.
EDIT #2
Rather than one table per client, we work with one database schema per client. Depending on the size of the client, we might have several schemas in a single database instance, or a dedicated database instance per client. We use separate schemas for raw data collection, and for aggregation/reporting for each client. We run multiple database servers, restricting each server to a single database instance. For resilience, databases are replicated across multiple servers and load balanced for improved performance.
Some suggestions in a database agnostic fashion.
The most simplest rational is to distinguish between read intensive and write intensive tables. Probably it is good idea to create two parallel schemas daily/weekly schema and a history schema. The partitioning can be done appropriately. One can think of a batch job to update the history schema with data from daily/weekly schema. In history schema again, you can create separate data tables per website (based on the data volume).
If all you are interested is in the aggregation stats alone (which may not be true). It is a good idea to have a summary tables (monthly, daily) in which the summary is stored like total unqiue visitors, repeat visitors etc; and these summary tables are to be updated at the end of day. This enables on the fly computation of stats with out waiting for the history database to be updated.
You should definitely consider splitting the data by site across databases or schemas - this not only makes it much easier to backup, drop etc an individual site/client but also eliminates much of the hassle of making sure no customer can see any other customers data by accident or poor coding etc. It also means it is easier to make choices about partitionaing, over and above databae table-level partitioning for time or client etc.
Also you said that the data volume is 1 million rows per day (that's not particularly heavy and doesn't require huge grunt power to log/store, nor indeed to report (though if you were genererating 500 reports at midnight you might logjam). However you also said that some sites had 1m visitors daily so perhaps you figure is too conservative?
Lastly you didn't say if you want real-time reporting a la chartbeat/opentracker etc or cyclical refresh like google analytics - this will have a major bearing on what your storage model is from day one.
M
You really should test your way forward will simulated enviroments as close as possible to the live enviroment, with "real fake" data (correct format & length). Benchmark queries and variants of table structures. Since you seem to know MySQL, start there. It shouldn't take you that long to set up a few scripts bombarding your database with queries. Studying the results of your database with your kind of data will help you realise where the bottlenecks will occur.
Not a solution but hopefully some help on the way, good luck :)

Performance of additional columns vs additional rows

I have a question about table design and performance. I have a number of analytical machines that produce varying amounts of data (which have been stored in text files up to this point via the dos programs which run the machines). I have decided to modernise and create a new database to store all the machine results in.
I have created separate tables to store results by type e.g. all results from the balance machine get stored in the balance results table etc.
I have a common results table format for each machine which is as follows:
ClientRequestID PK
SampleNumber PK
MeasureDtTm
Operator
AnalyteName
UnitOfMeasure
Value
A typical ClientRequest might have 50 samples which need to tested by various machines. Each machine records only 1 line per sample, so there are apprx 50 rows per table associated with any given ClientRequest.
This is fine for all machines except one!
It measures 20-30 analytes per sample (and just spits them out in one long row), whereas all the other machines, I am only ever measuring 1 analyte per RequestID/SampleNumber.
If I stick to this format, this machine will generate over a miliion rows per year, because every sample can have as many as 30 measurements.
My other tables will only grow at a rate of 3000-5000 rows per year.
So after all that, my question is this:
Am I better to stick to the common format for this table, and have bucket loads of rows, or is it better to just add extra columns to represent each Analyte, such that it would generate only 1 row per sample (like the other tables). The machine can only ever measure a max of 30 analytes (and a $250k per machine, I won;t be getting another in my lifetime).
All I am worried about is reporting performance and online editing. In both cases, the PK: RequestID and SampleNumber remain the same, so I guess it's just a matter of what would load quicker. I know the multiple column approach is considered woeful from a design perspective, but would it yield better performance in this instance?
BTW the database is MS Jet / Access 2010
Any help would be greatly appreciated!
Millions of rows in a Jet/ACE database are not a problem if the rows have few columns.
However, my concern is how these records are inserted -- is this real-time data collection? If so, I'd suggest this is probably more than Jet/ACE can handle reliably.
I'm an experienced Access developer who is a big fan of Jet/ACE, but from what I know about your project, if I was starting it out, I'd definitely choose a server database from the get go, not because Jet/ACE likely can't handle it right now, but because I'm thinking in terms of 10 years down the road when this app might still be in use (remember Y2K, which was mostly a problem of apps that were designed with planned obsolescence in mind, but were never replaced).
You can decouple the AnalyteName column from the 'common results' table:
-- Table Common Results
ClientRequestID PK SampleNumber PK MeasureDtTm Operator UnitOfMeasure Value
-- Table Results Analyte
ClientRequestID PK SampleNumber PK AnalyteName
You join on the PK (Request + Sample.) That way you don't duplicate all the rest of the rows needlessly, can avoid the join in the queries where you don't require the AnalyteName to be used, can support extra Analytes and is overall saner. Unless you really start having a performance problem, this is the approach I'd follow.
Heck, even if you start having performance problems, I'd first move to a real database to see if that fixes the problems before adding columns to the results table.