When working with data values, should I create a single table storing the hourly values, and also the aggregated daily/monthly values, or should I create separate tables for these?
I'd imagine multiple tables would be the way to go, but I'm a complete amateur here. It sounds like something that would improve performance and possibly maintenance, but I'd also like to know if this even makes a difference. In the end, having 3-4 tables vs 1 could also cause some maintenance issues I would imagine.
So basically, a values_table containing:
id value datetime range
1 33 2022-05-13 11:00:00 hourly
2 54 2022-05-13 12:00:00 hourly
3 840 2022-05-13 daily
...
vs
hourly_values_table containing:
id value datetime
1 33 2022-05-13 11:00:00
2 54 2022-05-13 12:00:00
...
And a daily_values_table containing:
id value datetime
1 840 2022-05-13
...
What would be the proper way to handle this?
Your hourly data is a Data Warehouse 'Fact' table". It is, I assume, written 'continually' and never updated.
"Summary Table(s)" are useful for performance. Usually only 1 is needed. For example a "daily" table gives you about a 24x reduction. From that table you can fetch weekly, monthly, or any arbitrary date range reasonably efficiently. (I need more metrics and a better feel for what type of data you are storing to be surer of what I am saying.)
I discuss using MySQL for DW and Summary tables
Sure, purists debate the storing of "redundant" data. But when you get a billion rows, you really need summary tables to avoid performace bottlenecks.
As for how long to hold onto the data in the Fact table or the Summary table, I often suggest:
Use Partitioning for speedy of purging old data (after, say, a month), thereby saving disk space;
Keep the summary tables 'forever', since they are 'small'.
I don't understand your purpose or your approach?
You have to start with the purpose of the database? What data are you trying to store, and why?
From reading your description I can't tell if the data is supposed to be connected to a person, or is it for an accounting purpose? There's no context.
Start with the purpose of the database and this will identify the tables/names, which will then reveal the structure and relationships. And go to my post here for clarification, which could help conceptually. Link
Related
I am creating analytics module in our 'Tours & Travels' application.
Following are the steps through which user has to go in our application:
Step 1: User search tours for any city.
Step 2: User views the details of the tour.
Step 3: If user finds perfect tour for him/her, he/she book the tour.
Step 4: While booking the tour user enter passenger details.
Step 5: User reviews the final data.
Step 6: User pays online & tour gets booked.
Now I want to store the users each activity on our system for our analysis purpose. For this I have below table structure:
Id
user_id
tour_id
city_id
searched_at
viewed_at
entered_pax_info_at
reviewed_at
booked_at
151
34
678
1290
2021-03-14 12:00:00
2021-03-14 12:05:00
2021-03-14 12:10:00
2021-03-14 12:15:00
2021-03-14 12:20:00
Now while analyzing the data from this structure, Admin user may want data based on below columns:
searched_at
or
viewed_at
or
entered_pax_info_at
or
reviewed_at
or
booked_at
Eg. Admin user can ask the data like - Give me report of tour 'ABC' which got booked from Jan 2021 to March 2021. etc...
Now to make such searches on huge data efficiently, I will have to put indexes on each above mentioned column. By doinng this there will be no efficiency problem while reading the data, but it will cost me while writing, updating operations.
To counter above problem I am thinking below structure table:
id
user_id
tour_id
city_id
activity_type
date
50
34
678
1290
searched
2021-03-14 12:00:00
51
34
678
1290
viewed
2021-03-14 12:05:00
52
34
678
1290
pax_info
2021-03-14 12:10:00
53
34
678
1290
reviewed
2021-03-14 12:15:00
54
34
678
1290
booked
2021-03-14 12:20:00
Now to make searches on huge data efficiently on above table structure, I may have to put indexes only on activity_type & date column.
But accoding to me disadvantage of this stucture is that it is going to take large space comparively to the first approach.
I am left with confusion which approch (among above two or any other) will be future proof in terms of scalability, efficacy.
Any help to sort out this would be appreciated.
Your second alternative is far better than your first. It allows your system to be flexible about the number of steps you will analyze, for one thing. Normalized (vertical) tables almost always scale up better than denormalized (horizontal) tables.
And about the space used by your tables and indexes? Fuggedaboudit! Disk / SSD space is really cheap, and getting expomemtially cheaper by the month.
Unless your system already has tens of millions of rows AND your database administrator is pressuring you to denormalize your tables for performance's sake, do not worry about the size of your tables. Seriously.
The analytic database should not not the operational database. In fact, I often work with analytic databases that are batch updated and rarely -- if ever -- have updates. Typically, analysts don't like their data changing under them as they are solving a problem.
In other words, either you need to rethink your approach or you have not described the full problem.
The first table you described looks like a good summary table for users that might be quite appropriate for analysts. It is not appropriate as an operational store for the data. In the world I live in, people are not so consistent about their searches. They search for the best tour in one city, find the price and other details, go back and check others. And so on. This is "navigation" and "path analysis", which your structure does not allow.
Such a summary table can be produced in a batch process. Even on a relatively large amount of data, that might take just a minute or two and it might be sufficient to do it once per day. If so, problem solved. There are no updates. The indexes are the ones needed on the analytic side.
On the other hand, there is lots of analysis that is this structure does not support. For instance, how many cities did a user look at before deciding on the final city? Well, maybe you could eke out the answer to that question.
I think you need 2 tables
one for browsing activity. In this situation, you probably should not even identify the user; let them be anonymous.
one for booking, paying, etc. (Probably more than one table, due to normalization, etc.)
The browsing table probably only gets INSERTs, and lots of them. If there will be many millions of rows, then we should talk about "summary tables". You don't have to decide on what, exactly, they summarize, but instead wait until the admins have requested some "reports".
The booking table(s) will have fewer INSERTs and possibly more UPDATEs than INSERTs.
I am a making a car tracking system and i want to store data that each car sends after every 5 seconds in a MySql database. Assuming that i have 1000 cars transmitting data to my system after 5 seconds, and the data is stored in one table. At some point i would want to query this table to generate reports for specific vehicle. I am confused between logging all the vehicles data in one table or creating a table for each vehicle (1000 tables). Which is more efficient?
OK 86400 seconds per day / 5 = 17280 records per car and day.
Will result in 17,280,000 records per day. This is not an issue for MYSQL in general.
And a good designed table will be easy to query.
If you go for one table for each car - what is, when there will be 2000 cars in future.
But the question is also: how long do you like to store the data?
It is easy to calculate when your database is 200 GB, 800GB, 2TB,....
One table, not one table per car. A database with 1000 tables will be a dumpster fire when you try to back it up or maintain it.
Keep the rows of that table as short as you possibly can; it will have many records.
Index that table both on timestamp and on (car_id, timestamp) . The second index will allow you to report on individual cars efficiently.
Read https://use-the-index-luke.com/
This is the "tip of the iceberg". There are about 5 threads here and on dba.stackexchange relating to tracking cars/trucks. Here are some further tips.
Keep datatypes as small as possible. Your table(s) will become huge -- threatening to overflow the disk, and slowing down queries due to "bulky rows mean that fewer rows can be cached in RAM".
Do you keep the "same" info for a car that is sitting idle overnight? Think of how much disk space this is taking.
If you are using HDD disks, plain on 100 INSERTs/second before you need to do some redesign of the ingestion process. (1000/sec for SSDs.) There are techniques that can give you 10x, maybe 100x, but you must apply them.
Will you be having several servers collecting the data, then doing simple inserts into the database? My point is that that may be your first bottleneck.
PRIMARY KEY(car_id, ...) so that accessing data for one car is efficient.
Today, you say the data will be kept forever. But have you computed how big your disk will need to be?
One way to shrink the data drastically is to consolidate "old" data into, say, 1-minute intervals after, say, one month. Start thinking about what you want to keep. For example: min/max/avg speed, not just instantaneous speed. Have an extra record when any significant change occurs (engine on; engine off; airbag deployed; etc)
(I probably have more tips.)
First off, I am new to database design so apologies for use of incorrect terminology.
For a university assignment I have been tasked with creating the database schema for a website. Part of the website a user selects the availability of hosting an event but the event can be at any time so for example from 12/12/2015 - 15/12/2015 and 16/01/2016 - 22/12/2016 and also singular dates such as 05/01/2016. They also have the option of having the event all the time
So I am unsure of how to store all these kind of variables in a database table without using a lot of rows. The example below is a basic one that would store each date of availability but that is a lot of records and that is just for one event. Is there a better method of storing these values or would this be stored elsewhere, outside of a database.
calendar_id | event_id | available_date
---------------------------------------
492 | 602 | 12/12/2015
493 | 602 | 13/12/2015
494 | 602 | 14/12/2015
495 | 602 | 15/12/2015
496 | 602 | 05/01/2016
497 | 602 | 16/01/2016
498 | 602 | 17/01/2016
etc...
This definitely requires a database. I don't think you should be concerned about the number of records in a database... that is what databases do best. However, from a university perspective there is something called Normalization. In simple terms normalization is about minimizing data repetition.
Steps to design a schema
Identify entities
As the first step of designing a database schema I tend to identify all the entities in the system. Looking at your example I see (1) Events and (2) EventTimes (event occurrences/bookings) with a one-to-many relation since one Event might have multiple EventTimes. I would suggest that you keep these two entities separate in the database. That way an Event can be extended with more attributes/fields without affecting its EventTimes. Most importantly you can add many EventTimes on an Event without repeating all the event's fields (which would be the case if you use a single table).
Identify attributes
The second step for me is to identify all the attributes/fields of each entity. Additionally, I always suggest an auto-increment id in every table to uniquely identify a row.
Identify constraints
This might be a bit more advanced, but most of the times you have constraints on what is acceptable data values or what uniquely identifies a row in real-life. For example, the Event.id might identify the row in the database but you might also require that each event has a unique title
Example schema
This has to be adjusted to the assignment or, in a real application, to the system's requirements
Events table
id int auto-increment
title varchar unique: Event's title
always_on boolean/enum: If 'Y' then the event is on all the time
... more fields here ... (category, tags, notes, description, venue,...)
EventTimes
id int auto-increment
event_id foreign key pointing to Event.id
start_datetime datetime or int (int if you go for a unix timestamp)
end_datetime : as above
... more fields again... (recursion below is a hard one! avoid it if you can)
recursion enum/int : Is the event repeated? Weekly, Montly, etc
recursion_interval int: Every x days, months, years, etc
A note on date/times, as a rule of thumb whenever you deal with dates and times in a database, always store them in UTC format. You probably don't want/need to mess with timezones in an assignment... but keep it in mind.
Possible extensions to the example
Designing a complete system one might add the tables: Venues, Organizers, Locations, etc... this can go on forever! I do try to think of future requirements when designing but do not over do it cause you end up with a lot of fields that you don't use and increased complexity.
Conclusion
Normalization is something you have to keep in mind when designing a database, however you can see that the more you normalize your schema the more complex will become your selects and joins. There is a trade-off there between data efficiency and query efficiency... That is the reason I used "from a university perspective" earlier. In a real-life system with complex data structures (for example graphs!) you might require to under-normalize the tables to make your queries more efficient/faster or easier. There are other approaches to deal with such issues (functions in the database, temporary/staging tables, views, etc) but always depends on the specific case.
Another really useful thing to keep in mind is: Requirements always change! Design your databases taking as granted that fields will be added/removed, more tables will be added, new constraints will appear, etc and thus make it as extensible and easy to modify as possible... (now we are scratching a bit "Agile" methodologies)
I hope this helps and does not confuse things more. I am not a DBA per-se but I have designed a few schemes. All the above come from experience rather than a book and they may not be 100% accurate. Definitely not the only way to design a database... its kind of an art this job :)
I'm looking into the database design for a data ware house kind of project which involves large number of inserts daily.The data archives will be further used to generate reports.I will have a list of user's (for example a user set of 2 million), for which I need to monitor daily social networking activities associated with them.
For example, let there be a set of 100 users say U1,U2,...,U100
I need to insert their daily status count into my database.
Consider the total status count obtained for a user U1 for period June 30 - July 6, is as follows
June 30 - 99
July 1 - 100
July 2 - 102
July 3 - 102
July 4 - 105
July 5 - 105
July 6 - 107
The database should keep daily status count of each users ,like
For user U1,
July 1- 1 (100-99)
July 2- 2 (102-100)
July 3- 0 (102-102)
July 4- 3 (105-102)
July 5- 0 (105-105)
July 6- 2 (107-105)
Similarly the database should hold archived details of the full set of user's.
And on a later phase , I envision to take aggregate reports out of these data like total points scored on each day,week,month,etc; and to compare it with older data.
I need to start things from the scratch.I am experienced with PHP as a server side script and MySQL. I am confused on the database side? Since I need to process about a million insertion daily,what all things should be taken care of?
I am confused on how to to design a MySQL database in this regard ? On which storage engine to be used and design patterns to be followed keeping in my mind the data could later used effectively with aggregate functions.
Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day.Does lot of table's could create some overhead?
Does MySQL fits my requirement? 2 million or more DB operations are done every day. How the server and other things are to be considered in this case.
1) The database should handle concurrent inserts, which should enable 1-2 million inserts per day.
Before inserting I suggest to calculate daily status count,i.e the difference today's count with yesterday's.
2) on a later phase, the archives data (collected over past days) is used as a data warehouse and aggregation tasks are to be performed on it.
Comments:
I have read MyISAM is the best choice for data warehousing projects and at the same time heard INNODB excels in many ways. Many have suggested on proper tuning to get it done, I would like to get thoughts on that as well.
When creating a data warehouse, you don't have to worry about normalization. You're inserting rows and reading rows.
I'd just have one table like this.
Status Count
------------
User id
Date
Count
The primary (clustering) key would be (User id, Date). Another unique index would be (Date, User id).
As far as whether or not MySQL can handle this data warehouse, that depends on the hardware that MySQL is running on.
Since you don't need referential integrity, I'd use MyISAM as the engine.
As for table design, a dimensional model with a star schema is usually a good choice for a datamart where there are mostly inserts and reads. I see two different granularities for the status data, one for status per day and one for status per user, so I would recommend tables similar to:
user_status_fact(user_dimension_id int, lifetime_status int)
daily_status_fact (user_dimension_id int, calendar_dimension_id int, daily_status int)
user_dimension(user_dimension_id, user_id, name, ...)
calendar_dimension(calendar_dimension_id, calendar_date, day_of_week, etc..)
You might also consider having the most detailed data available even though you don't have a current requirement for it as it may make it easier to build aggregates in the future:
status_fact (user_dimension_id int, calendar_dimension_id int, hour_dimension_id, status_dimension_id, status_count int DEFAULT 1)
hour_dimension(hour_dimension_id, hour_of_day_24, hour_of_day_12, ...)
status_dimension(status_dimension_id, status_description string, ...)
If you aren't familiar with the dimensional model, I would recommend the book data warehouse toolkit by Kimball.
I would also recommend MyISAM since you don't need the transactional integrity provided by InnoDB when dealing with a read-mostly warehouse.
I would question whether you want to do concurrent inserts into a production database though. Often in a warehouse environment this data would get batched over time and inserted in bulk and perhaps go through a promotion process.
As for scalability, mysql can certainly handle 2M write operations per day on modest hardware. I'm inserting 500K+ rows/day (batched hourly) on a cloud based server with 8GB of ram running apache + php + mysql and the inserts aren't really noticeable to the php users hitting the same db.
I'm assuming you will get one new row per user per day inserted (not 2M rows a day as some users will have more than one status). You should look at how many new rows per day you expect that to created. When you get to a large number of rows you might have to consider partitioning, sharding and other performance tricks. There are many books out there that can help you with that. Or you could also consider moving to an analytics db such as Amazon Red Shift.
I would create a fact table for each user status for each day. This fact table would connect to a date dimension via a date_key and to a user dimension via a user_key. The primary key for the fact table should be a surrogate key = status_key.
So, your fact table now has four fields: status_key, date_key, user_key, status.
Once the dimension and fact tables have been loaded, then do the processing and aggregating.
Edit: I assumed you knew something about datamarts and star schemas. Here is a simple star schema to base your design on.
This design will store any user's status for a given day. (If the user status can change during the day, just add a time dimension).
This design will work on MySQL or SQL Server. You will have to manage a million inserts per day, don't bog it down with comparisons to previous data points. You can do that with the datamart (star schema) after it's loaded - that's what it's for - analysis and aggregation.
If there are large number of DML operation and selecting records from database MYISAM engine would be prefer. INNODB is mainly use for TCL and referential integrity.You can also specify engine at table level.
If you need to generate the report then also MYISAM engine work faster than INNODB.See which table or data you need for your report.
Remember that if you generate reports from MYSQL database processing on millions of data using PHP programming could create a problem.You may encounter 500 or 501 error frequently.
So report generation view point MYISAM engine for required table will be useful.
You can also store data in multiple table to prevent overhead otherwise there is a chance for DB table crash.
It looks like you need a schema that will keep a single count per user per day. Very simple. You should create a single table which is DAY, USER_ID, and STATUS_COUNT.
Create an index on DAY and USER_ID together, and if possible keep the data in the table sorted by DAY and USER_ID also. This will give you very fast access to the data, as long as you are querying it by day ranges for any (or all) users.
For example:
select * from table where DAY = X and USER_ID in (Y, Z);
would be very fast because the data is ordered on disk sequentially by day, then by user_id, so there are very few seeks to satisfy the query.
On the other hand, if you are more interested in finding a particular user's activity for a range of days:
select * from table where USER_ID = X and DAY between Y and Z;
then the previous method is less optimal because finding the data will require many seeks instead of a sequential scan. Index first by USER_ID, then DAY, and keep the data sorted in that order; this will require more maintenance though, as the table would need to be re-sorted often. Again, it depends on your use case, and how fast you want your queries against the table to respond.
I don't use MySQL extensively, but I believe MyISAM is faster for inserts at the expense of transaction isolation. This should not be a problem for the system you're describing.
Also, 2MM records per day should be child's play (only 23 inserts / second) if you're using decent hardware. Especially if you can batch load the records using mysqlimport. If that's not possible, 23 inserts/second should still be very doable.
I would not do the computation of the delta from the previous day in the insertion of the current day however. There is an analytic function called LAG() that will do that for you very handily (http://explainextended.com/2009/03/10/analytic-functions-first_value-last_value-lead-lag/), not to mention it doesn't seem to serve any practical purpose at the detail level.
With this detail data, you can aggregate it any way you'd like, truncating the DAY column down to WEEK or MONTH, but be careful how you build aggregates. You're talking about over 7 billion records per year, and re-building aggregates over so many rows can be very costly, especially on a single database. You might consider doing aggregation processing using Hadoop (I'd recommend Spark over plain old Map/Reduce also, its far more powerful). This will alleviate any computation burden from your database server (which can't easily scale to multiple servers) and allow it to do its job of recording and storing new data.
You should consider partitioning your table as well. Some purposes of partitioning tables are to distribute query load, ease archival of data, and possibly increase insert performance. I would consider partitioning along the month boundary for an application such as you've described.
I have a general question about the best way to set up my tables to deal with large volume data that I import on a daily basis.
I will import 10 csv files containing 1000's records each day so this table will expand rapidly.
It consists of 15 or so columns ranging from tiny and medium ints to 30 character varchars.
There is no ID field - I can join 6 columns to form a primary key - this would be a var char total length about 45.
When it's imported I need to report on this data through a web front end at summary levels so I see myself having to build reporting tables from this after importing.
Within this data are many fields that repeat themselves in each days import - date, region, customer etc, only half the columns each day are specific to the record.
Questions:
Should I import it all into one table immediately as a dump table.
Should I transform the data through the import process and split the import across different tables
Should I form an id field based on the columns I can to get a unique key during the import
Should I use auto inc id field for this.
What sort of table should this be InnoDB etc
My fear is data overload on this table which will make extracting to reporting tables harder and harder as it builds?
Advice really helpful. Thanks.
Having autoinc id is usually more helpful than not having it
To ensure data integrity you can have uniq index on your 6 columns that make up ID
MySQL is pretty comfortable with millions of records in database if you have enough RAM
If you still have a fear of millions of records - just aggregate your data on monthly basis into another table. If you can't - add more RAM.
Transform as much of your data during importing as possible as long as it doesn't hurt performance. Transforming the data when it's already imported adds unnecessary load to MySQL server and if you can avoid doing so - avoid.
MyISAM is(was?) usually better for statistical kind of data, kind that doesn't get UPDATEd too often but InnoDB has caught up in past few years(have a look at percona's XtraDB engine) and is basically same performance-wise.
I think the most important point here is to define your data retention rates - it's rare that you have to retain daily resolution after a year or two.
Aggregate into lower resolution frames and archive(mysqldump > bzip is quite efficient) if you think you might still need daily resolution in future.