Horizontal vs vertical data approach in MySql - mysql

I am creating analytics module in our 'Tours & Travels' application.
Following are the steps through which user has to go in our application:
Step 1: User search tours for any city.
Step 2: User views the details of the tour.
Step 3: If user finds perfect tour for him/her, he/she book the tour.
Step 4: While booking the tour user enter passenger details.
Step 5: User reviews the final data.
Step 6: User pays online & tour gets booked.
Now I want to store the users each activity on our system for our analysis purpose. For this I have below table structure:
Id
user_id
tour_id
city_id
searched_at
viewed_at
entered_pax_info_at
reviewed_at
booked_at
151
34
678
1290
2021-03-14 12:00:00
2021-03-14 12:05:00
2021-03-14 12:10:00
2021-03-14 12:15:00
2021-03-14 12:20:00
Now while analyzing the data from this structure, Admin user may want data based on below columns:
searched_at
or
viewed_at
or
entered_pax_info_at
or
reviewed_at
or
booked_at
Eg. Admin user can ask the data like - Give me report of tour 'ABC' which got booked from Jan 2021 to March 2021. etc...
Now to make such searches on huge data efficiently, I will have to put indexes on each above mentioned column. By doinng this there will be no efficiency problem while reading the data, but it will cost me while writing, updating operations.
To counter above problem I am thinking below structure table:
id
user_id
tour_id
city_id
activity_type
date
50
34
678
1290
searched
2021-03-14 12:00:00
51
34
678
1290
viewed
2021-03-14 12:05:00
52
34
678
1290
pax_info
2021-03-14 12:10:00
53
34
678
1290
reviewed
2021-03-14 12:15:00
54
34
678
1290
booked
2021-03-14 12:20:00
Now to make searches on huge data efficiently on above table structure, I may have to put indexes only on activity_type & date column.
But accoding to me disadvantage of this stucture is that it is going to take large space comparively to the first approach.
I am left with confusion which approch (among above two or any other) will be future proof in terms of scalability, efficacy.
Any help to sort out this would be appreciated.

Your second alternative is far better than your first. It allows your system to be flexible about the number of steps you will analyze, for one thing. Normalized (vertical) tables almost always scale up better than denormalized (horizontal) tables.
And about the space used by your tables and indexes? Fuggedaboudit! Disk / SSD space is really cheap, and getting expomemtially cheaper by the month.
Unless your system already has tens of millions of rows AND your database administrator is pressuring you to denormalize your tables for performance's sake, do not worry about the size of your tables. Seriously.

The analytic database should not not the operational database. In fact, I often work with analytic databases that are batch updated and rarely -- if ever -- have updates. Typically, analysts don't like their data changing under them as they are solving a problem.
In other words, either you need to rethink your approach or you have not described the full problem.
The first table you described looks like a good summary table for users that might be quite appropriate for analysts. It is not appropriate as an operational store for the data. In the world I live in, people are not so consistent about their searches. They search for the best tour in one city, find the price and other details, go back and check others. And so on. This is "navigation" and "path analysis", which your structure does not allow.
Such a summary table can be produced in a batch process. Even on a relatively large amount of data, that might take just a minute or two and it might be sufficient to do it once per day. If so, problem solved. There are no updates. The indexes are the ones needed on the analytic side.
On the other hand, there is lots of analysis that is this structure does not support. For instance, how many cities did a user look at before deciding on the final city? Well, maybe you could eke out the answer to that question.

I think you need 2 tables
one for browsing activity. In this situation, you probably should not even identify the user; let them be anonymous.
one for booking, paying, etc. (Probably more than one table, due to normalization, etc.)
The browsing table probably only gets INSERTs, and lots of them. If there will be many millions of rows, then we should talk about "summary tables". You don't have to decide on what, exactly, they summarize, but instead wait until the admins have requested some "reports".
The booking table(s) will have fewer INSERTs and possibly more UPDATEs than INSERTs.

Related

MySQL single table with hourly, daily, monthly values, or separate tables?

When working with data values, should I create a single table storing the hourly values, and also the aggregated daily/monthly values, or should I create separate tables for these?
I'd imagine multiple tables would be the way to go, but I'm a complete amateur here. It sounds like something that would improve performance and possibly maintenance, but I'd also like to know if this even makes a difference. In the end, having 3-4 tables vs 1 could also cause some maintenance issues I would imagine.
So basically, a values_table containing:
id value datetime range
1 33 2022-05-13 11:00:00 hourly
2 54 2022-05-13 12:00:00 hourly
3 840 2022-05-13 daily
...
vs
hourly_values_table containing:
id value datetime
1 33 2022-05-13 11:00:00
2 54 2022-05-13 12:00:00
...
And a daily_values_table containing:
id value datetime
1 840 2022-05-13
...
What would be the proper way to handle this?
Your hourly data is a Data Warehouse 'Fact' table". It is, I assume, written 'continually' and never updated.
"Summary Table(s)" are useful for performance. Usually only 1 is needed. For example a "daily" table gives you about a 24x reduction. From that table you can fetch weekly, monthly, or any arbitrary date range reasonably efficiently. (I need more metrics and a better feel for what type of data you are storing to be surer of what I am saying.)
I discuss using MySQL for DW and Summary tables
Sure, purists debate the storing of "redundant" data. But when you get a billion rows, you really need summary tables to avoid performace bottlenecks.
As for how long to hold onto the data in the Fact table or the Summary table, I often suggest:
Use Partitioning for speedy of purging old data (after, say, a month), thereby saving disk space;
Keep the summary tables 'forever', since they are 'small'.
I don't understand your purpose or your approach?
You have to start with the purpose of the database? What data are you trying to store, and why?
From reading your description I can't tell if the data is supposed to be connected to a person, or is it for an accounting purpose? There's no context.
Start with the purpose of the database and this will identify the tables/names, which will then reveal the structure and relationships. And go to my post here for clarification, which could help conceptually. Link

What is the best way to structure a database for a stock exchange?

I am trying to make a stock market simulator and i want to be as real as possible.
My question is: Nasdaq has 3000+ companies and in their database of stocks, right?! but is it one entry line for every share of every symbol on the sql db like the following example?
Company Microsoft = MSFT
db `companies_shares`
ID symbol price owner_id* company_id last_trade_datetime
1 msft 58.99 54334 101 2019-06-15 13:09:32
2 msft 58.99 54334 101 2019-06-15 13:09:32
3 msft 58.91 2231 101 2019-06-15 13:32:32
4 msft 58.91 544 101 2019-06-15 13:32:32
*owner_id = user id of the person that last bought the share.
Or is it calculate based on the shares available to trade and demand to buy and sell provided by the market maker? for exemple:
I've already tried the first example by it takes a lot space in my db and i'm concerned about the band width of all those trades, especially when millions of requests(trades) are being made every minute.
I've already tried the first example by it takes a lot space in my db and i'm concerned about the band width of all those trades, especially when millions of requests(trades) are being made every minute.
What is the best solution? Database or math?
Thanks in advance.
You might also want to Google many to many relationships.
Think about it this way. One person might own many stocks. One stock might be held by many people. That is a many to many relationship and usually modelled using three tables in a physical database. This is often written as M:M
Also, people might buy or sell a single company on multiple occasions this would likely be modelled using another table. From the person perspective there will be many transactions so we have a new type of relationship one (person) to many (transactions). This is often written as a 1:M relationship.
As to what to store, as a general rule it is best to store the atomic pieces of data. For example for a transaction, store the customer I'd, transaction date/time, the quantity bought or sold and the price at the very least.
You might also want to read up about normalization. Usually 3rd normal form is a good level to strive for, but a lot of this is a "it depends upon your circumstance and what you need to do". Often people will denormalize for speed of access at the expense of more storage and potentially more complicated updating....
You also mentioned performance, more often than not big companies such as NASDAQ. will use multiple layers if IT infrastructure. Each layer will have a different role and thus different functional characteristics and performance characteristics. Often they will be multiple servers operating together in a cluster. For example they might use a NoSQL system to manage the high volume of trading. From there there might be a feed (e.g. kafka) into other systems for other purposes (e.g. fraud prevention, analytics, reporting and so on).
You also mention data volumes. I do not know how much data you are talking about, but at one financial customer I've worked at have several peta bytes of storage (1 peta byte = 1000 TB) running on over 300 servers just for their analytics platform. They were probably on the medium to large size as far as financial institutions go.
I hope this helps point you in the right direction.

Table Schema for storing various dates of availability

First off, I am new to database design so apologies for use of incorrect terminology.
For a university assignment I have been tasked with creating the database schema for a website. Part of the website a user selects the availability of hosting an event but the event can be at any time so for example from 12/12/2015 - 15/12/2015 and 16/01/2016 - 22/12/2016 and also singular dates such as 05/01/2016. They also have the option of having the event all the time
So I am unsure of how to store all these kind of variables in a database table without using a lot of rows. The example below is a basic one that would store each date of availability but that is a lot of records and that is just for one event. Is there a better method of storing these values or would this be stored elsewhere, outside of a database.
calendar_id | event_id | available_date
---------------------------------------
492 | 602 | 12/12/2015
493 | 602 | 13/12/2015
494 | 602 | 14/12/2015
495 | 602 | 15/12/2015
496 | 602 | 05/01/2016
497 | 602 | 16/01/2016
498 | 602 | 17/01/2016
etc...
This definitely requires a database. I don't think you should be concerned about the number of records in a database... that is what databases do best. However, from a university perspective there is something called Normalization. In simple terms normalization is about minimizing data repetition.
Steps to design a schema
Identify entities
As the first step of designing a database schema I tend to identify all the entities in the system. Looking at your example I see (1) Events and (2) EventTimes (event occurrences/bookings) with a one-to-many relation since one Event might have multiple EventTimes. I would suggest that you keep these two entities separate in the database. That way an Event can be extended with more attributes/fields without affecting its EventTimes. Most importantly you can add many EventTimes on an Event without repeating all the event's fields (which would be the case if you use a single table).
Identify attributes
The second step for me is to identify all the attributes/fields of each entity. Additionally, I always suggest an auto-increment id in every table to uniquely identify a row.
Identify constraints
This might be a bit more advanced, but most of the times you have constraints on what is acceptable data values or what uniquely identifies a row in real-life. For example, the Event.id might identify the row in the database but you might also require that each event has a unique title
Example schema
This has to be adjusted to the assignment or, in a real application, to the system's requirements
Events table
id int auto-increment
title varchar unique: Event's title
always_on boolean/enum: If 'Y' then the event is on all the time
... more fields here ... (category, tags, notes, description, venue,...)
EventTimes
id int auto-increment
event_id foreign key pointing to Event.id
start_datetime datetime or int (int if you go for a unix timestamp)
end_datetime : as above
... more fields again... (recursion below is a hard one! avoid it if you can)
recursion enum/int : Is the event repeated? Weekly, Montly, etc
recursion_interval int: Every x days, months, years, etc
A note on date/times, as a rule of thumb whenever you deal with dates and times in a database, always store them in UTC format. You probably don't want/need to mess with timezones in an assignment... but keep it in mind.
Possible extensions to the example
Designing a complete system one might add the tables: Venues, Organizers, Locations, etc... this can go on forever! I do try to think of future requirements when designing but do not over do it cause you end up with a lot of fields that you don't use and increased complexity.
Conclusion
Normalization is something you have to keep in mind when designing a database, however you can see that the more you normalize your schema the more complex will become your selects and joins. There is a trade-off there between data efficiency and query efficiency... That is the reason I used "from a university perspective" earlier. In a real-life system with complex data structures (for example graphs!) you might require to under-normalize the tables to make your queries more efficient/faster or easier. There are other approaches to deal with such issues (functions in the database, temporary/staging tables, views, etc) but always depends on the specific case.
Another really useful thing to keep in mind is: Requirements always change! Design your databases taking as granted that fields will be added/removed, more tables will be added, new constraints will appear, etc and thus make it as extensible and easy to modify as possible... (now we are scratching a bit "Agile" methodologies)
I hope this helps and does not confuse things more. I am not a DBA per-se but I have designed a few schemes. All the above come from experience rather than a book and they may not be 100% accurate. Definitely not the only way to design a database... its kind of an art this job :)

Database design suggestions for a data scraping/warehouse application?

I'm looking into the database design for a data ware house kind of project which involves large number of inserts daily.The data archives will be further used to generate reports.I will have a list of user's (for example a user set of 2 million), for which I need to monitor daily social networking activities associated with them.
For example, let there be a set of 100 users say U1,U2,...,U100
I need to insert their daily status count into my database.
Consider the total status count obtained for a user U1 for period June 30 - July 6, is as follows
June 30 - 99
July 1 - 100
July 2 - 102
July 3 - 102
July 4 - 105
July 5 - 105
July 6 - 107
The database should keep daily status count of each users ,like
For user U1,
July 1- 1 (100-99)
July 2- 2 (102-100)
July 3- 0 (102-102)
July 4- 3 (105-102)
July 5- 0 (105-105)
July 6- 2 (107-105)
Similarly the database should hold archived details of the full set of user's.
And on a later phase , I envision to take aggregate reports out of these data like total points scored on each day,week,month,etc; and to compare it with older data.
I need to start things from the scratch.I am experienced with PHP as a server side script and MySQL. I am confused on the database side? Since I need to process about a million insertion daily,what all things should be taken care of?
I am confused on how to to design a MySQL database in this regard ? On which storage engine to be used and design patterns to be followed keeping in my mind the data could later used effectively with aggregate functions.
Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day.Does lot of table's could create some overhead?
Does MySQL fits my requirement? 2 million or more DB operations are done every day. How the server and other things are to be considered in this case.
1) The database should handle concurrent inserts, which should enable 1-2 million inserts per day.
Before inserting I suggest to calculate daily status count,i.e the difference today's count with yesterday's.
2) on a later phase, the archives data (collected over past days) is used as a data warehouse and aggregation tasks are to be performed on it.
Comments:
I have read MyISAM is the best choice for data warehousing projects and at the same time heard INNODB excels in many ways. Many have suggested on proper tuning to get it done, I would like to get thoughts on that as well.
When creating a data warehouse, you don't have to worry about normalization. You're inserting rows and reading rows.
I'd just have one table like this.
Status Count
------------
User id
Date
Count
The primary (clustering) key would be (User id, Date). Another unique index would be (Date, User id).
As far as whether or not MySQL can handle this data warehouse, that depends on the hardware that MySQL is running on.
Since you don't need referential integrity, I'd use MyISAM as the engine.
As for table design, a dimensional model with a star schema is usually a good choice for a datamart where there are mostly inserts and reads. I see two different granularities for the status data, one for status per day and one for status per user, so I would recommend tables similar to:
user_status_fact(user_dimension_id int, lifetime_status int)
daily_status_fact (user_dimension_id int, calendar_dimension_id int, daily_status int)
user_dimension(user_dimension_id, user_id, name, ...)
calendar_dimension(calendar_dimension_id, calendar_date, day_of_week, etc..)
You might also consider having the most detailed data available even though you don't have a current requirement for it as it may make it easier to build aggregates in the future:
status_fact (user_dimension_id int, calendar_dimension_id int, hour_dimension_id, status_dimension_id, status_count int DEFAULT 1)
hour_dimension(hour_dimension_id, hour_of_day_24, hour_of_day_12, ...)
status_dimension(status_dimension_id, status_description string, ...)
If you aren't familiar with the dimensional model, I would recommend the book data warehouse toolkit by Kimball.
I would also recommend MyISAM since you don't need the transactional integrity provided by InnoDB when dealing with a read-mostly warehouse.
I would question whether you want to do concurrent inserts into a production database though. Often in a warehouse environment this data would get batched over time and inserted in bulk and perhaps go through a promotion process.
As for scalability, mysql can certainly handle 2M write operations per day on modest hardware. I'm inserting 500K+ rows/day (batched hourly) on a cloud based server with 8GB of ram running apache + php + mysql and the inserts aren't really noticeable to the php users hitting the same db.
I'm assuming you will get one new row per user per day inserted (not 2M rows a day as some users will have more than one status). You should look at how many new rows per day you expect that to created. When you get to a large number of rows you might have to consider partitioning, sharding and other performance tricks. There are many books out there that can help you with that. Or you could also consider moving to an analytics db such as Amazon Red Shift.
I would create a fact table for each user status for each day. This fact table would connect to a date dimension via a date_key and to a user dimension via a user_key. The primary key for the fact table should be a surrogate key = status_key.
So, your fact table now has four fields: status_key, date_key, user_key, status.
Once the dimension and fact tables have been loaded, then do the processing and aggregating.
Edit: I assumed you knew something about datamarts and star schemas. Here is a simple star schema to base your design on.
This design will store any user's status for a given day. (If the user status can change during the day, just add a time dimension).
This design will work on MySQL or SQL Server. You will have to manage a million inserts per day, don't bog it down with comparisons to previous data points. You can do that with the datamart (star schema) after it's loaded - that's what it's for - analysis and aggregation.
If there are large number of DML operation and selecting records from database MYISAM engine would be prefer. INNODB is mainly use for TCL and referential integrity.You can also specify engine at table level.
If you need to generate the report then also MYISAM engine work faster than INNODB.See which table or data you need for your report.
Remember that if you generate reports from MYSQL database processing on millions of data using PHP programming could create a problem.You may encounter 500 or 501 error frequently.
So report generation view point MYISAM engine for required table will be useful.
You can also store data in multiple table to prevent overhead otherwise there is a chance for DB table crash.
It looks like you need a schema that will keep a single count per user per day. Very simple. You should create a single table which is DAY, USER_ID, and STATUS_COUNT.
Create an index on DAY and USER_ID together, and if possible keep the data in the table sorted by DAY and USER_ID also. This will give you very fast access to the data, as long as you are querying it by day ranges for any (or all) users.
For example:
select * from table where DAY = X and USER_ID in (Y, Z);
would be very fast because the data is ordered on disk sequentially by day, then by user_id, so there are very few seeks to satisfy the query.
On the other hand, if you are more interested in finding a particular user's activity for a range of days:
select * from table where USER_ID = X and DAY between Y and Z;
then the previous method is less optimal because finding the data will require many seeks instead of a sequential scan. Index first by USER_ID, then DAY, and keep the data sorted in that order; this will require more maintenance though, as the table would need to be re-sorted often. Again, it depends on your use case, and how fast you want your queries against the table to respond.
I don't use MySQL extensively, but I believe MyISAM is faster for inserts at the expense of transaction isolation. This should not be a problem for the system you're describing.
Also, 2MM records per day should be child's play (only 23 inserts / second) if you're using decent hardware. Especially if you can batch load the records using mysqlimport. If that's not possible, 23 inserts/second should still be very doable.
I would not do the computation of the delta from the previous day in the insertion of the current day however. There is an analytic function called LAG() that will do that for you very handily (http://explainextended.com/2009/03/10/analytic-functions-first_value-last_value-lead-lag/), not to mention it doesn't seem to serve any practical purpose at the detail level.
With this detail data, you can aggregate it any way you'd like, truncating the DAY column down to WEEK or MONTH, but be careful how you build aggregates. You're talking about over 7 billion records per year, and re-building aggregates over so many rows can be very costly, especially on a single database. You might consider doing aggregation processing using Hadoop (I'd recommend Spark over plain old Map/Reduce also, its far more powerful). This will alleviate any computation burden from your database server (which can't easily scale to multiple servers) and allow it to do its job of recording and storing new data.
You should consider partitioning your table as well. Some purposes of partitioning tables are to distribute query load, ease archival of data, and possibly increase insert performance. I would consider partitioning along the month boundary for an application such as you've described.

best way to store user's "favorites" in MySQL

I have a photo gallery. I want to add "Add to favorites" button - so user can add other user to his/her favorites. And then I want each user to be able to watch his list of favorite users, as well as to be able to watch who (list of users) added this user to favorites.
I found two ways, and the first is:
faver_id faved_id
1 10
1 31
1 24
10 1
10 24
I dont like this method because of
1) a lots of repeating 2) very large table in future (if a have at least 1001 users, and each likes other 1000 users = 1 001 000 records) which I suppose will slow down my base.
The second way is:
user_id favs
1 1 23 34 56 87 23
10 45 32 67 54 34 88 101
I can take these favs and explode() them in php or search if user likes some other user by MySQL query select count(user_id) from users where favs LIKE '% 23 %' and user_id=10;
But I feel the second way is not very "correct" in MySQL terms.
Can you advice me something?
Think about this. Your argument against using the first approach is that your tables might get too big, but you then go on to say that if you use the second approach you could run a wildcard query to find fields which contain something.
The second approach forces a full table search, and is unindexable. With the first approach, you just slap indexes on each of your columns and you're good to go. The first approach scales much, much, much better than the second one. Since scaling seems to be your only concern with the first, I think the answer is obvious.
Go with the first approach. Many-to-Many tables are used everywhere, and for good reason.
Edit:
Another problem is that the second approach is handing off a lot of the work in maintaining the database off to the application. This is fine in some cases, but the cases you're talking about are things that the database excels at. You would only be reinventing the wheel, and badly.
Definitely go with the first way.
Well, the second way is not that easy when you want to remove or make changes, but the its all right in terms of MySQL.
Though, Joomla is even including different date of information in the same field called params.