How to Extract relational data faster than SQL - mysql

I have a graph-like data in SQL. The data can be described as:
products table - list of skus classified into two (2)
Class 1: non-vehicle specific (universally fits all vehicle)
Class 2: vehicle-specific (custom-fit to specific set of vehicle)
1 sku fits one or more vehicle (YMMSE)
vehicle master table (year, make model, submodel, engine) aka YMMSE
e.g.
2014 Ford Fiesta S 4 Cylinder, 1.6L
applications tables - relationship between custom-fit products and the corresponding vehicles YMMSE.
I have an applications table that runs into Gigabytes with approximately 85 Million records.
The problem is querying for SKU specific vehicle YMMSE takes a long time in SQL especially on skus that has a lot of applications mapping aka "almost-universal".
The applications table gets updated frequently so I need to be able to perform the expensive queries every-time until such point that the MySQL server is almost giving up or causes replication delays as a result.
The question is:
Would a distributed processing framework like Hadoop or Spark be able to help me speed up the process of discovering sku-specific vehicle mapping fast?
Regards,
Jun

Frameworks like Hadoop or Spark can help to remove some pressure from your database but are not designed for low-latency operations. If data is graphical and queries represent some types of graph traversal you'd be better with dedicated tools like some type of graph database.

Related

Rails ActiveRecord Sharding for B2B Application

I have a B2B application with below requirements:
We are on Rails5. Mysql 5.6 is our main production DB.
Below is the schema for table user_preferences
client_id, user_id, pref1, pref2,.. pref10, created_at, updated_at
Total no of preference columns are 10. Pref fields data types are tinyint, int.
Also there are 2 datetime fields.
All above data is currently stored in a single table.
As more and more clients are joining to our application, we are expecting this table to grow huge soon. It's possible a single client can have 10 million users.
Question:
I am looking for way to shard my tables based on client ID. e.g. client 1 has table client1_user_preferences, client2_user_preference etc. All read write queries will go to respective client user preference table.
Schema will be something like:
client1_user_preferences
user_id, pref1, pref2,.. pref10, created_at, updated_at
Is there a way rails activerecord support this? If there is no gem/plugin, I am open to other suggestions too.
Why I am looking to separate tables based on Client?
We have business use case to give an option to Clients to manage their own users data by moving table to their own DB after using application service. With sharded tables the migration will be very smooth.
Extensive read write operations are performed on this table. With this approach data per table will be less hence faster read, write queries.
Since your aspirations are high and your budget is low, I recommend you implement something now. But plan on re-engineering the entire design within 6 months.
The "something" you implement now will give you some feel for the pitfalls you will encounter, and teach you what needs to be fixed in the re-engineering. And it may point out the limitations of the software you are using. (Generally, big projects, have to throw out any 3rd party package that stands between the client software and the database. It was handy for proof-of-concept, but fell apart when scaling)
And do not think that the second step is the last upheaval -- plan to again rebuild the entire system in another 6 months.
The thing that hurts big projects the most is getting stuck with an early design, yet not having the willingness to throw it out when it cannot scale further.

Database (OLTP) and Reporting

I am working on a trading platform that has reporting as a big portion of its business.
The set up is the following:
SQL OLTP database (about 200 tables) – rather small in number of records. (20,000 records the biggest table – but keeps growing every week)
For reporting services, SQL views are being used to query the Live Transaction Database. Imagine the result set of the views a de-normalized one, in the spirit of a data warehouse approach. Then these data sets are passed to a third party Reporting platform (like Tableau, Power Bi or SiSense), which take these data sets and throws them into Cubes (probably some columnar structure, like mono db, hadoop, etc). From there the Reports are getting generated.
Current challenges.
The SQL views (about 8). Are huge and very hard to maintain. To give you an example, one of the views outputs 100 fields. But each of these fields are calculated fields with complicated CASE statements, nested IF statements, inline Functions, and what not, which makes this view as big as 700 lines of sql code. I inherited these from anther employee and now, sadly, I have to maintain them.
Because the data grows weekly by several hundreds records (through migration and transactions) and the number of fields in the views also grow (a few every week), the cube building takes longer and longer. To give you an example, few months ago we set up the cube re-built ever 10 minutes to refresh the data (which was taking 5 minutes). Currently takes 12-15 minutes to build, so we set it up every 30 minutes. As you can imagine, this will get worse as data and the number of fields keep growing; and we kind of need the data as current as possible.
The only good thing is that once the cube is built, the reports load fast because they are being pulled form the 3rd party platform, so no concerns here.
What I have in mind
I would like to get rid of the views so I could ease the process of maintenanace and also keep at minimum the duration of the cube re-built.
Options:
to build a data warehouse. And then build SSIS packages to populate this structure with the live transactional data. The de-normalized structure would probably look very similar the views mentioned above. The draw back here is that I don’t really feel like I simplify much, actualy adding one more layer, which is the data migration from the OLTP to the OLAP (datawarehouse). And I would still have to re-build the cube.
To turn the current views into SQL Indexed Views (materialized views), but in their current state, I simply cannot do it because of the agregate and inline functions used a lot across the entire view.
Another option I red about is to built a ODS (Operational Data Store – which would be a databse that would contain the necessary tables similar to the sql views I have now – and refresh it constantly) Maybe using triggers, or, Transaction logs? But I am not sure what involves to built such thing and how hard is to maintain.
Question:
What approach should I take?
Do any of the 3 above make any sense?
Of course, I am interested in other ideas or suggestions, as well.
Thank you!
from my experience your best approach will be 1. It is costly, but will give you better benefits . Create a ROLAP DWH(i recommend Kimball's "The data warehouse toolkit" for best practices and design patterns), if you have the oportunity use a columnar data store(like amazon redshift, or sap sybase iq) and all the case statements ,nested ifs and all operations that you mentioned, would be applied on ETL time, so in the ROLAP everything is precalculated and optimized to consumption. And dont forget about aplying indexes(depending on the relying technology you use) . Some database vendors already published "indexing best practices" for ROLAP so they will tell you which type of index aply depending of the type of table(dimension) and data type for example.

Database design suggestions for a data scraping/warehouse application?

I'm looking into the database design for a data ware house kind of project which involves large number of inserts daily.The data archives will be further used to generate reports.I will have a list of user's (for example a user set of 2 million), for which I need to monitor daily social networking activities associated with them.
For example, let there be a set of 100 users say U1,U2,...,U100
I need to insert their daily status count into my database.
Consider the total status count obtained for a user U1 for period June 30 - July 6, is as follows
June 30 - 99
July 1 - 100
July 2 - 102
July 3 - 102
July 4 - 105
July 5 - 105
July 6 - 107
The database should keep daily status count of each users ,like
For user U1,
July 1- 1 (100-99)
July 2- 2 (102-100)
July 3- 0 (102-102)
July 4- 3 (105-102)
July 5- 0 (105-105)
July 6- 2 (107-105)
Similarly the database should hold archived details of the full set of user's.
And on a later phase , I envision to take aggregate reports out of these data like total points scored on each day,week,month,etc; and to compare it with older data.
I need to start things from the scratch.I am experienced with PHP as a server side script and MySQL. I am confused on the database side? Since I need to process about a million insertion daily,what all things should be taken care of?
I am confused on how to to design a MySQL database in this regard ? On which storage engine to be used and design patterns to be followed keeping in my mind the data could later used effectively with aggregate functions.
Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day.Does lot of table's could create some overhead?
Does MySQL fits my requirement? 2 million or more DB operations are done every day. How the server and other things are to be considered in this case.
1) The database should handle concurrent inserts, which should enable 1-2 million inserts per day.
Before inserting I suggest to calculate daily status count,i.e the difference today's count with yesterday's.
2) on a later phase, the archives data (collected over past days) is used as a data warehouse and aggregation tasks are to be performed on it.
Comments:
I have read MyISAM is the best choice for data warehousing projects and at the same time heard INNODB excels in many ways. Many have suggested on proper tuning to get it done, I would like to get thoughts on that as well.
When creating a data warehouse, you don't have to worry about normalization. You're inserting rows and reading rows.
I'd just have one table like this.
Status Count
------------
User id
Date
Count
The primary (clustering) key would be (User id, Date). Another unique index would be (Date, User id).
As far as whether or not MySQL can handle this data warehouse, that depends on the hardware that MySQL is running on.
Since you don't need referential integrity, I'd use MyISAM as the engine.
As for table design, a dimensional model with a star schema is usually a good choice for a datamart where there are mostly inserts and reads. I see two different granularities for the status data, one for status per day and one for status per user, so I would recommend tables similar to:
user_status_fact(user_dimension_id int, lifetime_status int)
daily_status_fact (user_dimension_id int, calendar_dimension_id int, daily_status int)
user_dimension(user_dimension_id, user_id, name, ...)
calendar_dimension(calendar_dimension_id, calendar_date, day_of_week, etc..)
You might also consider having the most detailed data available even though you don't have a current requirement for it as it may make it easier to build aggregates in the future:
status_fact (user_dimension_id int, calendar_dimension_id int, hour_dimension_id, status_dimension_id, status_count int DEFAULT 1)
hour_dimension(hour_dimension_id, hour_of_day_24, hour_of_day_12, ...)
status_dimension(status_dimension_id, status_description string, ...)
If you aren't familiar with the dimensional model, I would recommend the book data warehouse toolkit by Kimball.
I would also recommend MyISAM since you don't need the transactional integrity provided by InnoDB when dealing with a read-mostly warehouse.
I would question whether you want to do concurrent inserts into a production database though. Often in a warehouse environment this data would get batched over time and inserted in bulk and perhaps go through a promotion process.
As for scalability, mysql can certainly handle 2M write operations per day on modest hardware. I'm inserting 500K+ rows/day (batched hourly) on a cloud based server with 8GB of ram running apache + php + mysql and the inserts aren't really noticeable to the php users hitting the same db.
I'm assuming you will get one new row per user per day inserted (not 2M rows a day as some users will have more than one status). You should look at how many new rows per day you expect that to created. When you get to a large number of rows you might have to consider partitioning, sharding and other performance tricks. There are many books out there that can help you with that. Or you could also consider moving to an analytics db such as Amazon Red Shift.
I would create a fact table for each user status for each day. This fact table would connect to a date dimension via a date_key and to a user dimension via a user_key. The primary key for the fact table should be a surrogate key = status_key.
So, your fact table now has four fields: status_key, date_key, user_key, status.
Once the dimension and fact tables have been loaded, then do the processing and aggregating.
Edit: I assumed you knew something about datamarts and star schemas. Here is a simple star schema to base your design on.
This design will store any user's status for a given day. (If the user status can change during the day, just add a time dimension).
This design will work on MySQL or SQL Server. You will have to manage a million inserts per day, don't bog it down with comparisons to previous data points. You can do that with the datamart (star schema) after it's loaded - that's what it's for - analysis and aggregation.
If there are large number of DML operation and selecting records from database MYISAM engine would be prefer. INNODB is mainly use for TCL and referential integrity.You can also specify engine at table level.
If you need to generate the report then also MYISAM engine work faster than INNODB.See which table or data you need for your report.
Remember that if you generate reports from MYSQL database processing on millions of data using PHP programming could create a problem.You may encounter 500 or 501 error frequently.
So report generation view point MYISAM engine for required table will be useful.
You can also store data in multiple table to prevent overhead otherwise there is a chance for DB table crash.
It looks like you need a schema that will keep a single count per user per day. Very simple. You should create a single table which is DAY, USER_ID, and STATUS_COUNT.
Create an index on DAY and USER_ID together, and if possible keep the data in the table sorted by DAY and USER_ID also. This will give you very fast access to the data, as long as you are querying it by day ranges for any (or all) users.
For example:
select * from table where DAY = X and USER_ID in (Y, Z);
would be very fast because the data is ordered on disk sequentially by day, then by user_id, so there are very few seeks to satisfy the query.
On the other hand, if you are more interested in finding a particular user's activity for a range of days:
select * from table where USER_ID = X and DAY between Y and Z;
then the previous method is less optimal because finding the data will require many seeks instead of a sequential scan. Index first by USER_ID, then DAY, and keep the data sorted in that order; this will require more maintenance though, as the table would need to be re-sorted often. Again, it depends on your use case, and how fast you want your queries against the table to respond.
I don't use MySQL extensively, but I believe MyISAM is faster for inserts at the expense of transaction isolation. This should not be a problem for the system you're describing.
Also, 2MM records per day should be child's play (only 23 inserts / second) if you're using decent hardware. Especially if you can batch load the records using mysqlimport. If that's not possible, 23 inserts/second should still be very doable.
I would not do the computation of the delta from the previous day in the insertion of the current day however. There is an analytic function called LAG() that will do that for you very handily (http://explainextended.com/2009/03/10/analytic-functions-first_value-last_value-lead-lag/), not to mention it doesn't seem to serve any practical purpose at the detail level.
With this detail data, you can aggregate it any way you'd like, truncating the DAY column down to WEEK or MONTH, but be careful how you build aggregates. You're talking about over 7 billion records per year, and re-building aggregates over so many rows can be very costly, especially on a single database. You might consider doing aggregation processing using Hadoop (I'd recommend Spark over plain old Map/Reduce also, its far more powerful). This will alleviate any computation burden from your database server (which can't easily scale to multiple servers) and allow it to do its job of recording and storing new data.
You should consider partitioning your table as well. Some purposes of partitioning tables are to distribute query load, ease archival of data, and possibly increase insert performance. I would consider partitioning along the month boundary for an application such as you've described.

MySQL: Huge tables that need to be joined, how should be splitted for optimization?

The case:
I have been developing a web application in which I storage data from different automated data sources. Currently I am using MySQL as DBMS and PHP as programming language on a shared LAMP server.
I use several tables to identify the data sources and two tables for the data updates. Data sources are in a three level hierarchy, and updates are timestamped.
One table contains the two upper levels of the hierarchy (geographic location and instrument), plus the time-stamp and an “update ID”. The other table contains the update ID, the third level of the hierarchy (meter) and the value.
Most queries involve a joint statement between this to tables.
Currently the first table contains near 2.5 million records (290 MB) and the second table has over 15 million records (1.1 GB), each hour near 500 records are added to the first table and 3,000 to the second one, and I expect this numbers to increase. I don't think these numbers are too big, but I've been experiencing some performance drawbacks.
Most queries involve looking for immediate past activity (per site, per group of sites, and per instrument) which are no problem, but some involve summaries of daily, weekly and monthly activity (per site and per instrument). The page takes several seconds to load, sometimes surpassing the server's timeout (30s).
It also seems that the automatic updates are suffering from these timeouts, causing the connection to fail.
The question:
Is there any rational way to split these tables so that queries perform more quickly?
Or should I attempt other types of optimizations not involving splitting tables?
(I think the tables are properly indexed, and I know that a possible answer is to move to a dedicated server, probably running something else than MySQL, but just yet I cannot make this move and any optimization will help this scenario.)
If the queries that are slow are the historical summary queries, then you might want to consider a Data Warehouse. As long as your history data is relatively static, there isn't usually much risk to pre-calculating transactional summary data.
Data warehousing and designing schemas for Business Intelligence (BI) reporting is a very broad topic. You should read up on it and ask any specific BI design questions you may have.

Database (MySQL) structuring: pros and cons of multiple tables

I am collecting data and storing it MySQL, for:
75 variables
55 countries
Each year
I have, at this stage since I am building this tool created a single table, of variables / countries (storing 1 year worth of data).
Next year (and for several years after that) a new set of data will be input for each country.
There are therefore 3 variables in controlling data returned to a user reviewing all collected data. The general form of any query would be:
Show me these specifics variables, for these specific countries, for these specific years.
(Show me average age and weight, for USA and Canada, for 2012 and 2009, for example)
My question is, it seems that I have two options for arranging this data:
-Multiple tables where I create a table of country / variable for each year data is collected
- Single table and simply add a column (field) for the year that data relates to.
As far as I can tell I could make these database calls with either sructure, but is one more powerful / efficient / quicker, and why?
Thanks for your consideration.
It's a PDO / PHP interface if that is relevent.
Using a relational approach generally involves more tables. This translates into queries being a bit more slow (though probably not noticeable in small databases) and database size to be smaller. This makes it simpler to update information properly and thus ensure data integrity. For example, if Joe's address changes you know it will be changed on all reports using Joe's address.
Using less linked tables where one field can be repeated multiple times you risk having disparity between data from different tables where you would naturally expect it to be equal. Access speed should be a bit faster if you arrange your tables properly because your information will be grouped according to how you access it.
For example, in the first method you would have an Orders table with a Supplier and Client table to make a complete invoice whereas in the second method you would want to put some information of both Supplier and Client in the Orders table such that accessing that finding the row corresponding to the invoice number you are looking for would return the entire set of data that you need (thus eliminating the need for joins on Supplier and Client and reducing load on the database server).
Edit: I think a better answer would require a bit more information about your data (samples for example).