Optimal way of storing performance data for statistics (graphs) - mysql

Currently I'm working on a dashboard in PHP/MySQL which contains several statistics/facts such as: amount of items sold, revenue, gender (male/female) ratio of users etc. (all filterable on last week/month/year). The amount of data is (currently) not that much: 20.000 user rows, 1.000 items, 500 items sold per day but is expected to grow in the future, perhaps even exponentially.
Now, there is a wish to have several graphs displaying the performance to see whether strategy changes have impacts on the amount of users, revenue, gender ratio etc. For this, it is necessary to have numbers per day. Currently, the dashboard can only display "NOW() - 1 week/1 month/1 year" but for showing a graph outlining the growth, these numbers should be saved on a daily basis.
My question is: what are the options in this case? A cronjob could be set in place to save these numbers and write them to a separate 'performance' or 'history' table that saves the visitors, sales, gender ratio etc. in rows linked to the date of that day. This is good for performance, but certain data gets lost. Another option is to compute these numbers with complex queries (group by day) etc, but that seems to intensive since the queries are performed on the production database. Especially since the database structure is a little complex. Thinking of avoiding doing this on the production database, is setting up a data-warehouse with ETL-processes a better option to avoid overloading the production database? In that case the data would not be displayed live.
I honestly have no idea what is the best option in this case. I'm very curious about the answers! Many thanks.

Running query on a production database (especially one which is growing in volume and complexity) become a losing proposition very quickly. There are a lot of possible alternative, basically the entire field of Business Intelligence is grown as as solution of this problem.
For a small system where you just want to avoid to query the production database probably the development of a full blown Data Warehouse is overkill. It is impossible to give a reasonable answer without knowing more, but I would go for one of the following (in growing order of complexity/degree of result):
Instead of directly show the result of the query, save it in a table and query the table
Clone your production database then query the clone
Extract relevant data from production database in a structure which save relevant data and preserve history (google Data Vault)
Direct over the production DB, or over solution 2 or 3 build a dimensional model (google Kimball Dimensional Model). Pay attention that to do a good job you have to consider what kind of queries you want to do. You could end up with different designs for different requirement.
It is also relevant which technology are you using and what are the options available on your available architecture. Depending on what you have on hand, you could have some solution, even complex ones, very much simplified. Do some research.

Related

Database (OLTP) and Reporting

I am working on a trading platform that has reporting as a big portion of its business.
The set up is the following:
SQL OLTP database (about 200 tables) – rather small in number of records. (20,000 records the biggest table – but keeps growing every week)
For reporting services, SQL views are being used to query the Live Transaction Database. Imagine the result set of the views a de-normalized one, in the spirit of a data warehouse approach. Then these data sets are passed to a third party Reporting platform (like Tableau, Power Bi or SiSense), which take these data sets and throws them into Cubes (probably some columnar structure, like mono db, hadoop, etc). From there the Reports are getting generated.
Current challenges.
The SQL views (about 8). Are huge and very hard to maintain. To give you an example, one of the views outputs 100 fields. But each of these fields are calculated fields with complicated CASE statements, nested IF statements, inline Functions, and what not, which makes this view as big as 700 lines of sql code. I inherited these from anther employee and now, sadly, I have to maintain them.
Because the data grows weekly by several hundreds records (through migration and transactions) and the number of fields in the views also grow (a few every week), the cube building takes longer and longer. To give you an example, few months ago we set up the cube re-built ever 10 minutes to refresh the data (which was taking 5 minutes). Currently takes 12-15 minutes to build, so we set it up every 30 minutes. As you can imagine, this will get worse as data and the number of fields keep growing; and we kind of need the data as current as possible.
The only good thing is that once the cube is built, the reports load fast because they are being pulled form the 3rd party platform, so no concerns here.
What I have in mind
I would like to get rid of the views so I could ease the process of maintenanace and also keep at minimum the duration of the cube re-built.
Options:
to build a data warehouse. And then build SSIS packages to populate this structure with the live transactional data. The de-normalized structure would probably look very similar the views mentioned above. The draw back here is that I don’t really feel like I simplify much, actualy adding one more layer, which is the data migration from the OLTP to the OLAP (datawarehouse). And I would still have to re-build the cube.
To turn the current views into SQL Indexed Views (materialized views), but in their current state, I simply cannot do it because of the agregate and inline functions used a lot across the entire view.
Another option I red about is to built a ODS (Operational Data Store – which would be a databse that would contain the necessary tables similar to the sql views I have now – and refresh it constantly) Maybe using triggers, or, Transaction logs? But I am not sure what involves to built such thing and how hard is to maintain.
Question:
What approach should I take?
Do any of the 3 above make any sense?
Of course, I am interested in other ideas or suggestions, as well.
Thank you!
from my experience your best approach will be 1. It is costly, but will give you better benefits . Create a ROLAP DWH(i recommend Kimball's "The data warehouse toolkit" for best practices and design patterns), if you have the oportunity use a columnar data store(like amazon redshift, or sap sybase iq) and all the case statements ,nested ifs and all operations that you mentioned, would be applied on ETL time, so in the ROLAP everything is precalculated and optimized to consumption. And dont forget about aplying indexes(depending on the relying technology you use) . Some database vendors already published "indexing best practices" for ROLAP so they will tell you which type of index aply depending of the type of table(dimension) and data type for example.

Recurring data demand - automated query, or store data directly in SQL?

This is a simple question even though the title sounds complicated.
Let's say I'm storing data from a bunch of applications into one central database/ data warehouse. This is data at a pretty fine level -- say, daily summaries of various metrics.
HOWEVER, I know in the front-end I will be frequently displaying weekly and monthly aggregates of this data as well.
One idea would be to have scripting language do this for me after querying the SQL database - but that seems horribly inefficient, perhaps.
The second idea would be to have views in the database that represent business weeks and months -- this might be the best way to do it.
But my final idea is -- couldn't a SQL client simply run a query that aggregates all the daily data into weeks (or months) and store them in a separate table? The advantage of this is that it would reduce querying time of any user, since all the query work is done before a website or button is even loaded/ pushed. Even with a view, I guess that aggregation calculation would have to be done as soon as the view was queried.
The only downside to having the queries aggregated from the weeks/ months perhaps even once a day (instead of every time the website is loaded) -- is that it won't be up-to-date/ may reflect inconsistencies.
I'm not really an expert when it comes to this bigger picture stuff -- anyone have any thoughts? thanks
It depends on the user experience you're trying to create.
Is the user base expecting to watch monthly aggregates with one finger on the F5 key when watching this month's statistics? To cover this scenario, you might want to have a view with criteria that presents a window always relative to getdate(). Keeping in mind that with good indexing strategies and query design should mitigate the impact of this sort of approach to nearly nothing.
Is the user expecting informational data that doesn't include today's data? More performance might be seen out of a nightly job that does the aggregation into a new table.
Of all the scenarios, though, I would not recommend manual aggregation. Down that road are unexpected bugs and exceptions that can really be handled with a good SQL statement. Aggregates are a big part of all DBMSs', let their software handle that and work on the rest of your application.

Creating a MySQL Database Schema for large data set

I'm struggling to find the best way to build out a structure that will work for my project. The answer may be simple but I'm struggling due to the massive number of columns or tables, depending on how it's set up.
We have several tools, each that can be run for many customers. Each tool has a series of questions that populate a database of answers. After the tool is run, we populate another series of data that is the output of the tool. We have roughly 10 tools, all populating a spreadsheet of 1500 data points. Here's where I struggle... each tool can be run multiple times, and many tools share the same data point. My next project is to build an application that can begin data entry for a tool, but allow import of data that shares the same datapoint for a tool that has already been run.
A simple example:
Tool 1 - company, numberofusers, numberoflocations, cost
Tool 2 - company, numberofusers, totalstorage, employeepayrate
So if the same company completed tool 1, I need to be able to populate "numberofusers" (or offer to populate) when they complete tool 2 since it already exists.
I think what it boils down to is, would it be better to create a structure that has 1500 tables, 1 for each data element with additional data around each data element, or to create a single massive table - something like...
customerID(FK), EventID(fk), ToolID(fk), numberofusers, numberoflocations, cost, total storage, employee pay,.....(1500)
If I go this route and have one large table I'm not sure how that will impact performance. Likewise - how difficult it will be to maintain 1500 tables.
Another dimension is that it would be nice to have a description of each field:
numberofusers,title,description,active(bool). I assume this is only possible if each element is in its own table?
Thoughts? Suggestions? Sorry for the lengthy question, new here.
Build a main table with all the common data: company, # users, .. other stuff. Give each row a unique id.
Build a table for each unique tool with the company id from above and any data unique to that implementation. Give each table a primary (unique key) for 'tool use' and 'company'.
This covers the common data in one place, identifies each 'customer' and provides for multiple uses of a given tool for each customer. Every use and customer is trackable and distinct.
More about normalization here.
I agree with etherbubunny on normalization but with larger datasets there are performance considerations that quickly become important. Joins which are often required in normalized databases to display human readable information can be performance killers on even medium sized tables which is why a lot of data warehouse models use de-normalized datasets for reporting. This is essentially pre-building the joined reporting data into new tables with heavy use of indexing, archiving and partitioning.
In many cases smart use of partitioning on its own can also effectively help reduce the size of the datasets being queried. This usually takes quite a bit of maintenance unless certain parameters remain fixed though.
Ultimately in your case (and most others) I highly recommend building it the way you are able to maintain and understand what is going on and then performing regular performance checks via slow query logs, explain, and performance monitoring tools like percona's tool set. This will give you insight into what is really happening and give you some data to come back here or the MySQL forums with. We can always speculate here but ultimately the real data and your setup will be the driving force behind what is right for you.

Efficient MySQL Database Design for Multi-User Web Apps

I am developing a site that will allow for users to track sales numbers for personal crafts. The way it will work is that a user will be able to submit/edit weekly sales data, and then once the data is stored, be able to view it in various forms of table or graph, track trends, etc.
My concern is that as the userbase grows, if it grows, I want a database design that will scale with it, and be manageable. I am self taught when it comes to proper web apps like this, and while I have all the PHP and JS knowledge I need to assemble the site, and I've worked with jQuery before, this one I am less sure about.
Am I better off storing users' weekly reports in one big table, or creating a separate database, in which each user has their own table, in turn containing that user's weekly reports? There's going to be far more pulling this data for charts done than there is altering or adding to it, so my goal is primarily efficiency and simplicity of storing/recalling the data.
The thing that has me most stumped is the best way to handle the fact that different users will have different amounts of products, and those amounts will change. In a user's first week, perhaps they log sales for 2 items, but come the third week, they add a new item to the list of things they are selling. The database needs to allow for this kind of thing with low overhead, as most users will have more than 1 product.
How would you structure this database?
I would suggest one large table with the innoDB engine for row-level locking instead of table locking. Then create an index on the username and entry time.
I would suggest that the table per user be a bit much, you would be wasting space on the harddisk and database allocated for the table which a user may not need. There is no problem with mySQL support 5+ million rows, if your table even gets that big.
Simplicity is best.

Database architecture for millions of new rows per day

I need to implement a custom-developed web analytics service for large number of websites. The key entities here are:
Website
Visitor
Each unique visitor will have have a single row in the database with information like landing page, time of day, OS, Browser, referrer, IP, etc.
I will need to do aggregated queries on this database such as 'COUNT all visitors who have Windows as OS and came from Bing.com'
I have hundreds of websites to track and the number of visitors for those websites range from a few hundred a day to few million a day. In total, I expect this database to grow by about a million rows per day.
My questions are:
1) Is MySQL a good database for this purpose?
2) What could be a good architecture? I am thinking of creating a new table for each website. Or perhaps start with a single table and then spawn a new table (daily) if number of rows in an existing table exceed 1 million (is my assumption correct). My only worry is that if a table grows too big, the SQL queries can get dramatically slow. So, what is the maximum number of rows I should store per table? Moreover, is there a limit on number of tables that MySQL can handle.
3) Is it advisable to do aggregate queries over millions of rows? I'm ready to wait for a couple of seconds to get results for such queries. Is it a good practice or is there any other way to do aggregate queries?
In a nutshell, I am trying a design a large scale data-warehouse kind of setup which will be write heavy. If you know about any published case studies or reports, that'll be great!
If you're talking larger volumes of data, then look at MySQL partitioning. For these tables, a partition by data/time would certainly help performance. There's a decent article about partitioning here.
Look at creating two separate databases: one for all raw data for the writes with minimal indexing; a second for reporting using the aggregated values; with either a batch process to update the reporting database from the raw data database, or use replication to do that for you.
EDIT
If you want to be really clever with your aggregation reports, create a set of aggregation tables ("today", "week to date", "month to date", "by year"). Aggregate from raw data to "today" either daily or in "real time"; aggregate from "by day" to "week to date" on a nightly basis; from "week to date" to "month to date" on a weekly basis, etc. When executing queries, join (UNION) the appropriate tables for the date ranges you're interested in.
EDIT #2
Rather than one table per client, we work with one database schema per client. Depending on the size of the client, we might have several schemas in a single database instance, or a dedicated database instance per client. We use separate schemas for raw data collection, and for aggregation/reporting for each client. We run multiple database servers, restricting each server to a single database instance. For resilience, databases are replicated across multiple servers and load balanced for improved performance.
Some suggestions in a database agnostic fashion.
The most simplest rational is to distinguish between read intensive and write intensive tables. Probably it is good idea to create two parallel schemas daily/weekly schema and a history schema. The partitioning can be done appropriately. One can think of a batch job to update the history schema with data from daily/weekly schema. In history schema again, you can create separate data tables per website (based on the data volume).
If all you are interested is in the aggregation stats alone (which may not be true). It is a good idea to have a summary tables (monthly, daily) in which the summary is stored like total unqiue visitors, repeat visitors etc; and these summary tables are to be updated at the end of day. This enables on the fly computation of stats with out waiting for the history database to be updated.
You should definitely consider splitting the data by site across databases or schemas - this not only makes it much easier to backup, drop etc an individual site/client but also eliminates much of the hassle of making sure no customer can see any other customers data by accident or poor coding etc. It also means it is easier to make choices about partitionaing, over and above databae table-level partitioning for time or client etc.
Also you said that the data volume is 1 million rows per day (that's not particularly heavy and doesn't require huge grunt power to log/store, nor indeed to report (though if you were genererating 500 reports at midnight you might logjam). However you also said that some sites had 1m visitors daily so perhaps you figure is too conservative?
Lastly you didn't say if you want real-time reporting a la chartbeat/opentracker etc or cyclical refresh like google analytics - this will have a major bearing on what your storage model is from day one.
M
You really should test your way forward will simulated enviroments as close as possible to the live enviroment, with "real fake" data (correct format & length). Benchmark queries and variants of table structures. Since you seem to know MySQL, start there. It shouldn't take you that long to set up a few scripts bombarding your database with queries. Studying the results of your database with your kind of data will help you realise where the bottlenecks will occur.
Not a solution but hopefully some help on the way, good luck :)