Most efficient way to generate reports in MySQL on massive datasets - mysql

I need to build a reporting interface to an application I'm working on which requires administrators to visualise huge quantities of collected data over time.
Think something similar to Google Analytics etc.
Most of the data that needs to be visualised sits in a basic table which contains a datetime, 'action' varchar and other filterable data - currently the table holds 1.5M rows, and it's growing every day.
At the moment I'm doing a simple select with the filters applied grouped by day and it's running pretty well, but I was wondering if there's a smarter more efficient way to extract such data.
Cheers

1) Two tiers -- raw data, and summarized data. For raw data, indexes will likely be of no help. You are doing aggregations, in most cases that necessitates a full table scan. If it doesn't, reorganize so it does, it'll be faster.
2) Figure out your aggregates, automatically generate them, and run the reports off the aggregate data. Do index these summary tables!
3) Avoid joins. Aggregate, materialize results of the group-bys, then join the aggregated results.
4) Partition. Keep data for one day (or whatever granularity makes sense) separate from data for another day. Make automated table creation scripts if necessary (grown-up -- or feature-heavy, depending on your point of view -- databases give you something called "partitioning" to do this in a more sane way).
5) Read up on "data warehousing"
http://en.wikipedia.org/wiki/Data_warehouse

You can start out doing couple of things:
Make sure you add the indexes on all the filters so they won't do any table scans.
check using query plan analyzer to make sure there are no places that need optimization.
Since you have a datetime stamp in your table, partitioning will definitely help you in the future.
Good luck.

You can expect a number of common queries, probably a small number compared to the number of unique combinations of filters that could be generated. You can use this to "compress" the data into companion tables, and run this collection process at night.

Related

How to handle 20M+ records from tables with same structure in MySQL

I have to handle 25M rows of data that I have collected and transformed from about 50 different sources. Every source leads to about 500.000 to 600.000 rows. Each record has the same structure, regardless the source (let say: id, title, author, release_date)
For flexibility, I would prefer to create a dedicated table for each source, (then I can clear/drop data from a source and reload/upload data very quickly (using LOAD INFILE)). This way, it seems very easy to truncate the table with no risk to delete rows from other sources.
But then I don't know how to select records having the same author across the different tables, and cherry on the cake, with pagination (LIMIT keyword).
Is the only solution to store everything into a single huge table and deal with the pain of indexing/backuping a 25M+ database or is there a kind of abstract layer to virtually merge 50 tables into a virtual one.
It is probably a usual question for a dba, but I could not find any answer yet...
Any help/idea much appreicated. Thx
This might be a good spot for MySQL partitoning.
This lets you handle a big volume of data, while giving you the opportunity to run DML operations on a specific partition when needed (such as truncate, or event drop) very efficiently, and without impacting the rest of your data. Partitioning selection is also supported in LOAD DATA statements.
You can run queries across partitions as you would with a normal table, or target a specific partition when you need to (which can be done very efficiently).
In your specific use case, list partitioning seems like a relevant choice: you have a pre-defined list of sources, so you would typically have one partition per source.

Running SUM across large amount of rows

I have a theoretical question which pertains to the SQL function SUM().
Imagine we have a table which contains a column called "value"
"value" is a DECIMAL number either positive or negative.
In our potential solution, we'd like to run a SUM() across all rows for column "value"
SELECT SUM(value)
FROM table
No problems so far, but the dataset is potentially millions of rows. Possibly even hundreds of millions of rows as the data will be retained over years.
So my questions are:
Can you run SUM() across hundreds of millions of rows?
What kind of performance could I expect on a query across that many rows? We haven't settled but looking at using MySQL or SQL Server.
You can take a look to the column store in SQL Server. In short, you are able to create a column store index on your tables - different from the traditional row store index.
These indexes are specially design for optimizing aggregate queries when huge amount of data is involved (for example, like in Data Warehouse star and snowflake schemes).
From the docs:
Columnstore indexes can achieve up to 100x better performance on
analytics and data warehousing workloads and up to 10x better data
compression than traditional rowstore indexes.
because:
Data compression - you can many benefits from here; for example, columnstore indexes read compressed data from disk, which means fewer bytes of data need to be read into memory;
Column elimination - columnstore indexes skip reading in columns that are not required for the query result and further reduces I/O for query execution and therefore improves query performance (not like rowstore indexes)
Rowgroup elimination - optimize table scans using metadata to eliminate specific rowgroups based on your filtering criteria;
Batch Mode Execution - prior to SQL Server 2019, only queries involving such indexes, can benefit from batch mode processing which reduce your execution time further (check this video to see how great is the this mode)
You may certainly run SUM() across an entire table, and the performance would depend roughly on how many records that table has. Note that things like indices would not really help performance in this case, because SQL Server has to touch every record to compute the sum.
If running SUM on the entire table in production might not scale well, then one option to consider would be maintain the sum in a separate table. Then, when a record gets inserted or deleted, you may use a trigger to update the running total appropriately. This way, accessing the sum would be roughly constant time, though you would have some additional overhead because of the trigger logic.
I'll throw out a couple of ideas. If the data sets that you are working with are absolutely massive, consider running an overnight job to create a view, or some kind of temp table, and refer to this aggregated data blob when you get into the office in the morning. Or, move everything to the cloud, like Azure Databricks, for instance, and run these jobs in Spark. Spark is blazing fast and runs jobs in parallel, so everything is done super-fast. Good luck.

Store large amounts of sensor data in SQL, optimize for query performance

I need to store sensor data from various locations (different factories with different rooms with each different sensors). Data is being downloaded in regular intervals from a device on site in the factories that collects the data transmitted from all sensors.
The sensor data looks like this:
collecting_device_id, sensor_id, type, value, unit, timestamp
Type could be temperature, unit could be degrees_celsius. collecting_device_id will identify the factory.
There are quite a lot of different things (==types) being measured.
I will collect around 500 million to 750 million rows and then perform analyses on them.
Here's the question for storing the data in a SQL database (let's say MySQL InnoDB on AWS RDS, large machine if necessary):
When considering query performance for future queries, is it better to store this data in one huge table just like it comes from the sensors? Or to distribute it across tables (tables for factories, temperatures, humidities, …, everything normalized)? Or to have a wide table with different fields for the data points?
Yes, I know, it's hard to say "better" without knowing the queries. Here's more info and a few things I have thought about:
There's no constant data stream as data is uploaded in chunks every 2 days (a lot of writes when uploading, the rest of the time no writes at all), so I would guess that index maintenance won't be a huge issue.
I will try to reduce the amount of data being inserted upfront (data that can easily be replicated later on, data that does not add additional information, …)
Queries that should be performed are not defined yet (I know, designing the query makes a big difference in terms of performance). It's exploratory work (so we don't know ahead what will be asked and cannot easily pre-compute values), so one time you want to compare data points of one type in a time range to data points of another type, the other time you might want to compare rooms in factories, calculate correlations, find duplicates, etc.
If I would have multiple tables and normalize everything the queries would need a lot of joins (which probably makes everything quite slow)
Queries mostly need to be performed on the whole ~ 500 million rows database, rarely on separately downloaded subsets
There will be very few users (<10), most of them will execute these "complex" queries.
Is a SQL database a good choice at all? Would there be a big difference in terms of performance for this use case to use a NoSQL system?
In this setup with this amount of data, will I have queries that never "come back"? (considering the query is not too stupid :-))
Don't pre-optimize. If you don't know the queries then you don't know the queries. It is to easy to make choices now that will slow down some sub-set of queries. When you know how the data will be queried you can optimize then -- it is easy to normalize after the fact (pull out temperature data into a related table for example.) For now I suggest you put it all in one table.
You might consider partitioning the data by date or if you have another way that might be useful (recording device maybe?). Often data of this size is partitioned if you have the resources.
After you think about the queries, you will possibly realize that you don't really need all the datapoints. Instead, max/min/avg/etc for, say, 10-minute intervals may be sufficient. And you may want to "alarm" on "over-temp" values. This should not involve the database, but should involve the program receiving the sensor data.
So, I recommend not storing all the data; instead only store summarized data. This will greatly shrink the disk requirements. (You could store the 'raw' data to a plain file in case you are worried about losing it. It will be adequately easy to reprocess the raw file if you need to.)
If you do decide to store all the data in table(s), then I recommend these tips:
High speed ingestion (includes tips on Normalization)
Summary Tables
Data Warehousing
Time series partitioning (if you plan to delete 'old' data) (partitioning is painful to add later)
750M rows -- per day? per decade? Per month - not too much challenge.
By receiving a batch every other day, it becomes quite easy to load the batch into a temp table, do normalization, summarization, etc; then store the results in the Summary table(s) and finally copy to the 'Fact' table (if you choose to keep the raw data in a table).
In reading my tips, you will notice that avg is not summarized; instead sum and count are. If you need standard deviation, also, keep sum-of-squares.
If you fail to include all the Summary Tables you ultimately need, it is not too difficult to re-process the Fact table (or Fact files) to populate the new Summary Table. This is a one-time task. After that, the summarization of each chunk should keep the table up to date.
The Fact table should be Normalized (for space); the Summary tables should be somewhat denormalized (for performance). Exactly how much denormalization depends on size, speed, etc., and cannot be predicted at this level of discussion.
"Queries on 500M rows" -- Design the Summary tables so that all queries can be done against them, instead. A starting rule-of-thumb: Any Summary table should have one-tenth the number of rows as the Fact table.
Indexes... The Fact table should have only a primary key. (The first 100M rows will work nicely; the last 100M will run so slowly. This is a lesson you don't want to have to learn 11 months into the project; so do pre-optimize.) The Summary tables should have whatever indexes make sense. This also makes querying a Summary table faster than the Fact table. (Note: Having a secondary index on a 500M-rows table is, itself, a non-trivial performance issue.)
NoSQL either forces you to re-invent SQL, or depends on brute-force full-table-scans. Summary tables are the real solution. In one (albeit extreme) case, I sped up a 1-hour query to 2-seconds by by using a Summary table. So, I vote for SQL, not NoSQL.
As for whether to "pre-optimize" -- I say it is a lot easier than rebuilding a 500M-row table. That brings up another issue: Start with the minimal datasize for each field: Look at MEDIUMINT (3 bytes), UNSIGNED (an extra bit), CHARACTER SET ascii (utf8 or utf8mb4) only for columns that need it), NOT NULL (NULL costs a bit), etc.
Sure, it is possible to have 'queries that never come back'. This one 'never comes back, even with only 100 rows in a: SELECT * FROM a JOIN a JOIN a JOIN a JOIN a. The resultset has 10 billion rows.

Performing Calculations SQL

I am trying to take information from one MySQL table, perform a bunch of calculations on this data, and then put the results in a second MySQL table. What would be the best way of doing this (i.e. in MySQL itself, using python, etc.)?
My apologies for the vagueness, I'll try to be more specific. Table 1 has every meal that every person in my class eats, so each meal is a primary key, and other columns include the person and the number of calories. The primary key for Table 2 is the person, and another column is the percentage of total calories this person has eaten, out of the calories of the entire class. Another column is the percentage of total calories of this person's gender in the class. Every day, I want to take the new eating information, and use it to update the percentages in Table 2. (Thanks for the help!)
Assming the calculations can be done in SQL (and percentages are definitely do-able), you have some choices.
The first, and academically correct, choice, is not to store this in a table at all. One of the principles of normalization is that you don't store duplicate or calculated values - instead, you calculate them as you need them.
This isn't just an academic concern - it avoids many silly bugs and anomalies, and it means your data is always up to date - you don't have to wait for your calculation query to run before you can use the data.
If the calculation is non-trivial and/or an essential part of the business domain, common practice is to create a database view, which behaves like a table when queried, but is actually calculated on the fly. This means that the business logic is encapsulated in the view, rather than repeated in multiple queries. You can go further, with materialized views etc. - but the basic principle is the same.
In some cases, where the volume of data is huge, or the calculations are time consuming, or you have calculations that are very hard to embed in a single SQL statement, it's common to create "aggregate tables" - this is what you are suggesting. You can populate these tables either by (scheduled) queries, or by using database triggers.
However, aggregate tables are a last resort - they make the solution much harder to maintain and debug - if the data is wrong, you don't have a single query to debug, you've got to follow the chain of logic all the way through.
Assuming you are in a class of a few dozen people, and are reporting on less than 10 million years of meals, any modern RDBMS can calculate this report in milliseconds - there's really no need to store it in an aggregate table.
A possible solution could be that you create a View or a Materialized View with the complex SELECT query behind it.
The Materialized View could be an other option too, as you have wrote that you would like to have these results re-queried/refreshed every day.
If you need to do more advanced operations on those tables, you could create a Stored procedure and call it when you need its data.
Note: you can't work furthermore (eg.: can't call it from a select for joining it's result set) with the procedures result set other than say a temporary table.

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.