Running SUM across large amount of rows - mysql

I have a theoretical question which pertains to the SQL function SUM().
Imagine we have a table which contains a column called "value"
"value" is a DECIMAL number either positive or negative.
In our potential solution, we'd like to run a SUM() across all rows for column "value"
SELECT SUM(value)
FROM table
No problems so far, but the dataset is potentially millions of rows. Possibly even hundreds of millions of rows as the data will be retained over years.
So my questions are:
Can you run SUM() across hundreds of millions of rows?
What kind of performance could I expect on a query across that many rows? We haven't settled but looking at using MySQL or SQL Server.

You can take a look to the column store in SQL Server. In short, you are able to create a column store index on your tables - different from the traditional row store index.
These indexes are specially design for optimizing aggregate queries when huge amount of data is involved (for example, like in Data Warehouse star and snowflake schemes).
From the docs:
Columnstore indexes can achieve up to 100x better performance on
analytics and data warehousing workloads and up to 10x better data
compression than traditional rowstore indexes.
because:
Data compression - you can many benefits from here; for example, columnstore indexes read compressed data from disk, which means fewer bytes of data need to be read into memory;
Column elimination - columnstore indexes skip reading in columns that are not required for the query result and further reduces I/O for query execution and therefore improves query performance (not like rowstore indexes)
Rowgroup elimination - optimize table scans using metadata to eliminate specific rowgroups based on your filtering criteria;
Batch Mode Execution - prior to SQL Server 2019, only queries involving such indexes, can benefit from batch mode processing which reduce your execution time further (check this video to see how great is the this mode)

You may certainly run SUM() across an entire table, and the performance would depend roughly on how many records that table has. Note that things like indices would not really help performance in this case, because SQL Server has to touch every record to compute the sum.
If running SUM on the entire table in production might not scale well, then one option to consider would be maintain the sum in a separate table. Then, when a record gets inserted or deleted, you may use a trigger to update the running total appropriately. This way, accessing the sum would be roughly constant time, though you would have some additional overhead because of the trigger logic.

I'll throw out a couple of ideas. If the data sets that you are working with are absolutely massive, consider running an overnight job to create a view, or some kind of temp table, and refer to this aggregated data blob when you get into the office in the morning. Or, move everything to the cloud, like Azure Databricks, for instance, and run these jobs in Spark. Spark is blazing fast and runs jobs in parallel, so everything is done super-fast. Good luck.

Related

How to improve the performance of table scans with innodb

Brief: Is there any way to improve the performance of table scans on InnoDB tables?
Please, do not suggest adding indexes to avoid table scans. (see below)
innodb_buffer_pool_size sits at 75% of server memory (48 GB/64GB)
I'm using the latest version of Percona (5.7.19) if that changes anything
Longer: We have 600Gb of recent time series data (we aggregate and delete older data) spread over 50-60 tables. So most of it is "active" data that is regularly queried. These tables are somewhat large (400+ numeric columns) and many queries run against a number of those columns (alarming) which is why it is impractical to add indexes (as we would have to add a few dozen). The largest tables are partitioned per day.
I am fully aware that this is an application/table design problem and not a "server tuning" problem. We are currently working to significantly change the way these tables are designed and queried, but have to maintain the existing system until this happens so I'm looking for a way to improve things a bit to buy us a little time.
We recently split this system and have moved a part of it to a new server. It previously used MyISAM, and we tried moving to TokuDB which seemed appropriate but ran into some weird problems. We switched to InnoDB but performance is really bad. I get the impression that MyISAM is better with table scans which is why, barring any better option, we'll go back to it until the new system is in place.
Update
All tables have pretty much the same structure:
-timestamp
-primary key (varchar(20) field)
-about 15 fields of various types representing other secondary attributes that can be filtered upon (along with an appropriately indexed criteria first)
-And then about a few hundred measures (floats), between 200-400.
I already trimmed the row length as much as I could without changing the structure itself. The primary key used to be a varchar(100), all measures used to be doubles, many of the secondary attributes had their data types changed.
Upgrading hardware is not really an option.
Creating small tables with just the set of columns I need would help some processes perform faster. But at the cost of creating that table with a table scan first and duplicating data. Maybe if I created it as a memory table. By my estimate, it would take a couple of GB away from the buffer pool. Also there are aggregation processes that read about as much data from the main tables on a regular basis, and they need all columns.
There is unfortunately a lot of duplication of effort in those queries which I plan to address in the next version. The alarming and aggregation processes basically reprocess the entire day's worth of data every time some rows inserted (every half hour) instead of just dealing with new/changed data.
Like I said, the bigger tables are partitioned, so it's usually a scan over a daily partition rather than the entire table, which is a small consolation.
Implementing a system to hold this in memory outside of the DB could work, but that would entail a lot of changes on the legacy system and development work. Might as well spend that time on the better design.
The fact that InnoDB table are so much bigger for the same data as MyISAM (2-3x as big in my case) really hinders the performance.
MyISAM is a little bit better at table-scans, because it stores data more compactly than InnoDB. If your queries are I/O-bound, scanning through less data on disk is faster. But this is a pretty weak solution.
You might try using InnoDB compression to reduce the size of data. That might get you closer to MyISAM size, but you're still I/O-bound so it's going to suck.
Ultimately, it sounds like you need a database that is designed for an OLAP workload, like a data warehouse. InnoDB and TokuDB are both designed for OLTP workload.
It smells like a Data Warehouse with "Reports". By judicious picking of what to aggregate (selected of your Floats) over what time period (hour or day is typical), you can build and maintain Summary Tables that work much more efficiently for the Reports. This has the effect of scanning the data only once (to build the Summaries), not repeatedly. The Summary tables are much smaller, so the reports are much faster -- 10x is perhaps typical.
It may also be possible to augment the Summary tables as the raw data is being Inserted. (See INSERT .. ON DUPLICATE KEY UPDATE ..)
And use Partitioning by date to allow for efficient DROP PARTITION instead of DELETE. Don't have more than about 50 partitions.
Summary Tables
Time series Partitioning
If you would like to discuss in more detail, let's start with one of the queries that is scanning so much now.
In the various projects I have worked on, there were between 2 and 7 Summary tables.
With 600GB of data, you may be pushing the limits on 'ingestion'. If so, we can discuss that, too.

Store large amounts of sensor data in SQL, optimize for query performance

I need to store sensor data from various locations (different factories with different rooms with each different sensors). Data is being downloaded in regular intervals from a device on site in the factories that collects the data transmitted from all sensors.
The sensor data looks like this:
collecting_device_id, sensor_id, type, value, unit, timestamp
Type could be temperature, unit could be degrees_celsius. collecting_device_id will identify the factory.
There are quite a lot of different things (==types) being measured.
I will collect around 500 million to 750 million rows and then perform analyses on them.
Here's the question for storing the data in a SQL database (let's say MySQL InnoDB on AWS RDS, large machine if necessary):
When considering query performance for future queries, is it better to store this data in one huge table just like it comes from the sensors? Or to distribute it across tables (tables for factories, temperatures, humidities, …, everything normalized)? Or to have a wide table with different fields for the data points?
Yes, I know, it's hard to say "better" without knowing the queries. Here's more info and a few things I have thought about:
There's no constant data stream as data is uploaded in chunks every 2 days (a lot of writes when uploading, the rest of the time no writes at all), so I would guess that index maintenance won't be a huge issue.
I will try to reduce the amount of data being inserted upfront (data that can easily be replicated later on, data that does not add additional information, …)
Queries that should be performed are not defined yet (I know, designing the query makes a big difference in terms of performance). It's exploratory work (so we don't know ahead what will be asked and cannot easily pre-compute values), so one time you want to compare data points of one type in a time range to data points of another type, the other time you might want to compare rooms in factories, calculate correlations, find duplicates, etc.
If I would have multiple tables and normalize everything the queries would need a lot of joins (which probably makes everything quite slow)
Queries mostly need to be performed on the whole ~ 500 million rows database, rarely on separately downloaded subsets
There will be very few users (<10), most of them will execute these "complex" queries.
Is a SQL database a good choice at all? Would there be a big difference in terms of performance for this use case to use a NoSQL system?
In this setup with this amount of data, will I have queries that never "come back"? (considering the query is not too stupid :-))
Don't pre-optimize. If you don't know the queries then you don't know the queries. It is to easy to make choices now that will slow down some sub-set of queries. When you know how the data will be queried you can optimize then -- it is easy to normalize after the fact (pull out temperature data into a related table for example.) For now I suggest you put it all in one table.
You might consider partitioning the data by date or if you have another way that might be useful (recording device maybe?). Often data of this size is partitioned if you have the resources.
After you think about the queries, you will possibly realize that you don't really need all the datapoints. Instead, max/min/avg/etc for, say, 10-minute intervals may be sufficient. And you may want to "alarm" on "over-temp" values. This should not involve the database, but should involve the program receiving the sensor data.
So, I recommend not storing all the data; instead only store summarized data. This will greatly shrink the disk requirements. (You could store the 'raw' data to a plain file in case you are worried about losing it. It will be adequately easy to reprocess the raw file if you need to.)
If you do decide to store all the data in table(s), then I recommend these tips:
High speed ingestion (includes tips on Normalization)
Summary Tables
Data Warehousing
Time series partitioning (if you plan to delete 'old' data) (partitioning is painful to add later)
750M rows -- per day? per decade? Per month - not too much challenge.
By receiving a batch every other day, it becomes quite easy to load the batch into a temp table, do normalization, summarization, etc; then store the results in the Summary table(s) and finally copy to the 'Fact' table (if you choose to keep the raw data in a table).
In reading my tips, you will notice that avg is not summarized; instead sum and count are. If you need standard deviation, also, keep sum-of-squares.
If you fail to include all the Summary Tables you ultimately need, it is not too difficult to re-process the Fact table (or Fact files) to populate the new Summary Table. This is a one-time task. After that, the summarization of each chunk should keep the table up to date.
The Fact table should be Normalized (for space); the Summary tables should be somewhat denormalized (for performance). Exactly how much denormalization depends on size, speed, etc., and cannot be predicted at this level of discussion.
"Queries on 500M rows" -- Design the Summary tables so that all queries can be done against them, instead. A starting rule-of-thumb: Any Summary table should have one-tenth the number of rows as the Fact table.
Indexes... The Fact table should have only a primary key. (The first 100M rows will work nicely; the last 100M will run so slowly. This is a lesson you don't want to have to learn 11 months into the project; so do pre-optimize.) The Summary tables should have whatever indexes make sense. This also makes querying a Summary table faster than the Fact table. (Note: Having a secondary index on a 500M-rows table is, itself, a non-trivial performance issue.)
NoSQL either forces you to re-invent SQL, or depends on brute-force full-table-scans. Summary tables are the real solution. In one (albeit extreme) case, I sped up a 1-hour query to 2-seconds by by using a Summary table. So, I vote for SQL, not NoSQL.
As for whether to "pre-optimize" -- I say it is a lot easier than rebuilding a 500M-row table. That brings up another issue: Start with the minimal datasize for each field: Look at MEDIUMINT (3 bytes), UNSIGNED (an extra bit), CHARACTER SET ascii (utf8 or utf8mb4) only for columns that need it), NOT NULL (NULL costs a bit), etc.
Sure, it is possible to have 'queries that never come back'. This one 'never comes back, even with only 100 rows in a: SELECT * FROM a JOIN a JOIN a JOIN a JOIN a. The resultset has 10 billion rows.

handling/compressing large datasets in multiple tables

In an application at our company we collect statistical data from our servers (load, disk usage and so on). Since there is a huge amount of data and we don't need all data at all times we've had a "compression" routine that takes the raw data and calculates min. max and average for a number of data-points, store these new values in the same table and removes the old ones after some weeks.
Now I'm tasked with rewriting this compression routine and the new routine must keep all uncompressed data we have for one year in one table and "compressed" data in another table. My main concerns now are how to handle the data that is continuously written to the database and whether or not to use a "transaction table" (my own term since I cant come up with a better one, I'm not talking about the commit/rollback transaction functionality).
As of now our data collectors insert all information into a table named ovak_result and the compressed data will end up in ovak_resultcompressed. But are there any specific benefits or drawbacks to creating a table called ovak_resultuncompressed and just use ovak_result as a "temporary storage"? ovak_result would be kept minimal which would be good for the compressing routine, but I would need to shuffle all data from one table into another continually, and there would be constant reading, writing and deleting in ovak_result.
Are there any mechanisms in MySQL to handle these kind of things?
(Please note: We are talking about quite large datasets here (about 100 M rows in the uncompressed table and about 1-10 M rows in the compressed table). Also, I can do pretty much what I want with both software and hardware configurations so if you have any hints or ideas involving MySQL configurations or hardware set-up, just bring them on.)
Try reading about the ARCHIVE storage engine.
Re your clarification. Okay, I didn't get what you meant from your description. Reading more carefully, I see you did mention min, max, and average.
So what you want is a materialized view that updates aggregate calculations for a large dataset. Some RDBMS brands such as Oracle have this feature, but MySQL doesn't.
One experimental product that tries to solve this is called FlexViews (http://code.google.com/p/flexviews/). This is an open-source companion tool for MySQL. You define a query as a view against your raw dataset, and FlexViews continually monitors the MySQL binary logs, and when it sees relevant changes, it updates just the rows in the view that need to be updated.
It's pretty effective, but it has a few limitations in the types of queries you can use as your view, and it's also implemented in PHP code, so it's not fast enough to keep up if you have really high traffic updating your base table.

Most efficient way to generate reports in MySQL on massive datasets

I need to build a reporting interface to an application I'm working on which requires administrators to visualise huge quantities of collected data over time.
Think something similar to Google Analytics etc.
Most of the data that needs to be visualised sits in a basic table which contains a datetime, 'action' varchar and other filterable data - currently the table holds 1.5M rows, and it's growing every day.
At the moment I'm doing a simple select with the filters applied grouped by day and it's running pretty well, but I was wondering if there's a smarter more efficient way to extract such data.
Cheers
1) Two tiers -- raw data, and summarized data. For raw data, indexes will likely be of no help. You are doing aggregations, in most cases that necessitates a full table scan. If it doesn't, reorganize so it does, it'll be faster.
2) Figure out your aggregates, automatically generate them, and run the reports off the aggregate data. Do index these summary tables!
3) Avoid joins. Aggregate, materialize results of the group-bys, then join the aggregated results.
4) Partition. Keep data for one day (or whatever granularity makes sense) separate from data for another day. Make automated table creation scripts if necessary (grown-up -- or feature-heavy, depending on your point of view -- databases give you something called "partitioning" to do this in a more sane way).
5) Read up on "data warehousing"
http://en.wikipedia.org/wiki/Data_warehouse
You can start out doing couple of things:
Make sure you add the indexes on all the filters so they won't do any table scans.
check using query plan analyzer to make sure there are no places that need optimization.
Since you have a datetime stamp in your table, partitioning will definitely help you in the future.
Good luck.
You can expect a number of common queries, probably a small number compared to the number of unique combinations of filters that could be generated. You can use this to "compress" the data into companion tables, and run this collection process at night.

What is the cost of indexing multiple db columns?

I'm writing an app with a MySQL table that indexes 3 columns. I'm concerned that after the table reaches a significant amount of records, the time to save a new record will be slow. Please inform how best to approach the indexing of columns.
UPDATE
I am indexing a point_value, the
user_id, and an event_id, all required
for client-facing purposes. For an
instance such as scoring baseball runs
by player id and game id. What would
be the cost of inserting about 200 new
records a day, after the table holds
records for two seasons, say 72,000
runs, and after 5 seasons, maybe a
quarter million records? Only for
illustration, but I'm expecting to
insert between 25 and 200 records a
day.
Index what seems the most logical (that should hopefully be obvious, for example, a customer ID column in the CUSTOMERS table).
Then run your application and collect statistics periodically to see how the database is performing. RUNSTATS on DB2 is one example, I would hope MySQL has a similar tool.
When you find some oft-run queries doing full table scans (or taking too long for other reasons), then, and only then, should you add more indexes. It does little good to optimise a once-a-month-run-at-midnight query so it can finish at 12:05 instead of 12:07. However, it's a huge improvement to reduce a customer-facing query from 5 seconds down to 2 seconds (that's still too slow, customer-facing queries should be sub-second if possible).
More indexes tend to slow down inserts and speed up queries. So it's always a balancing act. That's why you only add indexes in specific response to a problem. Anything else is premature optimization and should be avoided.
In addition, revisit the indexes you already have periodically to see if they're still needed. It may be that the queries that caused you to add those indexes are no longer run often enough to warrant it.
To be honest, I don't believe indexing three columns on a table will cause you to suffer unless you plan on storing really huge numbers of rows :-) - indexing is pretty efficient.
After your edit which states:
I am indexing a point_value, the user_id, and an event_id, all required for client-facing purposes. For an instance such as scoring baseball runs by player id and game id. What would be the cost of inserting about 200 new records a day, after the table holds records for two seasons, say 72,000 runs, and after 5 seasons, maybe a quarter million records? Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
My response is that 200 records a day is an extremely small value for a database, you definitely won't have anything to worry about with those three indexes.
Just this week, I imported a days worth of transactions into one of our database tables at work and it contained 2.1 million records (we get at least one transaction per second across the entire day from 25 separate machines). And it has four separate composite keys which is somewhat more intensive than your three individual keys.
Now granted, that's on a DB2 database but I can't imagine IBM are so much better than the MySQL people that MySQL can only handle less than 0.01% of the DB2 load.
I made some simple tests using my real project and real MySql database.
My results are: adding average index (1-3 columns in an index) to a table - makes inserts slower by 2.1%. So, if you add 20 indexes, your inserts will be slower by 40-50%. But your selects will be 10-100 times faster.
So is it ok to add many indexes? - It depends :) I gave you my results - You decide!
Nothing for select queries, though updates and especially inserts will be order of magnitudes slower - which you won't really notice before you start inserting a LOT of rows at the same time...
In fact at a previous employer (single user, desktop system) we actually DROPPED indexes before starting our "import routine" - which would first delete all records before inserting a huge number of records into the same table...
Then when we were finished with the insertion job we would re-create the indexes...
We would save 90% of the time for this operation by dropping the indexes before starting the operation and re-creating the indexes afterwards...
This was a Sybase database, but the same numbers apply for any database...
So be careful with indexes, they're FAR from "free"...
Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
With that kind of insertion rate, the cost of indexing an extra column will be negligible.
Without some more details about expected usage of the data in your table worrying about indexes slowing you down smells a lot like premature optimization that should be avoided.
If you are really concerned about it, then setup a test database and simulate performance in the worst case scenarios. A test proving that is or is not a problem will probably be much more useful then trying to guess and worry about what may happen. If there is a problem you will be able to use your test setup to try different methods to fix the issue.
The index is there to speed retrieval of data, so the question should be "What data do I need to access quickly?". Without the index, some queries will do a full table scan (go through every row in the table) in order to find the data that you want. With a significant amount of records this will be a slow and expensive operation. If it is for a report that you run once a month then maybe thats okay; if it is for frequently accessed data then you will need the index to give your users a better experience.
If you find the speed of the insert operations are slow because of the index then this is a problem you can solve at the hardware level by throwing more CPUs, RAM and better hard drive technology at the problem.
What Pax said.
For the dimensions you describe, the only significant concern I can imagine is "What is the cost of failing to index multiple db columns?"