Compress a mysql partition - mysql

I have a table that is partitioned by a timestamp into a separate partition for each day.
During each day around a billion events are received. Each event is tagged with an object, and the business logic needs all the events for an object to decide what to do with them. So the system has a big table with one row per object (which is just hundreds of millions of rows per day), and these events are concatenated into an 'event buffer' mediumtext.
One row per object works really well. It is very fast and suitable for our business logic and reporting to consume. Once upon a time we started with an event table and joining instead, and it was far too slow.
After 5 days no more events will be received for an object. At that point, if we haven't had terminating events, our system adds our own 'timed out' event to the buffer.
We are doing a lot of business logic as events for an object are received, and we have a bool to flag which objects have no final event etc.
Although the "online" system only wants 5 days of object events, the reporting system wants a year's worth.
I want partitions over 5 days old to be compressed. I can run a cron-job to trigger this.
The current approach is: have another table, with identical schema and partitioning like the online table but row_format=compressed. Then each day, create a new table like these table but without partitioning. First we ALTER TABLE EXCHANGE PARTITION to swap-out the 5-days-old partition. Then we insert this into the new table.
There are two problems with the current approach: 1) that reporting tools have to scan two separate tables, and 2) that there is a race condition when objects are in neither of the main tables.
Is it possible to ALTER the row_format for an individual existing partition?

No, you cannot compress individual partitions. The basic attributes of the table are uniform across all partitions.
Please provide SHOW CREATE TABLE, the table size, and further discussion of "why" you want to compress. There may be a workaround that achieves a similar function.
Reporting
"Reports" usually need "summary" information, not all the raw data. So... Summarize the data each day (or hour, or whatever) and put the summary into a much smaller table. Then throw away the raw data.
If you are concerned about need the raw data at a later data, then save the logs, compressed, in regular files. Sure, it will be some extra work to pull up old data. But there are trade offs.
This also solves the interference -- different table, smaller table.
For summarizing, I usually do it periodically, but it could be done as data is inserted into the raw ("Fact") table. See Summary tables and High speed ingestion. In the later link, it explains a way to gather data in a "staging table", massague it, send it to the Fact table, Normalization tables (which perhaps you don't have) and do the incremental summarization.
Your raw data might be partitioned in 6-hour chunks: PARTITION BY RANGE(TO_DAYS(...)). 30 partitions is a pretty good number (a compromise). The Summary table(s) might need to be partitioned. Consider 12 or 52 if you are purging after a year. (Actually 14 or 54; see the links for why.)
Migrating a partition
Suppose you could move the 6-day-old partition to another, more compressed, table; and use a VIEW in front of a UNION to mask the existence of the split?
If you have 5.7, it is pretty easy to "export a tablespace"; this turn the partition into a table. Then, you could either import it into another partitioned table (which does not seem very useful, unless it is somehow compressed), or otherwise transform the data to shrink it.
Manual compression
For large text columns, I recommend compressing in the client, storing into a BLOB (instead of TEXT), and uncompressing on the way out. This saves disk space (3x for typical English, code, xml, etc) and bandwidth (to/from client -- especially handy if the client and server are distant).
Database API Layer
You have users issuing SQL queries? You should seriously consider providing a simple layer between the clients and the database. With that, you can hide view, union, compress, two tables, etc. And you can change things without making the customers make changes. Be sure to create an API that understands the client and feels more 'generic'. GetObject(1234) does whatever SQL is needed, then returns "object #1234" in some agreed-upon format (JSON, XML, PHP structure, whatever).

Related

How to handle 20M+ records from tables with same structure in MySQL

I have to handle 25M rows of data that I have collected and transformed from about 50 different sources. Every source leads to about 500.000 to 600.000 rows. Each record has the same structure, regardless the source (let say: id, title, author, release_date)
For flexibility, I would prefer to create a dedicated table for each source, (then I can clear/drop data from a source and reload/upload data very quickly (using LOAD INFILE)). This way, it seems very easy to truncate the table with no risk to delete rows from other sources.
But then I don't know how to select records having the same author across the different tables, and cherry on the cake, with pagination (LIMIT keyword).
Is the only solution to store everything into a single huge table and deal with the pain of indexing/backuping a 25M+ database or is there a kind of abstract layer to virtually merge 50 tables into a virtual one.
It is probably a usual question for a dba, but I could not find any answer yet...
Any help/idea much appreicated. Thx
This might be a good spot for MySQL partitoning.
This lets you handle a big volume of data, while giving you the opportunity to run DML operations on a specific partition when needed (such as truncate, or event drop) very efficiently, and without impacting the rest of your data. Partitioning selection is also supported in LOAD DATA statements.
You can run queries across partitions as you would with a normal table, or target a specific partition when you need to (which can be done very efficiently).
In your specific use case, list partitioning seems like a relevant choice: you have a pre-defined list of sources, so you would typically have one partition per source.

One database or multiple databases for statistical architecture

I currently already have a website running using CodeIgniter and MySQL. The MySQL database is around 110 tables big and contains mainly website specific data, like user data, vacancy data, etc.
Now I want to extend this website to include a complete statistical module as well. We would capture a lot of user actions and other aggregations from the data gather on our own website, and would also pull in some data from google analytics API to use in our statistics (we will generate a report in Excel but also show statistical graphs and numbers on a page (using chart.js)).
We are not thinking (in a forseeable future) to use this data in other programs, but we need to be able to open some data to the public using an API.
We expect to start with about 300.000-350.000 data points gathered per day, but this amount will keep on growing every day of course, the more users we get.
Using multiple databases in CodeIgniter seems to not be an issue, so the main problem I am left with is how I should create the architecture for this statistical module.
I have a couple of idea's on how to start doing this, but I am not aware if there is performance impact from one to another solution or other things to take into consideration.
My main idea boils down to having a table containing all "events", which just insert in that table every time an action is performed, eg "user is registered", "user put account on private", "user clicked on X", ...
Then once a day (probably at around midnight), a CRON job would run over that table for the past day and aggregate all the values into a format usable for our statistical metrics. Those aggregated values would be stored in a new table. This way we can clean up the "event" table quite regularly since that will become very big very fast.
Idea 1: Extend the current MySQL database architecture with new tables to incorporate the statistics. I would keep on using the current database architecture and add 2 new tables for the events and the aggregated values.
Idea 2: Create a new database, separate from the current existing one, and use this to insert all the events in a table there and the aggregated values in a new table there.
Note: we already have quite a few CRONS running on our current database, updating statusses and dates, sending emails, ...
Note2: sync issues between databases is not an issue since we will never be storing statistics on a per-user level.
MySQL does not care whether tables are in the same database or separate databases. It is just a convenience for the user. Some things:
You might need db1.tbla JOIN db2.tblb to talk across dbs.
It is convenient to have different GRANTs for different databases, but clumsy to have different GRANTs for 110 tables.
I can't think of any performance differences.
Nightly aggregation is a middle-of-the road approach. Using IODKU gives you 'immediate' aggregation, but is probably more burden on the system.
My blog on Summary Tables .
350K rows inserted per day is about 5/second, which is comfortably low, so I don't think we need to discuss performance issues there.
"Summarize and toss" (for events) -- Yes. I like that approach. (Most people fail to think of this option.)
Do the math. Which table is the largest after a year? How many GB will it be? Then think about whether you can shrink any of the columns in it: SMALLINT instead of INT, normalization of long, oft-repeated, strings, etc.

Store large amounts of sensor data in SQL, optimize for query performance

I need to store sensor data from various locations (different factories with different rooms with each different sensors). Data is being downloaded in regular intervals from a device on site in the factories that collects the data transmitted from all sensors.
The sensor data looks like this:
collecting_device_id, sensor_id, type, value, unit, timestamp
Type could be temperature, unit could be degrees_celsius. collecting_device_id will identify the factory.
There are quite a lot of different things (==types) being measured.
I will collect around 500 million to 750 million rows and then perform analyses on them.
Here's the question for storing the data in a SQL database (let's say MySQL InnoDB on AWS RDS, large machine if necessary):
When considering query performance for future queries, is it better to store this data in one huge table just like it comes from the sensors? Or to distribute it across tables (tables for factories, temperatures, humidities, …, everything normalized)? Or to have a wide table with different fields for the data points?
Yes, I know, it's hard to say "better" without knowing the queries. Here's more info and a few things I have thought about:
There's no constant data stream as data is uploaded in chunks every 2 days (a lot of writes when uploading, the rest of the time no writes at all), so I would guess that index maintenance won't be a huge issue.
I will try to reduce the amount of data being inserted upfront (data that can easily be replicated later on, data that does not add additional information, …)
Queries that should be performed are not defined yet (I know, designing the query makes a big difference in terms of performance). It's exploratory work (so we don't know ahead what will be asked and cannot easily pre-compute values), so one time you want to compare data points of one type in a time range to data points of another type, the other time you might want to compare rooms in factories, calculate correlations, find duplicates, etc.
If I would have multiple tables and normalize everything the queries would need a lot of joins (which probably makes everything quite slow)
Queries mostly need to be performed on the whole ~ 500 million rows database, rarely on separately downloaded subsets
There will be very few users (<10), most of them will execute these "complex" queries.
Is a SQL database a good choice at all? Would there be a big difference in terms of performance for this use case to use a NoSQL system?
In this setup with this amount of data, will I have queries that never "come back"? (considering the query is not too stupid :-))
Don't pre-optimize. If you don't know the queries then you don't know the queries. It is to easy to make choices now that will slow down some sub-set of queries. When you know how the data will be queried you can optimize then -- it is easy to normalize after the fact (pull out temperature data into a related table for example.) For now I suggest you put it all in one table.
You might consider partitioning the data by date or if you have another way that might be useful (recording device maybe?). Often data of this size is partitioned if you have the resources.
After you think about the queries, you will possibly realize that you don't really need all the datapoints. Instead, max/min/avg/etc for, say, 10-minute intervals may be sufficient. And you may want to "alarm" on "over-temp" values. This should not involve the database, but should involve the program receiving the sensor data.
So, I recommend not storing all the data; instead only store summarized data. This will greatly shrink the disk requirements. (You could store the 'raw' data to a plain file in case you are worried about losing it. It will be adequately easy to reprocess the raw file if you need to.)
If you do decide to store all the data in table(s), then I recommend these tips:
High speed ingestion (includes tips on Normalization)
Summary Tables
Data Warehousing
Time series partitioning (if you plan to delete 'old' data) (partitioning is painful to add later)
750M rows -- per day? per decade? Per month - not too much challenge.
By receiving a batch every other day, it becomes quite easy to load the batch into a temp table, do normalization, summarization, etc; then store the results in the Summary table(s) and finally copy to the 'Fact' table (if you choose to keep the raw data in a table).
In reading my tips, you will notice that avg is not summarized; instead sum and count are. If you need standard deviation, also, keep sum-of-squares.
If you fail to include all the Summary Tables you ultimately need, it is not too difficult to re-process the Fact table (or Fact files) to populate the new Summary Table. This is a one-time task. After that, the summarization of each chunk should keep the table up to date.
The Fact table should be Normalized (for space); the Summary tables should be somewhat denormalized (for performance). Exactly how much denormalization depends on size, speed, etc., and cannot be predicted at this level of discussion.
"Queries on 500M rows" -- Design the Summary tables so that all queries can be done against them, instead. A starting rule-of-thumb: Any Summary table should have one-tenth the number of rows as the Fact table.
Indexes... The Fact table should have only a primary key. (The first 100M rows will work nicely; the last 100M will run so slowly. This is a lesson you don't want to have to learn 11 months into the project; so do pre-optimize.) The Summary tables should have whatever indexes make sense. This also makes querying a Summary table faster than the Fact table. (Note: Having a secondary index on a 500M-rows table is, itself, a non-trivial performance issue.)
NoSQL either forces you to re-invent SQL, or depends on brute-force full-table-scans. Summary tables are the real solution. In one (albeit extreme) case, I sped up a 1-hour query to 2-seconds by by using a Summary table. So, I vote for SQL, not NoSQL.
As for whether to "pre-optimize" -- I say it is a lot easier than rebuilding a 500M-row table. That brings up another issue: Start with the minimal datasize for each field: Look at MEDIUMINT (3 bytes), UNSIGNED (an extra bit), CHARACTER SET ascii (utf8 or utf8mb4) only for columns that need it), NOT NULL (NULL costs a bit), etc.
Sure, it is possible to have 'queries that never come back'. This one 'never comes back, even with only 100 rows in a: SELECT * FROM a JOIN a JOIN a JOIN a JOIN a. The resultset has 10 billion rows.

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.

handling/compressing large datasets in multiple tables

In an application at our company we collect statistical data from our servers (load, disk usage and so on). Since there is a huge amount of data and we don't need all data at all times we've had a "compression" routine that takes the raw data and calculates min. max and average for a number of data-points, store these new values in the same table and removes the old ones after some weeks.
Now I'm tasked with rewriting this compression routine and the new routine must keep all uncompressed data we have for one year in one table and "compressed" data in another table. My main concerns now are how to handle the data that is continuously written to the database and whether or not to use a "transaction table" (my own term since I cant come up with a better one, I'm not talking about the commit/rollback transaction functionality).
As of now our data collectors insert all information into a table named ovak_result and the compressed data will end up in ovak_resultcompressed. But are there any specific benefits or drawbacks to creating a table called ovak_resultuncompressed and just use ovak_result as a "temporary storage"? ovak_result would be kept minimal which would be good for the compressing routine, but I would need to shuffle all data from one table into another continually, and there would be constant reading, writing and deleting in ovak_result.
Are there any mechanisms in MySQL to handle these kind of things?
(Please note: We are talking about quite large datasets here (about 100 M rows in the uncompressed table and about 1-10 M rows in the compressed table). Also, I can do pretty much what I want with both software and hardware configurations so if you have any hints or ideas involving MySQL configurations or hardware set-up, just bring them on.)
Try reading about the ARCHIVE storage engine.
Re your clarification. Okay, I didn't get what you meant from your description. Reading more carefully, I see you did mention min, max, and average.
So what you want is a materialized view that updates aggregate calculations for a large dataset. Some RDBMS brands such as Oracle have this feature, but MySQL doesn't.
One experimental product that tries to solve this is called FlexViews (http://code.google.com/p/flexviews/). This is an open-source companion tool for MySQL. You define a query as a view against your raw dataset, and FlexViews continually monitors the MySQL binary logs, and when it sees relevant changes, it updates just the rows in the view that need to be updated.
It's pretty effective, but it has a few limitations in the types of queries you can use as your view, and it's also implemented in PHP code, so it's not fast enough to keep up if you have really high traffic updating your base table.