Best practice to archive less frequent accessed data in MySQL - mysql

I have an e-commerce application running on MySQL server from last 10 years. There are particularly two tables:
orders (Apporx 30 million rows and 3 GB data)
order_item (Approx 45 million rows and 3.5 GB data)
These tables contain all orders and their respective products purchased in last 10 years. Generally last 6 months data is most frequently accessed and all previous data is either used in multiple reports by business/marketing teams or if a customer checks his order history
As this table constantly growing in size we are running into increased query time for both read and write. Therefore, I was wondering to archive some data from these tables to improve performance. Main concern is that archived data should still be available for reads by reports and customers.
What would be a right archiving strategy for this?

PARTITIONing is unlikely to provide any performance benefit.
Please provide SHOW CREATE TABLE and some of the "slow" queries so I can analyze the access patterns, clustered index choice, the potential for Partitioning, etc.
Sometimes a change to the PK (which is "clustered" with the data) can greatly improve the "locality of reference", which provides performance benefits. A user going back in time may also blow out the cache (the "buffer pool). Changing the PK to start with user_id may significantly improve on that issue.
Another possibility is to use replication to segregate "read-only" queries that go back in time from the "active" records (read/write in the last few months). This can benefit by decreasing the I/O by having customers on the Primary while relegating the "reports" to a Replica.
How much RAM? When the data gets "big", running a "report" may blow out the cache, making everything be slower. Summary table(s) is a very good cure for such.
The optimal indexes for a Partitioned table are necessarily different than for the equivalent non-partitioned table.

Related

Limit before sharding or partitioning a table

I am new to the database system design. After reading many articles, I am really getting confused on what is the limit till which we should have 1 table and not go for sharding or partitioning. I know that it is really hard to provide generic answer and things depend on factors like
size of row
kind of data (strings, blobs, etc)
active queries number
what kind of queries
indexes
read heavy/write heavy
the latency expected
But when someone ask that
what will you do if you have 1 billion data and million rows getting added everyday. The latency needs to be less than 5 ms for 4 read, 1 write and 2 update queries over such a big database, etc.
what will your choice if you have only 10 million rows but the updates and reads are high. The number of new rows added are not significant. High consistency and low latency are the requirement.
If the rows are less that a million and the row size is increasing by thousands then the choice is simple. But it gets trickier when the choice involves for million or billion of rows.
Note: I have not mentioned the latency number in my question. Please
answer according to the latency number which is acceptable to you. Also, we are talking about structured data.
I am not sure but I can add 3 specific questions:
Lets say that you choose sql database for amazon or any ecommerce order management system. The order numbers are increasing everyday by million. There are already 1 billion record. Now, assuming that there is no archival of data. There are high read queries more than thousand queries per second. And there are writes as well. The read:write ratio is 100:1
Let's take an example which smaller number now. Lets say that you choose a sql database for abc or any ecommerce order management system. The order numbers are increasing everyday by thousands. There are already 10 million record. Now, assuming that there is no archival of data. There are high read queries more than ten thousand queries per second. And there are writes as well. The read:write ratio is 10:1
3rd example: Free goodies distribution. We have 10 million goodies to be distributed. 1 goodies per user. High consistency and low latency is the aim. Lets assume that 20 million users already waiting for this free distribution and once the time starts, all of them will try to get the free goodies.
Note: In the whole question, the assumption is that we will go with
SQL solutions. Also, please neglect if the provided usecase doesn't make sense logically. The aim is to get the knowledge in terms of numbers.
Can someone please help with what are the benchmarks. Any practical numbers from the project you are currently working in which can tell that for such a big database with these many queries, this is the latency observed,. Anything which can help me justify the choice for the number of tables for the certain number of queries for particular latency.
Some answers for MySQL. Since all databases are limited by disk space, network latency, etc., other engines may be similar.
A "point query" (fetching one row using a suitable index) takes milliseconds regardless of the number of rows.
It is possible to write a SELECT that will take hours, maybe even days, to run. So you need to understand whether the queries are pathological like this. (I assume this is an example of high "latency".)
"Sharding" is needed when you cannot sustain the number of writes needed on a single server.
Heavy reads can be scaled 'infinitely' by using replication and sending the reads to Replicas.
PARTITIONing (especially in MySQL) has very few uses. More details: Partition
INDEXes are very important for performance.
For Data Warehouse apps, building and maintaining "Summary tables" is vital for performance at scale. (Some other engines have some built-in tools for such.)
INSERTing one million rows per day is not a problem. (Of course, there are schema designs that could make this a problem.) Rules of Thumb: 100/second is probably not a problem; 1000/sec is probably possible; it gets harder after that. More on high speed ingestion
Network latency is mostly determined by how close the client and server are. It takes over 200ms to reach the other side of the earth. On the other hand, if the client and server are in the same building, latency is under 1ms. On another hand, if you are referring to how long it takes too run a query, then here are a couple of Rules of Thumb: 10ms for a simple query that needs to hit an HDD disk; 1ms for SSD.
UUIDs and hashes are very bad for performance if the data is too big to be cached in RAM.
I have not said anything about read:write ratio because I prefer to judge reads and writes independently.
"Ten thousand reads per second" is hard to achieve; I suggest that very few apps really need such. Or they can find better ways to achieve the same goals. How fast can one user issue a query? Maybe one per second? How many users can be connected and active at the same time? Hundreds.
(my opinion) Most benchmarks are useless. Some benchmarks can show that one system is twice as fast as another. So what? Some benchmarks say that when you have more than a few hundred active connections, throughput stagnates and latency heads toward infinity. So what. After you have an app running for some time, capturing the actual queries is perhaps the best benchmark. But it still has limited uses.
Almost always a single table is better than splitting up the table (multiple tables; PARTITIONing; sharding). If you have a concrete example, we can discuss the pros and cons of the table design.
Size of row and kinds of data -- Large columns (TEXT/BLOB/JSON) are stored "off-record", thereby leading to [potentially] an extra disk hit. Disk hits are the most costly part of any query.
Active queries -- After a few dozen, the queries stumble over each other. (Think about a grocery store with lots of shoppers pushing carts -- with "too many" shoppers, each takes a long time to finish.)
When you get into large databases, they fall into a few different types; each with somewhat different characteristics.
Data Warehouse (sensors, logs, etc) -- appending to 'end' of the table; Summary Tables for efficient 'reports'; huge "Fact" table (optionally archived in chunks); certain "dimension tables".
Search (products, web pages, etc) -- EAV is problematical; FULLTEXT is often useful.
Banking, order processing -- This gets heavy into the ACID features and the need for crafting transactions.
Media (images and videos) -- How to store the bulky objects while making searching (etc) reasonably fast.
'Find nearest' -- Need a 2D index, either SPATIAL or some of the techniques here

How to improve the performance of table scans with innodb

Brief: Is there any way to improve the performance of table scans on InnoDB tables?
Please, do not suggest adding indexes to avoid table scans. (see below)
innodb_buffer_pool_size sits at 75% of server memory (48 GB/64GB)
I'm using the latest version of Percona (5.7.19) if that changes anything
Longer: We have 600Gb of recent time series data (we aggregate and delete older data) spread over 50-60 tables. So most of it is "active" data that is regularly queried. These tables are somewhat large (400+ numeric columns) and many queries run against a number of those columns (alarming) which is why it is impractical to add indexes (as we would have to add a few dozen). The largest tables are partitioned per day.
I am fully aware that this is an application/table design problem and not a "server tuning" problem. We are currently working to significantly change the way these tables are designed and queried, but have to maintain the existing system until this happens so I'm looking for a way to improve things a bit to buy us a little time.
We recently split this system and have moved a part of it to a new server. It previously used MyISAM, and we tried moving to TokuDB which seemed appropriate but ran into some weird problems. We switched to InnoDB but performance is really bad. I get the impression that MyISAM is better with table scans which is why, barring any better option, we'll go back to it until the new system is in place.
Update
All tables have pretty much the same structure:
-timestamp
-primary key (varchar(20) field)
-about 15 fields of various types representing other secondary attributes that can be filtered upon (along with an appropriately indexed criteria first)
-And then about a few hundred measures (floats), between 200-400.
I already trimmed the row length as much as I could without changing the structure itself. The primary key used to be a varchar(100), all measures used to be doubles, many of the secondary attributes had their data types changed.
Upgrading hardware is not really an option.
Creating small tables with just the set of columns I need would help some processes perform faster. But at the cost of creating that table with a table scan first and duplicating data. Maybe if I created it as a memory table. By my estimate, it would take a couple of GB away from the buffer pool. Also there are aggregation processes that read about as much data from the main tables on a regular basis, and they need all columns.
There is unfortunately a lot of duplication of effort in those queries which I plan to address in the next version. The alarming and aggregation processes basically reprocess the entire day's worth of data every time some rows inserted (every half hour) instead of just dealing with new/changed data.
Like I said, the bigger tables are partitioned, so it's usually a scan over a daily partition rather than the entire table, which is a small consolation.
Implementing a system to hold this in memory outside of the DB could work, but that would entail a lot of changes on the legacy system and development work. Might as well spend that time on the better design.
The fact that InnoDB table are so much bigger for the same data as MyISAM (2-3x as big in my case) really hinders the performance.
MyISAM is a little bit better at table-scans, because it stores data more compactly than InnoDB. If your queries are I/O-bound, scanning through less data on disk is faster. But this is a pretty weak solution.
You might try using InnoDB compression to reduce the size of data. That might get you closer to MyISAM size, but you're still I/O-bound so it's going to suck.
Ultimately, it sounds like you need a database that is designed for an OLAP workload, like a data warehouse. InnoDB and TokuDB are both designed for OLTP workload.
It smells like a Data Warehouse with "Reports". By judicious picking of what to aggregate (selected of your Floats) over what time period (hour or day is typical), you can build and maintain Summary Tables that work much more efficiently for the Reports. This has the effect of scanning the data only once (to build the Summaries), not repeatedly. The Summary tables are much smaller, so the reports are much faster -- 10x is perhaps typical.
It may also be possible to augment the Summary tables as the raw data is being Inserted. (See INSERT .. ON DUPLICATE KEY UPDATE ..)
And use Partitioning by date to allow for efficient DROP PARTITION instead of DELETE. Don't have more than about 50 partitions.
Summary Tables
Time series Partitioning
If you would like to discuss in more detail, let's start with one of the queries that is scanning so much now.
In the various projects I have worked on, there were between 2 and 7 Summary tables.
With 600GB of data, you may be pushing the limits on 'ingestion'. If so, we can discuss that, too.

MySQL: Huge tables that need to be joined, how should be splitted for optimization?

The case:
I have been developing a web application in which I storage data from different automated data sources. Currently I am using MySQL as DBMS and PHP as programming language on a shared LAMP server.
I use several tables to identify the data sources and two tables for the data updates. Data sources are in a three level hierarchy, and updates are timestamped.
One table contains the two upper levels of the hierarchy (geographic location and instrument), plus the time-stamp and an “update ID”. The other table contains the update ID, the third level of the hierarchy (meter) and the value.
Most queries involve a joint statement between this to tables.
Currently the first table contains near 2.5 million records (290 MB) and the second table has over 15 million records (1.1 GB), each hour near 500 records are added to the first table and 3,000 to the second one, and I expect this numbers to increase. I don't think these numbers are too big, but I've been experiencing some performance drawbacks.
Most queries involve looking for immediate past activity (per site, per group of sites, and per instrument) which are no problem, but some involve summaries of daily, weekly and monthly activity (per site and per instrument). The page takes several seconds to load, sometimes surpassing the server's timeout (30s).
It also seems that the automatic updates are suffering from these timeouts, causing the connection to fail.
The question:
Is there any rational way to split these tables so that queries perform more quickly?
Or should I attempt other types of optimizations not involving splitting tables?
(I think the tables are properly indexed, and I know that a possible answer is to move to a dedicated server, probably running something else than MySQL, but just yet I cannot make this move and any optimization will help this scenario.)
If the queries that are slow are the historical summary queries, then you might want to consider a Data Warehouse. As long as your history data is relatively static, there isn't usually much risk to pre-calculating transactional summary data.
Data warehousing and designing schemas for Business Intelligence (BI) reporting is a very broad topic. You should read up on it and ask any specific BI design questions you may have.

MySQL Table Locks

I was asked to do some PHP scripts on MySQL DB to show some data when I noticed the strange design they had.
They want to perform a study that would require collecting up to 2000 records per user and they are automatically creating a new table for each user that registers. It's a pilot study at this stage so they have around 30 tables but they should have 3000 users for the real study.
I wanted to suggest gathering all of them in a single table but since there might be around 1500 INSERTs per minute to that database during the study period, I wanted to ask this question here first. Will that cause table locks in MySQL?
So, Is it one table with 1500 INSERTs per minute and a maximum size of 6,000,000 records or 3000 tables with 30 INSERTs per minute and a maximum size of 2000 records. I would like to suggest the first option but I want to be sure that it will not cause any issues.
I read that InnoDB has row-level locks. So, will that have a better performance combined with the one table option?
This is a huge loaded question. In my experience performance is not really measured accurately by table size alone. It comes down to design. Do you have the primary keys and indexes in place? Is it over indexed? That being said, I have also found that almost always one trip to the DB is faster than dozens. How big is the single table (columns)? What kind of data are you saving (larger than 4000K?). It might be that you need to create some prototypes to see what performs best for you. The most I can recommend is that you carefully judge the size of the data you are collecting and allocate accordingly, create indexes (but not too many, don't over index), and test.

What is the cost of indexing multiple db columns?

I'm writing an app with a MySQL table that indexes 3 columns. I'm concerned that after the table reaches a significant amount of records, the time to save a new record will be slow. Please inform how best to approach the indexing of columns.
UPDATE
I am indexing a point_value, the
user_id, and an event_id, all required
for client-facing purposes. For an
instance such as scoring baseball runs
by player id and game id. What would
be the cost of inserting about 200 new
records a day, after the table holds
records for two seasons, say 72,000
runs, and after 5 seasons, maybe a
quarter million records? Only for
illustration, but I'm expecting to
insert between 25 and 200 records a
day.
Index what seems the most logical (that should hopefully be obvious, for example, a customer ID column in the CUSTOMERS table).
Then run your application and collect statistics periodically to see how the database is performing. RUNSTATS on DB2 is one example, I would hope MySQL has a similar tool.
When you find some oft-run queries doing full table scans (or taking too long for other reasons), then, and only then, should you add more indexes. It does little good to optimise a once-a-month-run-at-midnight query so it can finish at 12:05 instead of 12:07. However, it's a huge improvement to reduce a customer-facing query from 5 seconds down to 2 seconds (that's still too slow, customer-facing queries should be sub-second if possible).
More indexes tend to slow down inserts and speed up queries. So it's always a balancing act. That's why you only add indexes in specific response to a problem. Anything else is premature optimization and should be avoided.
In addition, revisit the indexes you already have periodically to see if they're still needed. It may be that the queries that caused you to add those indexes are no longer run often enough to warrant it.
To be honest, I don't believe indexing three columns on a table will cause you to suffer unless you plan on storing really huge numbers of rows :-) - indexing is pretty efficient.
After your edit which states:
I am indexing a point_value, the user_id, and an event_id, all required for client-facing purposes. For an instance such as scoring baseball runs by player id and game id. What would be the cost of inserting about 200 new records a day, after the table holds records for two seasons, say 72,000 runs, and after 5 seasons, maybe a quarter million records? Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
My response is that 200 records a day is an extremely small value for a database, you definitely won't have anything to worry about with those three indexes.
Just this week, I imported a days worth of transactions into one of our database tables at work and it contained 2.1 million records (we get at least one transaction per second across the entire day from 25 separate machines). And it has four separate composite keys which is somewhat more intensive than your three individual keys.
Now granted, that's on a DB2 database but I can't imagine IBM are so much better than the MySQL people that MySQL can only handle less than 0.01% of the DB2 load.
I made some simple tests using my real project and real MySql database.
My results are: adding average index (1-3 columns in an index) to a table - makes inserts slower by 2.1%. So, if you add 20 indexes, your inserts will be slower by 40-50%. But your selects will be 10-100 times faster.
So is it ok to add many indexes? - It depends :) I gave you my results - You decide!
Nothing for select queries, though updates and especially inserts will be order of magnitudes slower - which you won't really notice before you start inserting a LOT of rows at the same time...
In fact at a previous employer (single user, desktop system) we actually DROPPED indexes before starting our "import routine" - which would first delete all records before inserting a huge number of records into the same table...
Then when we were finished with the insertion job we would re-create the indexes...
We would save 90% of the time for this operation by dropping the indexes before starting the operation and re-creating the indexes afterwards...
This was a Sybase database, but the same numbers apply for any database...
So be careful with indexes, they're FAR from "free"...
Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
With that kind of insertion rate, the cost of indexing an extra column will be negligible.
Without some more details about expected usage of the data in your table worrying about indexes slowing you down smells a lot like premature optimization that should be avoided.
If you are really concerned about it, then setup a test database and simulate performance in the worst case scenarios. A test proving that is or is not a problem will probably be much more useful then trying to guess and worry about what may happen. If there is a problem you will be able to use your test setup to try different methods to fix the issue.
The index is there to speed retrieval of data, so the question should be "What data do I need to access quickly?". Without the index, some queries will do a full table scan (go through every row in the table) in order to find the data that you want. With a significant amount of records this will be a slow and expensive operation. If it is for a report that you run once a month then maybe thats okay; if it is for frequently accessed data then you will need the index to give your users a better experience.
If you find the speed of the insert operations are slow because of the index then this is a problem you can solve at the hardware level by throwing more CPUs, RAM and better hard drive technology at the problem.
What Pax said.
For the dimensions you describe, the only significant concern I can imagine is "What is the cost of failing to index multiple db columns?"