MySQL performance with large number of records - partitioning? - mysql

I am trying to build a database that would contain a large number of records, each with a lot of columns(fields) - maybe around 200-300 fields total for all tables. Let's say that I would have in a few years about 40.000.000 to 60.000.000 records.
I plan to normalize the database, so I will have a lot of tables (about 30-40) -> and lots of joins for queries.
Database will be strictly related to US, meaning that queries will be based on the 50 states alone (if a query is made, it won't allow to search/insert/etc in multiple states, but just one).
What can I do to have better performance?
Someone came with the idea to have all the states in different table structures, meaning I will have 50 tables * the 30-40 for the data (about 200 tables)! Should I even consider this type of approach?
The next idea was to use partitioning based on the US 50 states. How about this?
Any other way?

The best optimization is determined by the queries you run, not by your tables' structure.
If you want to use partitioning, this can be a great optimization, if the partitioning scheme supports the queries you need to optimize. For instance, you could partition by US state, and that would help queries against data for a specific state. MySQL supports "partition pruning" so that the query would only run against the specific partition -- but only if your query mentions a specific value for the column you used as the partition key.
You can always check whether partition pruning is effective by using EXPLAIN PARTITIONS:
EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE state = 'NY';
That should report that the query uses a single partition.
Whereas if you need to run queries by date for example, then the partitioning wouldn't help; MySQL would have to repeat the query against all 50 partitions.
EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE date > '2013-05-01';
That would list all partitions. There's a bit of overhead to query all partitions, so if this is your typical query, you should probably use range partitioning by date.
So choose your partition key with the queries in mind.
Any other optimization technique follows a similar pattern -- it helps some queries, possibly to the disadvantage of other queries. So be sure you know which queries you need to optimize for, before you decide on the optimization method.
Re your comment:
Certainly there are many databases that have 40 million rows or more, but have good performance. They use different methods, including (in no particular order):
Indexing
Partitioning
Caching
Tuning MySQL configuration variables
Archiving
Increasing hardware capacity (e.g. more RAM, solid state drives, RAID)
My point above is that you can't choose the best optimization method until you know the queries you need to optimize. Furthermore, the best choice may be different for different queries, and may even change over time as data or traffic grows. Optimization is an continual process, because you won't know where your bottlenecks are until after you see how your data grows and the query traffic your database receives.

Related

Partitioning of MySQL tables doesn't seem to increase performance

I have a huge table with a company IDX (unique ID for each of my companies) as a Partition Key.
I have around 10,000 companies and each company might have up to 200,000 rows.
I have increased the number of partitions, but my query performance doesn't seem to have increased.
Shall I increase the number of partitions?? Up to one partition per company (for the companies having a lot of rows)?
What is the best architecture solution for me?
I've heard about indexing but not sure if it's relevant in my case.
Partitioning is going to be useful for improving performance and concurrency if all SELECT, UPDATE and DELETE statements include the partition column (partitioning expression) in the WHERE clause (or equivalent predicate in the ON clause of a join). Otherwise, if our queries are looking at all partitions, then we aren't going to see a performance improvement.
To gain a performance benefit, we need optimizer to make effective use of partition pruning in the execution plan; to eliminate from consideration all of the partitions where it is known that matching rows will not be found.
Before launching into creating more partitions, we need to ensure that queries (at least the ones we want performance from) are getting execution plans that are getting partition pruning.
We should first check at the execution plans, to see what affect partitioning has. We use EXPLAIN for that. (Newer versions of MySQL, partitioning info is shown by default. For older releases of MySQL, we used EXPLAIN PARTITION to get that info.)
Not stated in the question is which type of partitioning is in use (RANGE, LIST, HASH, COLUMNS, KEY), what the partition expression is, or what the WHERE clauses look like. So we can't venture a guess as to whether we'd expect partition pruning.
INDEXES
For large sets, indexes are always relevant.
Which indexes we need to create and maintain, which indexes are most appropriate, really depends on the actual mix of queries that are being executed.
Partitioning is not a substitute for suitable indexes.

How will partitioning affect my current queries in MySQL? When is it time to partition my tables?

I have a table that contains 1.5 million rows, has 39 columns, contains sales data of around 2 years, and grows every day.
I had no problems with it until we moved it to a new server, we probably have less memory now.
Queries are currently taking a very long time. Someone suggested partitioning the large table that is causing most of the performance issues but I have a few questions.
Is it wise to partition the table I described and is it
likely to improve its performance?
If I do partition it, will
I have to make changes to my current INSERT or SELECT statements or
will they continue working the same way?
Does the partition
take a long time to perform? I worry that with the slow performance,
something would happen midway through and I would lose the data.
Should I be partioning it to years or months? (we usually
look at the numbers within the month, but sometimes we take weeks or
years). And should I also partition the columns? (We have some
columns that we rarely or never use, but we might want to use them
later)
(I agree with Bill's answer; I will approach the Question in a different way.)
When is it time to partion my tables?
Probably never.
is it likely to improve its performance?
It is more likely to decrease performance a little.
I have a table that contains 1.5 million rows
Not big enough to bother with partitioning.
Queries are currently taking a very long time
Usually that is due to the lack of a good index, probably a 'composite' one. Secondly is the formulation of the query. Please show us a slow query, together with SHOW CREATE TABLE.
data of around 2 years, and grows every day
Will you eventually purge "old" data? If so, the PARTITION BY RANGE(TO_DAYS(..)) is an excellent idea. However, it only helps during the purge. This is because DROP PARTITION is a lot faster than DELETE....
we probably have less memory now.
If you are mostly looking at "recent" data, then the size of memory (cf innodb_buffer_pool_size) may not matter. This is due to caching. However, it sounds like you are doing table scans, perhaps unnecessarily.
will I have to make changes to my current INSERT or SELECT
No. But you probably need to change what column(s) are in the PRIMARY KEY and secondary key(s).
Does the partition take a long time to perform?
Slow - yes, because it will copy the entire table over. Note: that means extra disk space, and the partitioned table will take more disk.
something would happen midway through and I would lose the data.
Do not worry. The new table is created, then a very quick RENAME TABLE swaps it into place.
Should I be partioning it to years or months?
Rule of thumb: aim for about 50 partitions. With "2 years and growing", a likely choice is "monthly".
we usually look at the numbers within the month, but sometimes we take weeks or years
Smells like a typical "Data Warehouse" dataset? Build and incrementally augment a "Summary table" with daily stats. With that table, you can quickly get weekly/monthly/yearly stats -- possibly 10 times as fast. Ditto for any date range. This also significantly helps with "low memory".
And should I also partition the columns? (We have some columns that we rarely or never use, but we might want to use them later)
You should 'never' use SELECT *; instead, specify the columns you actually need. "Vertical partitioning" is the term for your suggestion. It is sometimes practical. But we need to see SHOW CREATE TABLE with realistic column names to discuss further.
More on partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables
In most circumstances, you're better off using indexes instead of partitioning as your main method of query optimization.
The first thing you should learn about partitioning in MySQL is this rule:
All columns used in the partitioning expression for a partitioned table must be part of every unique key that the table may have.
Read more about this rule here: Partitioning Keys, Primary Keys, and Unique Keys.
This rule makes many tables ineligible for partitioning, because you might want to partition by a column that is not part of the primary or unique key in that table.
The second thing to know is that partitioning only helps queries using conditions that unambiguously let the optimizer infer which partitions hold the data you're interested in. This is called Partition Pruning. If you run a query that could find data in any or all partitions, MySQL must search all the partitions, and you gain no performance benefit compared to have a regular non-partitioned table.
For example, if you partition by date, but then you run a query for data related to a specific user account, it would have to search all your partitions.
In fact, it might even be a little bit slower to use partitioned tables in such a query, because MySQL has to search each partition serially.
You asked how long it would take to partition the table. Converting to a partitioned table requires an ALTER TABLE to restructure the data, so it takes about the same time as any other alteration that copies the data to a new tablespace. This is proportional to the size of the table, but varies a lot depending on your server's performance. You'll just have to test it out, there's no way we can estimate how long it will take on your server.

Why does SQL choose join index inconsistently?

I have a join between two tables on three columns. The join was taking hours to complete, so I added a composite index on all three columns on each table. Then, sometimes the join would be really fast and sometimes it would still be slow.
Using EXPLAIN, I noticed that it was fast when it chose to join using the composite index and slow when it just chose an index on only one of the columns. But each of these runs was using the same data.
Is there randomness involved in SQL selecting which index to use? Why would it be inconsistent?
If it helps: it is a MySQL database being queried from pandas in python.
Q: Is there randomness involved in SQL selecting which index to use?
Not randomness involved, per se. The optimizer makes use of table and index statistics (the number of rows and cardinality) along with predicates in the query to develop estimates, e.g. the number of rows that will need be retrieved.
MySQL also evaluates the cost for join operations, sort operations, etc. for each possible access plan (e.g. which index to use, which order to access the tables in) to come up with an estimated cost for each plan.
And then the optimizer compares the costs, and uses the plan that has the lowest cost. There are some parameters (MySQL system variables) that influence the cost estimates. (For example, tuning the expected cost for I/O operations.)
Q: Why would it be inconsistent?
For an InnoDB table, there is some randomness that comes into play with gathering statistics. InnoDB uses a sampling technique, doing a "deep dive" into a small set of "random" pages. The results from those sample pages is extrapolated into estimates for the whole table.
Some of the InnoDB tuning parameters (MySQL system variables) influence (increase/decrease) the number of pages that are sampled when gathering statistics. Sampling a smaller number of pages can be faster, but the smaller sample makes it more likely that the sample set may not be entirely representative of the entire table. Using a larger number of sample alleviates that to a degree, but the sampling takes longer. It's a tradeoff.
Note that InnoDB automatically re-collects statistics when 10% of the rows in the table are changed with DML operations. (There are some cases where the automatic collection of statistics may not be trigger, for example, creating a new (empty) table and populating it with a LOAD DATA statement, that could result in no statistics collected.)
So, the most likely explanation for the observed behavior is that at different times, there are different statistics available to the optimizer.
Note that it is possible to influence the optimizer to opt for a plan that makes use of particular indexes, by including hints in the SQL text. We typically don't need to do that, nor do we want to do that. But in some cases, where the optimizer is choosing an inefficient plan, we can help get a better plan.
A few references (from the MySQL 5.7 Reference Manual)
https://dev.mysql.com/doc/refman/5.7/en/optimizer-hints.html
https://dev.mysql.com/doc/refman/5.7/en/innodb-performance-optimizer-statistics.html

Is there any performance issue if i query in mysql multiple partitions at once compared to querying same data without partitions?

I have transactions table in which it is partitioned by client ids(currently will have 4 clients, so 4 partitions). Now if I query for client id in (1,2) is there any performance issue compared to using same query with out partitions on the table?
I hear that for each partition mysql will maintain separate file system, so querying in partitioned table need to open multiple files internally and query will slow down. Is this correct?
PARTITION BY LIST? BY RANGE? BY HASH? other? It can make a big difference.
Use EXPLAIN PARTITIONS SELECT ... to see if it is doing any "pruning". If it is not, then partitioning is a performance drag for that query.
In general, there are very few cases where partitioning provides any performance benefit. It sounds like your case will not benefit from partitioning. Think of it this way... First, it must decide which partition(s) to look in, then it will dig into the index to finish locating the row(s). Without partitioning, the first step is avoided, hence potentially faster.
If you grow to hundreds of "clients", hence "partitions, then the pruning is inefficient since each partition is essentially a "table".
See http://mysql.rjweb.org/doc.php/partitionmaint for a list of the only 4 use cases that I have found for partitioning.

How to structure an extremely large table

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.
By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available.
I have two conceptual ideas to speed it up.
1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.
2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.
Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.
The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)
There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.
From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.
There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.
10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.
Your second idea looks like partitioning.
I don't know how well it works, but there is support for partition in MySQL -- see, in its manual : Chapter 17. Partitioning
There is good scalability approach for this tables. Union is right way, but there is better way.
If your database engine supports "semantical partitioning", then you can split one table into partitions. Each partition will cover some subrange (say 1 partition per year). It will not affect anything in SQL syntax, except DDL. And engine will transparently run hidden union logic and partitioned index scans with all parallel hardware it has (CPU, I/O, storage).
For example Sybase allows up to 255 partitions, as it is limit of union. But you will never need keyword "union" in queries.
Often the best plan is to have one table and then use database partioning.
Or you can archive data and create a view for the archived and combined data and keep only the active data in the table most functions are referencing. You will have to have a good archiving stategy though (which is automated) or you can lose data or not get things done efficiently in moving them. This is typically more difficult to maintain.
What you're talking about is horizontal partitioning or sharding.