Partitioning of MySQL tables doesn't seem to increase performance - mysql

I have a huge table with a company IDX (unique ID for each of my companies) as a Partition Key.
I have around 10,000 companies and each company might have up to 200,000 rows.
I have increased the number of partitions, but my query performance doesn't seem to have increased.
Shall I increase the number of partitions?? Up to one partition per company (for the companies having a lot of rows)?
What is the best architecture solution for me?
I've heard about indexing but not sure if it's relevant in my case.

Partitioning is going to be useful for improving performance and concurrency if all SELECT, UPDATE and DELETE statements include the partition column (partitioning expression) in the WHERE clause (or equivalent predicate in the ON clause of a join). Otherwise, if our queries are looking at all partitions, then we aren't going to see a performance improvement.
To gain a performance benefit, we need optimizer to make effective use of partition pruning in the execution plan; to eliminate from consideration all of the partitions where it is known that matching rows will not be found.
Before launching into creating more partitions, we need to ensure that queries (at least the ones we want performance from) are getting execution plans that are getting partition pruning.
We should first check at the execution plans, to see what affect partitioning has. We use EXPLAIN for that. (Newer versions of MySQL, partitioning info is shown by default. For older releases of MySQL, we used EXPLAIN PARTITION to get that info.)
Not stated in the question is which type of partitioning is in use (RANGE, LIST, HASH, COLUMNS, KEY), what the partition expression is, or what the WHERE clauses look like. So we can't venture a guess as to whether we'd expect partition pruning.
INDEXES
For large sets, indexes are always relevant.
Which indexes we need to create and maintain, which indexes are most appropriate, really depends on the actual mix of queries that are being executed.
Partitioning is not a substitute for suitable indexes.

Related

How to judge the complexity of SQL queries

Any resource where it is explained how to judge the complexity of SQL queries would be much appreciated.
(By "complexity", I assume you mean "slowness"?) Some tips:
Subqueries may or may not slow down a query a lot.
GROUP BY and ORDER BY -- when both are present but different: Usually requires two sorts.
Usually only a single index is used per SELECT.
OR is almost always inefficient. Switching to UNION allows multiple indexes to be efficiently used.
UNION ALL, with a few restrictions, is more efficient than UNION DISTINCT (because of the dedupping pass)
Non-sargeable expressions cannot use an index, hence severely inefficient.
If the entire WHERE, GROUP BY and ORDER BY are handled by a single index can LIMIT be efficiently handled. (Else it must collect all the stuff, sort it, only then can it peel off a few rows.)
Entity-Attribute-Value schema is inefficient.
UUIDs and GUIDs are inefficient on very large tables.
A composite index is often better than a single-column index.
A "covering" index is somewhat better.
Sometimes, especially when a LIMIT is involved, it is better to turn the query inside-out. That is start with a subquery that finds the few ids that you need, then reaches back into the same table and into other tables to get the rest of the desired columns.
"Windowing functions" are poorly implemented in MySQL 8 and MariaDB 10.2. They are useful for "groupwise-max" and "hierarchical schemas". Until the Optimizer improves, I declare them to be "complex".
Recent versions have recognized "row constructors"; previously they were a performance hit.
Having an AUTO_INCREMENT id hurts performance in certain cases; helps in others.
EXPLAIN (or EXPLAIN FORMAT=JSON) tells you what is going on now; it fails to tell you how to rewrite the query or what better index to add.
More indexing tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql In that link, see "Handler counts" for a good way to measure complexity for specific queries. I use it for comparing query formulations, etc, even without populating a large table to get usable timings.
Give me a bunch of Queries; I'll point out the complexities, if any, in each.
Check out the official MySQL documentation on Query Execution Plan:
https://dev.mysql.com/doc/refman/5.7/en/execution-plan-information.html
You could use the EXPLAIN command to get more information about your query.

What's the minimum number of rows where indexing becomes valuable in MySQL?

I've read that indexing on some databases (SQL Server is the one I read about) doesn't have much effect until you cross a certain threshold of rows because the database will hold the entire table X in memory.
Ordinarily, I'd plan to index on my WHEREs and unique columns/lesser-changed tables. After hearing about the suggested minimum (which was about 10k), I wanted to learn more about that idea. If there are tables that I know will never pass a certain point, this might change the way I index some of them.
For something like MySQL MyISAM/INNODB, is there a point where indexing has little value and what are some ways of determining that?
Note: Very respectfully, I'm not looking for suggestions about structuring my database like "You should index anyway," I'm looking to understand this concept, if it's true or not, how to determine the thresholds, and similar information.
One of the major uses of indexes is to reduce the number of pages being read. The index itself is usually smaller than the table. So, just in terms of page read/writes, you generally need at least three data pages to see a benefit, because using an index requires at least two data pages (one for the index and one for the original data).
(Actually, if the index covers the query, then the breakeven is two.)
The number of data pages needed for a table depends on the size of the records and the number of rows. So, it is really not possible to specify a threshold on the number of rows.
The above very rudimentary explanation leaves out a few things:
The cost of scanning the data pages to do comparisons for each row.
The cost of loading and using index pages.
Other uses of indexing.
But it gives you an idea, and you can see benefits on tables much smaller than 10k rows. That said you can easily do tests on your data to see how queries work on the tables in question.
Also, I strongly, strongly recommend having primary keys on all tables and using those keys for foreign key relationships. The primary key itself is an index.
Indexes serve a lot of purposes. InnoDB tables are always organized as an index, on the cluster key. Indexes can be used to enforce unique constraints, as well as support foreign key constraints. The topic of "indexes" spans way more than query performance.
In terms of query performance, it really depends on what the query is doing. If we are selecting a small subset of rows, out of large set, then effective use of an index can speed that up by eliminating vast swaths of rows from being checked. That's where the biggest bang comes from.
If we are pulling all of the rows, or nearly all the rows, from a set, then an index typically doesn't help narrow down which rows to check; even when an index is available, the optimizer may choose to do a full scan of all of the rows.
But even when pulling large subsets, appropriate indexes can improve performance for join operations, and can significantly improve performance of queries with GROUP BY or ORDER BY clauses, by making use of an index to retrieve rows in order, rather than requiring a "Using filesort" operation.
If we are looking for a simple rule of thumb... for a large set, if we are needing to pull (or look at) less than 10% of the total rows, then an access plan using a suitable index will typically outperform a full scan. If we are looking for a specific row, based on a unique identifier, index is going to be faster than full scan. If we are pulling all columns for every row in the table n no particular order, then a full scan is going to be faster.
Again, it really comes down to what operations are being performed. What queries are being executed, and the performance profile that we need from those queries. That is going to be the key to determining the indexing strategy.
In terms of gaining understanding, use EXPLAIN to see the execution plan. And learn the operations available to MySQl optimizer.
(The topic of indexing strategy in terms of database performance is much too large for a StackOverflow question.)
Each situation is different. If you profile your code, then you'll understand better each anti-pattern. To demonstrate the extreme unexpectedness, consider Oracle:
If this were Oracle, I would say zero because if an empty table's high water mark is very high, then a query that motivates a full table scan that returns zero rows would be much more expensive than the same query that were to induce even a full index scan.
The same process that I went through to understand Oracle you can do with MySQL: profile your code.

How will partitioning affect my current queries in MySQL? When is it time to partition my tables?

I have a table that contains 1.5 million rows, has 39 columns, contains sales data of around 2 years, and grows every day.
I had no problems with it until we moved it to a new server, we probably have less memory now.
Queries are currently taking a very long time. Someone suggested partitioning the large table that is causing most of the performance issues but I have a few questions.
Is it wise to partition the table I described and is it
likely to improve its performance?
If I do partition it, will
I have to make changes to my current INSERT or SELECT statements or
will they continue working the same way?
Does the partition
take a long time to perform? I worry that with the slow performance,
something would happen midway through and I would lose the data.
Should I be partioning it to years or months? (we usually
look at the numbers within the month, but sometimes we take weeks or
years). And should I also partition the columns? (We have some
columns that we rarely or never use, but we might want to use them
later)
(I agree with Bill's answer; I will approach the Question in a different way.)
When is it time to partion my tables?
Probably never.
is it likely to improve its performance?
It is more likely to decrease performance a little.
I have a table that contains 1.5 million rows
Not big enough to bother with partitioning.
Queries are currently taking a very long time
Usually that is due to the lack of a good index, probably a 'composite' one. Secondly is the formulation of the query. Please show us a slow query, together with SHOW CREATE TABLE.
data of around 2 years, and grows every day
Will you eventually purge "old" data? If so, the PARTITION BY RANGE(TO_DAYS(..)) is an excellent idea. However, it only helps during the purge. This is because DROP PARTITION is a lot faster than DELETE....
we probably have less memory now.
If you are mostly looking at "recent" data, then the size of memory (cf innodb_buffer_pool_size) may not matter. This is due to caching. However, it sounds like you are doing table scans, perhaps unnecessarily.
will I have to make changes to my current INSERT or SELECT
No. But you probably need to change what column(s) are in the PRIMARY KEY and secondary key(s).
Does the partition take a long time to perform?
Slow - yes, because it will copy the entire table over. Note: that means extra disk space, and the partitioned table will take more disk.
something would happen midway through and I would lose the data.
Do not worry. The new table is created, then a very quick RENAME TABLE swaps it into place.
Should I be partioning it to years or months?
Rule of thumb: aim for about 50 partitions. With "2 years and growing", a likely choice is "monthly".
we usually look at the numbers within the month, but sometimes we take weeks or years
Smells like a typical "Data Warehouse" dataset? Build and incrementally augment a "Summary table" with daily stats. With that table, you can quickly get weekly/monthly/yearly stats -- possibly 10 times as fast. Ditto for any date range. This also significantly helps with "low memory".
And should I also partition the columns? (We have some columns that we rarely or never use, but we might want to use them later)
You should 'never' use SELECT *; instead, specify the columns you actually need. "Vertical partitioning" is the term for your suggestion. It is sometimes practical. But we need to see SHOW CREATE TABLE with realistic column names to discuss further.
More on partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables
In most circumstances, you're better off using indexes instead of partitioning as your main method of query optimization.
The first thing you should learn about partitioning in MySQL is this rule:
All columns used in the partitioning expression for a partitioned table must be part of every unique key that the table may have.
Read more about this rule here: Partitioning Keys, Primary Keys, and Unique Keys.
This rule makes many tables ineligible for partitioning, because you might want to partition by a column that is not part of the primary or unique key in that table.
The second thing to know is that partitioning only helps queries using conditions that unambiguously let the optimizer infer which partitions hold the data you're interested in. This is called Partition Pruning. If you run a query that could find data in any or all partitions, MySQL must search all the partitions, and you gain no performance benefit compared to have a regular non-partitioned table.
For example, if you partition by date, but then you run a query for data related to a specific user account, it would have to search all your partitions.
In fact, it might even be a little bit slower to use partitioned tables in such a query, because MySQL has to search each partition serially.
You asked how long it would take to partition the table. Converting to a partitioned table requires an ALTER TABLE to restructure the data, so it takes about the same time as any other alteration that copies the data to a new tablespace. This is proportional to the size of the table, but varies a lot depending on your server's performance. You'll just have to test it out, there's no way we can estimate how long it will take on your server.

Is there any performance issue if i query in mysql multiple partitions at once compared to querying same data without partitions?

I have transactions table in which it is partitioned by client ids(currently will have 4 clients, so 4 partitions). Now if I query for client id in (1,2) is there any performance issue compared to using same query with out partitions on the table?
I hear that for each partition mysql will maintain separate file system, so querying in partitioned table need to open multiple files internally and query will slow down. Is this correct?
PARTITION BY LIST? BY RANGE? BY HASH? other? It can make a big difference.
Use EXPLAIN PARTITIONS SELECT ... to see if it is doing any "pruning". If it is not, then partitioning is a performance drag for that query.
In general, there are very few cases where partitioning provides any performance benefit. It sounds like your case will not benefit from partitioning. Think of it this way... First, it must decide which partition(s) to look in, then it will dig into the index to finish locating the row(s). Without partitioning, the first step is avoided, hence potentially faster.
If you grow to hundreds of "clients", hence "partitions, then the pruning is inefficient since each partition is essentially a "table".
See http://mysql.rjweb.org/doc.php/partitionmaint for a list of the only 4 use cases that I have found for partitioning.

MySQL performance with large number of records - partitioning?

I am trying to build a database that would contain a large number of records, each with a lot of columns(fields) - maybe around 200-300 fields total for all tables. Let's say that I would have in a few years about 40.000.000 to 60.000.000 records.
I plan to normalize the database, so I will have a lot of tables (about 30-40) -> and lots of joins for queries.
Database will be strictly related to US, meaning that queries will be based on the 50 states alone (if a query is made, it won't allow to search/insert/etc in multiple states, but just one).
What can I do to have better performance?
Someone came with the idea to have all the states in different table structures, meaning I will have 50 tables * the 30-40 for the data (about 200 tables)! Should I even consider this type of approach?
The next idea was to use partitioning based on the US 50 states. How about this?
Any other way?
The best optimization is determined by the queries you run, not by your tables' structure.
If you want to use partitioning, this can be a great optimization, if the partitioning scheme supports the queries you need to optimize. For instance, you could partition by US state, and that would help queries against data for a specific state. MySQL supports "partition pruning" so that the query would only run against the specific partition -- but only if your query mentions a specific value for the column you used as the partition key.
You can always check whether partition pruning is effective by using EXPLAIN PARTITIONS:
EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE state = 'NY';
That should report that the query uses a single partition.
Whereas if you need to run queries by date for example, then the partitioning wouldn't help; MySQL would have to repeat the query against all 50 partitions.
EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE date > '2013-05-01';
That would list all partitions. There's a bit of overhead to query all partitions, so if this is your typical query, you should probably use range partitioning by date.
So choose your partition key with the queries in mind.
Any other optimization technique follows a similar pattern -- it helps some queries, possibly to the disadvantage of other queries. So be sure you know which queries you need to optimize for, before you decide on the optimization method.
Re your comment:
Certainly there are many databases that have 40 million rows or more, but have good performance. They use different methods, including (in no particular order):
Indexing
Partitioning
Caching
Tuning MySQL configuration variables
Archiving
Increasing hardware capacity (e.g. more RAM, solid state drives, RAID)
My point above is that you can't choose the best optimization method until you know the queries you need to optimize. Furthermore, the best choice may be different for different queries, and may even change over time as data or traffic grows. Optimization is an continual process, because you won't know where your bottlenecks are until after you see how your data grows and the query traffic your database receives.