MySQL Partitioning user generated rows by user - mysql

I have two tables: userMessages and userStatistics
I realized that I need to set up a partitioning in order to ensure efficiency. With all the information I could gather, I am suppose to use HASH partitioning.
PARTITION BY HASH(user_id) PARTITIONS 101
Why do I have to define number of partitions? Is it possible to partition by number of users? I want to partition all the messages and statistics by each user. What partitions number should I use?
More Context
Let use my userStatistics for example. This will store a new entry every day to capture daily activity of users impressions and click-throughs etc... I expect this database to get very large over time. I expect it to be very large within a year (>1m rows). I was thinking of just creating separate tables for each user using an index, but was told about partitioning using HASH. What is the best way to approach this case?

Partition by HASH will not provide any efficiency. In fact, there are only a few RANGE partitionings that provide any efficiency. I suspect your use case does not apply. See http://mysql.rjweb.org/doc.php/partitionmaint
Describe your use case further if you would like to debate the issue.

Related

MYSQL - compare between 1 table have n partition and n table have same struct

I am a student and I have a question when I research about mysql partition.
Example I have a table "Label" with 10 partitions by hash(TaskId)
resourceId (PK)
TaskId (PK)
...
And I have 10 table with name table is "label": + taskId:
tables:
task1(resourceId,...)
task2(resourceId,...)
...
Could you please tell me about advantages and disadvantages between them?
Thanks
Welcome to Stack Overflow. I wish you had offered a third alternative in your question: "just one table with no partitions." That is by far, in almost all cases in the real world, the best way to handle your data. It only requires maintaining and querying one copy of each index, for example. If your data approaches billions of rows in size, it's time to consider stuff like partitions.
But never mind that. Your question was to compare ten tables against one table with ten partitions. Your ten-table approach is often called sharding your data.
First, here's what the two have in common: they both are represented by ten different tables on your storage device (ssd or disk). A query for a row of data that might be anywhere in the ten involves searching all ten, using whatever indexes or other techniques are available. Each of these ten tables consumes resources on your server: open file descriptors, RAM caches, etc.
Here are some differences:
When INSERTing a row into a partitioned table, MySQL figures out which partition to use. When you are using shards, your application must figure out which table to use and write the INSERT query for that particular table.
When querying a partitioned table for a few rows, MySQL automatically figures out from your query's WHERE conditions which partitions it must search. When you search your sharded data, on the other hand, your application much figure out which table or tables to search.
In the case you presented --partitioning by hash on the primary key -- the only way to get MySQL to search just one partition is to search for particular values of the PK. In your case this would be WHERE resourceId = foo AND TaskId = bar. If you search based on some other criterion -- WHERE customerId = something -- MySQL must search all the partitions. That takes time. In the sharding case, your application can use its own logic to figure out which tables to search.
If your system grows very large, you'll be able to move each shard to its own MySQL server running on its own hardware. Then, of course, your application will need to choose the correct server as well as the correct shard table for each access. This won't work with partitions.
With a partitioned table with an autoincrementing id value on each row inserted, each of your rows will have its own unique id no matter which partition it is in. In the sharding case, each table has its own sequence of autoincrementing ids. Rows from different tables will have duplicate ids.
The Data Definition Language (DDL: CREATE TABLE and the like) for partitioning is slightly simpler than for sharding. It's easier and less repetitive to write the DDL add a column or an index to a partitioned table than it is to a bunch of shard tables. With the volume of data that justifies sharding or partitioning, you will need to add and modify indexes to match the needs of your application in future.
Those are some practical differences. Pro tip don't partition and don't shard your data unless you have really good reasons to do so.
Keep in mind that server hardware, disk hardware, and the MySQL software are under active development. If it takes several years for your data to grow very large, new hardware and new software releases may improve fast enough in the meantime that you don't have to worry too much about partitioning / sharding.

How will partitioning affect my current queries in MySQL? When is it time to partition my tables?

I have a table that contains 1.5 million rows, has 39 columns, contains sales data of around 2 years, and grows every day.
I had no problems with it until we moved it to a new server, we probably have less memory now.
Queries are currently taking a very long time. Someone suggested partitioning the large table that is causing most of the performance issues but I have a few questions.
Is it wise to partition the table I described and is it
likely to improve its performance?
If I do partition it, will
I have to make changes to my current INSERT or SELECT statements or
will they continue working the same way?
Does the partition
take a long time to perform? I worry that with the slow performance,
something would happen midway through and I would lose the data.
Should I be partioning it to years or months? (we usually
look at the numbers within the month, but sometimes we take weeks or
years). And should I also partition the columns? (We have some
columns that we rarely or never use, but we might want to use them
later)
(I agree with Bill's answer; I will approach the Question in a different way.)
When is it time to partion my tables?
Probably never.
is it likely to improve its performance?
It is more likely to decrease performance a little.
I have a table that contains 1.5 million rows
Not big enough to bother with partitioning.
Queries are currently taking a very long time
Usually that is due to the lack of a good index, probably a 'composite' one. Secondly is the formulation of the query. Please show us a slow query, together with SHOW CREATE TABLE.
data of around 2 years, and grows every day
Will you eventually purge "old" data? If so, the PARTITION BY RANGE(TO_DAYS(..)) is an excellent idea. However, it only helps during the purge. This is because DROP PARTITION is a lot faster than DELETE....
we probably have less memory now.
If you are mostly looking at "recent" data, then the size of memory (cf innodb_buffer_pool_size) may not matter. This is due to caching. However, it sounds like you are doing table scans, perhaps unnecessarily.
will I have to make changes to my current INSERT or SELECT
No. But you probably need to change what column(s) are in the PRIMARY KEY and secondary key(s).
Does the partition take a long time to perform?
Slow - yes, because it will copy the entire table over. Note: that means extra disk space, and the partitioned table will take more disk.
something would happen midway through and I would lose the data.
Do not worry. The new table is created, then a very quick RENAME TABLE swaps it into place.
Should I be partioning it to years or months?
Rule of thumb: aim for about 50 partitions. With "2 years and growing", a likely choice is "monthly".
we usually look at the numbers within the month, but sometimes we take weeks or years
Smells like a typical "Data Warehouse" dataset? Build and incrementally augment a "Summary table" with daily stats. With that table, you can quickly get weekly/monthly/yearly stats -- possibly 10 times as fast. Ditto for any date range. This also significantly helps with "low memory".
And should I also partition the columns? (We have some columns that we rarely or never use, but we might want to use them later)
You should 'never' use SELECT *; instead, specify the columns you actually need. "Vertical partitioning" is the term for your suggestion. It is sometimes practical. But we need to see SHOW CREATE TABLE with realistic column names to discuss further.
More on partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables
In most circumstances, you're better off using indexes instead of partitioning as your main method of query optimization.
The first thing you should learn about partitioning in MySQL is this rule:
All columns used in the partitioning expression for a partitioned table must be part of every unique key that the table may have.
Read more about this rule here: Partitioning Keys, Primary Keys, and Unique Keys.
This rule makes many tables ineligible for partitioning, because you might want to partition by a column that is not part of the primary or unique key in that table.
The second thing to know is that partitioning only helps queries using conditions that unambiguously let the optimizer infer which partitions hold the data you're interested in. This is called Partition Pruning. If you run a query that could find data in any or all partitions, MySQL must search all the partitions, and you gain no performance benefit compared to have a regular non-partitioned table.
For example, if you partition by date, but then you run a query for data related to a specific user account, it would have to search all your partitions.
In fact, it might even be a little bit slower to use partitioned tables in such a query, because MySQL has to search each partition serially.
You asked how long it would take to partition the table. Converting to a partitioned table requires an ALTER TABLE to restructure the data, so it takes about the same time as any other alteration that copies the data to a new tablespace. This is proportional to the size of the table, but varies a lot depending on your server's performance. You'll just have to test it out, there's no way we can estimate how long it will take on your server.

Is there any performance issue if i query in mysql multiple partitions at once compared to querying same data without partitions?

I have transactions table in which it is partitioned by client ids(currently will have 4 clients, so 4 partitions). Now if I query for client id in (1,2) is there any performance issue compared to using same query with out partitions on the table?
I hear that for each partition mysql will maintain separate file system, so querying in partitioned table need to open multiple files internally and query will slow down. Is this correct?
PARTITION BY LIST? BY RANGE? BY HASH? other? It can make a big difference.
Use EXPLAIN PARTITIONS SELECT ... to see if it is doing any "pruning". If it is not, then partitioning is a performance drag for that query.
In general, there are very few cases where partitioning provides any performance benefit. It sounds like your case will not benefit from partitioning. Think of it this way... First, it must decide which partition(s) to look in, then it will dig into the index to finish locating the row(s). Without partitioning, the first step is avoided, hence potentially faster.
If you grow to hundreds of "clients", hence "partitions, then the pruning is inefficient since each partition is essentially a "table".
See http://mysql.rjweb.org/doc.php/partitionmaint for a list of the only 4 use cases that I have found for partitioning.

How to partition MySQL table by day?

I'm running MySQL 5.1 and storing data from web logs into a table. There is a datetime column which I want to partition by day. Every night I add new data from the previous day into the table, which is why I want to partition by day. It is usually a few million rows. I want to partition by day because it usually takes 20 seconds for a MySQL query to complete.
In short, I want to partition by each day because users can click on a calendar to get web log information consisting of a day's worth of data. The data spans millions of row (for a single day).
The problem that I've seen with a lot of partitioning articles is that you have to explicitly specify what values you want to partition for? I don't like this way because it means that I'll have to alter the table every night in order to add an extra partition. Is there a built in MySQL feature to do this for me automatically, or will I have to write a bash script/cron job to alter the table for me every night?
For example, if I were to follow the following example:
http://datacharmer.blogspot.com/2008/12/partition-helper-improving-usability.html
In one year, I would have 365 partitions.
Indexes are a must for any table. The details of the index(es) derive from the SELECTs you have; let's see them.
Rules of thumb:
Don't partition a table of less than a million rows
Don't use more than about 50 partitions.
If you are 'purging old data' after some number of days/weeks/months, see my blog for the code on how to do that.
PARTITION BY RANGE() is the only useful partition mechanism.
I tried this once. I ended up creating a cron job to do the partitioning on a regular basis (once a month). Keep in mind that you have a maximum of 1024 partitions per table (http://dev.mysql.com/doc/refman/5.1/en/partitioning-limitations.html).
Offhand, I probably wouldn't recommend it. For my needs, I saw this created a significant slowdown in any searches that that required cross-partition results.
Based on your updated explanation, I would first recommend to create the necessary indexes. I would read MySQL Optimization chapter (in specific the section on indexes), to better learn how to ensure you have the necessary indexes. You can also use the slow_query log to help isolate the problematic queries.
Once you have that narrowed down, I can see your need for partitioning change to wanting to partition to limit the size of a particular partition (perhaps for storage space or for quick truncation, etc). At that point, you may decide to partition on a monthly or annual basis.
Partitioning using the date as a partition key will obviously force you into creating an index for the date field. Start with that and see how it goes before you get into the extra efforts of partitioning on a scheduled basis.

How to structure an extremely large table

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.
By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available.
I have two conceptual ideas to speed it up.
1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.
2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.
Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.
The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)
There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.
From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.
There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.
10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.
Your second idea looks like partitioning.
I don't know how well it works, but there is support for partition in MySQL -- see, in its manual : Chapter 17. Partitioning
There is good scalability approach for this tables. Union is right way, but there is better way.
If your database engine supports "semantical partitioning", then you can split one table into partitions. Each partition will cover some subrange (say 1 partition per year). It will not affect anything in SQL syntax, except DDL. And engine will transparently run hidden union logic and partitioned index scans with all parallel hardware it has (CPU, I/O, storage).
For example Sybase allows up to 255 partitions, as it is limit of union. But you will never need keyword "union" in queries.
Often the best plan is to have one table and then use database partioning.
Or you can archive data and create a view for the archived and combined data and keep only the active data in the table most functions are referencing. You will have to have a good archiving stategy though (which is automated) or you can lose data or not get things done efficiently in moving them. This is typically more difficult to maintain.
What you're talking about is horizontal partitioning or sharding.