Simply MySQL Partition on One Column (groupID) - mysql

I have a large-ish table (over 10 million rows). I have the following columns;
rowId, groupId and textString. The ID's are both ints and textString is a simple varchar. I only have a maximum of 50 groupId's at a time which are stored in another table (possibly not of interest), but the groupID's are NOT sequential (rowId is AUTO_INCREMENT and the PRIMARY KEY).
What I would like to do, is to partition my table on these groupID's. I know what the list of groupIDs are (IE. 2342, 5251, 1591, 5915 etc etc).
How do I do this in MySQL?
Ammends: Running version 5.5.23
Thanks!

Related

MYSQL: how to speed up an sql Query for getting data

I am using Mysql database.
I have a table daily_price_history of stock values stored with the following fields. It has 11 million+ rows
id
symbolName
symbolId
volume
high
low
open
datetime
close
So for each stock SymbolName there are various daily stock values. And the data is now more than 11 million rows,
The following sql try to get the last 100 days of daily data for a set of 1500 symbols
SELECT `daily_price_history`.`id`,
`daily_price_history`.`symbolId_id`,
`daily_price_history`.`volume`,
`daily_price_history`.`close`
FROM `daily_price_history`
WHERE (`daily_price_history`.`id` IN
(SELECT U0.`id`
FROM `daily_price_history` U0
WHERE (U0.`symbolName` = `daily_price_history`.`symbolName`
AND U0.`datetime` >= 1598471533546))
AND `daily_price_history`.`symbolName` IN (A,AA, ...... 1500 symbols Names)
I have the table indexed on symbolName and also datetime
For getting 130K (i.e 1500 x 100 ~ 150000) rows of data it takes 20 secs.
Also i have weekly_price_history and monthly_price_history tables, and I try to run the similar sql, they take less time for the same number (130K) of rows, because they have less data in the table than daily.
weekly_price_history getting 150K rows takes 3s. The total number of rows in it are 2.5million
monthly_price_history getting 150K rows takes 1s. The total number of rows in it are 800K
So how to speed up the thing when the size of table is large.
As a starter: I don't see the point for the subquery at all. Presumably, your query could filter directly in the where clause:
select id, symbolid_id, volume, close
from daily_price_history
where datetime >= 1598471533546 and symbolname in ('A', 'AA', ...)
Then, you want an index on (datetime, symbolname):
create index idx_daily_price_history
on daily_price_history(datetime, symbolname)
;
The first column of the index matches on the predicate on datetime. It is not very likley, however, that the database will be able to use the index to filter symbolname against a large list of values.
An alternative would be to put the list of values in a table, say symbolnames.
create table symbolnames (
symbolname varchar(50) primary key
);
insert into symbolnames values ('A'), ('AA'), ...;
Then you can do:
select p.id, p.symbolid_id, p.volume, p.close
from daily_price_history p
inner join symbolnames s on s.symbolname = p.symbolname
where s.datetime >= 1598471533546
That should allow the database to use the above index. We can take one step forward and try and add the 4 columns of the select clause to the index:
create index idx_daily_price_history_2
on daily_price_history(datetime, symbolname, id, symbolid_id, volume, close)
;
When you add INDEX(a,b), remove INDEX(a) as being no longer necessary.
Your dataset and query may be a case for using PARTITIONing.
PRIMARY KEY(symbolname, datetime)
PARTITION BY RANGE(datetime) ...
This will do "partition pruning": datetime >= 1598471533546. Then the PRIMARY KEY will do most of the rest of the work for symbolname in ('A', 'AA', ...).
Aim for about 50 partitions; the exact number does not matter. Too many partitions may hurt performance; too few won't provide effective pruning.
Yes, get rid of the subquery as GMB suggests.
Meanwhile, it sounds like Django is getting in the way.
Some discussion of partitioning: http://mysql.rjweb.org/doc.php/partitionmaint

MySQL Order of records in table

I have a very large staging table that I want to process a few rows at a time into an indexed table.
As the time to write indexes results in longer than desired locks on the target table I usually do this a few 100k rows at a time. I pick the rows using order by a unique column as well as Limit and Offset to pick a value to repeatedly churn away on the staging table.
SELECT unique_id INTO #cut_off FROM staging_X ORDER BY unique_id;
START TRANSACTION;
INSERT INTO my_indexed_table ([columns])
SELECT columns FROM staging_X where unique_id <= #cut_off;
DELETE FROM my_indexed_table WHERE unique_id <= #cut_off;
COMMIT;
I've done this for a couple of tables successfully, but am now faced with the largest table in my list. This one has more than 100 million rows. It is created by Apache Spark, so I have no control over setting up partitions or anything.
I've been wondering if I can just use LIMIT with a constant value on both the INSERT and DELETE queries without trying to sort the data. But I cannot find anything that states that the rows will be returned in a reliably repeatable order.
For reference I am using MySQL 5.7 and INNODB tables.
Update
On request the data is something like this:
uuid - Text
timestamp1 - Bigint - unixtime
timestmap2 - Bigint - unixtime
timestmap3 - Bigint - unixtime
timestmap4 - Bigint - unixtime
url - Text
metric1 - Int
metric2 - Int
There are about 30 million rows per day and I can only process this weekly. I can throttle the provision of the data create multiple tables (I cannot create partitions) with a limited row count each but ideally, I'd just like to be able to reliably get the first N rows, insert them elsewhere and delete them, without trying to sort the data.

MySQL - Data Loading by Partitions, and Indexes

This is for MySQL 5.7 with InnoDB.
I have a partitioned table, and I'll be doing batch data loading (of a large amount of data) by partitions. i.e. I know that each batch of data I load will fall exclusively into one partition.
Now, the common way to handle indexes with data loading (as far as I know), would be to drop all indexes first, do the data loading, then re-create the indexes.
But I'm wondering, since I'm loading by partitions, is this still the most optimal way (dropping and then re-creating indexes) since it seems like I'm unnecessarily "touching" the non-updated partitions this way.
e.g.
Loading data into partition 1.
Drop all indexes - nothing happens, since no data yet.
Load data - all goes into partition 1.
Create indexes - only parition 1 is modified.
Loading data into partition 2.
Drop all indexes - all indexes in partition 1 dropped (unnecessary!)
Load data - all goes into partition 2.
Create indexes - partition 1 indexes re-created (unnecessary!) and partition 2 indexes created.
And hence, loading this second batch of data takes significantly longer than the first batch. And it will get worse for each batch!
In that case, should I just pre-create the indexes and leave them there when loading data?
(BTW, don't worry about queries. The database is "offline" when data loading takes place. The objective here is only to shorten the time for each batch of data loading.)
The table schema is as follows:
CREATE TABLE MYTABLE (
ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
YEAR SMALLINT UNSIGNED NOT NULL,
MONTH TINYINT UNSIGNED NOT NULL,
A CHAR(4),
B VARCHAR(127),
C VARCHAR(15),
D VARCHAR(511),
E TEXT,
F TEXT,
G VARCHAR(127),
H VARCHAR(127),
I VARCHAR(127),
J VARCHAR(511),
K VARCHAR(511),
L BIT(1),
CONSTRAINT PKEY PRIMARY KEY (ID, YEAR, MONTH)
)
PARTITION BY LIST COLUMNS(YEAR, MONTH) (
PARTITION PART1 VALUES IN ((2007, 1)),
PARTITION PART2 VALUES IN ((2007, 2)),
PARTITION PART3 VALUES IN ((2007, 3)),
...
);
And, of course, there are a bunch of indexes (14 in all), mostly involving 2 to 4 columns. None of the 2 TEXT columns are in any of the index.
If you are using InnoDB, do not drop the PRIMARY KEY.
All PARTITIONs always have the same indexes. So you cannot turn on/off indexes separately.
Please provide SHOW CREATE TABLE for further critique and advice. I may say that PARTITIONing is of no use; there are very few use cases were it is worth using PARTITION. More info, and use cases

Improve performance in a big MySQL table

I'd like to ask a question about how to improve performance in a big MySQL table using innodb engine:
There's currently a table in my database with around 200 million rows. This table periodically stores the data collected by different sensors. The structure of the table is as follows:
CREATE TABLE sns_value (
value_id int(11) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
date timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
KEY idx_sensor id (sensor_id),
KEY idx_date (date),
KEY idx_type_id (type_id) );
At first, I thought of partitioning the table in months, but due to the steady addition of new sensors it would reach the current size in about a month.
Another solution that I came up with was partitioning the table by sensors. However, due to the limit of 1024 partitions of MySQL that wasn't an option.
I believe that the right solution would be using a table with the same structure for each of the sensors:
sns_value_XXXXX
This way there would be more than 1.000 tables with an estimated size of 30 million rows per year. These tables could, at the same time, be partitioned in months for fastest access to data.
What problems would result from this solution? Is there a more normalized solution?
Editing with additional information
I consider the table to be big in relation to my server:
Cloud 2xCPU and 8GB Memory
LAMP (CentOS 6.5 and MySQL 5.1.73)
Each sensor may have more than one variable types (CO, CO2, etc.).
I mainly have two slow queries:
1) Daily summary for each sensor and type (avg, max, min):
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max, type_id
FROM sns_value
WHERE sensor_id=1 AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
This takes more than 5 min.
2) Vertical to Horizontal view and export:
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1 AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date LIMIT 4500;
This also takes more than 5 min.
Other considerations
Timestamps may be repeated due to inserts characteristics.
Periodic inserts must coexist with selects.
No updates nor deletes are performed on the table.
Suppositions made to the "one table for each sensor" approach
Tables for each sensor would be much smaller so access would be faster.
Selects will be performed only on one table for each sensor.
Selects mixing data from different sensors are not time-critical.
Update 02/02/2015
We have created a new table for each year of data, which we have also partitioned in a daily basis. Each table has around 250 million rows with 365 partitions. The new index used is as Ollie suggested (sensor_id, date, type_id, value) but the query still takes between 30 seconds and 2 minutes. We do not use the first query (daily summary), just the second (vertical to horizontal view).
In order to be able to partition the table, the primary index had to be removed.
Are we missing something? Is there a way to improve the performance?
Many thanks!
Edited based on changes to the question
One table per sensor is, with respect, a very bad idea indeed. There are several reasons for that:
MySQL servers on ordinary operating systems have a hard time with thousands of tables. Most OSs can't handle that many simultaneous file accesses at once.
You'll have to create tables each time you add (or delete) sensors.
Queries that involve data from multiple sensors will be slow and convoluted.
My previous version of this answer suggested range partitioning by timestamp. But that won't work with your value_id primary key. However, with the queries you've shown and proper indexing of your table, partitioning probably won't be necessary.
(Avoid the column name date if you can: it's a reserved word and you'll have lots of trouble writing queries. Instead I suggest you use ts, meaning timestamp.)
Beware: int(11) values aren't aren't big enough for your value_id column. You're going to run out of ids. Use bigint(20) for that column.
You've mentioned two queries. Both these queries can be made quite efficient with appropriate compound indexes, even if you keep all your values in a single table. Here's the first one.
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max,
type_id
FROM sns_value
WHERE sensor_id=1
AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
For this query, you're first looking up sensor_id using a constant, then you're looking up a range of date values, then you're aggregating by type_id. Finally you're extracting the value column. Therefore, a so-called compound covering index on (sensor_id, date, type_id, value) will be able to satisfy your query directly with an index scan. This should be very fast for you--certainly faster than 5 minutes even with a large table.
In your second query, a similar indexing strategy will work.
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1
AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date
LIMIT 4500;
Again, you start with a constant value of sensor_id and then use a date range. You then extract both type_id and value. That means the same four column index I mentioned should work for you.
CREATE TABLE sns_value (
value_id bigint(20) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
INDEX query_opt (sensor_id, ts, type_id, value)
);
Creating separate table for a range of sensors would be an idea.
Do not use the auto_increment for a primary key, if you dont have to. Usually DB engine is clustering the data by its primary key.
Use composite key instead, depends from your usecase, the sequence of columns may be different.
EDIT: Also added the type into the PK. Considering the queries, i would do it like this. Choosing the field names is intentional, they should be descriptive and always consider the reserverd words.
CREATE TABLE snsXX_readings (
sensor_id int(11) NOT NULL,
reading int(11) NOT NULL,
reading_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
type_id int(11) NOT NULL,
PRIMARY KEY (reading_time, sensor_id, type_id),
KEY idx date_idx (date),
KEY idx type_id (type_id)
);
Also, consider summarizing the readings or grouping them into a single field.
You can try get randomize summary data
I have similar table. table engine myisam(smallest table size), 10m record, no index on my table because useless(tested). Get all range for the all data. result:10sn this query.
SELECT * FROM (
SELECT sensor_id, value, date
FROM sns_value l
WHERE l.sensor_id= 123 AND
(l.date BETWEEN '2013-10-29 12:28:29' AND '2015-10-29 12:28:29')
ORDER BY RAND() LIMIT 2000
) as tmp
ORDER BY tmp.date;
This query on first step get between dates and sorting randomize first 2k data, on the second step sort data. the query every time get 2k result for different data.

How can I perform this MySQL partitoning?

I have a table with an integer column ranging from 1 to 32 (this column identify the type of record stored).
The types 5 and 12 represents 70% of the total number of rows, and this number is greater than 1M rows, so it seems to makes sense to partition the table.
Question is: how can I create a set of 3 partitions, one containing the type 5 records, the second containing the type 12 records, and the third one with the remaining records?
http://dev.mysql.com/doc/refman/5.1/en/partitioning-list.html
create table some_table (
id INT NOT NULL,
some_id INT NOT NULL
)
PARTITION BY LIST(some_id) (
PARTITION fives VALUES IN (5),
PARTITION twelves VALUES IN (12),
PARTITION rest VALUES IN (1,2,3,4,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32)
);
Use Partition by list
Provided that type is an index, then MySQL has already logically partitioned the table for you. Unless you really need physical partitioning, it seems to me you are only making trouble for yourself.