heres another database problem I stubled upon.
I have a date-range partitioned Myisam lookup table with 200M records and ~150 columns.
On this Table I need to perform cascading SELECT-Statements to filter the data.
Output:
filter 126M
filter 110M
filter 40M
filter 5M
filter 100k
Every single SELECT is highly complex with regex (=no index possible) and multiple comparisons, which is why I want them to query the least amount of rows possible.
There are about 500 unique filters and around 200 constant users. Every filter needs to be run for each user, in total around 100k combinations.
Big question:
Is there a way for each subsequent SELECT statement to query only the previous subset?
Example:
Filter #5 should only have to query the 5M rows out of query 4 to get those 100k results. At the moment it has to scan through all 200M records.
EDIT
current approach: cache table
CREATE TABLE IF NOT EXISTS cache
( filter_id int(11) NOT NULL,
user_id int(11) NOT NULL,
lookup_id int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
ALTER TABLE cache ADD PRIMARY KEY (filter_id,user_id);
This would contain the relation between individual data-rows from the lookup table and the filters. PLUS I'd be able to use the primary index to get all of the lookup_ids from the previous filter.
Query for subsequent filters:
SELECT SUM( column), COUNT(*)
FROM cache c
LEFT JOIN lookup_table l
ON c.lookup_id= l.id
WHERE c.filter_id = 1
AND c. user_id= x
AND l.regex_column = preg_rlike...
May be you should save primary key of selected records to a some kind of temporary table? On next step join that temp table with your main table.
Related
There are two tables in Mysql5.7, and each one has 100,000 records.
And each one contains data like this:
id name
-----------
1 name_1
2 name_2
3 name_3
4 name_4
5 name_5
...
The ddl is:
CREATE TABLE `table_a` (
`id` int(11) NOT NULL,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `table_b` (
`id` int(11) NOT NULL,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Now I execute following two queries to see whether the latter will be better.
select SQL_NO_CACHE *
from table_a a inner
join table_b b on a.name = b.name
where a.id between 50000 and 50100;
select SQL_NO_CACHE *
from (
select *
from table_a
where id between 50000 and 50100
) a
inner join table_b b on a.name = b.name;
I think that in the former query, it would iterate up to 100,000 * 100,000 times and then filter the result by where clause; in the latter query, it would first filter the table_a to get 100 intermediate result and then iterate up to 100 * 100,000 times to get final result. So the former would be much faster than the latter.
But the result is that both query spends 1.5 second. And by using explain statement, I can't find any substantial differences
Does the mysql optimize the former query so that it executes like the latter?
For INNER JOIN, ON and WHERE are optimized the same. For LEFT/RIGHT JOIN, the semantics are different, so the optimization is different. (Meanwhile, please use ON for stating the relationship and WHERE for filtering -- it helps humans in understanding the query.)
Both queries can start by fetching 100 rows from a because of a.id between 50000 and 50100, then reach into the other table 100 time. But how it has to do a table scan because of the lack of any useful index. So 100 x 100,000 operations. ("Nested Loop Join" or "NLJ")
The solution to the slowness is to add
INDEX(name)
Add it at least to b. Or, if this is really a lookup table for making "names" to "ids", then UNIQUE(name). With either index, the work should be down to 100 x 100.
Another technique for analyzing queries is
FLUSH STATUS;
SELECT ...
SHOW VARIABLES LIKE 'Handler%';
It counts the actual number of rows (data or index) touched. 100,000 (or multiples of such) indicate a full table/index scan(s) in your case.
More: Index Cookbook
Joins are always faster than sub-queries, so try to use joins instead of sub-queries wherever you can to speed up the process. Whereas in this case, both the queries are equivalent.
Another way to optimize the query would be using partitions. When using partitions, mysql will directly go to the partition according to your specified query which will reduce the time spent on other unrelated records.
I have two tables in MySQL 5.6 for collecting event data.
When an event occurs it generates data in certain time period.
The parent table named 'event' remembers the last state of event.
The child table named 'event_version' remembers all data versions generated by any event.
Schemas for this tables looks like that:
CREATE TABLE `event` (
`id` BIGINT(20) NOT NULL,
`version_id` BIGINT(20)', -- refers to last event_version
`version_number` BIGINT(20)', -- consecutive numbers increased when new version appears
`first_event_time` TIMESTAMP(6), -- time when a set of event data was generated first time,
-- it is immutable after creation
`event_time` TIMESTAMP(6), -- time when a set of event data changed last time
`other_event_data` VARCHAR(30),--more other columns
PRIMARY KEY (`id`),
INDEX `event_time` (`event_time`),
INDEX `version_id` (`version_id`),
CONSTRAINT `FK_version_id` FOREIGN KEY (`version_id`) REFERENCES `event_version` (`id`)
);
CREATE TABLE `event_version` (
`id` BIGINT(20) NOT NULL,
`event_id` BIGINT(20)', -- refers to event
`version_number` BIGINT(20)', -- consecutive numbers increased when new version appears
`event_time` TIMESTAMP(6) NULL DEFAULT NULL, -- time when a set of event data was generated
`other_event_data` VARCHAR(30),--more other columns
PRIMARY KEY (`id`),
INDEX `event_time` (`event_time`), -- time when a set of event data changed
INDEX `event_id` (event_id),
CONSTRAINT `FK_event_id` FOREIGN KEY (`event_id`) REFERENCES `event` (`id`)
);
I want to get all event_version rows which have new rows added in selected time period.
For example: there is na event with event.id=21 that appeared at 2019-04-28 and it produced versions at:
2019-04-28 version_number: 1, event_version.event_id=21
2019-04-30 version_number: 2, event_version.event_id=21
2019-05-02 version_number: 3, event_version.event_id=21
2019-05-04 version_number: 4, event_version.event_id=21
I want this records to be found when I search for period from 2019-05-01 to 2019-06-01.
The idea is to find all event_version.event_id created in selected period, and then all rows from event_version which have event_id from this list.
To create the list of event id I have an inner SELECT queries:
The first query:
SELECT DISTINCT event_id FROM event_version WHERE event_time>='2019-05-01' AND event_time<'2019-06-01';
It takes about 10s and returns about 500 000 records.
But I have second query which uses parent table and looks like this:
SELECT id FROM event WHERE (first_event_time>='2019-05-01' AND first_event_time<'2019-06-01') OR (first_event_time<'2019-05-01' AND event_time>'2019-05-01');
It takes about 7s and returns the same set of ids.
Then I use this subqueries in my final query:
SELECT * FROM event_version WHERE event_id IN (<one of prvious two queries>);
The problem is that when I use the second subquery it takes about 8s to produce result (about 5 millions records).
Creating the same result with the first subquery takes 3 minutes and 15s.
I can't understand why there is such a big difference in executing time even if subqueries produce the same result list.
I want to use a subquery like in the first example because it depends only from event_time, not from additional data from parent table.
I have more similar tables and there I can rely only on event_time.
My question: is there a possibility to optimize the query to produce expected result using only event_time?
As I understand, you want the following query to be optimized:
SELECT *
FROM event_version
WHERE event_id IN (
SELECT DISTINCT event_id
FROM event_version
WHERE event_time >= '2019-05-01'
AND event_time < '2019-06-01'
)
Things I would try:
Create an index on event_version(event_time, event_id). This should improve the performance of the subquery by avoiding a second lookup to get the event_id. Though the overall performance will probably be similar. The reason is that WHERE IN (<subquery>) tend to be slow (at least in older versions) when the subquery returns a lot of rows.
Try a JOIN with your subquery as derived table:
SELECT *
FROM (
SELECT DISTINCT event_id
FROM event_version
WHERE event_time >= '2019-05-01'
AND event_time < '2019-06-01'
) s
JOIN event_version USING(event_id)
Look if the index mentioned above is of any help here.
Try an EXISTS subquery:
SELECT v.*
FROM event e
JOIN event_version v ON v.event_id = e.id
WHERE EXISTS (
SELECT *
FROM event_version v1
WHERE v1.event_id = e.id
AND v1.event_time >= '2019-05-01'
AND v1.event_time < '2019-06-01'
)
Here you would need an index on event_version(event_id, event_time). Though the performance might be even worse. I would bet on the derived table join solution.
My guess - why your second query runs faster - is that the optimizer is able to convert the IN condition to a JOIN, because the returned column is the primary key of the event table.
im guessing the event_version table is a lot bigger then the event table. the subqueries are easy to do, you scan a table once for a predicate and return the rows. when you do this inside a subquery, forevery row the outer query checks, the subquery gets executed. so if event_version has 1m rows, it executes the subquery 1m times. theres probebly some smarter logic to not make it this extreme, but the principle stays.
how ever, i fail to see the point of the 3rd query. if you use the 3rd query with the 1st query as subquery, you get the exact same rows where if you had done the first query as Select all from event_version, so why the subquery?
wouldnt this:
SELECT * FROM event_version WHERE event_id IN (insert query 1);
be the same as
SELECT * FROM event_version WHERE event_time>='2019-05-01' AND event_time<'2019-06-01';
?
I have a table meta with the following structure (this is just an example denormalized data)
`id` int(3) not null auto_increment primary key,
`category_id` int(3),
`subdomain` varchar(191),
`created_at` timestamp,
`updated_at` timestamp
The subdomain field could store unique values and repeating values like 'general' can be repeated many times
Situation 1
Also i have an index subdomain. This index applied on query
Select `id` from `table` where `subdomain` = 'general'
But when i try to get some non-indexed field, mysql scans all table and index is not used
Select `created_at` from `table` where `subdomain` = 'general'
As i know, Inno-db non-clustered index stores a reference to a row and there is no need to perform linear search over all rows to retrieve some field.
Also i know optimizer can choose an unexpected plan for human, but what the reasons can be in this case?
No matter how much data in the table, result always the same.
This can happen, when the filtering backed by the index is not very selective/your value to filter for has a high cardinality. This means a high percentage of your total rows match the where-condition supported by the index (e.g. 90% of your rows match subdomain = 'general'). If you use the index under that condition you end up processing more data compared to a full table scan.
Example: you have 100 rows and 90 of them match subdomain = 'general'.
A full table scan needs to access all 100 rows to check the conditaion and 90 values are read for the result.
An index backed select needs to access 90 items in the index fo fulfill the condition and follow the pointer from the index to the actual row to select the not indexed value from that row. Ending up in 90 lookups on the index + 90 reads from the rows = 180 operations. This is slower than the full table scan where you just access some rows more than needed. The operations might not have the same cost, but you end up doing more work in the end.
I'd like to ask a question about how to improve performance in a big MySQL table using innodb engine:
There's currently a table in my database with around 200 million rows. This table periodically stores the data collected by different sensors. The structure of the table is as follows:
CREATE TABLE sns_value (
value_id int(11) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
date timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
KEY idx_sensor id (sensor_id),
KEY idx_date (date),
KEY idx_type_id (type_id) );
At first, I thought of partitioning the table in months, but due to the steady addition of new sensors it would reach the current size in about a month.
Another solution that I came up with was partitioning the table by sensors. However, due to the limit of 1024 partitions of MySQL that wasn't an option.
I believe that the right solution would be using a table with the same structure for each of the sensors:
sns_value_XXXXX
This way there would be more than 1.000 tables with an estimated size of 30 million rows per year. These tables could, at the same time, be partitioned in months for fastest access to data.
What problems would result from this solution? Is there a more normalized solution?
Editing with additional information
I consider the table to be big in relation to my server:
Cloud 2xCPU and 8GB Memory
LAMP (CentOS 6.5 and MySQL 5.1.73)
Each sensor may have more than one variable types (CO, CO2, etc.).
I mainly have two slow queries:
1) Daily summary for each sensor and type (avg, max, min):
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max, type_id
FROM sns_value
WHERE sensor_id=1 AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
This takes more than 5 min.
2) Vertical to Horizontal view and export:
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1 AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date LIMIT 4500;
This also takes more than 5 min.
Other considerations
Timestamps may be repeated due to inserts characteristics.
Periodic inserts must coexist with selects.
No updates nor deletes are performed on the table.
Suppositions made to the "one table for each sensor" approach
Tables for each sensor would be much smaller so access would be faster.
Selects will be performed only on one table for each sensor.
Selects mixing data from different sensors are not time-critical.
Update 02/02/2015
We have created a new table for each year of data, which we have also partitioned in a daily basis. Each table has around 250 million rows with 365 partitions. The new index used is as Ollie suggested (sensor_id, date, type_id, value) but the query still takes between 30 seconds and 2 minutes. We do not use the first query (daily summary), just the second (vertical to horizontal view).
In order to be able to partition the table, the primary index had to be removed.
Are we missing something? Is there a way to improve the performance?
Many thanks!
Edited based on changes to the question
One table per sensor is, with respect, a very bad idea indeed. There are several reasons for that:
MySQL servers on ordinary operating systems have a hard time with thousands of tables. Most OSs can't handle that many simultaneous file accesses at once.
You'll have to create tables each time you add (or delete) sensors.
Queries that involve data from multiple sensors will be slow and convoluted.
My previous version of this answer suggested range partitioning by timestamp. But that won't work with your value_id primary key. However, with the queries you've shown and proper indexing of your table, partitioning probably won't be necessary.
(Avoid the column name date if you can: it's a reserved word and you'll have lots of trouble writing queries. Instead I suggest you use ts, meaning timestamp.)
Beware: int(11) values aren't aren't big enough for your value_id column. You're going to run out of ids. Use bigint(20) for that column.
You've mentioned two queries. Both these queries can be made quite efficient with appropriate compound indexes, even if you keep all your values in a single table. Here's the first one.
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max,
type_id
FROM sns_value
WHERE sensor_id=1
AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
For this query, you're first looking up sensor_id using a constant, then you're looking up a range of date values, then you're aggregating by type_id. Finally you're extracting the value column. Therefore, a so-called compound covering index on (sensor_id, date, type_id, value) will be able to satisfy your query directly with an index scan. This should be very fast for you--certainly faster than 5 minutes even with a large table.
In your second query, a similar indexing strategy will work.
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1
AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date
LIMIT 4500;
Again, you start with a constant value of sensor_id and then use a date range. You then extract both type_id and value. That means the same four column index I mentioned should work for you.
CREATE TABLE sns_value (
value_id bigint(20) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
INDEX query_opt (sensor_id, ts, type_id, value)
);
Creating separate table for a range of sensors would be an idea.
Do not use the auto_increment for a primary key, if you dont have to. Usually DB engine is clustering the data by its primary key.
Use composite key instead, depends from your usecase, the sequence of columns may be different.
EDIT: Also added the type into the PK. Considering the queries, i would do it like this. Choosing the field names is intentional, they should be descriptive and always consider the reserverd words.
CREATE TABLE snsXX_readings (
sensor_id int(11) NOT NULL,
reading int(11) NOT NULL,
reading_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
type_id int(11) NOT NULL,
PRIMARY KEY (reading_time, sensor_id, type_id),
KEY idx date_idx (date),
KEY idx type_id (type_id)
);
Also, consider summarizing the readings or grouping them into a single field.
You can try get randomize summary data
I have similar table. table engine myisam(smallest table size), 10m record, no index on my table because useless(tested). Get all range for the all data. result:10sn this query.
SELECT * FROM (
SELECT sensor_id, value, date
FROM sns_value l
WHERE l.sensor_id= 123 AND
(l.date BETWEEN '2013-10-29 12:28:29' AND '2015-10-29 12:28:29')
ORDER BY RAND() LIMIT 2000
) as tmp
ORDER BY tmp.date;
This query on first step get between dates and sorting randomize first 2k data, on the second step sort data. the query every time get 2k result for different data.
I have a table that is close to 20million records and growing. The table was setup as innodb. There is a primary index on the two main fields:
`entries_to_fields`
entry_id int(11) NO PRI NULL
field_id int(11) NO PRI NULL
value text NO NULL
Despite the number of records, most of the queries to this table are exceptionally quick, except for the following:
DELETE FROM `entries_to_fields` WHERE `entry_id` IN (SELECT `id` FROM `entries` WHERE `form_id` = 196)
This deletes all entry data for a specific form.
Currently this is taking over 45 seconds, even if the entries table returns no results.
My question is can is there a simple change to the entries_to_fields structure I can make, or can I optomise my query further.
After I read your answer, I wrote this query that may help you as well (in future).
DELETE entries_to_fields
FROM entries_to_fields
JOIN entries
ON entries_to_fields.entry_id = entries.id
WHERE entries.form_id = 196
... entries.form_id field should be indexed.
After a bit of trial & error + googling, I found using IN on index fields on large tables is a very bad practice.
I've broken the sub-query into a separate query and then created a dynamic query as follows:
DELETE FROM `entries_to_fields` WHERE `entry_id` = 232 OR `entry_id` = 342 ...
Despite generating a potential large query, this executes within ~1sec now. Even when deleting 1000's of entries.
I would look at the query plan, my guess is the subquery is returning NULL and making the delete full scan.
see :
http://dev.mysql.com/doc/refman/5.0/en/in-subquery-optimization.html