Mysql Join Query optimization

Mysql Join Query optimization - mysql

I have two tables in mysql:
Results Table : 1046928 rows.
Nodes Table : 50 rows.
I am joining these two tables with the following query and the execution of the query is very very slow.
select res.TIndex, res.PNumber, res.Sender, res.Receiver,
sta.Nickname, rta.Nickname from ((Results res join
Nodes sta) join Nodes rta) where ((res.sender_h=sta.name) and
(res.receiver_h=rta.name));
Please help me optimize this query. Right now if I want to pull just top 5 rows, It takes about 5-6 MINUTES. Thank you.
CREATE TABLE `nodes1` (
`NodeID` int(11) NOT NULL,
`Name` varchar(254) NOT NULL,
`Nickname` varchar(254) NOT NULL,
PRIMARY KEY (`NodeID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `Results1` (
`TIndex` int(11) NOT NULL,
`PNumber` int(11) NOT NULL,
`Sender` varchar(254) NOT NULL,
`Receiver` varchar(254) NOT NULL,
`PTime` datetime NOT NULL,
PRIMARY KEY (`TIndex`,`PNumber`),
KEY `PERIOD_TIME_IDX` (`PTime`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

SELECT res.TIndex ,
res.PNumber ,
res.Sender ,
res.Receiver ,
sta.Nickname ,
rta.Nickname
FROM Results AS res
INNER JOIN Nodes AS sta ON res.sender_h = sta.name
INNER JOIN Nodes AS rta ON res.receiver_h = rta.NAME
Create an index on Results
(sender_h)
Create an index on Results (receiver_h)
Create an index
on Nodes (name)

Joining on the node's name rather than NodeId (the primary key) doesn't look good at all.
Perhaps you should be storing NodeId for foreign key sender and receiver in the Results table instead of name Adding foreign key constraints is a good idea too. Among other things, this might cause indexing automatically depending on your configuration
If this change is difficult, at the very least you should enforce uniqueness on node's name field
If you change the tables definition in this manner, change your query to John's recommendation, and add indexes it should run a lot better and be a lot more readable/better form.

Related

Django slow inner join on a table with more than 10 million records

I am using mysql with Django. I am trying to count the number of visitor_pages for a specific dealer in a certain amount of time.
I would share the raw sql query that I have obtained from django debug toolbar.
SELECT COUNT(*) AS `__count`
FROM `visitor_page`
INNER JOIN `dealer_visitors`
ON (`visitor_page`.`dealer_visitor_id` = `dealer_visitors`.`id`)
WHERE (`visitor_page`.`date_time` BETWEEN '2021-02-01 05:51:00'
AND '2021-03-21 05:50:00'
AND `dealer_visitors`.`dealer_id` = 15)
The issue is that I have more than 13 million records in the visitor_pages table and about 1.5 million records in the dealer_visitor table. I have already indexed date_time. I am thinking of using a materialized view but before attempting that, I would really appreciate suggestions on how I could improve this query.
visitor_pages schema:
CREATE TABLE `visitor_page` (
`id` int NOT NULL AUTO_INCREMENT,
`date_time` datetime(6) DEFAULT NULL,
`added_at` datetime(6) DEFAULT NULL,
`updated_at` datetime(6) DEFAULT NULL,
`page_id` int NOT NULL,
`dealer_visitor_id` int NOT NULL,
PRIMARY KEY (`id`),
KEY `visitor_page_page_id_246babdf_fk_web_page_id` (`page_id`),
KEY `visitor_page_dealer_visitor_id_e2dddea2_fk_dealer_visitors_id` (`dealer_visitor_id`),
KEY `visitor_page_date_time_06e9e9f5` (`date_time`),
CONSTRAINT `visitor_page_dealer_visitor_id_e2dddea2_fk_dealer_visitors_id` FOREIGN KEY (`dealer_visitor_id`) REFERENCES `dealer_visitors` (`id`),
CONSTRAINT `visitor_page_page_id_246babdf_fk_web_page_id` FOREIGN KEY (`page_id`) REFERENCES `web_page` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=13626649 DEFAULT CHARSET=latin1;
dealer_visitors schema:
CREATE TABLE `dealer_visitors` (
`id` int NOT NULL AUTO_INCREMENT,
`visit_date` datetime(6) DEFAULT NULL,
`added_at` datetime(6) DEFAULT NULL,
`updated_at` datetime(6) DEFAULT NULL,
`dealer_id` int NOT NULL,
`visitor_id` int NOT NULL,
`type` int DEFAULT NULL,
`notes` longtext,
`location` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dealer_visitors_dealer_id_306e2202_fk_dealer_id` (`dealer_id`),
KEY `dealer_visitors_visitor_id_27ae498e_fk_visitor_id` (`visitor_id`),
KEY `dealer_visitors_type_af0f7d79` (`type`),
KEY `dealer_visitors_visit_date_f2b138c9` (`visit_date`),
CONSTRAINT `dealer_visitors_dealer_id_306e2202_fk_dealer_id` FOREIGN KEY (`dealer_id`) REFERENCES `dealer` (`id`),
CONSTRAINT `dealer_visitors_visitor_id_27ae498e_fk_visitor_id` FOREIGN KEY (`visitor_id`) REFERENCES `visitor` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1524478 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
EXPLAIN ANALYZE the query gives me the following:
EXPLAIN:

For this query:
SELECT COUNT(*) AS `__count`
FROM visitor_page vp JOIN
dealer_visitors dv
ON vp.dealer_visitor_id = dv.id
WHERE vp.date_time BETWEEN '2021-02-01 05:51:00' AND '2021-03-21 05:50:00' AND
dv.dealer_id = 15;
The best indexes are on dealer_visitors(dealer_id, date_time, id) and visitor_page(dealer_visitor_id).
An index only on date helps a bit. But you are retrieving a month's worth of data and that might be a lot of data to process. Having dealer_id as the first column in the index will restrict the data to only the rows for that dealer in that time frame.

Depending on the distribution of the data, the Optimizer might pick one of the tables to start with, or pick the other. So, let's provide optimal indexes for each case:
ON `visitor_page`.`dealer_visitor_id` = `dealer_visitors`.`id`
WHERE `visitor_page`.`date_time` BETWEEN ...
AND `dealer_visitors`.`dealer_id` = 15
Starting with visitor_page:
visitor_page: INDEX(date_time) -- (already exists)
dealer_visitors: (already has PRIMARY KEY(id))
Starting with dealer_visitors:
dealer_visitors: INDEX(dealer_id) -- (already exists)
visitor_page: INDEX(dealer_visitor_id, date_time) -- in this order
and drop dealer_visitors_visitor_id_27ae498e_fk_visitor_id as now being redundant.
The net is to add one index and drop one index.
Materialized view -- It is often best for Data Warehouse reports to build and incrementally maintain a "summary table" (a "materialized view"). The very odd date range (1 month + 20 days - 61 seconds) makes this clumsy to do. Typically it is handy to make the table based on whole days. If you can shift to daily (or hourly), then see http://mysql.rjweb.org/doc.php/summarytables
Something else to check: How much RAM do you have? What does SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; say?
I see that the tables have different charset/collation. This is not a problem for the query in question, but if you have other queries that JOIN on VARCHARs, check that they use the same collation.

MySQL composite index effect on joins

I have the following SQL query (DB is MySQL 5):
select
event.full_session_id,
DATE(min(event.date)),
event_exe.user_id,
COUNT(DISTINCT event_pat.user_id)
FROM
event AS event
JOIN event_participant AS event_pat ON
event.pat_id = event_pat.id
JOIN event_participant AS event_exe on
event.exe_id = event_exe.id
WHERE
event_pat.user_id <> event_exe.user_id
GROUP BY
event.full_session_id;
"SHOW CREATE TABLE event":
CREATE TABLE `event` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`session_id` varchar(64) DEFAULT NULL,
`full_session_id` varchar(72) DEFAULT NULL,
`pat_id` int(12) DEFAULT NULL,
`exe_id` int(12) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `SESSION_IDX` (`full_session_id`),
KEY `PAT_ID_IDX` (`pat_id`),
KEY `DATE_IDX` (`date`),
KEY `SESSLOGPATEXEC_IDX` (`full_session_id`,`date`,`pat_id`,`exe_id`)
) ENGINE=MyISAM AUTO_INCREMENT=371955 DEFAULT CHARSET=utf8
"SHOW CREATE TABLE event_participant":
CREATE TABLE `event_participant` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(64) NOT NULL,
`alt_user_id` varchar(64) NOT NULL,
`username` varchar(128) NOT NULL,
`usertype` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ALL_UNQ` (`user_id`,`alt_user_id`,`username`,`usertype`),
KEY `USER_ID_IDX` (`user_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5397 DEFAULT CHARSET=utf8
Also, the query itself seems ugly, but this is legacy code on a production system, so we are not expected to change it (at least for now).
The problem is that, there is around 36 million record on the event table (in the production system), so there have been frequent crashes of the DB machine due to using temporary;using filesort processing (they provided these EXPLAIN outputs, unfortunately, I don't have them right now. I'll try to update them to this post later.)
The customer asks for a "quick fix" by adding indices. Currently we have indices on full_session_id, pat_id, date (separately) on event and user_id on event_participant.
Thus I'm thinking of creating a composite index (pat_id, exe_id, full_session_id, date) on event- this index comprises of the fields in the join (equivalent to where ?), then group by, then aggregate (min) parts.
This is just an idea because we currently don't have that kind of data volume to test, so we try the best we could first.
My question is:
Could the index above help in the performance ? (It's quite confusing on the effect because I have found two really contrasting results: https://dba.stackexchange.com/questions/158385/compound-index-on-inner-join-table
versus Separate Join clause in a Composite Index, where the latter suggests that composite index on joins won't work and the former that it'll work.
Does this path (adding indices) have hopes ? Or should we forget it and just try to optimize the query instead ?
Thanks in advance for your help :)
Update:
I have updated the full table description for the two related tables.
MySQL version is 5.1.69. But I think we don't need to worry about the ambiguous data issue mentioned in the comments, because it seems there won't be ambiguity for our data. Specifically, for each full_session_id, there is only one "event_exe.user_id" returned (it's just a business logic in the application)
So, what do you think about my 2 questions ?

Subquery processing more rows than necessary

I am optimising my queries and found something I can't get my head around.
I am using the following query to select a bunch of categories, combining them with an alias from a table containing old and new aliases for categories:
SELECT `c`.`id` AS `category.id`,
(SELECT `alias`
FROM `aliases`
WHERE category_id = c.id
AND `old` = 0
AND `lang_id` = 1
ORDER BY `id` DESC
LIMIT 1) AS `category.alias`
FROM (`categories` AS c)
WHERE `c`.`status` = 1 AND `c`.`parent_id` = '11';
There are only 2 categories with a value of 11 for parent_id, so it should look up 2 categories from the alias table.
Still if I use EXPLAIN it says it has to process 48 rows. The alias table contains 1 entry per category as well (in this case, it can be more). Everything is indexed and if I understand correctly therefore it should find the correct alias immediately.
Now here's the weird thing. When I don't compare the aliases by the categories from the conditions, but manually by the category ids the query returns, it does process only 1 row, as intended with the index.
So I replace WHERE category_id = c.id by WHERE category_id IN (37, 43) and the query gets faster:
The only thing I can think of is that the subquery isn't run over the results from the query but before some filtering is done. Any kind of explanation or help is welcome!
Edit: silly me, the WHERE IN doesn't work as it doesn't make a unique selection. The question still stands though!
Create table schema
CREATE TABLE `aliases` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`lang_id` int(2) unsigned NOT NULL DEFAULT '1',
`alias` varchar(255) DEFAULT NULL,
`product_id` int(10) unsigned DEFAULT NULL,
`category_id` int(10) unsigned DEFAULT NULL,
`brand_id` int(10) unsigned DEFAULT NULL,
`page_id` int(10) unsigned DEFAULT NULL,
`campaign_id` int(10) unsigned DEFAULT NULL,
`old` tinyint(1) unsigned DEFAULT '0',
PRIMARY KEY (`id`),
KEY `product_id` (`product_id`),
KEY `category_id` (`category_id`),
KEY `page_id` (`page_id`),
KEY `alias_product_id` (`product_id`,`alias`),
KEY `alias_category_id` (`category_id`,`alias`),
KEY `alias_page_id` (`page_id`,`alias`),
KEY `alias_brand_id` (`brand_id`,`alias`),
KEY `alias_product_id_old` (`alias`,`product_id`,`old`),
KEY `alias_category_id_old` (`alias`,`category_id`,`old`),
KEY `alias_brand_id_old` (`alias`,`brand_id`,`old`),
KEY `alias_page_id_old` (`alias`,`page_id`,`old`),
KEY `lang_brand_old` (`lang_id`,`brand_id`,`old`),
KEY `id_category_id_lang_id_old` (`lang_id`,`old`,`id`,`category_id`)
) ENGINE=InnoDB AUTO_INCREMENT=112392 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT;

SELECT ...
WHERE x=1 AND y=2
ORDER BY id DESC
LIMIT 1
will be performed in one of several ways.
Since you have not shown us the indexes you have (SHOW CREATE TABLE), I will cover some likely cases...
INDEX(x, y, id) -- This can find the last row for that condition, so it does not need to look at more than one row.
Some other index, or no index: Scan DESCending from the last id checking each row for x=1 AND y=2, stopping when (if) such a row is found.
Some other index, or no index: Scan the entire table, checking each row for x=1 AND y=2; collect them into a temp table; sort by id; deliver one row.
Some of the EXPLAIN clues:
Using where -- does not say much
Using filesort -- it did a sort, apparently for the ORDER BY. (It may have been entirely done in RAM; ignore 'file'.)
Using index condition (not "Using index") -- this indicates an internal optimization in which it can check the WHERE clause more efficiently than it used to in older versions.
Do not trust the "Rows" in EXPLAIN. Often they are reasonably correct, but sometimes they are off by orders of magnitude. Here is a better way to see "how much work" is being done in a rather fast query:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
With the CREATE TABLE, I may have suggestions on how to improve the index.

Optimize MySQL count query with JOIN

I have a query that takes about 20 seconds, I would like to understand if there is a way to optimize it.
Table 1:
CREATE TABLE IF NOT EXISTS `sessions` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=9845765 ;
And table 2:
CREATE TABLE IF NOT EXISTS `access` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`session_id` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `session_id ` (`session_id `)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=9467799 ;
Now, what I am trying to do is to count all the access connected to all sessions about one user, so my query is:
SELECT COUNT(*)
FROM access
INNER JOIN sessions ON access.session_id=session.id
WHERE session.user_id='6';
It takes almost 20 seconds...and for user_id 6 there are about 3 millions sessions stored.
There is anything I can do to optimize that query?

Change this line from the session table:
KEY `user_id` (`user_id`)
To this:
KEY `user_id` (`user_id`, `id`)
What this will do for you is allow you to complete the query from the index, without going back to the raw table. As it is, you need to do an index scan on the session table for your user_id, and for each item go back to the table to find the id for the join to the access table. By including the id in the index, you can skip going back to the table.
Sadly, this will make your inserts slower into that table, and it seems like this may be a bid deal, given just one user has 3 millions sessions. Sql Server and Oracle would address this by allowing you to include the id column in your index, without actually indexing on it, saving a little work at insert time, and also by allowing you specify a lower fill factor for the index, reducing the need to re-build or re-order the indexes at insert, but MySql doesn't support these.

Optimization of a query with GROUP BY clause by using indexes

I need to optimize indexes in a table that stores more than 10 Millions rows. The query that is particularly time consuming takes up to 10 seconds to load (when WHERE clause filters only about 2 Millions rows - 8 Millions must be grouped). I have created a few indexes (some of them are complex, some simpler) and tried to find out how to speed this up. Perhaps I'm doing something wrong. MySQL is using optimized_5 index (based on EXPLAIN).
Here is the table's structure and the query:
CREATE TABLE IF NOT EXISTS `geo_reverse` (
`fid` mediumint(8) unsigned NOT NULL,
`tablename` enum('table1','table2') NOT NULL default 'table1',
`geo_continent` varchar(2) NOT NULL,
`geo_country` varchar(2) NOT NULL,
`geo_region` varchar(8) NOT NULL,
`geo_city` mediumint(8) unsigned NOT NULL,
`type` varchar(30) NOT NULL,
PRIMARY KEY (`fid`,`tablename`,`geo_continent`,`geo_country`,`geo_region`,`geo_city`),
KEY `geo_city` (`geo_city`),
KEY `fid` (`fid`),
KEY `geo_region` (`geo_region`,`geo_city`),
KEY `optimized` (`tablename`,`type`,`geo_continent`,`geo_country`,`geo_region`,`geo_city`,`fid`),
KEY `optimized_2` (`fid`,`tablename`),
KEY `optimized_3` (`type`,`geo_city`),
KEY `optimized_4` (`geo_city`,`tablename`),
KEY `optimized_5` (`tablename`,`type`,`geo_city`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
An example query:
SELECT type, COUNT(*) AS objects FROM geo_reverse WHERE tablename = 'table1' AND geo_city IN (5847207,5112771,4916894,...) GROUP BY type
Do you have any idea of how to speed the computation up?

i would use the following index: (geo_city, tablename, type) - geo_city is obviously more selective than tablename, thus it should be on the left. After the condition is applied, the rest should be sorted by type for grouping.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Mysql Join Query optimization - mysql

Related

Django slow inner join on a table with more than 10 million records

MySQL composite index effect on joins

Subquery processing more rows than necessary

Optimize MySQL count query with JOIN

Optimization of a query with GROUP BY clause by using indexes

Categories

Resources