MySQL Recursive SQL select takes too long to execute - mysql

I have a table with records that are referenced to their correlate records as "sponsors" of record and I am using this SQL select to obtain a list of 10 records:
SELECT A.ref_user_id, A.ref_user_id_sponsor,
IF(A.businessname IS NULL OR A.businessname = '', LTRIM(RTRIM(CONCAT(A.name, ' ', A.surname))), A.businessname) AS namesurnamesponsor, A.level
FROM (
with recursive parent_users (ref_user_id, ref_user_id_sponsor, name, surname, businessname, level) AS (
SELECT ref_user_id, ref_user_id_sponsor, name, surname, businessname, 1 level
FROM users_details
WHERE ref_user_id = XXXXXXXXX
union all
SELECT t.ref_user_id, t.ref_user_id_sponsor, t.name, t.surname, t.businessname, level + 1
FROM users_details t INNER JOIN parent_users pu
ON t.ref_user_id = pu.ref_user_id_sponsor
)
SELECT * FROM parent_users ) A LIMIT 10
but looks like it takes too long to extract just 10 records from a table of just 120 records total. Plus , I tried to create an index to speed up :
CREATE INDEX idx_ref_user_id_ref_user_id_sponsor ON (ref_user_id, ref_user_id_sponsor)
but it takes too long even to create the index which would help the SELECT to give me back just those 10 results .
Do you have a suggest for that? An alternative Index? or even an alternative way to obtain upper 10 sponsors of selected record declared by WHERE ref_user_id = XXXXXXXXX ? Thanks to all! Cheers
EDIT : I run an EXPLAIN SELECT for the above query and I obtained this result:
and table structure is:
CREATE TABLE IF NOT EXISTS users_details (
ID bigint(20) UNSIGNED NOT NULL AUTO_INCREMENT,
ref_user_id bigint(20) UNSIGNED NOT NULL,
ref_user_id_sponsor bigint(20) UNSIGNED DEFAULT NULL,
sponsorship_code varchar(6) NOT NULL,
name varchar(250) NOT NULL,
surname varchar(250) NOT NULL,
businessname varchar(300) DEFAULT NULL,
activate tinyint(1),
PRIMARY KEY (ID),
CONSTRAINT fk_users_details_id
FOREIGN KEY (ref_user_id)
REFERENCES users(ID)
ON DELETE CASCADE,
CONSTRAINT fk_users_ref_user_id_sponsor
FOREIGN KEY (ref_user_id_sponsor)
REFERENCES users(ID)
)ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
and a SELECT COUNT(*) FROM users_details returns just 120 records. The table users just contains ID, login, pwd and a couple others columns of the users
Edit 2 : Maybe is there a better SELECT to obtain same result concerning a list of the upline of an user referencing its sponsor? Or just would be better add a
CREATE INDEX idx_ref_user_id_ref_user_id_sponsor ON (ref_user_id, ref_user_id_sponsor)
to speed it up?
Edit 3 :
here is a photo of 20 of those users.... obiously hiding sensible informations and as you can see ref_user_id and ref_user_id_sponsor are strictly connected between them :
and about the time of execution looks like no end time which is weird because some time ago with just some less data (like 60 users instead of 120) , that query gave me quickly enough 10 users result.
Eventually maybe is there an alternative recursive SELECT that would give me same result back just to check if could be that with recursive clause or not? Or even shall I have to create index on those two columns ref_user_id and ref_user_id_sponsor to speed it up?

Related

Get AVG value for each selected row from MySQL table with 500m rows?

I have one table with 500 million records In MySQL 8.x. My regular query to get a certain result set is 200ms, but if I try to get an AVG value the performance drops to 30s+.
Structure:
KW_ID | DATE | SERP | MERCHANT_ID | ARTICLE_ID
-- auto-generated definition
create table merchants_keyword_serps
(
KW_ID mediumint unsigned null,
MERCHANT_ID tinyint unsigned null,
ARTICLE_ID char(10) null,
SERP tinyint unsigned null,
DATE date null,
constraint `unique`
unique (MERCHANT_ID, ARTICLE_ID, KW_ID, DATE),
constraint fk_serps_kwd_t
foreign key (MERCHANT_ID, ARTICLE_ID) references merchants_product_catalog (MERCHANT_ID, ARTICLE_ID)
on delete cascade,
constraint keywords
foreign key (KW_ID) references merchants_keywords (ID)
on delete cascade
);
create index merchants_keyword_serps_SERP_index
on merchants_keyword_serps (SERP);
create index mks_date
on merchants_keyword_serps (DATE);
Goal, get SERP for 20220122 and MERCHANT_ID = 2:
select
mcs.SERP
FROM merchants_keyword_serps mcs
WHERE date = 20220120
AND mcs.MERCHANT_ID = 2;
Now do also get the AVG SERP for all shops in addition:
select
mcs.SERP,
(
SELECT AVG(SERP)
FROM merchants_keyword_serps mcs2
WHERE mcs2.date = 20220120
AND mcs2.KW_ID = mcs.KW_ID
AND mcs2.ARTICLE_ID = mcs.ARTICLE_ID) AS SERP_AVG
from merchants_keyword_serps mcs
WHERE
date = 20220120
AND mcs.MERCHANT_ID = 2;
The expected result would be an additional column with the average SERP value for all shops with the same KW_ID, DATE, ARTICLE_ID.
Is there a way to speed that up with a different approach? The indexes are all set OK I believe since the standard query runs perfectly fast in unter 200ms.
Where does KW_ID come from? Please provide SHOW CREATE TABLE merchants_keyword_serps.
Using 20220120 for a date is asking for trouble. (I don't see any problem yet.)
Add these:
INDEX(merchant_id, date)
INDEX(kw_id, article_id, date, serp)
and Drop these since they will be redundant:
INDEX(merchant_id)
INDEX(kw_id)

In MySQL is it faster to execute one JOIN + one LIKE statement or two JOINs?

I have to create a cron job, which is simple in itself, but because it will run every minute I'm worried about performance. I have two tables, one has user names and the other has details about their network. Most of the time a user will belong to just one network, but it is theoretically possible that they might belong to more, but even then very few, maybe two or three. So, in order to reduce the number of JOINs, I saved the network ids separated by | in a field in the user table, e.g.
|1|3|9|
The (simplified for this question) user table structure is
TABLE `users` (
`u_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`userid` VARCHAR(500) NOT NULL UNIQUE,
`net_ids` VARCHAR(500) NOT NULL DEFAULT '',
PRIMARY KEY (`u_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The (also simplified) network table structure is
CREATE TABLE `network` (
`n_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`netname` VARCHAR(500) NOT NULL UNIQUE,
`login_time` DATETIME DEFAULT NULL,
`timeout_mins` TINYINT UNSIGNED NOT NULL DEFAULT 10,
PRIMARY KEY (`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I have to send a warning when timeout occurs, my query is
SELECT N.netname, N.timeout_mins, N.n_id, U.userid FROM
(SELECT netname, timeout_mins, n_id FROM network
WHERE is_open = 1 AND notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, timeout_mins, login_time), NOW()) < 60) AS N
INNER JOIN users AS U ON U.net_ids LIKE CONCAT('%|', N.n_id, '|%');
I made N a subquery to reduce the number of rows joined. But I would like to know if it would be faster to add a third table with u_id and n_id as columns, removed the net_ids column from users and then do a join on all three tables? Because I read that using LIKE slows things down.
Which is the most effcient query to use in this case? One JOIN and a LIKE or two JOINS?
P.S. I did some experimentation and the initial values for using two JOINS are higher than using a JOIN and a LIKE. However, repeated runs of the same query seems to speed things up a lot, I suspect something is cached somewhere, either in my app or the database, and both become comparable, so I did not find this data satisfactory. It also contradicts what I was expecting based on what I have been reading.
I used this table:
TABLE `user_net` (
`u_id` BIGINT UNSIGNED NOT NULL,
`n_id` BIGINT UNSIGNED NOT NULL,
INDEX `u_id` (`u_id`),
FOREIGN KEY (`u_id`) REFERENCES `users`(`u_id`),
INDEX `n_id` (`n_id`),
FOREIGN KEY (`n_id`) REFERENCES `network`(`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
and this query:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid FROM
(SELECT netname, timeout_mins, n_id FROM network
WHERE is_open = 1 AND notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, timeout_mins, login_time), NOW()) < 60) AS N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id;
You should define composite indexes for the user_net table. One of them can (and should) be the primary key.
TABLE `user_net` (
`u_id` BIGINT UNSIGNED NOT NULL,
`n_id` BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (`u_id`, `n_id`),
INDEX `uid_nid` (`n_id`, `u_id`),
FOREIGN KEY (`u_id`) REFERENCES `users`(`u_id`),
FOREIGN KEY (`n_id`) REFERENCES `network`(`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I would also rewrite your query to:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid
FROM network N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id
WHERE N.is_open = 1
AND N.notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, N.timeout_mins, N.login_time), NOW()) < 60
While your subquery will probably not hurt much, there is no need for it.
Note that the last condition cannot use an index, because you have to combine two columns. If your MySQL version is at least 5.7.6 you can define an indexed virtual (calculated) column.
CREATE TABLE `network` (
`n_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`netname` VARCHAR(500) NOT NULL UNIQUE,
`login_time` DATETIME DEFAULT NULL,
`timeout_mins` TINYINT UNSIGNED NOT NULL DEFAULT 10,
`is_open` TINYINT UNSIGNED,
`notify` TINYINT UNSIGNED,
`timeout_dt` DATETIME AS (`login_time` + INTERVAL `timeout_mins` MINUTE),
PRIMARY KEY (`n_id`),
INDEX (`timeout_dt`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Now change the query to:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid
FROM network N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id
WHERE N.is_open = 1
AND N.notify = 1
AND N.timeout_dt < NOW() + INTERVAL 60 SECOND
and it will be able to use the index.
You can also try to replace
INDEX (`timeout_dt`)
with
INDEX (`is_open`, `notify`, `timeout_dt`)
and see if it is of any help.
Reformulate to avoid hiding columns inside functions. I can't grok your date expression, but note this:
login_time < NOW() - INTERVAL timeout_mins MINUTE
If you can achieve something like that, then this index should help:
INDEX(is_open, notify, login_time)
If that is not good enough, let's see the other formulation so we can compare them.
Having stuff separated by comma (or |) is likely to be a really bad idea.
Bottom line: Assume that JOINs are not a performance problem, write the queries with as many JOINs as needed. Then let's optimize that.

Calling Data from 2 tables

I am kind of new to SQL. I have 2 MySQL Tables. Below is their structure.
Key_Hash Table
CREATE TABLE `key_hash` (
`primary_key` int(11) NOT NULL,
`hash` text NOT NULL,
`totalNumberOfWords` int(11) NOT NULL,
PRIMARY KEY (`primary_key`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
--
Key_Word Table
CREATE TABLE `key_word` (
`primary_key` bigint(20) NOT NULL AUTO_INCREMENT,
`indexVal` int(11) NOT NULL,
`hashed_word` char(3) NOT NULL,
PRIMARY KEY (`primary_key`),
KEY `hashed_word` (`hashed_word`,`indexVal`)
) ENGINE=InnoDB AUTO_INCREMENT=28570982 DEFAULT CHARSET=latin1
Now, below is my query
SELECT `indexVal`, COUNT(`indexVal`) FROM `key_word` WHERE `hashed_word` IN ('001','01v') GROUP BY `indexVal` LIMIT 100;
When you run the above query, you will get an output like below
The important thing here to note is that indexVal in key_word table is the same set of data in primary_key in key_hash table (I think it can be a foreign key?). In other words, primary_key data in key_hash table appear as indexVal in key_word table. But pleas note indexVal can appear any number of times inside the table because it is not a primary key in key_word.
OK so, this is not the query what I need exactly. I need to count how many times each unique indexVal appear in the above search, and divide it by appropriate value in key_hash.totalNumberOfWords.
I am providing few examples below.
Imagine I ran the above query, now the result is generated. It says
indexVal 0 appeared 10 times in search
indexVal 1 appeared 20 times in search
indexVal 300 appeared 20,000 times in search
Now keep in mind that key_hash.primary_key = key_word.indexVal . first I search for key_hash.primary_key which is similar to key_word.indexVal and get the associated key_hash.numberOfWords. Then I divide the count() appeared in the above mentioned query from this key_hash.numberOfWords and multiply the total answer by 100 (to get the value as a percentage). Below is a query I tried but it has errors.
SELECT `indexVal`,COUNT(`indexVal`), (COUNT(`indexVal`) / (select `numberOfWords` from `key_hash` where `primary_key`=`key_word.indexVal`)*100) FROM `key_word` WHERE `hashed_word` IN ('001','01v') GROUP BY `indexVal` LIMIT 100;
How can I do this job?
EDIT
This is how the key_hash table looks like
This is how the key_word table looks like
You can use a JOIN instead of a sub-query
SELECT w.indexVal
, COUNT(w.indexVal)
, COUNT(w.indexVal) / MAX(h.numberOfWords) * 100
FROM key_word w
INNER JOIN key_hash h ON h.primary_key = w.indexVal
WHERE w.hashed_word IN ('001','01v')
GROUP BY indexVal
LIMIT 100

Mysql Group By implementation details - which row mysql chooses in a Group By query without operators?

I have a table with multiple rows per "website_id"
CREATE TABLE `MyTable` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`tagCheckResult` int(11) DEFAULT NULL,
`website_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `IX_website_id` (`website_id`),
) ENGINE=InnoDB;
I am trying to select the latest entry per website_id
-- This creates a temporary table with the last entry per website_id, and joins it
-- to get the entire row
SELECT *
FROM `WebsiteStatus` ws1
JOIN (
SELECT MAX(id) max_id, website_id FROM `WebsiteStatus`
GROUP BY website_id) ws2
ON ws1.id = ws2.max_id
Now, I know the correct way to get the last row per website_id is as above. My qusetion is - I also tried the following simpler query, at it seemed to return the exact same results as above:
SELECT * FROM `WebsiteStatus`
GROUP BY website_id
ORDER BY website_id DESC
I know that in principle GROUP BY without operators (e.g. MAX), like I do in my 2nd query can return any of the relevant rows ... but in practice it returns the last one. Is there an implementation detail in mysql that guarantees this is always the case?
(Just asking for academic curiosity, I know the 1st query is "more correct").

How could I optimise this MySQL query?

I have a table that stores a pupil_id, a category and an effective date (amongst other things). The dates can be past, present or future. I need a query that will extract a pupil's current status from the table.
The following query works:
SELECT *
FROM pupil_status
WHERE (status_pupil_id, status_date) IN (
SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW() -- to ensure we ignore the "future status"
GROUP BY status_pupil_id );
In MySQL, the table is defined as follows:
CREATE TABLE IF NOT EXISTS `pupil_status` (
`status_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`status_pupil_id` int(10) unsigned NOT NULL, -- a foreign key
`status_category_id` int(10) unsigned NOT NULL, -- a foreign key
`status_date` datetime NOT NULL, -- effective date/time of status change
`status_modify` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`status_staff_id` int(10) unsigned NOT NULL, -- a foreign key
`status_notes` text NOT NULL, -- notes detailing the reason for status change
PRIMARY KEY (`status_id`),
KEY `status_pupil_id` (`status_pupil_id`,`status_category_id`),
KEY `status_pupil_id_2` (`status_pupil_id`,`status_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1409 ;
However, with 950 pupils and just over 1400 statuses in the table, the query takes 0.185 seconds to process. Perhaps acceptable now, but when the table swells, I'm worried about scalability. It is likely that the production system will have over 10000 pupils and each will have 15-20 statuses each.
Is there a better way to write this query? Are there better indexes that I should have to assist the query? Please let me know.
There are the following things you could try
1 Use an INNER JOIN instead of the WHERE
SELECT *
FROM pupil_status ps
INNER JOIN
(SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW()
GROUP BY status_pupil_id) X
ON ps.status_pupil_id = x.status_pupil_id
AND ps.status_date = x.status_date
2 Have a variable and store the value for NOW() - I am not sure if the DB engine optimizes this call to NOW() as just one call but if it doesnt, then this might help a bit
These are some suggestions however you will need to compare the query plans and see if there is any appreciable improvement or not.
Based on your usage of indexes as per the Query plan, robob's suggestion above could also come in handy
Find out how long query takes when you load the system with 10000 pupils each with have 15-20 statuses each.
Only refactor if it takes too long.