MySQL Join Best Practice on Large Data - mysql

table1_shard1 (1,000,000 rows per shard x 120 shards)
id_user hash
table2 (100,000 rows)
value hash
Desired Output:
id_user hash value
I am trying to find the fastest way to associate id_user with value from the tables above.
My current query ran for 30 hours without result.
SELECT
table1_shard1.id_user, table1_shard1.hash, table2.value
FROM table1_shard1
LEFT JOIN table2 ON table1_shard1.hash=table2.hash
GROUP BY id_user
UNION
SELECT
table1_shard2.id_user, table1_shard2.hash, table2.value
FROM table1_shard1
LEFT JOIN table2 ON table1_shard2.hash=table2.hash
GROUP BY id_user
UNION
( ... )
UNION
SELECT
table1_shard120.id_user, table1_shard120.hash, table2.value
FROM table1_shard1
LEFT JOIN table2 ON table1_shard120.hash=table2.hash
GROUP BY id_user

Firstly, do you have indexes on the hash fields
I think you should merge your tables in one before the query (at least temporarily)
CREATE TEMPORARY TABLE IF NOT EXISTS tmp_shards
SELECT * FROM table1_shard1;
CREATE TEMPORARY TABLE IF NOT EXISTS tmp_shards
SELECT * FROM table1_shard2;
# ...
Then do the main query
SELECT
table1_shard120.id_user
, table1_shard120.hash
, table2.value
FROM tmp_shards AS shd
LEFT JOIN table2 AS tb2 ON (shd.hash = tb2.hash)
GROUP BY id_user
;
Not sure for the performance gain but it'll be at least more maintainable.

Related

Selecting Counts from Different Tables with a Subquery

I'm new to MySQL, and I'd like some help in setting up a MySQL query to pull some data from a few tables (~100,000 rows) in a particular output format.
This problem involves three SQL tables:
allusers : This one contains user information. The columns of interest are userid and vip
table1 and table2 contain data, but they also have a userid column, which matches the userid column in allusers.
What I'd like to do:
I'd like to create a query which searches through allusers, finds the userid of those that are VIP, and then count the number of records in each of table1 and table2 grouped by the userid. So, my desired output is:
userid | Count in Table1 | Count in Table2
1 | 5 | 21
5 | 16 | 31
8 | 21 | 12
What I've done so far:
I've created this statement:
SELECT userid, count(1)
FROM table1
WHERE userid IN (SELECT userid FROM allusers WHERE vip IS NOT NULL)
GROUP BY userid
This gets me close to what I want. But now, I want to add another column with the respective counts from table2
I also tried using joins like this:
select A.userid, count(T1.userid), count(T2.userid) from allusers A
left join table1 T1 on T1.userid = A.userid
left join table2 T2 on T2.userid = A.userid
where A.vip is not null
group by A.userid
However, this query took a very long time and I had to kill the query. I'm assuming this is because using Joins for such large tables is very inefficient.
Similar Questions
This one is looking for a similar result as I am, but doesn't need nearly as much filtering with subqueries
This one sums up the counts across tables, while I need the counts separated into columns
Could someone help me set up the query to generate the data I need?
Thanks!
You need to pre-aggregate first, then join, otherwise the results will not be what you expect if a user has several rows in both table1 and table2. Besides, pre-aggregation is usually more efficient than outer aggregation in a situation such as yours.
Consider:
select a.userid, t1.cnt cnt1, t2.cnt cnt2
from allusers a
left join (select userid, count(*) cnt from table1 group by userid) t1
on t1.userid = a.userid
left join (select userid, count(*) cnt from table2 group by userid) t2
on t2.userid = a.userid
where a.vip is not null
This is a case where I would recommend correlated subqueries:
select a.userid,
(select count(*) from table1 t1 where t1.userid = a.userid) as cnt1,
(select count(*) from table2 t2 where t2.userid = a.userid) as cnt2
from allusers a
where a.vip is not null;
The reason that I recommend this approach is because you are filtering the alllusers table. That means that the pre-aggregation approach may be doing additional, unnecessary work.

updating million rows based on resultset from previous query

I need to update a table consisting of million rows
there are two tables table1 and table2
SELECT ID
FROM (
select ID from table1 where<condition>
) as result1
INNER JOIN table2 ON result1.field=table2.field
GROUPBY table2.field
HAVING <condtion>
and on this #resultset1 ID, I have to update table 1
UPDATE table1
SET x=true
where ID EXISTS IN (#resultset1)
there are millions of rows in both table. how do i do it?
And Can anyone say whats wrong with this, i am trying some alternative over join
UPDATE table1 t1
SET x=true
WHERE <condition> AND EXISTS(
SELECT* FROM (
SELECT *
FROM table2 t2
WHERE t2.field = t1.field
) AS result
WHERE<condition on resultset field>
);
Output the results of the first query to a file.
Convert the output to update statements, one update per id (using sed or whatever)
Split the statements into separate files, maybe a 1000 per file (using split or whatever)
Run each file as an sql script with a pause (eg 10 seconds) between each execution (to allow logs to update etc and spread the load), using a simple runner script the loops over the files
Why can't you simply do like below. Though I don't really see a need of group by and having provided if this is your actual query.
UPDATE table1
SET x=true
where ID IN (
SELECT ID
FROM (
select ID from table1 where<condition>
) as result1
INNER JOIN table2 ON result1.field=table2.field
GROUPBY table2.field
HAVING <condtion>
)
Another good option would be JOIN both result set and do the UPDATE like
UPDATE table1 t1
JOIN
(
SELECT ID
FROM (
select ID from table1 where<condition>
) as result1
INNER JOIN table2 ON result1.field=table2.field
GROUPBY table2.field
HAVING <condtion>
) X ON t1.ID = X.ID
SET t1.x=true
EDIT:
You can as well store the result of fist query in a temporary table (As already suggested by a_horse_with_no_name) and then do the UPDATE using a JOIN with the temporary table. Somethig like below
create temporary table idtemp(ID INT);
insert into idtemp
SELECT ID
FROM (
select ID from table1 where<condition>
) result1
INNER JOIN table2 ON result1.field=table2.field
GROUPBY table2.field
HAVING <condtion>
Finally do the update like
UPDATE table1 t1
JOIN idtemp it ON t1.ID = it.ID
SET t1.x=true

select from table based on select distinct from another table

the case is that I need to select a field distinct from table1 (no duplicates) and use the result as a key to select from another table2. And I need this to be in one query. Is this possible?!
table1: hID, hName, hLocation
table2: hID, hFrom, hTo, hRate, hRoomType, hMeals
I want to correct version of this query:
SELECT
*
FROM
table1
JOIN (
DISTINCT
hID
FROM
table2
WHERE
hRoomType = Double Room
ON table1.hID = table2.hID)
expected result: all hotels that offer Double Room thanks much –
thanks for help!
Your question is quite vague and confusing. Is this what you are looking for:
SELECT hID, name, location
FROM table2
INNER JOIN table1
ON table1.hID = table2.hID
GROUP BY table2.hID;
Here is a skeleton to achieve this:
SELECT
* -- Don't forget to list the requested fields instead of using `*`!
FROM (
-- This is the distinct list from table1
SELECT DISTINCT
id
FROM
table1 T1
) DT1
INNER JOIN table2 T2
ON T1.id = T2.reference_to_t1_id
Another solution if you don't want to retrieve any columns from table1:
SELECT
* -- Don't forget to list the requested fields instead of using `*`!
FROM
table2 T2
WHERE
-- Sais that get all record from table2 where this condition matches
-- at least one record
EXISTS (
SELECT 1 FROM table1 T1 WHERE T1.id = T2.reference_to_t1_id
)
For your tables and question
SELECT
hID, hName, hLocation
FROM
table1 T1
WHERE
EXISTS (
SELECT 1 FROM
table2 T2
WHERE
T1.hID = T2.hID
AND T.hRoomType = 'Double' -- Assuming that this is the definition of double rooms
)

percentile by COUNT(DISTINCT) with correlated WHERE only works with a view (or without DISTINCT)

I've got a weird one, and I don't know if it's my syntax (which seems straightforward) or a bug (or just unsupported).
Here's my query that works but is needlessly slow:
UPDATE table1
SET table1column1 =
(SELECT COUNT(DISTINCT table2column1) FROM table2view WHERE table2column1 <= (SELECT table2column1 FROM table2 WHERE table2.id = table1.id) )
/
(SELECT COUNT(DISTINCT table2column1) FROM table2)
+ (SELECT COUNT(DISTINCT table2column2) FROM table2view WHERE table2column2 <= (SELECT table2column2 FROM table2 WHERE table2.id = table1.id) )
/
(SELECT COUNT(DISTINCT table2column2) FROM table2)
+ (SELECT COUNT(DISTINCT table2column3) FROM table2view WHERE table2column3 <= (SELECT table2column3 FROM table2 WHERE table2.id = table1.id) )
/ (SELECT COUNT(DISTINCT table2column3) FROM table2);
It's just the sum of three percentiles (of table2column1, table2column2, and table2column3) with duplicates removed.
Here's where it gets weird. I have to use a view for this to work on the subquery with the WHERE or it will only UPDATE the first row of table1, and set the rest of the rows' table1column1 to 0. That table2view is an exact duplicate of table2. Yeah, weird.
If I don't use DISTINCT, I can do it without the view. Does that make sense? Note: I have to have DISTINCT because I have lots of duplicates.
I tried making it SELECT only from the view, but that slowed it down worse.
Does anyone know what the problem is and the best way to rework this query so it doesn't take so long? It's in a TRIGGER, and the updated data is pretty on demand.
Many thanks in advance!
Details
I'm testing the speed in phpMyAdmin's command line.
I'm pretty sure the degradation is coming from the view since the more of the view and the less of the actual table I use, the slower it gets.
When I do the one without DISTINCT, it's lightning fast.
Only works on views?
OK, so I just set up a copy of table2. I tried first to do the original query substituting the view with the copy. No go.
I tried to do the query below with the copy instead of the view. No go.
Hopefully the introduction of these constants will better show what I'm trying to do.
SET #table2column1_distinct_count = (SELECT COUNT(DISTINCT table2column1) FROM table2);
SET #table2column2_distinct_count = (SELECT COUNT(DISTINCT table2column2) FROM table2);
SET #table2column3_distinct_count = (SELECT COUNT(DISTINCT table2column3) FROM table2);
UPDATE table1, table2
SET table1.table1column1 = (SELECT COUNT(DISTINCT table2column1) FROM table2view WHERE table2column1 <= table2.table2column1) / #table2column1_distinct_count
+ (SELECT COUNT(DISTINCT table2column2) FROM table2view WHERE table2column2 <= table2.table2column2) / #table2column2_distinct_count
+ (SELECT COUNT(DISTINCT table2column3) FROM table2view WHERE table2column3 <= table2.table2column3) / #table2column3_distinct_count
WHERE table1.id = table2.id;
Again, when I use table2 instead of the table2view, it only updates the first row properly and sets all other rows' table1.table1column1 = 0.
Math
I'm trying to set table1.table1column1 = to the sum of the percentiles of table2column1, table2column2, and table2column3 by id.
I do a percentile by (counting the distinct values of a table2columnX <= to the current table2columnX ) / (the total count of distinct table2columnXs).
I use DISTINCT to get rid of the excessive duplicates.
View
Here's the SELECT for the view. Does this help?
CREATE VIEW myTable.table2view AS SELECT
table2.table2column1 AS table2column1,
table2.table2column2 AS table2column2,
table2.table2column2 AS table2column3,
FROM table2
GROUP BY table2.id;
Is there something special about the GROUP BY in the view's SELECT that makes this work (that I'm not seeing)?
I would probably say that the query is slow because it is repeatedly accessing the table when the trigger fires.
I am no SQL expert but I have tried to put together a query using temporary tables. You can see if it helps speed up the query. I have used different but similar sounding column names in my code sample below.
EDIT : There was a calculation error in my earlier code. Updated now.
SELECT COUNT(id) INTO #no_of_attempts from tb2;
-- DROP TABLE IF EXISTS S1Percentiles;
-- DROP TABLE IF EXISTS S2Percentiles;
-- DROP TABLE IF EXISTS S3Percentiles;
CREATE TEMPORARY TABLE S1Percentiles (
s1 FLOAT NOT NULL,
percentile FLOAT NOT NULL DEFAULT 0.00
);
CREATE TEMPORARY TABLE S2Percentiles (
s2 FLOAT NOT NULL,
percentile FLOAT NOT NULL DEFAULT 0.00
);
CREATE TEMPORARY TABLE S3Percentiles (
s3 FLOAT NOT NULL,
percentile FLOAT NOT NULL DEFAULT 0.00
);
INSERT INTO S1Percentiles (s1, percentile)
SELECT A.s1, ((COUNT(B.s1)/#no_of_attempts)*100)
FROM (SELECT DISTINCT s1 from tb2) A
INNER JOIN tb2 B
ON B.s1 <= A.s1
GROUP BY A.s1;
INSERT INTO S2Percentiles (s2, percentile)
SELECT A.s2, ((COUNT(B.s2)/#no_of_attempts)*100)
FROM (SELECT DISTINCT s2 from tb2) A
INNER JOIN tb2 B
ON B.s2 <= A.s2
GROUP BY A.s2;
INSERT INTO S3Percentiles (s3, percentile)
SELECT A.s3, ((COUNT(B.s3)/#no_of_attempts)*100)
FROM (SELECT DISTINCT s3 from tb2) A
INNER JOIN tb2 B
ON B.s3 <= A.s3
GROUP BY A.s3;
-- select * from S1Percentiles;
-- select * from S2Percentiles;
-- select * from S3Percentiles;
UPDATE tb1 A
INNER JOIN
(
SELECT B.tb1_id AS id, (C.percentile + D.percentile + E.percentile) AS sum FROM tb2 B
INNER JOIN S1Percentiles C
ON B.s1 = C.s1
INNER JOIN S2Percentiles D
ON B.s2 = D.s2
INNER JOIN S3Percentiles E
ON B.s3 = E.s3
) F
ON A.id = F.id
SET A.sum = F.sum;
-- SELECT * FROM tb1;
DROP TABLE S1Percentiles;
DROP TABLE S2Percentiles;
DROP TABLE S3Percentiles;
What this does is that it records the percentile for each score group and then finally just updates the tb1 column with the requisite data instead of recalculating the percentile for each student row.
You should also index columns s1, s2 and s3 for optimizing the queries on these columns.
Note: Please update the column names according to your db schema. Also note that each percentile calculation has been multiplied by 100 as I believe that percentile is usually calculated that way.

Merge 3 structurally identical tables if value in date column exists in all 3

I have a mysql db/server that has 3 tables that are identical in structure:
west, midwest and east.
I would like to create a national table with the sum of the columns of those regional tables, ONLY if the datetime row matches all 3 tables. That way if one hour is missing in a particular table, I don't end up summing 2 regions and calling it national.
Here is how I am thinking to do it:
All 3 tables have a datetime column.
Merge the tables (union?) only if the datetime row exists in all 3 tables.
Aggregate (sum) the columns grouped by datetime column. I would of course be summing all columns which carry int values.
I am not sure how to run a query that would perform this task.
These tables have 11mil rows so an efficient way would be great.
I am also open to other approaches to solve this problem.
I picked the answer from Neil because although the answer would not work if datetime col is not unique i.e. multiple rows in Table1 with the same datetime. Using any other method the performance I got was horrific, hours of query time. I decided to compromise. I created 3 new tables
westh, midwesth and southh.
These 3 new tables are a creation of aggregating the original tables by hour.
I then used Neils second version with a twist:
INNER JOIN Table2 USING (datetime)
While datetime is indexed in my tables that provides superior performance which is a firm criteria for me.
First version:
SELECT T123.dtcol, SUM(T123.intcol) AS intcolsum
FROM (
SELECT Table1.dtcol, Table1.intcol FROM Table1
UNION
SELECT Table2.dtcol, Table2.intcol FROM Table2
UNION
SELECT Table3.dtcol, Table3.intcol FROM Table3
) T123
GROUP BY T123.dtcol
HAVING COUNT(*) = 3
Second version:
SELECT Table1.dtcol, Table1.intcol + Table2.intcol + Table3.intcol AS intcolsum
FROM Table1 T1
INNER JOIN Table2 T2 ON T2.dtcol = T1.dtcol
INNER JOIN Table3 T2 ON T3.dtcol = T1.dtcol
use
SELECT A.dtcol, SUM (A.intcol) intcolsum FROM
(
SELECT 'T1' T, T1.* FROM Table1 T1
UNION
SELECT 'T2' T, T2.* FROM Table2 T2
UNION
SELECT 'T3' T, T3.* FROM Table3 T3
) A
WHERE A.dtcol IN
(
SELECT T1.dtcol
FROM Table1 T1
INNER JOIN Table2 T2 ON T2.dtcol = T1.dtcol
INNER JOIN Table3 T2 ON T3.dtcol = T1.dtcol
)
GROUP BY A.dtcol