How to optimize two queries with union in my sql - mysql

I have created the below query. but the below query taken lots of time to fetch result. i have added two queries and combined with union statement. how to optimize the below query in my sql. it consume much amount of time.
select count(*) as count
from(
select id,createdUser,patientName,patientFirstName,patientLastName,tagnames
FROM vw_tagged_forms v1 where v1.tenant_id = 91 AND
CASE WHEN v1.patsiteId IS NOT NULL THEN v1.patsiteId IN
(151,2937,1450,1430,2746,1431,1472,1438,2431,1428) ELSE v1.patsiteId IS NULL END group by
COALESCE(`message_grp_id`, `id`)
UNION
select
id,createdUser,patientName,patientFirstName,patientLastName,tagnames
FROM vw_tagged_forms_logs v2 where tenant_id = 91 AND CASE WHEN v2.patsiteId IS NOT
NULL THEN v2.patsiteId IN (151,2937,1450,1430,2746,1431,1472,1431) ELSE v2.patsiteId IS
NULL END) a

Simplification: I think that
CASE WHEN x IS NOT NULL
THEN x IN (...)
ELSE x IS NULL
END
can be simplified to either of these:
x IN (...) OR x IS NULL
or
COALESCE(x IN (...), TRUE)
Bug? (The IN lists are not the same; was this an oversight?)
INDEX (for performance): Add this composite index, with the columns in the order given:
INDEX(tenant_id, patsiteId)
When adding a composite index, DROP index(es) with the same leading columns.
That is, when you have both INDEX(a) and INDEX(a,b), toss the former.
DISTINCT, id, etc:
Are the id values different for the two tables? If they are AUTO_INCREMENT and the rows were independently INSERTed, I would expect them to be different. If, on the other hand, you copy a row from v1 to v2, then I would expect the ids to match.
That leads to the question of "what are you counting"? If the ids are expected to match, then do only SELECT id. If they they were independently generated, then leave id out of the SELECTs; keep the other 5 columns.

Related

Optimize range query with group by

Having trouble with a query. Here is the outline -
Table structure:
CREATE TABLE `world` (
`placeRef` int NOT NULL,
`forenameRef` int NOT NULL,
`surnameRef` int NOT NULL,
`incidence` int NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb3;
ALTER TABLE `world`
ADD KEY `surnameRef_forenameRef` (`surnameRef`,`forenameRef`),
ADD KEY `forenameRef_surnameRef` (`forenameRef`,`surnameRef`),
ADD KEY `forenameRef` (`forenameRef`,`placeRef`);
COMMIT;
This table contains data like and has over 600,000,000 rows:
placeRef forenameRef surnameRef incidence
1 1 2 100
2 1 3 600
This represents the number of people with a given forename-surname combination in a place.
I would like to be able to query all the forenames that a surname is attached to; and then perform another search for where those forenames exist, with a count of the sum incidence. For Example: get all the forenames of people who have the surname "Smith"; then get a list of all those forenames, grouped by place and with the sum incidence. I can do this with the following query:
SELECT placeRef, SUM( incidence )
FROM world
WHERE forenameRef IN
(
SELECT DISTINCT forenameRef
FROM world
WHERE surnameRef = 214488
)
GROUP BY world.placeRef
However, this query takes about a minute to execute and will take more time if the surname being searched for is common.
The root problem is: performing a range query with a group doesn't utilize the full index.
Any suggestions how the speed could be improved?
In my experience, if your query has a range condition (i.e. any kind of predicate other than = or IS NULL), the column for that condition is the last column in your index that can be used to optimize search, sort, or grouping.
In other words, suppose you have an index on columns (a, b, c).
The following uses all three columns. It is able to optimize the ORDER BY c, because since all rows matching the specific values of a and b will by definition be tied, and then those matching rows will already be in order by c, so the ORDER BY is a no-op.
SELECT * FROM mytable WHERE a = 1 AND b = 2 ORDER BY c;
But the next example only uses columns a, b. The ORDER BY needs to do a filesort, because the index is not in order by c.
SELECT * FROM mytable WHERE a = 1 AND b > 2 ORDER BY c;
A similar effect is true for GROUP BY. The following uses a, b for row selection, and it can also optimize the GROUP BY using the index, because each group of values per distinct value of c is guaranteed to be grouped together in the index. So it can count the rows for each value of c, and when it's done with one group, it is assured there will be no more rows later with that value of c.
SELECT c, COUNT(*) FROM mytable WHERE a = 1 AND b = 2 GROUP BY c;
But the range condition spoils that. The rows for each value of c are not grouped together. It's assumed that the rows for each value of c may be scattered among each of the higher values of b.
SELECT c, COUNT(*) FROM mytable WHERE a = 1 AND b > 2 GROUP BY c;
In this case, MySQL can't optimize the GROUP BY in this query. It must use a temporary table to count the rows per distinct value of c.
MySQL 8.0.13 introduced a new type of optimizer behavior, the Skip Scan Range Access Method. But as far as I know, it only applies to range conditions, not ORDER BY or GROUP BY.
It's still true that if you have a range condition, this spoils the index optimization of ORDER BY and GROUP BY.
Unless I don't understand the task, it seems like this works:
SELECT placeRef, SUM( incidence )
FROM world
WHERE surnameRef = 214488
GROUP BY placeRef;
Give it a try.
It would benefit from a composite index in this order:
INDEX(surnameRef, placeRef, incidence)
Is incidence being updated a lot? If so, leave it off my Index.
You should consider moving from MyISAM to InnoDB. It will need a suitable PK, probably
PRIMARY KEY(placeRef, surnameRef, forenameRef)
and it will take 2x-3x the disk space.

Does UNION overwrite previous results?`

I have two queries on the same table and I need to get results which do not appear in the first resultset.
There are several ways to get there (I do not ask for one), my first try was this:
SELECT * FROM
(
SELECT
o.custom_order_id AS 'order',
'yes' AS 'test'
FROM orders
WHERE <first criteria>
UNION
SELECT
o.custom_order_id AS 'order',
'no' AS 'test'
FROM orders
WHERE <second criteria>
) x
WHERE x.test = 'no'
UNION does not append rows which already appeared in the first resultset.
Actually I do get rows like
12345 no
but 12345 does appear in the first resultset (query before UNION).
Why that ?
Edit:
custom_order_id has no index and is not primary key (although it actually is unique) - does UNION need a (unique) index or pk to recognize a row as already-in-first-resultset?
UNION uses the entire tuple to determine if a row is unique. In your case that is (order, test).
As one half of your answers has test set to "yes", and one "no", you can end up with multiple orders with the same id (one for yes, one for no).
You only get rows with "no", because that's what you specified in your WHERE clause at the end:
WHERE x.test = 'no'
Eliminate that and UNION will return unique rows from unioned queries.
UNION doesn't necesarrly need keys, although optimizing query is almost always a good idea :) Try running EXPLAIN ( above query with UNION goes here) and see what is produced.

Large SQL database - solving efficiency

I have this following SQL query, which, when I originally coded it, was exceptionally fast, it now takes over 1 second to complete:
SELECT counted/scount as ratio, [etc]
FROM
playlists
LEFT JOIN (
select AID, PLID FROM (SELECT AID, PLID FROM p_s ORDER BY `order` asc, PLSID desc)as g GROUP BY PLID
) as t USING(PLID)
INNER JOIN (
SELECT PLID, count(PLID) as scount from p_s LEFT JOIN audio USING(AID) WHERE removed='0' and verified='1' GROUP BY PLID
) as g USING(PLID)
LEFT JOIN (
select AID, count(AID) as counted FROM a_p_all WHERE ".time()." - playtime < 2678400 GROUP BY AID
) as r USING(AID)
LEFT JOIN audio USING (AID)
LEFT JOIN members USING (UID)
WHERE scount > 4 ORDER BY ratio desc
LIMIT 0, 20
I have identified the problem, the a_p_all table has over 500k rows. This is slowing down the query. I have come up with a solution:
Create a smaller temporary table, that only stores the data necessary, and deletes anything older than is needed.
However, is there a better method to use? Optimally I wouldn't need a temporary table; what do sites such as YouTube/Facebook do for large tables to keep query times fast?
edit
This is the EXPLAIN table for the query in the answer from #spencer7593
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived3> ALL NULL NULL NULL NULL 20
1 PRIMARY u eq_ref PRIMARY PRIMARY 8 q.AID 1 Using index
1 PRIMARY m eq_ref PRIMARY PRIMARY 8 q.UID 1 Using index
3 DERIVED <derived6> ALL NULL NULL NULL NULL 20
6 DERIVED t ALL NULL NULL NULL NULL 21
5 DEPENDENT SUBQUERY s ALL NULL NULL NULL NULL 49 Using where; Using filesort
4 DEPENDENT SUBQUERY c ALL NULL NULL NULL NULL 49 Using where
4 DEPENDENT SUBQUERY o eq_ref PRIMARY PRIMARY 8 database.c.AID 1 Using where
2 DEPENDENT SUBQUERY a ALL NULL NULL NULL NULL 510594 Using where
Two "big rock" issues stand out to me.
Firstly, this predicate
WHERE ".time()." - playtime < 2678400
(I'm assuming that this isn't the actual SQL being submitted to the database, but that what's being sent to the database is something like this...
WHERE 1409192073 - playtime < 2678400
such that we want only rows where playtime is within the past 31 days (i.e. within 31*24*60*60 seconds of the integer value returned by time().
This predicate can't make use of a range scan operation on a suitable index on playtime. MySQL evaluates the expression on the left side for every row in the table (every row that isn't excluded by some other predicate), and the result of that expression is compared to the literal on the right.
To improve performance, rewrite the predicate that so that the comparison is made on the bare column. Compare the value stored in the playtime column to an expression that needs to be evaluated one time, for example:
WHERE playtime > 1409192073 - 2678400
With a suitable index available, MySQL can perform a "range" scan operation, and efficiently eliminate a boatload of rows that don't need to be evaluated.
The second "big rock" is the inline views, or "derived tables" in MySQL parlance. MySQL is much different than other databases in how inline views are processed. MySQL actually runs that innermost query, and stores the result set as a temporary MyISAM table, and then the outer query runs against the MyISAM table. (The name that MySQL uses, "derived table", makes sense when we understand how MySQL processes the inline view.) Also, MySQL does not "push" predicates down, from an outer query down into the view queries. And on the derived table, there are no indexes created. (I believe MySQL 5.7 is changing that, and does sometimes create indexes, to improve performance.) But large "derived tables" can have a significant performance impact.
Also, the LIMIT clause gets applied last in the statement processing; that's after all the rows in the resultset are prepared and sorted. Even if you are returning only 20 rows, MySQL still prepares the entire resultset; it just doesn't transfer them to the client.
Lots of the column references are not qualified with the table name or alias, so we don't know, for example, which table (p_s or audio) contains the removed and verified columns.
(We know it can't be both, if MySQL isn't throwing a "ambiguous column" error. But MySQL has access to the table definitions, where we don't. MySQL also knows something about the cardinality of the columns, in particular, which columns (or combination of columns) are UNIQUE, and which columns can contain NULL values, etc.
Best practice is to qualify ALL column references with the table name or (preferably) a table alias. (This makes it much easier on the human reading the SQL, and it also avoids a query from breaking when a new column is added to a table.)
Also, the query as a LIMIT clause, but there's no ORDER BY clause (or implied ORDER BY), which makes the resultset indeterminate. We don't have any guaranteed which will be the "first" rows returned.
EDIT
To return only 20 rows from playlists (out of thousands or more), I might try using correlated subqueries in the SELECT list; using a LIMIT clause in an inline view to winnow down the number of rows that I'd need to run the subqueries for. Correlated subqueries can eat your lunch (and your lunchbox too) in terms of performance with large sets, due to the number of times those need to be run.
From what I can gather, you are attempting to return 20 rows from playlists, picking up the related row from member (by the foreign key in playlists), finding the "first" song in the playlist; getting a count of times that "song" has been played in the past 31 days (from any playlist); getting the number of times a song appears on that playlist (as long as it's been verified and hasn't been removed... the outerness of that LEFT JOIN is negated by the predicates on the removed and verified columns, if either of those columns is from the audio table...).
I'd take a shot with something like this, to compare performance:
SELECT q.*
, ( SELECT COUNT(1)
FROM a_p_all a
WHERE a.playtime < 1409192073 - 2678400
AND a.AID = q.AID
) AS counted
FROM ( SELECT p.PLID
, p.UID
, p.[etc]
, ( SELECT COUNT(1)
FROM p_s c
JOIN audio o
ON o.AID = c.AID
AND o.removed='0'
AND o.verified='1'
WHERE c.PLID = p.PLID
) AS scount
, ( SELECT s.AID
FROM p_s s
WHERE s.PLID = p.PLID
ORDER BY s.order ASC, s.PLSID DESC
LIMIT 1
) AS AID
FROM ( SELECT t.PLID
, t.[etc]
FROM playlists t
ORDER BY NULL
LIMIT 20
) p
) q
LEFT JOIN audio u ON u.AID = q.AID
LEFT JOIN members m ON m.UID = q.UID
LIMIT 0, 20
UPDATE
Dude, the EXPLAIN output is showing that you don't have suitable indexes available. To get any decent chance at performance with the correlated subqueries, you're going to want to add some indexes, e.g.
... ON a_p_all (AID, playtime)
... ON p_s (PLID, order, PLSID, AID)

What is the difference between count(0), count(1).. and count(*) in mySQL/SQL?

I was recently asked this question in an interview.
I tried this in mySQL, and got the same results(final results).
All gave the number of rows in that particular table.
Can anyone explain the major difference between them.
Nothing really, unless you specify a field in a table or an expression within parantheses instead of constant values or *
Let me give you a detailed answer. Count will give you non-null record number of given field. Say you have a table named A
select 1 from A
select 0 from A
select * from A
will all return same number of records, that is the number of rows in table A. Still the output is different. If there are 3 records in table. With X and Y as field names
select 1 from A will give you
1
1
1
select 0 from A will give you
0
0
0
select * from A will give you ( assume two columns X and Y is in the table )
X Y
-- --
value1 value1
value2 (null)
value3 (null)
So, all three queries return the same number. Unless you use
select count(Y) from A
since there is only one non-null value you will get 1 as output
COUNT(*) will count the number of rows, while COUNT(expression) will count non-null values in expression and COUNT(column) will count all non-null values in column.
Since both 0 and 1 are non-null values, COUNT(0)=COUNT(1) and they both will be equivalent to the number of rows COUNT(*). It's a different concept, but the result will be the same.
Now - they should all perform identically.
In days gone by, though, COUNT(1) (or whatever constant you chose) was sometimes recommended over COUNT(*) because poor query optimisation code would make the database retrieve all of the field data prior to running the count. COUNT(1) was therefore faster, but it shouldn't matter now.
Since the expression 1 is a constant expression, they should always produce the same result, but the implementations might differ as some RDBMS might check whether 1 IS NULL for every single row in the group. This is still being done by PostgreSQL 11.3 as I have shown in this article.
I've benchmarked queries on 1M rows doing the two types of count:
-- Faster
SELECT count(*) FROM t;
-- 10% slower on PostgreSQL 11.3
SELECT count(1) FROM t;
One reason why people might use the less intuitive COUNT(1) could be that historically, it was the other way round.
The result will be the same, however COUNT(*) is slower on a lot of production environments today, because in production the db engines can live decades. I prefer to use COUNT(0), someone use COUNT(1), but definitely not COUNT(*) even if its lets say safe to use on modern db engines, I would not depend on the engine, especially if its only one character difference, also the code will be more portable.
count(any integer value) is faster than count(*) ---> gives all counts including null values
count(column_name) omits null
Ex-->
column name=> id
values => 1 1 null null 2 2
==> count(0), count(1), count(*) -----> result is 6 only
==> count(id) ----> result is 4
Let's say we have table with columns
Table
-------
col_A col_B
System returns all column (null and non-null) values when we query
select col_A from Table
System returns column values which are non-null when we query
select count(col_A) from Table
System returns total rows when we query
select count(*) from Table
Mysql5.6 👇
InnoDB handles SELECT COUNT(*) and SELECT COUNT(1) operations in the same way. There is no performance difference.
12.19.1 Aggregate Function Descriptions
Official doc is the fastest way after I found many different answers.
COUNT(*), COUNT(1) , COUNT(0), COUNT('Y') , ...
All of the above return the total number of records (including the null ones).
But COUNT('any constant') is faster than COUNT(*).

Optimizing DISTINCT SQL query with OR conditions

I have the following SQL query:
SELECT DISTINCT business_key
FROM Memory
WHERE concept <> 'case' OR attrib <> 'status' OR value <> 'closed'
What I try to achieve is to get all unique business keys that don't have a record concept=case AND attrib=status AND value=closed. Running this query in MySQL with 500 000 records with all unique business_keys is very slow: about 11 seconds.
I placed indices to the business_key column, to the concept, attrib and value columns. I also tried with a combined index to all three columns (concept, attrib, value) but the result is the same.
Here is a screenshot of the EXPLAIN EXTENDED command:
The interesting thing is that running the query without the distinct specifier results in a very fast execution.
I had also tried this:
SELECT DISTINCT m.business_key
FROM Memory m
WHERE m.business_key NOT IN
(SELECT c.business_Key
FROM Memory c
WHERE c.concept = 'case' AND c.attrib = 'status' AND c.value = 'closed')
with even worse results: around 25 seconds
You could add a compound (concept, attrib, value, business_key) index so the query (if MySQL decides to use this index) can find all the info in the index without having to read the whole table.
Your query is equivalent to:
SELECT DISTINCT business_key
FROM Memory
WHERE NOT (concept = 'case' AND attrib = 'status' AND value = 'closed')
and to this (which will probably yield the same execution plan):
SELECT business_key
FROM Memory
WHERE NOT (concept = 'case' AND attrib = 'status' AND value = 'closed')
GROUP BY business_key
Since the 4 columns that are to be put in the index are all VARCHAR(255), the index length will be pretty large. MyISAM will not allow more than 1000 bytes and InnoDB no more than 3072.
One solution is to cut the length of the last part, making the index length less than 1000: 255+255+255+230 = 995:
(concept, attrib, value, business_key(220))
It will work but it's really not good to have so large index lengths, performance wise.
Another option is to lower the length of all or some of those 4 columns, if that complies with the data you expect to store there. No need to declare length 255 if you expect to have maximum of 100 in a column.
Another option you may consider is putting those 4 columns in 4 separate reference tables. (Or just the columns that have repeated data. It seems that business_key will have duplicate data but not that many. So, it won't be much good to make a reference table for that column.)
Example: Put concept values in a new table with something like:
CREATE TABLE Concept_Ref
( concept_id INT AUTO_INCREMENT
, concept VARCHAR(255)
, PRIMARY KEY concept_id
, UNIQUE INDEX concept_idx (concept)
) ;
INSERT INTO Concept_Ref
( concept )
SELECT DISTINCT
concept
FROM
Memory ;
and then change the Memory table with:
ALTER TABLE Memory
ADD COLUMN concept_id INT ;
do this (once):
UPDATE
Memory m
JOIN
Concept_Ref c
ON c.concept = m.concept
SET m.concept_id = c.concept_id
and then drop the Memory.concept column:
ALTER TABLE Memory
DROP COLUMN concept ;
You can also add FOREIGN KEY references if you change your tables from MyISAM to InnoDB.
After doing the same for all 4 columns, not only the length of new compound index in the Memory table will be much smaller but your table size will be much smaller, too. Additionally, any other index that uses any of those columns will have smaller length.
Off course the query would need 4 JOINs to be written. And any INSERT, UPDATE or DELETE statement to this table will have to be changed and carefully designed.
But overall, I think you will have better performance. With the design you have now, it seems that values like 'case', 'status' and 'closed' are repeated many times.
This will allow the use of index. It will still take some time to retrieve all the rows.
SELECT DISTINCT business_key FROM Memory
WHERE NOT(concept = 'case' AND attrib AND 'status' AND value = 'closed')
If the query runs quickly without DISTINCT, have you tried:
SELECT DISTINCT business_key from
(SELECT business_key
FROM Memory
WHERE concept <> 'case' OR attrib <> 'status' OR value <> 'closed') v
?