I have a table which has ~500 GBs of data and have two queries running on it.
-- Query 1
select Count(*) from table
where C1 = A
-- Query 2
select Count(*) from table
where C1 = A and C2 = B
I feel Query 2 execution on whole table is un-necessary as the results are subset of Query 1. Is there any optimized way to first execute Query 1 then run Query 2 on the results of it and finally return Count of both the results.
SELECT
COUNT(*) AS cnt_1,
SUM(c2 = 'B') AS cnt_2
FROM yourTable
WHERE c1 = 'A';
The index yourTable (c1, c2) will improve.
No. Any such inter-query optimization would depend on the database, and I'm not familiar with any database that caches intermediate result sets. In addition, such an optimization would be rendered useless if the underlying table changes -- and relational databases are designed to support changing data.
As a note: the optimization would have to be very sophisticated, because the count returned by the first query has nothing to do with the count returned by the second. You are thinking that the rows in the second are a subset of the first, but those rows are not actually returned. Some databases -- including MySQL -- can cache result sets so the same query run later would use the cache. However, MySQL is removing that support because of the complications it introduces.
If you want to phrase this as two queries, your best bet is an index on t(c1, c2). The index will be used for both queries and should be prety efficient.
Otherwise, use a single query. Akina's solution is the best approach among the other answers because it filters before aggregating.
Use conditional aggregation:
SELECT
SUM(C1 = A) AS cnt_1,
SUM(C1 = A AND C2 = B) AS cnt_2
FROM yourTable;
The above works because MySQL happens to support summing boolean expressions. On most other databases, you would use this version:
SELECT
COUNT(CASE WHEN C1 = A THEN 1 END) AS cnt_1,
COUNT(CASE WHEN C1 = A AND C2 = B THEN 1 END) AS cnt_2
FROM yourTable;
select SUM(CASE WHEN C1='A' THEN 1 ELSE 0 END), A_CNT
SUM(CASE WHEN C1='A' AND C2='B' THEN 1 ELSE 0 END)B_CNT
from table
Related
There are quite a few "why is my GROUP BY so slow" questions on SO, most of them seem to be resolved with indexes.
My situation is different. Indeed I GROUP BY on non-indexed data but this is on purpose and it's not something I can change.
However, when I compare the performance of GROUP BY with the performance of a similar query without a GROUP BY (that also doesn't use indexes) - the GROUP BY query is slower by an order of magnitude.
Slow query:
SELECT someFunc(col), COUNT(*) FROM tbl WHERE col2 = 42 GROUP BY someFunc(col)
The results are something like this:
someFunc(col) COUNT(*)
=========================
a 100000
b 80000
c 20
d 10
Fast(er) query:
SELECT 'a', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'a'
UNION
SELECT 'b', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'b'
UNION
SELECT 'c', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'c'
UNION
SELECT 'd', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'd'
This query yields the same results and is about ten times faster despite actually running multiple separate queries.
I realize that they are not the same from MySQL point of view, because MySQL doesn't know in advance that someFunc(col) can only yield four different values, but still it seems like too big of a difference.
I'm thinking that this has to do with some work GROUP BY is doing behind the scenes (creating temporary tables and stuff like that).
Are there configuration parameters that I could tweak to make the GROUP BY faster?
Is there a way to hint MySQL to do things differently within the query itself? (e.g. refrain from creating a temporary table).
EDIT:
In fact what I referred to as someFunc(col) above is actually a JSON_EXTRACT(). I just tried to copy the specific data being extracted into its own (unindexed) column and it makes GROUP BY extremely fast, and indeed faster than the alternative UNIONed queries.
The question remains: why? JSON_EXTRACT() might be slow but it should be just as slow with the four queries (in fact slower because more rows are scanned). Also, I've read that MySQL JSON is designed for fast reads.
The difference I'm seeing is between more than 200 seconds (GROUP BY with JSON_EXTRACT()) and 1-2 seconds (GROUP BY on a CONCAT() on an actual unindexed column).
First, for this query:
SELECT someFunc(col), COUNT(*)
FROM tbl
WHERE col2 = 42
GROUP BY someFunc(col);
You should have an index on tbl(col2, col). This is a covering index for the query so it should improve performance.
Small note: The second version should use UNION ALL rather than UNION. The performance difference for eliminating duplicates is small on 4 rows, but UNION is a bad habit in these cases.
I'm not sure what would cause 10x performance slow down. I can readily think of two things that would make the second version faster.
First, this query is calling someFunc() twice for each row being processed. If this is an expensive operation, then that would account for half the increase in query load. This could be much larger if the first version is calling someFunc() on all rows, rather than just on matching rows.
To see if this is an issue, you can try:
SELECT someFunc(col) as someFunc_col, COUNT(*)
FROM tbl
WHERE col2 = 42
GROUP BY someFunc_col;
Second, doing 4 smaller GROUP BYs is going to be a bit faster than doing 1 bigger one. This is because GROUP BY uses a sort, and sorting is O(n log(n)). So, sorting 100,000 rows and 80,000 rows should be faster than sorting 180,000. Your case has about half the data in two groups. This might account for up to 50% difference (although I would be surprised if it were this large).
I have one query, that is comparatively slow. I tried to rewrite it many times, but I cant find better solution. So I want to ask you, if it is written in wrong way from the beginning or it is ok.
SELECT sql_calc_found_rows
present_id, present_id, present_url, present_name, present_text_short, foto_name, price_id, price_price, price_amount, price_dis
FROM a_present
LEFT JOIN
(SELECT price_id, price_present_id, price_supplier_id, price_dis, price_amount,
(CASE WHEN price_dis <> 0 THEN price_dis ELSE price_amount END) as price_price
FROM a_price
WHERE
price_visibility = 1 AND price_deleted <> 1
GROUP BY price_id ) pri
ON pri.price_present_id = present_id
LEFT JOIN _present_fotos ON foto_id = present_title_foto
LEFT JOIN _cate_pres ON cp_present = present_id
WHERE present_visibility = 1 AND present_deleted <> 1 AND price_price > 0 AND present_out <> 1 AND cp_category IN (30,31,232,32)
GROUP BY present_id
ORDER BY price_price
LIMIT 8
Description: price_dis is price after discount, price_amount is price before discount.. Each product (present) has more prices than one.. Is there faster solution to select final price?
If you will find table structure bad, I will be in trouble:)
Thank you very much!
EDIT:
explain select
OK, so I see a couple of things that could be improved.
First of all, you are JOINing with a table derived from a subquery - with subqueries MySQL does not use indexes (hence the slowdown). Instead of joining with a subquery, try JOINing with a table a_price itself, and put that CASE statement in the original (parent) SELECT. It should allow MySQL to use indexes when JOINing, and it is really important, when your subquery returns many rows.
It should look somewhat like this (including MIN() and GROUP BY, as you need minimum price):
SELECT (...), price_amount, price_dis, MIN((CASE WHEN pri.price_dis <> 0 THEN pri.price_dis ELSE pri.price_amount END)) as price_price
FROM a_present
LEFT JOIN a_price pri
ON pri.price_present_id = present_id AND price_visibility = 1 AND price_deleted <> 1
(...)
GROUP BY present_id
Second of all - as EXPLAIN SELECT suggests - MySQL does not use index on table _cate_pres. You should make it use index to JOIN and to select categories you need (since you put some of them in the IN (..) statement).
Try adding an index on _cate_pres.cp_category, and/or maybe a composite index on this table (using two columns - cp_category and cp_present).
Generally, the result you want to achieve (not always it's possible, but in your case I'm pretty sure it is), is to make the following disappear from EXPLAIN SELECT:
[key] NULL - this means no key is used in this particular set
[Extra] Using temporary - this means a temporary table is created to retrieve results, and it is usually bad for performance
[Extra] Using filesort - this means no index is used for sorting, so sorting process is slow.
Read more about indexes in the Mysql docs, and please give much attention to EXPLAIN output.
[Summary of the question: 2 SQL statements produce same results, but at different speeds. One statement uses JOIN, other uses IN. JOIN is faster than IN]
I tried a 2 kinds of SELECT statement on 2 tables, named booking_record and inclusions. The table inclusions has a many-to-one relation with table booking_record.
(Table definitions not included for simplicity.)
First statement: (using IN clause)
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Second statement: (using JOIN)
SELECT
id,
agent,
source
FROM
booking_record
JOIN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
) inclusions
ON
id = foreign_key_booking_record
with 300,000+ rows in booking_record-table and 6,100,000+ rows in inclusions-table; the 2nd statement delivered 127 rows in just 0.08 seconds, but the 1st statement took nearly 21 minutes for same records.
Why JOIN is so much faster than IN clause?
This behavior is well-documented. See here.
The short answer is that until MySQL version 5.6.6, MySQL did a poor job of optimizing these types of queries. What would happen is that the subquery would be run each time for every row in the outer query. Lots and lots of overhead, running the same query over and over. You could improve this by using good indexing and removing the distinct from the in subquery.
This is one of the reasons that I prefer exists instead of in, if you care about performance.
EXPLAIN should give you some clues (Mysql Explain Syntax
I suspect that the IN version is constructing a list which is then scanned by each item (IN is generally considered a very inefficient construct, I only use it if I have a short list of items to manually enter).
The JOIN is more likely constructing a temp table for the results, making it more like normal JOINs between tables.
You should explore this by using EXPLAIN, as said by Ollie.
But in advance, note that the second command has one more filter: id = foreign_key_booking_record.
Check if this has the same performance:
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
id = foreign_key_booking_record -- new filter
AND
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
I was recently asked this question in an interview.
I tried this in mySQL, and got the same results(final results).
All gave the number of rows in that particular table.
Can anyone explain the major difference between them.
Nothing really, unless you specify a field in a table or an expression within parantheses instead of constant values or *
Let me give you a detailed answer. Count will give you non-null record number of given field. Say you have a table named A
select 1 from A
select 0 from A
select * from A
will all return same number of records, that is the number of rows in table A. Still the output is different. If there are 3 records in table. With X and Y as field names
select 1 from A will give you
1
1
1
select 0 from A will give you
0
0
0
select * from A will give you ( assume two columns X and Y is in the table )
X Y
-- --
value1 value1
value2 (null)
value3 (null)
So, all three queries return the same number. Unless you use
select count(Y) from A
since there is only one non-null value you will get 1 as output
COUNT(*) will count the number of rows, while COUNT(expression) will count non-null values in expression and COUNT(column) will count all non-null values in column.
Since both 0 and 1 are non-null values, COUNT(0)=COUNT(1) and they both will be equivalent to the number of rows COUNT(*). It's a different concept, but the result will be the same.
Now - they should all perform identically.
In days gone by, though, COUNT(1) (or whatever constant you chose) was sometimes recommended over COUNT(*) because poor query optimisation code would make the database retrieve all of the field data prior to running the count. COUNT(1) was therefore faster, but it shouldn't matter now.
Since the expression 1 is a constant expression, they should always produce the same result, but the implementations might differ as some RDBMS might check whether 1 IS NULL for every single row in the group. This is still being done by PostgreSQL 11.3 as I have shown in this article.
I've benchmarked queries on 1M rows doing the two types of count:
-- Faster
SELECT count(*) FROM t;
-- 10% slower on PostgreSQL 11.3
SELECT count(1) FROM t;
One reason why people might use the less intuitive COUNT(1) could be that historically, it was the other way round.
The result will be the same, however COUNT(*) is slower on a lot of production environments today, because in production the db engines can live decades. I prefer to use COUNT(0), someone use COUNT(1), but definitely not COUNT(*) even if its lets say safe to use on modern db engines, I would not depend on the engine, especially if its only one character difference, also the code will be more portable.
count(any integer value) is faster than count(*) ---> gives all counts including null values
count(column_name) omits null
Ex-->
column name=> id
values => 1 1 null null 2 2
==> count(0), count(1), count(*) -----> result is 6 only
==> count(id) ----> result is 4
Let's say we have table with columns
Table
-------
col_A col_B
System returns all column (null and non-null) values when we query
select col_A from Table
System returns column values which are non-null when we query
select count(col_A) from Table
System returns total rows when we query
select count(*) from Table
Mysql5.6 👇
InnoDB handles SELECT COUNT(*) and SELECT COUNT(1) operations in the same way. There is no performance difference.
12.19.1 Aggregate Function Descriptions
Official doc is the fastest way after I found many different answers.
COUNT(*), COUNT(1) , COUNT(0), COUNT('Y') , ...
All of the above return the total number of records (including the null ones).
But COUNT('any constant') is faster than COUNT(*).
I want to read data for reporting purpose. Currently, I populate a table using another table's calculated data, and read data for reporting from the populated table. My current logic is too delete the old data, and insert the new data, all within a transaction.
UPDATE
Requirements
1) The logic below is to run once every second. Please note that other processes also udpates tableB with the same refresh rate.
2) TableB is used for reporting purpose. TableA and TableB resides in different databases.
3) TableB contains around 10 millions rows, around 4 millions rows will be updated once every second by the code below. Other processes also update the other part of data (6 = 10-4 millions) in tableB at the same refresh rate.
My concern is that:
1) three statements use similar sum, and where clauses, which might be improved.
2) There are about 1-2 millions row in tablea to update to tableB. Using an explicit temporary table might slow down.
3) Using transaction might slow down, but it seems the only way.
4) Update the data might be a better option than delete and insert (which one should I choose?)
I want to find a better performant way (including table redesign etc.). Below is the current way:
pseudocode below:
start/begin transaction here
DELETE from tableb data that I want to insert below, e.g. delete data where Code = 'code'
INSERT INTO tableb(Code, Total)
SELECT sum(a.Code, price)
FROM tablea a
GROUP BY a.Code;
IINSERT INTO tableb(Code, Total)
SELECT sum(a.Code price) -- use price
FROM tablea a
WHERE a.meanPrice IS NOT NULL
GROUP BY a.Code;
INSERT INTO tableb(Code, Total)
SELECT sum(a.Code, meanPrice ) -- use meanPrice
FROM tablea a
WHERE a.meanPrice IS NOT NULL
GROUP BY a.Code;
Commit transaction here
It is for MySQL, but ideally it should be generic.
Any idea?
Do you actually need to update values in the table? They are not tagged with any id or names to identify them.
The following SELECT statement returns the data you want:
SELECT code,
sum(price),
sum(case when a.meanPrice is not null then price else 0 end),
sum(case when a.meanPrice is not null then meanprice else 0 end)
FROM tablea a
GROUP BY a.Code;
If you needed to insert this into a temp table, you can unpivot the data. However, that format does not make sense to me. Can you explain why you are using a table with one numeric column in this way?
This query does your INSERT task in 1 step, but ... kids, don't do this at home without actually measuring actual performance:
http://sqlfiddle.com/#!2/381e2/9
INSERT INTO tableb(Total)
SELECT
CASE
WHEN t.v = 1
THEN SUM( price )
WHEN t.v = 2
THEN SUM(
CASE
WHEN meanPrice IS NOT NULL THEN price
ELSE 0
END
)
WHEN t.v = 3
THEN SUM( meanPrice )
END AS Total
FROM tablea
INNER JOIN
( SELECT 1 AS v UNION ALL
SELECT 2 AS v UNION ALL
SELECT 3 AS v
) AS t
GROUP BY tablea.Code, t.v;
Point 3 is false.
Solution 1: Create an store procedure.
Solution 2: Create a trigger on the impacted tables.
Solution 3: Don't ask for a sum every time, do the sum the first time and then save the number on another table. On every modification of this table, do the sum over your new table, I wont be 1 millions records, only one per table.
Pivot tables!