Which is faster: SELECT DISTINCT or WHERE foo != 0? - mysql

id | foo | bar
--------------
0 | 0 | ...
1 | 1 | ...
2 | 2 | ...
3 | 0 | ...
4 | 2 | ...
I need all unique foo values, but not "0" which is in very often.
Which is faster?
SELECT foo FROM `table` WHERE foo != 0
or
SELECT DISTINCT foo FROM `table`
The last would keep the 0 but be removed in PHP.
On my server both were fast enough but one of these two option might be theoretically faster :)

Here's an indexed data set of 130,000 rows. The sparse column has values in the range 0-100000. The dense column has values in the range 0-100.
SELECT * FROM my_table;
+----+--------+-------+
| id | sparse | dense |
+----+--------+-------+
| 1 | 0 | 0 |
| 2 | 52863 | 87 |
| 3 | 76503 | 21 |
| 4 | 77783 | 25 |
| 6 | 89359 | 73 |
| 7 | 97772 | 69 |
| 8 | 53429 | 59 |
| 9 | 35206 | 99 |
| 13 | 88062 | 44 |
| 14 | 56312 | 49 |
...
SELECT * FROM my_table WHERE sparse <> 0;
130941 rows in set (0.09 sec)
SELECT * FROM my_table WHERE dense <> 0;
130289 rows in set (0.09 sec)
SELECT DISTINCT sparse FROM my_table;
72844 rows in set (0.27 sec)
SELECT DISTINCT dense FROM my_table;
101 rows in set (0.00 sec)
As you can see, whether or not DISTINCT is faster depends very much on the density of the data.
Obviously, in this instance, the two queries are very different from each other!

As per the condition given in question, distinct will be expensive as it does sorting on all the records in a block fetched in main memory before eliminating the duplicate records while the select with where condition is going to iterate each record in the block only once to filter out the records.
Also the best known sorting algorithm does it in O(nlogn) while iterative record check happen in O(n) time.
Thus, first query is faster here.
Hope, it answers your question.

In my most cases, SELECT foo FROM table WHERE foo != 0 is faster.
But in your case, it can be even faster:
SELECT foo FROM `table` WHERE foo > 0
From the data you've shown, you do not have negative values, so you don't need to check for any. (as pointed out here - MySQL docs - scroll to the comments section)
From the MySQL Distinct docs:
In most cases, a DISTINCT clause can be considered as a special case of GROUP BY
So, if performance is an issue and you don't really need it, don't use it.

SELECT DISTINCT foo FROM `table`
because there is no Where condition

Related

How do I check whether the result row of given mysql query exceeds a certain number without using count()

Now my problem is to know a mysql query will fetch result which exceeds a certain row count (like 5000 rows). I know it can use select * ... limit 5001 to replace count() for performance optimization in terms of time effeciency, but it still return 5001 row of records which is totally useless in my scenario, becasue all I want is a sample 'yes/no' answer. Is there any better approach? big thanks ! ^_^
The accepted answer in the link provided by Devsi Odedra
is substantially correct but if you don't want a big result set select a column into a user defined variable and limit 1
for example
MariaDB [sandbox]> select * from dates limit 7;
+----+------------+
| id | dte |
+----+------------+
| 1 | 2018-01-02 |
| 2 | 2018-01-03 |
| 3 | 2018-01-04 |
| 4 | 2018-01-05 |
| 5 | 2018-01-06 |
| 6 | 2018-01-07 |
| 7 | 2018-01-08 |
+----+------------+
SELECT SQL_CALC_FOUND_ROWS ID INTO #ID FROM DATES WHERE ID < 5 LIMIT 1;
SELECT FOUND_ROWS();
+--------------+
| FOUND_ROWS() |
+--------------+
| 4 |
+--------------+
1 row in set (0.001 sec)
SELECT 1 FROM tbl
WHERE ... ORDER BY ...
LIMIT 5000, 1;
will give you either a row or no row -- This indicates whether there are more than 5000 row or not. Wrapping it in EXISTS( ... ) turns that into "true" or "false" -- essentially the same effort, but perhaps clearer syntax.
Caution: If the WHERE and ORDER BY are used but cannot handled by an INDEX, the query may still read the entire table before getting to the 5000 and 1.
When paginating, I recommend
LIMIT 11, 1
to fetch 10 rows, plus an indication that there are more rows.

How to reuse variables in the select statement of mysql

I would like to use mysql variables to prevent same statements. In the following example i would like to sum the salary of an each employee and also sum it twice times. Of course the second column is wrong.
MariaDB [Messdaten]> select * from t;
+----+----------+--------+
| id | employee | salery |
+----+----------+--------+
| 1 | 10 | 1000 |
| 2 | 10 | 2000 |
| 3 | 20 | 3000 |
| 4 | 20 | 4000 |
+----+----------+--------+
4 rows in set (0.000 sec)
MariaDB [Messdaten]> select employee, #x:=sum(salery), 2*#x from t group by employee;
+----------+-----------------+-------+
| employee | #x:=sum(salery) | 2*#x |
+----------+-----------------+-------+
| 10 | 3000 | 14000 |
| 20 | 7000 | 14000 |
+----------+-----------------+-------+
2 rows in set (0.001 sec)
Of course i could use select employee, sum(salery), 2*sum(salery) but in my real use case the statements are very big and therefore bad readable.
What ist going wrong and if this is a gap of mysql are there some workarounds?
You can use a subquery like so to get the correct result while only summing (or executing a more complex statement) once
SELECT
employee,
totalSalary,
totalSalary*2 AS doubleSalary
FROM (
SELECT
employee,
sum(salary) AS totalSalary
FROM employees
GROUP BY employee
) AS employeeSalaries;
The unexpected variable behaviour is described in the MySQL docs here.
HAVING, GROUP BY, and ORDER BY, when referring to a variable that is assigned a value in the select expression list do not work as expected because the expression is evaluated on the client and thus can use stale column values from a previous row.

Mysql - deleting duplicates

I have a table with a barcode column with a unique index. The data has been loaded with additional chars (-xx) at the end of each barcode to prevent duplicates, but there will be lots of duplicates once I strip off the suffix. Here is a sample of the data:
itemnumber barcode
17912 2-14
18082 2-1
21870 2-10
29219 2-8
Then I created two temporary tables, marty and manny, both with the itemnumber and the stripped down barcodes. So,both tables would contain
itemnumber barcode
17912 2
18082 2
21870 2
29219 2
etc
And the I tried to delete all but the first entry with barcode '2' in the marty table(and every other barcode). I hoped then to update the original table with the correct first entry and the users could fix up the duplicates themselves in time in the application.
So, this was my query to delete all but the first entry in the marty table for each barcode
DELETE FROM marty
WHERE itemnumber NOT IN
(SELECT MIN(itemnumber) FROM manny GROUP BY barcode)
There are 130,000 rows in marty and manny. The query took over 24 hours and then didn't finish properly. The connection to the server crashed and the query did not do all the updates.
Is there a better way to approach this that would not us the subquery, which i think is causing the delay? And the group by is probably slowing things down too with so many records.
Thanks
One more variant: this variant works without any temporary tables for deleting duplicates:
Delete m1
From Marty m1
join Marty m2
on m1.barcode = m2.barcode
and m1.itemnumber > m2.itemnumber
Here is a two-stage approach that avoids use of NOT IN. It also does not use the temporary table "manny". First, join "marty" to itself to pick out rows for which itemnumber != min(itemnumber). Use UPDATE to set barcode for these rows to NULL. A second pass with DELETE then removes all rows that were flagged in the first phase.
For this example, I split the barcode column of "marty" into two columns; it could be done with the table in its original format with some modification (need to split the column values on the fly).
select * from marty;
+------------+---------+---------+
| itemnumber | barcode | subcode |
+------------+---------+---------+
| 17912 | 2 | 14 |
| 18082 | 2 | 1 |
| 21870 | 2 | 10 |
| 29219 | 2 | 8 |
| 30133 | 3 | 5 |
| 30134 | 3 | 7 |
| 30139 | 3 | 9 |
| 30142 | 3 | 12 |
+------------+---------+---------+
8 rows in set (0.00 sec)
UPDATE
(marty m1
JOIN
(SELECT barcode,
MIN(itemnumber) AS itemnumber
FROM marty
GROUP BY barcode) m2
USING(barcode))
SET m1.barcode = NULL WHERE m1.itemnumber != m2.itemnumber;
mysql> select * from marty;
+------------+---------+---------+
| itemnumber | barcode | subcode |
+------------+---------+---------+
| 17912 | 2 | 14 |
| 18082 | NULL | 1 |
| 21870 | NULL | 10 |
| 29219 | NULL | 8 |
| 30133 | 3 | 5 |
| 30134 | NULL | 7 |
| 30139 | NULL | 9 |
| 30142 | NULL | 12 |
+------------+---------+---------+
8 rows in set (0.00 sec)
DELETE FROM marty WHERE barcode IS NULL;
MySQL is notoriously slow when using IN with very large sets. A scripted alternative:
Use a script to construct a long itemnumber = X OR itemnumber = y OR itemnumber = z clause (chunks size ~1000) and INSERT the matched rows (i.e. the ones that would not have been DELETEd in your previous query) into a new table, TRUNCATE the existing and load the contents of the new table back into the old with INSERT INTO marty SELECT * FROM marty_tmp.
You may want to lock the table or run in a transaction for the final TRUNCATE, INSERT.
edit:
Query SELECT MIN(itemnumber) FROM manny GROUP BY barcode from a script, store results in desiredItemNumbers array
Take batches of 1000 desiredItemNumbers and construct this query: INSERT INTO manny_tmp SELECT * FROM manny WHERE itemnumber = desiredItemNumbers[0] OR itemnumber = desiredItemNumbers[1] .... Rerun this query until you've exhausted the desiredItemNumbers array (n.b. the last query will probably have less than 1000 desiredItemNumbers).
You now have a table with the results that you would have been left with had you DELETEd the rest, so swap the contents of the marty and marty_tmp tables.
TRUNCATE marty
INSERT INTO marty SELECT * FROM marty_tmp
If you are creating temp tables anyway, how about building your table with an "INSERT INTO " or "CREATE TABLE .. AS ..." based on:
SELECT MIN(itemnumber) AS itemnumber, barcode
FROM marty
GROUP BY barcode

SQL (mysql) - If a given row on a given column as a certain value, don't list that column

I have a query that retrieves some data, among those data I have some that are returned with a value like 0. I would like the query to NOT return the columns when that's the case.
How can we do such a thing?
Regards,
MEM
select <column_name> from <table_name> where <column_name> <> 0.0
Here is all the data in a sample database. Notice how there are 3 rows with one having a zero value for the num column.
mysql> select * from test_tbl;
+------+----------+
| num | some_str |
+------+----------+
| 0 | matt |
| 2 | todd |
| 3 | Paul |
+------+----------+
3 rows in set (0.00 sec)
Now lets use the where clause to specify the rows we want to ignore (it's a little bit of reverse logic because we are actually specifying what rows we want).
mysql> select * from test_tbl where num <> 0.0;
+------+----------+
| num | some_str |
+------+----------+
| 2 | todd |
| 3 | Paul |
+------+----------+
2 rows in set (0.00 sec)
Note: This will only work without getting messy if 0 is the only value you are worried about. A better way would be to allow nulls in your column and then you can check to see if they are non-null in the where clause.

need explanation for this MySQL query

I just came across this database query and wonder what exactly this query does..Please clarify ..
select * from tablename order by priority='High' DESC, priority='Medium' DESC, priority='Low" DESC;
Looks like it'll order the priority by High, Medium then Low.
Because if the order by clause was just priority DESC then it would do it alphabetical, which would give
Medium
Low
High
It basically lists all fields from the table "tablename" and ordered by priority High, Medium, Low.
So High appears first in the list, then Medium, and then finally Low
i.e.
* High
* High
* High
* Medium
* Medium
* Low
Where * is the rest of the fields in the table
Others have already explained what id does (High comes first, then Medium, then Low). I'll just add a few words about WHY that is so.
The reason is that the result of a comparison in MySQL is an integer - 1 if it's true, 0 if it's false. And you can sort by integers, so this construct works. I'm not sure this would fly on other RDBMS though.
Added: OK, a more detailed explanation. First of all, let's start with how ORDER BY works.
ORDER BY takes a comma-separated list of arguments which it evalutes for every row. Then it sorts by these arguments. So, for example, let's take the classical example:
SELECT * from MyTable ORDER BY a, b, c desc
What ORDER BY does in this case, is that it gets the full result set in memory somewhere, and for every row it evaluates the values of a, b and c. Then it sorts it all using some standard sorting algorithm (such as quicksort). When it needs to compare two rows to find out which one comes first, it first compares the values of a for both rows; if those are equal, it compares the values of b; and, if those are equal too, it finally compares the values of c. Pretty simple, right? It's what you would do too.
OK, now let's consider something trickier. Take this:
SELECT * from MyTable ORDER BY a+b, c-d
This is basically the same thing, except that before all the sorting, ORDER BY takes every row and calculates a+b and c-d and stores the results in invisible columns that it creates just for sorting. Then it just compares those values like in the previous case. In essence, ORDER BY creates a table like this:
+-------------------+-----+-----+-----+-----+-------+-------+
| Some columns here | A | B | C | D | A+B | C-D |
+-------------------+-----+-----+-----+-----+-------+-------+
| | 1 | 2 | 3 | 4 | 3 | -1 |
| | 8 | 7 | 6 | 5 | 15 | 1 |
| | ... | ... | ... | ... | ... | ... |
+-------------------+-----+-----+-----+-----+-------+-------+
And then sorts the whole thing by the last two columns, which it discards afterwards. You don't even see them it your result set.
OK, something even weirder:
SELECT * from MyTable ORDER BY CASE WHEN a=b THEN c ELSE D END
Again - before sorting is performed, ORDER BY will go through each row, calculate the value of the expression CASE WHEN a=b THEN c ELSE D END and store it in an invisible column. This expression will always evaluate to some value, or you get an exception. Then it just sorts by that column which contains simple values, not just a fancy formula.
+-------------------+-----+-----+-----+-----+-----------------------------------+
| Some columns here | A | B | C | D | CASE WHEN a=b THEN c ELSE D END |
+-------------------+-----+-----+-----+-----+-----------------------------------+
| | 1 | 2 | 3 | 4 | 4 |
| | 3 | 3 | 6 | 5 | 6 |
| | ... | ... | ... | ... | ... |
+-------------------+-----+-----+-----+-----+-----------------------------------+
Hopefully you are now comfortable with this part. If not, re-read it or ask for more examples.
Next thing is the boolean expressions. Or rather the boolean type, which for MySQL happens to be an integer. In other words SELECT 2>3 will return 0 and SELECT 2<3 will return 1. That's just it. The boolean type is an integer. And you can do integer stuff with it too. Like SELECT (2<3)+5 will return 6.
OK, now let's put all this together. Let's take your query:
select * from tablename order by priority='High' DESC, priority='Medium' DESC, priority='Low" DESC;
What happens is that ORDER BY sees a table like this:
+-------------------+----------+-----------------+-------------------+----------------+
| Some columns here | priority | priority='High' | priority='Medium' | priority='Low' |
+-------------------+----------+-----------------+-------------------+----------------+
| | Low | 0 | 0 | 1 |
| | High | 1 | 0 | 0 |
| | Medium | 0 | 1 | 0 |
| | Low | 0 | 0 | 1 |
| | High | 1 | 0 | 0 |
| | Low | 0 | 0 | 1 |
| | Medium | 0 | 1 | 0 |
| | High | 1 | 0 | 0 |
| | Medium | 0 | 1 | 0 |
| | Low | 0 | 0 | 1 |
+-------------------+----------+-----------------+-------------------+----------------+
And it then sorts by the last three invisble columns which are discarded later.
Does it make sense now?
(P.S. In reality, of course, there are no invisible columns and the whole thing is made much trickier to get good speed, using indexes if possible and other stuff. However it is much easier to understand the process like this. It's not wrong either.)