Random sampling using subquery and rand() gives unexpected results - mysql

Edit: If it makes any difference, I am using mysql 5.7.19.
I have a table A, and am trying to randomly sample on average 10% of the rows. I have decided that using rand() in a subquery, and then filtering out on that random result would do the trick, but it is giving unexpected results. When I print out the randomly generated value after filtering, I get random values that do not match my main query's "where" clause, so I suppose it is regenerating the random value in the outer select.
I guess I'm missing something to do with subqueries and when things are executed, but I'm really not sure what's going on.
Can anyone explain what I might be doing wrong? I've checked out this post: In which sequence are queries and sub-queries executed by the SQL engine? , and my subquery is correlated so I assume that my subquery is being executed first, and then the main query is filtering off of it. Given my assumptions, I do not understand why the result has values that should have been filtered away.
Query:
select
*
from
(
select
*,
rand() as rand_value
from
A
) a_rand
where
rand_value < 0.1;
Result:
--------------------------------------
| id | events | rand_value |
--------------------------------------
| c | 1 | 0.5512495763145849 | <- not what I expected
--------------------------------------

I am not able to reproduce using this SQL Fiddle use that link and click the blue [Run SQL] button a few times
CREATE TABLE Table1
(`x` int)
;
INSERT INTO Table1
(`x`)
VALUES
(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
;
Query 1:
select
*
from (
select
*
, rand() as rand_value
from Table1
) a_rand
where
rand_value < 0.1
[Results]:
| x | rand_value |
|---|---------------------|
| 1 | 0.03006686086772649 |
| 1 | 0.09353976332912199 |
| 1 | 0.08519635823107917 |

Related

Union as sub query using MySQL 8

I'm wanting to optimize a query using a union as a sub query.
Im not really sure how to construct the query though.
I'm using MYSQL 8.0.12
Here is the original query:
---------------
| c1 | c2 |
---------------
| 18182 | 0 |
| 18015 | 0 |
---------------
2 rows in set (0.35 sec)
I'm sorry but the question doesn't stored if I paste the sql query as text and format using ctrl+k
Output expected
---------------
| c1 | c2 |
---------------
| 18182 | 167 |
| 18015 | 0 |
---------------
As a output I would like to have the difference of rows between the two tables in UNION ALL.
I processed this question using the wizard https://stackoverflow.com/questions/ask
Since a parenthesized SELECT can be used almost anywhere a expression can go:
SELECT
ABS( (SELECT COUNT(*) FROM tbl_aaa) -
(SELECT COUNT(*) FROM tbl_bbb) ) AS diff;
Also, MySQL is happy to allow a SELECT without a FROM.
There are several ways to go for this, including UNION, but I wouldn't recommend it, as it is IMO a bit 'hacky'. Instead, I suggest you use subqueries or use CTEs.
With subqueries
SELECT
ABS(c_tbl_aaa.size - c_tbl_bbb.size) as diff
FROM (
SELECT
COUNT(*) as size
FROM tbl_aaa
) c_tbl_aaa
CROSS JOIN (
SELECT
COUNT(*) as size
FROM tbl_bbb
) c_tbl_bbb
With CTEs, also known as WITHs
WITH c_tbl_aaa AS (
SELECT
COUNT(*) as size
FROM tbl_aaa
), c_tbl_bbb AS (
SELECT
COUNT(*) as size
FROM tbl_bbb
)
SELECT
ABS(c_tbl_aaa.size - c_tbl_bbb.size) as diff
FROM c_tbl_aaa
CROSS JOIN c_tbl_bbb
In a practical sense, they are the same. Depending on the needs, you might want to define and join the results though, and in said cases, you could use a single number as a "pseudo id" in the select statement.
Since you only want to know the differences, I used the ABS function, which returns the absolute value of a number.
Let me know if you want a solution with UNIONs anyway.
Edit: As #Rick James pointed out, COUNT(*) should be used in the subqueries to count the number of rows, as COUNT(id_***) will only count the rows with non-null values in that field.

Return the query when count of a query is greater than a number?

I want to return all rows that have a certain value in a column and have more than 5 instances in which a number is that certain value. For example, I would like to return all rows of the condition in which if the value in the column M has the number 1 in it and there are 5 or more instances of M having the number 1 in it, then it will return all rows with that condition.
select *
from tab
where M = 1
group by id --ID is the primary key of the table
having count(M) > 5;
EDIT: Here is my table:
id | M | price
--------+-------------+-------
1 | | 100
2 | 1 | 50
3 | 1 | 30
4 | 2 | 20
5 | 2 | 10
6 | 3 | 20
7 | 1 | 1
8 | 1 | 1
9 | 1 | 1
10 | 1 | 1
11 | 1 | 1
Originally I just want to insert into a trigger so that if the number of M = 1's is greater than 5, then I want to create an exception. The query I asked for would be inserted into the trigger. END EDIT.
But my table is always empty. Can anyone help me out? Thanks!
Try this :
select *
from tab
where M in (select M from tab where M = 1 group by M having count(id) > 5);
SQL Fiddle Demo
please try
select *,count(M) from table where M=1 group by id having count(M)>5
Since you group on your PK (which seems a futile excercise), you are counting per ID, whicg will indeed always return 1.
As i explain after this code, this query is NOT good, it is NOT the answer, and i also explain WHY. Please do not expect this query to run correctly!
select *
from tab
where M = 1
group by M
having count(*) > 5;
Like this, you group on what you are counting, which makes a lot more sense. At the same time, this will have unexpected behaviour, as you are selecting all kinds of columns that are not in the group by or in any aggregate. I know mySQL is lenient on that, but I don;t even want to know what it will produce.
Try indeed a subquery along these lines:
select *
from tab
where M in
(SELECT M
from tab
group by M
having count(*) > 5)
I've built a SQLFiddle demo (i used 'Test' as table name out of habit) accomplishing this (I don't have a mySQL at hand now to test it).
-- Made up a structure for testing
CREATE TABLE Test (
id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
M int
);
SELECT id, M FROM tab
WHERE M IN (
SELECT M
FROM Test
WHERE M = 1
GROUP BY M
HAVING COUNT(M) > 5
)
The sub-query is a common "find the duplicates" kind of query, with the added condition of a specific value for the column M, also stating that there must be at least 5 dupes.
It will spit out a series of values of M which you can use to query the table against, ending with the rows you need.
You shouldn't use SELECT * , it's a bad practice in general: don't retrieve data you aren't actually using, and if you are using it then take the little time needed to type in a list of field, you'll likely see faster querying and on the other hand the code will be way more readable.

Mysql-Select all tables from a database

I've a database called test and i've tables called x,y,z.
How do i select x,y,z and there is a column called date IN X,Y,Z check whether there is a particular date.
Is there any build in function that does this?
update
SELECT column date from all tables which is in a database called test
Thanks in advance!!
As far as I know, in SQL you cannot 'select a table', you can select some
column(s) from one or many tables at once. The result of such a query is an another table (temporary table) that you retrieve the data from.
Please be more specific about what exactly you want to do (e.g.: "I want to select a column 'z' from table 'tableA' and column 'y' from table 'tableB'") - then I'm sure your question has a pretty simple answer :)
SELECT x.date AS x_date, y.date AS y_date, z.date AS z_date FROM x,y,z;
That produces a result:
+---------+---------+---------+
| x_date | y_date | z_date |
+---------+---------+---------+
| | | |
| | | |
+---------+---------+---------+
Alternatively you can get everything in one column by ussuing a query:
SELECT date FROM x
UNION ALL
SELECT date FROM y
UNION ALL
SELECT date FROM z;
That produces a result:
+-------+
| date |
+-------+
| |
| |
+-------+
In the example above you would get also duplicate values in the single column. If you want to avoid duplicates replace 'UNION ALL' with 'UNION'
I'm still not sure if I undestood what you really want ot achieve, but I still hope that helps
Also take a look at:
http://www.w3schools.com/sql/sql_union.asp
http://www.sql-tutorial.net/SQL-JOIN.asp

How to optimize SQL query with two where clauses?

My query is something like this
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND (tbl2.date = '$date' OR ('$date' BETWEEN tbl1.planA AND tbl1.planB ))
When I run this query, it is considerably slower than for example this query
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND ('$date' BETWEEN tbl1.planA AND tbl1.planB )
or
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND tbl2.date = '$date'
In localhost, the first query takes about 0.7 second, the second query about 0.012 second and the third one 0.008 second.
My question is how do you optimize this? If currently I have 1000 rows in my tables and it takes 0.7 second to display the first query, it will take 7 seconds if I have 10.000 rows right? That's a massive slow down compared to second query (0.12 second) and third (0.08).
I've tried adding indexes, but the result is no different.
Thanks
Edit : This application will only work locally, so no need to worry about the speed over the web.
Sorry, I didn't include the EXPLAIN because my real query are much more complicated (about 5 joins). But the joins (I think) don't really matter, cos I've tried omitting them and still get approximately the same result as above.
The date belongs to tbl1, planA and planB belongs to tbl2. I've tried adding indexes to tbl1.date, tbl2.planA and tbl2.planB but the result is insignificant.
By schema do you mean MyISAM or InnoDB? It's MyISAM.
Okay, I'll just post my query straight away. Hopefully it's not that confusing.
SELECT *
FROM tb_joborder jo
LEFT JOIN tb_project p ON jo.project_id = p.project_id
LEFT JOIN tb_customer c ON p.customer_id = c.customer_id
LEFT JOIN tb_dispatch d ON jo.joborder_id = d.joborder_id
LEFT JOIN tb_joborderitem ji ON jo.joborder_id = ji.joborder_id
LEFT JOIN tb_mix m ON ji.mix_id = m.mix_id
WHERE dispatch_date = '2011-01-11'
OR '2011-01-11'
BETWEEN planA
AND planB
GROUP BY jo.joborder_id
ORDER BY customer_name ASC
And the describe output
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE jo ALL NULL NULL NULL NULL 453 Using temporary; Using filesort
1 SIMPLE p eq_ref PRIMARY PRIMARY 4 db_dexada.jo.project_id 1
1 SIMPLE c eq_ref PRIMARY PRIMARY 4 db_dexada.p.customer_id 1
1 SIMPLE d ALL NULL NULL NULL NULL 2048 Using where
1 SIMPLE ji ALL NULL NULL NULL NULL 455
1 SIMPLE m eq_ref PRIMARY PRIMARY 4 db_dexada.ji.mix_id 1
You can just use UNION to merge results of 2nd and 3d queries.
More about UNION.
First thing that comes to mind is to union the two:
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND ('$date' BETWEEN planA AND planB )
UNION ALL
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND date = '$date'
You have provided too little to make optimizations. We don't know anything about your data structures.
Even if most slow queries are usually due to the query itself or index setup of the used tables, you can try to find out where your bottleneck is with using the MySQL Query Profiler, too. It has been implemented into MySQL since Version 5.0.37.
Before you start your query, activate the profiler with this statement:
mysql> set profiling=1;
Now execute your long query.
With
mysql> show profiles;
you can now find out what internal number (query number) your long query has.
If you now execute the following query, you'll get alot of details about what took how long:
mysql> show profile for query (insert query number here);
(example output)
+--------------------+------------+
| Status | Duration |
+--------------------+------------+
| (initialization) | 0.00005000 |
| Opening tables | 0.00006000 |
| System lock | 0.00000500 |
| Table lock | 0.00001200 |
| init | 0.00002500 |
| optimizing | 0.00001000 |
| statistics | 0.00009200 |
| preparing | 0.00003700 |
| executing | 0.00000400 |
| Sending data | 0.00066600 |
| end | 0.00000700 |
| query end | 0.00000400 |
| freeing items | 0.00001800 |
| closing tables | 0.00000400 |
| logging slow query | 0.00000500 |
+--------------------+------------+
This is a more general, administrative approach, but can help narrow down or even find out the cause for slow queries very nice.
A good tutorial on how to use the MySQL Query Profiler can be found here in the MySQL articles.

How to optimize an ORDER BY for a computed column on a MASSIVE MySQL table

I have a very large (80+ million row) de-normalized MySQL table. A simplified schema looks like:
+-----------+-------------+--------------+--------------+
| ID | PARAM1 | PARAM2 | PARAM3 |
+-----------+-------------+--------------+--------------+
| 1 | .04 | .87 | .78 |
+-----------+-------------+--------------+--------------+
| 2 | .12 | .02 | .76 |
+-----------+-------------+--------------+--------------+
| 3 | .24 | .92 | .23 |
+-----------+-------------+--------------+--------------+
| 4 | .65 | .12 | .01 |
+-----------+-------------+--------------+--------------+
| 5 | .98 | .45 | .65 |
+-----------+-------------+--------------+--------------+
I'm trying to see if there's a way to optimize a query in which I apply a weight to each PARAM column (where weight is between 0 and 1) and then average them to come up with a computed value SCORE. Then I want to ORDER BY that computed SCORE column.
For example, assuming the weighting for PARAM1 is .5, the weighting for PARAM2 is .23 and the weighting for PARAM3 is .76, you would end up with something similar to:
SELECT ID, ((PARAM1 * .5) + (PARAM2 * .23) + (PARAM3 * .76)) / 3 AS SCORE
ORDER BY SCORE DESC LIMIT 10
With some proper indexing, this is fast for basic queries, but I can't figure out a good way to speed up the above query on such a large table.
Details:
Each PARAM value is between 0 and 1
Each weight applied to the PARAMS are between 0 and 1 s
--EDIT--
A simplified version of the problem follows.
This runs in a reasonable amount of time:
SELECT value1, value2
FROM sometable
WHERE id = 1
ORDER BY value2
This does not run in a reasonable amount of time:
SELECT value1, (value2 * an_arbitrary_float) as value3
FROM sometable
WHERE id = 1
ORDER BY value3
Using the above example, is there any solution that allows me to do an ORDER BY with out computing value3 ahead of time?
I've found 2 (sort of obvious) things that have helped speed this query up to a satisfactory level:
Minimize the number of rows that need to be sorted. By using an index on the 'id' field and a subselect to trim the number of records first, the file sort on the computed column is not that bad. Ie:
SELECT t.value1, (t.value2 * an_arbitrary_float) as SCORE
FROM (SELECT * FROM sometable WHERE id = 1) AS t
ORDER BY SCORE DESC
Try increasing sort_buffer_size in my.conf to speed up those filesorts.
I know this question is old, but I recently ran into this problem, and the solution I came up with was to use a derived table. In the derived table, create your calculated column. In the outer query, you can order by it. It seems to run considerably faster for my workload (orders of magnitude).
SELECT value1, value3
FROM (
SELECT value1, (value2 * an_arbitrary_float) as value3
FROM sometable
WHERE id = 1
) AS calculated
ORDER BY value3
MySQL lacks many sexy features that could help you with this. Perhaps you could add a column with the calculated ranking, index it and write a couple of triggers to keep it updated.