Mysql improve sampling query speed - mysql

I have a table with 3,000,000 records.I tried to randomly extract 300,000 records using the following method,but it takes about 7 minutes.
SELECT * FROM mytable WHERE `class`='faq' ORDER BY RAND() LIMIT 300000
I want to improve the speed of random extraction, what should I do?
Mysql version is 5.6.

The cost is most likely due to sorting all the matching data. You don't specify how many rows match the condition, so this sort is likely to be some fraction of 3,000,000 rows.
If you can deal with approximately 300,000, you can use sampling logic in the WHERE clause:
SELECT t.*
FROM mytable t CROSS JOIN
(SELECT COUNT(*) as cnt
FROM t
WHERE class = 'faq'
) x
WHERE t.class = 'faq' AND
rand() < (300000 / cnt);
To be more precise, you can take a slightly larger random sample and then use order by/limit:
SELECT t.*
FROM mytable t CROSS JOIN
(SELECT COUNT(*) as cnt
FROM t
WHERE class = 'faq'
) x
WHERE t.class = 'faq' AND
rand() < (300000 / cnt) * 1.1
ORDER BY rand()
LIMIT 300000;

Related

SQL query with a major NOT IN not working

Does anyone know what's wrong with this query?
This works perfectly on its own:
SELECT * FROM
(SELECT * FROM data WHERE site = '".$id."'
AND disabled = '0'
AND carvotes NOT LIKE '0'
AND (time > ( now( ) - INTERVAL 14 DAY ))
GROUP BY car ORDER BY carvotes DESC LIMIT 0 , 10)
X order by time DESC
So does this:
SELECT * FROM data WHERE site = '".$id."' AND disabled = '0' GROUP BY car DESC ORDER BY time desc LIMIT 0 , 30
But combining them like this:
SELECT * FROM data WHERE site = '".$id."' AND disabled = '0' AND car NOT IN (SELECT * FROM
(SELECT * FROM data WHERE site = '".$id."'
AND disabled = '0'
AND carvotes NOT LIKE '0'
AND (time > ( now( ) - INTERVAL 14 DAY ))
GROUP BY car ORDER BY carvotes DESC LIMIT 0 , 10)
X order by time DESC) GROUP BY car DESC ORDER BY time desc LIMIT 0 , 30
Gives errors. Any ideas?
Please try the following...
$result = mysqli_query( $con,
"SELECT *
FROM data
WHERE site = '" . $id .
"' AND disabled = '0'
AND car NOT IN ( SELECT car
FROM ( SELECT car,
carvotes
FROM data
WHERE site = '" . $id .
"' AND disabled = '0'
AND carvotes NOT LIKE '0'
AND ( time > ( NOW( ) - INTERVAL 14 DAY ) )
GROUP BY car
ORDER BY carvotes DESC
LIMIT 10 ) X
)
GROUP BY car
ORDER BY time DESC
LIMIT 30" );
The main cause of your problem is that with car NOT IN ( SELECT * FROM ( SELECT *... you are trying to compare each record's value of car with each row returned by your subquery. IN requires you to have the same number of fields on both sides of the comparison. By using SELECT * at both levels of the subquery you were ensuring that the right side of the comparison had however many fields are in data versus your single field on the left, which confused MySQL.
Since you are aiming to compare to a single field, namely car, our subquery has to select just the car field from its dataset. Since the sort order of the subquery's results has no effect upon the IN comparison, and since our innermost query will be returning just car, I have removed the outer level of the subquery.
Beyond changing the first part of the subquery to SELECT car, the only other change that I have made to the subquery is to change LIMIT 0, 10 to LIMIT 10. The former means limit to the the 10 records that are offset by 0 from the first record. This is useful if you want records 6 to 15, but redundant for 1 to 10 as LIMIT 10 has the same affect and is slightly simpler. Ditto for LIMIT 0, 30 at the end of your overall statement.
As for the main body of the statement, I have not made any attempt to specify what fields (or aggregate functions of those fields) should be returned since you have made no statement indicating what your requirements / preferences are. If you are satisfied that GROUP BY has left you with a still valid set of values, then all the good, but if not then I recommend that you rewrite your Question to be specific about that detail.
By default, MySQL sorts the data subjected to a GROUP BY into ascending order, but if an ORDER BY clause is also present then it overrides the GROUP BY's sort pattern. As such, there is no benefit to specifying DESC after either of your GROUP BY car clauses, so I have removed it where it occurs.
Interesting Sidenote : You can override a GROUP BY's sort by specifying ORDER BY NULL.
If you have any questions or comments, then please feel free to post a Comment accordingly.
Further Reading
https://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html - on optimising your ORDER BY sorting
https://dev.mysql.com/doc/refman/5.7/en/select.html - on the SELECT statement's syntax - specifically the parts to do with LIMIT.
https://www.w3schools.com/php/php_mysql_select_limit.asp - a simpler explanation of LIMIT
This is your query:
SELECT *
FROM data
WHERE site = '".$id."' AND disabled = '0' AND
car NOT IN (SELECT *
FROM (SELECT *
FROM data
WHERE site = '".$id."' AND
disabled = '0' AND
carvotes NOT LIKE '0' AND
(time > ( now( ) - INTERVAL 14 DAY ))
GROUP BY car
ORDER BY carvotes DESC
LIMIT 0 , 10
) x
ORDER BY time DESC
)
GROUP BY car DESC
ORDER BY time desc
LIMIT 0 , 30 ;
Several comments:
Do not wrap integer constants in single quotes. This can mislead people. This can mislead optimizers.
Do not use string functions on integers (such as like). Same reason.
NOT IN with subqueries is dangerous. The construct does not handle NULL values the way you expect. Use NOT EXISTS or LEFT JOIN instead.
When using subqueries, ORDER BY is almost never appropriate.
Never use SELECT * with GROUP BY. It is just wrong. Happily, MySQL 5.7 has changed its defaults to reject this anti-pattern
So, a better way to write this query is something like this:
SELECT d.car, MAX(time) as time
FROM data d LEFT JOIN
(SELECT d2.*
FROM data d2
WHERE d2.site = '".$id."' AND
d2.disabled = 0 AND
d2.carvotes NOT LIKE 0 AND
(d2.time > ( now( ) - INTERVAL 14 DAY ))
GROUP BY d2.car
ORDER BY carvotes DESC
LIMIT 0 , 10
) car10
ON d.car = car10.car
WHERE d.site = '".$id."' AND d.disabled = 0' AND
car10.car IS NOT NULL
GROUP BY car DESC
ORDER BY MAX(time) desc
LIMIT 0 , 30 ;
Alternatively, use SELECT * and remove the GROUP BY in the outer query.

Select limit percent

I want
example 10% of records in table
not 10 records
This query run in SQL Server
select top 10 percent * from tablename
Why this query in MySQL do not run?
select top 10 percent * from tablename
You could do it with a subquery, this is pretty basic since you want everything in one table:
SELECT *
FROM (
SELECT tablename.*, #counter := #counter +1 AS counter
FROM (select #counter:=0) AS initvar, tablename
ORDER BY value DESC
) AS X
where counter <= (10/100 * #counter);
ORDER BY value DESC
For MySQL use order by or limit
select * from tablename order by percent desc limit 10
TOP clause works on MSSQL server not sql.

DELETE with a percentage on total count

Let's say I want to delete 10% of rows, is there a query to do this?
Something like:
DELETE FROM tbl WHERE conditions LIMIT (SELECT COUNT(*) FROM tbl WHERE conditions) * 0.1
If you only need roughly 10% of rows, in no particular order, this should do the trick:
DELETE FROM tbl WHERE RAND() <= 0.1
However, I don't recommend using it on very large data sets due to the overhead of generating random numbers.
I would simply return the total amount of filtered rows, calculate through php and use that value as a limit in my DELETE query.
$query = mysql_query("SELECT COUNT(*) FROM tbl WHERE conditions");
$int = reset(mysql_fetch_array($query));
$int = round($int * 0.1);
mysql_query("DELETE FROM tbl WHERE conditions LIMIT {$int}");
I'm not sure if DELETE allows an advanced query such as this one:
DELETE FROM ( SELECT h2.id
FROM ( SELECT COUNT(*) AS total
FROM tbl
WHERE conditions) AS h
JOIN ( SELECT *, #rownum := #rownum + 1 AS rownum
FROM tbl, (SELECT #rownum := 0) AS vars
WHERE conditions) AS h2
ON '1'
WHERE rownum < total * 0.1) AS h3

Negative limit offset in mysql

I'm creating a high score server and one of the needed features is being able to retrieve high scores around the users current score. I currently have the following:
SELECT * FROM highscores
WHERE score >= ( SELECT score FROM highscores WHERE userID = someID )
ORDER BY score, updated ASC
LIMIT -9, 19
The only problem here is that the offset parameter of LIMIT can't be negative, otherwise I believe this would work dandy. So in conclusion, is there any trick / way to supply a negative offset to the LIMIT offset, or is there perhaps a better way to about this entirely?
You can either do a real pain in the butt single select query, or just do this:
(SELECT * FROM highscores
WHERE score <= ( SELECT score FROM highscores WHERE userID = someID )
ORDER BY score, updated ASC
LIMIT 9)
UNION
(SELECT * FROM highscores
WHERE score = ( SELECT score FROM highscores WHERE userID = someID ))
UNION
(SELECT * FROM highscores
WHERE score >= ( SELECT score FROM highscores WHERE userID = someID )
ORDER BY score, updated ASC
LIMIT 9)
I threw in a piece to grab the indicated user's score so it's in the middle of the list. Optional if you need it. Also, don't use SELECT *, use specific fields. Clarity is always preferable, and performance wise, * sucks.

MySQL Select Random X Entries - Optimized

MySQL what's the best way to select X random entries (rather than just one) - optimization for heavy use, i.e. on main page of a domain.
Supposedly just blindly using MySQL rand() is going to make this rather scary for large databases - please give me a better optimization answer than that!
the solution is use php
look at this article that choose the solution number 3 as faster
http://akinas.com/pages/en/blog/mysql_random_row/
Solution 3 [PHP]
$offset_result = mysql_query( " SELECT FLOOR(RAND() * COUNT(*)) AS `offset` FROM `table` ");
$offset_row = mysql_fetch_object( $offset_result );
$offset = $offset_row->offset;
$result = mysql_query( " SELECT * FROM `table` LIMIT $offset, 1 " )
the
Solution 4 [SQL] (Second in fast)
SELECT * FROM `table` WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM `table` ) ORDER BY id LIMIT 1;
I had an issue with the ids. The id was auto generated but the minimum id was very large with respect to the number of total records. So I made a little changes to make the query more randomized, but a little slower though.
SELECT * FROM 'table' WHERE id >= (SELECT (FLOOR( MAX(id) * RAND()) + MIN(id)) FROM 'table' ) ORDER BY id LIMIT 10