Is it possible to quickly select random rows from a table, while also using a where condition?
Example:
SELECT * FROM geo WHERE placeRef = 1 ORDER BY RAND() LIMIT 1
This can take 10+ seconds.
I found this, which is sometimes quick, sometimes very slow:
(SELECT *
FROM geo
INNER JOIN ( SELECT RAND() * ( SELECT MAX( nameRef ) FROM geo ) AS ID ) AS t ON geo.nameRef >= t.ID
WHERE geo.placeRef = 1
ORDER BY geo.nameRef
LIMIT 1)
This provides a quick result, only if there is no extra where condition.
This is the create table:
CREATE TABLE `geo` (
`nameRef` int(8) DEFAULT NULL,
`placeRef` mediumint(7) unsigned DEFAULT NULL,
`category` enum('continent','country','region','subregion') COLLATE utf8_bin DEFAULT NULL,
`parentRef` mediumint(7) DEFAULT NULL,
`incidence` int(9) unsigned NOT NULL,
`percent` decimal(11,9) unsigned DEFAULT NULL,
`ratio` int(11) NOT NULL,
`rank` mediumint(7) unsigned DEFAULT NULL,
KEY `placeRef_rank` (`placeRef`,`rank`),
KEY `nameRef_category` (`nameRef`,`category`),
KEY `nameRef_parentRef` (`nameRef`,`parentRef`),
KEY `nameRef_placeRef` (`nameRef`,`placeRef`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin
N.B. this table has around 550 million rows.
Desired query: query the table where placeRef = x; and then quickly return one row.
Issue: a query like SELECT * FROM geo WHERE placeRef = 1 can provide up to about 15 million results. So selecting a single random row is slow.
That technique is variable because it depends on where the matching rows happen to lie in the table.
The quick fix may be to add this index, assuming that nameRef is the PRIMARY KEY for the table:
INDEX(placeRef, nameRef)
Let's discuss this further after
You provide SHOW CREATE TABLE geo
You read http://mysql.rjweb.org/doc.php/random
There are (currently) 3 indexes that make this subquery very fast (because of the leading nameRef):
( SELECT MAX( nameRef ) FROM geo )
After that, my suggestion of (placeRef, nameRef) will kick in for these:
WHERE geo.placeRef = 1
geo.nameRef >= t.ID
I think the resulting query should be consistently fast.
This is pulling a result in 1/100th of a second:
SELECT * FROM geo where placeRef = 1 AND nameRef >= CEIL( RAND() * ( SELECT MAX( nameRef ) FROM forenameGeo ) ) LIMIT 1
This works well if you have an index on both the columns you would like to query. However, you may need to make a new table that is randomly ordered. In my table the nameRefs tend to be grouped by country. This causes the random results to be selected from a handful of results as most of the resulted are grouped around the same Id. I needed to create a new table ordered randomly ORDER BY RAND() where each row had a unique Id. Now I search this much smaller summary table with:
SELECT * FROM geoSummary where placeRef = 1 AND nameRef >= CEIL( RAND() * ( SELECT MAX( id ) FROM geoSummary ) ) LIMIT 1
Though to cut that SELECT MAX query running all the time I have saved the maximum Id in the server-side code, generate the random number there and run:
SELECT * FROM geoSummary where placeRef = 1 AND nameRef >= :random_number LIMIT 1
This provides truly random results.
Related
I have a table like this
http://sqlfiddle.com/#!9/052381/1
I need to create a request that will find VIN codes that meet the following conditions:
VIN starts with XTA%
I have registration history: date_reg_last values: 1306440000,1506715200,1555963200. You need to select only those VIN codes that have exactly these values. If there are more or less records - VIN does not match
I have an owner_type that matches the values 1306440000,1506715200,1555963200: 2, 2, 2. Ie. for record 1306440000 owner_type must be 2, for record 1506715200 also 2, etc. The type can be different for each entry.
Similarly to the third point, I have regions: УЛЬЯНОВСК Г.,УЛЬЯНОВСК Г.,С РУНГА
I have a year, it should be in all records.
I tried making a request like this
SELECT *
FROM `ac_gibdd_shortinfo`
WHERE `vin` LIKE 'XTA%'
AND `model` LIKE '%1119%'
AND `date_reg_first` IN (0,1506715200,1555963200)
AND `date_reg_last` IN (1306440000,1506715200,1555963200)
AND `year` LIKE '2011'
AND `location` IN ('УЛЬЯНОВСК Г.','С РУНГА')
But it finds records that have a different number of registration records. There is only one thought: get all the matching records and then filter them by number with an additional request.
Test this:
SELECT *
FROM `ac_gibdd_shortinfo` t0
WHERE `vin` LIKE 'XTA%'
AND `model` LIKE '%1119%'
AND `date_reg_first` IN (0,1506715200,1555963200)
AND `date_reg_last` IN (1306440000,1506715200,1555963200)
AND `year` LIKE '2011'
AND `location` IN ('УЛЬЯНОВСК Г.','С РУНГА')
AND NOT EXISTS ( SELECT NULL
FROM ac_gibdd_shortinfo t1
WHERE t0.vin = t1.vin
AND t1.date_reg_first NOT IN (0,1506715200,1555963200) )
AND NOT EXISTS ( SELECT NULL
FROM ac_gibdd_shortinfo t2
WHERE t0.vin = t2.vin
AND t2.date_reg_last NOT IN (1306440000,1506715200,1555963200) )
AND NOT EXISTS ( SELECT NULL
FROM ac_gibdd_shortinfo t3
WHERE t0.vin = t3.vin
AND t3.location NOT IN ('УЛЬЯНОВСК Г.','С РУНГА') )
PS. According indices will improve.
and have count (1306440000,1506715200,1555963200) - 3 records in total by VIN – blood73
SELECT vin, model, date_reg_first, date_reg_last, `year`, location
FROM `ac_gibdd_shortinfo` t0
WHERE `vin` LIKE 'XTA%'
AND `model` LIKE '%1119%'
AND `date_reg_first` IN (0,1506715200,1555963200)
AND `date_reg_last` IN (1306440000,1506715200,1555963200)
AND `year` LIKE '2011'
AND `location` IN ('УЛЬЯНОВСК Г.','С РУНГА')
AND 3 = ( SELECT COUNT(*)
FROM ac_gibdd_shortinfo t1
WHERE t0.vin = t1.vin );
I have a table set up like so:
CREATE TABLE `cn` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`type` int(3) unsigned NOT NULL,
`number` int(10) NOT NULL,
`desc` varchar(64) NOT NULL,
`datetime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
number is usually but not necessarily unique.
Most of the table consists of rows with consecutive number entries.
e.g.
101010, 101011, 101012, etc.
I've been trying to find an efficient way to list ranges of consecutive numbers so I can find out where numbers are "missing" easily. What I'd like to do is list the start number, end number, and number of consecutive rows. Since there can be duplicates, I am using SELECT DISTINCT(number) to avoid duplicates.
I've not been having much luck - most of the questions of this type deal with dates and have been hard to generalize. One query was executing forever, so that was a no go. This answer is sort of close but not quite. It uses a CROSS JOIN, which sounds like a recipe for disaster when you have millions of records.
What would the best way to do this be? Some answers use joins, which I'm skeptical of performance wise. Right now there are only 50,000 rows, but it will be millions of records within a few days, and so every ounce of performance matters.
The eventual pseudoquery I have in mind is something like:
SELECT DISTINCT(number) FROM cn WHERE type = 1 GROUP BY [consecutive...] ORDER BY number ASC
This is a gaps-and-islands problem. You can solve by using the difference between row_number() and number to define groups; gaps are identified by changes in the difference:
select type, min(number) first_number, max(number) last_number, count(*) no_records
from (
select cn.*, row_number() over(order by number) rn
from cn
where type = 1
) c
group by type, number - rn
Note: window functions avalailable in MySQL 8.0 and MariaDB 10.3 onwards.
In earlier versions, you can emulate row_number() with a session variable:
select type, min(number) first_number, max(number) last_number, count(*) no_records
from (
select c.*, #rn := #rn + 1 rn
from (select * from cn where type = 1 order by number) c
cross join (select #rn := 0) r
) c
group by number - rn
I've got a table with auto-incremented ID in Mysql. I am always adding to this table, never deleting and setting the ID value to NULL so that I am pretty sure there are no holes. This is the table structure:
CREATE TABLE mytable (
id smallint(5) unsigned NOT NULL AUTO_INCREMENT,
data1 varchar(200) DEFAULT NULL,
data2 varchar(30) DEFAULT NULL,
PRIMARY KEY (id),
UNIQUE KEY data (data1,data2)
)
I want to pick up a random row from the table. I am using this:
select * from mytable where id=(select floor(1 + rand() * ((select max(id) from mytable) - 1)));
But sometimes I get nothing, sometimes one row, sometimes two. Replacing max(id) with count(*) or count(id) did not help. I understand it may be because rand() is evaluated for each row. As suggested in a similar question, I used this query:
select * from mytable cross join (select #rand := rand()) const where id=floor(1 + #rand*((select count(*) from mytable)-1));
But I still get an empty set sometimes. Same goes for this:
select * from mytable cross join (select #rand := rand()) const where id=floor(#rand*(select count(*) from mytable)+1);
I am looking for a fast way to do this, so that it won't take a long on big tables. ORDER BY rand() LIMIT 1 is not an option for me. Can't that be done with one query, can be?
I have two different implementations for retrieving polls from the users someone is following and I want to know which one lends itself to a database that will be more scalable. First I'll show you the tables, and then the two implementations.
poll table
CREATE TABLE `poll` (
`id` int(1) unsigned NOT NULL AUTO_INCREMENT,
`creator_id` int(1) unsigned NOT NULL,
`date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`question` varchar(255) NOT NULL,
`num_of_responses` int(1) unsigned DEFAULT NULL,
`num_of_answers` enum('2','3','4','5') NOT NULL,
PRIMARY KEY (`id`),
KEY `creator_id` (`creator_id`),
KEY `date_created` (`date_created`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
.
repoll table -necessary for both implementations
CREATE TABLE `repoll` (
`repoller_id` int(1) unsigned NOT NULL,
`poll_id` int(1) unsigned NOT NULL,
`date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
KEY `repoller_id` (`repoller_id`),
KEY `poll_id` (`poll_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
.
following table
CREATE TABLE `following` (
`follower` int(1) unsigned NOT NULL,
`followee` int(1) unsigned NOT NULL,
KEY `follower` (`follower`),
KEY `followee` (`followee`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
.
user_feed table -necessary only for second implementation
CREATE TABLE `user_feed` (
`user_id` int(1) unsigned NOT NULL,
`poll_id` int(1) unsigned NOT NULL,
`repoller_id` int(1) unsigned DEFAULT NULL,
`date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
KEY `user_id` (`user_id`),
KEY `date_created` (`date_created`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
.
First implementation: Doesn't require the user_feed table, but the query seems much more computationally expensive than the query in implementation two.
SELECT P.id, P.creator_id, P.date_created
FROM
following f JOIN
(
SELECT id, creator_id, date_created
FROM poll
UNION ALL
SELECT poll_id, repoller_id, date_created
FROM repoll
) AS P(id, creator_id, date_created)
ON f.followee=P.creator_id
AND f.follower=23
ORDER BY P.date_created DESC
LIMIT 120;
Second implementation: Requires the user_feed table and the repoll table. I add a record to the user_feed table every time someone posts/repolls something. The record is added for each one of the poster's followers. I only keep, say, 120 records for any particular user in the user_feed table. If a post is made and a user already has 120 records in the user_feed table, the oldest record for that user is removed and added to the repoll table; and the new one takes its place. If a user requests more records than there are present in the user_feed table for them, then the first implementation is used to retrieve the excess.
SELECT uf.poll_id, p.creator_id, uf.repoller_id, uf.date_created
FROM
user_feed uf JOIN poll p
ON uf.poll_id=p.id
AND uf.user_id=23
ORDER BY date_created DESC;
Unwinding that original query, it looks like the query is looking for:
polls created by 23
repolls created by 23
polls created by someone followed by 23
repolls created by someone followed by 23
Assuming that there's a restriction on rows in `following` such that 23 cannot be follower of himself (i.e. no rows allowed where follower=followee)
And also assuming that a user cannot repoll the same poll at the same exact time, that is, (poll_id, creator_id, created_on) tuple is UNIQUE in repoll
(and probably some other conditions I've not identified yet...)
It looks like four distinct sets:
1) polls created by 23
SELECT p.id
, p.creator_id
, p.created_on
, NULL AS repoller_id
FROM poll p
WHERE p.creator_id = 23
ORDER BY p.created_on DESC LIMIT 80
2) repolls by 23
SELECT p.id
, p.creator_id
, r.created_on
, r.repoller_id
FROM poll p
JOIN repoll r
ON r.poll_id = p.id
WHERE r.repoller_id = 23
ORDER BY r.created_on DESC LIMIT 80
3) polls created by someone followed by 23
SELECT p.id
, p.creator_id
, p.created_on
, NULL AS repoller_id
FROM poll p
JOIN following f
ON f.followee = p.creator_id
AND f.follower = 23
AND f.follower <> f.followee -- only needed if we don't disallow 23 to follow 23
ORDER BY p.created_on DESC LIMIT 80
4) repolls created by someone followed by 23
SELECT p.id
, r.creator_id
, r.created_on
, r.repoller_id
FROM poll p
JOIN repoll r
ON r.poll_id = p.id
JOIN following f
ON f.followee = r.repoller_id
AND f.follower = 23
AND f.follower <> f.followee -- only needed if we don't disallow 23 to follow 23
ORDER BY r.created_on DESC LIMIT 80
If there's some possibility of duplicates that need to be eliminated (for example because we don't have appropriate UNIQUE constraints on the tables) we can add GROUP BY clause to the queries required.
We can tune each of these individual queries, making sure appropriate indexes are available and being used, using EXPLAIN.
Then we can combine the queries with UNION ALL set operators. My preference is to not reuse any table aliases in a query, even if its unambiguous to MySQL, it makes the statement easier to read, and in particular, makes the EXPLAIN easier to decipher when every table reference has a unique alias.
Since the original query orders by created_on in descending order, and specifies a limit of 80 rows returned, we can apply that same ordering and limit on each individual subquery. When we get to the order by on the whole set, we'll have at most 320 (=4x80 rows.) To make the result more deterministic, we'll include a second expression in the order by clauses.
(
SELECT p1.id
, p1.creator_id
, p1.created_on
, NULL AS repoller_id
FROM poll p1
WHERE p1.creator_id = 23 -- query parameter
ORDER BY p1.created_on DESC, p1.id DESC LIMIT 80
)
UNION ALL
(
SELECT p2.id
, p2.creator_id
, r2.created_on
, r2.repoller_id
FROM poll p2
JOIN repoll r2
ON r2.poll_id = p2.id
WHERE r2.repoller_id = 23 -- query parameter
ORDER BY r2.created_on DESC, r2.poll_id DESC LIMIT 80
)
UNION ALL
(
SELECT p3.id
, p3.creator_id
, p3.created_on
, NULL AS repoller_id
FROM poll p3
JOIN following f3
ON f3.followee = p3.creator_id
AND f3.follower = 23 -- query parameter
AND f3.follower <> f3.followee -- only needed if we allow 23 to follow 23
ORDER BY p3.created_on DESC, p3.id DESC LIMIT 80
)
UNION ALL
(
SELECT p4.id
, p4.creator_id
, r4.created_on
, r4.repoller_id
FROM poll p4
JOIN repoll r4
ON r4.poll_id = p4.id
JOIN following f4
ON f4.followee = r4.repoller_id
AND f4.follower = 23 -- query parameter
AND f4.follower <> f4.followee -- only needed if we allow 23 to follow 23
ORDER BY r4.created_on DESC, r4.poll_id DESC LIMIT 80
)
ORDER BY created_on DESC, id DESC LIMIT 80
Even though the SQL text in this query is longer than either of the two options you posted, I would expect we'd have a much better shot at predictable performance, given suitable indexes.
In place of the indexes on the singleton column `creator_id`
CREATE INDEX poll_IX1 ON poll (creator_id, created_on, id) ;
CREATE INDEX repoll_IX1 ON repoll (creator_id, created_on, id)
And on the `following` table, make a unique constraint, e.g.
ALTER TABLE `following` ADD PRIMARY KEY (follower, followee)
And not for this query, but for other queries that are likely used in the system...
CREATE UNIQUE INDEX following_UX1 ON following (followee_id, follower_id)
And drop the (now redundant) indexes on the singleton columns of the `following` table.
Also consider adding appropriate foreign key constraints.
Implementation 1 can be improved thus:
SELECT P.id, P.creator_id, P.date_created
FROM following f
JOIN (
( SELECT id, creator_id, date_created
FROM poll
ORDER BY date_created DESC
LIMIT 120
) UNION ALL
( SELECT poll_id, repoller_id, date_created
FROM repoll
ORDER BY date_created DESC
LIMIT 120
)
) AS P ON f.followee=P.creator_id
AND f.follower=23
ORDER BY P.date_created DESC
LIMIT 120;
poll and repoll need: INDEX(creator_id, date_created)
Explanation: In most situations, the seemingly redundant ORDER BY .. LIMIT .. clause is actually an optimization. In the SELECTs in the UNION it minimizes the number of rows to store in the temp table that UNION will create. The temp table will have no more than 2*120 rows; this is important if the tables have millions of rows. The outer query also needs the clauses in order to shuffle the sublists together and whittle the result down to just the 120 that are desired.
Improve indexes and allow for "covering indexes":
CREATE TABLE `following` (
`follower` int unsigned NOT NULL,
`followee` int unsigned NOT NULL,
PRIMARY KEY(`follower`, followee),
INDEX (`followee`, follower)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Explanation: I assume that the combination of follower and followee is 'unique', hence can be the PRIMARY KEY. The performance benefit of doing things this way is that all the followees for a given follower are found in adjacent rows, and vice versa. What you originally had is significantly slower because it had to first look in the index, then read into the data in order to get both fields. What I give you is both "covering" and "clustered"
Shouldn't your second flavor have a LIMIT?
uf needs INDEX(user_id, date_created)
As for your original question of 'which', ... I don't see that the second query does the right thing.
I have the following table setup in mysql:
CREATE TABLE `games_characters` (
`game_id` int(11) DEFAULT NULL,
`player_id` int(11) DEFAULT NULL,
`character_id` int(11) DEFAULT NULL,
KEY `game_id_key` (`game_id`),
KEY `character_id_key` (`character_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
My objective is to get a game_id where a list of character_ids are all present in this game_id.
An example set of data:
1, 1
1, 2
1, 3
2, 1
2, 2
3, 1
3, 4
Let's say i want to get the game_id where the character_id has 1, 2, and 3. How would I go about making an efficient query? Best idea I have had so far was joining the table to itself multiple times, but i assume there has to be a better way to do this.
Thanks
EDIT: for anyone curious this was the final solution I used as it proved the best query time:
SELECT game_ID
FROM (
SELECT DISTINCT character_ID, game_ID
FROM games_Characters
) AS T
WHERE character_ID
IN ( 1, 2, 3 )
GROUP BY game_ID
HAVING COUNT( * ) =3
Select game_ID from games_Characters
where character_ID in (1,2,3)
group by game_ID
having count(*) = 3
the above makes two assumptions
1) you know the characters your looking for
2) game_ID and character_ID are unique
I don't assume you can get the #3 for the count I knnow you can since you know the list of people you're looking for.
This ought to do it.
select game_id
from games_characters
where character_id in (1,2,3)
group by game_id
having count(*) = 3
If that's not dynamic enough for you you'll need to add a few more steps.
create temporary table character_ids(id int primary key);
insert into character_ids values (1),(2),(3);
select #count := count(*)
from character_ids;
select gc.game_id
from games_characters as gc
join character_ids as c
on (gc.character_id = c.id)
group by gc.game_id
having count(*) = #count;