Fetching random set of values from a huge table

Fetching random set of values from a huge table - mysql

I have a table containing about 1 billion records. It has the following structure:
id | name | first_id | second_id
I also have an array with a set of specific words:
$arr = ['camel', 'toe', 'glasses', 'book'];
I now have to fetch all records from this table where:
- name contains one or more keywords from this array
- first_id matches 8
- second_id matches 55
Those values are made up of course, they change dynamically in my application.
How can I do this so that it's most efficient?
I tried the following:
SELECT *
FROM table t
WHERE (t.name LIKE '%camel%' OR t.name LIKE '%toe%' OR t.name LIKE '%glasses%' OR t.name LIKE '%book%') AND t.first_id = 8 AND t.second_id = 55;
But it executes about 3.5s.
I just need to get about 3-4 random records from this query, so I also tried limiting results to 300. But it still gave me 700ms, which is way too long.
I also tried randomizing limit and offset, but I'd have to count all results earlier, so it would be even slower.
Is there a way to solve this problem?

First, learn how to use EXPLAIN SELECT. This should tell you a bit about how mysql will pick a strategy for your query.
If just using the first_id and second_id reduces the table to a small amount of records, it should be pretty fast, but it does mean that you need an index. Only 1 index can be used, so how you build that index depends on the cardinality of both first_id and second_id. If both only contain a limited about of values (say: under a hundred), you should make an index that references both.
But if there's still a ton of records in the table even for those first_id and second_id values, it means you need an index on the name field instead.
A regular index will do nothing for you for that field. You need a FULLTEXT index.

Related

Issues with a MySQL Index due to a specific part

The query is
SELECT row
FROM `table`
USE INDEX(`indexName`)
WHERE row1 = '0'
AND row2 = '0'
AND row3 >= row4
AND (row5 = '0' OR row5 LIKE 'value')
I have the following MySQL Query which I've created a index for using;
CREATE INDEX indexName ON `table` (row1, row2, row3, row5);
However, the performance is not really good. It's extracting about 17,000+ rows out of a 5.9+ million row table in anywhere from 6-12 seconds.
It seems like the bottleneck is the row3 >= row4 - because without that part in the code it runs in 0.6-0.7 seconds.
(from Comment)
The row (placeholder column name) is actually the id (primary key, index) column in the table, which is the result set I'm outputting later on. I'm outputting an array of IDs that are matching the parameters in my query, and then selecting a random ID from that array to gather data through the final query on a specific row. This was done as a workaround for rand(). Any adjustments needed based on that knowledge?

17K rows is not a tiny result set. Large result sets often take time just because of the overhead of delivering the data from the MySQL server to the program requesting them.
The contents of the 'value' you use in row5 LIKE 'value' matter a great deal to query performance. If 'value' starts with a wildcard character like % your query will be slow.
That being said, you need a so-called covering index. You've tried to create one with the index you created. It's close but not perfect.
Your query filters on equality to constant values on row1, row2, and row5, so those columns should come first in your index. The query planner can random-access your index to the first matching entry, and then sequentially scan the index until it gets to the last matching entry. That's as fast as it gets.
Then you want to examine row3 and row4 (to compare them). Those columns should come next in the index. Finally, if your query's SELECT clause mentions a subset of the columns in your table you should put the rest of those columns in the index. So, based on the query in your question, your index should be
CREATE INDEX indexName ON `table` (row1, row2, row5, row3, row4, row);
The query planner will be able to satisfy the entire query by scanning through a subset of the index, using a so-called index range scan. That should be decently fast.
Pro tip: don't force the query planner's hand with USE INDEX(). Instead, structure your indexes to handle your queries efficiently.

An index can't be used to compare two columns in the same table (at best, it could be used for an index scan rather than a table scan if all output fields are contained in the index), so there basically is no "correct" way to do this.
If you have control over the structure AND the processes the fill the table, you could add a calculated field that holds the difference between the two fields. Then add that field to the index and adjust your query to use that field instead of the other 2.
It ain't pretty and doesn't offer a lot of flexibility (eg. if you want to compare another field, you need to add it as well etc), but it does get the job done.

(This is an adaptation of http://mysql.rjweb.org/doc.php/random )
Let's actually fold the randomization into the query. This will eliminate gathering a bunch of ids, processing them, and then reaching back into the table. It will also avoid the need for an extra index.
Find min and max id values.
Pick a random id between min and max.
Scan forward, looking for the first row with col1...col5 matching the criteria.
Something like...
SELECT b.* -- should replace with actual list of columns
FROM
( SELECT id
FROM tbl
WHERE id >= ( SELECT MIN(id) +
( MAX(id) - MIN(id)
- 22 -- somewhat avoids running off end
) * RAND()
FROM tbl )
AND col1 = 0 ... -- your various criteria
ORDER BY id
LIMIT 1
) AS a
JOIN tbl AS b USING(id);
Pros/cons:
Probably faster than anything else you can devise.
If there RAND() hits too late in the table, it will return nothing. In this (rare) case, run the query again, but starting at 0.
Big gaps in id will lead to a bias in which id is returned. (The link above discusses some kludges to handle such.)

MySQL query takes long to execute if using ORDER BY String Column

So my query on a table that contains 4 million records executes instant if I dont use order by. However I want to give my clients a way to sort results by Name field and only show last 100 of the filtered result. As soon as I add order by Name it takes 100 seconds to execute.
My table structure is similar to this:
CREATE TABLE Test(
ID INT PRIMARY KEY AUTO_INCREMENT,
Name VARCHAR(100),
StatusID INT,
KEY (StatusID), <-- Index on StatusID
KEY (StatusID, Name) <-- Index on StatusID, Name
KEY(Name) <-- Index on Name
);
My query simply does something like:
explain SELECT ID, StatusID, Name
FROM Test
WHERE StatusID = 113
ORDER BY Name DESC
LIMIT 0, 100
Above explain when I order by Name gives this result:
StatusID_2 is the composite index of StatausID, Name
Now If I change ORDER BY Name DESC to ORDER BY ID I get this:
How can I make it so that it also examines only 100 rows when using ORDER BY Name?

You can try one thing, try for letters which would be in 100 rows expected in result like
SELECT *
FROM Test
*** Some Joins to filter data or get more columns from other tables
WHERE StatusID = 12 AND NAME REGEXP '^[A-H]'
ORDER BY Name DESC
LIMIT 0, 100
Moreover using index is very important on name (which is already applied) – in this case index range scan will be started and query execution stopped as soon as soon as required amount of rows generated.
So we can't use ID for nothing as it won't scan when it has reached its limit, the only thing we can try for is remove letters which are not possible in expected result and this what we are trying to do with REGEXP

It's hard to tell without the joins and the explain result but you're not making use of index appeareantly.
It might be because of the joins or because you have another key in the where clause. I'd recommend reading this, it covers all possible cases: http://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html
Increasing the sort_buffer_size and/or read_rnd_buffer_size might help...

You need a composite key based on the filtering WHERE criteria PLUS the order by... create an index on
( StatusID, Name )
This way, the WHERE jumps right to your StatusID = 12 records and ignores the rest of the 4 million... THEN uses the name as a secondary to qualify the ORDER BY.
Without seeing the other tables / join criteria and associated indexes, you might also want to try adding MySQL keyword
SELECT STRAIGHT_JOIN ... rest of query
So it does the query in the order you have selected, but unsure of impact without seeing other joins as noted previously.
ADDITION (per feedback)
I would remove the individual indexes on the ID only so the engine doesn't have to guess which one to use. The composite index can be used as an ID only query regardless of the name so you don't need to have both.
Also, remove the Name only index UNLESS you will ever be querying PRIMARILY on the name as a where qualifier without the ID qualifier... Also, how many total records are even possible for the example IDs you are querying out of the 4 million... You MIGHT want to pull the full set for the id as a sub-query, get a few thousand and have THAT ordered by name which would be quick... something like...
select *
from ( SELECT
ID,
StatusID,
Name
FROM
Test
WHERE
StatusID = 113 ) PreQuery
ORDER BY
Name DESC
LIMIT 0, 100

MySQL Secondary Indexes

If I'm trying to increase the performance of a query that uses 4 different columns from a specific table, should I create 4 different indexes (one with each column individually) or should I create 1 index with all columns included?

One index with all 4 values is by my experience the fastest. If you use a where, try to put the columns in an order that makes it useful for the where.

An index with all four columns; the columns used in the WHERE should go first, and those for which you do == compare should go first of all.
Sometimes, giving priority to integer columns gives better results; YMMV.
So for example,
SELECT title, count(*) FROM table WHERE class = 'post' AND topic_id = 17
AND date > ##BeginDate and date < ##EndDate;
would have an index on: topic_id, post, date, and title, in this order.
The "title" in the index is only used so that the DB may find the value of "title" for those records matching the query, without the extra access to the data table.
The more balanced the distribution of the records on the first fields, the best results you will have (in this example, say 10% of the rows have topic_id = 17, you would discard the other 90% without ever having to run a string comparison with 'post' -- not that string comparisons are particularly costly. Depending on the data, you might find it better to index date first and post later, or even use date first as a MySQL PARTITION.

Single index is usually more effective than index merge, so if you have condition like f1 = 1 AND f2 = 2 AND f3 = 3 AND f4 = 4 single index would right decision.
To achieve best performance enumerate index fields in descending order of cardinality (count of distinct values), this will help to reduce analyzed rows count.
Index of less than 4 fields can be more effective, as it requires less memory.
http://www.mysqlperformanceblog.com/2008/08/22/multiple-column-index-vs-multiple-indexes/

optimize SELECT query, knowing that we are dealing with a limited range

I am trying to include in a MYSQL SELECT query a limitation.
My database is structured in a way, that if a record is found in column one then only 5000 max records with the same name can be found after that one.
Example:
mark
..mark repeated 5000 times
john
anna
..other millions of names
So in this table it would be more efficent to find the first Mark, and continue to search maximum 5000 rows down from that one.
Is it possible to do something like this?

Just make a btree index on the name column:
CREATE INDEX name ON your_table(name) USING BTREE
and mysql will silently do exactly what you want each time it looks for a name.

Try with:
SELECT name
FROM table
ORDER BY (name = 'mark') DESC
LIMIT 5000
Basicly you sort mark 1st then the rest follow up and gets limited.

Its actually quite difficult to understand your desired output .... but i think this might be heading in the right direction ?
(SELECT name
FROM table
WHERE name = 'mark'
LIMIT 5000)
UNION
(SELECT name
FROM table
WHERE name != 'mark'
ORDER BY name)
This will first get upto 5000 records with the first name as mark then get the remainder - you can add a limit to the second query if required ... using UNION
For performance you should ensure that the columns used by ORDER BY and WHERE are indexed accordingly

If you make sure that the column is properly indexed, MySQL will take care off optimisation for you.
Edit:
Thinking about it, I figured that this answer is only useful if I specify how to do that. user nobody beat me to the punch: CREATE INDEX name ON your_table(name) USING BTREE
This is exactly what database indexes are designed to do; this is what they are for. MySQL will use the index itself to optimise the search.

randomizing large dataset

I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!

You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.

Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...

You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Fetching random set of values from a huge table - mysql

Related

Issues with a MySQL Index due to a specific part

MySQL query takes long to execute if using ORDER BY String Column

MySQL Secondary Indexes

optimize SELECT query, knowing that we are dealing with a limited range

randomizing large dataset

Categories

Resources