Fast mysql query to randomly select N usernames - mysql

In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.

How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.

does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Is this strategy for fast substring search in MySQL fast enough?

I have a USER table with millions of rows. I am implementing a search function that allows someone to look for a user by typing in a username. This autocomplete feature needs to be blazingly fast. Given that, in MySQL, column indexes speed up queries using LIKE {string}%, is the following approach performant enough to return within 200ms? (Note: Memory overhead is not an issue here, username are maximum 30 characters).
Create a USERSEARCH table that has a foreign key to the user table and an indexed ngram username column:
USERSEARCH
user_id username_ngram
-------------------------
1 crazyguy23
1 razyguy23
1 azyguy23
1 zyguy23
...
The query would then be:
SELECT user_id FROM myapp.usersearch WHERE username_ngram LIKE {string}%
LIMIT 10
I am aware that third party solutions exist, but I would like to stay away from them at the moment for other reasons. Is this approach viable in terms of speed? Am I overestimating the power of indexes if the db would need to check all O(30n) rows where n is the number of users?
Probably not. The union distinct is going to process each subquery to completion.
If you just want arbitrary rows, phrase this as:
(SELECT user_id
FROM myapp.usersearch
WHERE username_1 LIKE {string}%
LIMIT 10
) UNION DISTINCT
(SELECT user_id
FROM myapp.usersearch
WHERE username_2 LIKE {string}%
LIMIT 10
)
LIMIT 10;
This will at least save you lots of time for common prefixes -- say 'S'.
That said, this just returns an arbitrary list of 10 user_ids when there might be many more.
I don't know if the speed will be fast enough for your application. You have to make that judgement by testing on an appropriate set of data.
Assuming SSDs, that should be blazing fast, yes.
Here are some further optimizations:
I would add a DISTINCT to your query, since there is no point in returning the same user_id multiple times. Especially when searching for a very common prefix, such as a single letter.
Also consider searching only for at least 3 letters of input. Less tends to be meaningless (since hopefully your usernames are at least 3 characters long) and is a needless hit on your database.
If you're not adding any more columns (I hope you're not, since this table is meant for blazing fast search!), we can do better. Swap the columns. Make the primary key (username_ngram, user_id). This way, you're searching directly on the primary key. (Note the added benefit of the alphabet ordering of the results! Well... alphabetic on the matching suffixes, that is, not the full usernames.)
Make sure you have an index on user_id, to be able to replace everything for a user if you ever need to change a username. (To do so, just delete all rows for that user_id and insert brand new ones.)
Perhaps we can do even better. Since this is just for fast searching, you could use an isolation level of READ_UNCOMMITTED. That avoids placing any read locks, if I'm not mistaken, and should be even faster. It can read uncommitted data, but so what... Afterwards you'll just query any resulting user_ids in another table and perhaps not find them, if that user was still being created. You haven't lost anything. :)
I think you nedd to use mysql full text index to improve performance.
You need to change your syntax to use your full text index.
Create full text index:
CREATE FULLTEXT INDEX ix_usersearch_username_ngram ON usersearch(username_ngram);
The official mysql documentation how to use full text index: https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html

Pagination and how to return all possible options together with database search results?

I'm working on a database containing over 5 million rows.
Question 1.
At the moment I'm doing the following:
SELECT COUNT(*) FROM cars
Count total rows to be returned. The above example is very basic. Queries do get more complex with WHERE clause.
I'm showing 50 rows per page. Using PHP I count total pages and offset based on current page retrieved from PHP $_GET. This gets passed to the following query:
SELECT ID FROM cars ORDER BY ID DESC LIMIT $offset, 50
I fetch all IDs of rows to be displayed in current page put them in a single string.
$ID_list = implode( ',', array_column( $mysqli_fetch, 'ID' ) );
This then gets passed to final query.
SELECT ID, make, model, year, price FROM cars WHERE ID IN ($ID_list)
Performance wise I find that passing IDs to third query is up to 8 times faster than just selecting all required columns in second query.
What is the most efficient way to paginate results while displaying total rows count and page numbers. While OFFSET, LIMIT pagination is not efficient, using seek method is not possible to display page numbers. Is there an alternative method? Maybe I should look into technologies other than MySQLi?
Question 2.
What is the best approach in displaying all possible search results of returned data?
https://www.autotrader.co.uk/car-search?advertClassification=standard&postcode=B4%206TB&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&advertising-location=at_cars&is-quick-search=TRUE&page=1
The search in the website above starts with no filters applied. Now I can click on for example, Make and it shows a number of possible results next to car brand name. Same goes for every other option. How is this achieved?
Question 1's issues and solution is discussed in http://mysql.rjweb.org/doc.php/pagination
That strongly recommends "remember where you left off" instead of OFFSET, providing a significant performance improvement. It gets rid of $ID_list and lets you do the two SELECTs as one (which is another performance benefit). (Your 8x improvement was due the combination of selecting multiple columns and skipping over rows (OFFSET).)
Question 2 is more difficult since you want to do multiple counts. Try usingGROUP BY and COUNT(*) to get all the counts in a single query. The risk is that this might involve so much data (eg, all 5M rows) that it takes "too long". In the few cases where a "covering" index is available, it might not be "too long".
You could do big group-bys every night -- counts by make and no filtering, counts by model-year and no filtering, etc. Store those in a table for quickly fetching. Once you add filtering, the complexity makes this impractical. Note: doing such a nightly tally implies that you analyze the user's request in order to tailor the SELECT.
Even the count-how-many-row-we-are-about-to-page-through (of Question 1) may be too costly.
See this for how to segregate the "common" attributes from the "rare" ones: http://mysql.rjweb.org/doc.php/eav . That leads to having several composite queries of 2-3 columns in order to handle most of the SELECT from people with random filtering criteria.
Keep the table size down by using minimal datatypes. Model_year could use a 2-byte YEAR datatype. An auto_inc for 5M cars could use a 3-byte MEDIUMINT UNSIGNED (16M limit).
Normalization (replacing a long string with a short id) saves space, but is likely to cost too much when the queries filter on multiple criteria. Eg: make = 'Ford' AND model = 'F150'.
AND is relatively easy to optimize in a WHERE clause; IN is worse and OR is even worse. For some of the IN and OR cases, you may need to resort to UNION to rid of such. Example:
( SELECT ... WHERE make = 'BMW' )
UNION ALL
( SELECT ... WHERE make = 'Audi' )
There will be a number of other cases where you really need to "construct" the query in your app code, not simply hope that MySQL can do something optimal.
The above UNION does not allow for pagination; see my links on how to deal with such.

Whether or not SQL query (SELECT) continues or stops reading data from table when find the value

Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally
In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.
The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.
Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized
I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?

MySQL : Does a query searches inside the whole table?

1. So if I search for the word ball inside the toys table where I have 5.000.000 entries does it search for all the 5 millions?
I think the answer is yes because how should it know else, but let me know please.
2. If yes: If I need more informations from that table isn't more logic to query just once and work with the results?
An example
I have this table structure for example:
id | toy_name | state
Now I should query like this
mysql_query("SELECT * FROM toys WHERE STATE = 1");
But isn't more logical to query for all the table
mysql_query("SELECT * FROM toys"); and then do this if($query['state'] == 1)?
3. And something else, if I put an ORDER BY id LIMIT 5 in the mysql_query will it search for the 5 million entries or just the last 5?
Thanks for the answers.
Yes, unless you have a LIMIT clause it will look through all the rows. It will do a table scan unless it can use an index.
You should use a query with a WHERE clause here, not filter the results in PHP. Your RDBMS is designed to be able to do this kind of thing efficiently. Only when you need to do complex processing of the data is it more appropriate to load a resultset into PHP and do it there.
With the LIMIT 5, the RDBMS will look through the table until it has found you your five rows, and then it will stop looking. So, all I can say for sure is, it will look at between 5 and 5 million rows!
Read this about indexes :-)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
It makes it uber-fast :-)
Full table scan is here only if there are no matching indexes and indeed very slow operation.
Sorting is also accelerated by indexes.
And for the #2 - this is slow because transfer rate from MySQL -> PHP is slow, and MySQL is MUCH faster at doing filtering.
For your #1 question: Depends on how you're searching for 'ball'. If there's no index on the column(s) where you're searching, then the entire table has to be read. If there is an index, then...
WHERE field LIKE 'ball%' will use an index
WHERE field LIKE '%ball%' will NOT use an index
For your #2, think of it this way: Doing SELECT * FROM table and then perusing the results in your application is exactly the same as going to the local super walmart, loading the store's complete inventory into your car, driving it home, picking through every box/package, and throwing out everything except the pack of gum from the impulse buy rack by the front till that you'd wanted in the first place. The whole point of a database is to make it easy to search for data and filter by any kind of clause you could think of. By slurping everything across to your application and doing the filtering there, you've reduced that shiny database to a very expensive disk interface, and would probably be better off storing things in flat files. That's why there's WHERE clauses. "SELECT something FROM store WHERE type=pack_of_gum" gets you just the gum, and doesn't force you to truck home a few thousand bottles of shampoo and bags of kitty litter.
For your #3, yes. If you have an ORDER BY clause in a LIMIT query, the result set has to be sorted before the database can figure out what those 5 records should be. While it's not quite as bad as actually transferring the entire record set to your app and only picking out the first five records, it still involves a bit more work than just retrieving the first 5 records that match your WHERE clause.