Select top n rows efficiently - mysql

So I have a table, possibly millions of rows long,
user | points
---------------
user1 | 10
user2 | 12
user3 | 7
...
and want to SELECT * FROM mytable ORDER BY points LIMIT 100, 1000
Now that works fine, but is horribly slow (on huge tables), since it refuses to use any kind of index, but performs a full table scan. How can I make this more efficient?
My first (obvious) idea was to use an index on points DESC, but then I figured out that MySQL does not support those at all.
Next, I tried to reverse the sign on points, meaning essentially having an ascending index on -points, this didnt help either, since it doesnt use the index for sorting
Lastly, I tried using force index, this yielded barely any performance improvement, since it still fetches the entire table, yet doesnt sort (using filesort: false in EXPLAIN)
I am sure this must be a solved problem, but I did not find any helpful information online. Any hints would be greatly appreciated.

Some ways to get better performance from a query.
Never never use SELECT *. It's a rookie mistake. It basically tells the query planner it needs to give you everything. Always enumerate the columns you want in the result set. This is the query you want (assuming you haven't oversimplified your question).
SELECT user, points
FROM table
ORDER BY points
LIMIT 100,1000
Use a compound index. In the case of your query, a compound index on (points, user) will allow the use of a partial index scan to satisfy your query. That should be faster than a full table sort. MySQL can scan indexes backward or forward, so you don't need to worry about descending order
To add the correct index use a command like this.
ALTER TABLE table ADD INDEX points_user (points, user);
Edit. The suggestion against using SELECT * here is based on (1) my unconfirmed suspicion that the table in question is oversimplified and has other columns in real life, and (2) the inconvenient reality that sometimes the index has to match the query precisely to get best performance results.
I stand by my opinion, based on experience, that using SELECT * in queries with performance sensitivity is not good engineering practice (unless you like the query so much you want to come back to it again and again).

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Is this strategy for fast substring search in MySQL fast enough?

I have a USER table with millions of rows. I am implementing a search function that allows someone to look for a user by typing in a username. This autocomplete feature needs to be blazingly fast. Given that, in MySQL, column indexes speed up queries using LIKE {string}%, is the following approach performant enough to return within 200ms? (Note: Memory overhead is not an issue here, username are maximum 30 characters).
Create a USERSEARCH table that has a foreign key to the user table and an indexed ngram username column:
USERSEARCH
user_id username_ngram
-------------------------
1 crazyguy23
1 razyguy23
1 azyguy23
1 zyguy23
...
The query would then be:
SELECT user_id FROM myapp.usersearch WHERE username_ngram LIKE {string}%
LIMIT 10
I am aware that third party solutions exist, but I would like to stay away from them at the moment for other reasons. Is this approach viable in terms of speed? Am I overestimating the power of indexes if the db would need to check all O(30n) rows where n is the number of users?
Probably not. The union distinct is going to process each subquery to completion.
If you just want arbitrary rows, phrase this as:
(SELECT user_id
FROM myapp.usersearch
WHERE username_1 LIKE {string}%
LIMIT 10
) UNION DISTINCT
(SELECT user_id
FROM myapp.usersearch
WHERE username_2 LIKE {string}%
LIMIT 10
)
LIMIT 10;
This will at least save you lots of time for common prefixes -- say 'S'.
That said, this just returns an arbitrary list of 10 user_ids when there might be many more.
I don't know if the speed will be fast enough for your application. You have to make that judgement by testing on an appropriate set of data.
Assuming SSDs, that should be blazing fast, yes.
Here are some further optimizations:
I would add a DISTINCT to your query, since there is no point in returning the same user_id multiple times. Especially when searching for a very common prefix, such as a single letter.
Also consider searching only for at least 3 letters of input. Less tends to be meaningless (since hopefully your usernames are at least 3 characters long) and is a needless hit on your database.
If you're not adding any more columns (I hope you're not, since this table is meant for blazing fast search!), we can do better. Swap the columns. Make the primary key (username_ngram, user_id). This way, you're searching directly on the primary key. (Note the added benefit of the alphabet ordering of the results! Well... alphabetic on the matching suffixes, that is, not the full usernames.)
Make sure you have an index on user_id, to be able to replace everything for a user if you ever need to change a username. (To do so, just delete all rows for that user_id and insert brand new ones.)
Perhaps we can do even better. Since this is just for fast searching, you could use an isolation level of READ_UNCOMMITTED. That avoids placing any read locks, if I'm not mistaken, and should be even faster. It can read uncommitted data, but so what... Afterwards you'll just query any resulting user_ids in another table and perhaps not find them, if that user was still being created. You haven't lost anything. :)
I think you nedd to use mysql full text index to improve performance.
You need to change your syntax to use your full text index.
Create full text index:
CREATE FULLTEXT INDEX ix_usersearch_username_ngram ON usersearch(username_ngram);
The official mysql documentation how to use full text index: https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html

Should I avoid ORDER BY in queries for large tables?

In our application, we have a page that displays user a set of data, a part of it actually. It also allows user to order it by a custom field. So in the end it all comes down to query like this:
SELECT name, info, description FROM mytable
WHERE active = 1 -- Some filtering by indexed column
ORDER BY name LIMIT 0,50; -- Just a part of it
And this worked just fine, as long as the size of table is relatively small (used only locally in our department). But now we have to scale this application. And let's assume, the table has about a million of records (we expect that to happen soon). What will happen with ordering? Do I understand correctly, that in order to do this query, MySQL will have to sort a million records each time and give a part of it? This seems like a very resource-heavy operation.
My idea is simply to turn off that feature and don't let users select their custom ordering (maybe just filtering), so that the order would be a natural one (by id in descending order, I believe the indexing can handle that).
Or is there a way to make this query work much faster with ordering?
UPDATE:
Here is what I read from the official MySQL developer page.
In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following:
....
The key used to
fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
So yes, it does seem like mysql will have a problem with such a query? So, what do I do - don't use an order part at all?
The 'problem' here seems to be that you have 2 requirements (in the example)
active = 1
order by name LIMIT 0, 50
The former you can easily solve by adding an index on the active field
The latter you can improve by adding an index on name
Since you do both in the same query, you'll need to combine this into an index that lets you resolve the active value quickly and then from there on fetches the first 50 names.
As such, I'd guess that something like this will help you out:
CREATE INDEX idx_test ON myTable (active, name)
(in theory, as always, try before you buy!)
Keep in mind though that there is no such a thing as a free lunch; you'll need to consider that adding an index also comes with downsides:
the index will make your INSERT/UPDATE/DELETE statements (slightly) slower, usually the effect is negligible but only testing will show
the index will require extra space in de database, think of it as an additional (hidden) special table sitting next to your actual data. The index will only hold the fields required + the PK of the originating table, which usually is a lot less data then the entire table, but for 'millions of rows' it can add up.
if your query selects one or more fields that are not part of the index, then the system will have to fetch the matching PK fields from the index first and then go look for the other fields in the actual table by means of the PK. This probably is still (a lot) faster than when not having the index, but keep this in mind when doing something like SELECT * FROM ... : do you really need all the fields?
In the example you use active and name but from the text I get that these might be 'dynamic' in which case you'd have to foresee all kinds of combinations. From a practical point this might not be feasible as each index will come with the downsides of above and each time you add an index you'll add supra to that list again (cumulative).
PS: I use PK for simplicity but in MSSQL it's actually the fields of the clustered index, which USUALLY is the same thing. I'm guessing MySQL works similarly.
Explain your query, and check, whether it goes for filesort,
If Order By doesnt get any index or if MYSQL optimizer prefers to avoid the existing index(es) for sorting, it goes with filesort.
Now, If you're getting filesort, then you should preferably either avoid ORDER BY or you should create appropriate index(es).
if the data is small enough, it does operations in Memory else it goes on the disk.
so you may try and change the variable < sort_buffer_size > as well.
there are always tradeoffs, one way to improve the preformance of order query is to set the buffersize and then the run the order by query which improvises the performance of the query
set sort_buffer_size=100000;
<>
If this size is further increased then the performance will start decreasing

Order of composite where clause (MySQL, Postgres)

I have a table t with columns a int, b int, c int; composite index i (b, c). I fetch some data with following query:
select * from t where c = 1 and b = 2;
So the question is: will MySQL and Postgres use the index i? And, more generally: does the query composite where clause order affect the possibility of index use?
What you need to do is use the explain function in both, to see what's going on. If it says it's using an index then it is. One caveat is that in a small table with minimal data, it's very likely that postgresql (and probably mysql) will ignore the indexes and favor of a scan. To get a real result, insert quite a bit of dummy data (at least 20 rows, and I always do about 500) and be sure the analyze the table. Also, realize that if the search criteria will return a large percentage of the table results, it will likely not use the index either (as a scan will be faster).
create table
generate data (perhaps using generate_series)
run explain select * from t where c=1 and b=2
create index `create index on t(b,c)
Analyze table analyze t
run explain select * from t where c=1 and b=2 and compare with first run
hopefully this will help answer this, and other questions you might have in the future about when indexes will run. To answer your original question though, yes, in general postgresql will use the index, regardless of order, if the optimizer determines that to be the best way to get your results. Remember to analyze your table though, so the optimizer has an idea of what information is in your table, and analyze it any time a ton of data is added or deleted from your table. Depending on your PG version and settings, some of this may be done automatically for you, but it won't hurt to manually analyze, especially when testing this kind of thing.
Edit: the index order may (especially if you don't use an order by in your query and the optimizer uses the index) effect the order of the results of your query-- the returned rows may be ordered in the same order of the index.
It's not, the order doesn't matter.
Optimizer does a lot of smart things to perform a query in the most efficient way.

Best way to check for updated rows in MySQL

I am trying to see if there were any rows updated since the last time it was checked.
I'd like to know if there are any better alternatives to
"SELECT id FROM xxx WHERE changed > some_timestamp;"
However, as there are 200,000+ rows it can get heavy pretty fast... would a count be any better?
"SELECT count(*) FROM xxx WHERE changed > some_timestamp;"
I have thought of creating a unit test but I am not the best at this yet /:
Thanks for the help!
EDIT: Because in many cases there would not be any rows that changed, would it be better to always test with a MAX(xx) first, and if its greater than the old update timestamp given, then do a query?
If you just want to know if any rows have changed, the following query is probably faster than either of yours:
SELECT id FROM xxx WHERE changed > some_timestamp LIMIT 1
Just for the sake of completeness: Make sure you have an index on changed.
Edit: A tiny performance improvement
Now that I think about it, you should probably do a SELECT change instead of selecing the id, because that eliminates accessing the table at all. This query will tell you pretty quickly if any change was performed.
SELECT changed FROM xxx WHERE changed > some_timestamp LIMIT 1
It should be a tiny bit faster than my first query - not by a lot, though, since accessing a single table row is going to be very fast.
Should I select MAX(changed) instead?
Selecting MAX(changed), as suggested by Federico should pretty much result in the same index access pattern. Finding the highest element in an index is a very cheap operation. Finding any element that is greater than some constant is potentially cheaper, so both should have approximately the same performance. In either case, both queries are extremely fast even on very large tables if - and only if - there is an index.
Should I first check if any rows were changed, and then retrieve the rows in a separate step
No. If there is no row that has changed, SELECT id FROM xxx WHERE changed > some_timestamp will be as fast as any such check making it pointless to perform it separately. It only turns into a slower operation when there are results. Unless you add expensive operations (such as ORDER BY), the performance should be (almost) linear to the number of rows retrieved.
Make an index on some_timestamp and run:
SELECT MAX(some_timestamp) FROM xxx;
If the table is MyISAM, the query will be immediate.