Is using LIMIT and OFFSET in MySQL less expensive than returning the full set of records? - mysql

It might be a silly question but I am just curious about what goes on behind the curtains.
If I want to paginate database records I can either use LIMIT and OFFSET or simply get all the records and extrapolate the ones I want with more code.
I know the second option is absolutely silly, I just want to know if it is more expensive
If I use LIMIT and OFFSET will the database grab just what I ask for, or will internally get all the records matching my query (even hundreds of thousands) and then use internally a starting index (OFFSET) and an ending index (OFFSET + LIMIT) to get the requested subset of records?
I don't even know if I used the right words to describe the doubt I have, I hope someone can shed some light.
Thanks!

Yes, it would be more expensive, for two reasons.
1) Mysql will optimize internally to only calculate the rows that it needs, rather than retrieving them all internally. Note that this optimization is a lot less if you have an order by in your query, because then mysql has to match and sort all of the rows in the dataset, rather than stopping when it finds the first X in your limit.
2) When all the records are returned, they all need to be transmitted over the wire from the database to your application server. That can take time, especially for medium to large data sets.

The difference can be enormous. Not only is the network difference big sometimes (a few rows vs hundreds to thousands), but the number of rows the database needs to find can be large also. For example, if you ask for 10 rows, the database can stop after finding 10 rows, rather than having to check every row.
Whenever possible, use LIMIT and OFFSET.

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Is selecting fewer columns speeding up my query?

I have seen several questions comparing select * to select by all columns explicitly, but what about fewer columns selected vs more.
In other words, is:
SELECT id,firstname,lastname,lastlogin,email,phone
More than negligibly faster than:
SELECT id,firstname,lastlogin
I realize there will be small differences for more data being transferred through the system and to the application, but this is a total data/load difference, not a cost of the query (larger data in the cells would have the same effect anyway I believe) - I'm only trying to optimize my query, as I will have to load ALL the data at some point anyway...
When my admin user logs in, I'm going to load the entire user database into a cache, but I can either query only critical data upfront to shave some execution time, or just get everything - if it works out roughly the same. I know more rows equals longer query execution - but what about more selected values in my query?
Under most circumstances, the only difference is going to be slightly larger data for these fields and the additional time to fetch them.
There are two things to consider:
If the additional fields are very big, then this could be a big difference in performance.
If there is an index that covers the columns you actually want, then the index can be used for the query. This could speed the query in the database.
In general, though, the advice is to return the columns you want to the application. If there is complex processing, you should consider doing that in the database rather than the application.

MySQL Large Datasets

I have large sets of data. Over 40GB that I loaded in MySQL table. I am trying to perform simple queries like select * from tablename but it takes gazillion minutes to run and eventually times out. If I set a limit, the execution is fairly fast ex: select * from tablename limit 1000.
The table has over 200 million records.
Tried creating indexes on some columns and that failed too after 3 hours of execution.
Any tips on working with these types of datasets?
First thing you need to do is completely ignore all answers and comments advising some other, awesome, mumbo jumbo technology. It's absolute bullshit. Those things can't work in a different way because they're all constrained with the same problem - hardware.
Now, let's get back to MySQL. The problem with LIMIT is that MySQL takes the whole data set, then takes LIMIT amount of rows starting from OFFSET. That means if you do SELECT * FROM my_table LIMIT 1000 - it will take all 200 million rows, buffer them, then it will start counting from 0 to 999 and discard the rest.
Yes, it takes time. Yes, it appears as dumb. However, MySQL doesn't know what "start" or "end" mean, so it can't know what limit and offset are until you tell it so.
To improve your search, you can use something like this (assuming you have numeric primary key):
SELECT * FROM tablename WHERE id < 10000 LIMIT 1000;
In this case, instead of with 200 million rows, MySQL will work with all rows whose PK is below 10 000. Much easier, much quicker, also readable. Numbers can be tweaked at any point and if you perform a pagination of some sort in a scripting language, you can always transfer the last numeric id that was present so MySQL can start from that id onwards in its search.
Also, you should be using InnoDB engine, and tweak it using innodb_buffer_pool_size which is the magic sauce that makes MySQL fly.
For large databases, one should consider using an alternative solutions such as Apache Spark. MySQL reads the data from disk which is a slow operation. Nothing can work as fast as a technology that is based on MapReduce. Take a look to this answer. It is true that with large databases, queries get very challenging.
Anyway assuming you want to stick with MySQL, first of all if you are using MyISAM, make sure to convert your database storage to InnoDB. This is especially important if you have lots of read/write operations.
It is also important to partition, that reduce the table into more manageable smaller tables. It will also enhance the indexes performance.
Do not be too generous with adding indexes. Define indexes wisely. If an index does not need to be UNIQUE do not define it as one. If an index does not need to include multiple fields do not include multiple fields.
Most importantly start monitor your MySQL instance. Use SHOW ENGINE INNODB STATUS to investigate the performance of your MySQL instance.

Which NoSql to Use for storing billions of Integer pair data?

Right now I have table in Mysql with 3 columns.
DocId Int
Match_DocId Int
Percentage Match Int
I am storing document id along with its near duplicate document id and percentage which indicate how closely two documents match.
So if one document has 100 near duplicates, we have 100 rows for that particular document.
Right now, this table has more than 1 billion records for total of 14 millions documents.
I am expecting total documents to go upto 30 millions. That means my table which stores near duplicate information will have more than 5 billions rows, may be more than that. (Near duplicate data grows exponentially compare to total document set)
Here are few issues that I have:
Getting all there records in mysql table is taking lot of time.
Query takes lot of time as well.
Here are few queries that I run:
Check if particular document has any near duplicate. (this is relatively fast, but still slow)
Check for given set of documents, how many near duplicates are there in each percentage range (Percentage range is 86-90, 91-95 , 96-100)?
This query takes lot of time. Most of the time it fails. I am going group by on percentage column.
Can this be managed with any available NoSql solution?
I am skeptical for SQL query support for NoSql solutions as I need group by support while querying data.
MySQL
You can try sharding with your current MySql solution, i.e. splitting your large database into smaller distinctive databases. The problem with that is you should only work with one shard at a time and this would be fast. If you plan to use queries across several shards then it would be painfully slow.
NoSql
Apache Hadoop stack will be worth looking at. There are several systems that allow you to perform slightly different queries. A good point is that they all tend to interoperate well between each other.
Check if particular document has any near duplicate. (this is
relatively fast, but still slow)
HBase can do this job for big table.
Check for given set of documents, how many near duplicates are there
in each percentage range ? (Percentage range is 86-90, 91-95 , 96-100)
This should be a good fit for Map-Reduce
There are many other solutions, see this link for a list and brief description of other NoSql databases.
We have good experiences with Redis. It's fast, can be make as reliable as you want it to. Other options could be CouchDB or Cassandra.

Does number of columns affect MYSQL speed?

I have a table. I only need to run one type of query: to find a given unique in column 1, then get say, the first 3 columns out.
now, how much would it affect speed if I added an extra few columns to the table for basically "data storage". I know I should use a saparate table, but lets assume I am constrained to having just 1 table, so the only way is to add on some columns at the end.
So, if I add on some columns, say 10 at the end, 30 varchar each, will this slow down any query given in the first sentence? If so, by how much of a factor do you think compared to without the extra reduntant yet present columns?
Yes, extra data can slow down queries because it means fewer rows can fit into a page, and this means more disk accesses to read a certain number of rows and fewer rows can be cached in memory.
The exact factor in slow down is hard to predict. It could be negligible, but if you are near the boundary between being able to cache the entire table in memory or not, a few extra columns could make a big difference to the execution speed. The difference in the time it takes to fetch a row from a cache in memory or from disk is several orders of magnitude.
If you add a covering index the extra columns should have less of an impact as the query can use the relatively narrow index without needing to refer to the wider main table.
I don't understand the 'I know I should use a separate table' bit. What you've described is the reason you have a DB, to associate a key with some related data. Look at it another way, how else do you retrieve this information if you don't have the key?
To answer your question, the only way to know what the performance hit is going to be is empirical testing (though Mark's answer, posted just prior to mine, is one - of VERY many - factors to speed).
That depends a bit on how much data you already have in the records. The difference would normally be somewhere between almost none at all to not so big.
The difference comes from how much more data has to be loaded from disk to get to the data. The extra columns will likely mean that there is room for less records in each page, but it's possible that it happens to be room enough left in each page for most of the extra data so that there are few extra blocks needed. It depends on how well the current data lines up in the pages.