MySQL seek-then-scan optimization for Limit Offset - mysql

From the mk-archiver help, we can see there is an option to optimize "seek-then-scan". Any idea how do they do this?
What I'm really looking for is, if I do have a table with one PKey, and queries
SELECT col1,col2 FROM tbl LIMIT 1,10;
SELECT col1,col2 FROM tbl LIMIT 11,20; ...
SELECT col1,col2 FROM tbl LIMIT m,n;
Any way to do this in an optimized way, given m and n are very large values and each select query is initiated in parallel from multiple machines? (will address host/network choking later)
How do others tackle the situation if the table doesn't have a PKey?
*Using MySQL
The default ascending-index optimization causes mk-archiver to
optimize repeated SELECT queries so they seek into the index where the
previous query ended, then scan along it, rather than scanning from
the beginning of the table every time. This is enabled by default
because it is generally a good strategy for repeated accesses.

I believe they are playing directly with the index structures, not relying on SQL. Advantage of access to source code of MySQL. It should be possible to have such an option using SQL, per connection, but with multiple users connect through intermediate (web) servers would be more complicated, if at all possible.

Related

Does MySql analyze / optimize queries before executing them?

If you'd write a query like so:
SELECT * FROM `posts` WHERE `views` > 200 OR `views` > 100
Would MySql analyze that query and realize that it's actually equivalent to this?
SELECT * FROM `posts` WHERE `views` > 100
In other words, would MySql optimize the query such that it skips any unnecessary WHERE checks?
I'm asking because I'm working on a piece of code that, for now, generates queries with redundant WHERE clauses. I'm wondering if I should optimize those queries before I send them to MySql, or if that's unnecessary, because MySql would do it anyway.
Yes. MySQL does optimize queries before running them. In fact, what runs has no obvious relationship to the SQL statement itself -- it is a directed acyclic graph.
In the process, MySQL determines what indexes to use for the query, what join algorithms, sorts lists of constants in in lists, and much more.
The optimizer also does some simplifications of the query. I'm not sure if those simplifications extend to inequalities. However, there is little overhead in making the comparison twice.
EXPLAIN SELECT ... Shows how the query was rewritten -- but it still has the OR.
The "Optimizer trace" says the same thing. However, when it gets into discussing the "cost", it gets smart and merges the two comparisons. (This is the case at least as far back as 5.6.)
In many cases, OR should be avoided like covid.

MySQL JOIN Keeps Timing Out

I am currently trying to run a JOIN between two tables in a local MySQL database and it's not working. Below is the query, I am even limiting the query to 10 rows just to run a test. After running this query for 15-20 minutes, it tells me "Error Code" 2013. Lost connection to MySQL server during query". My computer is not going to sleep, and I'm not doing anything to interrupt the connection.
SELECT rd_allid.CreateDate, rd_allid.SrceId, adobe.Date, adobe.Id
FROM rd_allid JOIN adobe
ON rd_allid.SrceId = adobe.Id
LIMIT 10
The rd_allid table has 17 million rows of data and the adobe table has 10 million. I know this is a lot, but I have a strong computer. My processor is an i7 6700 3.4GHz and I have 32GB of ram. I'm also running this on a solid state drive.
Any ideas why I cannot run this query?
"Why I cannot run this query?"
There's not enough information to determine definitively what is happening. We can only make guesses and speculations. And offer some suggestions.
I suspect MySQL is attempting to materialize the entire resultset before the LIMIT 10 clause is applied. For this query, there's no optimization for the LIMIT clause.
And we might guess that there is not a suitable index for the JOIN operation, which is causing MySQL to perform a nested loops join.
We also suspect that MySQL is encountering some resource limitation which is causing the session to be terminated. Possibly filling up all space in /tmp (that usually throws an error, something like "invalid/corrupted myisam table '#tmpNNN'", something of that ilk. Or it could be some other resource constraint. Without doing an analysis, we're just guessing.
It's possible MySQL wrote something to the error log (hostname.err). I'd check there.
But whatever condition MySQL is running into (the answer to the question "Why I cannot run this query")
I'm seriously questioning the purpose of the query. Why is that query being run? Why is returning that particular resultset important?
There are several possible queries we could execute. Some of those will run a long time, and some will be much more performant.
One of the best ways to investigate query performance is to use MySQL EXPLAIN. That will show us the query execution plan, revealing the operations that MySQL will perform, and in what order, and indexes will be used.
We can make some suggestions as to some possible indexes to add, based on the query shown e.g. on adobe (id, date).
And we can make some suggestions about modifications to the query (e.g. adding a WHERE clause, using a LEFT JOIN, incorporate inline views, etc. But we don't have enough of a specification to recommend a suitable alternative.
You can try something like:
SELECT rd_allidT.CreateDate, rd_allidT.SrceId, adobe.Date, adobe.Id
FROM
(SELECT CreateDate, SrceId FROM rd_allid ORDER BY SrceId LIMIT 1000) rd_allidT
INNER JOIN
(SELECT Id FROM adobe ORDER BY Id LIMIT 1000) adobeT ON adobeT.id = rd_allidT.SrceId;
This may help you get a faster response times.
Also if you are not interested in all the relation you can also put some WHERE clauses that will be executed before the INNER JOIN making the query faster also.

speed select vs join

The two quires below do the same thing. Basically show all the id's of table 1, which are present in table 2. The thing which puzzles me is that the simple select is way way faster than the JOIN, I would have expected that the JOIN is a bit slower, but not by that much...5 seconds vs. 0.2
Can anyone elaborate on this ?
SELECT table1.id FROM
table1,table2 WHERE
table1.id=table2.id
Duration/Fetch 0.295/0.028 (MySql Workbench 5.2.47)
SELECT table1.id
FROM table1
INNER JOIN table2
ON table1.id=table2.id
Duration/Fetch 5.035/0.027 (MySql Workbench 5.2.47)
Q: Can anyone elaborate on this?
A: Before we go the "a bug in MySQL" route that #a_horse_with_no_name seems impatient to race down, we'd really need to ensure that this is repeatable behavior, and isn't just a quirk.
And to do that, we'd really need to see the elapsed time result from more than one run of the query.
If the query cache is enabled on the server, we want to run the queries with the SQL_NO_CACHE hint added (SELECT SQL_NO_CACHE table1.id ...) so we know we aren't retrieving cached results.
I'd repeat the execution of each query at least three times, and throw out the result from the first run, and average the other runs. (The purpose of this is to eliminate the impact of the table data not being in the cache, either InnoDB buffer, or the filesystem cache.)
Also, run an EXPLAIN SELECT ... for each query. And compare the access plans.
If either of these tables is MyISAM storage engine, note that MyISAM tables are subject to locking by DML operations; while an INSERT, UPDATE or DELETE operation is run on the table, the SELECT statements will be blocked from accessing the table. (But five seconds seems a bit much for that, unless these are really large tables, or really inefficient DML statements).
With InnoDB, the SELECT queries won't be blocked by DML operations.
Elapsed time is also going to depend on what else is going on on the system.
But the total elapsed time is going include more than just the time in the MySQL server. Temporarily turning on the MySQL general_log would allow you to capture the statements that are actually being processed by the server.
This looks like something that could be further optimized by the database engine if indeed you are running both queries under the exact same context.
SQL is declarative. By successfully declaring what you want, the engine has free reign to restructure the "How" of your request to bring back the fastest result.
The earliest versions of SQL didn't even have the keyword JOIN. There was only the comma.
There are many coding constructs in SQL that imperatively force a single inferior methodology over another and they should be avoided. JOIN shouldn't be avoided. Something sounds a miss. JOIN is the core element of SQL. It would be a shame to always have to use commas.
There are a zillion factors that go into the performance of a JOIN all based your environment, schema, and data. Chances are that your table1 and table2 represent a fringe case that may have gotten past the optimization algorithms.
The SQL_NO_CACHE worked, the new results are:
Duration/Fetch 5.065 / 0.027 for the select where and
Duration/Fetch 5.050 / 0.027 for the join
I would have thought that the "select where" would be faster, but the join was actually a tad swifter. But the difference is negligible
I would like to thank everyone for their response.

select * from <tablename> not using any key

I am using MySQL 5.1 version in Windows 2008 server. When I execute below queries:
SELECT * FROM tablename;
It is taking too much time for fetching all the results in that table. This query is listed in the slow query log too while this table has primary key as well as few more index.
I execute below query to check the execution plan:
explain extended select * from tablename;
I found below information:
id=1
select_type=SIMPLE
table=tablename
possible_keys=null
key=null
key_len=null
ref=null
rows=85151
Extra=blank
I thought that it query should use at least primary key by default. Again, I executed below query and found that filtered column has value=100.0
explain extended select * from tablenmae;
Is there any specific reason about why query is not utilizing key?
You are selecting all rows from the table. This is why the whole table (all rows) needs to be scanned.
A key (or index) is only used if you narrow your search (using where). An index is used in that case to pre-select the rows which you want to have without having to actually scan the whole table for the given criteria.
If you don't need to access all the rows at once, try limiting the returned rows using LIMIT.
SELECT * FROM tablename LIMIT 100;
If you want the next 100 rows, use
SELECT * FROM tablename LIMIT 100,100;
and so on.
Other than that approach (referred to as "paging"), there is not much you can do to speed up this query (other than get a faster machine, more RAM, a faster disk, better network if the DMBS is accessed remotely).
If you need to do some processing, consider moving logic (such as filtering) to the DBMS. This can be achieved using the WHERE portion of a query.
Why would it use a key, when there is no filter, nor order? - there's going to be no approach, in this single table query, where a table scan is not going to be at least as fast.
To solve your performance issue, perhaps you have client side processing that could be passed tot he server (after all, you're not really showing 85,151 rows to the end user at once, are you?) - or get a faster disk...

Should I be using CREATE VIEW instead of JOIN all the time

I have the following query:
SELECT t.*, a.hits AS ahits
FROM t, a
WHERE (t.TRACK LIKE 'xxx')
AND a.A_ID = t.A_ID
ORDER BY t.hits DESC, a.hits DESC
which runs very frequently. Table t has around 15M+ rows and a has around 3M+ rows.
When I did an EXPLAIN on the above query, I received a note saying that it always created a temp table. I noticed that creating a temp table based on the above query took quite a while. And, this is done plenty of time.
Thus, I am wondering if I create a view using the above say:
CREATE VIEW v_t_a
SELECT t.*, a.hits AS ahits
FROM t, a
WHERE a.A_ID = t.A_ID
And change my code to:
SELECT * FROM v_t_a WHERE TRACK LIKE 'xxx' ORDER BY hits DESC, ahits DESC
Will it improve the performance? Will it remove the create temp table time?
Thank you so much for your suggestions!
It is very dangerous if you assume MySQL would optimize your VIEWs same way as more advanced database systems would. Same as with subqueries and derived tables MySQL 5.0 will fail and perform very inefficiently in many counts.
MySQL has two ways of handling the VIEWS – query merge, in which case VIEW is simply expanded as a macro or Temporary Table in which case VIEW is materialized to temporary tables (without indexes !) which is later used further in query execution.
There does not seems to be any optimizations applied to the query used for temporary table creation from the outer query and plus if you use more then one Temporary Tables views which you join together you may have serious issues because such tables do not get any indexes.
So be very careful implementing MySQL VIEWs in your application, especially ones which require temporary table execution method. VIEWs can be used with very small performance overhead but only in case they are used with caution.
MySQL has long way to go getting queries with VIEWs properly optimized.
VIEW internally JOINS the TWO tables everytime you QUERY a VIEW...!!
To prevent this, create MATERIALIZED VIEW...
It is a view that is more of a TABLE ...You can query it directly as other table..
But you have to write some TRIGGERS to update it automatically, if any underlying TABLE data changes...
See this : http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views
It's rare that doing exactly the same operations in a view will be more efficient than doing it as a query.
The views are more to manage complexity of queries rather than performance, they simply perform the same actions at the back end as the query would have.
One exception to this is materialised query tables which actually create a separate long-lived table for the query so that subsequent queries are more efficient. I have no idea whether MySQL has such a thing, I'm a DB2 man myself :-)
But, you could possibly implement such a scheme yourself if performance of the query is an issue.
It depends greatly on the rate of change of the table. If the data is changing so often that a materialised query would have to be regenerated every time anyway, it won't be worth it.