I have two tables
LangArticles | columns: id (INT) ,de (VARCHAR),en (VARCHAR),count_links(INT)
WikiLinks | columns: article_id,link_id,nr_in_article (all integer)
The name of an article is in the columns de (German) and en (English).
The id in the LangArticles table is the same as the ids article_id and link_id.
I want now to get all article names which links to another article. So I want all articles which links to 'abc'. 'abc' has the id = '1'
So my normal query (without an order by) looks like:
select distinct(LA.de),W.nr_in_article,LA.count_links from
LangArticles as LA inner join WikiLinks as W on W.article_id = LA.id
where W.link_id in ("1")
This maybe took 0.001 seconds and give me 100000 results. Actually I want the best 5 hits.
Best means in this case the most relevant ones. I want to sort it like this:
The articles which links to 'abc' at the beginning of an article (nr_in_article) and which has a lot of links itself (count_links) should have a high ranking.
I am using an
order by (1-(W.nr_in_article/LA.count_links)) desc
for this.
The problem is that I am not sure how to optimize this order by.
The Explain in mysql says that he has to use a temporary file and filesort and can't use the index on the order by keys. For testing I tried an "easy" order by W.nr_in_article so an normal order with one key.
For your information my indices are:
in LangArticles: id (primary),de (unique),en (unique), count_links(index)
in WikiLinks: article_id(index),link_id(index),nr_in_article(index)
But I tried this two multiindices link_id,nr_in_article & article_id,nr_in_article as well.
And the query with order by tooks approximately 5.5 seconds. :(
I think I know why MySql has to use a temporary file and filesort here because all 100,000 entries has to be found with one index (link_id) and afterwards it has to be sorted and in a temporary file it can't use an index.
But is there any way to make this faster?
Actually I only want the best 5 hits so there is no need to sort everything. I am not sure if sth. like the bad sort (bubble sort) would be faster for this than Quicksort which sorts the hole temporary table.
Since you only need the top 5 I think you could split it into two queries that should lead less results.
First like Sam pointed out,
order by (W.nr_in_article/LA.count_links) asc
should be equivalent to your
order by (1-(W.nr_in_article/LA.count_links)) desc
unless I'm overlooking some corner case here.
Furthermore, anything where
W.nr_in_article > LA.count_links
will be in the TOP 5 unless that result is empty, so I would try the query
select distinct(LA.de),W.nr_in_article,LA.count_links
from LangArticles as LA
inner join WikiLinks_2 as W on W.article_id = LA.id
and W.nr_in_article > LA.count_links
where W.link_id in ("1")
order by W.nr_in_article/La.count_links
limit 5
Only if this returns less than 5 results you have to further execute the query again with a changed where condition.
This however will not bring the runtime down by orders of magnitude, but should help a little. If you need more performance I don't see any other way than a materialized view, which I don't think is available in mysql, but can be simulated using triggers.
Related
I have to query ~500k records in a column and curently it takes about ~15s to finish the entire operation. Currently I am using pagination like this
SELECT
columnA,
columnB,
columnC,
columnD,
name
FROM
table1 t1
INNER JOIN table2 t2 ON
${condition1}
INNER JOIN table3 t3 ON
${condition2}
INNER JOIN table4 t4 ON
${condition3}
WHERE
${WhereConditions}
-- would like to have a key based pagination in where
GROUP BY c.id_number
HAVING ${aggregateCondition}
ORDER BY
name ASC
LIMIT 20 , 20
But my current implementation yeilds no difference in fetching paginated result and all results
So I stumbled upon key-based pagination or seeking-pagination, but all the examples I sqw was based on some key which is an integer or DateTime.
So I was wondering if it's possible to do something like key-based pagination on varchar?
Below image is the Explain part of the query
You hope to paginate a result set with ORDER BY name LIMIT offset, 20. And, of course you've discovered that LIMIT large_offset, 20 is painfully slow.
So you're looking for a way to do WHERE ... name > ?previous_page_name ... ORDER BY name LIMIT 20 to avoid large offsets. That's generally good for efficiency if you can exploit an index.
But here's the problem you face: name values probably are not unique. What if you have 22 Francis names, then 26 Jones, then 13 Smith? Your first page will be fine, but your second page won't. It will skip the last two Francis rows and show 20 Jones rows.
Index-based pagination really requires ordering by an indexed unique value.
And of course any reliable pagination requires stable ordering: with duplicate name values, ORDER BY name doesn't always return the rows in the same order.
In many apps using pagination, there's an underlying assumption that the user will look at one, two, or three pages, and then give up and try a different way of finding the data. The assumption says the user won't paginate all the way through. This means the performance problem with later pages doesn't matter.
So duplicated data is your problem. To give a more detailed suggestion about how to fix it requires more understanding of your data than you have given in your question.
Consider a table Test having 1000 rows
Test Table
id name desc
1 Adi test1
2 Sam test2
3 Kal test3
.
.
1000 Jil test1000
If i need to fetch, say suppose 100 rows(i.e. a small subset) only, then I am using LIMIT clause in my query
SELECT * FROM test LIMIT 100;
This query first fetches 1000 rows and then returns 100 out of it.
Can this be optimised, such that the DB engine queries only 100 rows and returns them
(instead of fetching all 1000 rows first and then returning 100)
Reason for above supposition is that the order of processing will be
FROM
WHERE
SELECT
ORDER BY
LIMIT
You can combine LIMIT ROW COUNT with an ORDER BY, This causes MySQL to stop sorting as soon as it has found the first ROW COUNT rows of the sorted result.
Hope this helps, If you need any clarification just drop a comment.
The query you wrote will fetch only 100 rows, not 1000. But, if you change that query in any way, my statement may be wrong.
GROUP BY and ORDER BY are likely to incur a sort, which is arguably even slower than a full table scan. And that sort must be done before seeing the LIMIT.
Well, not always...
SELECT ... FROM t ORDER BY x LIMIT 100;
together with INDEX(x) -- This may use the index and fetch only 100 rows from the index. BUT... then it has to reach into the data 100 times to find the other columns that you ask for. UNLESS you only ask for x.
Etc, etc.
And here's another wrinkle. A lot of questions on this forum are "Why isn't MySQL using my index?" Back to your query. If there are "only" 1000 rows in your table, my example with the ORDER BY x won't use the index because it is faster to simply read through the table, tossing 90% of the rows. On the other hand, if there were 9999 rows, then it would use the index. (The transition is somewhere around 20%, but it that is imprecise.)
Confused? Fine. Let's discuss one query at a time. I can [probably] discuss the what and why of each one you throw at me. Be sure to include SHOW CREATE TABLE, the full query, and EXPLAIN SELECT... That way, I can explain what EXPLAIN tells you (or does not).
Did you know that having both a GROUP BY and ORDER BY may cause the use of two sorts? EXPLAIN won't point that out. And sometimes there is a simple trick to get rid of one of the sorts.
There are a lot of tricks up MySQL's sleeve.
I have a table with 10 columns, Now I want to give the users an option to sort the data with any column they want. For example suppose a combo box with 7 items that each of them is a column of the table, now the user choose one item and get the data sorted by the chosen column.
Now what is the problem?
My table has 3M records, and if I sort the data with indexed column I have no problem but with a non index column it takes 3.5mins to sort!!!
What is the solution I am thinking about?
Add index to every column of table that is needed to be sort by! In my case I will have index on 8 columns!!!!
What is the problem of my solution?
Having a lot of index on columns may decrease the speed of INSERT/UPDATE queries! In my case the table is updated frequently (every second!!!!!)
What is your solution for this case?!
Read this for more details on optimization: http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
In some cases, MySQL cannot use indexes to resolve the ORDER BY, although it still uses indexes to find the rows that match the WHERE clause. Using index for sorting often comes together with using index to find rows, however it can also be used just for sort for example if you’re just using ORDER BY without and where clauses on the table. In such case you would see “Index” type in EXPLAIN which correspond to scanning (potentially) complete table in the index order. It is very important to understand in which conditions index can be used to sort data together with restricting amount of rows.
Looking at the same index (A,B) things like ORDER BY A ; ORDER BY A,B ; ORDER BY A DESC, B DESC will be able to use full index for sorting (note MySQL may not select to use index for sort if you sort full table without a limit). However ORDER BY B or ORDER BY A, B DESC will not be able to use index because requested order does not line up with the order of data in BTREE. If you have both restriction and sorting things like this would work A=5 ORDER BY B ; A=5 ORDER BY B DESC; A>5 ORDER BY A ; A>5 ORDER BY A,B ; A>5 ORDER BY A DESC which again can be easily visualized as scanning a range in BTREE. Things like this however would not work A>5 ORDER BY B , A>5 ORDER BY A,B DESC or A IN (3,4) ORDER BY B – in these cases getting data in sorting form would require a bit more than simple range scan in the BTREE and MySQL decides to pass it on.
Option #1: If you are limited to MySQL there's no better option but create 8 indexes for the possible order columns. You're insert/update are going to suffer it for sure but no real visitor will wait for 3.5 minutes for a list to be sorted.
Tune #1: To make it a little faster you can create partial indexes instead of standard indexes which will use much less space (I assume some of these columns are varchar) and this means less writes, smaller footprint in memory. You just need to check the entropy for each column with the substring and make sure you still have distinction over 90%.
For example with a query like this:
> select count(distinct(substring(COLUMN, 1, 5))) as part_5, count(distinct(substring(COLUMN, 1, 10))) as part_10, count(distinct(substring(COLUMN, 1, 20))) as part_20, count(distinct(COLUMN)) as sum from TABLE;
+--------+---------+---------+---------+
| part_5 | part_10 | part_20 | sum |
+--------+---------+---------+---------+
| 892183 | 1996053 | 1996058 | 1996058 |
+--------+---------+---------+---------+
Tune #2: You can make you insert/update statements to execute in the background. The application won't be faster but the user experience is going to be much better.
Tune #3: Use bigger transactions if you can for the inserts/updates.
Option #2: You can try to use one of the search engines which have been built for this usage pattern (too). I would recommend Solr as I'm using it for a while with great satisfaction but I heard good about elastic search as well.
I have a situation where I need to use a huge number for the limit. For example,
"select * from a table limit 15824293949,1";
this is....really really slow. Sometimes my home mysql server just dies.
is it possible to make it faster?
sorry the number was 15824293949, not 38975901200
Added:
**Table 'photos' (Sample)**
Photos
img_id img_filename
1 a.jpg
2 b.jpg
3 c.jpg
4 d.jpg
5 e.jpg
and so on
select cp1.img_id,cp2.img_id from photos as cp1 cross join photos as cp2 limit ?,1
How I get 15824293949?
I have 177901 rows in my photo table. I can get the total # of possible combinations by using
(total # of rows * total # of rows ) - total # of rows)/2
MySQL has issues with huge limit offsets with MyISAM engine mostly where InnoDB optimizes that. There are various techniques to get MyISAM limit to behave faster, however add EXPLAIN before your select statement to see what's actually going on. 3 billion rows generated from cross join indicate that the issue lies within the join itself, not the LIMIT clause.
If you're interested in how to make LIMIT behave faster, this link should provide you with enough information.
Try limiting the query with a WHERE clause on a column with an index on it. E.g.:
SELECT * FROM table WHERE id >= 38975901200 LIMIT 1
Update: I think perhaps you don't even need the database? You can find the nth combination of two images by calculating something like 15824293949 / 177901 and 15824293949 % 177901. I suppose you could write a query:
SELECT (15824293949 / 177901) AS img_id1, (15824293949 MOD 177901) AS img_id2
If you're trying to get them from the natural order that they're in the database (and it doesn't happen to be their img_id) then you might have some trouble. Does it matter? It's not clear what you're trying to do here.
Presumably you have this in some sort of script, where the reason you are looking at that specific point is because it is where you left off last.
Ideally, if you also have an auto_increment primary key field (an id), you can store that number. Then just do select * from table where id > last_seen_id limit 1 (maybe do more than 1 at a time :P)
Generally speaking, what you are asking it to do should be slow. Give it something to search for, rather than everything with a limit.
I got a spicy question about mysql...
The idea here is to select the n last records from a table, filtering by a property, (possibly from another table). That simple.
At this point you wanna reply :
let n = 10
SELECT *
FROM huge_table
JOIN another_table
ON another_table.id = huge_table.another_table_id
AND another_table.some_interesting_property
ORDER BY huge_table.id DESC
LIMIT 10
Without the JOIN that's OK, mysql reads the index from the end and trow me 10 items, execution time is negligible
With the join, the execution time become dependent of the size of the table and in many case not negligible, the explain stating that mysql is : "Using where; Using index; Using temporary; Using filesort"
MySQL documentation (http://dev.mysql.com/doc/refman/5.1/en/order-by-optimization.html) states that :
"You are joining many tables, and the columns in the ORDER BY are not all from the first nonconstant table that is used to retrieve rows. (This is the first table in the EXPLAIN output that does not have a const join type.)"
explaining why MySQL can't use index to resolve my ORDER BY prefering a huge file sort ...
My question is : Is it natural to use ORDER BY ... LIMIT 10 to get last items ? Do you really do it while picking last 10 cards in an ascending ordered card deck ? Personally i just pick 10 from the bottom ...
I tried many possibilities but all ended giving the conclusion that i'ts really fast to query 10 first elements and slow to query 10 last cause of the ORDER BY clause.
Can a "Select last 10" really be fast ? Where i am wrong ?
Nice question, I think you should make order by column i.e., id a DESC index.
That should do the trick.
http://dev.mysql.com/doc/refman/5.0/en/create-index.html
With the join you're now restricting rows to "some_interesting_property" and the ID's in your huge_table may no longer be consecutive... Try an index on another_table (some_interesting_property, id) and also huge_table (another_table_id, id) and see if your EXPLAIN gives you better hints.
I'm having trouble reproducing your situation. Whether I use ASC or DESC with my huge_table/another_table mock up, my EXPLAINs and execution time all show approx N rows read and a logical join. Which version of MySQL are you using?
Also, from the EXPLAIN doc, it states that Using index indicates
The column information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row
which doesn't correspond with the fact you're doing a SELECT *, unless you have an index which covers your whole table.
Perhaps you should show your schema, including indexes, and the EXPLAIN output.