I've hit the DB performance bottleneck, where now? - mysql

I have some queries that are taking too long (300ms) now that the DB has grown to a few million records. Luckily for me the queries don't need to look at the majority of this data, that latest 100,000 records will be sufficient so my plan is to maintain a separate table with the most recent 100,000 records and run the queries against this. If anyone has any suggestions for a better way of doing this that would be great. My real question is what are the options if the queries did need to run against the historic data, what is the next step? Things I've thought of:
Upgrade hardware
Use an in memory database
Cache the objects manually in your own data structure
Are these things correct and are there any other options? Do some DB providers have more functionality than others to deal with these problems, e.g. specifying a particular table/index to be entirely in memory?
Sorry, I should've mentioned this, I'm using mysql.
I forgot to mention indexing in the above. Indexing have been my only source of improvement thus far to be quite honest. In order to identify bottlenecks I've been using maatkit for the queries to show whether or not indexes are being utilised.
I understand I'm now getting away from what the question was intended for so maybe I should make another one. My problem is that EXPLAIN is saying the query takes 10ms rather than 300ms which jprofiler is reporting. If anyone has any suggestions I'd really appreciate it. The query is:
select bv.*
from BerthVisit bv
inner join BerthVisitChainLinks on bv.berthVisitID = BerthVisitChainLinks.berthVisitID
inner join BerthVisitChain on BerthVisitChainLinks.berthVisitChainID = BerthVisitChain.berthVisitChainID
inner join BerthJourneyChains on BerthVisitChain.berthVisitChainID = BerthJourneyChains.berthVisitChainID
inner join BerthJourney on BerthJourneyChains.berthJourneyID = BerthJourney.berthJourneyID
inner join TDObjectBerthJourneyMap on BerthJourney.berthJourneyID = TDObjectBerthJourneyMap.berthJourneyID
inner join TDObject on TDObjectBerthJourneyMap.tdObjectID = TDObject.tdObjectID
where
BerthJourney.journeyType='A' and
bv.berthID=251860 and
TDObject.headcode='2L32' and
bv.depTime is null and
bv.arrTime > '2011-07-28 16:00:00'
and the output from EXPLAIN is:
+----+-------------+-------------------------+-------------+---------------------------------------------+-------------------------+---------+------------------------------------------------+------+-------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------------+-------------+---------------------------------------------+-------------------------+---------+------------------------------------------------+------+-------------------------------------------------------+
| 1 | SIMPLE | bv | index_merge | PRIMARY,idx_berthID,idx_arrTime,idx_depTime | idx_berthID,idx_depTime | 9,9 | NULL | 117 | Using intersect(idx_berthID,idx_depTime); Using where |
| 1 | SIMPLE | BerthVisitChainLinks | ref | idx_berthVisitChainID,idx_berthVisitID | idx_berthVisitID | 8 | Network.bv.berthVisitID | 1 | Using where |
| 1 | SIMPLE | BerthVisitChain | eq_ref | PRIMARY | PRIMARY | 8 | Network.BerthVisitChainLinks.berthVisitChainID | 1 | Using where; Using index |
| 1 | SIMPLE | BerthJourneyChains | ref | idx_berthJourneyID,idx_berthVisitChainID | idx_berthVisitChainID | 8 | Network.BerthVisitChain.berthVisitChainID | 1 | Using where |
| 1 | SIMPLE | BerthJourney | eq_ref | PRIMARY,idx_journeyType | PRIMARY | 8 | Network.BerthJourneyChains.berthJourneyID | 1 | Using where |
| 1 | SIMPLE | TDObjectBerthJourneyMap | ref | idx_tdObjectID,idx_berthJourneyID | idx_berthJourneyID | 8 | Network.BerthJourney.berthJourneyID | 1 | Using where |
| 1 | SIMPLE | TDObject | eq_ref | PRIMARY,idx_headcode | PRIMARY | 8 | Network.TDObjectBerthJourneyMap.tdObjectID | 1 | Using where |
+----+-------------+-------------------------+-------------+---------------------------------------------+-------------------------+---------+------------------------------------------------+------+---------------------------------------
7 rows in set (0.01 sec)

Make sure all your indexes are optimized. Use explain on the query to see if it is using your indexes efficiently.
If you are doing some heavy joins then start thinking about doing this calculation in java.
Think of using other DBs such NoSQL. You maybe able to do some preprocessing and put data in Memcache to help you a little.

Considering a design change like this is not a good sign - I bet you still have plenty of performance to squeeze out using EXPLAIN, adjusting db variables and improving the indexes and queries. But you're probably past the point where "trying stuff" works very well. It's an opportunity to learn how to interpret the analyses and logs, and use what you learn for specific improvements to indexes and queries.
If your suggestion were a good one, you should be able to tell us why already. And note that this is a popular pessimization--
What is the most ridiculous pessimization you've seen?

Well, if you have optimised the database and queries, I'd say that rather than chop up the data, the next step is to look at:
a) the mysql configuration and make sure that it is making the most of the hardware
b) look at the hardware. You don't say what hardware you are using. You may find that replication is an option in your case if you can buy a two or three servers to divide up the reads from the database (writes have to be done to a central server, but reads can be read from any number of slaves).

Instead of creating a separate table for latest results, think about table partitioning. MySQL has this feature built in since version 5.1
Just to make it clear: I am not saying this is THE solution for your issues. Just one thing you can try

I would start by trying to optimize the tables/indexes/queries before before taking any of the measures you listed. Have you dug into the poorly performing queries to the point where you are absolutely convinced you have reached the limit of your RDBMS's capabilities?
Edit: if you are indeed properly optimized, but still have problems, consider creating a Materialized View for the exact data you need. That may or may not be a good idea based on more factors than you have provided, but I would put it at the top of the list of things to consider.

Searching in the last 100,000 records should be terribly fast, you definitely have problems with the indexes. Use EXPLAIN and fix it.

I understand I'm now getting away from what the question was intended for
so maybe I should make another one. My problem is that EXPLAIN is saying
the query takes 10ms rather than 300ms which jprofiler is reporting.
Then your problem (and solution) must be in java, right?

Related

MySQL 5.7 vs 5.6: Index usage wrong at first, but "automagically" fixed weeks later

Context:
I have a MySQL 5.6 as master, that have two replicas, one of them is also a MySQL 5.6 instance and other is a MySQL 5.7. Heavy queries are distributed evenly across the two replicas.
The 5.6 Replica is up and running for about 2 months, the 5.7 is newer, running about two weeks.
Using AWS RDS.
Weird Behaviour:
Just after the creation, I noticed that some queries were much slower at 5.7 because of “wrong index usage”, as shown by comparing the EXPLAIN result on both versions. Even some queries that had an USE INDEX clause, seemed to have no effect on 5.7 database.
Giving that it was a fresh new replica, I thought: “Maybe index statistics aren’t up to date?“.
Not the case. I have a routine that runs “analyse table” for every table on the system overnight at the master DB, so index statistics should be fine. I can confirm that the “analyse table” statement is successfully being replicated by checking the last_update on the replicas mysql.innodb_index_stats.
But some days have passed by, and without any work on my side, the bad queries started to use the right indexes! All fixed. No work done. Crazy day.
Questions:
What could affect the choice for a specific query execution plan? Index statistics aren’t the only thing?
Is there any “caching/warming up” stuff, that over two weeks of actual work could make The Optimiser change his mind?
Shouldn’t “analyse table” be enough to guarantee that both queries would have the same execution plan?
What else can explain that after two weeks of executions, the MySQL 5.7 starts to work same way as MySQL 5.6?
Remember, both databases were REPLICAS from the SAME SOURCE that receive EVENLY DISTRIBUTED traffic.
SQL Example:
The following query (this is an obfuscated version), was using the BAD index, and one day later it was using the GOOD index. Just like Magic™.
SELECT sale.number,
seller.id,
seller.name,
sale.id,
COALESCE(billed_amount, 0),
COALESCE(billing.tax_id, '---'),
billing.date,
CASE COALESCE(sale.seller_name_2, '') WHEN '' THEN sale.seller_name_1 ELSE sale.seller_name_2 END,
sale.business_id,
sale.business_name,
sale.total
FROM app_billing billing
JOIN app_sale sale ON billing.sale_id = sale.id
JOIN app_seller seller ON seller.id = sale.seller_id
WHERE billing.tenant_id = 515
AND billing.removed = FALSE
AND billing.date BETWEEN '2020-08-01' AND '2020-08-31'
AND sale.status = 2
AND sale.seller_id IN (368);
-- MySQL 5.7 (BAD, really bad estimate and counter intuitive decision [should definitely be a range scan])
-- 1st step: There are only one seller with ID 368, so thats right.
-- 2nd step: 692 is pretty accurate, there are 695 sales for the seller ID 368.
-- 3rd step: This is tricky. There are a total of 270776 billing records that matches the 695 sale_ids from the previous step. 65% of the matches, have only one correspondence, but the remaining 35% have between 1000 and 5000 correspondences. Average would be 1360.
-- Estimated total number of processed records: 942000~.
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | SIMPLE | seller | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100 | NULL |
| 1 | SIMPLE | sale | NULL | ref | PRIMARY,idx_seller_id | idx_seller_id | 4 | const | 692 | 10 | Using where |
| 1 | SIMPLE | billing | NULL | ref | idx_sale_id,idx_tenant_id_removed_date | idx_sale_id | 4 | sale.id | 2 | 2.16 | Using where |
-- MySQL 5.6 (GOOD, using range scan)
-- 1st step: There are only one seller with ID 368, so thats right.
-- 2nd step: 421 is almost perfect. There are actually 422 billing records across the date '2020-08-01' AND '2020-08-31' for the tenant_id 515 that arent removed.
-- 3rd step: That's right. There's only one sale correspondence per billing,
-- Estimated total number of processed records: 421.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | SIMPLE | seller | const | PRIMARY | PRIMARY | 4 | const | 1 | NULL |
| 1 | SIMPLE | billing | range | idx_sale_id,idx_tenant_id_removed_date | idx_tenant_id_removed_date | 9 | NULL | 421 | Using index condition |
| 1 | SIMPLE | sale | eq_ref | PRIMARY,idx_seller_id | PRIMARY | 4 | billing.sale_id | 1 | Using where |
MySQL 5.7 and later suffer quite badly from a query optimizer heisenbug, which makes it unpredictably pick an egregiously bad execution plan. It sounds like you are falling foul of that. I have seen this sort of inconsistent, unpredictable behaviour on every 5.7 installation I have seen to date (more than I can hope to enumerate).
Regarding your specific questions:
Only the stats should be making a difference. However, the query plan was always also prone to getting confused when there is a large number of indexes in a table (more than approximately 10). And then there is the heisenbug I mentioned above.
No, but table statistics are maintained all the time in the background. It sounds like it finally got the data distribution estimation right - eventually.
It should - but often isn't. You may want to look at the following settings:
innodb_stats_persistent_sample_pages
innodb_stats_persistent
optimizer_switch settings. You may want to try setting this on 5.7 to the same flag set that you have on 5.6. it sometimes helps (but not too often).
There are many reasons why the statistics might be slightly different, leading to a different explain plan.
Since the WHERE clause tests columns in more than one table, the Optimizer does not necessarily know which table to start with. Sometimes it will pick the "wrong" table.
Rather than chasing the statistics and the Optimizer's choice, let's improve the indexes, thereby hopefully making the index enough faster to quell the question:
billing: (tenant_id, sale_id, removed, date, tax_id)
billing: (removed, tenant_id, date, sale_id, tax_id)
sale: (status, seller_id)
That set of indexes should help regardless of which table it decides to start with. The order of columns in each index is important. (There are some variations that would work equally well.)
Examples of why the Optimizer might "do the wrong thing".
If billing.tenant_id = 515 occurs in only one row, then starting billing would probably be fast -- fetch one row, then join to get the row(s) from the other table.
Similarly, if sale.status = 2 occurs infrequently, sale would be a better first choice.
There are no statistics to achieve the above. What there is: "There are about 100 distinct tenent_id values out of 5000 rows." That is, the Optimizer will assume that tenant_id=515 occurs in 50 rows.
Note: I said "about". This is because ANALYZE does not (assuming a large InnoDB table) look at every value in the table. Instead, it takes a sample. (Look for "number of dives" in the documentation.)
Some things that can lead Primary and Replica to act differently
When were the stats last updated? There is nothing to synchronize the updating.
The stats involve "random" probes. Hence slight (or big) variations can occur.
In some situations a slight change in the stats can lead to a big difference in the query plan chosen -- typically between a table scan and using an index. The Optimizer tries to have this quantum change at the point where there is not much difference in performance, but it can be wrong.
Presumably different SELECTs are un on different Replicas. SELECTs can interact with the writes coming through replication. This can lead to block splits (etc) happening at different times and/or different places in the table. This can indirectly impact the stats.
Etc.

How to add related properties to a table in mysql correctly

We have been developing the system at my place of work for sometime now and I feel the database design is getting out of hand somewhat.
For example we have a table widgets (I'm spoofing these somewhat):
+-----------------------+
| Widget |
+-----------------------+
| Id | Name | Price |
| 1 | Sprocket | 100 |
| 2 | Dynamo | 50 |
+-----------------------+
*There's about 40+ columns on this table already
We want to add on a property for each widget for packaging information. We need to know if it has packaging information, if it doesn't have packaging information or we don't know if it does or doesn't. We then need to also store the type of packaging details (assuming it does or maybe it doesn't and it's reduntant info now).
We already have another table which stores the details information information (I personally think this table should be divided up but that's another issue).
PD = PackageDetails
+--------------------------------+
| System Properties |
+--------------------------------+
| Id | Type | Value |
| 28 | PD | Boxed |
| 29 | PD | Vacuum Sealed |
+--------------------------------+
*There's thousands of rows in the table for all system wide table properties
Instinctively I would create a number of mapping tables to capture this information. I have however been instructed to just add another column onto each table to avoid doing a join.
My solution:
Create tables:
+---------------------------------------------------+
| widgets_packaging |
+---------------------------------------------------+
| Id | widget_id | packing_info | packing_detail_id |
| 1 | 27 | PACKAGED | 2 |
| 2 | 28 | UNKNOWN | NULL |
+---------------------------------------------------+
+--------------------+
| packaging |
+--------------------+
| Id | |
| 1 | Boxed |
| 2 | Vacuum Sealed |
+--------------------+
If I want to know what packaging a widget has I join through to widgets_packaging and join again to packaging if I want to know the exact details. Therefore no more columns on the widgets table.
I have however been told to ignore this and put the value int for the packing information and another as a foreign key to System Properties table to find the packaging details. Therefore adding another two columns to the table and creating yet more rows in the system properties table to store package details.
+------------------------------------------------------------+
| Widget |
+------------------------------------------------------------+
| Id | Name |Price | has_packaging | packaging_details |
| 1 | Sprocket |100 | 1 | 28 |
| 2 | Dynamo |50 | 0 | 29 |
+------------------------------------------------------------+
The reason for this is because it's simpler and doesn't involve a join if you only want to know if the widget has packaging (there are lots of widgets). They are concerned that more joins will slow things down.
Which is the more correctly solution here and are their concerns about speed legitimate? My gut instint is that we can't just keep adding columns onto the widgets table as it is growing and growing with flags for properties at present.
The answer to this really depends on whether the application(s) using this database are read or write intensive. If it's read intensive, the de-normalized structure is a better approach because you can make use of indexes. Selects are faster with fewer joins, too.
However, if your application is write intensive, normalization is a better approach (the structure you're suggesting is a more normalized approach). Tables tend to be smaller, which means they have a better chance of fitting into the buffer. Also, normalization tends to lead to less duplication of data, which means updates and inserts only need to be done in one place.
To sum it up:
Write Intensive --> normalization
smaller tables have a better chance of fitting into the buffer
less duplicated data, which means quicker updates / inserts
Read Intensive --> de-normalization
better structure for indexes
fewer joins means better performance
If your application is not heavily weighted toward reads over writes, then a more mixed approach would be better.

More efficient to have more columns or more rows?

I'm currently redesigning a database which could contain a lot of data - I have the option to either include a number of different columns in the database or use a lot of rows instead. It's probably easier if I did some kind of outline below:
item_id | user_id | title | description | content | category | template | comments | status
-------------------------------------------------------------------------------------------
1 | 1 | ABC | DEF | GHI | 1 | default | 1 | 1
2 | 1 | ZYX | | QWE | 2 | default | 0 | 1
3 | 1 | A | | RTY | 2 | default | 0 | 0
4 | 2 | ABC | DEF | GHI | 3 | custom | 1 | 1
5 | 2 | CBA | | GHI | 3 | custom | 1 | 1
Versus something in the following structure:
item_id | user_id | attribute | value
---------------------------------------
1 | 1 | title | ABC
1 | 1 | description | DEF
1 | 1 | content | GHI
... | ... | ... | ...
I may want to create additional attributes in the future (50 for arguments sake) - so there could be a lot of empty cells if using multiple columns. The attribute names would be reused, where possible, across different types of content - say a blog entry, event, and gallery - title would easily be reused.
So my question is, is it more efficient to use multiple columns or multiple rows - in terms of query speed and disk space. Or would you instead recommend relationship tables, so there's a table for blogs, a table for events, etc. I'm just trying to come up with an easily expandable solution, where I ideally do not want to create a table for every kind of content as I'm thinking of developers creating new kinds of content via an app/API system (with attributes being tightly controlled).
Supplementary Question if Multiple Rows
How could I, in MySQL, convert multiple rows into a usable column format (I guess temporary tables) - so I could do some filtering by content type, as an example.
Basically, mysql has a variable row length as long as one does not change the on a per table level. Thus, empty cols will not use any space (well, almost).
But with blobs or text columns, it might be better to normalize those, as these may have large data to store and this needs to be read / skipped every time a table is scanned. Even if the column is not in the result set and you're doing queries outside of an index, it will take it's time on a large amount of rows.
As a good practice I think it will be fast to put all administrative and often used cols in one table and normalize all the rest. A kind of "vertical" design as in your second example will be complex to read and as soon as you work with temporary tables you will run in to performance issues sooner or later.
For a traditional row-based store, the cost of spooling through rows will depend on their width, so scanning a table with wide rows will take longer than one with narrow rows.
That said, it you're using an index to locate the rows that are of interest, this won't be so much of an issue.
If you normalise your data by replacing columns with keys to rows in other tables, you can reduce the amount of storage if the linked tables end up being significantly smaller than the original table, however any query will need to include the cost of required joins into the related table.
As with all these things, it's a balancing act that depends on your requirements, but understanding what's going on under the hood can certainly help you to make more informed decisions.
This question is very difficult to answer as it all comes down to what you are looking for and how your database will grow in size and complexity over time. I find the best way to answer these types of questions is to read case studies from other successful sites. For example Reddit would be a case study where they use a lot of rows but very little tables and/or columns. The article is here and a question on it is here.
There is also the option of exploring a NoSQL solution which may be more applicable to what you are trying to achieve.
Google case studies of sites that would have a similar structure to your own and see how they accomplished it as they have most likely encountered all the issues you will and already overcome them.

How to speed up mysql select in database with highly redundant key values

I have a very simple MYSQL database with only 3 columns but several millions of rows.
Two of the colums (hid1, hid2) describe study objects (about 50,000 of them) and the third column (score) is the result of a comparison of hid1 with hid2. Thus, the number of rows is max(hid1)*max(hid2), which is quite a big number. Because the table has to be written only once and read many million times, I selected a MyISAM table (I hope this was a good idea). Initially, it was planned that I would retrieve 'score' for a given pair of hid1,hid2 but it turned out to be more convenient to retrieve all scores (and hid2) for a given hid1.
My table ("result") looks like this:
+-------+-----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-----------------------+------+-----+---------+-------+
| hid1 | mediumint(8) unsigned | YES | MUL | NULL | |
| hid2 | mediumint(8) unsigned | YES | | NULL | |
| score | float | YES | | NULL | |
+-------+-----------------------+------+-----+---------+-------+
and a typical query would be
select hid1,hid2,score from result where hid1=13531 into outfile "/tmp/ttt"
Here is the problem: The query just takes too long, at least sometimes. For some 'hid1' values, I get the result back in under a second. For other hid1 (particularly for big numbers), I have to wait for up to 40 sec. As I said, I have to run thousands of these queryies, so I am interested in speeding things up.
Let me reiterate: there are about 50,000 hits to the query, and I don't need them in any particular order. Am I doing something wrong here, or is a relational database like MySQL not up to this task?
What I already tried is to increase the key_buffer in /etc/mysql/my.conf
this appeared to help, but not much. The index on hid1 is a few GB, does the key_buffer have to be bigger than the index size to be effective?
Any hint would be appreciated.
Edit: here is an example run with the corresponding 'explain' output:
select hid1,hid2,score from result where hid1=132885 into outfile "/tmp/ttt"
Query OK, 16465 rows affected (31.88 sec)
As you can see below, the index hid1_idx is actually being used:
mysql> explain select hid1,hid2,score from result where hid1=132885 into outfile "/tmp/ttt";
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
| 1 | SIMPLE | result | ref | hid1_index | hid1_index | 4 | const | 15456 | Using where |
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)
What I do find puzzling is the fact that query with low numbers for hid1 always are much faster than those with high numbers. This is not what I would expect from using an index.
Two random suggestions, based on a query pattern that always involve equality filter on hid1:
Use InnoDB table instead and take advantage of a clustered index on (hid1, hid2). That way all rows belonging to the same hid will be physically located together, and this will speed up retreival.
Hash-partition the table on hid1, with a suitable nr of partitions.
The simplest way to optimize a query like that, would be to use an index. A simple thing like
alter table results add index(hid1)
would improve the query you sent. Even more, if you want to search by both fields at once, you can use both fields in the index.
alter table results add index(hid1, hid2)
That way, MySQL can access results in a very organized way, and find the information you want.
If you run an explain on the first query, you might see something like
| select_type | table | type|possible_keys| rows |Extra
| SIMPLE | results| ALL | | 7765605| Using where
After adding the index, you should see
| select_type | table | type|possible_keys| rows |Extra
| SIMPLE | results| ref |hid1 | 2816304|
Which is telling you, in the first case, that it needs to check ALL the rows, and in the second case, that it can find the information using a ref
If you know the combination of hid1 and hid2 is unique, you should consider making that your primary key. That will automatically also add an index to hid1. See: http://dev.mysql.com/doc/refman/5.5/en/multiple-column-indexes.html
Also, check the output of EXPLAIN. See: http://dev.mysql.com/doc/refman/5.5/en/select-optimization.html and related links.

how can I optimize a query with multiple joins (already have indexes)?

SELECT citing.article_id as citing, lac_a.year, r.id_when_cited, cited_issue.country, citing.num_citations
FROM isi_lac_authored_articles as lac_a
JOIN isi_articles citing ON (lac_a.article_id = citing.article_id)
JOIN isi_citation_references r ON (citing.article_id = r.article_id)
JOIN isi_articles cited ON (cited.id_when_cited = r.id_when_cited)
JOIN isi_issues cited_issue ON (cited.issue_id = cited_issue.issue_id);
I have indexes on all the fields being JOINED on.
Is there anything I can do? My tables are large (some 1 Million records, the references tables has 500 million records, the articles table has 25 Million).
This is what EXPLAIN has to say:
+----+-------------+-------------+--------+--------------------------------------------------------------------------+---------------------------------------+---------+-------------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+--------------------------------------------------------------------------+---------------------------------------+---------+-------------------------------+---------+-------------+
| 1 | SIMPLE | cited_issue | ALL | NULL | NULL | NULL | NULL | 1156856 | |
| 1 | SIMPLE | cited | ref | isi_articles_id_when_cited,isi_articles_issue_id | isi_articles_issue_id | 49 | func | 19 | Using where |
| 1 | SIMPLE | r | ref | isi_citation_references_article_id,isi_citation_references_id_when_cited | isi_citation_references_id_when_cited | 17 | mimir_dev.cited.id_when_cited | 4 | Using where |
| 1 | SIMPLE | lac_a | eq_ref | PRIMARY | PRIMARY | 16 | mimir_dev.r.article_id | 1 | |
| 1 | SIMPLE | citing | eq_ref | PRIMARY | PRIMARY | 16 | mimir_dev.r.article_id | 1 | |
+----+-------------+-------------+--------+--------------------------------------------------------------------------+---------------------------------------+---------+-------------------------------+---------+-------------+
5 rows in set (0.07 sec)
If you realy need all the returned data, I would suggest two things:
You, probably, know the data better than MySQL and you can try to make advantage of it if MySQL is not correct in its assumptions. Currently, MySQL thinks that it is easier to full scan the whole isi_issues table at the beginning, and if the result is really going to include all issues, than the assumption is correct. But if there are many issues that should not be in the result, you may want to force another order of the joins that you consider more correct. It is you, who knows which table applies the strongest restrictions and which are the smallest to full scan (you will anyway need to full scan something, since there is no WHERE clause).
You can make profit from covering indexes (that is indexes that contain enough data in itself and not needing to touch the row data). For example, having an index (article_id, num_citations) on isi_articles and (article_id, year) on isi_lac_authored_articles and even (country) on isi_issues will significantly speed up that query as long as the indexes fit in memory, but, from the other side, will make you indexes larger and slightly slow dow inserts into the table.
i think it's the best you can do. i mean at least it's not using nested/multiple queries. you should do a little benchmark on the sql. you could at least limit your results at the least as possible. 15-30 rows for a return set is pretty fine per page (this depends on the app, but 15-30 for me is the tolerance range)
i believe in mySQL (phpMyAdmin, console, GUI whatever) they return some sort of "execution time" which is the time that it took to the query to process. compare that with a benchmark of the query using your server-side code. then compare that with the query run using the server-side code and outputting it with your app interface included after that.
by this, you can see where your bottle-neck is - that is where you optimize.
Unless the result of your query is input to some other query or system, it is useless to return that much(3M) rows. That would be clever to return just an acceptable amount of rows per query(like 1000) that is for visualizing.
Looking at your SQL - the lack of a WHERE clause means it is pulling all rows from:
JOIN isi_issues cited_issue ON (cited.issue_id = cited_issue.issue_id)
You could look at partitioning the large isi_issues table, this would allow MySQL to perform a bit quicker (smaller files are easier to handle)
Or alternatively you can loop the statement and use a LIMIT clause.
LIMIT 0,100000
then
LIMIT 100001, 200000
This will let the statements run quicker and you can deal with the data in batches.