So I've been searching for a solution and reading books, and havent been able to figure it out, the question is rather simple, I have 2 tables. On one table I have 2 fields:
table_1:"chromosome" and "position" both of the being integers.
table_2:"chromosome" "start" and "end", all being integers as well.
I want a query that gives me back all rows from table_1 that are between the start and end of table_2. The query looks like this:
SELECT
table_1 . *
FROM
table_1,
table_2
WHERE
table_1.chromosome = table_2.chromosome
AND table_1.position > table_2.start
AND table_1.position < table_1.end;
So this query works fine, but my tables are many millions of rows (7092713) and (215909) respectvely. I indexed chromosome, pos and chromosome, start, end. The weird part is that if I do the query one by one (perl DBI, do one statement for every row of table_2), this runs a lot faster. Not sure where am I screwing up.
Any help would be appreciated.
Jorge Kageyama
For the sake of clarity, let's start by recasting your query using the standard JOIN syntax. The query is equivalent but easier to read.
SELECT table_1 . *
FROM table_1
JOIN table_2 ON ( table_1.chromosome = table_2.chromosome
AND table_1.position > table_2.start
AND table_1.position < table_1.end)
Second, it's smart when searching large tables (or any tables for that matter) to avoid * in your SELECT clauses. Using * denies useful data to the optimizer about what you do, or don't, need in your result set. So let us say
SELECT table_1.chromosome, table_1.position
for SELECT.
So, it becomes clear that your result set, and your join, need chromosome and position, and nothing else, from your larger table. Try creating a compound BTREE index on that table, as follows.
CREATE INDEX ON table_1(chromosome,position) USING BTREE
Similarly, try creating an index on table_2 as follows.
CREATE INDEX ON table_2(chromosome,start, end) USING BTREE
These are called covering indexes. They contain enough columns that the query can be satisfied from the index without having to bounce back to the original table.
BTREE indexes (the default by the way) are inherently ordered. Appropriate records in table_1 can be found by range scans on the index starting with (chromosome,start) and ending with (chromosome,end).
Third, it's possible you're getting a massive combinatorial explosion of rows from table_1 in your result set. You'll get a row for every combination of rows in the two tables that matches your ON() clause. It's hard to know whether that's the case without knowing a lot about your data.
You could try to reduce that combinatorial explosion using
SELECT DISTINCT table_1.chromosome, table_1.position
Give this a try. If you're still not getting anywhere, maybe another question with complete table definitions and the results of EXPLAIN will be helpful.
Interesting question. Without knowing more about the quantities contained in "position," I would still approach it generally in this way:
Select for position generally from table_1 (with 7.0mm entities) so that the resulting table is a bin of a smaller amount of data. Let's say, for instance, that the "position" quantity is a set of discrete integers from 2-9. Select from table_1 where position is equal to 2, then select from table_2 where "start" is less than 2 and "end" is greater than 2. Iterate over this query selection 8 times updating a new table_3 with results.
I am assuming here that table_2 is unique on chromosome, and table_1 is not. Therefore, you end up with chromosomes that could have multiple positions within the same range (a chromosome has one range, but can appear anywhere within that range). You also, then, can't tell how large the resulting join table is going to be, but it could be quite large as each of the 7mm entities in table_1 could be within all ranges in table_2.
Iterating would allow you to "grow" your results while observing the quality at each point experimentally before committing to the entire loop.
Here is an idea of the query I have in mind (untested):
SELECT table_1.chromosome, table_1.position, table_2.start, table_2.end
FROM
(SELECT table_1.chromosome, table_1.position
from table_1 where table_1.position = 2)
JOIN
(SELECT table_2.chromosome, table_2.start, table_2.end
from table_2 where table_2.start < 2 AND table_2.end > 2)
ON
table_1.chromosome = table_2.chromosome
Good luck, and I hope you find your answer!
Related
It takes around 5 seconds to get result of query from a table consisting 1.5 million row. Query is "select * from table where code=x"
Is there a setting to increase speed ? Or should I jump to another database apart from MySQL ?
You could index the code column. Note that the trade off is that inserting new rows or updating the code column on existing rows will be slowed down a bit since the index also needs to be updated. In any event, you should benchmark the improvement to make sure it's worth it.
WHERE code=x -- needs INDEX(code)
SELECT * when many of the columns are bulky: Large columns are stored "off-record". Hence they take longer to fetch. So, explicitly list the columns you really need, hoping to leave out some of the bulky columns.
When a GROUP BY or LIMIT is involved, it is sometimes best to do
SELECT lots of columns
FROM ( SELECT id FROM t WHERE ... group-by or limit ) AS x
JOIN t AS y USING(id)
etc.
That is, start by finding just the ids as simply as possible, then JOIN back to the original table and other table(s). (This is not the case you presented, but I worry that you over-simplified it.)
I was running a query of this kind of query:
SELECT
-- fields
FROM
table1 JOIN table2 ON (table1.c1 = table.c1 OR table1.c2 = table2.c2)
WHERE
-- conditions
But the OR made it very slow so i split it into 2 queries:
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table.c1
WHERE
-- conditions
UNION
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table.c2
WHERE
-- conditions
Which works much better but now i am going though the tables twice so i was wondering if there was any further optimizations for instance getting set of entries that satisfies the condition (table1.c1 = table.c1 OR table1.c2 = table2.c2) and then query on it. That would bring me back to the first thing i was doing but maybe there is another solution i don't have in mind. So is there anything more to do with it or is it already optimal?
Splitting the query into two separate ones is usually better in MySQL since it rarely uses "Index OR" operation (Index Merge in MySQL lingo).
There are few items I would concentrate for further optimization, all related to indexing:
1. Filter the rows faster
The predicate in the WHERE clause should be optimized to retrieve the fewer number of rows. And, they should be analized in terms of selectivity to create indexes that can produce the data with the fewest filtering as possible (less reads).
2. Join access
Retrieving related rows should be optimized as well. According to selectivity you need to decide which table is more selective and use it as a driving table, and consider the other one as the nested loop table. Now, for the latter, you should create an index that will retrieve rows in an optimal way.
3. Covering Indexes
Last but not least, if your query is still slow, there's one more thing you can do: use covering indexes. That is, expand your indexes to include all the rows from the driving and/or secondary tables in them. This way the InnoDB engine won't need to read two indexes per table, but a single one.
Test
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table2.c1
WHERE
-- conditions
UNION ALL
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table2.c2
WHERE
-- conditions
/* add one more condition which eliminates the rows selected by 1st subquery */
AND table1.c1 != table2.c1
Copied from the comments:
Nico Haase > What do you mean by "test"?
OP shows query patterns only. So I cannot predict does the technique is effective or not, and I suggest OP to test my variant on his structure and data array.
Nico Haase > what you've changed
I have added one more condition to 2nd subquery - see added comment in the code.
Nico Haase > and why?
This replaces UNION DISTINCT with UNION ALL and eliminates combined rowset sorting for duplicates remove.
I've got a table of the following schema:
+----+--------+----------------------------+----------------------------+
| id | amount | created_timestamp | updated_timestamp |
+----+--------+----------------------------+----------------------------+
| 1 | 1.00 | 2018-01-09 12:42:38.973222 | 2018-01-09 12:42:38.973222 |
+----+--------+----------------------------+----------------------------+
Here, for id = 1, there could be multiple amount entries. I want to extract the last added entry and its corresponding amount, grouped by id.
I've written a working query with an inner join on the self table as below:
SELECT t1.id,
t1.amount,
t1.created_timestamp,
t1.updated_timestamp
FROM transactions AS t1
INNER JOIN (SELECT id,
Max(updated_timestamp) AS last_transaction_time
FROM transactions
GROUP BY id) AS latest_transactions
ON latest_transactions.id = t1.id
AND latest_transactions.last_transaction_time =
t1.updated_timestamp;
I think inner join is an overkill and this can be replaced with a more optimized/efficient query. I've written the following query with where, group by, and having but it isn't working. Can anyone help?
select id, any_value(`updated_timestamp`), any_value(amount) from transactions group by `id` having max(`updated_timestamp`);
There are two (good) options when performing a query like this in MySQL. You have already tried one option. Here is the other:
SELECT t1.id,
t1.amount,
t1.created_timestamp,
t1.updated_timestamp
FROM transactions AS t1
LEFT OUTER JOIN transactions later_transactions
ON later_transactions.id = t1.id
AND later_transactions.last_transaction_time > t1.updated_timestamp
WHERE later_transactions.id IS NULL
These methods are the ones in the documentation, and also the ones I use in my work basically every day. Which one is most efficient depends on a variety of factors, but usually, if one is slow the other will be fast.
Also, as Strawberry points out in the comments, you need a composite index on (id,updated_timestamp). Have separate indexes for id and updated_timestamp is not equivalent.
Why a composite index?
Be aware that an index is just a copy of the data that is in the table. In many respects, it works the same as a table does. So, creating an index is creating a copy of the table's data that the RDBMS can use to query the table's information in a more efficient manner.
An index on just updated_timestamp will create a copy of the data that contains updated_timestamp as the first column, and that data will be sorted. It will also include a hidden row ID value (that will work as a primary key) in each of those index rows, so that it can use that to look up the full rows in the actual table.
How does that help in this query (either version)? If we wanted just the latest (or earliest) updated_timestamp overall, it would help, since it can just check the first or last record in the index. But since we want the latest for each id, this index is useless.
What about just an index on id. Here we have a copy of the id column, sorted by the id column, with the row ID attached to each row in the index.
How does this help this query? It doesn't, because it doesn't even have the updated_timestamp column as part of the index, and so won't even consider using this index.
Now, consider a composite index: (id,updated_timestamp).
This creates a copy of the data with the id column first, sorted, and then the second column updated_timestamp is also included, and it is also sorted within each id.
This is the same way that a phone book (if people still use those things as something more than paperweights) is sorted by last name and then first name.
Because the rows are sorted in this way, MySQL can look, for each id, at just the last record of a given id. It knows that that record contains the highest updated_timestamp value, because of how the index is defined.
So, it only has to look up one row for each id that exists. That is fast. Further explanation into why would take up a lot more space, but you can research it yourself if you like, by just looking into B-Trees. Suffice to say, finding the first (or last) record is easy.
Try the following:
ALTER TABLE transactions
ADD INDEX `LatestTransaction` (`id`,`updated_timestamp`)
And then see whether your original query or my alternate query is faster. Likely both will be faster than having no index. As your table grows, or your select statement changes it may affect which of these queries is faster, but the index is going to provide the biggest performance boost, regardless of which version of the query you use.
I'm facing an issue with an SQL Query. I'm developing a php website, and to avoid making too much queries, I prefer to make a big one looking like :
select m.*, cj.*, cjb.*, me.pseudo as pseudo_acheteur
from mercato m
JOIN cartes_joueur cj
ON m.ID_carte = cj.ID_carte_joueur
JOIN cartes_joueur_base cjb
ON cj.ID_carte_joueur_base = cjb.ID_carte_joueur_base
JOIN membres me
ON me.ID_membre = cj.ID_membre
where not exists (select * from mercato_encheres me where me.ID_mercato = m.ID_mercato)
and cj.ID_membre = 2
and m.status <> 'cancelled'
ORDER BY total_carac desc, cj.level desc, cjb.nom_carte asc
This should return all cards sold by the member without any bet on it. In the result, I need all the information to display them.
Here is the approximate rows in each table :
mercato : 1200
cartes_joueur : 800 000
carte_joueur_base : 62
membres : 2000
mercato_enchere : 15 000
I tried to reduce them (in dev environment) by deleting old data; but the query still needs 10~15 seconds to execute (which is way too long on a website )
Thanks for your help.
Let's take a look.
The use of * in SELECT clauses is harmful to query performance. Why? It's wasteful. It needlessly adds to the volume of data the server must process, and in the case of JOINs, can force the processing of columns with duplicate values. If you possibly can do so, try to enumerate the columns you need.
You may not have useful indexes on your tables for accelerating this. We can't tell. Please notice that MySQL can't exploit multiple indexes in a single query, so to make a query fast you often need a well-chosen compound index. I suggest you try defining the index (ID_membre, ID_carte_jouer, ID_carte_joueur_base) on your cartes_joueur table. Why? Your query matches for equality on the first of those columns, and then uses the second and third column in ON conditions.
I have often found that writing a query with the largest table (most rows) first helps me think clearly about optimizing. In your case your largest table is cartes_jouer and you are choosing just one ID_membre value from that table. Your clearest path to optimization is the knowledge that you only need to examine approximately 400 rows from that table, not 800 000. An appropriate compound index will make that possible, and it's easiest to imagine that index's columns if the table comes first in your query.
You have a correlated subquery -- this one.
where not exists (select *
from mercato_encheres me
where me.ID_mercato = m.ID_mercato)
MySQL's query planner can be stupidly literal-minded when it sees this, running it thousands of times. In your case it's even worse: it's got SELECT * in it: see point 1 above.
It should be refactored to use the LEFT JOIN ... IS NULL pattern. Here's how that goes.
select whatever
from mercato m
JOIN ...
JOIN ...
LEFT JOIN mercato_encheres mench ON mench.ID_mercato = m.ID_mercato
WHERE mench.ID_mercato IS NULL
and ...
ORDER BY ...
Explanation: The use of LEFT JOIN rather than ordinary inner JOIN allows rows from the mercato table to be preserved in the output even when the ON condition does not match them to tables in the mercato_encheres table. The mismatching rows get NULL values for the second table. The mench.ID_mercato IS NULL condition in the WHERE clause then selects only the mismatching rows.
SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.
To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.
The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.