I have what may be a basic performance question. I've done a lot of SQL queries, but not much in terms of complex inner joins and such. So, here it is:
I have a database with 4 tables, countries, territories, employees, and transactions.
The transactions links up with the employees and countries. The employees links up with the territories. In order to produce a required report, I'm running a PHP script that processes a SQL query against a mySQL database.
SELECT trans.transactionDate, agent.code, agent.type, trans.transactionAmount, agent.territory
FROM transactionTable as trans
INNER JOIN
(
SELECT agent1.code as code, agent1.type as type, territory.territory as territory FROM agentTable as agent1
INNER JOIN territoryTable as territory
ON agent1.zip=territory.zip
) AS agent
ON agent.code=trans.agent
ORDER BY trans.agent
There are about 50,000 records in the agent table, and over 200,000 in the transaction table. The other two are relatively tiny. It's taking about 7 minutes to run this query. And I haven't even inserted the fourth table yet, which needs to relate a field in the transactionTable (country) to a field in the countryTable (country) and return a field in the countryTable (region).
So, two questions:
Where would I logically put the connection between the transactionTable and the countryTable?
Can anyone suggest a way that this can be quickened up?
Thanks.
Your query should be equivalent to this:
SELECT tx.transactionDate,
a.code,
a.type,
tx.transactionAmount,
t.territory
FROM transactionTable tx,
agentTable a,
territoryTable t
WHERE tx.agent = a.code
AND a.zip = t.zip
ORDER BY tx.agent
or to this if you like to use JOIN:
SELECT tx.transactionDate,
a.code,
a.type,
tx.transactionAmount,
t.territory
FROM transactionTable tx
JOIN agentTable a ON tx.agent = a.code
JOIN territoryTable t ON a.zip = t.zip
ORDER BY tx.agent
In order to work fast, you must have following indexes on your tables:
CREATE INDEX transactionTable_agent ON transactionTable(agent);
CREATE INDEX territoryTable_zip ON territoryTable(zip);
CREATE INDEX agentTable_code ON agentTable(code);
(basically any field that is part of WHERE or JOIN constraint should be indexed).
That said, your table structure looks suspicious in a sense that it is joined by apparently non-unique fields like zip code. You really want to join by more unique entities, like agent id, transaction id and so on - otherwise expect your queries to generate a lot of redundant data and be really slow.
One more note: INNER JOIN is equivalent to simply JOIN, there is no reason to type redundant clause.
Related
I'm writing a query in mysql to join two tables. And both tables have more than 50,000 records.
Table EMP Columns
empid,
project,
code,
Status
Table EMPINFO
empid,
project,
code,
projecttype,
timespent,
skills
In each table there is candidate key [empid, project, code]
So when I join the table using INNER join
like this INNER JOIN
ON a.empid = b.empid
and a.project = b.project
and a.code = b.code
I'm getting the result, but if I add count(*) in outer query to count number of records, it takes lot of time something connection gets failed.
Is there any way to speed up to get number of records ?
And I would like to hear more suggestions to speed up inner join query as well having same candidate key in both tables.
INDEX(empid, project, code) -- in any order.
Are these tables 1:1? If so, why do the JOIN in order to do the COUNT?
Please provide SHOW CREATE TABLE. (If there are datatype differences, this could be a big problem.)
Please provide the actual SELECT.
How much RAM do you have? Please provide SHOW VARIABLES LIKE '%buffer%';.
I'm reasonably new to MySQL and I'm trying to select a distinct set of rows using this statement:
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4);
However, the select statement is taking around 10 minutes, so something is clearly afoot.
One significant factor is that the table gtfsstop_times is huge. (~250 million records)
Indexes seem to be set up properly; all the above joins are using indexed columns. Table sizes are, roughly:
gtfsagencys - 4 rows
gtfsroutes - 56,000 rows
gtfstrips - 5,500,000 rows
gtfsstop_times - 250,000,000 rows
`transportdata`.stoppoints - 400,000 rows
The server has 22Gb of memory, I've set the InnoDB buffer pool to 8G and I'm using MySQL 5.6.
Can anybody see a way of making this run faster? Or indeed, at all!
Does it matter that the stoppoints table is in a different schema?
EDIT:
EXPLAIN SELECT... returns this:
It looks like you are trying to find a collection of stop points, based on certain criteria. And, you're using SELECT DISTINCT to avoid duplicate stop points. Is that right?
It looks like atcoCode is a unique key for your stoppoints table. Is that right?
If so, try this:
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints` AS sp
JOIN (
SELECT DISTINCT st.fk_atco_code AS atcoCode
FROM `vehicledata`.gtfsroutes AS route
JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id
JOIN `vehicledata`.gtfsstop_times AS st ON trip.trip_id = st.trip_id
WHERE route.agency_id BETWEEN 1 AND 4
) ids ON sp.atcoCode = ids.atcoCode
This does a few things: It eliminates a table (agency) which you don't seem to need. It changes the search on agency_id from IN(a,b,c) to a range search, which may or may not help. And finally it relocates the DISTINCT processing from a situation where it has to handle a whole ton of data to a subquery situation where it only has to handle the ID values.
(JOIN and INNER JOIN are the same. I used JOIN to make the query a bit easier to read.)
This should speed you up a bit. But, it has to be said, a quarter gigarow table is a big table.
Having 250M records, I would shard the gtfsstop_times table on one column. Then each sharded table can be joined in a separate query that can run parallel in separate threads, you'll only need to merge the result sets.
The trick is to reduce how many rows of gtfsstop_times SQL has to evaluate. In this case SQL first evaluates every row in the inner join of gtfsstop_times and transportdata.stoppoints, right? How many rows does transportdata.stoppoints have? Then SQL evaluates the WHERE clause, then it evaluates DISTINCT. How does it do DISTINCT? By looking at every single row multiple times to determine if there are other rows like it. That would take forever, right?
However, GROUP BY quickly squishes all the matching rows together, without evaluating each one. I normally use joins to quickly reduce the number of rows the query needs to evaluate, then I look at my grouping.
In this case you want to replace DISTINCT with grouping.
Try this;
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4)
GROUP BY sp.name
, sp.longitude
, sp.latitude
, sp.atcoCode
There other valuable answers to your question and mine is an addition to it. I assume sp.atcoCode and st.fk_atco_code are indexed columns in their table.
If you can validate and make sure that agency ids in the WHERE clause are valid, you can eliminate joining `vehicledata.gtfsagencys` in the JOINS as you are not fetching any records from the table.
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
WHERE route.agency_id IN (1,2,3,4);
I have following tables:
products - 4500 records
Fields: id, sku, name, alias, price, special_price, quantity, desc, photo, manufacturer_id, model_id, hits, publishing
products_attribute_rel - 35000 records
Fields: id, product_id, attribute_id, attribute_val_id
attribute_values - 243 records
Fields: id, attr_id, value, ordering
manufacturers - 29 records
Fields: id, title,publishing
models - 946 records
Fields: id, manufacturer_id, title, publishing
So I get data from these tables by one query:
SELECT jp.*,
jm.id AS jm_id,
jm.title AS jm_title,
jmo.id AS jmo_id,
jmo.title AS jmo_title
FROM `products` AS jp
LEFT JOIN `products_attribute_rel` AS jpar ON jpar.product_id = jp.id
LEFT JOIN `attribute_values` AS jav ON jav.attr_id = jpar.attribute_val_id
LEFT JOIN `manufacturers` AS jm ON jm.id = jp.manufacturer_id
LEFT JOIN `models` AS jmo ON jmo.id = jp.model_id
GROUP BY jp.id HAVING COUNT(DISTINCT jpar.attribute_val_id) >= 0
This query is slow as hell. It takes hundreds of seconds mysql to handle it.
So how it would be possible to improve this query ? With small data chunks it works
perfectly well. But I guess everything ruins products_attribute_rel table, which
has 35000 records.
Your help would be appreciated.
EDITED
EXPLAIN results of the SELECT query:
The problem is that MySQL uses the join-type ALL for 3 tables. That means that MySQL performs 3 full table scans, puts every possibility together before sorting those out that don't match the ON statement. To get a much faster join-type (for instance eq_ref), you must put an index on the coloumns that are used on the ON statements.
Be aware though that putting an index on every possible coloumn is not recommended. A lot of indexes do speed up SELECT statements, however it also creates an overhead since the index must be stored and managed. This means that manipulation queries like UPDATE and DELETE are much slower. I've seen queries deleting only 1000 records in half an hour. It's a trade-off where you have to decide what happens more often and what is more important.
To get more infos on MySQL join-types, take a look at this.
More on indexes here.
Tables data is not so much huge that it's taking hundreds of seconds. Something is wrong with table schema. Please do proper indexing. That will surly speed up.
select distinct
jm.id AS jm_id,
jm.title AS jm_title,
jmo.id AS jmo_id,
jmo.title AS jmo_title
from products jp,
products_attribute_rel jpar,
attribute_values jav,
manufacturers jm
models jmo
where jpar.product_id = jp.id
and jav.attr_id = jpar.attribute_val_id
and jm.id = jp.manufacturer_id
and jmo.id = jp.model_id
you can do that if you want to select all the data. Hope it works.
I have a table 'Clients' and a sub-table 'Orders'.
For a certain view I need to display the last order for each client.
Since you cannot use LIMIT in a join, I first used a complex solution with a LEFT JOIN, GROUP_CONCAT and SUBSTRING_INDEX to get the last order, but this is quite slow, since there are millions of records.
Then I thought of just storing the last OrderID in the Clients table, that is updated by a trigger each time the Orders table changes. Then I just do a LEFT JOIN to Orders on this field LastOrderID.
Would an index on the field LastOrderID be of any use in this situation? Or wouldn't it be used since the source table is always Clients, so there is no sorting, searching, etc. done on this field ?
The reason I'm asking is that in reality it's a little bit more complex, I might actually need about 20 of these kind of fields.
update:
My query now is :
SELECT *
FROM Clients AS c
LEFT JOIN Orders AS o ON o.OrderID=c.LastOrderID
Would an index on LastOrderID in Clients improve speed, or is it not neccessary?
First of all, do you have an index on the Client foreign key within the Order table?
Doing this alone should increase performance quite considerably.
Perhaps your SQL is wrong?
This is standard SQL: you'll need a single two column index on (ClientID, OrderID) on the Orders table for this which will speed up the aggregate and self join
SELECT
...
FROM
(
SELECT MAX(OrderID) AS LastOrderID, ClientID
FROM Orders
GROUP BY ClientID
) o2
JOIN
Orders o ON o2.LastOrderID = o.ClientID AND o2.OrderID = o.ClientID
JOIN
Clients c PN o.ClientID = c.ClientID
I have the following query:
SELECT c.*
FROM companies AS c
JOIN users AS u USING(companyid)
JOIN jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123;
I have the following questions:
Is the USING syntax synonymous with ON syntax?
Are these joins evaluated left to right? In other words, does this query say: x = companies JOIN users; y = x JOIN jobs; z = y JOIN useraccounts;
If the answer to question 2 is yes, is it safe to assume that the companies table has companyid, userid and jobid columns?
I don't understand how the WHERE clause can be used to pick rows on the companies table when it is referring to the alias "j"
Any help would be appreciated!
USING (fieldname) is a shorthand way of saying ON table1.fieldname = table2.fieldname.
SQL doesn't define the 'order' in which JOINS are done because it is not the nature of the language. Obviously an order has to be specified in the statement, but an INNER JOIN can be considered commutative: you can list them in any order and you will get the same results.
That said, when constructing a SELECT ... JOIN, particularly one that includes LEFT JOINs, I've found it makes sense to regard the third JOIN as joining the new table to the results of the first JOIN, the fourth JOIN as joining the results of the second JOIN, and so on.
More rarely, the specified order can influence the behaviour of the query optimizer, due to the way it influences the heuristics.
No. The way the query is assembled, it requires that companies and users both have a companyid, jobs has a userid and a jobid and useraccounts has a userid. However, only one of companies or user needs a userid for the JOIN to work.
The WHERE clause is filtering the whole result -- i.e. all JOINed columns -- using a column provided by the jobs table.
I can't answer the bit about the USING syntax. That's weird. I've never seen it before, having always used an ON clause instead.
But what I can tell you is that the order of JOIN operations is determined dynamically by the query optimizer when it constructs its query plan, based on a system of optimization heuristics, some of which are:
Is the JOIN performed on a primary key field? If so, this gets high priority in the query plan.
Is the JOIN performed on a foreign key field? This also gets high priority.
Does an index exist on the joined field? If so, bump the priority.
Is a JOIN operation performed on a field in WHERE clause? Can the WHERE clause expression be evaluated by examining the index (rather than by performing a table scan)? This is a major optimization opportunity, so it gets a major priority bump.
What is the cardinality of the joined column? Columns with high cardinality give the optimizer more opportunities to discriminate against false matches (those that don't satisfy the WHERE clause or the ON clause), so high-cardinality joins are usually processed before low-cardinality joins.
How many actual rows are in the joined table? Joining against a table with only 100 values is going to create less of a data explosion than joining against a table with ten million rows.
Anyhow... the point is... there are a LOT of variables that go into the query execution plan. If you want to see how MySQL optimizes its queries, use the EXPLAIN syntax.
And here's a good article to read:
http://www.informit.com/articles/article.aspx?p=377652
ON EDIT:
To answer your 4th question: You aren't querying the "companies" table. You're querying the joined cross-product of ALL four tables in your FROM and USING clauses.
The "j.jobid" alias is just the fully-qualified name of one of the columns in that joined collection of tables.
In MySQL, it's often interesting to ask the query optimizer what it plans to do, with:
EXPLAIN SELECT [...]
See "7.2.1 Optimizing Queries with EXPLAIN"
Here is a more detailed answer on JOIN precedence. In your case, the JOINs are all commutative. Let's try one where they aren't.
Build schema:
CREATE TABLE users (
name text
);
CREATE TABLE orders (
order_id text,
user_name text
);
CREATE TABLE shipments (
order_id text,
fulfiller text
);
Add data:
INSERT INTO users VALUES ('Bob'), ('Mary');
INSERT INTO orders VALUES ('order1', 'Bob');
INSERT INTO shipments VALUES ('order1', 'Fulfilling Mary');
Run query:
SELECT *
FROM users
LEFT OUTER JOIN orders
ON orders.user_name = users.name
JOIN shipments
ON shipments.order_id = orders.order_id
Result:
Only the Bob row is returned
Analysis:
In this query the LEFT OUTER JOIN was evaluated first and the JOIN was evaluated on the composite result of the LEFT OUTER JOIN.
Second query:
SELECT *
FROM users
LEFT OUTER JOIN (
orders
JOIN shipments
ON shipments.order_id = orders.order_id)
ON orders.user_name = users.name
Result:
One row for Bob (with the fulfillment data) and one row for Mary with NULLs for fulfillment data.
Analysis:
The parenthesis changed the evaluation order.
Further MySQL documentation is at https://dev.mysql.com/doc/refman/5.5/en/nested-join-optimization.html
SEE http://dev.mysql.com/doc/refman/5.0/en/join.html
AND start reading here:
Join Processing Changes in MySQL 5.0.12
Beginning with MySQL 5.0.12, natural joins and joins with USING, including outer join variants, are processed according to the SQL:2003 standard. The goal was to align the syntax and semantics of MySQL with respect to NATURAL JOIN and JOIN ... USING according to SQL:2003. However, these changes in join processing can result in different output columns for some joins. Also, some queries that appeared to work correctly in older versions must be rewritten to comply with the standard.
These changes have five main aspects:
The way that MySQL determines the result columns of NATURAL or USING join operations (and thus the result of the entire FROM clause).
Expansion of SELECT * and SELECT tbl_name.* into a list of selected columns.
Resolution of column names in NATURAL or USING joins.
Transformation of NATURAL or USING joins into JOIN ... ON.
Resolution of column names in the ON condition of a JOIN ... ON.
Im not sure about the ON vs USING part (though this website says they are the same)
As for the ordering question, its entirely implementation (and probably query) specific. MYSQL most likely picks an order when compiling the request. If you do want to enforce a particular order you would have to 'nest' your queries:
SELECT c.*
FROM companies AS c
JOIN (SELECT * FROM users AS u
JOIN (SELECT * FROM jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123)
)
as for part 4: the where clause limits what rows from the jobs table are eligible to be JOINed on. So if there are rows which would join due to the matching userids but don't have the correct jobid then they will be omitted.
1) Using is not exactly the same as on, but it is short hand where both tables have a column with the same name you are joining on... see: http://www.java2s.com/Tutorial/MySQL/0100__Table-Join/ThekeywordUSINGcanbeusedasareplacementfortheONkeywordduringthetableJoins.htm
It is more difficult to read in my opinion, so I'd go spelling out the joins.
3) It is not clear from this query, but I would guess it does not.
2) Assuming you are joining through the other tables (not all directly on companyies) the order in this query does matter... see comparisons below:
Origional:
SELECT c.*
FROM companies AS c
JOIN users AS u USING(companyid)
JOIN jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123
What I think it is likely suggesting:
SELECT c.*
FROM companies AS c
JOIN users AS u on u.companyid = c.companyid
JOIN jobs AS j on j.userid = u.userid
JOIN useraccounts AS us on us.userid = u.userid
WHERE j.jobid = 123
You could switch you lines joining jobs & usersaccounts here.
What it would look like if everything joined on company:
SELECT c.*
FROM companies AS c
JOIN users AS u on u.companyid = c.companyid
JOIN jobs AS j on j.userid = c.userid
JOIN useraccounts AS us on us.userid = c.userid
WHERE j.jobid = 123
This doesn't really make logical sense... unless each user has their own company.
4.) The magic of sql is that you can only show certain columns but all of them are their for sorting and filtering...
if you returned
SELECT c.*, j.jobid....
you could clearly see what it was filtering on, but the database server doesn't care if you output a row or not for filtering.