sql (mysql) optimize query oder schema, avoiding full table scan

sql (mysql) optimize query oder schema, avoiding full table scan - mysql

I have one table with data. The table has entries for orders, each order has some types (or state e.g ordered, ..., polished, packed, shipped).
Now, I want to do this query.
select * from orders as o
where not exists
(SELECT * from orders as oo
where
o.order = oo.order and
oo.type="SHIPMENT")
type and shipment have a index, but it is only used after doing a full scan. So the query takes far to long. I want to present the data directly.

Having an index on orders.type doesn't necessarily mean that this index will be used. In fact, the index is not used if it's not selective enough. Also, mysql works a little faster if you use NOT IN or LEFT JOIN/IS NULL approach instead of NOT EXISTS :
// LEFT JOIN/IS NULL:
SELECT o.*
FROM orders o
LEFT JOIN orders oo ON (oo.order = o.order AND oo.type="SHIPMENT")
WHERE oo.id IS NULL

Related

Mysql inner join vs in clause performance

I have a query to get data of friends of user. I have 3 tables, one is user table, second is a user_friend table which has user_id and friend_id (both are foreign key to user table) and 3rd table is feed table which has user_id and feed content. Feed can be shown to friends. I can query in two ways either by join or by using IN clause (I can get all the friends' ids by graph database which I am using for networking).
Here are two queries:
SELECT
a.*
FROM feed a
INNER JOIN user_friend b ON a.user_id = b.friend_id
WHERE b.user_id = 1;
In this query I get friend ids from graph database and will pass to this query:
SELECT
a.*
FROM feed a
WHERE a.user_id IN (2,3,4,5)
Which query runs faster and good for performance when I have millions of records?

With suitable indexes, a one-query JOIN (Choice 1) will almost always run faster than a 2-query (Choice 2) algorithm.
To optimize Choice 1, b needs this composite index: INDEX(user_id, friend_id). Also, a needs an index (presumably the PRIMARY KEY?) starting with user_id.

This depends on your desired result when you have a compared big data in your subquery their always a join is much preferred for such conditions. Because subqueries can be slower than LEFT [OUTER] JOINS / INNER JOIN [LEft JOIN is faster than INNER JOIN], but in my opinion, their strength is slightly higher readability.
So if your data have fewer data to compare then why you chose a complete table join so that depends on how much data you have.
In my opinion, if you have a less number of compared data in IN than it's good but if you have a subquery or big data then you must go for a join...

Query with multiple table joins taking too much time despite indexing

Query-
SELECT SUM(sale_data.total_sale) as totalsale, `sale_data_temp`.`customer_type_cy` as `customer_type`, `distributor_list`.`customer_status` FROM `distributor_list` LEFT JOIN `sale_data` ON `sale_data`.`depo_code` = `distributor_list`.`depo_code` and `sale_data`.`customer_code` = `distributor_list`.`customer_code` LEFT JOIN `sale_data_temp` ON `distributor_list`.`address_coordinates` = `sale_data_temp`.`address_coordinates` LEFT JOIN `item_master` ON `sale_data`.`item_code` = `item_master`.`item_code` WHERE `invoice_date` BETWEEN "2017-04-01" and "2017-11-01" AND `item_master`.`id_category` = 1 GROUP BY `distributor_list`.`address_coordinates`
Query, rewritten with formatting.
SELECT SUM(sale_data.total_sale) as totalsale,
sale_data_temp.customer_type_cy as customer_type,
distributor_list.customer_status
FROM distributor_list
LEFT JOIN sale_data
ON sale_data.depo_code = distributor_list.depo_code
and sale_data.customer_code = distributor_list.customer_code
LEFT JOIN sale_data_temp
ON distributor_list.address_coordinates = sale_data_temp.address_coordinates
LEFT JOIN item_master
ON sale_data.item_code = item_master.item_code
WHERE invoice_date BETWEEN "2017-04-01" and "2017-11-01"
AND item_master.id_category = 1
GROUP BY distributor_list.address_coordinates
DESC-
This Query is taking 7.5 seconds to run. My application contains 3-4 such queries. Therefore loading time appraches 1 min on server.
My sale data table contains 450K records.
Distributor list contains 970 records
Item master contains 7774 records and sale_data_temp contains 324 records.
I am using indexing but it is not being used for sale data table.
All the 400K records are searched as is evident from explain sql.
If I reduce the duration of BETWEEN clause than sale data table uses date index otherwise it scans all 400K rows.
The rows between 01-04-2017 and 01-11-2017 are 84000 but still it scans 400K rows.
MYSQL EXPLAIN-
I have modified queries two times with no success.
Modification 1:
SELECT SUM(sale_data.total_sale) as totalsale, `sale_data_temp`.`customer_type_cy` as `customer_type`, `distributor_list`.`customer_status` FROM `distributor_list` LEFT JOIN `sale_data` ON `sale_data`.`depo_code` = `distributor_list`.`depo_code` and `sale_data`.`customer_code` = `distributor_list`.`customer_code` AND `invoice_date` BETWEEN "2017-04-01" and "2017-11-01" LEFT JOIN `sale_data_temp` ON `distributor_list`.`address_coordinates` = `sale_data_temp`.`address_coordinates` LEFT JOIN `item_master` ON `sale_data`.`item_code` = `item_master`.`item_code` WHERE `item_master`.`id_category` = 1 GROUP BY `distributor_list`.`address_coordinates`
Modification 2
SELECT SQL_NO_CACHE SUM( sd.total_sale ) AS totalsale, `sale_data_temp`.`customer_type_cy` AS `customer_type` , `distributor_list`.`customer_status` FROM `distributor_list` LEFT JOIN (SELECT * FROM `sale_data` WHERE `invoice_date` BETWEEN "2017-04-01" AND "2017-11-01")sd ON `sd`.`depo_code` = `distributor_list`.`depo_code` AND `sd`.`customer_code` = `distributor_list`.`customer_code` LEFT JOIN `sale_data_temp` ON `distributor_list`.`address_coordinates` = `sale_data_temp`.`address_coordinates` LEFT JOIN `item_master` ON `sd`.`item_code` = `item_master`.`item_code` WHERE `item_master`.`id_category` =1 GROUP BY `distributor_list`.`address_coordinates`
HERE ARE MY INDEXES ON SALE DATA TABLE

See the key column of the EXPLAIN results view - no key is being used at the moment so MySQL is not using any of your indexes for filtering out rows so it is scanning the whole table on each query. This is why it is taking so long.
I have taken a look at your first query with relation to your sale_data indices. It looks like you will need to create a new composite index on this table that contains the following columns only:
depo_code, customer_code, item_code, invoice_date, total_sale
I recommend that you name this index test1 and experiment with modifying the ordering of the columns and keep testing again each time using EXPLAIN EXTENDED until you achieve a selected key - you want to see index test1 has been selected in the key column.
See this answer that has helped me before with this, and it will help you understand the importance of correctly ordering your composite indices.
Looking at the cardinality of the single field indices, here is my best attempt at giving you the correct index to apply:
ALTER TABLE `sale_data` ADD INDEX `test1` (`item_code`, `customer_code`, `invoice_date`, `depo_code`, `total_sale`);
Good luck with your mission!

A few things to notice about your query.
You are misusing the notorious MySQL extension to GROUP BY. Read this, then mention the same columns in your GROUP BY clause as you mention in your SELECT clause.
Your LEFT JOIN sale_data and LEFT JOIN item_master operations are actually ordinary JOIN operations. Why? You mention columns from those tables in your WHERE clause.
Your best bet for speedup is doing a date-range scan on an index on sale_data.invoice_date. For some reason known only to the MySQL query planner's feverish machinations, you're not getting it.
Try refactoring your query. Here's one suggestion:
SELECT SUM(sale_data.total_sale) as totalsale,
sale_data_temp.customer_type_cy as customer_type,
distributor_list.customer_status
FROM distributor_list
JOIN sale_data
ON sale_data.invoice_date BETWEEN "2017-04-01" and "2017-11-01"
and sale_data.depo_code = distributor_list.depo_code
and sale_data.customer_code = distributor_list.customer_code
LEFT JOIN sale_data_temp
ON distributor_list.address_coordinates = sale_data_temp.address_coordinates
JOIN item_master
ON sale_data.item_code = item_master.item_code
WHERE item_master.id_category = 1
GROUP BY sale_data_temp.customer_type_cy, distributor_list.customer_status
Try creating a covering index on sale_data for this query. You'll have to mess around a bit to get this right, but this is a starting point. (invoice_date, item_code, depo_code, customer_code, total_sale). The point of a covering index is to allow the query to be satisfied entirely from the index without having to refer back to the table's data. That's why I included total_sale in the index.
Please notice that index I suggested makes your index on invoice_date redundant. You can drop that index.

MySQL LEFT JOIN order of ON conditions

I have a MySQL query with inner joins and one left join and a lot of data in my database, and it's running quite slow. This is roughly my query:
SELECT
main_table.*
FROM
main_table
INNER JOIN
...
LEFT JOIN
second_table ON (main_table.id = second_table.ref_id AND second_table.type = 'foo' AND second_table.bar IS NULL
WHERE
second_table.id IS NULL
;
An entry from main_table may have one or more referenced entries in second_table. I want to get all results from main_table, that either have no results in second_table, or only has irrelevant data in the second table (type 'foo' or bar is NULL).
Taking a look into the EXPLAIN, MySQL searches for bar IS NULL first, followed by type = 'foo', that would still result in many thousands of result, whereas checking for ref_id first would only leave very few results to check the other conditions on.
I only have an index on ref_id, not for type or bar and I don't feel the need to index them if I could just get the query search for ref_id first.
--EDIT: I noticed that on the copy of the database (where it has the actual data and runs slow) does also have an index on type and bar individually, so that's probably why MySQL prefers bar over the other keys. I'm considering a key spanning multiple fields.--
Does anybody have an idea how to optimize this kind of query? Is it possible to force MySQL using a certain order in the ON conditions?
"Solution": I added an index spanned over all the relevant fields.
I don't consider this being a real solution, because I believe, it would also have been faster if the JOIN was done on the indexed ref_id first. It probably did so when that was the only index, however my colleague had the idea to add an index separately on the other fields as well for some reason, probably needed somewhere else in our application.

What happens if you move the "Irrelevant" rows to the where part?
Seems to me the DB should have an easier time joining the tables, and will use the index
Something like
SELECT
main_table.*
FROM
main_table
INNER JOIN
...
LEFT JOIN
second_table ON main_table.id = second_table.ref_id
WHERE
second_table.id IS NULL OR
(second_table.type = 'foo' AND second_table.bar IS NULL)

In MYSQL JOIN is faster then LEFT JOIN so you can write your query like this.
SELECT
main_table.*
FROM
main_table
INNER JOIN
...
LEFT JOIN (SELECT main_table.*,second_table.* FROM main_table
JOIN second_table ON main_table.id = second_table.ref_id AND
second_table.type = 'foo' AND second_table.bar IS NULL) AS main_table2 ON
main_table2.id = main_table.id
WHERE
second_table.id IS NULL;

MySQL : do I need indexes in this situation?

I have a table 'Clients' and a sub-table 'Orders'.
For a certain view I need to display the last order for each client.
Since you cannot use LIMIT in a join, I first used a complex solution with a LEFT JOIN, GROUP_CONCAT and SUBSTRING_INDEX to get the last order, but this is quite slow, since there are millions of records.
Then I thought of just storing the last OrderID in the Clients table, that is updated by a trigger each time the Orders table changes. Then I just do a LEFT JOIN to Orders on this field LastOrderID.
Would an index on the field LastOrderID be of any use in this situation? Or wouldn't it be used since the source table is always Clients, so there is no sorting, searching, etc. done on this field ?
The reason I'm asking is that in reality it's a little bit more complex, I might actually need about 20 of these kind of fields.
update:
My query now is :
SELECT *
FROM Clients AS c
LEFT JOIN Orders AS o ON o.OrderID=c.LastOrderID
Would an index on LastOrderID in Clients improve speed, or is it not neccessary?

First of all, do you have an index on the Client foreign key within the Order table?
Doing this alone should increase performance quite considerably.

Perhaps your SQL is wrong?
This is standard SQL: you'll need a single two column index on (ClientID, OrderID) on the Orders table for this which will speed up the aggregate and self join
SELECT
...
FROM
(
SELECT MAX(OrderID) AS LastOrderID, ClientID
FROM Orders
GROUP BY ClientID
) o2
JOIN
Orders o ON o2.LastOrderID = o.ClientID AND o2.OrderID = o.ClientID
JOIN
Clients c PN o.ClientID = c.ClientID

In what order are MySQL JOINs evaluated?

I have the following query:
SELECT c.*
FROM companies AS c
JOIN users AS u USING(companyid)
JOIN jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123;
I have the following questions:
Is the USING syntax synonymous with ON syntax?
Are these joins evaluated left to right? In other words, does this query say: x = companies JOIN users; y = x JOIN jobs; z = y JOIN useraccounts;
If the answer to question 2 is yes, is it safe to assume that the companies table has companyid, userid and jobid columns?
I don't understand how the WHERE clause can be used to pick rows on the companies table when it is referring to the alias "j"
Any help would be appreciated!

USING (fieldname) is a shorthand way of saying ON table1.fieldname = table2.fieldname.
SQL doesn't define the 'order' in which JOINS are done because it is not the nature of the language. Obviously an order has to be specified in the statement, but an INNER JOIN can be considered commutative: you can list them in any order and you will get the same results.
That said, when constructing a SELECT ... JOIN, particularly one that includes LEFT JOINs, I've found it makes sense to regard the third JOIN as joining the new table to the results of the first JOIN, the fourth JOIN as joining the results of the second JOIN, and so on.
More rarely, the specified order can influence the behaviour of the query optimizer, due to the way it influences the heuristics.
No. The way the query is assembled, it requires that companies and users both have a companyid, jobs has a userid and a jobid and useraccounts has a userid. However, only one of companies or user needs a userid for the JOIN to work.
The WHERE clause is filtering the whole result -- i.e. all JOINed columns -- using a column provided by the jobs table.

I can't answer the bit about the USING syntax. That's weird. I've never seen it before, having always used an ON clause instead.
But what I can tell you is that the order of JOIN operations is determined dynamically by the query optimizer when it constructs its query plan, based on a system of optimization heuristics, some of which are:
Is the JOIN performed on a primary key field? If so, this gets high priority in the query plan.
Is the JOIN performed on a foreign key field? This also gets high priority.
Does an index exist on the joined field? If so, bump the priority.
Is a JOIN operation performed on a field in WHERE clause? Can the WHERE clause expression be evaluated by examining the index (rather than by performing a table scan)? This is a major optimization opportunity, so it gets a major priority bump.
What is the cardinality of the joined column? Columns with high cardinality give the optimizer more opportunities to discriminate against false matches (those that don't satisfy the WHERE clause or the ON clause), so high-cardinality joins are usually processed before low-cardinality joins.
How many actual rows are in the joined table? Joining against a table with only 100 values is going to create less of a data explosion than joining against a table with ten million rows.
Anyhow... the point is... there are a LOT of variables that go into the query execution plan. If you want to see how MySQL optimizes its queries, use the EXPLAIN syntax.
And here's a good article to read:
http://www.informit.com/articles/article.aspx?p=377652
ON EDIT:
To answer your 4th question: You aren't querying the "companies" table. You're querying the joined cross-product of ALL four tables in your FROM and USING clauses.
The "j.jobid" alias is just the fully-qualified name of one of the columns in that joined collection of tables.

In MySQL, it's often interesting to ask the query optimizer what it plans to do, with:
EXPLAIN SELECT [...]
See "7.2.1 Optimizing Queries with EXPLAIN"

Here is a more detailed answer on JOIN precedence. In your case, the JOINs are all commutative. Let's try one where they aren't.
Build schema:
CREATE TABLE users (
name text
);
CREATE TABLE orders (
order_id text,
user_name text
);
CREATE TABLE shipments (
order_id text,
fulfiller text
);
Add data:
INSERT INTO users VALUES ('Bob'), ('Mary');
INSERT INTO orders VALUES ('order1', 'Bob');
INSERT INTO shipments VALUES ('order1', 'Fulfilling Mary');
Run query:
SELECT *
FROM users
LEFT OUTER JOIN orders
ON orders.user_name = users.name
JOIN shipments
ON shipments.order_id = orders.order_id
Result:
Only the Bob row is returned
Analysis:
In this query the LEFT OUTER JOIN was evaluated first and the JOIN was evaluated on the composite result of the LEFT OUTER JOIN.
Second query:
SELECT *
FROM users
LEFT OUTER JOIN (
orders
JOIN shipments
ON shipments.order_id = orders.order_id)
ON orders.user_name = users.name
Result:
One row for Bob (with the fulfillment data) and one row for Mary with NULLs for fulfillment data.
Analysis:
The parenthesis changed the evaluation order.
Further MySQL documentation is at https://dev.mysql.com/doc/refman/5.5/en/nested-join-optimization.html

SEE http://dev.mysql.com/doc/refman/5.0/en/join.html
AND start reading here:
Join Processing Changes in MySQL 5.0.12
Beginning with MySQL 5.0.12, natural joins and joins with USING, including outer join variants, are processed according to the SQL:2003 standard. The goal was to align the syntax and semantics of MySQL with respect to NATURAL JOIN and JOIN ... USING according to SQL:2003. However, these changes in join processing can result in different output columns for some joins. Also, some queries that appeared to work correctly in older versions must be rewritten to comply with the standard.
These changes have five main aspects:
The way that MySQL determines the result columns of NATURAL or USING join operations (and thus the result of the entire FROM clause).
Expansion of SELECT * and SELECT tbl_name.* into a list of selected columns.
Resolution of column names in NATURAL or USING joins.
Transformation of NATURAL or USING joins into JOIN ... ON.
Resolution of column names in the ON condition of a JOIN ... ON.

Im not sure about the ON vs USING part (though this website says they are the same)
As for the ordering question, its entirely implementation (and probably query) specific. MYSQL most likely picks an order when compiling the request. If you do want to enforce a particular order you would have to 'nest' your queries:
SELECT c.*
FROM companies AS c
JOIN (SELECT * FROM users AS u
JOIN (SELECT * FROM jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123)
)
as for part 4: the where clause limits what rows from the jobs table are eligible to be JOINed on. So if there are rows which would join due to the matching userids but don't have the correct jobid then they will be omitted.

1) Using is not exactly the same as on, but it is short hand where both tables have a column with the same name you are joining on... see: http://www.java2s.com/Tutorial/MySQL/0100__Table-Join/ThekeywordUSINGcanbeusedasareplacementfortheONkeywordduringthetableJoins.htm
It is more difficult to read in my opinion, so I'd go spelling out the joins.
3) It is not clear from this query, but I would guess it does not.
2) Assuming you are joining through the other tables (not all directly on companyies) the order in this query does matter... see comparisons below:
Origional:
SELECT c.*
FROM companies AS c
JOIN users AS u USING(companyid)
JOIN jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123
What I think it is likely suggesting:
SELECT c.*
FROM companies AS c
JOIN users AS u on u.companyid = c.companyid
JOIN jobs AS j on j.userid = u.userid
JOIN useraccounts AS us on us.userid = u.userid
WHERE j.jobid = 123
You could switch you lines joining jobs & usersaccounts here.
What it would look like if everything joined on company:
SELECT c.*
FROM companies AS c
JOIN users AS u on u.companyid = c.companyid
JOIN jobs AS j on j.userid = c.userid
JOIN useraccounts AS us on us.userid = c.userid
WHERE j.jobid = 123
This doesn't really make logical sense... unless each user has their own company.
4.) The magic of sql is that you can only show certain columns but all of them are their for sorting and filtering...
if you returned
SELECT c.*, j.jobid....
you could clearly see what it was filtering on, but the database server doesn't care if you output a row or not for filtering.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008