MySQL Sum not correct and join - mysql

I'm building a holiday system and one of the features is being able to buy extra holiday which you can do at several points of the year, so I'm wanting to see the total number of days holiday, how much has been booked and how much has been bought by each user.
I'm doing a query
SELECT hr_user.name AS username,
hr_user.user_id,
SUM(working_days) AS daysbooked,
sum(hr_buyback.days) AS daysbought
FROM hr_leave
inner join hr_user on hr_user.user_id = hr_leave.user_id
left outer join hr_buyback on hr_buyback.user_id = hr_user.user_id
where active = 'y'
and hr_leave.start_date between '2012-01-01' and '2012-12-31'
and (hr_leave.status = 'approved' OR hr_leave.status = 'pending')
GROUP BY hr_user.name, hr_user.user_id
Now this is bringing back results in the daysbought column waaaay higher than what I was expecting, which is odd because when I get rid of the sum and just have hr_buyback.days it shows all the individual values I'd expect (except I'd much rather they were summed)
Secondly, in MySQL can you do what you can in MSSQL which is
left outer join hr_buyback on (select hr_buyback.user_id where buy_sell = 'buy') = hr_leave.user_id
?
Relevant table definitions (I assume this is what you mean?):
hr_buyback
buyback_id int(11) NO PRI auto_increment
user_id int(11) NO
days int(11) NO
buy_sell varchar(10) NO
status varchar(10) NO pending
year int(11) NO
hr_user
user_id int(11) NO PRI auto_increment
name varchar(40) NO
email varchar(40) NO UNI
level int(5) YES
manager_id int(11) NO
team_id int(11) YES
active varchar(2) NO y
holidays_day int(11) NO
start_date timestamp NO CURRENT_TIMESTAMP
password varchar(60) NO
division_id int(11) YES
day_change int(5) NO 0
priv_hours varchar(2) NO n
po_level int(2) YES 0
po_signoff int(10) YES
hr_leave
leave_id int(11) NO PRI auto_increment
user_id int(11) NO
start_date date NO
end_date date NO
day_type varchar(10) NO
status varchar(20) NO pending
working_days varchar(5) NO
leave_type int(11) NO
cancel int(11) NO 0
date timestamp NO CURRENT_TIMESTAMP

The problem is probably that you will get one copy of each row from hr_buyback for each matching row in hr_leave.
I assume that it is possible to have more than one hr_buyback row per user, and that it is possible to have a hr_buyback row without a hr_leave row. If so, you'll probably want something like this:
SELECT hr_user.name AS username,
hr_user.user_id,
SUM(working_days) AS daysbooked,
(SELECT SUM(days)
FROM hr_buyback
WHERE hr_buyback.user_id = hr_user.user_id) AS daysbought
FROM hr_user
left join hr_leave on hr_user.user_id = hr_leave.user_id
where active = 'y'
and hr_leave.start_date between '2012-01-01' and '2012-12-31'
and (hr_leave.status = 'approved' OR hr_leave.status = 'pending')
GROUP BY hr_user.name, hr_user.user_id

Related

Exclusion of all rows possible via outer join?

I'll start off with the schema:
CREATE TABLE CustomersActions (
`caID` int(10) unsigned NOT NULL AUTO_INCREMENT primary key,
`cusID` int(11) unsigned NOT NULL,
`caTimestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
)
CREATE TABLE `Assignments` (
`asgID` int(10) unsigned NOT NULL AUTO_INCREMENT primary key,
`cusID` int(11) unsigned NOT NULL,
`asgAssigned` date DEFAULT NULL,
`astID` int(10) unsigned NOT NULL
)
CREATE TABLE `AssignmentStatuses` (
`astID` int(10) unsigned NOT NULL AUTO_INCREMENT primary key,
`astStatus` varchar(255) DEFAULT ''
)
My original query is:
SELECT DISTINCT
ca.cusID
FROM
CustomersActions ca
WHERE
NOT EXISTS (
SELECT
TRUE
FROM
Assignments asg
NATURAL JOIN AssignmentStatuses
WHERE
asg.cusID = ca.cusID
AND (
DATE_ADD(asgAssigned, INTERVAL 6 DAY) > NOW()
OR astStatus IN('Not contacted', 'Follow-up')
)
)
What this does is select all cusID entries from CustomersActions if said Customer does not have a row in Assignments that is in "Not contacted" or "Follow-up" (for any date range) OR has an assignment of any status from less than six days ago.
I tried writing the same query using LEFT JOIN to exclude from Assignments like so:
SELECT DISTINCT
ca.cusID
FROM
CustomersActions ca
LEFT JOIN (
Assignments asg
NATURAL JOIN AssignmentStatuses
) ON (ca.cusID = asg.cusID)
WHERE
asgID IS NULL
OR DATE_ADD(asgAssigned, INTERVAL 6 DAY) < NOW()
OR astStatus IN('Not contacted', 'Follow-up')
The problem is that it's possible for a customer to have multiple entries in Assignments so a cusID can be selected even if they have an existing row that should force them to be excluded. This makes sense to me, and the NOT EXISTS solves this problem.
What I'm wondering is if there is a way to perform a single query that has the same effect as the query when using NOT EXISTS. That is, a customer should be excluded if they have any rows that satisfy the exclusion condition, not only if all of their rows satisfy the exclusion condition (or if they have none).
Have you tried using NOT IN clause, like:
SELECT DISTINCT
ca.cusID
FROM
CustomersActions ca
WHERE cusID
NOT IN (
SELECT
cusID
FROM
Assignments asg
INNER JOIN AssignmentStatuses ast
ON asg.astID = ast.astID
WHERE
DATE_ADD(asgAssigned, INTERVAL 6 DAY) > NOW()
OR astStatus IN('Not contacted', 'Follow-up')
)

Performance of MySQL Query

I have inherited some code, the original author is not contactable and I would be extremely grateful for any assistance as my own MySQL knowledge is not great.
I have the following query that is taking around 4 seconds to execute, there is only around 20,000 rows of data in all the tables combined so I suspect the query could be made more efficient, perhaps by splitting it into more than one query, here it is:
SELECT SQL_CALC_FOUND_ROWS ci.id AS id, ci.customer AS customer, ci.installer AS installer, ci.install_date AS install_date, ci.registration AS registration, ci.wf_obj AS wf_obj, ci.link_serial AS link_serial, ci.sim_serial AS sim_serial, sc.call_status AS call_status
FROM ap_servicedesk.corporate_installs AS ci
LEFT JOIN service_calls AS sc ON ci.wf_obj = sc.wf_obj
WHERE ci.acc_id = 3
GROUP BY ci.id
ORDER BY link_serial
asc
LIMIT 40, 20
Can anyone spot any way to make this more efficient, thanks.
(Some values are set as variables but running the above query in PHPMyAdmin takes ~4secs)
The id column is the primary index.
More Info as requested:
corporate_installs table:
Field Type Null Key Default Extra
id int(11) NO PRI NULL auto_increment
customer varchar(800) NO NULL
acc_id varchar(11) NO NULL
installer varchar(50) NO NULL
install_date varchar(50) NO NULL
address_name varchar(30) NO NULL
address_street varchar(40) NO NULL
address_city varchar(30) NO NULL
address_region varchar(30) NO NULL
address_post_code varchar(10) NO NULL
latitude varchar(15) NO NULL
longitude varchar(15) NO NULL
registration varchar(50) NO NULL
driver_name varchar(50) NO NULL
vehicle_type varchar(50) NO NULL
make varchar(50) NO NULL
model varchar(50) NO NULL
vin varchar(50) NO NULL
wf_obj varchar(50) NO NULL
link_serial varchar(50) NO NULL
sim_serial varchar(50) NO NULL
tti_inv_no varchar(50) NO NULL
pro_serial varchar(50) NO NULL
eco_serial varchar(50) NO NULL
eco_bluetooth varchar(50) NO NULL
warranty_expiry varchar(50) NO NULL
project_no varchar(50) NO NULL
status varchar(15) NO NULL
service_calls table:
Field Type Null Key Default Extra
id int(11) NO PRI NULL auto_increment
acc_id int(15) NO NULL
ciid int(11) NO NULL
installer_job_no varchar(50) NO NULL
installer_inv_no varchar(50) NO NULL
engineer varchar(50) NO NULL
request_date varchar(50) NO NULL
completion_date varchar(50) NO NULL
call_status varchar(50) NO NULL
registration varchar(50) NO NULL
wf_obj varchar(50) NO NULL
driver_name varchar(50) NO NULL
driver_phone varchar(50) NO NULL
team_leader_name varchar(50) NO NULL
team_leader_phone varchar(50) NO NULL
servicing_address varchar(150) NO NULL
region varchar(50) NO NULL
post_code varchar(50) NO NULL
latitude varchar(50) NO NULL
longitude varchar(50) NO NULL
incident_no varchar(50) NO NULL
service_type varchar(20) NO NULL
fault_description varchar(50) NO NULL
requested_action varchar(50) NO NULL
requested_replacemt varchar(100) NO NULL
fault_detected varchar(50) NO NULL
action_taken varchar(50) NO NULL
parts_used varchar(50) NO NULL
new_link_serial varchar(50) NO NULL
new_sim_serial varchar(50) NO NULL
(Apologies for the formatting, I did the best I could)
Let me know if you need more info thanks.
Further info (I did the query again with EXPLAIN):
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE ci ALL acc_id NULL NULL NULL 7227 Using where; Using temporary; Using filesort
1 SIMPLE sc ALL NULL NULL NULL NULL 410
Add indices on the two wf_obj columns, the link_serial column (you may also need an index on the acc_id, too).
Then try this version:
SELECT ...
FROM
( SELECT *
FROM ap_servicedesk.corporate_installs
WHERE acc_id = 3
ORDER BY link_serial ASC
LIMIT 60
) AS ci
LEFT JOIN service_calls AS sc
ON sc.PK = --- the PRIMARY KEY of the table
( SELECT PK
FROM service_calls AS scm
WHERE ci.wf_obj = scm.wf_obj
ORDER BY scm. --- whatever suits you
LIMIT 1
)
ORDER BY ci.link_serial ASC
LIMIT 20 OFFSET 40
The ORDER BY scm.SomeColumn is needed not for performance but to get consistent results. Your query as it is, is joining a row from the first table to all related rows of the second table. But the final GROUP BY aggregates all these rows (of the second table), so your SELECT ... sc.call_status picks a more or less random call_status from one of these rows.
The first place I'd look on this would have to be indexes.
There is a group on ci.id which is the PK which is fine, however you are ordering by link_ser (source table unspecified) and you are selecting based on ci.acc_id.
If you add an extra key on the table corp_installs for the field acc_id then that alone should help increase performance as it will be usable for the WHERE clause.
Looking further you have ci.wf_obj = sc.wf_obj within the join. Joining on a VARCHAR will be SLOW, and you are not actually using this as part of the selection criteria and so a SUBQUERY may be your friend, consider the following
SELECT
serviceCallData.*,
sc.call_status AS call_status
FROM (
SELECT
SQL_CALC_FOUND_ROWS AS found_rows,
ci.id AS id,
ci.customer AS customer,
ci.installer AS installer,
ci.install_date AS install_date,
ci.registration AS registration,
ci.wf_obj AS wf_obj,
ci.link_serial AS link_serial,
ci.sim_serial AS sim_serial
FROM ap_servicedesk.corporate_installs AS ci
WHERE ci.acc_id = 3
GROUP BY ci.id
ORDER BY ci.link_serial ASC
LIMIT 40, 20
) AS serviceCallData
LEFT JOIN serice_calls AS sc ON serviceCallData.wf_obj = sc.wf_obj
In addition to this, change that (acc_id) key to be (acc_id, link_serial) as then it will be usable in the sort. Also add a key on (wf_obj) into serice_calls.
This will select the 20 rows from the corpoprate_installs table and then only join them onto the service_calls table using the inefficient VARCHAR join
I hope this is of help
I think the SQL_CALC_FOUND_ROWS option used with a join and a group by could be degrading the performance (look here for some tests, info on SQL_CALC_FOUND_ROWS here). It seems in facts that indexes are not used in that case.
Try replacing your query with two separate queries, the one with the LIMIT followed by a COUNT().

MySQL GROUP BY return the first item - Need to select the last item

I have a table that has duplicate data for the same user_id. I need to select the newest record for each user_id. When i use group by then order. mysql does the function in that order and i get the wrong records.
Table - tblUsersProfile
Field Type Null Default Comments
id int(11) No AI
user_id int(7) No
first_name_id int(11) No
last_name_id int(11) No
location_id int(11) No
dob date Yes NULL
sex int(1) Yes NULL 1 for Male, 0 for Female
created_by int(21) No
activity_ts timestamp No CURRENT_TIMESTAMP
select t1.* from tblUsersProfile as t1
inner join (
select user_id,max(activity_ts) as rct
from tblUsersProfile
group by user_id) as t2
on t1.user_id = t2.user_id and t1.activity_ts = t2.rct
Maybe my query is even more "complicated" than necessary if you have all the others data redundants and equals for all records.

Want to learn to improve slow mysql query

I have a MySQL query to select all product id's with certain filters applied to the products. This query
works but I want to learn to improve this query. Alternatives for this query are welcome with explanation.
SELECT kkx_products.id from kkx_products WHERE display = 'yes' AND id in
(SELECT product_id FROM `kkx_filters_products` WHERE `filter_id` in
(SELECT id FROM `kkx_filters` WHERE kkx_filters.urlname = "comics" OR kkx_filters.urlname = "comicsgraphicnovels")
group by product_id having count(*) = 2)
ORDER BY kkx_products.id desc LIMIT 0, 24
I've included the structure of the tables being used in the query.
EXPLAINkkx_filters;
Field Type Null Key Default Extra
id int(11) unsigned NO PRI NULL auto_increment
name varchar(50) NO
filtergroup_id int(11) YES MUL NULL
urlname varchar(50) NO MUL NULL
date_modified timestamp NO CURRENT_TIMESTAMP
orderid float(11,2) NO NULL
EXPLAIN kkx_filters_products;
Field Type Null Key Default Extra
filter_id int(11) NO PRI 0
product_id int(11) NO PRI 0
EXPLAIN kkx_products;
Field Type Null Key Default Extra
id int(11) NO PRI NULL auto_increment
title varchar(255) NO
urlname varchar(50) NO MUL
description longtext NO NULL
price float(11,2) NO NULL
orderid float(11,2) NO NULL
imageurl varchar(255) NO
date_created datetime NO NULL
date_modified timestamp NO CURRENT_TIMESTAMP
created_by varchar(11) NO NULL
modified_by varchar(11) NO NULL
productnumber varchar(32) NO
instock enum('yes','no') NO yes
display enum('yes','no') NO yes
Instead of using inline queries in your criteria statements, try using the EXISTS block...
http://dev.mysql.com/doc/refman/5.0/en/exists-and-not-exists-subqueries.html
You will be able to see the difference in your explain plan. Before you had a query executing for each and every record in your result set, and every result in that inline view result set had its own query executing to.
You see how nested inline views can create an exponential increase in cost. EXISTS doesn't work that way.
Example of the use of EXISTS:
Consider tbl1 has columns id and data. tbl2 has columns id, parentid, and data.
SELECT a.*
FROM tbl1 a
WHERE 1 = 1
AND EXISTS (
SELECT NULL
FROM tbl2 b
WHERE b.parentid = a.id
AND b.data = 'SOME CONDITIONAL DATA TO CONSTRAIN ON'
)
1) We can assume the 1 = 1 is some condition that equates to true for every record
2) Doesn't matter what we select in the EXISTS statment really, NULL is fine.
3) It is important to look at b.parentid = a.id, this links our exist statement to the result set

How do I get the latest post for each category in this forum using MySQL?

EDIT:
Let me be more specific in what I'm after:
catID, cat1, cat2, cat3, cat4, pri_color,sec_color, and cat_name are all related to each specific category.
The sum_views field should correspond to the sum of all views in the forum for that particular category.
The count_posts field should correspond to the number of posts in the forum for that category.
The userID, forum_id, title, alias, created, and paragraph correspond to the newest post in each category.
So in other words, for each category, I need the corresponding category information, the aggregate forum statistics for each category, and finally, the newest post in each category.
I've been given a small project to create a forum type view for our existing system. In this case, I need to find the newest post (and other information) in each forum category.
My current query is as follows:
SELECT DISTINCT forum.catID AS catID, category.cat1 AS cat1,
category.cat2 AS cat2,
category.cat3 AS cat3,
category.cat4 AS cat4,
SUM(forum.view) AS sum_views,
COUNT(forum.id) AS count_posts,
category.pri_color AS pri_color,
category.sec_color AS sec_color,
category.name AS cat_name,
forum.userID AS userID,
forum.id AS forum_id,
forum.title AS title,
users.alias AS alias,
MAX(forum.created) AS created,
forum.paragraph AS paragraph
FROM forum, category, users
WHERE forum.approved = 'yes'
AND users.id = forum.userID
AND forum.catID = category.id
GROUP BY forum.catID
ORDER BY category.name
And it gives me almost all the correct information I want EXCEPT the actual newest post. I suppose my main culprit is my inexperience with JOINS and GROUP BY. It seems to be grouping the data in such a way that it gives me the newest created timestamp but the oldest forum post.
Note that for now, I cannot change the table structure or create a cache table in the current software, though we will be building a replacement in the near future. Also, the id field in the USERS table is a foreign key for another table.
FORUM
id int(10) unsigned NO PRI NULL auto_increment
userID int(10) unsigned NO MUL 0
catID int(3) unsigned NO MUL 0
regID int(3) unsigned NO MUL 0
approved enum('yes','no') NO MUL yes
title varchar(150) NO MUL paragraph text NO NULL
view int(10) unsigned NO 1
created int(10) unsigned NO 0
modified int(10) unsigned NO MUL 0
ip varchar(15) NO
oldID int(10) unsigned NO 0
comments int(4) NO MUL 0
responses int(4) NO 0
pics int(4) NO MUL 0
USERS
user_id int(10) unsigned NO PRI NULL auto_increment
id int(10) unsigned NO MUL 0
alias varchar(50) NO MUL new
email varchar(150) NO MUL
fname varchar(30) NO MUL
lname varchar(30) NO MUL
address varchar(200) NO
city varchar(50) NO
state varchar(50) NO
zip varchar(20) NO
country varchar(50) NO
job varchar(150) NO
phone varchar(30) NO
url varchar(200) NO
pref_news enum('yes','no') NO no
pref_contact enum('yes','no') NO yes
pref_summary int(3) NO 20
pref_showLoc enum('yes','no') NO no
pref_showName enum('yes','no') NO no
pref_showEmail enum('yes','no') NO no
pref_showUrl enum('yes','no') NO no
CATEGORY
id int(10) unsigned NO PRI NULL auto_increment
area enum('article','classifieds','news','forum') NO forum
level enum('1','2','3','4') NO 1
cat1 int(10) NO 0
cat2 int(10) unsigned NO 0
cat3 int(10) unsigned NO 0
cat4 int(10) unsigned NO 0
name varchar(150) NO
pri_color varchar(6) NO
sec_color varchar(6) NO
right_nav text NO NULL
left_ad text NO NULL
footer_ad text NO NULL
top_ad text NO NULL
Your query (reformatted for clarity):
SELECT DISTINCT f.catID AS catID,
c.cat1, c.cat2, c.cat3, c.cat4,
SUM(f.view) AS sum_views,
COUNT(f.id) AS count_posts,
c.pri_color, c.sec_color, c.name AS cat_name,
f.userID AS userID,
f.id AS forum_id, f.title, u.alias,
MAX(f.created) AS created,
f.paragraph
FROM forum f JOIN users u ON (u.id = f.userID)
JOIN category c ON (f.catID = c.id)
WHERE f.approved = 'yes'
GROUP BY f.catID ORDER BY c.name
The issue with your query is that the aggregated data cannot be logically related to the unaggregated data. For example, the aggregated count for a particular category does not apply to any particular user. So either you want to get your aggregate data separately or you also want to group by user info:
SELECT c.name,c.pri_color, c.sec_color,
cats.ualias,
cats.sum_views, cats.count_posts, cats.created
c.cat1, c.cat2, c.cat3, c.cat4,
FROM categories c JOIN
(SELECT f.catID, u.alias AS ualias
SUM(f.view) as sum_views,
COUNT(f.id) as count_posts,
MAX(f.created) as created
FROM forum f JOIN users u ON (f.userID=u.id)
WHERE f.approved='yes'
GROUP BY f.catID, u.alias) AS cats
ON (c.id=cats.catID);
Had an issue with account creation which won't let me edit my post, but...
This query:
SELECT DISTINCT category.id, category.cat1, category.cat2, category.cat3, category.cat4, category.pri_color, category.sec_color, SUM(forum.view) AS sum_views, COUNT(forum.id) AS count_posts FROM category, forum WHERE forum.catID = category.id AND category.area = 'forum' AND forum.approved = 'yes' GROUP BY category.id
effectively takes care of all of the query dealing with the category table information. Now I just need to know how to link it to the latest post for each category.