I have an issue with data redundancy. My JOIN query in MySQL creates a very large data set (~8mb) while a lot of the data is redundant. After analysis, I can see that the query is fast, but the data transfer can take several seconds. What options do I have?
For example, say that I have the two tables
Users:
user_id
user_name
1
Alex
2
Joe
And Purchases:
user_id
purchase_id
purchase_amount
1
A
100
2
B
200
1
C
300
1
D
400
If I simply LEFT Join the tables with
SELECT users.user_id, users.user_name, purchase_id, purchase_amount
FROM Users
LEFT JOIN purchases ON users.id = purchases.user_id
I will end up with a result:
user_id
user_name
purchase_id
purchase_amount
1
Alex
A
100
2
Joe
B
200
1
Alex
C
300
1
Alex
D
400
However, as we can see, the user_id 1 and user_name Alex exists in three places. For very large result sets this can become an issue.
I'm thinking about using GROUP BY and GROUP_CONCAT to reduce the redundancy. Is this in general a good idea? My first tests seem to work, but I have to set the MySQL SET SESSION group_concat_max_len = 1000000; which might not be a good thing since I don't know what to set it to.
For example I could do something like
SELECT user_id, user_name, GROUP_CONCAT(CONCAT(purchase_id, ':', purchase_amount))
FROM Users
LEFT JOIN purchases ON users.id = purchases.user_id
GROUP BY user_id, user_name
And end up with a result:
user_id
user_name
GROUP_CONCAT...
1
Alex
A:100,C:300,D:400
2
Joe
B:200
Are there any other options for me? Is this the way to go? Parsing the concatenated column is not an issue. I am trying to solve the large data set being returned.
we can have temp table in between?
use apache spark's map reduce to get data in desired format.
Related
I have three tables. One with notes Notes, one with users Users, and one a relational table between users and notes NotesUsers.
Users
user_id first_name last_name
1 John Smith
2 Jane Doe
Notes
note_id note_name owner_id
1 Math 1
2 Science 1
3 English 2
NoteUsers
user_id note_id
1 1
2 1
2 2
2 3
Hopefully, from the select statement you can tell what I'm trying to do. I am trying to select the notes that user_id = 2 has access to but doesn't necessarily own, but also along with this I'm trying to get the first and last name of the owner.
SELECT Notes.notes_id, note_name
FROM Notes, NotesUsers
WHERE NotesUsers.note_id = Notes.note_id AND NotesUsers.user_id = 2
JOIN SELECT first_name, last_name FROM Users, Notes WHERE Notes.owner_id = Users.user_id
My problem is that because the WHERE clause for first_name, and last_name versus that for notes are different, I don't know how to query the data. I understand that this is not how a JOIN works and
I don't necessarily want to use a JOIN, but I'm not sure how to structure the statement, so I left it in there so that you can understand what I'm trying to do.
You can join Notes with NoteUsers to check for access and with Users to add the user's details to the result:
SELECT n.noted_id, n.note_name, u.first_name, u.last_name
FROM Notes n
JOIN NoteUsers nu ON n.noted_id = nu.note_id AND nu.user_id = 2
JOIN Users u ON n.owner_id = u.user_id
you need here to use a query inside the main query. MySQL will return first all the note_id that the user with user_id = 2 has access to from NoteUser, then well build the outer query to return the first_name and the last_name of the owner.
SELECT u.first_name, u.last_name, n.note_name, n.note_id
FROM Notes AS n
LEFT JOIN Users AS u ON u.user_id = n.owner_id
WHERE n.note_id IN
(SELECT nu.note_id FROM NoteUser WHERE nu.user_id = 2)
I have the following tables:
Users
user_id course_id completion_rate
1 2 0.4
1 23 0.6
1 49 0.5
... ... ...
Courses
course_id title
1 Intro to Python
2 Intro to R
... ...
70 Intro to Flask
Each entry in the user table represents a course that the user took. However, it is rare that users have taken every course.
What I need is a result set with user_id, course_id, completion_rate. In the case that the user has taken the course, the existing completion_rate should be used, but if not then the completion_rate should be set to 0. That is, there would be 70 rows for each user_id, one for each course.
I don't have a lot of experience with SQL, and I'm not sure where to start. Would it be easier to do this in something like R?
Thank you.
You should first cross join the courses with distinct users. Then left join on this to get the desired result. If the user hasn't taken a course the completion_rate would be null and we use coalesce to default a 0.
select c.course_id,cu.user_id,coalesce(u.completion_rate,0) as completion_rate
from courses c
cross join (select distinct user_id from users) cu
left join users u on u.course_id=c.course_id and cu.user_id=u.user_id
Step1: Take the distinct client_id from client_data (abc) and do 1 on 1 merge with the course data (abc1) . 1 on 1 merge helps up write all the courses against each client_id
Step2: Merge the above dataset with the client info on client_id as well as course
create table ans as
select p.*,case when q.completion_rate is not null then q.completion_rate else 0
end as completion_rate
from
(
select a.client_id,b.course from
(select distinct client_id from abc) a
left join
abc1 b
on 1=1
) p
left join
abc q
on p.client_id = q.client_id and p.course = q.course
order by client_id,course;
Let me know in case of any queries.
With this db:
Chef(cid,cname,age),
Recipe(rid,rname),
Cooked(orderid,cid,rid,price)
Customers(cuid,orderid,time,daytime,age)
[cid means chef id, and so on]
Given orders from customers, I need to find for each chef, the difference between his age and the average of people who ordered his/her meals.
I wrote the following query:
select cid, Ch.age - AVG(Cu.age) as Diff
from Chef Ch NATURAL JOIN Cooked Co,Customers Cu
where Co.orderid = Cu.orderid
group by cid
This solves the problem, but if you assume that customers has their unique id, it might not work,because then one can order two meals of the same chef and affect the calculation.
Now I know it can be answered with NOT EXISTS but I'm looking for a soultion which includes the group by function (something similar to what I wrote). So far I couldn't find (I searched and tried many ways, from select distinct , to manipulation in the where clause ,to "having count(distinct..)" )
Edit: People asked for an exmaple. i'm coding using SQLFiddle and it crashes alot, so I'll try my best:
cid | cuid | orderid | Cu.age
-----------------------------
1 333 1 20
1 200 2 41
1 200 5 41
2 4 3 36
Let's say Chef 1's age is 50 . My query will give you 50 - (20+40+40/3) = 16 and 2/3. althought it should actually be 50 - (20+40/2) = 20. (because the guy with id 200 ordered two recipes of our beloved Chef 1.).
Assume Chef 2's age is 47. My query will result:
cid | Diff
----------
1 16.667
2 11
Another edit: I wasn't taught any particular sql-query form.So I really have no idea what are the differences between Oracle's to MySql's to Microsoft Server's, so I'm basically "freestyle" querying.(I hope it will be good in my exam as well :O )
First, you should write your query as:
select cid, Ch.age - AVG(Cu.age) as Diff
from Chef Ch join
Cooked Co
on ch.cid = co.cid join
Customers Cu
on Co.orderid = Cu.orderid
group by cid;
Two different reasons:
NATURAL JOIN is just a bug waiting to happen. List the columns that you want used for the join, lest an unexpected field or spelling difference affect the results.
Never use commas in the FROM clause. Always use explicit JOIN syntax.
Next, the answer to your question is more complicated. For each chef, we can get the average age of the customers by doing:
select cid, avg(age)
from (select distinct co.cid, cu.cuid, cu.age
from Cooked Co join
Customers Cu
on Co.orderid = Cu.orderid
) c
group by cid;
Then, for the difference, you need to bring that information in as well. One method is in the subquery:
select cid, ( age - avg(cuage) ) as diff
from (select distinct co.cid, cu.cuid, cu.age as cuage, c.age as cage
from Chef c join
Cooked Co
on ch.cid = co.cid join
Customers Cu
on Co.orderid = Cu.orderid
) c
group by cid, cage;
I am not a databases guy,but I have been given the "fun" job of cleaning up someone else's database. We have many duplicate record in our databases and some of customers are getting double or triple billed every month.
Given the following Database example
:
Table: Customers
ID Name Phone DoNotBill
1 Acme Inc 5125551212 No
2 ABC LLC 7138221661 No
3 Big Inc 4132229807 No
4 Acme 5125551212 No
5 Tree Top 2127657654 No
Is it possible to write a query that Identifies the all duplicate phone numbers (in this case records 1 and 4) and then marks and duplicate records yes by updating the DoNotBill column. But leaves the first record unmarked.
In this example case we would be left with:
ID Name Phone DoNotBill
1 Acme Inc 5125551212 No
2 ABC LLC 7138221661 No
3 Big Inc 4132229807 No
4 Acme 5125551212 Yes
5 Tree Top 2127657654 No
something like this?
UPDATE
customers cust,
(SELECT
c1.ID,
c1.name,
c1.phone,
c1.DoNotBill
FROM customers c
LEFT JOIN
(SELECT
cc.ID
FROM customers cc
) as c1 on c1.phone = c.phone
) dup
SET cust.DoNotBill = 'Yes' WHERE cust.id=dup.id ;
To begin with I assume that the DoNotBill column only has two possible values; yes and no. In that case it should be bool instead of varchar, meaning it would be either true or false.
Furthermore I don't get the meaning of the DoNotBill column. Why wouldn't you just use something like this?
select distinct phone from customers
SQL SELECT DISTINCT
That would give you the phone numbers without duplicates and without the need for an extra column.
This depends on ur data amount
You can do it in steps and make use some tools like excel...
This qrt
SELECT a.id,b.id,a.phone FROM clients a , clients b WHERE
A.phone =b.phone
And a.id!=b.id
The result is all duplicated records.
Add
Group by a.phone
And u will get 1 record for each 2 duplicates.
if you like the records and they are whT u need. ChNge select to select a.id and
Use this qry as subqry to an update sql statement
UPDATE clients SET billing='no' WHERE id IN ( sql goes here)
UPDATE customers c SET c.DoNotBill="Yes";
UPDATE customers c
JOIN (
SELECT MIN( ID ) ID, Phone
FROM customers
GROUP BY Phone
) u ON c.ID = u.ID AND c.Phone = u.Phone
SET c.DoNotBill="No";
That way not only duplicates are eliminated, but all multiple entries are dealt with.
I have table users and another table premium_users in which I hold the userid and the date when he bought premium membership.
How can I use mysql join , so that in a single query I can select all the columns from the table users and also know for each premium user the date he joined on.
USERS:
ID USERNAME
1 JOHN
2 BILL
3 JOE
4 KENNY
PREMIUM USERS:
ID USERID DATE
1 2 20/05/2010
2 4 21/06/2011
And the final table (the one that will be returned my the query) should look like this:
ID USERNAME DATE
1 JOHN
2 BILL 20/05/2010
3 JOE
4 KENNY 21/06/2011
Is it ok for some rows to have the DATE value empty?
How can I check if that value is empty? $row['date']=='' ?
EDIT:
This was only an example, but the users table has much more columns, how can I select all from users and only date from premium_users without writing all the columns?
select u.*, pu.DATE
from USERS u LEFT OUTER JOIN PREMIUM_USERS pu on
u.ID = pu.USERID
You can check if a row is empty with:
if (!$row['DATE'])
{
...
}
select USERS.ID, USERS.USERNAME, PREMIUM_USERS.DATE
from USERS
join PREMIUM_USERS on USERS.ID = PREMIUM_USERS.ID
order by USERS.ID
This is mssql syntax, but it should be pretty similar...
select *
from users u
left join premiumUsers p
on u.id = p.id
order by u.id asc
SELECT A.*, B.DATE
FROM USERS A
LEFT JOIN PREMIUIM_USERS B on A.ID=B.USERID
EDITED
It might be easier to have it all in one table. You can have nullable fields for isPremium(t/f) and premiumDate. you actually dont even need the isPremium field. just premiumDate if its null they are not premium and if it has value they are premium user and you have the date they joined.