I'm getting grey hair by now...
I have a table like this.
ID - Place - Person
1 - London - Anna
2 - Stockholm - Johan
3 - Gothenburg - Anna
4 - London - Nils
And I want to get the result where all the different persons are included, but I want to choose which Place to order by.
For example. I want to get a list where they are ordered by LONDON and the rest will follow, but distinct on PERSON.
Output like this:
ID - Place - Person
1 - London - Anna
4 - London - Nils
2 - Stockholm - Johan
Tried this:
SELECT ID, Person
FROM users
ORDER BY FIELD(Place,'London'), Person ASC "
But it gives me:
ID - Place - Person
1 - London - Anna
4 - London - Nils
3 - Gothenburg - Anna
2 - Stockholm - Johan
And I really dont want Anna, or any person, to be in the result more then once.
This is one way to get the specified output, but this uses MySQL specific behavior which is not guaranteed:
SELECT q.ID
, q.Place
, q.Person
FROM ( SELECT IF(p.Person<=>#prev_person,0,1) AS r
, #prev_person := p.Person AS person
, p.Place
, p.ID
FROM users p
CROSS
JOIN (SELECT #prev_person := NULL) i
ORDER BY p.Person, !(p.Place<=>'London'), p.ID
) q
WHERE q.r = 1
ORDER BY !(q.Place<=>'London'), q.Person
This query uses an inline view to return all the rows in a particular order, by Person, so that all of the 'Anna' rows are together, followed by all the 'Johan' rows, etc. The set of rows for each person is ordered by, Place='London' first, then by ID.
The "trick" is to use a MySQL user variable to compare the values from the current row with values from the previous row. In this example, we're checking if the 'Person' on the current row is the same as the 'Person' on the previous row. Based on that check, we return a 1 if this is the "first" row we're processing for a a person, otherwise we return a 0.
The outermost query processes the rows from the inline view, and excludes all but the "first" row for each Person (the 0 or 1 we returned from the inline view.)
(This isn't the only way to get the resultset. But this is one way of emulating analytic functions which are available in other RDBMS.)
For comparison, in databases other than MySQL, we could use SQL something like this:
SELECT ROW_NUMBER() OVER (PARTITION BY t.Person ORDER BY
CASE WHEN t.Place='London' THEN 0 ELSE 1 END, t.ID) AS rn
, t.ID
, t.Place
, t.Person
FROM users t
WHERE rn=1
ORDER BY CASE WHEN t.Place='London' THEN 0 ELSE 1 END, t.Person
Followup
At the beginning of the answer, I referred to MySQL behavior that was not guaranteed. I was referring to the usage of MySQL User-Defined variables within a SQL statement.
Excerpts from MySQL 5.5 Reference Manual http://dev.mysql.com/doc/refman/5.5/en/user-variables.html
"As a general rule, other than in SET statements, you should never assign a value to a user variable and read the value within the same statement."
"For other statements, such as SELECT, you might get the results you expect, but this is not guaranteed."
"the order of evaluation for expressions involving user variables is undefined."
Try this:
SELECT ID, Place, Person
FROM users
GROUP BY Person
ORDER BY FIELD(Place,'London') DESC, Person ASC;
You want to use group by instead of distinct:
SELECT ID, Person
FROM users
GROUP BY ID, Person
ORDER BY MAX(FIELD(Place, 'London')), Person ASC;
The GROUP BY does the same thing as SELECT DISTINCT. But, you are allowed to mention other fields in clauses such as HAVING and ORDER BY.
Related
I have a ternary relationship in which I stablish the relation between Offers, Profiles, and Skills. The ternary relationship table, called ternary for example, has the IDs of the three tables as primary key. It could look something like this:
id_Offer - id_Profile - id_Skill
1 - 1 - 1
1 - 1 - 2
1 - 1 - 3
1 - 2 - 1
2 - 1 - 1
2 - 3 - 2
2 - 1 - 3
2 - 5 - 1
[and so on, there would be more registers for each id_Offer from Offer but I want to limit the example]
So I have 2 offers in total, with a number of profiles in each one.
The table Offer looks something like this:
Offer - business_name
1 - business-1
2 - business-1
3 - business-1
4 - business-1
5 - business-2
6 - business-2
7 - business-2
8 - business-3
So when I do a query like
select distinct id_offer, business_name, COUNT(*)
FROM Offer
GROUP BY business_name
Order by COUNT(*);
I get that for business-1 I have 4 offers.
Now if I want to take into account the offers for some Profile, I have to make a join with my ternary relationship. But even if I do something as simple as the following
select distinct business_name
from Offer
INNER JOIN ternary ON Offer.id_Offer = ternary.id_Offer
GROUP BY business_name
WHERE business_name = 'business-1'
No matter what I put on the group by, or if I write distinct or not, I do not get what I want. The reality is that for business-1, I have 4 offers. Right now in the ternary only appear two. So it should return 2 unique offers for this name with no filtering by profile.
But instead I get 8 offers, because that is how many times it appears in the ternary, the id_Offer's that match.
How should this be done? If I need no filters I can simply look at Offers table alone. But what if I need to filter by id_skill or id_Profile AND want to return the business_name?
I have seen solutions such as this but I can not make them work, I do not understand what the ? is, how is it called to learn more about it, if MariaDB works the same in this sense, I could not find information about it because I do not know how that operation is called. When I try to build that query for my data I get:
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '? ORDER BY COUNT(*) DESC' at line 1
But as I said, it is kind of hard to look for '?' as an... Operator? Function?
There are two basic solutions.
SELECT
o.business_name,
COUNT(DISTINCT o.id_offer) AS unique_offers
FROM
Offer AS o
INNER JOIN
ternary AS t
ON t.id_Offer = o.id_Offer
WHERE
o.business_name = 'business-1'
AND t.id_profile IN (1, 2, 3, 5)
GROUP BY
o.business_name
That's the simplest to write and think about. But, it can also be quite intensive because you're still joining each row in offer to 4 rows in ternary - Creating 8 rows to aggregate and process through DISTINCT.
The "better" (in my opinion) route is to filter then aggregate the ternary table in a sub-query.
SELECT
o.business_name,
COUNT(*) AS unique_offers
FROM
Offer AS o
INNER JOIN
(
SELECT id_Offer
FROM ternary
WHERE id_profile IN (1, 2, 3, 5)
GROUP BY id_Offer
)
AS t
ON t.id_Offer = o.id_Offer
WHERE
o.business_name = 'business-1'
GROUP BY
o.business_name
This ensures the t only ever has one row for any given offer. This in turn means that each row in offer only ever joins to one row in t; no duplication. That in turn means there is no need to use COUNT(DISTINCT) and relieves some overhead (By moving it to the inner query's GROUP BY).
Are you saying that you want to see offers for a particular business, but you want to limit these according to certain profiles or skills?
We limit query results in the WHERE clause. If we want to look up data in another table, we use IN or EXISTS. For instance:
select *
from offer
where business_name = 'business-1'
and id_offer in
(
select id_offer
from ternary
where id_profile = 1
and id_skill = 2
);
My SQL query needs to return a list of values alongside the date, but with my limited knowledge I have only been able to get this far.
This is my SQL:
select lsu_students.student_grouping,lsu_attendance.class_date,
count(lsu_attendance.attendance_status) AS count
from lsu_attendance
inner join lsu_students
ON lsu_students.student_grouping="Central1A"
and lsu_students.student_id=lsu_attendance.student_id
where lsu_attendance.attendance_status="Present"
and lsu_attendance.class_date="2015-02-09";
This returns:
student_grouping class_date count
Central1A 2015-02-09 23
I want it to return:
student_grouping class_date count
Central1A 2015-02-09 23
Central1A 2015-02-10 11
Central1A 2015-02-11 21
Central1A 2015-02-12 25
This query gets the list of the dates according to the student grouping:
select distinct(class_date)from lsu_attendance,lsu_students
where lsu_students.student_grouping like "Central1A"
and lsu_students.student_id = lsu_attendance.student_id
order by class_date
I think you just want a group by:
select s.student_grouping, a.class_date, count(a.attendance_status) AS count
from lsu_attendance a inner join
lsu_students s
ON s.student_grouping = 'Central1A' and
s.student_id = a.student_id
where a.attendance_status = 'Present'
group by s.student_grouping, a.class_date;
Comments:
Using single quotes for string constants, unless you have a good reason.
If you want a range of class dates, then use a where with appropriate filtering logic.
Notice the table aliases. The query is easier to write and to read.
I added student grouping to the group by. This would be required by any SQL engine other than MySQL.
Just take out and lsu_attendance.class_date="2015-02-09" or change it to a range, and then add (at the end) GROUP BY lsu_students.student_grouping,lsu_attendance.class_date.
The group by clause is what you're looking for, to limit aggregates (e.g. the count function) to work within each group.
To get the number of students present in each group on each date, you would do something like this:
select student_grouping, class_date, count(*) as present_count
from lsu_students join lsu_attendance using (student_id)
where attendance_status = 'Present'
group by student_grouping, class_date
Note: for your example, using is simpler than on (if your SQL supports it), and putting the table name before each field name isn't necessary if the column name doesn't appear in more than one table (though it doesn't hurt).
If you want to limit which data rows get included, put your constraints get in the where clause (this constrains which rows are counted). If you want to constrain the aggregate values that are displayed, you have to use the having clause. For example, to see the count of Central1A students present each day, but only display those dates where more than 20 students showed up:
select student_grouping, class_date, count(*) as present_count
from lsu_students join lsu_attendance using (student_id)
where attendance_status = 'Present' and student_grouping = 'Central1A'
group by student_grouping, class_date
having count(*) > 20
I have a mysql table-
User Value
A 1
A 12
A 3
B 4
B 3
B 1
C 1
C 1
C 8
D 34
D 1
E 1
F 1
G 56
G 1
H 1
H 3
C 3
F 3
E 3
G 3
I need to run a query which returns 2nd distinct value that each user has.
Means if any 2 values are accessed by each user , then based on the occurrence, pick the 2nd distinct value.
So as above 1 & 3 is being accessed by each User. Occurrence of 1 is
more than 3 , so 2nd distinct will be 3
So I thought first I will get all distinct user.
create table temp AS Select distinct user from table;
Then I will have an outer query-
Select value from table where value in (...)
In programmatically way , I can iterate through each of the value user contains like Map but in Hive query I just couldn't write that.
This will return the second most frequented value from your list that spans all users. There isn't one of these values in the table which I expect is a typo in the data. In real data you will likely have muliple ties that you need to figure out how to handle.
Select value as second_distinct from
(select value, rank() over (order by occurrences desc) as rank
from
(SELECT value, unique_users, max(count_users) as count_users, count(value) as occurrences
from
(select value, size(collect_set(user) over (partition by value))
as count_users from my_table
) t
left outer join
(select count(distinct user) as unique_users from my_table
) t2 on (1=1)
where unique_users=count_users
group by value, unique_users
) t3
) t4
where rank = 2;
This works. It returns NULL because there is only value that visited every user (value of 1). Value 3 is not a solution because not every user has seen that value in your data. I expect you intended that three should be returned but again it doesn't span all the users (user D did not see value 3).
Not sure how #invoketheshell's answer was marked correct; it doesn't run and it needs 6 MR jobs. This will get you there in 4 and is less code.
Query:
select value
from (
select value, value_count, rank() over (order by value_count desc) rank
from (
select value, count(value) value_count
from (
select value, num_users, max(num_users) over () max_users
from (
select value
, size(collect_set(user) over (partition by value)) num_users
from db.table ) x ) y
where num_users = max_users
group by value ) z ) f
where rank = 2
Output:
3
EDIT: Let me clarify my solution as there seems to be some confusion. The OP's example says
"So as above 1 & 3 is being accessed by each User ... "
As my comment below the question suggests, in the example given, user D never accesses value 3. I made the assumption that this was a typo and added this to the dataset and then added another 1 as well to make there be more 1's than 3's. So my code correctly returns 3, which was the desired output. If you run this script on the actual dataset it will also produce the correct output which is nothing because there isn't a "2nd Distinct". The only time it could produce an incorrect value, is if there was no one specific number that was accessed by all users, which illustrates the point I was trying to make to #invoketheshell: if there is no single number that every user has accessed, running a query with 6 map-reduce jobs is an absurd way to find that out. Since we are using Hive I believe it would be fair to assume that if this problem were a "real-world" problem, it would most likely be executed on at least 100's of TBs of data (probably more). I the interest of preserving time and resources, it would behoove an individual to at least check that one number had been accessed by all users before running a massive query whose analysis hinges on that assumption being true.
I have a table (tblExam) showing exam data score designed as follow:
Exam Name: String
Score: number(pecent)
Basically I am trying to pull the records by Exam name where the score are less than a specific amount (0.695 in my case).
I am using the following statement to get the results:
SELECT DISTINCTROW tblExam.name, Count(tblExam.name) AS CountOfName
FROM tblExam WHERE (((tblExam.Score)<0.695))
GROUP BY tblExam.name;
This works fine but does not display the exam that have 0 records more than 0.695; in other words I am getting this:
Exam Name count
firstExam 2
secondExam 1
thirdExam 3
The count of 0 and any exams with score above 0.695 do not show up. What I would like is something like this:
Exam Name count
firstExam 2
secondExam 1
thirdExam 3
fourthExam 0
fifthExam 0
sixthExam 2
.
..
.etc...
I hope that I am making sense here. I think that I need somekind of LEFT JOIN to display all of the exam name but I can not come up with the proper syntax.
It seems you want to display all name groups and, within each group, the count of Score < 0.695. So I think you should move < 0.695 from the WHERE to the Count() expression --- actually remove the WHERE clause.
SELECT
e.name,
Count(IIf(e.Score < 0.695, 1, Null)) AS CountOfName
FROM tblExam AS e
GROUP BY e.name;
That works because Count() counts only non-Null values. You could use Sum() instead of Count() if that seems clearer:
Sum(IIf(e.Score < 0.695, 1, 0)) AS CountOfName
Note DISTINCTROW is not useful in a GROUP BY query, because the grouping makes the rows unique without it. So I removed DISTINCTROW from the query.
Do I detect a contradiction? The query calls for results <0.695 but your text says you are also looking for results >0.695. Perhaps I don't understand. Does this give you what you are looking for:
SELECT DISTINCTROW tblExam.ExamName, Count(tblExam.ExamName) AS CountOfExamName
FROM tblExam
WHERE (((tblExam.Score)<0.695 Or (tblExam.Score)>0.695))
GROUP BY tblExam.ExamName;
I want to search for records where a particular field either STARTS WITH some string (let's say "ar") OR that field CONTAINS the string, "ar".
However, I consider the two conditions different, because I'm limiting the number of results returned to 10 and I want the STARTS WITH condition to be weighted more heavily than the CONTAINS condition.
Example:
SELECT *
FROM Employees
WHERE Name LIKE 'ar%' OR Name LIKE '%ar%'
LIMIT 10
The catch is that is that if there are names that START with "ar" they should be favored. The only way I should get back a name that merely CONTAINS "ar" is if there are LESS than 10 names that START with "ar"
How can I do this against a MySQL database?
You need to select them in 2 parts, and add a Preference tag to the results. 10 from each segment, then merge them and take again the best 10. If segment 1 produces 8 entries, then segment 2 of UNION ALL will product the remaining 2
SELECT *
FROM
(
SELECT *, 1 as Preferred
FROM Employees
WHERE Name LIKE 'ar%'
LIMIT 10
UNION ALL
SELECT *
FROM
(
SELECT *, 2
FROM Employees
WHERE Name NOT LIKE 'ar%' AND Name LIKE '%ar%'
LIMIT 10
) X
) Y
ORDER BY Preferred
LIMIT 10
Assign a code value to results, and sort by the code value:
select
*,
(case when name like 'ar%' then 1 else 2 end) as priority
from
employees
where
name like 'ar%' or name like '%ar%'
order by
priority
limit 10
Edit:
See Richard aka cyberkiwi's answer for a more efficient solution if there are potentially lots of matches.
My solution is:
SELECT *
FROM Employees
WHERE Name LIKE '%ar%'
ORDER BY instr(name, 'ar'), name
LIMIT 10
The instr() looks for the first occurrence of the pattern in question. AR% will come before xxAR.
This prevents:
Should only do table scan 1 time. Unions and derived tables do 3. The first two on the columns to filter out the patterns and then the 3rd on the subset to find where they equal - since union filters out dupes.
Gives a true sort based on the location of the pattern. Wx > xW > xxW > etc...
Try this (don't have a MySQL instance immediately available to test with):
SELECT * FROM
(SELECT * FROM Employees WHERE Name LIKE 'ar%'
UNION
SELECT * FROM Employees WHERE Name LIKE '%ar%'
)
LIMIT 10
There are probably better ways to do it, but that immediately sprang to mind.
SELECT *
FROM Employees
WHERE Name LIKE 'ar%' OR Name LIKE '%ar%'
ORDER BY LIKE 'ar%' DESC
LIMIT 10
Should work orders by the binary true / false for like and if index'ed should benefit from the index