Please help by writing a SQL-script that will collate data.
A key difficulty - need to create an additional column on which sorting will take place.
I tried to describe the situation as detailed as possible.
Let's get started. There is a table of the following form:
We will receive a user ID and return data, only those who do not have he, but there are others.
Next step: sort by artificially created column.
Next, I'll step by step.
So what do I mean by artificial column:
This column will contain the difference between the estimates. So to get it - you need to first perform a number of actions:
According to the information which is like set the user and at other user to calculate the difference in assessment, and get an average score.
The following two pictures show the same data and then the calculation itself, it seems to me - it's pretty simple.
Calculation of this column is as follows:
User with 2nd id:
1: 5 - 1 = 4;
2: 2 - 9 = -7;
3: next data what is in user 1 - absent in user 2, and we ease pass it;
User with 3rd id:
1: 3 - 1 = 2;
2: the next data's is absent in user with 3rt id;
3: 8 – 9 = -1;
4: 6 – 2 = 4;
5: passed;
End in the end:
User_2 will have new mark = -1.5
User_3 will have new mark = 1.66666
And in the end I need to return the table:
But that's not all. Often, the data will be duplicated and I'd like to get average results from the data obtained. Please look at the following example:
And this is the end. I really need your help, experts. I teach sql code myself, but it is very difficult for me.
Had the idea of making the script as follows:
SELECT d.data, (d.mark + myCount(d.user, 1)) newOrder
FROM info d
WHERE -- data from user_1 NOT equal data from other users
ORDER BY newOrder;
But the script will execute a lot of time, because it uses its own function that could do with a query to each user, and not to record. I hope someone will be able to cope with this task.
Following your steps:
First, we need to isolate the data from the selected user (let's assume it's 1):
CREATE TEMP TABLE sel_user AS
SELECT data, mark FROM info d WHERE user = 1;
Now, we calculate the mark for every other user (again, the selected user is 1):
SELECT d.user user, d.mark - s.mark mark
FROM info d JOIN sel_user s USING (data)
WHERE d.user <> 1;
Result:
user mark
---------- ----------
2 4
2 -7
3 2
3 -1
3 4
We can query just the average:
SELECT d.user user, AVG(d.mark - s.mark) mark
FROM info d JOIN sel_user s USING (data)
WHERE d.user <> 1 GROUP BY user;
user mark
---------- ----------
2 -1.5
3 1.66666666
But you still want to do calculations with the marks that do not correspond to user 1:
SELECT d.user user, mark FROM info d
WHERE d.user <> 1 AND d.data NOT IN (SELECT data FROM sel_user);
user mark
---------- ----------
2 4
3 3
3 10
Specifically, you want to add the previously calculated average to each row:
SELECT d.user user, d.data, d.mark + d2.mark AS neworder FROM info d JOIN (
SELECT d.user user, AVG(d.mark - s.mark) mark
FROM info d JOIN sel_user s USING (data)
WHERE d.user <> 1 GROUP BY user
) d2 USING (user)
WHERE d.data NOT IN (SELECT data FROM sel_user)
ORDER BY neworder DESC;
user data neworder
---------- ---------- ----------------
3 6 11.6666666666667
3 3 4.66666666666667
2 5 2.5
And your last request is to get the average for each data:
SELECT data, AVG(neworder) final FROM (
SELECT d.user user, d.data, d.mark + d2.mark AS neworder FROM info d JOIN (
SELECT d.user user, AVG(d.mark - s.mark) mark
FROM info d JOIN sel_user s USING (data)
WHERE d.user <> 1 GROUP BY user
) d2 USING (user)
WHERE d.data NOT IN (SELECT data FROM sel_user)
)
GROUP BY data
ORDER BY final DESC;
data final
---------- ----------------
6 11.6666666666667
3 4.66666666666667
5 2.5
Related
I tried to write a query that selects rows with steps that both user 1 and user 2 did, with combined number of times they did the step (i.e., if user 1 did step 1 3 times and user 2 did 1 time then the count should show 4 times.)
when I put condition as user_id=1, user_id=2 there is no error but it return nothing, when it should return some rows with values.
there is table step, and step taken
and table step has column id, title
table step_taken has column id, user_id(who performs steps), step_id
i want to find step that both of two user whose id 1,2 did
and also want to have the value as count added up how many times they performed that step.
for example if user id 1 did step named meditation 2 times,
and user id 2 did step named meditation 3 times,
the result i want to find should be like below ;
------------------------------
title | number_of_times
------------------------------
meditation| 5
------------------------------
here is my sql query
select title, count(step_taken.step_id)as number_of_times
from step join step_taken
on step.id = step_taken.step_id
where user_id = 1 and user_id=2
group by title;
it returns nothing, but it should return some rows of step both user1 and user 2 did.
when i wrote same thing only with user_id=1 or user_id=2, it shows selected information
how can I fix my code so it can show the information I want to get?
thanks in advance :)
user_id cannot be 1 and 2 at the same time. You need a second user table. Then join those on your criteria and count:
select title, count(u1.id) + count(u2.id) as number_of_times
from step u1 join step u2
on u1.id = u2.id
where u1.user_id = 1 and u2.user_id=2
group by title;
note: cannot tell what table title is in, or the purpose of step_taken was as step.id is identical.
I have a history mapping table for UserId changes, where every time when UserId changes, a row for new UserId with old UserId inserted in the history table.
Below is the sample table and data:
UserIdNew | UserIdOld
---------------------
5 | 1
10 | 5
15 | 10
The above data explains that UserId 1 has gone with following transition from UserId 1 -> 5-> 10 -> 15.
I want to query all the Old Ids for a give UserIdNew, how can I do it in a single query?
For this case if UserIdNew = 15, then it should return 1,5,10
If UserIdNew are always greater then previous (older) in a UserIds chain, i.e. if cases like 10->20->5->1 never happen, this query can do the job (not fully tested, new and old used instead of your field names):
SELECT
CASE
WHEN new=7 THEN #seq:=concat(new,',',old)
WHEN substring_index(#seq,',',-1)=new THEN concat(#seq,',',old)
ELSE #seq
END AS SEQUENCE
FROM (SELECT * FROM UserIdsTable ORDER BY new DESC) AS SortedIds
ORDER BY SEQUENCE DESC
LIMIT 1
I have a mysql table-
User Value
A 1
A 12
A 3
B 4
B 3
B 1
C 1
C 1
C 8
D 34
D 1
E 1
F 1
G 56
G 1
H 1
H 3
C 3
F 3
E 3
G 3
I need to run a query which returns 2nd distinct value that each user has.
Means if any 2 values are accessed by each user , then based on the occurrence, pick the 2nd distinct value.
So as above 1 & 3 is being accessed by each User. Occurrence of 1 is
more than 3 , so 2nd distinct will be 3
So I thought first I will get all distinct user.
create table temp AS Select distinct user from table;
Then I will have an outer query-
Select value from table where value in (...)
In programmatically way , I can iterate through each of the value user contains like Map but in Hive query I just couldn't write that.
This will return the second most frequented value from your list that spans all users. There isn't one of these values in the table which I expect is a typo in the data. In real data you will likely have muliple ties that you need to figure out how to handle.
Select value as second_distinct from
(select value, rank() over (order by occurrences desc) as rank
from
(SELECT value, unique_users, max(count_users) as count_users, count(value) as occurrences
from
(select value, size(collect_set(user) over (partition by value))
as count_users from my_table
) t
left outer join
(select count(distinct user) as unique_users from my_table
) t2 on (1=1)
where unique_users=count_users
group by value, unique_users
) t3
) t4
where rank = 2;
This works. It returns NULL because there is only value that visited every user (value of 1). Value 3 is not a solution because not every user has seen that value in your data. I expect you intended that three should be returned but again it doesn't span all the users (user D did not see value 3).
Not sure how #invoketheshell's answer was marked correct; it doesn't run and it needs 6 MR jobs. This will get you there in 4 and is less code.
Query:
select value
from (
select value, value_count, rank() over (order by value_count desc) rank
from (
select value, count(value) value_count
from (
select value, num_users, max(num_users) over () max_users
from (
select value
, size(collect_set(user) over (partition by value)) num_users
from db.table ) x ) y
where num_users = max_users
group by value ) z ) f
where rank = 2
Output:
3
EDIT: Let me clarify my solution as there seems to be some confusion. The OP's example says
"So as above 1 & 3 is being accessed by each User ... "
As my comment below the question suggests, in the example given, user D never accesses value 3. I made the assumption that this was a typo and added this to the dataset and then added another 1 as well to make there be more 1's than 3's. So my code correctly returns 3, which was the desired output. If you run this script on the actual dataset it will also produce the correct output which is nothing because there isn't a "2nd Distinct". The only time it could produce an incorrect value, is if there was no one specific number that was accessed by all users, which illustrates the point I was trying to make to #invoketheshell: if there is no single number that every user has accessed, running a query with 6 map-reduce jobs is an absurd way to find that out. Since we are using Hive I believe it would be fair to assume that if this problem were a "real-world" problem, it would most likely be executed on at least 100's of TBs of data (probably more). I the interest of preserving time and resources, it would behoove an individual to at least check that one number had been accessed by all users before running a massive query whose analysis hinges on that assumption being true.
For each area in my game I have many levels that can be achieved. Once a user earns a certain number of points in an area, his 'progress level' increases for that particular area. I have two tables in my database. One stores the progress of the user for a particular area of my game:
Table A
userID | areaID | progressLevel | total points earnt
1 1 1 1000
1 2 1 500
Another table, B, stores how many points are required to unlock increase the progress level
areaID | progressLevel | points required
1 2 5000
1 3 9000
1 4 11000
2 2 9999
When enough points are achieved by the user then I check table B and increase the progress level of the user in table A. For example, if user 1 earns over 5000 points in area 1, I would update table A and set progress level = 2.
My problem is I want to write a query to obtain, for a particular user, all their progress levels for each area as well as the number of points required for the next level. For example, for user with id 1, I would like:
areaID | progressLevelCurrent | total points earnt | points required for next progress level
1 1 1000 4000
2 1 500 9499
Is it possible to do this in a single query?
How about this:
select A.areaID, A.progressLevel as progressLevelCurrent, A.`total points earnt`, B.`points required` - A.`total points earnt` as `points required for next progress level`
from A
inner join B on A.areaID = B.areaID and (A.progressLevel + 1) = B.progressLevel
where B.`points required` > A.`total points earnt`;
SELECT
areaID,
progressLevel AS CurrentLevel,
`total points earnt` as TotalNow,
(
SELECT (pointsTilNext - TableA.TotalNow)
FROM TableB
WHERE TableB.progressLevel = (TableA.progressLevel+1)
) AS ToNextLevel
FROM TableA
WHERE userID = ##
Edited to add:
You could also use a join, which would be a more efficient use of server capacity. The left join will return a result for a person even if the person is at the highest level, ie there is no row matching TableA.progressLevel+1
SELECT
areaID,
progressLevel AS CurrentLevel,
`total points earnt` as TotalNow,
(pointsTilNext - TableA.TotalNow) AS ToNextLevel
FROM TableA
LEFT JOIN TableB ON TableB.progressLevel = (TableA.progressLevel+1)
WHERE userID = ##
I have a table of surveys which contains (amongst others) the following columns
survey_id - unique id
user_id - the id of the person the survey relates to
created - datetime
ip_address - of the submission
ip_count - the number of duplicates
Due to a large record set, its impractical to run this query on the fly, so trying to create an update statement which will periodically store a "cached" result in ip_count.
The purpose of the ip_count is to show the number of duplicate ip_address survey submissions have been recieved for the same user_id with a 12 month period (+/- 6months of created date).
Using the following dataset, this is the expected result.
survey_id user_id created ip_address ip_count #counted duplicates survey_id
1 1 01-Jan-12 123.132.123 1 # 2
2 1 01-Apr-12 123.132.123 2 # 1, 3
3 2 01-Jul-12 123.132.123 0 #
4 1 01-Aug-12 123.132.123 3 # 2, 6
6 1 01-Dec-12 123.132.123 1 # 4
This is the closest solution I have come up with so far but this query is failing to take into account the date restriction and struggling to come up with an alternative method.
UPDATE surveys
JOIN(
SELECT ip_address, created, user_id, COUNT(*) AS total
FROM surveys
WHERE surveys.state IN (1, 3) # survey is marked as completed and confirmed
GROUP BY ip_address, user_id
) AS ipCount
ON (
ipCount.ip_address = surveys.ip_address
AND ipCount.user_id = surveys.user_id
AND ipCount.created BETWEEN (surveys.created - INTERVAL 6 MONTH) AND (surveys.created + INTERVAL 6 MONTH)
)
SET surveys.ip_count = ipCount.total - 1 # minus 1 as this query will match on its own id.
WHERE surveys.ip_address IS NOT NULL # ignore surveys where we have no ip_address
Thank you for you help in advance :)
A few (very) minor tweaks to what is shown above. Thank you again!
UPDATE surveys AS s
INNER JOIN (
SELECT x, count(*) c
FROM (
SELECT s1.id AS x, s2.id AS y
FROM surveys AS s1, surveys AS s2
WHERE s1.state IN (1, 3) # completed and verified
AND s1.id != s2.id # dont self join
AND s1.ip_address != "" AND s1.ip_address IS NOT NULL # not interested in blank entries
AND s1.ip_address = s2.ip_address
AND (s2.created BETWEEN (s1.created - INTERVAL 6 MONTH) AND (s1.created + INTERVAL 6 MONTH))
AND s1.user_id = s2.user_id # where completed for the same user
) AS ipCount
GROUP BY x
) n on s.id = n.x
SET s.ip_count = n.c
I don't have your table with me, so its hard for me to form correct sql that definitely works, but I can take a shot at this, and hopefully be able to help you..
First I would need to take the cartesian product of surveys against itself and filter out the rows I don't want
select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)
The output of this should contain every pair of surveys that match (according to your rules) TWICE (once for each id in the 1st position and once for it to be in the 2nd position)
Then we can do a GROUP BY on the output of this to get a table that basically gives me the correct ip_count for each survey_id
(select x, count(*) c from (select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)) group by x)
So now we have a table mapping each survey_id to its correct ip_count. To update the original table, we need to join that against this and copy the values over
So that should look something like
UPDATE surveys SET s.ip_count = n.c from surveys s inner join (ABOVE QUERY) n on s.survey_id = n.x
There is some pseudo code in there, but I think the general idea should work
I have never had to update a table based on the output of another query myself before.. Tried to guess the right syntax for doing this from this question - How do I UPDATE from a SELECT in SQL Server?
Also if I needed to do something like this for my own work, I wouldn't attempt to do it in a single query.. This would be a pain to maintain and might have memory/performance issues. It would be best have a script traverse the table row by row, update on a single row in a transaction before moving on to the next row. Much slower, but simpler to understand and possibly lighter on your database.