I have a table of surveys which contains (amongst others) the following columns
survey_id - unique id
user_id - the id of the person the survey relates to
created - datetime
ip_address - of the submission
ip_count - the number of duplicates
Due to a large record set, its impractical to run this query on the fly, so trying to create an update statement which will periodically store a "cached" result in ip_count.
The purpose of the ip_count is to show the number of duplicate ip_address survey submissions have been recieved for the same user_id with a 12 month period (+/- 6months of created date).
Using the following dataset, this is the expected result.
survey_id user_id created ip_address ip_count #counted duplicates survey_id
1 1 01-Jan-12 123.132.123 1 # 2
2 1 01-Apr-12 123.132.123 2 # 1, 3
3 2 01-Jul-12 123.132.123 0 #
4 1 01-Aug-12 123.132.123 3 # 2, 6
6 1 01-Dec-12 123.132.123 1 # 4
This is the closest solution I have come up with so far but this query is failing to take into account the date restriction and struggling to come up with an alternative method.
UPDATE surveys
JOIN(
SELECT ip_address, created, user_id, COUNT(*) AS total
FROM surveys
WHERE surveys.state IN (1, 3) # survey is marked as completed and confirmed
GROUP BY ip_address, user_id
) AS ipCount
ON (
ipCount.ip_address = surveys.ip_address
AND ipCount.user_id = surveys.user_id
AND ipCount.created BETWEEN (surveys.created - INTERVAL 6 MONTH) AND (surveys.created + INTERVAL 6 MONTH)
)
SET surveys.ip_count = ipCount.total - 1 # minus 1 as this query will match on its own id.
WHERE surveys.ip_address IS NOT NULL # ignore surveys where we have no ip_address
Thank you for you help in advance :)
A few (very) minor tweaks to what is shown above. Thank you again!
UPDATE surveys AS s
INNER JOIN (
SELECT x, count(*) c
FROM (
SELECT s1.id AS x, s2.id AS y
FROM surveys AS s1, surveys AS s2
WHERE s1.state IN (1, 3) # completed and verified
AND s1.id != s2.id # dont self join
AND s1.ip_address != "" AND s1.ip_address IS NOT NULL # not interested in blank entries
AND s1.ip_address = s2.ip_address
AND (s2.created BETWEEN (s1.created - INTERVAL 6 MONTH) AND (s1.created + INTERVAL 6 MONTH))
AND s1.user_id = s2.user_id # where completed for the same user
) AS ipCount
GROUP BY x
) n on s.id = n.x
SET s.ip_count = n.c
I don't have your table with me, so its hard for me to form correct sql that definitely works, but I can take a shot at this, and hopefully be able to help you..
First I would need to take the cartesian product of surveys against itself and filter out the rows I don't want
select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)
The output of this should contain every pair of surveys that match (according to your rules) TWICE (once for each id in the 1st position and once for it to be in the 2nd position)
Then we can do a GROUP BY on the output of this to get a table that basically gives me the correct ip_count for each survey_id
(select x, count(*) c from (select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)) group by x)
So now we have a table mapping each survey_id to its correct ip_count. To update the original table, we need to join that against this and copy the values over
So that should look something like
UPDATE surveys SET s.ip_count = n.c from surveys s inner join (ABOVE QUERY) n on s.survey_id = n.x
There is some pseudo code in there, but I think the general idea should work
I have never had to update a table based on the output of another query myself before.. Tried to guess the right syntax for doing this from this question - How do I UPDATE from a SELECT in SQL Server?
Also if I needed to do something like this for my own work, I wouldn't attempt to do it in a single query.. This would be a pain to maintain and might have memory/performance issues. It would be best have a script traverse the table row by row, update on a single row in a transaction before moving on to the next row. Much slower, but simpler to understand and possibly lighter on your database.
Related
So I am working with event data.
I need to identify when "X happens", store that data in a column and then identify when "in-production" happens.
Now, I just want the first "in-production" that shows up after "X happens" I do not care about the previous ones.
Note: Between "X happens" and "in-production" happens multiple states can exist.
What have I tried: case whens, self joins, with tables... nothing to my avail.
Any help is much appreciated, thanks a lot friends!
TblEvents
=========
EventID OrderID EventDate Status
1 2 01/02/2011 in-production
2 2 02/02/2011 pending
3 2 03/02/2011 on-hold
4 2 03/02/2011 stuck
5 2 03/02/2011 *X happens*
6 2 04/02/2011 pending
7 2 05/02/2011 *in-production*
Output table:
date of event X | date of "in-production"
03/02/2011 05/02/2011
Try this:
SELECT
OuterTblEvents.EventDate AS 'date of event X',
(
SELECT EventDate
FROM TblEvents
WHERE
TblEvents.Status = 'in-production'
AND TblEvents.EventID > OuterTblEvents.EventID
ORDER BY TblEvents.EventID
LIMIT 1
) AS 'date of "in-production"'
FROM TblEvents OuterTblEvents
WHERE OuterTblEvents.Status = 'X happens'
Though note that if the table is large and/or you run this often - performance may suffer, because the inner query is invoked once per each row that is found in the outer query.
Also, a composite index on Status, EventID should make both the outer and inner queries more performant. The outer query would use only the Status part of the index, and the inner query would use both Status and EventID. Note that the order of the columns in this composite index matters.
I have a mysql table-
User Value
A 1
A 12
A 3
B 4
B 3
B 1
C 1
C 1
C 8
D 34
D 1
E 1
F 1
G 56
G 1
H 1
H 3
C 3
F 3
E 3
G 3
I need to run a query which returns 2nd distinct value that each user has.
Means if any 2 values are accessed by each user , then based on the occurrence, pick the 2nd distinct value.
So as above 1 & 3 is being accessed by each User. Occurrence of 1 is
more than 3 , so 2nd distinct will be 3
So I thought first I will get all distinct user.
create table temp AS Select distinct user from table;
Then I will have an outer query-
Select value from table where value in (...)
In programmatically way , I can iterate through each of the value user contains like Map but in Hive query I just couldn't write that.
This will return the second most frequented value from your list that spans all users. There isn't one of these values in the table which I expect is a typo in the data. In real data you will likely have muliple ties that you need to figure out how to handle.
Select value as second_distinct from
(select value, rank() over (order by occurrences desc) as rank
from
(SELECT value, unique_users, max(count_users) as count_users, count(value) as occurrences
from
(select value, size(collect_set(user) over (partition by value))
as count_users from my_table
) t
left outer join
(select count(distinct user) as unique_users from my_table
) t2 on (1=1)
where unique_users=count_users
group by value, unique_users
) t3
) t4
where rank = 2;
This works. It returns NULL because there is only value that visited every user (value of 1). Value 3 is not a solution because not every user has seen that value in your data. I expect you intended that three should be returned but again it doesn't span all the users (user D did not see value 3).
Not sure how #invoketheshell's answer was marked correct; it doesn't run and it needs 6 MR jobs. This will get you there in 4 and is less code.
Query:
select value
from (
select value, value_count, rank() over (order by value_count desc) rank
from (
select value, count(value) value_count
from (
select value, num_users, max(num_users) over () max_users
from (
select value
, size(collect_set(user) over (partition by value)) num_users
from db.table ) x ) y
where num_users = max_users
group by value ) z ) f
where rank = 2
Output:
3
EDIT: Let me clarify my solution as there seems to be some confusion. The OP's example says
"So as above 1 & 3 is being accessed by each User ... "
As my comment below the question suggests, in the example given, user D never accesses value 3. I made the assumption that this was a typo and added this to the dataset and then added another 1 as well to make there be more 1's than 3's. So my code correctly returns 3, which was the desired output. If you run this script on the actual dataset it will also produce the correct output which is nothing because there isn't a "2nd Distinct". The only time it could produce an incorrect value, is if there was no one specific number that was accessed by all users, which illustrates the point I was trying to make to #invoketheshell: if there is no single number that every user has accessed, running a query with 6 map-reduce jobs is an absurd way to find that out. Since we are using Hive I believe it would be fair to assume that if this problem were a "real-world" problem, it would most likely be executed on at least 100's of TBs of data (probably more). I the interest of preserving time and resources, it would behoove an individual to at least check that one number had been accessed by all users before running a massive query whose analysis hinges on that assumption being true.
I've noticed there are a few similar questions on StackOverflow, but nothing has worked for me so far. I'll try to keep this as short as possible.
I am building a query that needs to return a number of issues that may or may not have contracts, that may or may not be completed (completed_at would be set to a DateTime, not nil). Each row needs to include:
one row containing all the issue record's fields
the description from the budget_item
the completed_at date from the most recent contract that was completed (one budget_item could have 0 contracts, 1 contract, or 5+ contracts and any number of them could be open (completed_at :nil) or closed (completed_at: DateTime)
This is what I have so far (which returns the correct number of rows, but it is returning the most recently created contract, not the most recent
BaseItem.issues
.joins('LEFT JOIN budget_items
ON issues.id = budget_items.issue_id
LEFT JOIN contracts
ON budget_items.id = contracts.budget_item_id')
.select('issues.*, budget_items.description, contracts.completed_at AS resolved_at')
.group('issues.id')
.order('contracts.completed_at')
The code in the models is as follows:
class BaseItem < ActiveRecord::Base
has_many :issues
...
end
class Issue < ActiveRecord::Base
belongs_to :base_item
has_many :budget_items
...
end
class BudgetItem < ActiveRecord::Base
belongs_to :issue
has_many :contracts
...
end
class Contract < ActiveRecord::Base
belongs_to :budget_item
...
end
The end result needs to be something along the line of:
There will likely be multiple issues making up the different rows. Each issue has at least four budget_items which are used only for the budget_item.description which needs to appear in the final query and then are used to join each issue to its many contracts (each budget_item could have 2 or 3 contracts so the issue could end up having 8-12 contracts. From those contracts, the query needs to order them according to their completed_at attribute and return AS resolved_at only the most recent contract's completed_at date. If there were 4 contracts, two had completed_at: nil, the query should return the most recent of the two remaining completed_at dates as the resolved_at field of that particular issue.
Any help would be greatly appreciated and please let me know if I need to provide any additional information.
-Dave
The resulting query (from a comment):
SELECT issues.*, budget_items.description, contracts.completed_at AS resolved_at
FROM issues
LEFT JOIN budget_items ON issues.id = budget_items.issue_id
LEFT JOIN contracts ON budget_items.id = contracts.budget_item_id
WHERE issues.base_item_id = 6
GROUP BY issues.id
ORDER BY contracts.completed_at DESC
LIMIT 1
You don't need to show your actual data in the sample to make it useful...
This is what I mean by it. You could have put into your question the following sample data:
issues
id base_item_id
-- ------------
10 6
20 6
30 6
99 123
budget_items
id issue_id description
-- -------- -----------
1 10 'one contract, none completed'
2 20 'one contract, one completed'
3 30 'two contracts, one completed'
4 30 'three contracts, two completed'
contracts
id budget_item_id completed_at
-- -------------- ------------
1 1 NULL
2 2 2015-01-02
3 3 2015-01-03
4 3 NULL
5 4 2015-01-05
6 4 NULL
7 4 2015-01-07
expected result
issues.id contracts.completed_at budget_item.description
--------- ---------------------- -----------------------
10 NULL NULL
20 2015-01-02 one contract, one completed
30 2015-01-07 three contracts, two completed
Here is SQL Fiddle.
Is it what you want? Does my sample data cover all possible edge cases? If not, add more rows to it and show how it affects the result.
This is how the final query may look like. MySQL doesn't have things like CROSS APPLY or LATERAL JOINS, so it is less efficient than in other databases - the subquery will run twice.
I have no idea how to translate this SQL to Ruby - I never used Ruby.
SELECT
issues.*
,(
SELECT contracts.completed_at
FROM
budget_items
INNER JOIN contracts ON contracts.budget_item_id = budget_items.id
WHERE
budget_items.issue_id = issues.id
AND contracts.completed_at IS NOT NULL
ORDER BY contracts.completed_at DESC
LIMIT 1
) AS resolved_at
,(
SELECT budget_items.description
FROM
budget_items
INNER JOIN contracts ON contracts.budget_item_id = budget_items.id
WHERE
budget_items.issue_id = issues.id
AND contracts.completed_at IS NOT NULL
ORDER BY contracts.completed_at DESC
LIMIT 1
) AS description
FROM issues
WHERE issues.base_item_id = 6
The main idea is simple. We return one row for each issue. For each issue we find one latest contract using whatever conditions you need (like contracts.completed_at IS NOT NULL to look for completed contracts only).
If there is no completed contracts at all for an issue it returns NULL for description and resolved_at. You can add extra filter in the main SELECT to remove such rows if this is what you want (WHERE issues.base_item_id = 6 AND resolved_at IS NOT NULL).
.order('contracts.completed_at DESC LIMIT 1')
(or do I not understand what is missing from your code?)
I have the following table with some sample data.
Record_ID Counter Serial Owner
1 0 AAA Jack
2 1 AAA Kevin
3 0 BBB Jane
4 1 BBB Wendy
Based on data similar to the above, I am trying to write a SQL query for MySQL that gets the record with the maximum Counter value per Serial number. The part I seem to be having trouble with is getting the query to get the last 50 unique serial numbers that were updated.
Below is the query I came up with so far based on this StackOverflow question.
SELECT *
FROM `history` his
INNER JOIN(SELECT serial,
Max(counter) AS MaxCount
FROM `tracking`
WHERE serial IN (SELECT serial
FROM `history`)
GROUP BY serial
ORDER BY record_id DESC) q
ON his.serial = q.serial
AND his.counter = q.maxcount
LIMIT 0, 50
It looks like a classic greatest-n-per-group problem, which can be solved by something like this:
select his.Record_ID, his.Counter, his.Serial, his.Owner
from History his
inner join(
select Serial, max(Counter) Counter
from History
group by Serial
) ss on his.Serial = ss.Serial and his.Counter = ss.Counter
If you are to have specific filters on your data set, you should apply the said filters in the sub-query.
Another source with more explanation on the problem here: SQL Select only rows with Max Value on a Column
I have a very specific query that is acting up and I could use any help at all with debugging it.
There are 4 tables involved in this query.
Transaction_Type
Transaction_ID (primary)
Transaction_amount
Transaction_Type
Transaction
Transaction_ID (primary)
Timestamp
Purchase
Transaction_ID
Item_ID
Item
Item_ID
Client_ID
Lets say there is a transaction in which someone pays $20 in cash and $0 in credit it inserts two rows into the table.
//row 1
Transaction_ID: 1
Transaction_amount: 20.00
Transaction_type: cash
//row 2
Transaction_ID: 1
Transaction_amount: 0.00
Transaction_type: credit
here is the specific query:
SELECT
tt.Transaction_Amount, tt.Transaction_ID
FROM
ItemTracker_dbo.Transaction_Type tt
JOIN
ItemTracker_dbo.Transaction t
ON
tt.Transaction_ID = t.Transaction_ID
JOIN
ItemTracker_dbo.Purchase p
ON
p.Transaction_ID = tt.Transaction_ID
JOIN
ItemTracker_dbo.Item i
ON
i.Item_ID = p.Item_ID
WHERE
t.TimeStamp >= "2010-01-06 00:00:00" AND t.TimeStamp <= "2010-01-06 23:59:59"
AND
tt.Transaction_Format IN ('cash', 'credit')
AND
i.Client_ID = 3
when I execute this query, it returns 4 rows for a specific transaction. (it should be 2)
When I remove ALL where clauses and insert WHERE tt.Transaction_ID = problematicID it only returns two.
EDIT:::::
still repeats upon changing date range
The kicker:
When I change the initial daterange it only returns two rows for that specific transaction_id.
::::
Is it the way I use join? that's all I can think of...
EDIT: This is the problem
in purchase - two sepparate purchase_ID's can have the same transaction_ID (purhcase_ID breaks down specific item sales).
There are duplicate Transaction_ID rows in purchase_ID
We need to see all the data in all the tables to be able to know where the problem is. However, because the joins are the problem it is because one of your tables has two rows when you think it has only one.
There's a problem with your schema. You have rows with the same transaction_id, which is the primary key. I would think they couldn't be marked primary in that database. With two rows with the same id, that could cause unexpected extra rows to come back from the join(s).