Let's say I have the following data:
id disease
1 0
1 1
1 0
2 0
2 1
3 0
4 0
4 0
I would like to remove the duplicate observations in Stata.
For example
id disease
1 1
2 1
3 0
4 0
For group id=1, keep observation 2
For group id=2, keep observation 2
For group id=3, keep observation 1 (because it has only 1 obs)
For group id=4, keep observation 1 (or any of them but one obs)
I am trying Stata duplicates command,
duplicates tag id if disease==0, generate(info)
drop if info==1
but it's not working as I required.
It is no surprise that duplicates does not do what you are wanting, as it does not fit your problem. For example, the observation with id == 2, disease == 0 is not a duplicate of any other observation. More generally, duplicates does not purport to be a general-purpose command for dropping observations you don't want.
Your criteria appear to be
Keep one observation for each id.
If id has any observation with value of 1, that is to be kept.
A solution to that is
bysort id (disease) : keep if _n == _N
That keeps the last observation for each distinct id: after sorting within id on disease observations with the disease are necessarily at the end of each group.
Related
Sorry for the beginner question.
I have an Outputs table:
ID
value
0
x
1
y
2
z
And an Inputs table that is linked to the Outputs through the outputsID:
ID
outputsID
name
0
0
A
1
1
B
2
1
C
3
2
B
4
2
C
Assuming that multiple outputs have at least one shared input (in this example outputID 1,3 and 2,4 are the same), is there a way to avoid the duplication of entries in my Inputs table (inputID 3 and 4)?
The 'normal' answer to your question is no. Rows 1 and 2 address output 1, and Rows 3 and 4 address output 2. They aren't duplicates and each reflect something distinct.
So if you are a beginner, I would say you shouldn't want to get rid of these rows.
That said, there are some more advanced techniques. For example, you could have the OutputsID column be an array with multiple values. This is harder, more complex, and non-standard.
I have a result set that contains; order_ids, a total for that order, and the quantities of items within.
Some totals are negative (if a refund has occurred) and others are positive. I would like to work out a count of the orders who's order_total, doesn't net out with with the negative values.
orders_id order_total products_quantity customers_id
--------- ------------- ----------------- --------------
1140898 -99.95830000 -1 459800
1140868 99.95830000 1 459800
1140867 99.95833333 1 459800
866932 -106.33333333 -2 459800
860100 125.08333333 3 459800
857864 106.33333333 2 459800
Would result in
orders_id order_total products_quantity customers_id
--------- ------------- ----------------- --------------
1140867 99.95833333 1 459800
860100 125.08333333 3 459800
I've attempted to write a cursor to iterate over each result, storing the last order_total and checking the current row for a diff.
This works as long as the negative order comes before or after the positive. Unfortunately, this wont always be the case.
Can anyone explain what approach/methods I should adhere to ensure the output below is achieved?
Based on your description, the problem is impossible. Consider:
orders_id order_total customers_id
--------- ------------- --------------
1 -100 1
2 50 1
3 50 1
4 50 1
(I assume that you only want to consider that each value only affects the "net" for a specific customer)
In the case above, orders_id=1 might be considered to offset 2 and 3 leaving 4 in the output, 3 4 leaving 2 in the output, or 2 and 4 leaving 3 in the output.
What if the lines with negative amounts are not an exact amount match for one or more of those with positives? Even if some combination of the negatives adds up to some combination of the positives, you would need to try every possible combination - just calculating the order of that algorithm makes my head hurt (O(N!)^2 I think).
I have a table, poll_response, with three columns: user_id, poll_id, option_id.
Give an arbitrary number of poll/response pairs, how can I determine the number of distinct user_ids match?
So, suppose the table's data looks like this:
user_id | poll_id | option_id
1 1 0
1 2 1
1 3 0
1 4 0
2 1 1
2 2 1
2 3 1
2 4 0
And suppose I want to know how many users have responded "1" to poll 2 and "0" to poll 3.
In this case, only user 1 matches, so the answer is: there is only one distinct user.
But suppose I want to know how many users have responded "1" to poll 2 and "0" to poll 4.
In this case, both user 1 and user 2 match, so the answer is: there are 2 distinct users.
I'm having trouble constructing the MySQL query to make this happen, especially given that there are an arbitrary number of poll/response pairs. Do I just try to chain a bunch of joins together?
To know how many users have responded "1" to poll 2 and "0" to poll 3.
select count(user_id) from(
select user_id from tblA
where (poll_id=2 and option_id=1) or (poll_id=3 and option_id=0)
group by user_id
having count(user_id)=2
)m
SQL FIDDLE HERE.
I found it hard to find a fitting title. For simplicity let's say I have the following table:
cook_id cook_rating
1 2
1 1
1 3
1 4
1 2
1 2
1 1
1 3
1 5
1 4
2 5
2 2
Now I would like to get an output of 'good' cooks. A good cook is someone who has a rating of at least 70% of 1, 2 or 3, but not 4 or 5.
So in my example table, the cook with id 1 has a total of 10 ratings, 7 of which have type 1, 2 and 3. Only three have type 4 or 5. Therefore the cook with id 1 would be a 'good' cook, and the output should be the cook's id with the number of good ratings.
cook_id cook_rating
1 7
The cook with id 2, however, doesn't satisfy my condition, therefore should not be listed at all.
select cook_id, count(cook_rating) - sum(case when cook_rating = 4 OR cook_rating = 5 then 1 else 0 end) as numberOfGoodRatings from cook
where cook_rating in (1,2,3,4,5)
group by cook_id
order by numberOfGoodRatings desc
However, this doesn't take into account the fact that there might be more 4 or 5 than good ratings, resulting in negative outputs. Plus, the requirement of at least 70% is not included.
You can get this with a comparison in your HAVING clause. If you must have just the two columns in the result set, this can be wrapped as a sub-select select cook_id, positive_ratings FROM (...)
SELECT
cook_id,
count(cook_rating < 4 OR cook_rating IS NULL) as positive_ratings,
count(*) as total_ratings
FROM cook
GROUP BY cook_id
HAVING (positive_ratings / total_ratings) >= 0.70
ORDER BY positive_ratings DESC
Edit Note that count(cook_rating < 4) is intended to only count rows where the rating is less than 4. The MySQL documentation says that count will only count non-null rows. I haven't tested this to see if it equates FALSE with NULL but I would be surprised it it doesn't. Worst case scenario we would need to wrap that in an IF(cook_rating < 4, 1,NULL).
I suggest you change a little your schema to make this kind of queries trivial.
Suppose you add 5 columns to your cook table, to simply count the number of each ratings :
nb_ratings_1 nb_ratings_2 nb_ratings_3 nb_ratings_4 nb_ratings_5
Updating such a table when a new rating is entered in DB is trivial, just as would be recomputing those numbers if having redundancy makes you nervous. And it makes all filterings and sortings fast and easy.
I have a table that looks somewhat like this:
id value
1 0
1 1
1 2
1 0
1 1
2 2
2 1
2 1
2 0
3 0
3 2
3 0
Now for each id, I want to count the number of occurences of 0 and 1 and the number of occurences for that ID (the value can be any integer), so the end result should look something like this:
id n0 n1 total
1 2 2 5
2 1 2 4
3 2 0 3
I managed to get the first and last row with this statement:
SELECT id, COUNT(*) FROM mytable GROUP BY id;
But I'm sort of lost from here. Any pointers on how to achieve this without a huge statement?
With MySQL, you can use SUM(condition):
SELECT id, SUM(value=0) AS n0, SUM(value=1) AS n1, COUNT(*) AS total
FROM mytable
GROUP BY id
See it on sqlfiddle.
As #Zane commented above, the typical method is to use CASE expressions to perform the pivot.
SQL Server now has a PIVOT operator that you might see. DECODE() and IIF() were older approaches on Oracle and Access that you might still find lying around.