Stata Duplicates within a 5 minute trange

Stata Duplicates within a 5 minute trange - duplicates

For the following dataset example:
11-12-2014 21:59
11-12-2014 21:59
11-12-2014 22:00
11-12-2014 22:06
I need to regard observations that are less than five minutes apart as duplicates and use them in a "bysort" command afterwards. Does anyone know how I can define duplicates to be observations that are <5 minutes apart?

This is an incomplete answer, since for clarity I used simple numbers rather than Stata time values. But it shows the fundamental idea.
clear
input float x
1
3
9
13
17
end
generate run = 0
replace run = x in 1
replace run = cond(x<=run[_n-1]+5,run[_n-1],x) if _n>1
which gives the following result, showing that the variable run identifies sets of "duplicate" observations by your criterion.
. list
+----------+
| x run |
|----------|
1. | 1 1 |
2. | 3 1 |
3. | 9 9 |
4. | 13 9 |
5. | 17 17 |
+----------+

Related

Group by adjacent time values

I hope someone can help with that tricky question:
How can I "GROUP BY" some rows which have a configurable adjacent distance between the timestamps?
Example table:
ID | Value | When
1 | 5 | 2017-06-30 11:45:55
2 | 9 | 2017-06-30 11:45:56
3 | 0 | 2017-06-30 11:45:59
4 | 2 | 2017-06-30 11:46:02
5 | 7 | 2017-06-30 17:19:22
6 | 7 | 2017-06-30 17:19:22
7 | 3 | 2017-06-30 17:19:22
8 | 6 | 2017-06-30 17:19:22
Desired result:
ID | Value | When
3 | 0 | 2017-06-30 11:45:59
7 | 3 | 2017-06-30 17:19:22
The result shall find adjacent entries (in the example two groups of four rows each) and tell the lowest "Value".
Adjacent distance can be any value like one minute or ten minutes.
I tried to reformat the date to be able to "GROUP BY" without seconds but this won't work for the first result.
My MySQL programming skills are limited but it could be done with following steps:
SELECT and ORDER BY "When"
Go though values and tell difference between current and previous "When" value, if within range, then GROUP, if not output a new row.
Any idea?

Finally I decided to write a C program which uses a view in the DB to obtain raw sorted data. The program then outputs groups if the difference between the timestamps is within the desired limit.
The output is then put back into the database.

MySQL - Get values from previous rows

I am trying to reconstruct data that has a tree structure.
Example - Country / City:
1) USA
1.1) New York
1.2) Chicago
2) France
2.1) Paris
2.2) Lyon
3) China
In my database it looks like this:
| Element | Level | Row |
|:--------:|:-----:|:---:|
| USA | 1 | 1 |
| New York | 2 | 2 |
| Chicago | 2 | 3 |
| France | 1 | 4 |
| Paris | 2 | 5 |
| Lyon | 2 | 6 |
| China | 1 | 7 |
Based on the sequence (row) of my entries I can reconstruct the tree structure. For each row I look for the nearest previous row that has Level-1.
max(pre.Row) / pre.Row < cur.Row / pre.Level = cur.Level-1
Following code is working and it returns the right results. My problem is that the table is 7 million rows large and therefore it takes a lot of time. It is like comparing 7 million times 7 million rows...
SELECT cur.`Row`, (
SELECT max(pre.`Row`)
FROM `abc`.`def` AS pre
WHERE pre.`Row` < cur.`Row`
AND pre.`Level`=cur.`Level`-1
) AS prev_row
FROM `abc`.`def` AS cur
;
Is there a faster way to implement this?
Maybe with loops or user variables? I could imagine that you actually start from the current row and then test if the previous row meets the conditions otherwise look for the next previous row and so on. This will reduce the opertions to 7 million times ~5. I never worked with loops so I have no clue if this is possible in SQL. Any ideas?

here's my try with 3 levels you can add levels if you have more, not sure why it's returning weird values that look like encoded values but CAST() AS unSIGNED gets you prev_row just as your query.
SELECT Row,
CAST(ELT(level-1,#level_1,#level_2,#level_3) as UNSIGNED) as prev_row,
#level_1 := IF(`level` = 1, row, #level_1),
#level_2 := IF(`level` = 2, row, #level_2),
#level_3 := IF(`level` = 3, row, #level_3)
FROM `def`
ORDER BY Row ASC
http://sqlfiddle.com/#!9/719b2/22

mySQL get all possible combinations of certain rows

I have a strange request in mySQL. I found many ways to do this for pairs of combinations or a certain other number by adding more joins, but I am wondering if there is a dynamic way of doing it for any number of combinations.
To explain if I have a table table has 1 column (column_id) and (column_text)
Id | Text
--------
1 | A
2 | B
3 | B
4 | B
5 | A
Then by running a procedure GetCombinations with parameter A should yield:
CombinationId | Combinations
---------------------------
1 | 1
2 | 5
3 | 1,5
by running a procedure GetCombinations with parameter B should yield:
CombinationId | Combinations
---------------------------
1 | 2
2 | 3
3 | 4
4 | 2,3
5 | 2,4
6 | 3,4
7 | 2,3,4
Obviously the larger the number, then I expect an exponential increase of results.
Is such a query even possible? All I could find was results using Joins limiting the length of each result to the number of Joins.
Thank you
UPDATE
I have found an article here but the maximum number of combinations should be small (max 20 or so). In my case with a 100 combinations I calculated that it would produce: 9426890448883247745626185743057242473809693764078951663494238777294707070023223798882976159207729119823605850588608460429412647567360000000000000000000099 rows (lol)
So I will classify my answer as infeasible
However is there a way to get this result with max 2 combinations?
CombinationId | Combinations
---------------------------
1 | 2
2 | 3
3 | 4
4 | 2,3
5 | 2,4
6 | 3,4
I have found a query to get all combinations using JOIN but I am not sure how to produce the combination id and also how to get the individual rows.
UPDATE 2
Solved it using
SELECT #rownum := #rownum + 1 AS 'CombinationId'
cross join (select #rownum := 0) r
And I did the query with UNION ALL

What you are trying to do is to generate the Power Set of the set of all elements with field Text == <parameter>. As you already found out, this number grows exponentially with the length of the input array.
If you can solve it in other language (say, php), take a look at this:
Finding the subsets of an array in PHP

How to generalize a sequential COUNT() of chronological data without loops or cursors?

I have read all the arguments: Tell SQL what you want, not how to get it. Use set-based approaches instead of procedural logic. Avoid cursors and loops at all costs.
Unfortunately, I have been racking my brain for weeks and I can't figure out how to come up with a set-based approach to generating an iterative COUNT for sequential subsets of chronologically ordered data.
Here is the specific application of the problem I am working on.
I do football-related research using a database that comprises many years of play-by-play data, which is of course arranged chronologically by year, game, and play. The database is loaded onto a web server running MySQL 5.0.
The fields I need for this particular problem are contained in the core table. Here is some sample data from the relevant part of the table:
GID | PID | OFF | DEF | QTR | MIN | SEC | PTSO | PTSD
--------------------------------------------------------
121 | 2455 | ARI | CHI | 2 | 4 | 30 | 17 | 10
121 | 2456 | ARI | CHI | 2 | 4 | 15 | 17 | 10
121 | 2457 | ARI | CHI | 2 | 3 | 53 | 17 | 10
121 | 2458 | ARI | CHI | 2 | 3 | 31 | 20 | 10
The columns represent, respectively: unique game identifier, unique play identifier, which team is on offense for that play, which team is on defense for that play, the quarter and time the play occurred, and the offense's and defense's scores going into the play. In other words, in (hypothetical) game 121, the Arizona Cardinals scored a field goal on play 2457 (i.e., going into play 2458).
What I want to do is go through several years' worth of data game by game, second by second, and count the number of times any possible score differential occurred at any given elapsed time. The following query arranges the data by seconds elapsed and score differential:
SELECT core.GID, core.PID, core.QTR, core.MIN, core.SEC, core.PTSO, core.PTSD,
((core.QTR - 1) * 900 + (900-(core.MIN * 60 + core.SEC))) AS secEl,
core.PTSO - core.PTSD AS oDif, (core.PTSO - core.PTSD) * -1 AS dDif
FROM core
ORDER BY secEl ASC, oDif ASC;
The result looks something like this:
GID | PID | OFF | DEF | QTR | MIN | SEC | PTSO | PTSD | secEl | oDif | dDif
---------------------------------------------------------------------------------
616 | 100022 | CHI | MIN | 1 | 15 | 00 | 0 | 0 | 0 | 0 | 0
617 | 100169 | HOU | DAL | 1 | 15 | 00 | 0 | 0 | 0 | 0 | 0
618 | 100224 | PHI | SEA | 1 | 15 | 00 | 0 | 0 | 0 | 0 | 0
619 | 100303 | JAX | NYJ | 1 | 15 | 00 | 0 | 0 | 0 | 0 | 0
Although that looks pretty, my goal is not to sort the data chronologically. Rather, I want to step sequentially through every one of the 4,500 possible seconds (four 15-minute quarters plus one 15-minute overtime period) in an NFL game and count the number of times every score differential has ever occurred in each one of those seconds.
In other words, I don't want to count just the number of times a team has been up by, say, 21 points at 1,800 seconds elapsed (i.e., the start of the second quarter) between 2002 and 2013. I want to count the number of times a team has been up by 21 points at any point in a game. On top of that, I want to do this for every score differential that has ever occurred (i.e., -50, -49, -48, ..., 0, 1, 2, ... 48, 49, 50, ...) for every second of every game.
This would be relatively easy to accomplish with a series of nested loops, but it wouldn't be the most reusable of code.
What I want to do is construct set logic that will COUNT the instances of each score differential that has occurred at every second of time elapsed without using loops or cursors. The results would be tabulated as follows:
secondsElapsed | scoreDif | Occurrences
-----------------------------------------
10 | -1 | 12
10 | 0 | 125517
10 | 1 | 0
10 | 2 | 3
Here is a sample query for getting the total number of instances of a specific score differential (+21) at a specific time point (3,000 seconds elapsed):
SELECT ((core.QTR - 1) * 900 + (900-(core.MIN * 60 + core.SEC))) AS timeElapsed,
(core.PTSO - core.PTSD) AS diff, COUNT(core.PTSO - core.PTSD) AS occurrences
FROM core
WHERE ((core.QTR - 1) * 900 + (900-(core.MIN * 60 + core.SEC))) = 3000
AND ABS(core.PTSO - core.PTSD) = 21
That query returns the following results:
timeElapsed | diff | occurrences
----------------------------------
3000 | 21 | 5
Now I want generalize this query to count the instances of every differential at every second elapsed.

Your description is rather confusing but if you want to "COUNT all of the possible score differentials for every possible second without using loops or cursors" then I would do something like:
1) Build a work table (either a temporary table# or a Table data type#) and fill it with the time increments you want e.g.
QTR | MIN | SEC |
1 | 00 | 01
1 | 00 | 02
..
1 | 01 | 59
1 | 02 | 00
1 | 02 | 01
1 | 02 | 02
..
4 | 15 | 59
2) You then use this as the basis of your query. Cross Join a list of the games you are interested in with the work table to give you a table of every game and every minute in that game.
3) With the result of (2) left join your query above back into it?
With this result set you can then look at a whole game and sum\count as neccessary without having to loop.

Not sure if this will cure your problem, but you could try using row_number over a partition...
SELECT ROW_NUMBER() OVER (PARTITION BY <column> ORDER BY <column>) AS aColumn, aColumn FROM aTable

I did it using a sub-query and two variables to define the time point and another to define the point difference.
The query then returns the Diff, then the amount of times the offensive side had it, followed by the defensive side and total times.
SET #Diff INT = 7;
SET #Seconds INT = 1530;
SELECT ABS(core.PTSO - core.PTSD) AS diff, SUM(CASE WHEN core.PTSO - core.PTSD <= 0 THEN 1 ELSE 0 END) OffensiveTimes, SUM(CASE WHEN core.PTSO - core.PTSD >= 0 THEN 1 ELSE 0 END) DefensiveTimes, SUM(1) TotalTimes
FROM (SELECT core.GID, core.PID, core.QTR, core.MIN, core.SEC, core.PTSO, core.PTSD,
((core.QTR - 1) * 900 + (900-(core.MIN * 60 + core.SEC))) AS secEl,
core.PTSO - core.PTSD AS oDif, (core.PTSO - core.PTSD) * -1 AS dDif
FROM core
) core
WHERE secEl = #Seconds AND ABS(core.PTSO - core.PTSD) = #Diff
GROUP BY ABS(core.PTSO - core.PTSD);
This returns this for the small dataset you gave
7 diff, 0 OffensiveTimes, 1 DefensiveTimes, 1 Times
Hope that was what you were looking for :)

MYSQL JOIN two tables

We just want to make the query for mysql database, in which there are 12 table according to the months(JAN - DEC), with 32 Columns(JAN1, JAN2, JAN3,....JAN31). These database is used for getting the availability for hotel,like if we select a tour for three days (29JAN-1JAN), so the query will check the records for 2 tables, one for JAN and other for FEB. the whole columns stored the values in digit(like, 5,10,2,0,5,etc)its showing Rooms available. We are successfully built a query for single month, but we unable to create a mysql query for 2 months, because we want a value in greater than 1. like we only shows the available rooms only.
$num = mysql_query("SELECT DISTINCT id,table_type,JAN29,room_type FROM JAN Where table_type='disp' AND JAN!=0 ");
above query is working fine for me, we just want this query for 2 tables. and getting the positive value , greater than 0(1).
Please help to solve this problem ..
Thanks
Rod
ID | JAN1 | JAN2 | JAN3 | JAN31|
34 | 5 | 3 | 3 | 4 |
56 | 4 | 3 | 9 | 3 |
28 | 0 | 7 | 0 | 9 |

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Stata Duplicates within a 5 minute trange - duplicates

Related

Group by adjacent time values

MySQL - Get values from previous rows

mySQL get all possible combinations of certain rows

How to generalize a sequential COUNT() of chronological data without loops or cursors?

MYSQL JOIN two tables

Categories

Resources