Use duplicate as a condition for if else in r - duplicates

I need to find incomplete records in a big dataset similar to the one below:
|#|Name|Eligible|Code|finish_process|Dup_finish|
|-|------|--------|----|--------------|----------|
|1|Tom Hi| Y | | |N |
|2|Tom Hi| Y | | |N |
|3|John Walls| Y | | |N |
|4|John Walls| Y | | |N |
|5|July Bran| Y | 310 |Y |Y |
|6|July Bran| Y | | |Y |
|5|Mary Doll| Y |741 |Y |Y |
|6|Mary Doll| Y | | |Y |
|5|Howard Smith| Y |182 |Y |Y |
|6|Howard Smith| | | |Y |
|6|Howard Smith| Y | | |Y |
I have some entries that are duplicates (by name) that started the training but did not finish it (i.e., Tom Hi) and other entries that are also duplicates but they finish the training (finish_process=Y) at least once (i.e., July Bran, Howard Smith).
In excel, I used this formula to create a column called Dup_finish that does what I need:
=IF(OR(COUNTIFS($G$2:$G$112,G2,$L$2:$L$112,{"Y"," "}),OR(L2="Y",L2=" ")),"Y","N")
My objective is to create a column in R with Y/N values that allows my to discern which people are true incomplete entries (they never finish the training) -> for example Tom Hi and John Walls are true incomplete records while the others are not (they finish the training at least once even though they tried more than once).
Can anyone help me to find the right code to create the Dup_finish column and values using the R software? Thanks.

Related

SQL query to find set of doc_ids where there is maximum intersection of ent_ids

I have a table with O(1M) rows with columns doc_id and ent_id where (doc_id, ent_id) is the primary key.
+--------+--------+
| doc_id | ent_id |
+--------+--------+
| 1 | a |
| 1 | b |
| 1 | x |
| 1 | y |
| 2 | a |
| 3 | a |
| 3 | x |
| 3 | y |
| 4 | x |
| 4 | y |
+--------+--------+
My question is, How do I efficiently find a set of doc_ids ( say I need top 1000 or 5000 doc_ids) where there is maximum intersection of ent_ids among that selected set of doc_ids?
For example : In the above table,
say I need top 2 doc_ids where there is maximum intersection among their ent_ids.The result would be - doc_ids = {1,3} with [ common ent_ids={a,x,y}, common ent_ids count=3 ]
say I need top 3 doc_ids where there is maximum intersection among their ent_ids. The result would be - doc_ids = {1,3,4} with [ common ent_ids={x,y}, common ent_ids count=2 ]
footnote - If it's not possible do it efficiently with SQL, any direction towards alternative method of doing it in application code would also be helpful. say, convert to csv -> some data-structure[inverted index?]/library + python code -> result set.

MySQL Substring between two DIFFERENT strings where the second needle comes AFTER the first

I have to extract certain data from a MySQL column. The table looks like so:
+----+---------------------+------------------------+
| id | time | data |
+----+---------------------+------------------------+
| 1 | 2016-10-28 00:12:01 | a Q1!! AF3 !! ext!! z |
| 2 | 2016-10-28 02:19:02 | z !!3F2 !AF66-2!! !!a |
| 3 | 2016-10-28 11:35:03 | AF!a !!! pl6 f !!! dd |
+----+---------------------+------------------------+
I want to grab the string from column data between the characters AF and the NEXT occurrence of !! So ideally the query SELECTid,[something] AS x FROM tbl would result in:
+----+------+
| id | x |
+----+------+
| 1 | 3 |
| 2 | 66-2 |
| 3 | !a |
+----+------+
Thoughts on how to do this? All the other questions I see don't quite relate, as they don't deal with finding the first occurrence of the second needle (!!) AFTER the first needle (AF).
There may be faster ways to do this but this is a good start:
select substring_index(substring_index(data, 'AF', -1), '!!', 1)

Finding the min distance MYSQL

I want to find the min distance between (x2-x1)^2 + (y2-y1)^2 for a particular macId and timeStamp. I'm trying to find a nearest possible gate
location for an individual at one particular instance of time. So, the query should return one unique value of user at one instance of time that is min to the GATE location.
The data set looks like:
X1 Y1 TimeStamp MACID X2 Y2 Gate
| 5618 | 5303 |1 12:22:02 | 54:ea:a8:53:5b:eb | 5844 | 5377 | C24
| 5848 | 5046 |1 12:22:02 | 54:ea:a8:53:5b:eb | 5844 | 5377 | C18
| 6094 | 5464 |1 12:22:02 | 54:ea:a8:53:5b:eb | 5844 | 5377 | C17
| 6021 | 6540 |1 13:09:48 | 48:5a:3f:6a:01:b9 | 6210 | 6801 | C23
| 6366 | 7036 |1 13:09:48 | 48:5a:3f:6a:01:b9 | 6210 | 6801 | C14
| 6366 | 7036 |1 13:09:48 | 48:5a:3f:6a:01:b9 | 6210 | 6801 | C13
The result set should look like below:
X1 Y1 TimeStamp MACID X2 Y2 Gate
| 5848 | 5046 |1 12:22:02 | 54:ea:a8:53:5b:eb | 5844 | 5377 | C18
| 6021 | 6540 |1 13:09:48 | 48:5a:3f:6a:01:b9 | 6210 | 6801 | C23
I have tried this below query but not working:
select min((x2-x1)^2 + (y2-y1)^2), macID, timeStamp from maptable
groupbymacID, timeStamp
I also tried using self joins but seems completely wrong.
May I know where I'm going wrong.
You could use this query:
SELECT m.*
FROM maptable m, (
SELECT TimeStamp, macid, MIN(POW((x2-x1), 2) + POW((y2-y1), 2)) mindist
FROM maptable
GROUP BY TimeStamp, macid
) a
WHERE m.TimeStamp = a.TimeStamp AND m.macid = a.macid
AND POW((x2-x1), 2) + POW((y2-y1), 2) = a.mindist;
SQL Fiddle: http://sqlfiddle.com/#!9/d7979/6
But note that it does not return one row per macid and date, because in your input data the two final rows are the same, so the min distance is the same for gates C13 and C14

Rewriting a select query

I have a rather simple (I think) question at hand. The example tables and the result I need are provided below (in reality those tables containt much more columns and data, I jest left what is relevant). There is also the query which returns exactly what I need. However, I dont like rather crude way in which it works (I dont like subqueries in general). The question is, how can I rewrite the query so it will automatically react to more columns appearing in TABLE2 in the future? Right now if the "z" column would be added to TABLE2, I need to modify each query in the code and add one more relevant subquery. I just want the select to read the entire content of TABLE2 and translate the id numbers to corresponding strings from TABLE1.
TABLE1
-----------------
id |x |
-----------------
567 |AAA |
345 |BBB |
341 |CCC |
827 |DDD |
632 |EEE |
503 |FFF |
945 |GGG |
234 |HHH |
764 |III |
123 |JJJ |
-----------------
TABLE2
-------------------------
id |x |y |
-------------------------
1 |123 |341 |
2 |567 |632 |
3 |345 |945 |
4 |764 |503 |
5 |234 |827 |
-------------------------
THE RESULT I NEED
-----------------
A |B |
-----------------
JJJ |CCC |
AAA |EEE |
BBB |GGG |
III |FFF |
HHH |DDD |
-----------------
The query I have
SELECT
(SELECT `x` FROM `TABLE1` WHERE `TABLE2`.`x` LIKE `TABLE1`.`id` LIMIT 1) as A,
(SELECT `x` FROM `TABLE1` WHERE `TABLE2`.`y` LIKE `TABLE1`.`id` LIMIT 1) as B
FROM `TABLE2` ORDER BY `id` DESC;
You might want to restructure your data model:
Instead of:
-------------------------
id |x |y |
-------------------------
1 |123 |341 |
2 |567 |632 |
3 |345 |945 |
4 |764 |503 |
5 |234 |827 |
-------------------------
You would have:
----------------------
col_id |col |
----------------------
1 |x |
2 |y |
----------------------
---------------------------
id |col_id |col_val |
---------------------------
1 |1 |123 |
1 |2 |341 |
2 |1 |567 |
2 |2 |632 |
etc
---------------------------
Probably not worth the hassle (you would effectively need to pivot when you're accessing multiple columns at a time) but it would allow you to do the query that you want across all current and future columns.
You can't do that with a plain select.
What you can do is creating a view with the translated values. You still have to modify the view when the original table is changed but your queries don't have to.
You can use dynamic sql statements, but still you can use the dynamic statements only if you are sure that table 2 will have the columns of same type like x and y(Apart from id).
Let me know if you are not sure how to write it.
All the best.

Calculate difference to data from previous row

I got a table which tracks positions over time. This table contains a lot of rows and cannot be changed in format.
Every position is inserted with an id, the actual position and a timestamp (datetime). The position itself is inserted in a concatenated string which can be converted to x/y coordinates using custom functions.
I want to create a procedure, and lists all known positions and their appropiate timestamps for a certain. No problem so far. But also want to create an output column which indicates the difference in position and time.
To clarify. What I have:
Id | Timestamp | Foreign_Id | POS
----------------------------------------------------------
1 | 2011-02-20 00:00:00 | 2 | PositionAsString
----------------------------------------------------------
2 | 2011-02-20 00:00:05 | 2 | PositionAsString
----------------------------------------------------------
3 | 2011-02-20 00:00:15 | 2 | PositionAsString
----------------------------------------------------------
4 | 2011-02-20 00:00:37 | 2 | PositionAsString
The Coordinates of the Position are available via the functions getX and getY which return float values for X and Y.
The difference in position would be calculated using the Pythagorean theorem as it is only 2D.
What I want
Id | Timestamp | Foreign_Id | POS | DiffPos | Speed
---------------------------------------------------------------------------
1 | 2011-02-20 00:00:00 | 2 | PositionAsString | 0 | 0
---------------------------------------------------------------------------
2 | 2011-02-20 00:00:05 | 2 | PositionAsString | 20 | 4
---------------------------------------------------------------------------
3 | 2011-02-20 00:00:10 | 2 | PositionAsString | 10 | 2
---------------------------------------------------------------------------
4 | 2011-02-20 00:00:20 | 2 | PositionAsString | 10 | 1
So now my issue is how to calculate the difference to the previous row.
Important is, that the result set is narrowed down to the foreign key befor any unefficient calculations are performed because the table has many rows.
Thanks for the suggestion, however as this is complicated and inefficient I am now using another solution. As I stated above I am unable to change the format of the source table. But what is possible was to modify the trigger which was writing to this tracking table so that it did compute the differences before inserting the new row.
This will of course not work for existing entries but this is acceptable.