Duplicate data based on various conditions in SAS - duplicates

In the following data set I want to remove duplicates based on several conditions:
For Auris disease:
Same id, same condition (Auris), keep data with the first date only no matter what the date difference is.
For other disease conditions (Acino and CRE):
Same id, same condition, data difference more than 90 days keep data with the first date and last date if we have two dates.
Same id, same condition, data difference more than 90 days keep data with the first date and last date if we have three dates or more.
Keep all three if the difference is more than 90 days between 1st and second, and more than 90 days between 2nd and third dates.
data have;
input Id Disease $ Date :mmddyy10.;
format date mmddyy10.;
datalines;
123 Auris 01/01/2021
123 CRE 09/02/2020
344 CRE 08/06/2019
344 CRE 03/06/2020
344 CRE 03/03/2021
323 CRE 01/06/2019
323 CRE 09/06/2020
323 CRE 09/09/2020
167 Acino 03/06/2020
167 Acino 03/19/2020
167 Acino 09/03/2021
256 Auris 08/05/2020
256 Auris 10/07/2021
317 Acino 12/07/2018
317 Acino 01/03/2018
;;;;
run;
Result should be as this:
123 Auris 01/01/2021
123 CRE 09/02/2020
344 CRE 08/06/2019
344 CRE 03/06/2020
344 CRE 03/03/2021
323 CRE 01/06/2019
323 CRE 09/06/2020
167 Acino 03/06/2020
167 Acino 09/03/2021
256 Auris 08/05/2020
256 Auris 10/07/2021
317 Acino 12/07/2018
Thanks

Related

find consecutive transaction within 10 minutes

I have table like this
user_id order_id create_time payment_amount product
101 10001 2018-04-02 5:26 48000 chair
102 10002 2018-04-02 7:44 25000 sofa
101 10003 2018-04-02 8:34 320000 ac
101 10004 2018-04-02 8:37 180000 water
103 10005 2018-04-02 9:32 21000 chair
102 10006 2018-04-02 9:33 200000 game console
103 10007 2018-04-02 9:36 11000 chair
107 10008 2018-04-02 11:05 10000 sofa
105 10009 2018-04-02 11:06 49000 ac
101 10010 2018-04-02 12:05 1200000 cc
105 10011 2018-04-02 12:12 98000 ac
103 10012 2018-04-02 13:11 85000 insurance
106 10013 2018-04-02 13:11 240000 cable tv
108 10014 2018-04-02 13:15 800000 financing
106 10015 2018-04-02 13:18 210000 phone
my goal is to find which user did transaction consecutively less than 10min.
I'm using mysql
Based on the format of your dates in the table, you will need to convert them using STR_TO_DATE to use them in a query. If your column is actually a datetime type, and that is just your display code outputting that format, just replace STR_TO_DATE(xxx, '%m/%d/%Y %k:%i') in this query with xxx.
The way to find orders within 10 minutes of each other is to self-join your table on user_id, order_id and the time on the second order being within the time of the first order and 10 minutes later:
SELECT t1.user_id, t1.create_time AS order1_time, t2.create_time AS order2_time
FROM transactions t1
JOIN transactions t2 ON t2.user_id = t1.user_id
AND t2.order_id != t1.order_id
AND STR_TO_DATE(t2.create_time, '%m/%d/%Y %k:%i') BETWEEN
STR_TO_DATE(t1.create_time, '%m/%d/%Y %k:%i')
AND STR_TO_DATE(t1.create_time, '%m/%d/%Y %k:%i') + INTERVAL 10 MINUTE
Output:
user_id order1_time order2_time
101 4/2/2018 8:34 4/2/2018 8:37
103 4/2/2018 9:32 4/2/2018 9:36
106 4/2/2018 13:11 4/2/2018 13:18
Demo on dbfiddle
Use this query:
SELECT user_id FROM `table_name` WHERE create_time < DATE_SUB(NOW(), INTERVAL 10 MINUTE) GROUP BY user_id HAVING count(user_id) > 1

Join Tables Based on Time and ID

I have two tables of time series data that I am trying to query and don't know how to properly do it.
The first table is time series data of device measurements. Each device is associated with a source and the data contains an hourly measurement. In this example there are 5 devices (101-105) with data for 5 days (June 1-5).
device_id date_time source_id meas
101 2016-06-01 00:00 ABC 105
101 2016-06-01 01:00 ABC 102
101 2016-06-01 02:00 ABC 103
...
101 2016-06-05 23:00 ABC 107
102 2016-06-01 00:00 XYZ 102
...
105 2016-06-05 23:00 XYZ 104
The second table is time series data of source measurements. Each source has three hourly measurements (meas_1, meas_2 and meas_3).
source_id date_time meas_1 meas_2 meas_3
ABC 2016-06-01 00:00 100 101 102
ABC 2016-06-01 01:00 99 100 105
ABC 2016-06-01 02:00 104 108 109
...
ABC 2016-06-05 23:00 102 109 102
XYZ 2016-06-01 00:00 105 106 103
...
XYZ 2016-06-05 23:00 103 105 101
I am looking for a query to get the data for a specified date range that grabs the device's measurements and its associated source's measurements. This example is the result for querying for device 101 from June 2-4.
device_id date_time d.meas s.meas_1 s.meas_2 s.meas_3
101 2016-06-02 00:00 105 100 101 102
101 2016-06-02 01:00 102 99 100 105
101 2016-06-02 02:00 103 104 108 109
...
101 2016-06-04 23:00 107 102 109 102
The actual data set could get large with lets say 100,000 devices and 90 days of hourly measurements. So any help on properly indexing the tables would be appreciated. I'm using MySQL.
UPDATE - Solved
Here's the query I used:
SELECT d.device_id, d.date_time, d.meas, s.meas_1, s.meas_2, s.meas_3
FROM devices AS d
JOIN sources AS s
ON d.source_id = s.source_id AND d.date_time = s.date_time AND d.device_id = '101' AND d.date_time >= '2016-06-02 00:00' AND d.date_time <= '2016-06-04 23:00'
ORDER BY d.date_time;
For what its worth, it also worked with the filters in a WHERE clause rather than in the JOIN but it was slower performing. Thanks for the help.

Webscraping the data using R

Aim: I am trying to scrape the historical daily stock price for all companies from the webpage http://www.nepalstock.com/datanepse/previous.php. The following code works; however, it always generates the daily stock price for the most recent (Feb 5, 2015) date only. In another words, output is the same, irrespective of the date that I entered. I would appreciate if you could help in this regard.
library(RHTMLForms)
library(RCurl)
library(XML)
url <- "http://www.nepalstock.com/datanepse/previous.php"
forms <- getHTMLFormDescription(url)
# we are interested in the second list with date forms
# forms[[2]]
# HTML Form: http://www.nepalstock.com/datanepse/
# Date: [ ]
get_stock<-createFunction(forms[[2]])
#create sequence of dates from start to end and store it as a list
date_daily<-as.list(seq(as.Date("2011-08-24"), as.Date("2011-08-30"), "days"))
# determine the number of elements in the list
num<-length(date_daily)
daily_1<-lapply(date_daily,function(x){
show(x) #displays the particular date
readHTMLTable(htmlParse(get_stock(Date = x)), which = 7)
})
#18 tables out of which 7 is one what we desired
# change the colnames
col_name<-c("SN","Traded_Companies","No_of_Transactions","Max_Price","Min_Price","Closing_Price","Total_Share","Amount","Previous_Closing","Difference_Rs.")
daily_2<-lapply(daily_1,setNames,nm=col_name)
Output:
> head(daily_2[[1]],5)
SN Traded_Companies No_of_Transactions Max_Price Min_Price Closing_Price Total_Share Amount
1 1 Agricultural Development Bank Ltd 24 489 471 473 2,868 1,359,038
2 2 Arun Valley Hydropower Development Company Limited 40 365 360 362 8,844 3,199,605
3 3 Alpine Development Bank Limited 11 297 295 295 150 44,350
4 4 Asian Life Insurance Co. Limited 10 1,230 1,215 1,225 898 1,098,452
5 5 Apex Development Bank Ltd. 23 131 125 131 6,033 769,893
Previous_Closing Difference_Rs.
1 480 -7
2 363 -1
3 303 -8
4 1,242 -17
5 132 -1
> tail(daily_2[[1]],5)
SN Traded_Companies No_of_Transactions Max_Price Min_Price Closing_Price Total_Share Amount Previous_Closing
140 140 United Finance Ltd 4 255 242 242 464 115,128 255
141 141 United Insurance Co.(Nepal)Ltd. 3 905 905 905 234 211,770 915
142 142 Vibor Bikas Bank Limited 7 158 152 156 710 109,510 161
143 143 Western Development Bank Limited 35 320 311 313 7,631 2,402,497 318
144 144 Yeti Development Bank Limited 22 139 132 139 14,355 1,921,511 134
Difference_Rs.
140 -13
141 -10
142 -5
143 -5
144 5
Here's one quick approach. Note that the site uses a POST request to send the date to the server.
library(rvest)
library(httr)
page <- "http://www.nepalstock.com/datanepse/previous.php" %>%
POST(body = list(Date = "2015-02-01")) %>%
html()
page %>%
html_node(".dataTable") %>%
html_table(header = TRUE)

Select specific records and change its values

I have a table filled with (DUTCH) holidays in the past and future and I need to change the name and date for a specific holiday. Since we don't have a Queen anymore, but a King, Queensday is no longer a holiday. Now its Kingsday. AND it is celebrated on a different date.
This is my current table:
68 NL 2014-04-30 Queensday
77 NL 2015-04-30 Queensday
88 NL 2016-04-30 Queensday
97 NL 2017-04-30 Queensday
106 NL 2018-04-30 Queensday
115 NL 2019-04-30 Queensday
124 NL 2020-04-30 Queensday
134 NL 2021-04-30 Queensday
I want to change all records where description='Queensday' into description='Kingsday' AND date=date-3days (since it is celebrated each April 27th) but only where year of the date is greater than 2013.
update table
set
description = 'Kingsday',
<yourdateField> = date_sub(<yourdateField>, interval 3 DAY)
where description = 'Queensday'
and year(<yourdateField>) > 2013
bonus (?)
update country
set
politicalSystem = 'democracy',
comment = 'easier to manage holidays'
where
politicalSystem = 'royalty'

Custom sort ssrs matrix

I have a matrix report that has four columns and is sorted descending on the last columns values. The problem I have is when there is a tie I would like to use the value in the prior column or the one prior to that if there is still a tie. Below is a sample of my output and what I'm after is for Nissan and Renault to be switched. This is the expression I'm currently using in my group sort
=IIF(Fields!YearSold.Value = MAX(Fields!YearSold.Value),0, Fields!UnitSold.Value)
2009 2010 2011 2012
Make Units Units Units Units
Chevy 1,104 842 811 927
Volvo 1,054 905 792 879
Ford 1,638 923 718 809
Nissan 918 794 725 791
Renault 840 698 759 791
Mazda 722 535 460 621
Lexus 786 590 551 563
You can sort a tablix on multiple columns. Edit the tablix Sort properties, adding the additional columns in order - the tablix will be sorted that order, starting with the top column.