how to remove duplicates in SAS data step

how to remove duplicates in SAS data step - duplicates

How to remove duplicates in SAS data step.
data uscpi;
input year month cpi;
datalines;
1990 6 129.9
1990 7 130.4
1990 8 131.6
1990 9 132.7
1991 4 135.2
1991 5 135.6
1991 6 136.0
1991 7 136.2
;
run;
PROC SORT DATA = uscpi OUT = uscpi_dist NODUPKEY;
BY year ;
RUN;
i can with proc step, but how to remove it in data step. Thanks in advance

You can use the first. & last. automatic variables created by SAS when using by-group processing. They give more control on which row you consider as duplicate.
Please read the manual to understand by group processing in a Data Step
data uscpi_dedupedByYear;
set uscpi_sorted;
by year;
if first.year; /*only keep the first occurence of each distinct year. */
/*if last.year; */ /*only keep the last occurence of each distinct year*/
run;
A lot depends on who your input dataset is sorted. For ex: If your input dataset is sorted by year & month and you use if first.year; then you can see that it only keeps the earliest month in any given year. However, if your dataset is sorted by year & descending month then if first.year; retains last month in any given year.
This behaviour obviously differs from how nodupkey works.

Related

how to get some field of current week data and some of next 7 days data in mysql

Hi everyone i have tried to write query so i want result as weekly_outstanding column data in current week and recover column get next 7 days data.
3 input paramter team_id, week_start_date, week_end_date
output as expected weekly_outstanding column data based on input paramter and recovery data comes from next 7 days
for ex- i have select date between 2022-03-28 to 2022-04-03 in input parameter so weekly_outstanding data comes from this week and recovery comes from next week
SELECT sum(weekly_outstanding) as last_week_os, sum(recovery) FROM `fleet_driver_dash_weekly` WHERE team_id=1 and (week_start_date='2022-03-28' and week_end_date='2022-04-03'

can you try this, not sure if this solve your problem. can modify if you could share some sample data and expected output.
SELECT sum(weekly_outstanding) as last_week_os
, sum(recovery) over (order by week_start_date asc ROWS BETWEEN CURRENT ROW AND 7
FOLLOWING)
FROM `fleet_driver_dash_weekly` WHERE team_id=1 ;

Can I automatically change the dataset for a tablix based on the current month?

I have a report with 7 datasets and 7 tables all exactly the same except the datasets each have a slightly different where clause
where datepart(DW,[Start Time]) = 2
The idea is to show data for each monday this month in one table and each tuesday in another and so on. So one table uses the monday dataset (above) one uses
where datepart(DW,[Start Time]) = 3
for tuesday and so on. What I really want to do is have the report decide which dataset to use first based on what day the first of the current month was. So this month (April) the 1st was a saturday so I'd like my leftmost table to use the dataset
where datepart(DW,[Start Time]) = 7
Then the one after it to be sunday and so on. But next month (May) I want it to automatically switch to using the monday dataset in the first table as the 1st will be a monday.
Is this possible?

One possibility is to use a sub-report that uses a single dataset. This dataset would have a where clause that takes a parameter
WHERE DATEPART(DW, [Start Time]) = #dayOfWeek
Replace your current 7 tables with 7 copies of the same sub-report but change the parameters to be based on the first day of the month so you first sub-report would have the parameter passed as
=WeekDay(dateserial(Year(now()), Month(now()), "1"))
the second sub report would have the parameter passed as
=WeekDay(dateserial(Year(now()), Month(now()), "2"))
and so on...

R how to select 150 days with only month and day information

I was able to select last 150 days from database when having column 'year' as follow:
data1 = dbGetQuery(conn_data, statement=paste("SELECT *, STR_TO_DATE(CONCAT(yyyy,'-',mm,'-',dd),'%Y-%m-%d') as dt FROM stations_daily_data", "WHERE STR_TO_DATE(CONCAT(yyyy,'-',mm,'-',dd),'%Y-%m-%d') >= DATE_SUB(CURDATE(), INTERVAL 150 DAY)"))
But now all data were averaged to date and thus only have columns 'month' and 'day' (no column 'year'), and I was stuck in how to select last 150 days this time. Here is the simplified example of data frame with original one of 17 million rows:
df <- data.frame(ID=c(1:5,50001:50005),mm=c(rep(1,5),rep(12,5)),dd=c(1:5,27:31),value=c(21:30))
Feb 29 can be ignored since 150 days is a significant amount of time period.
I tried add column 'year' so that I could use the code above, but it would be wrong if say, current date is at the beginning of a year, also make changes to a big table in R would run out of R memory, I'm not familiar with database query, is it possible that I can do this by just using query instead of read the table into R and then make changes in the data frame in R, any suggestion would be appreciated!
EDIT:
The column 'year' is no longer needed since its all been averaged to date, which means now May 5th would be the average of 60 years of May 5th of each year. Next I would like to select last 150 days(averaged), the reason I tried to add column 'year' was simply try to make it easier to select.
Since I need to run the data every day, so if the day is after the month of June it would be easy just to use the current year, but if it's the month of February, then it would be current year-1, this could be done if the data is much smaller, now if I make change to the data frame, the R would pop out error of 'out of memory', that's why I was wondering if there is a way to select in database query or functions in R that wouldn't cost much memory, thanks!

You could write a function to calculate year based on a reference year plus an adjustment based on a cut off month. Then you could use the order function to order the data.frame based on calculated year, month, and day, without inserting the new calculated year field into the data.frame.
Won't have a great performance on 17 million row dataset though, since you are still ordering every row.
# some dummy data (not worrying about illegal dates like Feb 31)
set.seed(123)
da <- data.frame(mm=sample(1:12, 20, replace=T),
dd=sample(1:31, 20, replace=T))
# function to calculate year from reference year and cut off month
calc_year <- function(mm_vec, ref_year, cut_month) {
ref_year + ifelse(mm_vec >= cut_month, 0, -1)
}
# order the data.frame by year, month, and day
# (taking 2014 as ref. year & assuming months before June are from prior year
da[with(da, order(calc_year(mm_vec=mm, ref_year=2014, cut_month=6), mm, dd)), ]
# if you want just the first 5 rows
da[with(da, order(calc_year(mm_vec=mm, ref_year=2014, cut_month=6), mm, dd)), ][1:5,]

Get sum of values based on the value of 2 other date related columns

Given the sample data in the screenshot below, would it be possible in mysql to return a sum of values from monthly_amount only where the values are before this month. I used a join to pull this data. The 5 left columns are from one table, and the rest are from another.
The issue I'm running into is, lets say its April of 2015, I can't just do a sum WHERE goal_year <= 2015 AND month_id_FK <= 4, or else I'll get only those 4 months from both years, when in that scenario, I really want all the months from 2014, plus the 4 months from 2015.
I could handle this in PHP, but I wanted to first see if there would be a way to do this in mysql?

try
WHERE Goal_Year*100+month_id_FK <= 201504
alternatively:
WHERE
GOAL_YEAR < 2015 OR
(GOAL_YEAR = 2015 and month_id_FK <= 4)

select sum(monthly_amount) from table where goaldate<(SELECT CURDATE())
this is not the actual query for your table..but if you do like this you will get the answer
you need the sum of monthly amount where the date is before current-date means today.
then you can just compare the currentdate with goal date

SSRS 2008 month number not displayed in order

I have a SSRS 2008 report that generated columns of the months along with other data based on year halves. I have the tablix column group and sort set for [Mon] and the first half of the year generated just fine but when I run the report for the second half it does not display in order :
MonthNumber 10 11 12 7 8 9
MonthName October Movember December July August September
The SQL code that is used generated the following rows which appear in order of month number.
Mon
7
8
9
10
11
12

I would say that Mon is being treated as a string value, for whatever reason, i.e. from the query or in the dataset definition, as you can see that in your example the columns are being sorted as strings, i.e. 10 will be before 7 when sorted as text and not numeric values.
You have two options:
First is to sort by an expression like: =CInt(Fields!Mon.Value), i.e. explicitly sorting as an integer, which solve the issue if Mon is being treated as text.
The other option is to make sure that Mon is being treated as an integer at the dataset level - either way should be fine.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

how to remove duplicates in SAS data step - duplicates

Related

how to get some field of current week data and some of next 7 days data in mysql

Can I automatically change the dataset for a tablix based on the current month?

R how to select 150 days with only month and day information

Get sum of values based on the value of 2 other date related columns

SSRS 2008 month number not displayed in order

Categories

Resources