What is a by statement really doing in a sas data step? - duplicates

Okay this seems like a very simple thing, but I can't explain what a "by statement" in a sas datastep is really doing. I know when I need to use it, but I am not sure what it is doing.
In the example below I understand what the virtual sas column of first.var and last.var is when it has the values that it does. Is the by statement creating these virtual columns around the var initial and metal? Then sas is scanning the entire data set once?
data jewelers ;
input id initial $ metal $ ;
datalines;
456 D Gold
456 D Silver
123 L Gold
123 L Copper
123 L PLatinum
567 R Gold
567 R Gold
567 R Gold
345 A Platinum
345 A Silver
345 A Silver
;
proc sort ;
by initial metal ;
run;
data dups;
set jewelers ;
by initial metal ;
if not (first.metal and last.metal) then output;
run;
if I proc print dups I expect this:
567 R Gold
567 R Gold
567 R Gold
345 A Silver
345 A Silver

As you've acknowledged, SAS is creating the automatic variables first.byvar and last.byvar for each byvar in your by statement. Records read in from a set statement are held in a buffer before SAS moves them to the PDV (Program Data Vector - where data step logic is executed on each row before it is output), so SAS can look ahead to the next record in the buffer to see whether any of the byvars have changed, and set last.byvar = 1 for the row currently in the PDV.
The only difference I can see between what you say you expect and what you get in the dups dataset is the order of the records - because you've sorted by initial and then metal, the A Silver records are sorted ahead of the R Gold records.
If you want to get duplicates across these two variables while preserving the original row order, you'd need to make a note of what the original record order was and sort your duplicates dataset back into that order after your second data step.

SAS is basically comparing the current record to the one before and the one after to calculate the FIRST. and LAST. flags. You can roll your own if you want.
data test ;
set jewelers end=eof;
by initial metal ;
if not eof then
set jewelers (firstobs=2 keep=initial metal
rename=(initial=next_initial metal=next_metal))
;
first_initial = (_n_=1) or (initial ne lag(initial));
last_initial = eof or (initial ne next_initial) ;
first_metal = first_initial or (metal ne lag(metal)) ;
last_metal = last_initial or (metal ne next_metal);
put / (initial metal) (=);
put (first:) (=);
put (last:) (=);
run;

Related

How do I replace multiple strings from one column in MySQL?

I am working on a multiple language database.
Unfortunately, some of strings was mistakenly
inputted into the wrong language column.
I have a glossary for those strings,
but the replacement task is huge (over 3,000).
I cannot finish the work with the Query I used before.
*UPDATE
tableName
SET
column = REPLACE(column,'BereplaceString','replaceString')
WHERE
locale = "XX" and
description like “%BereplaceString%"
It is possible to make the replacement task into one command?
(I can use excel to combine for the replace string,
but don't want to execute the command for a 1000 times)
Thanks for helping in advance!
Now:
ID Column A Column B
(Langauge) (Description)
1 EN red pomme
2 FR apple rouge
3 EN yellow citron
4 FR lemon jaune
Expected Outcome:
ID Column A Column B
(Langauge) (Description)
1 EN red apple
2 FR pomme rouge
3 EN jaune citron
4 FR lemon yellow

How to perform a many-to-many or (at least) a outer-join in SPSS

usually I use [R] for my data analysis, but these days I have to use SPSS. I was expecting that data manipulation might get a little bit more difficult this way, but after my first day I kind of surrender :D and I really would appreciate some help ...
My problem is the following:
I have two data sets, which have an ID number. Neither data sets have a unique ID (in one data set, which should have unique IDs, there is kind of a duplicated row)
In a perfect world I would like to keep this duplicated row and simply perform a many-to-many-join. But I accepted, that I might have to delete this "bad" row (in dataset A) and perform a 1:many-join (join dataset B to dataset A, which contains the unique IDs).
If I run the join (and accept that it seems not to be possible to run a 1:many, but only a many:1-join), I have the problem, that I lose IDs. If I join dataset A to dataset B I lose all cases, that are not part of dataset B. But I really would like to have both IDs like in a full join or something.
Do you know if there is (kind of) a simple solution to my problem?
Example:
dataset A:
ID
VAL1
1
A
1
B
2
D
3
K
4
A
dataset B:
ID
VAL2
1
g
2
k
4
a
5
c
5
d
5
a
2
x
expected result (best solution):
ID
VAL1
VAL2
1
A
g
1
B
g
2
D
k
3
K
NA
4
A
a
2
D
x
expected result (second best solution):
ID
VAL1
VAL2
1
A
g
2
D
k
3
K
NA
4
A
a
5
NA
c
5
NA
d
5
NA
a
2
D
x
what I get (worst solution):
ID
VAL1
VAL2
1
A
g
2
D
k
4
A
a
5
NA
c
5
NA
d
5
NA
a
2
D
x
From your example It looks like what you need is a full many to many join, based on the ID's existing in dataset A. You can get this by creating a full Cartesian-Product of the two dataset, using dataset A as the first\left dataset.
The following syntax assumes you have the STATS CARTPROD extention command installed. If you don't you can see here about installing it.
First I'll recreate your example to demonstrate on:
dataset close all.
data list list/id1 vl1 (2F3) .
begin data
1 232
1 433
2 456
3 246
4 468
end data.
dataset name aaa.
data list list/id2 vl2 (2F3) .
begin data
1 111
2 222
4 333
5 444
5 555
5 666
2 777
3 888
end data.
dataset name bbb.
Now the actual work is fairly simple:
DATASET ACTIVATE aaa.
STATS CARTPROD VAR1=id1 vl1 INPUT2=bbb VAR2=id2 vl2
/SAVE OUTFILE="C:\somepath\yourcartesianproduct.sav".
* The new dataset now contains all possible combinations of rows in the two datasets.
* we will select only the relevant combinations, where the two ID's match.
select if id1=id2.
exe.

Is there some kind of way to import data that consists of multiple rows?

In RapidMiner the data table I usually see is like this:
Row Age Class
1 19 Adult
2 10 Minor
3 15 Teenager
In the data table above this sentence, one row refers to one complete information.
But how do I input a data table to RapidMiner where more than one row refers to one complete information?
For example:
Row Word Rho Theta Phi
1 Hello 0.9384 0.4943 1.2750
2 Hello 1.2819 0.8238 1.3465
3 Hello 1.3963 0.1758 1.4320
4 Eat 1.3918 0.3883 1.1756
5 Eat 1.4742 0.0526 1.2312
6 Eat 0.6698 0.2548 1.4769
7 Eat 0.3074 1.2214 0.2059
In the data table above this sentence, rows 1-3 refers to one complete information where the combinations of rho, theta, and phi from rows 1-3 means the word hello. Same goes for rows 4-7 which is one complete information also that means the word eat. For further explanation of the information I'm talking about, take a look at the table below this sentence.
Row Rho Theta Phi Word
----------------------------
1 |0.9384 0.4943 1.2750|
2 |1.2819 0.8238 1.3465| HELLO
3 |1.3963 0.1758 1.4320|
----------------------------
4 |1.3918 0.3883 1.1756|
5 |1.4742 0.0526 1.2312|
6 |0.6698 0.2548 1.4769| EAT
7 |0.3074 1.2214 0.2059|
----------------------------
Again my problem is, how do I insert this kind of data table to RapidMiner where it understands that multiple rows refer to one complete information? Is there some kind of table like what I displayed below this sentence?
Row Word Rho Theta Phi
1 Hello 0.9384 0.4943 1.2750
. Hello 1.2819 0.8238 1.3465
1 Hello 1.3963 0.1758 1.4320
2 Eat 1.4742 0.0526 1.2312
. Eat 0.6698 0.2548 1.4769
. Eat 0.3074 1.2214 0.2059
2 Eat 0.3074 1.2214 0.2059
you can try to use the Pivot operator to group your result by word.
To do so, I would set the group attribute parameter to "Word" and the index parameter to "Row". It's not exactly the same representation, but close enough, depending on your use case, as multiple format tables are not part of RapidMiner's design.

Panel data or time-series data and xt regression

Need help observing simple regression as well as xt-regression for panel data.
The dataset consists of 16 participants in which daily observations were made.
I would like to observe the difference between pre-test (from the first date on which observations were taken) and post-test (the last date on which observations were made) across different variables.
also I was advised to do xtregress, re
what is this re? and its significance?
If the goal is to fit some xt model at the end, you will need the data in long form. I would use:
bysort id (year): keep if inlist(_n,1,_N)
For each id, this puts the data in ascending chronological order, and keeps the first and last observation for each id.
The RE part of your question is off-topic here. Try Statalist or CV SE site, but do augment your questions with details of the data and what you hope to accomplish. These may also reveal that getting rid of the intermediate data is a bad idea.
Edit:
Add this after the part above:
bysort id (year): gen t= _n
reshape wide x year, i(id) j(t)
order id x1 x2 year1 year2
Perhaps this sample code will set you in the direction you seek.
clear
input id year x
1 2001 11
1 2002 12
1 2003 13
1 2004 14
2 2001 21
2 2002 22
2 2003 23
3 1005 35
end
xtset id year
bysort id (year): generate firstx = x[1]
bysort id (year): generate lastx = x[_N]
list, sepby(id)
With regard to xterg, re, that fits a random effects model. See help xtreg for more details, as well as the documentation for xtreg in the Stata Longitudinal-Data/Panel-Data Reference Manual included in your Stata documentation.

implementation of the Gower distance function

I have a matrix (size: 28 columns and 47 rows) with numbers. This matrix has an extra row that is contains headers for the columns ("ordinal" and "nominal").
I want to use the Gower distance function on this matrix. Here says that:
The final dissimilarity between the ith and jth units is obtained as a weighted sum of dissimilarities for each variable:
d(i,j) = sum_k(delta_ijk * d_ijk ) / sum_k( delta_ijk )
In particular, d_ijk represents the distance between the ith and jth unit computed considering the kth variable. It depends on the nature of the variable:
factor or character columns are
considered as categorical nominal
variables and d_ijk = 0 if
x_ik =x_jk, 1 otherwise;
ordered columns are considered as
categorical ordinal variables and
the values are substituted with the
corresponding position index, r_ik in
the factor levels. These position
indexes (that are different from the
output of the R function rank) are
transformed in the following manner
z_ik = (r_ik - 1)/(max(r_ik) - 1)
These new values, z_ik, are treated as observations of an
interval scaled variable.
As far as the weight delta_ijk is concerned:
delta_ijk = 0 if x_ik = NA or x_jk =
NA;
delta_ijk = 1 in all the other cases.
I know that there is a gower.dist function, but I must do it that way.
So, for "d_ijk", "delta_ijk" and "z_ik", I tried to make functions, as I didn't find a better way.
I started with "delta_ijk" and i tried this:
Delta=function(i,j){for (i in 1:28){for (j in 1:47){
+{if (MyHeader[i,j]=="nominal")
+ result=0
+{else if (MyHeader[i,j]=="ordinal") result=1}}}}
+;result}
But I got error. So I got stuck and I can't do the rest.
P.S. Excuse me if I make mistakes, but English is not a language I very often.
Why do you want to reinvent the wheel billyt? There are several functions/packages in R that will compute this for you, including daisy() in package cluster which comes with R.
First things first though, get those "data type" headers out of your data. If this truly is a matrix then character information in this header row will make the whole matrix a character matrix. If it is a data frame, then all columns will likely be factors. What you want to do is code the type of data in each column (component of your data frame) as 'factor' or 'ordered'.
df <- data.frame(A = c("ordinal",1:3), B = c("nominal","A","B","A"),
C = c("nominal",1,2,1))
Which gives this --- note that all are stored as factors because of the extra info.
> head(df)
A B C
1 ordinal nominal nominal
2 1 A 1
3 2 B 2
4 3 A 1
> str(df)
'data.frame': 4 obs. of 3 variables:
$ A: Factor w/ 4 levels "1","2","3","ordinal": 4 1 2 3
$ B: Factor w/ 3 levels "A","B","nominal": 3 1 2 1
$ C: Factor w/ 3 levels "1","2","nominal": 3 1 2 1
If we get rid of the first row and recode into the correct types, we can compute Gower's coefficient easily.
> headers <- df[1,]
> df <- df[-1,]
> DF <- transform(df, A = ordered(A), B = factor(B), C = factor(C))
> ## We've previously shown you how to do this (above line) for lots of columns!
> str(DF)
'data.frame': 3 obs. of 3 variables:
$ A: Ord.factor w/ 3 levels "1"<"2"<"3": 1 2 3
$ B: Factor w/ 2 levels "A","B": 1 2 1
$ C: Factor w/ 2 levels "1","2": 1 2 1
> require(cluster)
> daisy(DF)
Dissimilarities :
2 3
3 0.8333333
4 0.3333333 0.8333333
Metric : mixed ; Types = O, N, N
Number of objects : 3
Which gives the same as gower.dist() for this data (although in a slightly different format (as.matrix(daisy(DF))) would be equivalent):
> gower.dist(DF)
[,1] [,2] [,3]
[1,] 0.0000000 0.8333333 0.3333333
[2,] 0.8333333 0.0000000 0.8333333
[3,] 0.3333333 0.8333333 0.0000000
You say you can't do it this way? Can you explain why not? As you seem to be going to some degree of effort to do something that other people have coded up for you already. This isn't homework, is it?
I'm not sure what your logic is doing, but you are putting too many "{" in there for your own good. I generally use the {} pairs to surround the consequent-clause:
Delta=function(i,j){for (i in 1:28) {for (j in 1:47){
if (MyHeader[i,j]=="nominal") {
result=0
# the "{" in the next line before else was sabotaging your efforts
} else if (MyHeader[i,j]=="ordinal") { result=1} }
result}
}
Thanks Gavin and DWin for your help. I managed to solve the problem and find the right distance matrix. I used daisy() after I recoded the class of the data and it worked.
P.S. The solution that you suggested at my other topic for changing the class of the columns:
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
didn't work. It changed only the first nominal and ordinal column.
Thanks again for your help.