How to perform a many-to-many or (at least) a outer-join in SPSS

How to perform a many-to-many or (at least) a outer-join in SPSS - many-to-many

usually I use [R] for my data analysis, but these days I have to use SPSS. I was expecting that data manipulation might get a little bit more difficult this way, but after my first day I kind of surrender :D and I really would appreciate some help ...
My problem is the following:
I have two data sets, which have an ID number. Neither data sets have a unique ID (in one data set, which should have unique IDs, there is kind of a duplicated row)
In a perfect world I would like to keep this duplicated row and simply perform a many-to-many-join. But I accepted, that I might have to delete this "bad" row (in dataset A) and perform a 1:many-join (join dataset B to dataset A, which contains the unique IDs).
If I run the join (and accept that it seems not to be possible to run a 1:many, but only a many:1-join), I have the problem, that I lose IDs. If I join dataset A to dataset B I lose all cases, that are not part of dataset B. But I really would like to have both IDs like in a full join or something.
Do you know if there is (kind of) a simple solution to my problem?
Example:
dataset A:
ID
VAL1
1
A
1
B
2
D
3
K
4
A
dataset B:
ID
VAL2
1
g
2
k
4
a
5
c
5
d
5
a
2
x
expected result (best solution):
ID
VAL1
VAL2
1
A
g
1
B
g
2
D
k
3
K
NA
4
A
a
2
D
x
expected result (second best solution):
ID
VAL1
VAL2
1
A
g
2
D
k
3
K
NA
4
A
a
5
NA
c
5
NA
d
5
NA
a
2
D
x
what I get (worst solution):
ID
VAL1
VAL2
1
A
g
2
D
k
4
A
a
5
NA
c
5
NA
d
5
NA
a
2
D
x

From your example It looks like what you need is a full many to many join, based on the ID's existing in dataset A. You can get this by creating a full Cartesian-Product of the two dataset, using dataset A as the first\left dataset.
The following syntax assumes you have the STATS CARTPROD extention command installed. If you don't you can see here about installing it.
First I'll recreate your example to demonstrate on:
dataset close all.
data list list/id1 vl1 (2F3) .
begin data
1 232
1 433
2 456
3 246
4 468
end data.
dataset name aaa.
data list list/id2 vl2 (2F3) .
begin data
1 111
2 222
4 333
5 444
5 555
5 666
2 777
3 888
end data.
dataset name bbb.
Now the actual work is fairly simple:
DATASET ACTIVATE aaa.
STATS CARTPROD VAR1=id1 vl1 INPUT2=bbb VAR2=id2 vl2
/SAVE OUTFILE="C:\somepath\yourcartesianproduct.sav".
* The new dataset now contains all possible combinations of rows in the two datasets.
* we will select only the relevant combinations, where the two ID's match.
select if id1=id2.
exe.

Related

Shared foreign keys without duplication of entries?

Sorry for the beginner question.
I have an Outputs table:
ID
value
0
x
1
y
2
z
And an Inputs table that is linked to the Outputs through the outputsID:
ID
outputsID
name
0
0
A
1
1
B
2
1
C
3
2
B
4
2
C
Assuming that multiple outputs have at least one shared input (in this example outputID 1,3 and 2,4 are the same), is there a way to avoid the duplication of entries in my Inputs table (inputID 3 and 4)?

The 'normal' answer to your question is no. Rows 1 and 2 address output 1, and Rows 3 and 4 address output 2. They aren't duplicates and each reflect something distinct.
So if you are a beginner, I would say you shouldn't want to get rid of these rows.
That said, there are some more advanced techniques. For example, you could have the OutputsID column be an array with multiple values. This is harder, more complex, and non-standard.

Combine multi rows in access - 1 field

I have data in multiple rows, and I need to combine data in similar columns and separate with a semi colon, to end up with one row with grouped by ID.
I have
ID Type
1 A
1 B
1 C
2 D
3 A
3 F
I want results to be
1 A;B;C
2 D
3 A;F
I have limited knowledge of access, but know this should be basic and easy. I appreciate assistance.

How to add dynamic range to database (store the ranges in a table)

Table (CostTitle)
Id_ _costTitle_
1 A
2 B
3 C
4 D
5 E
6 F
A Refers numbers between 0-99
B Refers numbers between 100-199
C Refers numbers between 200-299
D Refers numbers between 300-399
E Refers numbers between 400-499
F Refers numbers between 500-599
costCode will be base on costTitle's refers numbers
Table (CostCode)
Id_ _costTitle_ _costCode_ _costProductTitle_
1 A 12 productX
2 B 111 productY
3 B 142 productZ
4 C 201 productK
5 F 511 productL
6 F 582 productM
I am trying to add product and assign dynamically cost code.
Thanks for advance

I suppose you want to store the ranges in a table. So you need a BEFORE INSERT trigger, which sets new.costTitle. Triggers are explained here:
http://dev.mysql.com/doc/refman/5.6/en/create-trigger.html
MariaDB offers an alternative: dynamic columns. However, because of the limits of this feature, you cannot store the ranges in a different table. You will need to hardcode the ranges in the virtual column definition, which doesn't seem to me a great idea (but you decide, of course).
https://mariadb.com/kb/en/mariadb/virtual-computed-columns/

Efficiently joining over interval ranges in SQL

Suppose I have two tables as follows (data taken from this SO post):
Table d1:
x start end
a 1 3
b 5 11
c 19 22
d 30 39
e 7 25
Table d2:
x pos
a 2
a 3
b 3
b 12
c 20
d 52
e 10
The first row in both tables are column headers. I'd like to extract all the rows in d2 where column x matches with d1 and pos1 falls within (including boundary values) d1's start and end columns. That is, I'd like the result:
x pos start end
a 2 1 3
a 3 1 3
c 20 19 22
e 10 7 25
The way I've seen this done so far is:
SELECT * FROM d1 JOIN d2 USING (x) WHERE pos BETWEEN start AND end
But what is not clear to me is if this operation is done as efficient as it can be (i.e., optimised internally). For example, computing the entire join first is not really a scalable approach IMHO (in terms of both speed and memory).
Are there any other efficient query optimisations (ex: using interval trees) or other algorithms that can handle ranges efficiently (again, in terms of both speed and memory) in SQL that I can make use of? It doesn't matter if it's using SQLite, PostgreSQL, mySQL etc..
What is the most efficient way to perform this operation in SQL?
Thank you very much.

Not sure how it all works out internally, but depending on the situation I would advice to play around with a table that 'rolls out' all the values from d1 and then join on that one. This way the query engine can pinpoint the right record 'exactly' instead of having to find a combination of boundaries that match the value being looked for.
e.g.
x value
a 1
a 2
a 3
b 5
b 6
b 7
b 8
b 9
b 10
b 11
c 19 etc..
given an index on the value column (**), this should be quite a bit faster than joining with the BETWEEN start AND end on the original d1 table IMHO.
Off course, each time you make changes to d1, you'll need to adjust the rolled out table too (trigger?). If this happens frequently you'll spend more time updating the rolled out table than you gained in the first place! Additionally, this might take quite a bit of (disk)space quickly if some of the intervals are really big; and also, this assumes we don't need to look for non-whole numbers (e.g. what if we look for the value 3.14 ?)
(You might consider experimenting with a unique one on (value, x) here...)

KDB: apply dyadic function across two lists

Consider a function F[x;y] that generates a table. I also have two lists; xList:[x1;x2;x3] and yList:[y1;y2;y3]. What is the best way to do a simple comma join of F[x1;y1],F[x1;y2],F[x1;y3],F[x2;y1],..., thereby producing one large table?

You have asked for the cross product of your argument lists, so the correct answer is
raze F ./: xList cross yList

Depending on what you are doing, you might want to look into having your function operate on the entire list of x and the entire list of y and return a table, rather than on each pair and then return a list of tables which has to get razed. The performance impact can be considerable, for example see below
q)g:{x?y} //your core operation
q)//this takes each pair of x,y, performs an operation and returns a table for each
q)//which must then be flattened with raze
q)fm:{flip `x`y`res!(x;y; enlist g[x;y])}
q)//this takes all x, y at once and returns one table
q)f:{flip `x`y`res!(x;y;g'[x;y])}
q)//let's set a seed to compare answers
q)\S 1
q)\ts do[10000;rm:raze fm'[x;y]]
76 2400j
q)\S 1
q)\ts do[10000;r:f[x;y]]
22 2176j
q)rm~r
1b

Setup our example
q)f:{([] total:enlist x+y; x:enlist x; y:enlist y)}
q)x:1 2 3
q)y:4 5 6
Demonstrate F[x1;y1]
q)f[1;4]
total x y
---------
5 1 4
q)f[2;5]
total x y
---------
7 2 5
Use the multi-valent apply operator together with each' to apply to each pair of arguments.
q)raze .'[f;flip (x;y)]
total x y
---------
5 1 4
7 2 5
9 3 6

Another way to achieve it using each-both :
x: 1 2 3
y: 4 5 6
f:{x+y}
f2:{ a:flip x cross y ; f'[a 0;a 1] }
f2[x;y]
5j, 6j, 7j, 6j, 7j, 8j, 7j, 8j, 9j

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to perform a many-to-many or (at least) a outer-join in SPSS - many-to-many

Related

Shared foreign keys without duplication of entries?

Combine multi rows in access - 1 field

How to add dynamic range to database (store the ranges in a table)

Efficiently joining over interval ranges in SQL

KDB: apply dyadic function across two lists

Categories

Resources