Unique combinations of variables in Stata - unique

I need assistance with getting a Stata code that can get me unique combinations of varibles. I have 7 variables and I need to run a code that can give me a unique combination of all of these variables. Every row will be a unique combination of all 7 variables.
An example:
V1: A, B, C
V2: 1, 2, 3
A1 A2 A3, B1 B2 B3, C1 C2 C3
Unique combination of all variables - total 9 combinations.
I have 15000 observations. I got a code in R but R won't get the output on a large data (memory error). I want to get this in Stata.

It is not especially clear what you want created or done. There is no code here, not even R code showing how what you want is done in R. There is no reproducible example.
You might want to check out egen, group(). (A previous answer to this effect from #Dimitriy V. Masterov, an experienced user of Stata, was twice incorrectly deleted as spam, presumably by people not knowing Stata.)
Alternatively, try installing groups from SSC.
UPDATE: The answer sounds more like fillin. For "unique" read "distinct".

Bit of a late response, but I just stumbled across this today. If I understand the question, Something like this should do the trick, although I'm not sure it's easily applied to more complex data or if this would even be the best way...
* Create Sample Data
clear
set obs 3
gen str var1 = "a" in 1
replace var1="b" in 2
replace var1="c" in 3
gen var2= _n
* Find number of Unique Groupings to set obs
by var1 var2, sort: gen groups=_n==1
keep if groups==1
drop groups
di _N^2
set obs 9
* Create New Variable
forvalues i = 4(3)9 {
forvalues j = 5(3)9 {
forvalues k = 6(3)9 {
replace var1="a" if _n==`i'
replace var1="b" if _n==`j'
replace var1="c" if _n==`k'
}
}
}
sort var1
egen i=seq(), f(1) t(3)
tostring i, replace
gen NewVar=var1+i
list NewVar
+--------+
| NewVar |
|--------|
1. | a1 |
2. | a2 |
3. | a3 |
4. | b1 |
5. | b2 |
|--------|
6. | b3 |
7. | c1 |
8. | c2 |
9. | c3 |
+--------+
Unfortunately as far as I know, there is no easy way to do this - it will require a fair amount of code. Although, I saw another answer or comment that mentioned cross which could be very useful here. Another command worth checking out is joinby. But even with either of these methods, you will have to split your data into 7 different sets based on the variables you want to 'cross combine'.
Anyway, Good Luck if you haven't yet found your solution.

If you just want the combination of that 7 variables, you can do it like this:
keep v1 v2 v3 v4 v5 v6 v7
duplicates drop
list
Then you will get the list of unique combinations of those 7 variables. You can save the file with different name from the original dataset. Please make sure that you do not save the dataset directly. Otherwise you will lose your original data.

Related

Whats the best way to retrieve array data from MySql

I'm storing a object / data structure like this inside a MySql (actually a MariaDb) database:
{
idx: 7,
a: "content A",
b: "content B",
c: ["entry c1", "entry c2", "entry c3"]
}
And to store it I'm using 2 tables, very similar to the method described in this answer: https://stackoverflow.com/a/17371729/3958875
i.e.
Table 1:
+-----+---+---+
| idx | a | b |
+-----+---+---+
Table 2:
+------------+-------+
| owning_obj | entry |
+------------+-------+
And then made a view that joins them together, so I get this:
+-----+------------+------------+-----------+
| idx | a | b | c |
+-----+------------+------------+-----------+
| 7 | content A1 | content B1 | entry c11 |
| 7 | content A1 | content B1 | entry c21 |
| 7 | content A1 | content B1 | entry c31 |
| 8 | content A2 | content B2 | entry c12 |
| 8 | content A2 | content B2 | entry c22 |
| 8 | content A2 | content B2 | entry c32 |
+-----+------------+------------+-----------+
My question is what is the best way I can get it back to my object form? (e.g. I want an array of the object type specified above of all entries with idx between 5 and 20)
There are 2 ways I can think of, but both seem to be not very efficient.
Firstly we can just send this whole table back to the server, and it can make a hashmap with the keys being the primary key or some other unique index, and collect up the different c columns, and rebuild it that way, but that means it has to send a lot of duplicate data, and take a bit more memory and processing time to rebuild on the server. This method also won't be very pleasant to scale if we have multiple arrays, or have arrays within arrays.
Second method would be to do multiple queries, filter Table 1 and get back the list of idx's you want, and then for each idx, send a query for Table 2 where owning_obj = current idx. This would mean sending a whole lot more queries.
Neither of these options seems very good, so I'm wondering if there is a better way. Currently I'm thinking it can be something that utilizes JSON_OBJECT(), but I'm not sure how.
This seems like a common situation, but I can't seem to find the exact wording to search for to get the answer.
PS: The server interfacing with MySql/MariaDb is written in Rust, don't think this is relevant in this question though
You can use GROUP_CONCAT to combine all the c values into a comma-separated string.
SELECT t1.idx, t1.a, t1.b, GROUP_CONCAT(entry) AS c
FROM table1 AS t1
LEFT JOIN table2 AS t2 ON t1.idx = t2.owning_obj
GROUP BY t1.idx
Then explode the string in PHP:
$result_array = [];
while ($row = $result->fetch_assoc()) {
$row['c'] = explode(',', $row['c']);
$result_array[] = $row;
}
However, if the entries can be long, make sure you increase group_concat_max_len.
If you're using MySQL 8.0 you can also use JSON_ARRAYAGG(). This will create a JSON array of the entry values, which you can convert to a PHP array using json_decode(). This is a little safer, since GROUP_CONCAT() will mess up if any of the values contain comma. You can change the separator, but you need a separator that will never be in any values. Unfortunately, this isn't in MariaDB.

Transpose survey response dataset with Open Refine (previously Google Refine)

I’m looking for some help to reshape a survey response dataset, exported as a csv, using Open Refine (previously Google Refine).
Some context on the survey
Collector and responder ID are collected in the background - ID1 ID2
Users select tasks from a long list - T{n}
Users enter a custom task - OT
Users rate the importance of the each selected task - R1
Users rate the satisfaction of the each selected task - R2
We have a total of 20 tasks atm but this might change.
Current dataset as follows:
ID1 | ID2 | T1 | » | T20 | OT | T1 R1 | » | T20 R1 | OT R1 | T1 R2 | » | T20 R2 | OT R2
123 | 789 |
I’m trying to reshape the dataset to the following format:
ID1 | ID2 | Task | Importance | Satisfaction
Here’s a gist of original and reshaped data sets
Also, i’ve tried to articulate how I want to reshape the data in a drawing, which might help
This can't be done by clicking a single button. You have to perform three "transpose cells across columns into rows" (one for tasks, one for their importance, one for their satisfaction), then three "join multivalued cells", then three "split multivalued cells", and finally use fill down to fill the blanks in the ID columns. A screencast will probably be clearer than my explanations.
You'll find the Json operations in a comment on your Gist. If your columns have exactly the same name as the example provided, you can apply it on your project by copying and pasting the file into "Undo/Redo -> Apply"
Try the following:
Concatenate all your content for each task using cells['Task1'].value+"|Importance: "+cells['Task Importance 1'].value+"|Satisfaction:"+cells['Task Satisfaction 1'].value You will need to do that 20 times (one for each group of task)
Transpose all column after Response ID (not included). You can reuse this Operation
split cells based on the pipe |
finish renaming and cleaning up value with value.replace()

Sort Array results based on variable Algorithm - MYSQL & PHP

I have an array nested within a PHP while loop that outputs a set of forum posts a number of times. I want to sort the array results based on an algorithm - however I do not want to hardcode the algorithm so I can test different variables at a later date. NB - I'm not looking to sort the items within the array, but rather the final output which when looped will output the array 20+ times.
Currently I have 2 Tables - the Forum table with loads of rows (3000 +):
id | name | date_add | votes | ... |
1 | Test Name | 1234567890 | 2 | ... |
... | ... | ... | ... | ... |
The other table contains the Algorithm variables that I want to pass through to the calculation and has only 1 row:
id | vote_reduction | time_variable | gravity |
1 | 1 | 2 | 1.8 |
The specific algorithm I'm using sorts the information based on how log it has been live (in hours), how many votes it has and the gravity factor makes it more sensitive to time. In full:
(votes - vote_reduction)/((Hours Live + time_variable) ^ gravity)
So far I've managed to get this far, and something is going wrong but I can't quite figure it out:
SELECT forum.*,
((forum.votes - algorithm.vote_reduction)/POW(((TIMESTAMPDIFF(HOUR, SYSDATE(), forum.date_add)) + algorithm.time_variable),algorithm.gravity)) AS algorithm.al,
forum.name, forum.id
FROM forum as forum
LEFT JOIN algorithm AS algorithm ON (algorithm.id='1')
ORDER BY algorithm.al
Any ideas?
I haven't tested the results of the algorithm, but the query returns a result for al if you just remove algorithm. from algorithm.al. I don't think you can make a column alias that acts like it's part of a table. What's confusing me is that you say that it's running on your machine. It's not running on SQL Fiddle and is throwing an error.
SELECT forum.*,
((forum.votes - algorithm.vote_reduction)/POW(((TIMESTAMPDIFF(HOUR, SYSDATE(), forum.date_add)) + algorithm.time_variable),algorithm.gravity)) AS al
FROM forum AS forum
LEFT JOIN algorithm AS algorithm ON (algorithm.id='1')
ORDER BY al
Link to SQL fiddle
There are a few errors in the code as follows:
Making an alias with the name "algorithm" clashes with a MySQL
clause also called ALGORITHM
The calculation (at least the way it is
edited above) creates too many values in the POW clause
Encapsulating all declared aliases in ' ' makes the code more full
proof - but the ORDER BY clause doesn't like quotation marks (so remove them there)
The SYSDATE() and forum.date_add fields are in different formats -
the latter being a timestamp
To fix:
SELECT forum.*, TIMESTAMPDIFF(HOUR, from_unixtime(bd.date_add), NOW()) as 'timedif'
((forum.votes - alg.vote_reduction)/POW(('timedif' + alg.time_variable),alg.gravity)) AS 'al'
FROM forum AS forum
LEFT JOIN algorithm AS 'alg' ON (alg.id='1')
ORDER BY al

Postgresql 9.2 trigger to separate subfields in a stored string

Postgresql 9.2 DB which automatically collects data from various machines.
The DB stores all the data including the machine id, the firmware, the manufacturer id etc as well as the actual result data. In one stored field (varchar) there are 5 sub fields which are separated by the ^ character.
ACT18!!!8246-EN-2.00013151!1^7.00^F5260046959^H1P1O1R1C1Q1L1^1 (Machine 1)
The order of this data seems to vary from one machine to another. Eg machine 1 2 and 3. The string above shows the firmware version, in this case "7.0" and it appears in sub-field 2. However, another machine sends the data in a different sub-field - in this case sub-field 3 and the value is "1"
BACT/ALERT^A.00^1^^ (Machine 2)
I want to store the values "7.0" and "1" in a different field in a separate table using a CREATE TRIGGER t_machine_id AFTER INSERT function where I can choose which sub-field is used depending on the machine the data has come from.
Is split_part the best function to do this? Can anyone supply an example code that will do this? I can't find anything in the documentation.
You need to (a) split the data using something like regexp_split_to_table then (b) match which parts are which using some criteria, since you don't have field position-order to rely on. Right now I don't see any reliable rule to decide what's the firmware version and what's the machine number; you can't really say where field <> machine_number because if machine 1 had firmware version 1 you'd get no results.
Given dummy data:
CREATE TABLE machine_info(data text, machine_no integer);
INSERT INTO machine_info(data,machine_no) (VALUES
('ACT18!!!8246-EN-2.00013151!1^7.00^F5260046959^H1P1O1R1C1Q1L1^1',1),
('BACT/ALERT^A.00^1^^',2)
);
Something like:
SELECT machine_no, regexp_split_to_table(data,'\^')
FROM machine_info;
will give you a table of split data elements with machine number, but then you need to decide which fields are which:
machine_no | regexp_split_to_table
------------+------------------------------
1 | ACT18!!!8246-EN-2.00013151!1
1 | 7.00
1 | F5260046959
1 | H1P1O1R1C1Q1L1
1 | 1
2 | BACT/ALERT
2 | A.00
2 | 1
2 |
2 |
(10 rows)
You may find the output of substituting regexp_split_to_array more useful, depending on whether you can get any useful info from field order and how you intend to process the data.
regress=# SELECT machine_no, regexp_split_to_array(data,'\^')
FROM machine_info;
machine_no | regexp_split_to_array
------------+------------------------------------------------------------------
1 | {ACT18!!!8246-EN-2.00013151!1,7.00,F5260046959,H1P1O1R1C1Q1L1,1}
2 | {BACT/ALERT,A.00,1,"",""}
(2 rows)
Say there are two firmware versions; version 1 sends code^blah^fwvers^^ and version 2 and higher sends code^fwvers^blah^blah2^machineno. You can then differentiate between the two because you know that version 1 leaves the last two fields blank:
SELECT
machine_no,
CASE WHEN info_arr[4:5] = ARRAY['',''] THEN info_arr[3] ELSE info_arr[2] END AS fw_vers
FROM (
SELECT machine_no, regexp_split_to_array(data,'\^')
FROM machine_info
) string_parts(machine_no, info_arr);
results:
machine_no | fw_vers
------------+---------
1 | 7.00
2 | 1
(2 rows)
Of course, you've only provided two sample data, so the real matching rules are likely to be more complex. Consider writing an SQL function to extract the desired field(s) and return them from the array passed.

mysql/oracle stored math formula

is there any way apply math formula from stored string in Oracle and or MySQL?
col1 | col2 | formula
---------------------
2 | 2 | col1*col2
2 | 3 | col1+col2
SELECT * from tbl
result:
col1 | col2 | formula
---------------------
2 | 2 | 4
2 | 3 | 5
edit: for each row another formula
I think what you're saying is you want to have the database parse the formula string. For example, for Oracle you could
Add a column to the table to contain the result
Run an update statement which would call a PL/SQL function with the values of the columns in the table and the text of the formula
update {table} set formula_result = fn_calc_result (col1, col2, formula_column);
The PL/SQL function would create a string by replacing the "col1" and "col2" and so forth with the actual values of those columns. You can do that with regular expresions, as long as the formulas are consistently written.
Then use
execute immediate 'select '||{formula}||' from dual' into v_return;
return v_return;
to calculate the result and return it.
Of course, you could also write your own parser. If you decide to go that way, don't forget to handle operation precedence, parentheses, and so forth.
I think you want a virtual column. See here for excellent article on its setup and use.
you may do it via a PL/SQL script that you can trigger automcatically when inserting the data.
See http://en.wikipedia.org/wiki/PL/SQL
PL/SQL is a kind of program that executes in the database itself. It's quite easy to do.