Transpose survey response dataset with Open Refine (previously Google Refine) - csv

I’m looking for some help to reshape a survey response dataset, exported as a csv, using Open Refine (previously Google Refine).
Some context on the survey
Collector and responder ID are collected in the background - ID1 ID2
Users select tasks from a long list - T{n}
Users enter a custom task - OT
Users rate the importance of the each selected task - R1
Users rate the satisfaction of the each selected task - R2
We have a total of 20 tasks atm but this might change.
Current dataset as follows:
ID1 | ID2 | T1 | » | T20 | OT | T1 R1 | » | T20 R1 | OT R1 | T1 R2 | » | T20 R2 | OT R2
123 | 789 |
I’m trying to reshape the dataset to the following format:
ID1 | ID2 | Task | Importance | Satisfaction
Here’s a gist of original and reshaped data sets
Also, i’ve tried to articulate how I want to reshape the data in a drawing, which might help

This can't be done by clicking a single button. You have to perform three "transpose cells across columns into rows" (one for tasks, one for their importance, one for their satisfaction), then three "join multivalued cells", then three "split multivalued cells", and finally use fill down to fill the blanks in the ID columns. A screencast will probably be clearer than my explanations.
You'll find the Json operations in a comment on your Gist. If your columns have exactly the same name as the example provided, you can apply it on your project by copying and pasting the file into "Undo/Redo -> Apply"

Try the following:
Concatenate all your content for each task using cells['Task1'].value+"|Importance: "+cells['Task Importance 1'].value+"|Satisfaction:"+cells['Task Satisfaction 1'].value You will need to do that 20 times (one for each group of task)
Transpose all column after Response ID (not included). You can reuse this Operation
split cells based on the pipe |
finish renaming and cleaning up value with value.replace()

Related

mySQL - Reiteratively Count rows that have particular CSV string

2-column MySQL Table:
| id| class |
|---|---------|
| 1 | A,B |
| 2 | B,C,D |
| 3 | C,D,A,G |
| 4 | E,F,G |
| 5 | A,F,G |
| 6 | E,F,G,B |
Requirement is to generate a report/output which tells which individual CSV value of class column is in how many rows.
For example, A is present in 3 rows (with id 1,3,5), and C is present in 2 rows (with id 2,3), and G is in 4 rows (3,4,5,6) so the output report should be
A - 3
B - 3
C - 2
...
...
G - 4
Essentially, column id can be ignored.
The draft that I can think of - first all the values of class column need to picked, split on comma, then create a distinct list of each unique value (A,B,C...), and then count how many rows contain the unique value from that distinct list.
While I know basic SQL queries, this is way too complex for me. Am unable to match it with some CSV split function in MySQL. (Am new to SQL so don't know much).
An alternative approach I made it to work - Download class column values in a file, feed it to a perl script which will create a distinct array of A,B,C, then read the downloaded CSV file again foreach element in distinct array and increase the count, and finally publish the report. But this is in perl which will be a separate execution, while the client needs it in SQL report.
Help will be appreciated.
Thanks
You may try split-string-into-rows function to get distinct values and use COUNT function to find number of occurrences. Specifically check here

How to extract relational data from a flat table using SQL?

I have a single flat table containing a list of people which records their participation in different groups and their activities over time. The table contains following columns:
- name (first/last)
- e-mail
- secondary e-mail
- group
- event date
+ some other data in a series of columns, relevant to a specific event (meeting, workshop).
I want to extract distinct people from that into a separate table, so that further down the road it could be used for their profiles giving them a list of what they attended and relevant info. In other words, I would like to have a list of people (profiles) and then link that to a list of groups they are in and then a list of events per group they participated in.
Obviously, same people appear a number of times:
| Full name | email | secondary email | group | date |
| John Smith | jsmith#someplace.com | | AcOP | 2010-02-12 |
| John Smith | jsmith#gmail.com | jsmith#somplace.com | AcOP | 2010-03-14 |
| John Smith | jsmith#gmail.com | | CbDP | 2010-03-18 |
| John Smith | jsmith#someplace.com | | BDz | 2010-04-02 |
Of course, I would like to roll it into one record for John Smith with both e-mails in the resulting People table. I can't rule out that there might be more records for same person with other e-mails than those two - I can live with that. To make it more complex ideally I would like to derive a list of groups, creating a Groups table (possibly with further details on the groups) and then a list of meetings/activities for each group. By linking that I would then have clean relational model.
Now, the question: is there a way to perform such a transformation of data in SQL? Or do I need to write a procedure (program) that would traverse the database and do it?
The database is in MySQL, though I can also use MS Access (it was given to me in that format).
There is no tool that does this automatically. You will have to write a couple queries (unless you want to write a DTS package or something proprietary). Here's a typical approach:
Write two select statements for the two tables you wish to create-- one for users and one for groups. You may need to use DISTINCT or GROUP BY to ensure you only get one row when the source table contains duplicates.
Run the two select statements and inspect them for problems. For example, it's possible some users show up with two different email addresses, or some users have the same name and were combined incorrectly. These will need to be cleaned up in order to proceed. There is great way to do this-- it's more or less a manual process requiring expert knowledge of the data.
Write CREATE TABLE scripts based on the two SELECT statements so that you can store the results somewhere.
Use INSERT FROM or SELECT INTO to populate the tables from your two SELECT statements.

How to build a matrix that contains counts of matches between tables?

Other than diving in brute force one query at a time, I'm stumped on a repeatable efficient way of doing this:
assume I have 4 ticketed events around the country (EventA-2018,
EventB-2018, EventC-2018, and EventD-2018)
I now need to present a simple 4x4 table with counts of people who attended X also attended Y
each event has an associated MySQL table (e.g., event-a-2018-buyers, event-b-2018-buyers, etc.) and each one contains
a single column called email representing the buyer.
The resulting table should look something like this:
+------------+-------------+-------------+-------------+-------------+
| | EventA-2018 | EventB-2018 | EventC-2018 | EventD-2018 |
+------------+-------------+-------------+-------------+-------------+
|EventA-2018 | X | a | b | c |
+------------+-------------+-------------+-------------+-------------+
|EventB-2018 | a | X | d | e |
+------------+-------------+-------------+-------------+-------------+
|EventC-2018 | b | d | X | f |
+------------+-------------+-------------+-------------+-------------+
|EventD-2018 | c | e | f | X |
+------------+-------------+-------------+-------------+-------------+
So the top row basically says, "Of the people who bought tickets for EventA-2018, there were a who also bought for EventB-2018, b who also bought for EventC-2018, and c also who bought for EventD-2018".
The diagonal isn't important since it would represent 100% each time.
Out of the 12 remaining cells, I obviously only need to fill in 6 since they are repeated (e.g., a,b,c,d,e,f).
There are actually more than 4 events and each one happens each year, but I'm assuming I can adapt any solutions to expand accordingly.
My current MySQL skills stop just after doing a join on two of the event tables. So I could easily figure out the 6 inner joins I need to run to fill in this matrix and manually build the table, but I'm hoping there is a more programmatic way that will allow me to automate it into a dashboard.
Here is how I would fill in one cell at a time:
SELECT
Count( eventa_2018.email ) as cell_a
FROM
( SELECT DISTINCT email FROM eventa_2018
INNER JOIN ( SELECT DISTINCT email FROM eventb_2018 ON eventa_2018.email = eventb_2018.email;
SIDE NOTE: A completely different approach I'm considering is to combine all tables into one with only two fields - email, event. Then I could strip out everyone who only attended one event. For the rest, I could create a simpler report showing the counts of people who attended each combination of more than one event (whereas the table above only shows two events at at time). The resulting business case for all of this is to learn where to invest in more cross promotion of events and create segments of most valuable customers.
Not an answer. Too long for a comment.
A normalised schema might look something like this:
event year buyer
a 2018 joe#amgil.com
b 2018 kat#plape.com
Start there. See my comments above, and then get back to us.

Crossfilter dimension of joined table without repeating values

I will preface this with that I am both relatively unfamiliar with SQL and dc.js. I feel the solution to this is likely pretty simple.
I am currently drawing a group of charts based on a join of two tables that results in a form similar to the following:
Subject | Gender | TestName
------- | ------ | --------
1 | M | Test 1
2 | M | Test 1
1 | M | Test 2
2 | M | Test 2
Essentially, a lot of unique data that is repeated on the join due to TestName changing per subject. The join is on Subject.
I have one bar chart for Gender, which can be either M or F. I also have a bar graph of a count of each test (TestName) and how many subjects were present in that test. As you can tell from the join, a single subject can and often is a member of more than one test.
My issue is that when I crossfilter this data, the counts for each test is correct (here, it would be 2 each), but my gender information is inflated (4, instead of 2) since it counts what should be each unique subject every time it appears in this joined data set. I want my charts to display n Subjects for Gender, but instead it displays n * 'number of tests'.
Is there a way to chart the correct count per test case but keep my other charts displaying only a maximum count of unique subjects while keeping my crossfilter working?

Database design and query optimization/general efficiency when joining 6 tables in mySQL

I have 6 tables. These are simplified for this example.
user_items
ID | user_id | item_name | version
-------------------------------------
1 | 123 | test | 1
data
ID | name | version | info
----------------------------
1 | test | 1 | info
data_emails
ID | name | version | email_id
------------------------
1 | test | 1 | 1
2 | test | 1 | 2
emails
ID | email
-------------------
1 | email#address.com
2 | second#email.com
data_ips
ID | name | version | ip_id
----------------------------
1 | test | 1 | 1
2 | test | 1 | 2
ips
ID | ip
--------
1 | 1.2.3.4
2 | 2.3.4.5
What I am looking to achieve is the following.
The user (123) has the item with name 'test'. This is the basic information we need for a given entry.
There is data in our 'data' table and the current version is 1 as such the version in our user_items table is also 1. The two tables are linked together by the name and version. The setup is like this as a user could have an item for which we dont have data, likewise there could be an item for which we have data but no user owns..
For each item there are also 0 or more emails and ips associated. These can be the same for many items so rather than duplicate the actual email varchar over and over we have the data_emails and data_ips tables which link to the emails and ips table respectively based on the email_id/ip_id and the respective ID columns.
The emails and ips are associated with the data version again through the item name and version number.
My first query is is this a good/well optimized database setup?
My next query and my main question is joining this complex data structure.
What i had was:
PHP
- get all the user items
- loop through them and get the most recent data entry (if any)
- if there is one get the respective emails
- get the respective ips
Does that count as 3 queries or essentially infinite depending on the number of user items?
I was made to believe that the above was inefficient and as such I wanted to condense my setup into using one query to get the same data.
I have achieved that with the following code
SELECT user_items.name,GROUP_CONCAT( emails.email SEPARATOR ',' ) as emails, x.ip
FROM user_items
JOIN data AS data ON (data.name = user_items.name AND data.version = user_items.version)
LEFT JOIN data_emails AS data_emails ON (data_emails.name = user_items.name AND data_emails.version = user_items.version)
LEFT JOIN emails AS emails ON (data_emails.email_id = emails.ID)
LEFT JOIN
(SELECT name,version,GROUP_CONCAT( the_ips.ip SEPARATOR ',' ) as ip FROM data_ips
LEFT JOIN ips as the_ips ON data_ips.ip_id = the_ips.ID )
x ON (x.name = data.name AND x.version = user_items.version)
I have done loads of reading to get to this point and worked tirelessly to get here.
This works as I require - this question seeks to clarify what are the benefits of using this instead?
I have had to use a subquery (I believe?) to get the ips as previously it was multiplying results (I believe based on the complex joins). How this subquery works I suppose is my main confusion.
Summary of questions.
-Is my database setup well setup for my usage? Any improvements would be appreciated. And any useful resources to help me expand my knowledge would be great.
-How does the subquery in my sql actually work - what is the query doing?
-Am i correct to keep using left joins - I want to return the user item, and null values if applicable to the right.
-Am I essentially replacing a potentially infinite number of queries with 2? Does this make a REAL difference? Can the above be improved?
-Given that when i update a version of an item in my data table i know have to update the version in the user_items table, I now have a few more update queries to do. Is the tradeoff off of this setup in practice worthwhile?
Thanks to anyone who contributes to helping me get a better grasp of this !!
Given your data layout, and your objective, the query is correct. If you've only got a small amount of data it shouldn't be a performance problem - that will change quickly as the amount of data grows. However when you ave a large amount of data there are very few circumstances where you should ever see all your data in one go, implying that the results will be filtered in some way. Exactly how they are filtered has a huge impact on the structure of the query.
How does the subquery in my sql actually work
Currently it doesn't work properly - there is no GROUP BY
Is the tradeoff off of this setup in practice worthwhile?
No - it implies that your schema is too normalized.