Reshape the dataset into more relational format (Transpose SOME rows and assign them to a data subset) - json

I have a spreadsheet/csv:
Code:,101,Course Description:,"Introduction to Rocket Science",
Student Name,Lecture Hours,Labs Hours,Test Score,Status
John Galt,48,120,4.7,Passed
James Taggart,50,120,4.9,Passed
...
I need to reshape it to the following view:
Code:,Course Description:,Students,Lecture Hours,Labs Hours,Average Test Score,Teaching Staff
101,"Introduction to Rocket Science",John Galt,48,120,4.7,Passed
101,"Introduction to Rocket Science",James Taggart,50,120,4.9,Passed
...
Beleive it or not, can not get the right idea how to do that despite it seems to be very primitive transformation, is there any silver bullet for this?
Original records (csv) have in a way json-like structure so my first approach was to represent the original data as a vector and then transpose it, (but in this case my resulting table looks like sparced matrix - rows I have transpored are blank in the rest of its values)
Another way Im considering - **serialize it into jsons and then de-serialize** into new spreadsheet (jsonize()) - in this case, Im having problems with merging them properly.
In both ways I have it "half-working";
Can anyone suggest simple and reliable algorithm for this;
Any language, RegEx, any tools, code snippets are very appreciated

Assuming that the pattern you've described here is consistent throughout, there are quite a few different approaches you could take I think, but in all cases you basically can use that fact that the 'Course' rows start with "Code:" but that's never going to be a student name.
You can take advantage of this either by a regular expression find/replace, or within OpenRefine.
Example:
Open file in a text editor that supports regular expressions in
find/replace
Search for lines starting with 'Code:' and add additional commas to the start of the row to shift the course data columns to the
right e.g. search for: ^Code: replace with: ,,,,,^Code:
If you now import the file into OpenRefine then you'll have a project with 10 columns (the 10th col is caused by the trailing
comma at the end of the course data row)
You can now use Transpose (or just rename) on the right-most columns which contain the course data, while leaving the left-most
columns which contain the student details
Isolate the rows that contain the phrase 'Student Name' in the first column and remove them (via a filter or facet)
Move the Course Code/Description columns to the beginning of the project, and use the 'Edit Cells->Fill Down' option on each column to get the values repeated on all the relevant lines
Finally rename the columns as you want, remove any extraneous columns

Related

AWS glue: crawler does not identify metadata when CSV contains string and timestamp/date values

I have come across one thing when we consider CSV as input to crawler
crawler doesn't identify the columns header when all the data is in string format in CSV.
#P1 Headers are displayed as col0,col1...colN.
#P2 And actual column names are considered as data.
#P3 Metadata (i.e. column datatype is shown as string even the CSV dataset consists of date/timestamp value)
If we are going to consider custom (CSV) classifier then we are manually mentioning the column header.
#P2 will get covered i.e. column names will be removed however
#P1 still remain same. column header will be displayed as col0,col1...colN.
There are 3 things I want to avoid and achieve expected result.
CSV with strings only should show actual column names instead of col0,col1...colN.
Metadata of generated table should show correctly (i.e. date/timestamp, string) once it is crawled by crawler.
If Custom classifier is used, we need to mention column header names manually in classifier, yet result is not satisfactory.
Need generic solution instead of manual interventions.
Have gone through this document: here
If anyone has already implemented the solution, Please help.
I got solution to one of the above points. Headers i.e. first line of CSV is displayed by using 'Has heading' in CSV classifier.
However, Solution for following is yet to figure out.
Metadata of CSV file is shown as string even if column contains timestamp/date value. Crawler is reading these datatypes as string.
Custom classifier needs manual interventions. I have mentioned all column names in classifier. Is there generic solution?
If we are using pd.to_csv to write the dataframe, then to avoid getting column names as col1, col2 and so on, add the parameter
index_label='index' such as:
pd.to_csv(df,index_label='index')

Regular expression to pick a row in an html table containing desired text

Sorry, but uhrm, I'd like to use regexp (actually I'd use something else but I want to do the task within a Matlab function) to pick a single row containing desired keywords within an html table.
I am using Matlab calling function regexpi (case-insensitive version of regexp), which is akin to PHP regex from what I can tell.
Ok, here's a snippet from such an html table to parse:
<tr><td>blu</td><td>value</td></tr><tr><td>findme</td><td>value</td></tr><tr><td>ble</td><td>value</td></tr>
The desired row to pick contains the word "findme".
(added:) Content of other cells and tags in the table could be anything (here "bla" is a dummy value)- the important part is the presence of "findme" and that a single line (not more) is caught (or all lines containing "findme" but such behaviour is not expected). Any paired name/value table in a wikipedia page is a good example.
I tinkered with https://regex101.com/ using whatever I could dig up at the Matlab documentation (forward/backward looking, combinations of :,> and ?), but have failed to identify a pattern that will pick just the right row (or all those that contain the keyword "findme"). The following pattern for instance will pick the text but not the entire row: <tr[^>]*>[^>]*.*?(findme).*?<\/td .
Pattern <tr[^>]*>(.*?findme.*?)<\/tr[^>]*> picks the row but is too greedy and picks preceding rows.
Note that the original task I had set out was to capture entire tables and then parse these, but the Matlab regexp-powered function I found for the task had trouble with nested tables (or I had trouble implementing it for the task).
The question is how to return a row containing desired keywords from an html table, programmatically, within a matlab function (without calling an external program)? Bonus question is how to solve the nested table issue, but maybe that's another question.
I suggest you split up the string with strsplit and use contains for the filtering, which is a lot more readable and maintainable than a regex pattern:
htmlString = ['<tr><td>blu</td><td>value</td></tr><tr><td><a',...
'href="bla">findme</a></td><td>value</td></tr><tr><td><a',...
'href="ble">ble</a></td><td>value</td></tr>'];
keyword = 'findme';
splitStrings = strsplit(htmlString,'<tr>');
desiredRow = ['<tr>' splitStrings{contains(splitStrings,keyword)}]
The output is:
<tr><td>findme</td><td>value</td></tr>
Alternatively you may also combine extractBetween and contains:
allRows = extractBetween(htmlString,'<tr>','</tr>');
desiredRow = ['<tr>' allRows{contains(allRows,keyword)} '</tr>']
If you must use regex:
regexp(htmlString,['<tr><td>[^>]+>' keyword '.*?<\/tr>'],'match')
Try this
%<td>(.*?)%sg
https://regex101.com/r/0Xq0mO/1

How to copy variable values within an SPSS file?

I have three seperate SPSS files with information about roughly 7500 hemicolectomy patients. One file contains the information about the hemicolectomies, the second one about other surgeries the patients have had during their lifetime and the last one contains information about their sick leaves during their lifetime.
I have merged (idnumber is the common variable) the files to a single SPSS document but i ran into a problem with filtering out the surgeries and sick leaves that have nothing to do with the hemicolectomy. I'm quite new to SPSS so the simplest way i could think of doing this is by somehow copying the hemicolectomy info to every case and then just using the date/time calculator to choose which sick leaves and surgeries to discard. Switching to wide format is unpractical due to the large number of unrelated surgeries and sick leaves: I'd have thousands of variables.
So basically I'd like to do the following:
IF idnumber = idnumber THEN variable1=variable1 AND variable2=variable2 etc
How would I go about doing this?
All help will be appreciated!
the IF command can only be used with one transformation:
IF [condition] [transformation].
Assuming both of your files are sorted by idnumber:
UPDATE file=[master_file_reference]
/file=[secondary_file_reference]
/BY idnumber.
EXECUTE.
The file reference can be made either by their dataset name, or by their full path.
More on the UPDATE command:
https://www.ibm.com/support/knowledgecenter/en/SSLVMB_24.0.0/spss/base/syn_update_examples.html
I cant comment yet, so Im sorry if I misunderstand the problem. I wouldve asked for clarification in the comments to the question... here goes...
So you have three sources of data which have dates (?) of hemicolectomies, one for each case; dates (?) of other surgeries, multiple for each case; and sickleaves even more for each case. Is that right?
I'd try solving the problem before matching all three file by matching the file that contains one observation per patient (presumably hemicolectomies) to the one with the second most observations (presumably other surgeries) per patient with the /table keyword:
MATCH FILES /FILE= 'surgeries.sav' /table = 'hemicolectomies.sav'
/by idnumber.
EXECUTE.
this will "fill up" the blank cells for each patient with the hemicolectomy data.
now use the datetime to check which surgeries "belong" to the hemicolectomies, thus reduce your data and match it to the sickleave data using the /table keyword again.
Seems like the easiest solution to me.

Best way to parse a big and intricated Json file with OpenRefine (or R)

I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.
Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use
I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.

What can I do with an inconsistent column delimited text file?

I have a text file that looks something like...
firstname:middle:lastname
firstname:middle:lastname
firstname:lastname
firstname:middle:lastname
firstname:lastname
I would like to be able to eventually use this information in a MySQL database, but since the columns are not correct I am not sure what to do. Is there any way to resolve this?
If the data you have is only the above variations, then you can make the assumptions:
First part is the firstname
Last part is the lastname
Therefore if using PHP for example you could use explode to separate the data on the delimeter such as in this case being :.
When looping through each row just assume the last part is the lastname, first part is the firstname and the middle part is the middlename.
You can use count() to find out how many parts are in the specific row you are reading inside the loop. This should allow you to figure out which one is the last part.
If the file is so simple ... the solution is trivial
firstname:middle:lastname
firstname:lastname
if(there are only two columns) { that means we have first and last name }
else { we have first, middle and last name }
If there are more columns, you could maybe resolve data to proper columns if you manage to build a priority list (like in what order they could be missing, for example 'last name > first name > middle name') or/and if you could combine that with data type matching (string/int/double/date) ... anyway you need to gather all your domain knowledge and see if that suffice.