Split csv cell into columns - csv

I'd like to split a csv column containing a dictionary-like structure/values into component columns. For example input/output data, see this spreadsheet. Data will always come in that format ({"key":value,...}), with the number of key value pairs being arbitrary.
Not necessarily looking for a full solution here—more curious what the my options are for parsing data to create the output I want. Open to maybe using python to do some of this.

use in B3:
=INDEX(BYROW(A3:A5, LAMBDA(x, IFNA(HLOOKUP(B2:E2,
TRANSPOSE(SPLIT(TRIM(FLATTEN(SPLIT(REGEXREPLACE(x, "[\}\{"",]", ),
CHAR(10)))), ":")), 2, 0)))))

In your case, you can use formula
=SPLIT(REGEXREPLACE(A1,""".*"":|\{|\}|\n|\r",""),",")
And here the result

Related

AWS glue: crawler does not identify metadata when CSV contains string and timestamp/date values

I have come across one thing when we consider CSV as input to crawler
crawler doesn't identify the columns header when all the data is in string format in CSV.
#P1 Headers are displayed as col0,col1...colN.
#P2 And actual column names are considered as data.
#P3 Metadata (i.e. column datatype is shown as string even the CSV dataset consists of date/timestamp value)
If we are going to consider custom (CSV) classifier then we are manually mentioning the column header.
#P2 will get covered i.e. column names will be removed however
#P1 still remain same. column header will be displayed as col0,col1...colN.
There are 3 things I want to avoid and achieve expected result.
CSV with strings only should show actual column names instead of col0,col1...colN.
Metadata of generated table should show correctly (i.e. date/timestamp, string) once it is crawled by crawler.
If Custom classifier is used, we need to mention column header names manually in classifier, yet result is not satisfactory.
Need generic solution instead of manual interventions.
Have gone through this document: here
If anyone has already implemented the solution, Please help.
I got solution to one of the above points. Headers i.e. first line of CSV is displayed by using 'Has heading' in CSV classifier.
However, Solution for following is yet to figure out.
Metadata of CSV file is shown as string even if column contains timestamp/date value. Crawler is reading these datatypes as string.
Custom classifier needs manual interventions. I have mentioned all column names in classifier. Is there generic solution?
If we are using pd.to_csv to write the dataframe, then to avoid getting column names as col1, col2 and so on, add the parameter
index_label='index' such as:
pd.to_csv(df,index_label='index')

Apache NiFi: Creating new column using a condition

I have asked a similar question. Yet I wasn't able to find a solution for my problem through that approach. I have a csv which looks like this:
studentID,regger,age,number
123,west,12,076392367
456,nort,77,098123124
231,west,33,076346325
I want to add a new column and add values according to the data in the number field.This is the logic.
If the first 4 digits of data in the number column is equal to "0763" then the new column named (status) must be set as INSIDE or if it is any other value its OUTSIDE
As mentioned in the logic the output must look like this:
studentID,regger,age,number,status
123,west,12,076392367,INSIDE
456,nort,77,098123124,OUTSIDE
231,west,33,076346325,INSIDE
My Approach
I tried to achieve this by first duplicating the number column to the status column. And then trying to take the first 4 digits and dealing with it.
Hope you would be able to suggest a way to Nifi Workflow to make this possible.
I used the UpdateRecord processor twice and got the results that you want.
Input
I started with your input data.
studentID,regger,age,number
123,west,12,076392367
456,nort,77,098123124
231,west,33,076346325
Process
First, set the UpdateRecord processor as follows:
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Replacement Value Strategy Record Path Value
/status /number
it will create the new column status with the value of number column.
Second, the first output should go to another UpdateRecord processor with the options
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Replacement Value Strategy Literal Value
/status ${field.value:substring(0,4):equals('0763'):ifElse(${field.value:replace(${field.value},'INSIDE')},${field.value:replace(${field.value},'OUTSIDE')})}
and this will give you the final results.
Be aware that the number column is not an integer column, so you have to set the record reader CSVReader with the option Schema Access Strategy to the Use String Fields From Header.
Output
studentID,regger,age,number,status
123,west,12,076392367,INSIDE
456,nort,77,098123124,OUTSIDE
231,west,33,076346325,INSIDE
You can try below logic :-
SplitText ->
ExtractText Processor ->
RouteOnAttribute(Add condition if first four number is 0763)
-----Match Relation--> ReplaceText(Extracted Attribute from file + "INSIDE") -> PutFile
-----Unmatch Relation--> ReplaceText(Extracted Attribute from file + "OUTSIDE") -> PutFile
Hope this will help you.

Reshape the dataset into more relational format (Transpose SOME rows and assign them to a data subset)

I have a spreadsheet/csv:
Code:,101,Course Description:,"Introduction to Rocket Science",
Student Name,Lecture Hours,Labs Hours,Test Score,Status
John Galt,48,120,4.7,Passed
James Taggart,50,120,4.9,Passed
...
I need to reshape it to the following view:
Code:,Course Description:,Students,Lecture Hours,Labs Hours,Average Test Score,Teaching Staff
101,"Introduction to Rocket Science",John Galt,48,120,4.7,Passed
101,"Introduction to Rocket Science",James Taggart,50,120,4.9,Passed
...
Beleive it or not, can not get the right idea how to do that despite it seems to be very primitive transformation, is there any silver bullet for this?
Original records (csv) have in a way json-like structure so my first approach was to represent the original data as a vector and then transpose it, (but in this case my resulting table looks like sparced matrix - rows I have transpored are blank in the rest of its values)
Another way Im considering - **serialize it into jsons and then de-serialize** into new spreadsheet (jsonize()) - in this case, Im having problems with merging them properly.
In both ways I have it "half-working";
Can anyone suggest simple and reliable algorithm for this;
Any language, RegEx, any tools, code snippets are very appreciated
Assuming that the pattern you've described here is consistent throughout, there are quite a few different approaches you could take I think, but in all cases you basically can use that fact that the 'Course' rows start with "Code:" but that's never going to be a student name.
You can take advantage of this either by a regular expression find/replace, or within OpenRefine.
Example:
Open file in a text editor that supports regular expressions in
find/replace
Search for lines starting with 'Code:' and add additional commas to the start of the row to shift the course data columns to the
right e.g. search for: ^Code: replace with: ,,,,,^Code:
If you now import the file into OpenRefine then you'll have a project with 10 columns (the 10th col is caused by the trailing
comma at the end of the course data row)
You can now use Transpose (or just rename) on the right-most columns which contain the course data, while leaving the left-most
columns which contain the student details
Isolate the rows that contain the phrase 'Student Name' in the first column and remove them (via a filter or facet)
Move the Course Code/Description columns to the beginning of the project, and use the 'Edit Cells->Fill Down' option on each column to get the values repeated on all the relevant lines
Finally rename the columns as you want, remove any extraneous columns

Best way to parse a big and intricated Json file with OpenRefine (or R)

I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.
Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use
I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.

Read second column of csv

I have this CSV data file that looks like
a,b
1,2
3,4
data = readdlm("my/local/path", ',')
however, when I access data[1], I'm only getting a, I thought it supposes to be [a,b]? Doing things like data[1:2] gets me the first column only.
Any idea how can I access the second column?
From the docs for readdlm:
Read a matrix from the source where each line gives one row...
So use data[row,col] syntax to get each element