Extract JSON data with Data Flow - Flatten Transformation - json

Following a previous question I started working on a Data Flow, with the purpose of flattening a JSON file, created as a result of an Application Insights REST query. You can find an anonymised version here.
My goal is to extract the data in the "rows" array of arrays, but I end up with the data duplicated in a cartezian manner (I got an original number of 18 rows and I end up with 324, 18*18).
I cannot understand what I am doing wrong or if it is an issue with the JSON "rows" array of arrays.
Here is my Data Flow - Source has the "Document per line" JSON option, "Single documents" raises a [unexpected character "] error, probably due to the strange formatting in the JSON:
This is the Data Preview in the Source - as you can see, it is only one "tables" node, with 18 elements in the "rows" array:
rows:
I tried to Flatten it, but I cannot map "rows" data to a column, I cannot use something like table.rows[0]:
Also, the rows data gets duplicated - 18 rows for each of the 18 rows outputted:
I am not sure how to get to the bottom of this, if it's the JSON format or if I am doing something wrong. From my experience it's probably the latter.

I think this is caused by your special format.
Please try this:
add a rowdata property
flatten your data

#Steve Zhao thank you! But that solution duplicated the data similar to the original situation:
I did not manage to treat these data as JSON so I ended up thinking about it as text that can be manipulated into an array.
So I split the text by "rows:" and retrieve the second part of the split (arrays in Expression Builder start from 1):
Then I split that text as an array:
Which then I can flatten (at last):
From here on I keep splitting these values to get the data I need - I was interested in the first two and fourth column.

Related

Foundry Data Connection Rest responses as rows

I'm trying to implement a very simple single get call, and the response returns some text with a bunch of ids separated by newline (like a single column csv). I want to save each one as a row in a dataset.
I understand that in general the Rest connector saves each response as a new row in an avro file, which works well for json responses which can then be parsed in code.
However in my case I need it to just save the response in a txt or csv file, which I can then apply a schema to, getting each id in its own row. How can I achieve this?
By default, the Data Connection Rest connector will place each response from the API as a row in the output dataset. If you know the format type of your response, and it's something that would usually be parsed to be one row per newline (csv for example), you can try setting the outputFileType to the correct format (undefined by default).
For example (for more details see the REST API Plugin documentation):
type: rest-source-adapter2
outputFileType: csv
restCalls:
- type: magritte-rest-call
method: GET
path: '/my/endpoint/file.csv'
If you don't know the format, or the above doesn't work regardless, you'll need to parse the response in transforms to split it into separate rows, this can be done as if the response was a string column, in this case exploding after splitting on newline (\n) might be useful: F.explode(F.split(F.col("response"), r'\n'))

Is there a way to get columns names of dataframe in pyspark without reading the whole dataset?

I have huges datasets in my HDFS environnement, say 500+ datasets and all of them are around 100M+ rows. I want to get only the column names of each dataset without reading the whole datasets because it will take too long time to do that. My data are json formatted and I'm reading them using the classic spark json reader : spark.read.json('path'). So what's the best way to get columns names without wasting my time and memory ?
Thanks...
from the official doc :
If the schema parameter is not specified, this function goes through the input once to determine the input schema.
Therefore, you cannot get the column names with only the first line.
Still, you can do an extra step first, that will extract one line and create a dataframe from it, then extract the column names.
One answer could be the following :
Read the data using spark.read.txt('path') method
Limit the number of rows to 1 with the method limit(1) since we just want the header as column names
Convert the table to rdd and collect it as a list with the method collect()
Convert the first row collected from unicode string to python dict (since I'm working with json formatted data).
The keys of the above dict is exactly what we are looking for (columns names as list in python).
This code worked for me:
from ast import literal_eval
literal_eval(spark.read.text('path').limit(1)
.rdd.flatMap(lambda x: x)
.collect()[0]).keys()
The reason it works faster might be that pyspark won't load the whole dataset with all the field structures if you read it using txt format (because everything is read as a big string), it's lighter and more efficient for that specific case.

Pentaho Data Integration - Two flows saving into same JSON output

I'm doing a tranfomation that has two differents flows. In the end of the transformation the two flows converge and save the data into the same json file output. Verifing a specific column on result file the values are strange. They look like as follow:
Column
[B#3e8fe299
[B#50b541fb
[B#44b719d4
[B#7dad3c13
[B#6e46a542
[B#170d9515
When I save in differents files it doesn't occurs, the values stay right. Does anyone know what could be causing it and how can I solve it?
Thanks.
Looks like you're printing out java Byte Array object IDs. Here is a link which shows similar values to yours ([B#...):
Java: Syntax and meaning behind "[B#1ef9157"? Binary/Address?
Can you verify what types your fields are?

What's the difference between json data being encoded or not

What's the purpose (not what it becomes) of doing json_encode on this before I am putting into the database
rating: {cleanliness: 3, publicFacility: 1, roomFacility: 2, security: 2}
to become this
rating: "{"cleanliness":3,"publicFacility":1,"roomFacility":2,"security":2}"
I see no point of doing this cause I need to json_decode it again before serving it back... can anybody clear me out?
Do not store json encoded data in the database. You mitigate the whole point of a relational database this way and make searching for values an expensive task. I see in your sample the attributes cleanliness, publicFacility, roomFacility and security. Those should be columns in your database so you can search for something like "all entries with a cleanliness higher than 3".
It works with the JSON column type but it is more expensive than using normal columns.
Edit: Check the use-case for your database entry. If you are sure you never need to search in or order by the encoded attributes you can store data encoded as json string. However, if your database supports the JSON column type, you should use that one because it allows searching in the stored JSON (but is more expensive than searching in normal columns). </Edit>
Second point: The second code snipped (with the quotation marks) looks like invalid syntax for json.

Python: Dump JSON Data Following Custom Format

I'm working on some Python code for my local billiard hall and I'm running into problems with JSON encoding. When I dump my data into a file I obviously get all the data in a single line. However, I want my data to be dumped into the file following the format that I want. For example (Had to do picture to get point across),
My custom JSON format
. I've looked up questions on custom JSONEncoders but it seems they all have to do with datatypes that aren't JSON serializable. I never found a solution for my specific need which is having everything laid out in the manner that I want. Basically, I want all of the list elements to on a separate row but all of the dict items to be in the same row. Do I need to write my own custom encoder or is there some other approach I need to take? Thanks!