Formatting JSON files for SQLContext

Formatting JSON files for SQLContext - json

I'm experiencing issues when loading JSON which are dependent on formatting of input JSON file.
According to Spark documentation on JSON Datasets, each line on input file must be a valid JSON Object. re:
"Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail."
So, if I have an input JSON file such as:
{
"Year": "2013",
"First Name": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
},
{
"Year": "2013",
"First Name": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}
Are there any existing tools or scripts to convert to:
{"Year": "2013","First Name": "DAVID","County": "KINGS","Sex": "M","Count":"272"},
{"Year": "2013","First Name": "JAYDEN","County": "KINGS","Sex": "M","Count": "268"}
where the JSON conforms to "Each line must contain a separate, self-contained valid JSON object"
If I format to this style above, things work as expected. But, I made these mods manually over a few rows. I cannot do this for entire data set, so looking for an existing script or tool.
OR
I could load to JDBC available database if that's a better option. Thoughts?
Thanks in advance

You can simply load the JSON files into an RDD first using sc.wholeTextFiles() and remove the file name column, then run the SQLContext read on the RDD contents.
e.g.
val jsonRdd = sc.wholeTextFiles("samplefile.json").map(x => x._2)
val jsonDf = sqlContext.read.json(jsonRdd)

What if you make it an array by adding square brackets. like this;
[
{
"Year": "2013",
"FName": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
},
{
"Year": "2013",
"FName": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}
]
If I take your file and add the brackets I can iterate through it with Node.js and output a file that looks like what you want. The caveat in node.js being I cant have variable First Name-- I had to change it to FName.

Related

Add data to a json file using Talend

I have the following JSON:
[
{
"date": "29/11/2021",
"Name": "jack",
},
{
"date": "30/11/2021",
"Name": "Adam",
},
"date": "27/11/2021",
"Name": "james",
}
]
Using Talend, I wanna add 2 lines to have something like:
[
{
"company": "AMA",
"service": "BI",
"date": "29/11/2021",
"Name": "jack",
},
{
"company": "AMA",
"service": "BI",
"date": "30/11/2021",
"Name": "Adam",
},
"company": "AMA",
"service": "BI",
"date": "27/11/2021",
"Name": "james",
}
]
Currently, I use 3 components (tJSONDocOpen, tFixedFlowInput, tJSONDocOutput) but I can't have the right configuration of components in order to get the job done !

If you are not comfortable with json .
Just do these steps :
In the metaData just create a FileJson like this then paste it in your job as a tFileInputJson
Your job design and mapping would be
In your tFileOutputJson don't forget to change in the name of the data block "Data" with ""

What you need to do there according to the Talend practices is read your JSON. Then extract each object of it, add your properties and finally rebuild your JSON in a file.
An efficient way to do this is using tMap componenent like this.
The first tFileInputJSON will have to specify what properties it has to read from the JSON by setting your 2 objects in the mapping field.
Then the tMap will simply add 2 columns to your main stream, here is an example with hard coded string values. Depending on you needs, this component will also offer you the possibility to assign dynamic data to your 2 new columns, it's a powerful tool for manipulating the structure of a data stream.
You will find more infos about this component in the official documentation : https://help.talend.com/r/en-US/7.3/tmap/tmap; especially the "tMap scenarios" part.
Note
Instead of using the tMap, if you are comfortable with Java, you can use a tjavaRow instead. Using this, you can setup your 2 new columns with whatever java code you want to put as long as you have defined the output schema of the component.
output_row.Name = input_row.Name;
output_row.date = input_row.date;
output_row.company = "AMA";
output_row.service = "BI";

Read first line of huge Json file with Spark using Pyspark

I'm pretty new to Spark and to teach myself I have been using small json files, which work perfectly. I'm using Pyspark with Spark 2.2.1 However I don't get how to read in a single data line instead of the entire json file. I have been looking for documentation on this but it seems pretty scarce. I have to process a single large (larger than my RAM) json file (wikipedia dump: https://archive.org/details/wikidata-json-20150316) and want to do this in chuncks or line by line. I thought Spark was designed to do just that but can't find out how to do it and when I request the top 5 observations in a naive way I run out of memory. I have tried RDD .
SparkRDD= spark.read.json("largejson.json").rdd
SparkRDD.take(5)
and Dataframe
SparkDF= spark.read.json("largejson.json")
SparkDF.show(5,truncate = False)
So in short:
1) How do I read in just a fraction of a large JSON file? (Show first 5 entries)
2) How do I filter a large JSON file line by line to keep just the required results?
Also: I don't want to predefine the datascheme for this to work.
I must be overlooking something.
Thanks
Edit: With some help I have gotten a look at the first observation but it by itself is already too huge to post here so I'll just put a fraction of it here.
[
{
"id": "Q1",
"type": "item",
"aliases": {
"pl": [{
"language": "pl",
"value": "kosmos"
}, {
"language": "pl",
"value": "\\u015bwiat"
}, {
"language": "pl",
"value": "natura"
}, {
"language": "pl",
"value": "uniwersum"
}],
"en": [{
"language": "en",
"value": "cosmos"
}, {
"language": "en",
"value": "The Universe"
}, {
"language": "en",
"value": "Space"
}],
...etc

That's very similar to Select only first line from files under a directory in pyspark
Hence something like this should work :
def read_firstline(filename):
with open(filename, 'rb') as f:
return f.readline()
# files is a list of filenames
rdd_of_firstlines = sc.parallelize(files).flatMap(read_firstline)

string to json format powershell

Hey guys I have a singular string output which I need to convert in JSON:
Policy Name: Default_US1 Id: abc123 Buckets: bucket1,bothplaces
Policy Name: Default_CH1 Id: def456 Buckets: support,ch1,ch2
Policy Name: Default_NY2 Id: ghi789 Buckets: demo,bucket1,test1,test
How SHOULD it look like in JSON format?
[
{"Policy Name": "Default_US1"}, {"Id": "abc123"}, {"Buckets":[ "bucket1","bothplaces"]}
{"Policy Name": "Default_CH1"}, {"Id": "def456"}, {"Buckets":[ "support","ch1","ch2"]}
{"Policy Name": "Default_NY2"}, {"Id": "ghi789"}, {"Buckets":[ "demo","bucket1","test1","test"]}
]
above is my current attempt... but other than not working.. I know instinctively it's missing something/s... but I cant figure out what and how to remedy it
Directions on how to do it in Powershell would be a plus, but not necessary
I keep trying but messing up, since I know the best test is making Convertfrom-json show me normal output.
I do not care much how it ends up looking at the end, I just wish to extract all that data, with JSON being the format of choice, any VALID JSON result I can work with and manipulate....but first i need a valid JSON conversion

Ok, so you were correct - your current JSON format is ghastly! The mistake you are making is treating each little bit of data as a separate object when there appears to be a natural hierarchy in your data model.
The following structure more naturally fits your data model. However, this is purely based on a cursory examination of the input data you have posted - I know nothing about the data model itself.
[
{
"Name": "Default_US1",
"Id": "abc123",
"Buckets": [
"bucket1",
"bothplaces"
]
},
{
"Name": "Default_CH1",
"Id": "def456",
"Buckets": [
"support",
"ch1",
"ch2"
]
},
{
"Name": "Default_NY2",
"Id": "ghi789",
"Buckets": [
"demo",
"bucket1",
"test1",
"test2"
]
}
]

A regex expression that can remove data from a json object

I'd like to be able to selectively remove elements from a json schema. Imagine a json object that contains a larger but similar array of users like this
[{
"users": [{
"firstName": "Nancy",
"socialSecurityNumber": "123-45-6789",
"sex": "Female",
"id": "1234",
"race": "Smith",
"lastName": "Logan"
}, {
"firstName": "Charles",
"socialSecurityNumber": "321-54-9876",
"sex": "Male",
"id": "3456",
"race": "White",
"lastName": "Clifford"
}],
I'd like to strip the socialSecurityNumber element from the json schema using a regex expression. What would a regex expression to remove
"socialSecurityNumber": "whatever value",
look like where the value of the data pair could be any string?
I cannot be certain of the position of the data pair and whether it would have a trailing comma.

Try replacing the following regular expression with empty:
"socialSecurityNumber": "(\d|\-)",
It can go wrong if this info is split in 2 lines, or if the SSN is the last user field, because there will be no comma after it.
Anyway, after the replacing operation, check if there are any string
"socialSecurityNumber"
to confirm this can be used. If there are still strings that weren't replaced, then you will need a JSON parser to correctly eliminate this information.

Extracting data from a JSON file

I have a large JSON file that looks similar to the code below. Is there anyway I can iterate through each object, look for the field "element_type" (it is not present in all objects in the file if that matters) and extract or write each object with the same element type to a file? For example each user would end up in a file called user.json and each book in a file called book.json?
I thought about using javascript but to my knowledge js can't write to files, I also tried to do it using linux command line tools by removing all new lines, then inserting a new line after each "}," and then iterating through each line to find the element type and write it to a file. This worked for most of the data; however, where there were objects like the "problem_type" below, it inserted a new line in the middle of the data due to the nested json in the "times" element. I've run out of ideas at this point.
{
"data": [
{
"element_type": "user",
"first": "John",
"last": "Doe"
},
{
"element_type": "user",
"first": "Lucy",
"last": "Ball"
},
{
"element_type": "book",
"name": "someBook",
"barcode": "111111"
},
{
"element_type": "book",
"name": "bookTwo",
"barcode": "111111"
},
{
"element_type": "problem_type",
"name": "problem object",
"times": "[{\"start\": \"1230\", \"end\": \"1345\", \"day\": \"T\"}, {\"start\": \"1230\", \"end\": \"1345\", \"day\": \"R\"}]"
}
]
}

I would recommend Java for this purpose. It sounds like you're running on Linux so it should be a good fit.
You'll have no problems writing to files. And you can use a library like this - http://json-lib.sourceforge.net/ - to gain access to things like JSONArray and JSONObject. Which you can easily use to iterate through the data in your JSON request, and check what's in "element_type" and write to a file accordingly.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Formatting JSON files for SQLContext - json

You can simply load the JSON files into an RDD first using sc.wholeTextFiles() and remove the file name column, then run the SQLContext read on the RDD contents. e.g. val jsonRdd = sc.wholeTextFiles("samplefile.json").map(x => x._2) val jsonDf = sqlContext.read.json(jsonRdd)

Related

Add data to a json file using Talend

Read first line of huge Json file with Spark using Pyspark

string to json format powershell

A regex expression that can remove data from a json object

Extracting data from a JSON file

Categories

Resources