Converting json file to a single line - json

I have a huge .json file like the one below. I want to convert that JSON to a dataframe on Spark.
{
"movie": {
"id": 1,
"name": "test"
}
}
When I execute the following code, I get a _corrupt_record error:
val df = sqlContext.read.json("example.json")
df.first()
Lately I learned that Spark only supports one-line JSON files, like:
{ "movie": { "id": 1, "name": "test test" } }
How can I convert a JSON text from multiple lines to a single line.

Related

Json data as dict for cerberus validations

I am trying to read Json file as string and using the data as validator for cerberus. I am using a custom function with check_with. The json works fine if I use the code from within my python test script.
abc.json
{
"rows": {
"type": "list",
"schema": {
"type": "dict",
"schema": {
"amt": {"type": "integer"},
"amt": {"check_with": util_cls.amt_gt_than}
}
}
}
}
python test code
with open("abc.json") as f:
s = f.read()
#s = ast.literal_eval(s)
v = Validator()
r = v.validate(json_data, s)
Cerberus requires the variable s to be a dict and I can't convert abc.json file contents to json using json.load as the json is not a valid format. Any ideas on how to convert the string to dict?

JSONPaths file: Parse a JSON object contained within a JSON array

I have rows of the following JSON form:
[
{
"id": 1,
"costs": [
{
"blue": 100,
"location":"courts",
"sport": "football"
}
]
}
]
I want to upload this into a redshift table as follows:
id | blue | location | sport
--------+------+---------+------
1 | 100 | courts |football
The following JSONPaths file is not successful:
{
"jsonpaths": [
"$.id",
"$.costs[0].blue",
"$.costs[0].location",
"$.costs[0].sport"
]
}
Redshift returns the following error code:
err_code: 1216 Invalid JSONPath format: Member is not an object.
How can I change the jsonpaths file to be able to upload the json as desired?
The answer to this is provided by John Rotenstein in the comments. I am just formalizing the answer here.
As shown in the documentation, the input JSON records have to be a new line delimited sequence of JSON objects. The examples show the JSON objects as pretty printed, but typically the input stream of records would be one JSON object per line.
{ "id": 1, "costs": [ { "blue": 100, "location":"courts", "sport": "football" } ] }
{ "id": 2, "costs": [ { "blue": 200, "location":"fields", "sport": "cricket" } ] }
So, technically the input record stream is not required to be a valid JSON, but a stream of delimited valid JSON objects.

Parsing and cleaning text file in Python?

I have a text file which contains raw data. I want to parse that data and clean it so that it can be used further.The following is the rawdata.
"{\x0A \x22identifier\x22: {\x0A \x22company_code\x22: \x22TSC\x22,\x0A \x22product_type\x22: \x22airtime-ctg\x22,\x0A \x22host_type\x22: \x22android\x22\x0A },\x0A \x22id\x22: {\x0A \x22type\x22: \x22guest\x22,\x0A \x22group\x22: \x22guest\x22,\x0A \x22uuid\x22: \x221a0d4d6e-0c00-11e7-a16f-0242ac110002\x22,\x0A \x22device_id\x22: \x22423e49efa4b8b013\x22\x0A },\x0A \x22stats\x22: [\x0A {\x0A \x22timestamp\x22: \x222017-03-22T03:21:11+0000\x22,\x0A \x22software_id\x22: \x22A-ACTG\x22,\x0A \x22action_id\x22: \x22open_app\x22,\x0A \x22values\x22: {\x0A \x22device_id\x22: \x22423e49efa4b8b013\x22,\x0A \x22language\x22: \x22en\x22\x0A }\x0A }\x0A ]\x0A}"
I want to remove all the hexadecimal characters,I tried parsing the data and storing in an array and cleaning it using re.sub() but it gives the same data.
for line in f:
new_data = re.sub(r'[^\x00-\x7f],\x22',r'', line)
data.append(new_data)
\x0A is the hex code for newline. After s = <your json string>, print(s) gives
>>> print(s)
{
"identifier": {
"company_code": "TSC",
"product_type": "airtime-ctg",
"host_type": "android"
},
"id": {
"type": "guest",
"group": "guest",
"uuid": "1a0d4d6e-0c00-11e7-a16f-0242ac110002",
"device_id": "423e49efa4b8b013"
},
"stats": [
{
"timestamp": "2017-03-22T03:21:11+0000",
"software_id": "A-ACTG",
"action_id": "open_app",
"values": {
"device_id": "423e49efa4b8b013",
"language": "en"
}
}
]
}
You should parse this with the json module load (from file) or loads (from string) functions. You will get a dict with 2 dicts and a list with a dict.

Invalid JSON file error while importing JSON in Firebase

I'm trying to import a json file (titled, 'filename.json') into my firebase database using 'Import JSON' under 'Database.'
However, i am getting an Invalid JSON file error.
The foll is the structure of my JSON that i wish to import. Can you pls help me with where i am going wrong with this:
{
"checklist": "XXX",
"notes": ""
}
{ "checklist": "XXX",
"notes": ""
}
{
"checklist": "XXX",
"notes": ""
}
{
"checklist": "XXX",
"notes": ""
}
Your objects need commas between them. Basically, any line where you've got an } here (except for the last one), throw a comma after it. Then wrap the whole thing in a [] so it's a valid json array.

Extracting data from a JSON file

I have a large JSON file that looks similar to the code below. Is there anyway I can iterate through each object, look for the field "element_type" (it is not present in all objects in the file if that matters) and extract or write each object with the same element type to a file? For example each user would end up in a file called user.json and each book in a file called book.json?
I thought about using javascript but to my knowledge js can't write to files, I also tried to do it using linux command line tools by removing all new lines, then inserting a new line after each "}," and then iterating through each line to find the element type and write it to a file. This worked for most of the data; however, where there were objects like the "problem_type" below, it inserted a new line in the middle of the data due to the nested json in the "times" element. I've run out of ideas at this point.
{
"data": [
{
"element_type": "user",
"first": "John",
"last": "Doe"
},
{
"element_type": "user",
"first": "Lucy",
"last": "Ball"
},
{
"element_type": "book",
"name": "someBook",
"barcode": "111111"
},
{
"element_type": "book",
"name": "bookTwo",
"barcode": "111111"
},
{
"element_type": "problem_type",
"name": "problem object",
"times": "[{\"start\": \"1230\", \"end\": \"1345\", \"day\": \"T\"}, {\"start\": \"1230\", \"end\": \"1345\", \"day\": \"R\"}]"
}
]
}
I would recommend Java for this purpose. It sounds like you're running on Linux so it should be a good fit.
You'll have no problems writing to files. And you can use a library like this - http://json-lib.sourceforge.net/ - to gain access to things like JSONArray and JSONObject. Which you can easily use to iterate through the data in your JSON request, and check what's in "element_type" and write to a file accordingly.