Read JSON files from multiple line file in spark scala - json

I'm learning spark in Scala. I have a JSON file as follows:
[
{
"name": "ali",
"age": "13",
"phone": "09123455737",
"sex": "m"
},{
"name": "amir",
"age": "24",
"phone": "09123475737",
"sex": "m"
}
]
and there is just this code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jsonFile = sqlContext.read.json("path-to-json-file")
I just receive corrupted_row : String nothing else
but when put every person(or objects) in single row, code works fine
How can I read from multiple lines for a JSON sqlContext in spark?

You will have to read it into an RDD yourself and then convert it to a Dataset:
spark.read.json(sparkContext.wholeTextFiles(...).values)

This problem is getting caused because you have multiline json row. Although by default spark.read.json expect a row to be in a single line but this is configurable:
You can set option spark.read.json("path-to-json-file").option("multiLine", true)

Related

multiple object of an array creates different columns in the CSV file

Here is my JSON example. When I convert JSON to CSV file, it creates different columns for each object of reviews array. columns names be like - serial name.0 rating.0 _id.0 name.1 rating.1 _id.1. How can i convert to CSV file where only serial,name,rating,_id will be the column name and every object of the reviews will be put in a different row?
`
[{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
},
],
}]
`
`
I am trying to use the CSV file to pandas. If not possible, is there any way to solve the problem using pandas package in python?
I suggest you use pandas for the CSV export only and process the json data by flattening the data structure first so that the result can then be easily loaded in a Pandas DataFrame.
Try:
data_python = [{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
},
],
}]
from collections import defaultdict
from pprint import pprint
import pandas as pd
dct_flat = defaultdict(list)
for dct in data_python:
for dct_reviews in dct["reviews"]:
dct_flat['serial'].append(dct['serial'])
for key, value in dct_reviews.items():
dct_flat[key].append(value)
#pprint(data_python)
#pprint(dct_flat)
df = pd.DataFrame(dct_flat)
print(df)
df.to_csv("data.csv")
which gives:
serial name rating _id
0 63708940a8d291c502be815f shadman 4 6373d4eb50cff661989f3d83
1 63708940a8d291c502be815f niloy1 3 6373d59450cff661989f3db8
and
,serial,name,rating,_id
0,63708940a8d291c502be815f,shadman,4,6373d4eb50cff661989f3d83
1,63708940a8d291c502be815f,niloy1,3,6373d59450cff661989f3db8
as CSV file content.
Notice that the json you provided in your question can't be loaded from file or string in Python neither using Python json module nor using Pandas because it is not valid json code. See below for corrected valid json data:
valid_json_data='''\
[{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
}
]
}]
'''
and code for loading this data from json file:
import json
json_file = "data.json"
with open(json_file) as f:
data_json = f.read()
data_python = json.loads(data_json)

Need to relationalise a nested JSON string using Pyspark

I am new to Pyspark and need guidance to perform the following task.
A sample data in the form of a JSON string has been given,
{
"id": "1234",
"location": "znd",
"contact": "{\"phone\": [{\"number\":\"12345\",\"code\":\"111\",\"altno\":\"No\"},{\"number\":\"55656\",\"code\":\"222\",\"altno\":\"Yes\"}]}"
}
This needs to be rationalized as follows, as seen below one row of input will get translated to 2 rows.
{id: "1234", "location": "znd","number": "12345", "code": "111","altno":"No"}
{id: "1234", "location": "znd","number": "55656", "code": "222","altno":"No"}
I have tried to use the explode function but as this is a JSON string, explode does not work on it.
I have read the data into a DF and tried to enforce a struct type to later use explode, but that does not work either.

Parsing and cleaning text file in Python?

I have a text file which contains raw data. I want to parse that data and clean it so that it can be used further.The following is the rawdata.
"{\x0A \x22identifier\x22: {\x0A \x22company_code\x22: \x22TSC\x22,\x0A \x22product_type\x22: \x22airtime-ctg\x22,\x0A \x22host_type\x22: \x22android\x22\x0A },\x0A \x22id\x22: {\x0A \x22type\x22: \x22guest\x22,\x0A \x22group\x22: \x22guest\x22,\x0A \x22uuid\x22: \x221a0d4d6e-0c00-11e7-a16f-0242ac110002\x22,\x0A \x22device_id\x22: \x22423e49efa4b8b013\x22\x0A },\x0A \x22stats\x22: [\x0A {\x0A \x22timestamp\x22: \x222017-03-22T03:21:11+0000\x22,\x0A \x22software_id\x22: \x22A-ACTG\x22,\x0A \x22action_id\x22: \x22open_app\x22,\x0A \x22values\x22: {\x0A \x22device_id\x22: \x22423e49efa4b8b013\x22,\x0A \x22language\x22: \x22en\x22\x0A }\x0A }\x0A ]\x0A}"
I want to remove all the hexadecimal characters,I tried parsing the data and storing in an array and cleaning it using re.sub() but it gives the same data.
for line in f:
new_data = re.sub(r'[^\x00-\x7f],\x22',r'', line)
data.append(new_data)
\x0A is the hex code for newline. After s = <your json string>, print(s) gives
>>> print(s)
{
"identifier": {
"company_code": "TSC",
"product_type": "airtime-ctg",
"host_type": "android"
},
"id": {
"type": "guest",
"group": "guest",
"uuid": "1a0d4d6e-0c00-11e7-a16f-0242ac110002",
"device_id": "423e49efa4b8b013"
},
"stats": [
{
"timestamp": "2017-03-22T03:21:11+0000",
"software_id": "A-ACTG",
"action_id": "open_app",
"values": {
"device_id": "423e49efa4b8b013",
"language": "en"
}
}
]
}
You should parse this with the json module load (from file) or loads (from string) functions. You will get a dict with 2 dicts and a list with a dict.

I can't get data from my Json file which contains small json files

My problem is I have Json file of small json file creadted with node js
I couldn't consume my json from that link and i tried to test my json file in some website like Json formatter there is this error : Multiple JSON root elements .
when i put only one json in json formatter it become right but like this example 2 json it it wrong
this is the example of my json of 2 json ,
{"#timestamp":"2017-06-11T00:28:24.112Z","type_instance":"interrupt","plugin":"cpu","logdate":"2017-06-11T00:28:24.112Z","host":"node-2","#version":"1","collectd_type":"percent","value":0}
{"#timestamp":"2017-06-11T00:28:24.112Z","type_instance":"softirq","plugin":"cpu","logdate":"2017-06-11T00:28:24.112Z","host":"node-2","#version":"1","collectd_type":"percent","value":0}
this is not a json format json must have a root an object or an array
[
{
"#timestamp": "2017-06-11T00:28:24.112Z",
"type_instance": "interrupt",
"plugin": "cpu",
"logdate": "2017-06-11T00:28:24.112Z",
"host": "node-2",
"#version": "1",
"collectd_type": "percent",
"value": 0
},
{
"#timestamp": "2017-06-11T00:28:24.112Z",
"type_instance": "softirq",
"plugin": "cpu",
"logdate": "2017-06-11T00:28:24.112Z",
"host": "node-2",
"#version": "1",
"collectd_type": "percent",
"value": 0
}
]
If you have the file contents as a string, then split the lines and JSON.parse them one by one:
const data =`{"a": 1}
{"b": 2}`;
const lines = data.split('\n')
const objects = lines.map(JSON.parse);
console.log(objects);

Formatting JSON files for SQLContext

I'm experiencing issues when loading JSON which are dependent on formatting of input JSON file.
According to Spark documentation on JSON Datasets, each line on input file must be a valid JSON Object. re:
"Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail."
So, if I have an input JSON file such as:
{
"Year": "2013",
"First Name": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
},
{
"Year": "2013",
"First Name": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}
Are there any existing tools or scripts to convert to:
{"Year": "2013","First Name": "DAVID","County": "KINGS","Sex": "M","Count":"272"},
{"Year": "2013","First Name": "JAYDEN","County": "KINGS","Sex": "M","Count": "268"}
where the JSON conforms to "Each line must contain a separate, self-contained valid JSON object"
If I format to this style above, things work as expected. But, I made these mods manually over a few rows. I cannot do this for entire data set, so looking for an existing script or tool.
OR
I could load to JDBC available database if that's a better option. Thoughts?
Thanks in advance
You can simply load the JSON files into an RDD first using sc.wholeTextFiles() and remove the file name column, then run the SQLContext read on the RDD contents.
e.g.
val jsonRdd = sc.wholeTextFiles("samplefile.json").map(x => x._2)
val jsonDf = sqlContext.read.json(jsonRdd)
What if you make it an array by adding square brackets. like this;
[
{
"Year": "2013",
"FName": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
},
{
"Year": "2013",
"FName": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}
]
If I take your file and add the brackets I can iterate through it with Node.js and output a file that looks like what you want. The caveat in node.js being I cant have variable First Name-- I had to change it to FName.