I want to read json or xml file in pyspark.lf my file is split in multiple line in sc.textFIle(json or xml)
Input
{
"employees": [
{
"firstName": "John",
"lastName": "Doe"
},
{
"firstName": "Anna",
"lastName": "Smith"
},
{
"firstName": "Peter",
"lastName": "Jones"
}
]
}
Its in multiple line
Output
{"employees:[{"firstName:"John",......]}
Every think in one string or one line..
In pyspark
Please help me I am new to spark
If you have access to the dictionary file (I'm not familiar with PySpark, but superficially, it seems you do) you can use a standard JSON library to "pretty print" it:
>>> import json
>>> my_dict = {'4': 5, '6': 7}
>>> print json.dumps(my_dict, sort_keys=True,
... indent=4, separators=(',', ': '))
{
"4": 5,
"6": 7
}
https://docs.python.org/2/library/json.html
Related
Data sample:
import pandas as pd
patients_df = pd.read_json('C:/MyWorks/Python/Anal/data_sample.json', orient="records", lines=True)
patients_df.head()
//in python
//my json data sample
"data1": {
"id": "myid",
"seatbid": [
{
"bid": [
{
"id": "myid",
"impid": "1",
"price": 0.46328014,
"adm": "adminfo",
"adomain": [
"domain.com"
],
"iurl": "url.com",
"cid": "111",
"crid": "1111",
"cat": [
"CAT-0101"
],
"w": 00,
"h": 00
}
],
"seat": "27"
}
],
"cur": "USD"
},
What I want to do is to check if there is a "cat" value in my very large JSON data.
The "cat" value may/may not exist, but I'm trying to use Python Pandas to check it.
for seatbid in patients_df["win_res"]:
for bid in seatbid["seatbid"]:
I tried to access JSON data while writing a loop like that, but it's not being accessed properly.
I simply want to check if "cat" exist or not.
You can use python's json library as follows:
import json
patient_data = json.loads(patientJson)
if "cat" in student:
print("Key exist in JSON data")
else
print("Key doesn't exist in JSON data")
Im using Spark 2.4.3 and Scala 2.11
Below is my current JSON string in a DataFrame column.
Im trying to store the schema of this JSON string in another column using schema_of_json function.
But its throwing below the error. How could I resolve this?
{
"company": {
"companyId": "123",
"companyName": "ABC"
},
"customer": {
"customerDetails": {
"customerId": "CUST-100",
"customerName": "CUST-AAA",
"status": "ACTIVE",
"phone": {
"phoneDetails": {
"home": {
"phoneno": "666-777-9999"
},
"mobile": {
"phoneno": "333-444-5555"
}
}
}
},
"address": {
"loc": "NORTH",
"adressDetails": [
{
"street": "BBB",
"city": "YYYYY",
"province": "AB",
"country": "US"
},
{
"street": "UUU",
"city": "GGGGG",
"province": "NB",
"country": "US"
}
]
}
}
}
Code:
val df = spark.read.textFile("./src/main/resources/json/company.txt")
df.printSchema()
df.show()
root
|-- value: string (nullable = true)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"company":{"companyId":"123","companyName":"ABC"},"customer":{"customerDetails":{"customerId":"CUST-100","customerName":"CUST-AAA","status":"ACTIVE","phone":{"phoneDetails":{"home":{"phoneno":"666-777-9999"},"mobile":{"phoneno":"333-444-5555"}}}},"address":{"loc":"NORTH","adressDetails":[{"street":"BBB","city":"YYYYY","province":"AB","country":"US"},{"street":"UUU","city":"GGGGG","province":"NB","country":"US"}]}}}|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df.withColumn("jsonSchema",schema_of_json(col("value")))
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'schemaofjson(`value`)' due to data type mismatch: The input json should be a string literal and not null; however, got `value`.;;
'Project [value#0, schemaofjson(value#0) AS jsonSchema#10]
+- Project [value#0]
+- Relation[value#0] text
The workaround solution that I found was to pass the column value as below to the schema_of_json function.
df.withColumn("jsonSchema",schema_of_json(df.select(col("value")).first.getString(0)))
Courtesy:
Implicit schema discovery on a JSON-formatted Spark DataFrame column
Since SPARK-24709 was introduced schema_of_json accepts just literal strings. You can extract schema of String in DDL format by calling
spark.read
.json(df.select("value").as[String])
.schema
.toDDL
If one is looking for a pyspark answer :
import pyspark.sql.functions as F
import pyspark.sql.types as T
import json
def process(json_content):
if json_content is None :
return []
try:
# Parse the content of the json, extract the keys only
keys = json.loads(json_content).keys()
return list(keys)
except Exception as e:
return [e]
udf_function = F.udf(process_file, T.ArrayType(T.StringType()))
my_df = my_df.withColumn("schema", udf_function(F.col("json_raw"))
I want to parse json file in spark 2.0(scala). Next i want to save the data.. in Hive table.
How can i parse json file by using scala?
json file example) metadata.json:
{
"syslog": {
"month": "Sep",
"day": "26",
"time": "23:03:44",
"host": "cdpcapital.onmicrosoft.com"
},
"prefix": {
"cef_version": "CEF:0",
"device_vendor": "Microsoft",
"device_product": "SharePoint Online",
},
"extensions": {
"eventId": "7808891",
"msg": "ManagedSyncClientAllowed",
"art": "1506467022378",
"cat": "SharePoint",
"act": "ManagedSyncClientAllowed",
"rt": "1506466717000",
"requestClientApplication": "Microsoft SkyDriveSync",
"cs1": "0bdbe027-8f50-4ec3-843f-e27c41a63957",
"cs1Label": "Organization ID",
"cs2Label": "Modified Properties",
"ahost": "cdpdiclog101.cgimss.com",
"agentZoneURI": "/All Zones",
"amac": "F0-1F-AF-DA-8F-1B",
"av": "7.6.0.8009.0",
}
},
Thanks
You can use something like:
val jsonDf = sparkSession
.read
//.option("wholeFile", true) if its not a Single Line JSON
.json("resources/json/metadata.json")
jsonDf.printSchema()
jsonDf.registerTempTable("metadata")
More details about this https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
I am writing a python script for extracting information from the json file. I am printing the title of the book with lastname as Marcus. I have the output but it has the error AttributeError: 'str' object has no attribute 'items' error as well
import json
from pprint import pprint
with open('bibliography.json.txt', encoding='utf-8') as data_file:
data = json.load(data_file)
for entry in data['bibliography']['biblioentry']:
for authors in entry['author']:
for key,val in authors.items():
if(key== 'lastname' and val=='Marcus'):
title=entry['title']
print(title)
the json file looks like this:
{
"bibliography": {
"biblioentry": [
{
"-type": "Journal Article",
"title": "A brief survey of web data extraction tools",
"author": [
{
"firstname": "Alberto",
"middlename": "HF",
"lastname": "Laender"
},
{
"firstname": "Berthier",
"middlename": "A",
"lastname": "Ribeiro-Neto"
},
{
"firstname": "Altigran",
"middlename": "S",
"lastname": "da Silva"
},
{
"firstname": "Juliana",
"middlename": "S",
"lastname": "Teixeira"
}
],
"details": {
"journalname": "ACM Sigmod Record",
"volume": "31",
"number": "2",
"pages": "84-93"
},
"year": "2002",
"publisher": "ACM"
},......
I think it is because it interprets the json file as a string. I think you might want to see this if it helps you:
Extract data from JSON API using Python
Below is my JSON format:
{"copybook": {
"item": {
"storage-length": 1652,
"item": [
{
"storage-length": 40,
"level": "05",
"name": "OBJECT-NAME",
"display-length": 40,
"position": 1,
"picture": "X(40)"
},
{
"storage-length": 8,
"occurs-min": 0,
"level": "05",
"name": "C-TCRMANKEYBOBJ-OFFSET",
"numeric": true,
"display-length": 8,
"position": 861,
"occurs": 99,
"depending-on": "C-TCRMANKEYBOBJ-COUNT",
"picture": "9(8)"
}
],
"level": "01",
"name": "TCRMCONTRACTBOBJ",
"display-length": 1652,
"position": 1
},
"filename": "test.cbl"
}}
How can I parse this json and convert it to CSV format? I am using Scala default JSON parser. The main problem I am facing is that I can not use case class to extract the data as all the item names are not same in item array.
This format is ok for me, please follow this link and paste the JSON - https://konklone.io/json/. Any scala code is appreciated. I am getting the below data:
implicit val formats = DefaultFormats
val json2 = parse(jsonString, false) \\ "item"
val list = json2.values.asInstanceOf[List[Map[String, String]]]
for (obj <- list) {
//println(obj.keys)
//obj.values
println (obj.toList.mkString(","))
}
(name,OBJECT-NAME),(storage-length,40),(picture,X(40)),(position,1),(display-length,40),(level,05)
(name,C-TCRMANKEYBOBJ-OFFSET),(storage-length,8),(occurs-min,0),(occurs,99),(picture,9(8)),(position,861),(numeric,true),(depending-on,C-TCRMANKEYBOBJ-COUNT),(display-length,8),(level,05)
https://circe.github.io/circe/ can work better for you in terms of traversing. Just try and read.