I am new to Spark and Scala. I am trying to parse a Nested JSON format column from a Spark Table. Here is a sneak peek of the table (I only show the first row from the Spark Table, they all look identical for the rest of it)
doc.show(1)
doc_content object_id object_version
{"id":"lni001","pub_date".... 20220301 7098727
The structure of the "doc_content" column of each row looks like this (Some rows may have more information store inside the 'content' field):
{
"id":"lni001",
"pub_date":"20220301",
"doc_id":"7098727",
"unique_id":"64WP-UI-POLI",
"content":[
{
"c_id":"002",
"p_id":"P02",
"type":"org",
"source":"internet"
},
{
"c_id":"003",
"p_id":"P03",
"type":"org",
"source":"internet"
},
{
"c_id":"005",
"p_id":"K01",
"type":"people",
"source":"news"
}
]
}
I tried to use explode on the "doc_content" column
doc.select(explode($"doc_content") as "doc_content")
.withColumn("id", col("doc_info.id"))
.withColumn("pub_date", col("doc_info.pub_date"))
.withColumn("doc_id", col("doc_info.doc_id"))
.withColumn("unique_id", col("doc_info.unique_id"))
.withColumn("content", col("doc_info.content"))
.withColumn("content", explode($"content"))
.withColumn("c_id", col("content.c_id"))
.withColumn("p_id", col("content.p_id"))
.withColumn("type", col("content.type"))
.withColumn("source", col("content.source"))
.drop(col("doc_content"))
.drop(col("content"))
.show()
but I got this error org.apache.spark.sql.AnalysisException: cannot resolve 'explode(`doc_content`)' due to data type mismatch: input to function explode should be array or map type, not string;. I am struggling on converting the column into Array or Map type (Probably new to Scala LOL).
After parsing the "doc_content" column, I want the table look like this.
id pub_date doc_id unique_id c_id p_id type source oject_id object_version
lni001 20220301 7098727 64WP-UI-POLI 002 P02 org internet 20220301 7098727
lni001 20220301 7098727 64WP-UI-POLI 003 P03 org internet 20220301 7098727
lni001 20220301 7098727 64WP-UI-POLI 005 K01 people news 20220301 7098727
I am wondering how can I do this and it will be great I can get some ideas or approaches on how to do this. Or maybe a better way than my approach since I have million of rows inside the Spark Table, if I can make it run faster.
Thanks!
You can use from_json to parses the JSON string into a MapType, then use explode on a array column to create new rows, what means you should explode on doc_content.content than doc_content.
Specify the schema to use parsing the json string:
import org.apache.spark.sql.types._
val schema = new StructType()
.add("id", StringType)
.add("pub_date", StringType)
.add("doc_id", StringType)
.add("unique_id", StringType)
.add("content", ArrayType(MapType(StringType, StringType)))
then parse the json string and explode it
df.select(
$"object_id",
$"object_version",
from_json($"doc_content", schema).alias("doc_content")
).select(
$"object_id",
$"object_version",
col("doc_content.id").alias("id"),
col("doc_content.pub_date").alias("pub_date"),
col("doc_content.doc_id").alias("doc_id"),
col("doc_content.unique_id").alias("unique_id"),
explode(col("doc_content.content")).alias("content")
).select(
$"id",
$"pub_date",
$"doc_id",
$"unique_id",
col("content.c_id").alias("c_id"),
col("content.p_id").alias("p_id"),
col("content.type").alias("type"),
col("content.source").alias("source"),
$"object_id",
$"object_version"
)
Related
I'm very new in Hadoop,
I'm using Spark with Java.
I have dynamic JSON, exmaple:
{
"sourceCode":"1234",
"uuid":"df123-....",
"title":"my title"
}{
"myMetaDataEvent": {
"date":"10/10/2010",
},
"myDataEvent": {
"field1": {
"field1Format":"fieldFormat",
"type":"Text",
"value":"field text"
}
}
}
Sometimes I can see only field1 and sometimes I can see field1...field50
And maybe the user can add fields/remove fields from this JSON.
I want to insert this dynamic JSON to hadoop (to hive table) from Spark Java code,
How can I do it?
I want that the user can after make HIVE query, i.e: select * from MyTable where type="Text
I have around 100B JSON records per day that I need to insert to Hadoop,
So what is the recommanded way to do that?
*I'm looked on the following: SO Question but this is known JSON scheme where it isnt my case.
Thanks
I had encountered kind of similar problem, I was able to resolve my problem using this. ( So this might help if you create the schema before you parse the json ).
For a field having a string data type you could create the schema :-
StructField field = DataTypes.createStructField(<name of the field>, DataTypes.StringType, true);
For a field having a int data type you could create the schema :-
StructField field = DataTypes.createStructField(<name of the field>, DataTypes.IntegerType, true);
After you have added all the fields in a List<StructField>,
Eg:-
List<StructField> innerField = new ArrayList<StructField>();
.... Field adding logic ....
Eg:-
innerField.add(field1);
innerField.add(field2);
// One instance can come, or multiple instance of value comes in an array, then it needs to be put in Array Type.
ArrayType getArrayInnerType = DataTypes.createArrayType(DataTypes.createStructType(innerField));
StructField getArrayField = DataTypes.createStructField(<name of field>, getArrayInnerType,true);
You can then create the schema :-
StructType structuredSchema = DataTypes.createStructType(getArrayField);
Then I read the json using the schema generated using the Dataset API.
Dataset<Row> dataRead = sqlContext.read().schema(structuredSchema).json(fileName);
How to create DataFrame from Json data where two fields has same key but differ in caps. For example,
{"abc1":"some-value", "ABC1":"some-other-value", "abc":"some-value"}
{"abc1":"some-value", "ABC1":"some-other-value", "abc":"some-value1"}
At present, I am getting following error,
org.apache.spark.sql.AnalysisException: Reference 'ABC1' is ambiguous.
This is how I created DataFrame,
val df = sqlContext.read.json(inputPath)
I also tried creating RDD first where I read every line and changed the key name in Json string and then converted RDD to Dataframe. This approach was very slow.
I tried multiple ways but still same problem remains.
I tried to rename the columns name but still error is there
val modDf = df
.withColumnRenamed("MCC", "MCC_CAP")
.withColumnRenamed("MNC", "MNC_CAP")
.withColumnRenamed("MCCMNC", "MCCMNC_CAP")
Created another DataFrame with renamed columns
val cols = df.columns.map(line => if (line.startsWith("M"))line.concat("_cap") else line)
val smallDf = df.toDF(cols: _*)
Dropping duplicate columns,
val capCols = df.columns.filter(line => line.startsWith("M"))
val smallDf = df.drop(capCols: _*)
You should read it as text rdd using sparkContext and then use sqlContext to read the rdd as json
sqlContext.read.json(sc.textFile("path to your json file"))
You should have your dataframe as
+----------------+-----------+----------+
|ABC1 |abc |abc1 |
+----------------+-----------+----------+
|some-other-value|some-value |some-value|
|some-other-value|some-value1|some-value|
+----------------+-----------+----------+
The dataframe generated would have defect though as duplicate column names are not allowed in spark dataframes (case insensitive)
So I would recommend to change the duplicate column names before you convert the jsons to dataframe
val rdd = sc.textFile("path to your json file").map(jsonLine => jsonLine.replace("\"ABC1\":", "\"AAA\":"))
sqlContext.read.json(rdd)
now you should have dataframe as
+----------------+-----------+----------+
|AAA |abc |abc1 |
+----------------+-----------+----------+
|some-other-value|some-value |some-value|
|some-other-value|some-value1|some-value|
+----------------+-----------+----------+
I am new to Spark, and want to read a log file and create a dataframe out of it. My data is half json, and I cannot convert it into a dataframe properly. Here below is first row in the file;
[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}
See first part is plain text and the last part between { } is json, I tried few things, converting it first to RDD then map and split then convert back to DataFrame, but I cannot extract the values from Json part of the row, is there a trick to extract fields in this context?
Final output will be like;
TimeStamp userid ip artist album song id service
2017-01-06 07:00:01 444444 11.11.111.0 Tears For Fears Songs From The Big Chair Everybody Wants To Rule The World S4555 pandora
You just need to parse out the pieces with a Python UDF into a tuple then tell spark to convert the RDD to a dataframe. The easiest way to do this is probably a regular expression. For example:
import re
import json
def parse(row):
pattern = ' '.join([
r'\[(?P<ts>\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)\]',
r'userid:(?P<userid>\d+)',
r'(?P<ip>\d+\.\d+\.\d+\.\d+)',
r'(?P<level>\w+)',
r'(?P<json>.+$)'
])
match = re.match(pattern, row)
parsed_json = json.loads(match.group('json'))
return (match.group('ts'), match.group('userid'), match.group('ip'), match.group('level'), parsed_json['artist'], parsed_json['song'], parsed_json['service'])
lines = [
'[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}'
]
rdd = sc.parallelize(lines)
df = rdd.map(parse).toDF(['ts', 'userid', 'ip', 'level', 'artist', 'song', 'service'])
df.show()
This prints
+-------------------+------+-----------+-----+---------------+--------------------+-------+
| ts|userid| ip|level| artist| song|service|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
|2017-01-06 07:00:01|444444|11.11.111.0| info|Tears For Fears|Everybody Wants T...|pandora|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
I have used the following, just some parsing utilizing pyspark power;
parts=r1.map( lambda x: x.value.replace('[','').replace('] ','###')
.replace(' userid:','###').replace('null','"null"').replace('""','"NA"')
.replace(' music_info {"artist":"','###').replace('","album":"','###')
.replace('","song":"','###').replace('","id":"','###')
.replace('","service":"','###').replace('"}','###').split('###'))
people = parts.map(lambda p: (p[0], p[1],p[2], p[3], p[4], p[5], p[6], p[7]))
schemaString = "timestamp mac userid_ip artist album song id service"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
With this I got almost what I want, and performance was super fast.
+-------------------+-----------------+--------------------+-------------------- +--------------------+--------------------+--------------------+-------+
| timestamp| mac| userid_ip| artist| album| song| id|service|
+-------------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------+
|2017-01-01 00:00:00|00:00:00:00:00:00|111122 22.235.17...|The United States...| This Is Christmas!|Do You Hear What ...| S1112536|pandora|
|2017-01-01 00:00:00|00:11:11:11:11:11|123123 108.252.2...| NA| Dinner Party Radio| NA| null|pandora|
I'm using Postgrex in Elixir, and when it returns query results, it returns them in the following struct format:
%{columns: ["id", "email", "name"], command: :select, num_rows: 2, rows: [{1, "me#me.com", "Bobbly Long"}, {6, "email#tts.me", "Woll Smoth"}]}
It should be noted I am using Postgrex directly WITHOUT Ecto.
The columns (table headers) are returned as a collection, but the results (rows) are returned as a list of tuples. (which seems odd, as they could get very large).
I'm trying to find the best way to programmatically create JSON objects for each result in which the JSON key is the column title and the JSON value the corresponding value from the tuple.
I've tried creating maps from both, merging and then serialising to JSON objects but it seems there should be an easier/better way of doing this.
Has anyone dealt with this before? What is the best way of creating a JSON object from a separate collection and tuple?
Something like this should work:
result = Postgrex.query!(...)
Enum.map(result.rows, fn row ->
Enum.zip(result.columns, Tuple.to_list(row))
|> Enum.into(%{})
|> JSON.encode
end)
This will result in a list of json objects where each row in the resultset is a json object.
I'm a new spark user currently playing around with Spark and some big data and I have a question related to Spark SQL or more formally the SchemaRDD. I'm reading a JSON file containing data about some weather forecasts and I'm not really interested in all of the fields that I have ... I only want 10 fields out of 50+ fields returned for each record. Is there a way (similar to filter) that I can use to specify the names of some fields that I want remove from spark.
Just a small descriptive example. Consider I have the Schema "Person" with 3 fields "Name", "Age", and "Gender" and I'm not interested in the "Age" field and wold like to remove it. Can I use spark some how to do that. ? Thanks
If you are using Spark 1.2, you can do the following (using Scala)...
If you already know what fields you want to use, you can construct the schema for these fields and apply this schema to the JSON dataset. Spark SQL will return a SchemaRDD. Then, you can register it and query it as a table. Here is a snippet...
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// The schema is encoded in a string
val schemaString = "name gender"
// Import Spark SQL data types.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Create the SchemaRDD for your JSON file "people" (every line of this file is a JSON object).
val peopleSchemaRDD = sqlContext.jsonFile("people.txt", schema)
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Only values of name and gender fields will be in the results.
val results = sqlContext.sql("SELECT * FROM people")
When you look at the schema of peopleSchemaRDD (peopleSchemaRDD.printSchema()), you will only see name and gender field.
Or, if you want to explore the dataset and determine what fields you want after you see all fields, you can ask Spark SQL to infer the schema for you. Then, you can register the SchemaRDD as a table and use projection to remove unneeded fields. Here is a snippet...
// Spark SQL will infer the schema of the given JSON file.
val peopleSchemaRDD = sqlContext.jsonFile("people.txt")
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Project name and gender field.
sqlContext.sql("SELECT name, gender FROM people")
You can specify what fields you would like to have in the schemaRDD. Below is an example. Create a case class, with only the fields that you need. Read the data into an rdd, then specify the only the fileds that you need(in the same order as you have specified the schema in the case class).
Sample Data: People.txt
foo,25,M
bar,24,F
Code:
case class Person(name: String, gender: String)
val people = sc.textFile("People.txt").map(_.split(",")).map(p => Person(p(0), p(2)))
people.registerTempTable("people")