spark hivecontext working with queries issues

spark hivecontext working with queries issues - json

I'm trying to get information from Jsons to create tables in Hive.
This is my Json schema:
root
|-- info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- stations: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- bikes: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- slots: string (nullable = true)
| | | | |-- streetName: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- updateTime: long (nullable = true)
|-- date: string (nullable = true)
|-- numRecords: string (nullable = true)
I'm using this query:
sqlContext.sql("SELECT info.updateTime FROM STATIONS").foreach(println)
This is what i get:
[WrappedArray(1449098169, 1449108553, 1449098468)]
But i don't know how to put this information in a table to use it after from the Hive console.
I used this:
query.write.save("/home/cloudera/Desktop/select")
And it creates something, but i don't know how to use it.
Thanks

You can do it in several ways...it depends.
First way: Have the table created in the query
sqlContext.sql("create table mytable AS SELECT info.updateTime FROM STATIONS")
// now you can query mytable
Second way: write the DataFrame with saveAsTable()
sqlContext.sql("SELECT info.updateTime FROM STATIONS").saveAsTable("othertable")

Related

Processing puzzle for complex json

I'm new to data processing with pyspark, pandas. I need some guidence to understand how I can process a relatively complex json coming out of puppet db.
Schema is something like below
root
|-- Hostname: string (nullable = true)
|-- facts-mountpoints: struct (nullable = true)
| |-- /: struct (nullable = true)
| | |-- available: string (nullable = true)
| | |-- available_bytes: long (nullable = true)
| | |-- capacity: string (nullable = true)
| | |-- device: string (nullable = true)
| | |-- filesystem: string (nullable = true)
| | |-- options: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- size: string (nullable = true)
| | |-- size_bytes: long (nullable = true)
| | |-- used: string (nullable = true)
| | |-- used_bytes: long (nullable = true)
| |-- /acfs01: struct (nullable = true)
| | |-- available: string (nullable = true)
| | |-- available_bytes: long (nullable = true)
| | |-- capacity: string (nullable = true)
| | |-- device: string (nullable = true)
| | |-- filesystem: string (nullable = true)
| | |-- options: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- size: string (nullable = true)
| | |-- size_bytes: long (nullable = true)
| | |-- used: string (nullable = true)
| | |-- used_bytes: long (nullable = true)
I use pyspark to create dataframe and process the data.
My problem is that each host can have extra different NFS mounts attached so facts-mountspoints is dynamic and not same across hosts to just flatten/explode and do the work.
To ease of the problem I want to filter out the filesystem="nfs" and get only mounts which are standard and are non nfs.
No matter what I tried I still could not find how to do a filter like below to build my columns.
facts-mountpoints.*.filesystem<>'nfs'
Is there a magical way to filter out on the known struct->unknown struct->field with json dataframes ?
If thats not possible maybe filter out on the mount point names (second struct)
Sample json file can be found here
https://github.com/coskan/stackof/blob/0b29f4f0645e28d3efa297a1c4e949f4a985c639/sample_data.json

scala dataframe column names replace '-' with _ for nested json

I am working with Nested json, using scala and need to replace the - in column names with _.
Schema of json:
|-- a-type: struct (nullable = true)
| |-- x-Type: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- part: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- x-Type: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- Length: long (nullable = true)
| | | |-- Order: long (nullable = true)
| | | |-- y-Name: string (nullable = true)
| | | |-- Payload-Text: string (nullable = true)
| |-- Date: string (nullable = true)
I am using below code which only works at first level. However, I have to replace - with _ at all levels. Any help is really appreciated.
Code used currently:
scJsonDF.columns.foreach { col =>
println(col + " after column replace " + col.replaceAll("-", ""))
scJsonDFCorrectedCols = scJsonDFCorrectedCols.withColumnRenamed(col, col.replaceAll("-", "")
)
}
I am looking for a dynamic solution as there are different structures available.

One of the solution I found is to flatten the json and update column names. I used link here to help https://gist.github.com/fahadsiddiqui/d5cff15698f9dc57e2dd7d7052c6cc43 and updated a line
col(x.toString).as(x.toString.replace(".", "_"))
col(x.toString).as(x.toString.replaceAll("-","_").replace(".", "_"))

In PySpark, how do I read a specific JSON attribute that has been loaded to a dataframe?

I am trying to get the value of "__delta" from the following JSON schema that has been loaded to a dataframe. How do I do that in Pyspark?
root
|-- d: struct (nullable = true)
| |-- __delta: string (nullable = true)
| |-- __next: string (nullable = true)
| |-- results: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- ABRVW: string (nullable = true)
| | | |-- ADRNR: string (nullable = true)
| | | |-- ANRED: string (nullable = true)

with the struct type JSON object just select the object with the attribute you want to get.
df.select("d.__delta")

How about df.select($"d.__delta")

what is optimal way to parse following kafka JSON message to pyspark dataframe?

I'm using spark structured streaming to read kafka topic and want to convert following complex JSON (kafka-msgs) in to dataframe having "NAME,ADDRESS,DESCRIPTION,CODE,DEPARTMENT,INFA_OP_TYPE,DTL__CAPXTIMESTAMP" columns.
{
"meta_data": [{"name":{"string":"INFA_SEQUENCE"},"value":
{"string":"2,PWX_GENERIC"},"type":null},
{"name":{"string":"INFA_TABLE_NAME"},"value":{"string":"customers"},"type":null},
{"name":{"string":"INFA_OP_TYPE"},"value":{"string":"INSERT_EVENT"},"type":null},
{"name":{"string":"DTL__CAPXRESTART1"},"value":{"string":"B+IABwAfA"},"type":null},
{"name":{"string":"DTL__CAPXRESTART2"},"value":{"string":"AAABpMwgRDk="},"type":null},
{"name":{"string":"DTL__CAPXUOW"},"value":{"string":"AAMKPgAAqaIABg=="},"type":null},
{"name":{"string":"DTL__CAPXUSER"},"value":null,"type":null},
{"name":{"string":"DTL__CAPXTIMESTAMP"},"value":{"string":"201807310934257270000000"},"type":null},
{"name":{"string":"DTL__CAPXACTION"},"value":{"string":"I"},"type":null}],
"columns":{"array":[{"name":{"string":"NAME"},"value":{"string":"ABCD"},"isPresent":{"boolean":true}},
{"name":{"string":"ADDRESS"},"value":{"string":"123,Bark street"},"isPresent":{"boolean":true}},
{"name":{"string":"DESCRIPTION"},"value":{"string":"Canadian"},"isPresent":{"boolean":true}},
{"name":{"string":"CODE"},"value":{"string":"3_1"},"isPresent":{"boolean":true}},
{"name":{"string":"DEPARTMENT"},"value":{"string":"HR"},"isPresent":{"boolean":true}}
] }
}
I'm able to extract two json object "meta_data" and "columns" but I'm unable to explode "columns.array"
newJsonObj = events.select(get_json_object(events.value,'$.meta_data').alias('meta_data'),get_json_object(events.value,'$.columns.array').alias('columns'))
And I don't know how to extract values from two json object and create dataframe having columns from both json object.
-- Schema of events dataframe --
root
|-- columns: struct (nullable = true)
| |-- array: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- isPresent: struct (nullable = true)
| | | | |-- boolean: boolean (nullable = true)
| | | |-- name: struct (nullable = true)
| | | | |-- string: string (nullable = true)
| | | |-- value: struct (nullable = true)
| | | | |-- string: string (nullable = true)
|-- meta_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: struct (nullable = true)
| | | |-- string: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- value: struct (nullable = true)
| | | |-- string: string (nullable = true)

how to parse the wiki infobox json with scala spark

I was trying to get the data from json data which I got it from wiki api
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Rajanna&rvsection=0
I was able to print the schema of that exactly
scala> data.printSchema
root
|-- batchcomplete: string (nullable = true)
|-- query: struct (nullable = true)
| |-- pages: struct (nullable = true)
| | |-- 28597189: struct (nullable = true)
| | | |-- ns: long (nullable = true)
| | | |-- pageid: long (nullable = true)
| | | |-- revisions: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- *: string (nullable = true)
| | | | | |-- contentformat: string (nullable = true)
| | | | | |-- contentmodel: string (nullable = true)
| | | |-- title: string (nullable = true)
I want to extract the data of the key "*" |-- *: string (nullable = true)
Please suggest me a solution.
One problem is
pages: struct (nullable = true)
| | |-- 28597189: struct (nullable = true)
the number 28597189 is unique to every title.

First we need to parse the json to get the key (28597189) dynamically then use this to extract the data from spark dataframe like below
val keyName = dataFrame.selectExpr("query.pages.*").schema.fieldNames(0)
println(s"Key Name : $keyName")
this will give you the key dynamically:
Key Name : 28597189
Then use this to extract the data
var revDf = dataFrame.select(explode(dataFrame(s"query.pages.$keyName.revisions")).as("revision")).select("revision.*")
revDf.printSchema()
Output:
root
|-- *: string (nullable = true)
|-- contentformat: string (nullable = true)
|-- contentmodel: string (nullable = true)
and we will be renaming the column * with some key name like star_column
revDf = revDf.withColumnRenamed("*", "star_column")
revDf.printSchema()
Output:
root
|-- star_column: string (nullable = true)
|-- contentformat: string (nullable = true)
|-- contentmodel: string (nullable = true)
and once we have our final dataframe we will call show
revDf.show()
Output:
+--------------------+-------------+------------+
| star_column|contentformat|contentmodel|
+--------------------+-------------+------------+
|{{EngvarB|date=Se...| text/x-wiki| wikitext|
+--------------------+-------------+------------+

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

spark hivecontext working with queries issues - json

Related

Processing puzzle for complex json

scala dataframe column names replace '-' with _ for nested json

In PySpark, how do I read a specific JSON attribute that has been loaded to a dataframe?

what is optimal way to parse following kafka JSON message to pyspark dataframe?

how to parse the wiki infobox json with scala spark

Categories

Resources