I'm pretty new to spark streaming and scala. I have a Json data and some other random log data coming in from a kafka topic.I was able to filter out just the json data like this
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet).map(_._2).filter (x => x.matches("^[{].*" ))
My json data looks like this.
{"time":"125573","randomcol":"abchdre","address":{"city":"somecity","zip":"123456"}}
Im trying to parse the json data and put it into hive table.
can someone please point me in the correct direction.
Thanks
There are multiple ways to do this.
create an external hive table with required columns and pointing to this data location.
When you create the table , you could use the default JSON serde and then use get_json_object hive function and load this raw data into a final table. Refer this for the function details
OR
You could try the avro serde and mention the avro schema as per your json message to create the hive table. Refer this for avro serde example
.
Hope it helps
Related
I need to import JSON data in a WP database. I found the correct table of the database but the example JSON data present in table is not in a normal JSON format.
I need to import:
{"nome": "Pippo","cognome": "Paperino"}
but the example data in the table is:
a:2:{s:4:"nome";s:5:"Pippo";s:7:"cognome";s:8:"Paperino";}
How i can convert my JSON to "WP JSON"?
The data is serialized, that's why it looks weird. You can use maybe_unserialize() in WordPress, this function will unserialize the data if it was serialized.
https://developer.wordpress.org/reference/functions/maybe_unserialize/
Some functions serialize data before saving it in wordpress, and some will also unserialize when pulling from the DB. So depending on how you save it, and how you later extract the data, you might end up with serialized data.
I've tried loading simple JSON records from a file into hive tables like as shown below. Each JSON record is in a separate line.
{"Industry":"Manufacturing","Phone":null,"Id":null,"type":"Account","Name":"Manufacturing"}
{"Industry":null,"Phone":"(738) 244-5566","Id":null,"type":"Account","Name":"Sales"}
{"Industry":"Government","Phone":null,"Id":null,"type":"Account","Name":"Kansas City Brewery & Co"}
But I couldn't find any serde to load the array of comma (,) separated JSON records into the hive table. Input is a file containing JSON records as shown below...
[{"Industry":"Manufacturing","Phone":null,"Id":null,"type":"Account","Name":"Manufacturing"},{"Industry":null,"Phone":"(738) 244-5566","Id":null,"type":"Account","Name":"Sales"},{"Industry":"Government","Phone":null,"Id":null,"type":"Account","Name":"Kansas City Brewery & Co"}]
Can someone suggest me a serde which can parse this JSON file?
Thanks
You can check this serde : https://github.com/rcongiu/Hive-JSON-Serde
Another related post : Parse json arrays using HIVE
I have json files, volume is approx 500 TB. I have loaded complete set into hive data warehouse.
How would I validate or test the data that was loaded into hive warehouse. What should be my testing strategy ?
Client want us to validate the json data. Whether the data loaded into hive is correct ot not. Is there any miss? If yes, which field it was?
Please help.
How is your data being stored in hive tables ?
One option is create a Hive UDF function that receive the JSON string and validate the data and return another string with the error message or an empty string if the JSON string is well formed.
Here is a Hve UDF tutorial: http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html
With the Hive UDF function in place you can executequeries like:
select strjson, validateJson(strjson) from jsonTable where validateJson(strjson) != "";
The current Postgresql version (9.4) supports json and jsonb data type as described in http://www.postgresql.org/docs/9.4/static/datatype-json.html
For instance, JSON data stored as jsonb can be queried via SQL query:
SELECT jdoc->'guid', jdoc->'name'
FROM api
WHERE jdoc #> '{"company": "Magnafone"}';
As a Sparker user, is it possible to send this query into Postgresql via JDBC and receive the result as DataFrame?
What I have tried so far:
val url = "jdbc:postgresql://localhost:5432/mydb?user=foo&password=bar"
val df = sqlContext.load("jdbc",
Map("url"->url,"dbtable"->"mydb", "driver"->"org.postgresql.Driver"))
df.registerTempTable("table")
sqlContext.sql("SELECT data->'myid' FROM table")
But sqlContext.sql() was unable to understand the data->'myid' part in the SQL.
It is not possible to query json / jsonb fields dynamically from Spark DataFrame API. Once data is fetched to Spark it is converted to string and is no longer a queryable structure (see: SPARK-7869).
As you've already discovered you can use dbtable / table arguments to pass a subquery directly to the source and use it to extract fields of interest. Pretty much the same rule applies to any non-standard type, calling stored procedures or any other extensions.
Can anyone recommend a code snippet, script, tool for converting a Hive Map field to a Redshift JSON field?
I have a Hive table that has two Map fields and I need to move the data to Redshift. I can easily move the data to a string format but then lose some functionality with that. Would prefer to have Map ported to JSON to maintain key, value pairs.
Thanks.
You might want to try the to_json UDF
http://brickhouseconfessions.wordpress.com/2014/02/07/hive-and-json-made-simple/