hadoop - Validate json data loaded into hive warehouse - json

I have json files, volume is approx 500 TB. I have loaded complete set into hive data warehouse.
How would I validate or test the data that was loaded into hive warehouse. What should be my testing strategy ?
Client want us to validate the json data. Whether the data loaded into hive is correct ot not. Is there any miss? If yes, which field it was?
Please help.

How is your data being stored in hive tables ?
One option is create a Hive UDF function that receive the JSON string and validate the data and return another string with the error message or an empty string if the JSON string is well formed.
Here is a Hve UDF tutorial: http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html
With the Hive UDF function in place you can executequeries like:
select strjson, validateJson(strjson) from jsonTable where validateJson(strjson) != "";

Related

Copy Activity Data Factory V2 Collection Refrence to String type

I am trying to load the Json file to SQL Server using Data Factory V2. Need to save the collection Reference of type array to string in SQL Server. In below figure 'Field' object consist of multiple fields. I need to store the object 'Keywords' as string type in SQL Server DB mapped to single column. I am not able to map the 'Keywords' or 'Field' to column
In Data Factory, we can not load the json data keywords array to each row for one column as json string in SQL Server.
What we can figured out is load all the json data to one row in SQL Server.
But that is an unconventional way:
Set the json file as Delimiter file format.
Set the column and row Delimiter with the character which not exist
in file.
Then the json data will be considered as a json string like bellow:
The output in SQL database:

JSON format in wordpress database

I need to import JSON data in a WP database. I found the correct table of the database but the example JSON data present in table is not in a normal JSON format.
I need to import:
{"nome": "Pippo","cognome": "Paperino"}
but the example data in the table is:
a:2:{s:4:"nome";s:5:"Pippo";s:7:"cognome";s:8:"Paperino";}
How i can convert my JSON to "WP JSON"?
The data is serialized, that's why it looks weird. You can use maybe_unserialize() in WordPress, this function will unserialize the data if it was serialized.
https://developer.wordpress.org/reference/functions/maybe_unserialize/
Some functions serialize data before saving it in wordpress, and some will also unserialize when pulling from the DB. So depending on how you save it, and how you later extract the data, you might end up with serialized data.

Ingesting CSV data to MySQL DB in NiFi

I am trying to ingest the data of my CSV file into MySQL Db. My CSV file have field called 'MeasurementTime' value as 2018-06-27 11:14.50. My flow is taking that field as string and thus PutSQL is giving error. I am using the same template as per this Template but not using the InferAvro processor as i already have a pre-defined schema. This is the website Website link
How can I pass a Datetime field into my MySQL db as correct datatype and not as string. What setting should I change?
Thank you
With PutDatabaseRecord you can avoid all this chain of transformations and overengineering. The flow would be like:
GetFile -> PutDatabaseRecord
You need to configure PutDatabaseRecord with RecordReader property configured to CSVReader and configure CSVReader and set its Schema Registry to AvroSchemaRegistry and provide the valid schema. you can find the template for a sample flow here.

How to parse Nested Json messages from Kafka topic to hive

I'm pretty new to spark streaming and scala. I have a Json data and some other random log data coming in from a kafka topic.I was able to filter out just the json data like this
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet).map(_._2).filter (x => x.matches("^[{].*" ))
My json data looks like this.
{"time":"125573","randomcol":"abchdre","address":{"city":"somecity","zip":"123456"}}
Im trying to parse the json data and put it into hive table.
can someone please point me in the correct direction.
Thanks
There are multiple ways to do this.
create an external hive table with required columns and pointing to this data location.
When you create the table , you could use the default JSON serde and then use get_json_object hive function and load this raw data into a final table. Refer this for the function details
OR
You could try the avro serde and mention the avro schema as per your json message to create the hive table. Refer this for avro serde example
.
Hope it helps

Spark SQL on Postgresql JSONB data

The current Postgresql version (9.4) supports json and jsonb data type as described in http://www.postgresql.org/docs/9.4/static/datatype-json.html
For instance, JSON data stored as jsonb can be queried via SQL query:
SELECT jdoc->'guid', jdoc->'name'
FROM api
WHERE jdoc #> '{"company": "Magnafone"}';
As a Sparker user, is it possible to send this query into Postgresql via JDBC and receive the result as DataFrame?
What I have tried so far:
val url = "jdbc:postgresql://localhost:5432/mydb?user=foo&password=bar"
val df = sqlContext.load("jdbc",
Map("url"->url,"dbtable"->"mydb", "driver"->"org.postgresql.Driver"))
df.registerTempTable("table")
sqlContext.sql("SELECT data->'myid' FROM table")
But sqlContext.sql() was unable to understand the data->'myid' part in the SQL.
It is not possible to query json / jsonb fields dynamically from Spark DataFrame API. Once data is fetched to Spark it is converted to string and is no longer a queryable structure (see: SPARK-7869).
As you've already discovered you can use dbtable / table arguments to pass a subquery directly to the source and use it to extract fields of interest. Pretty much the same rule applies to any non-standard type, calling stored procedures or any other extensions.