Spark SQL on Postgresql JSONB data - json

The current Postgresql version (9.4) supports json and jsonb data type as described in http://www.postgresql.org/docs/9.4/static/datatype-json.html
For instance, JSON data stored as jsonb can be queried via SQL query:
SELECT jdoc->'guid', jdoc->'name'
FROM api
WHERE jdoc #> '{"company": "Magnafone"}';
As a Sparker user, is it possible to send this query into Postgresql via JDBC and receive the result as DataFrame?
What I have tried so far:
val url = "jdbc:postgresql://localhost:5432/mydb?user=foo&password=bar"
val df = sqlContext.load("jdbc",
Map("url"->url,"dbtable"->"mydb", "driver"->"org.postgresql.Driver"))
df.registerTempTable("table")
sqlContext.sql("SELECT data->'myid' FROM table")
But sqlContext.sql() was unable to understand the data->'myid' part in the SQL.

It is not possible to query json / jsonb fields dynamically from Spark DataFrame API. Once data is fetched to Spark it is converted to string and is no longer a queryable structure (see: SPARK-7869).
As you've already discovered you can use dbtable / table arguments to pass a subquery directly to the source and use it to extract fields of interest. Pretty much the same rule applies to any non-standard type, calling stored procedures or any other extensions.

Related

How to extract tables with data from .sql dumps using Spark?

I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.
I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )
I directly read the file into spark using
import org.apache.spark.sql.SparkSession
var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")
//Convert this to indexed dataframe so you can parse multiple line create / data statements.
//This will also show you the structure of the sql dump for your usecase.
var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text")
// Identify all tables and data in the sql dump along with their indexes
var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))
println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())
var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))
The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?
Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?
UPDATE 1: I am writing the parser for this based on Input given by Ajay
UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested
Is there any .sql parser written for scala spark ?
Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?
The SQL parser interface (ParserInterface) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?
Please note that ParserInterface deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.
The ParserInterface is available as sqlParser of a SessionState.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface
Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql, Dataset.selectExpr or simply expr standard function. You may also use the SQL parser directly.
shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.
You need to parse it by yourself. It requires following steps -
Create a class for each table.
Load files using textFile.
Filter out all the statements other than insert statements.
Then split the RDD using filter into multiple RDDs based on the table name present in insert statement.
For each RDD, use map to parse values present in insert statement and create object.
Now convert RDDs to datasets.

Querying hive complex data types like struct in Superset's SQL LAB

I have been using superset to query an external table through hive. This table has columns which are mostly of hive complex data types like the struct.
How would I write a query in SQL LAB that does something like below?
SELECT header.guid
FROM table1
WHERE guid = 'xxxx'
where header is of struct data type and guid is a member of the header.
The problem as far as I can see is that pyhive maps the struct data types to string, although not sure how to get around it yet
I got this working by querying hive through prestodb. PrestoDB needed additional parquet config in it's etc/catalog/hive.properties catalog:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://<hive_url>:9083
hive.parquet-optimized-reader.enabled=true
hive.parquet-predicate-pushdown.enabled=true

How to parse Nested Json messages from Kafka topic to hive

I'm pretty new to spark streaming and scala. I have a Json data and some other random log data coming in from a kafka topic.I was able to filter out just the json data like this
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet).map(_._2).filter (x => x.matches("^[{].*" ))
My json data looks like this.
{"time":"125573","randomcol":"abchdre","address":{"city":"somecity","zip":"123456"}}
Im trying to parse the json data and put it into hive table.
can someone please point me in the correct direction.
Thanks
There are multiple ways to do this.
create an external hive table with required columns and pointing to this data location.
When you create the table , you could use the default JSON serde and then use get_json_object hive function and load this raw data into a final table. Refer this for the function details
OR
You could try the avro serde and mention the avro schema as per your json message to create the hive table. Refer this for avro serde example
.
Hope it helps

hadoop - Validate json data loaded into hive warehouse

I have json files, volume is approx 500 TB. I have loaded complete set into hive data warehouse.
How would I validate or test the data that was loaded into hive warehouse. What should be my testing strategy ?
Client want us to validate the json data. Whether the data loaded into hive is correct ot not. Is there any miss? If yes, which field it was?
Please help.
How is your data being stored in hive tables ?
One option is create a Hive UDF function that receive the JSON string and validate the data and return another string with the error message or an empty string if the JSON string is well formed.
Here is a Hve UDF tutorial: http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html
With the Hive UDF function in place you can executequeries like:
select strjson, validateJson(strjson) from jsonTable where validateJson(strjson) != "";

Apache Drill Query PostgreSQL Json

I am trying to query a jsonb field in PostgreSQL in drill and read it as if were coming from a json storage type but am running into trouble. I can conver from text to json but cannot seem to query the json object. At least I think I can convert to JSON. My goal is to avoid reading through millions of uneven json objects from PostgreSQL, perform joins and things with text files such as CSV files and XML files. Is there a way to query the text field as if it were coming from a json storage type without writing large files to disk?
The goal is to generate results implicitly which PostgreSQL nor Pentaho do and integrate these data sets with others of any format.
Attempt:
SELECT * FROM (SELECT convert_to(json_field,'JSON') as data FROM postgres.mytable) as q1
Sample Result:
[B#7106dd
Attempt to existing field that should be in any json object:
SELECT data[field] FROM (SELECT convert_to(json_field,'JSON') as data FROM postgres.mytable) as q1
Result:
null
Attempting to do anything with jsonb results in a Null Pointer Error.