Apache Drill Query PostgreSQL Json - json

I am trying to query a jsonb field in PostgreSQL in drill and read it as if were coming from a json storage type but am running into trouble. I can conver from text to json but cannot seem to query the json object. At least I think I can convert to JSON. My goal is to avoid reading through millions of uneven json objects from PostgreSQL, perform joins and things with text files such as CSV files and XML files. Is there a way to query the text field as if it were coming from a json storage type without writing large files to disk?
The goal is to generate results implicitly which PostgreSQL nor Pentaho do and integrate these data sets with others of any format.
Attempt:
SELECT * FROM (SELECT convert_to(json_field,'JSON') as data FROM postgres.mytable) as q1
Sample Result:
[B#7106dd
Attempt to existing field that should be in any json object:
SELECT data[field] FROM (SELECT convert_to(json_field,'JSON') as data FROM postgres.mytable) as q1
Result:
null
Attempting to do anything with jsonb results in a Null Pointer Error.

Related

JSON format in wordpress database

I need to import JSON data in a WP database. I found the correct table of the database but the example JSON data present in table is not in a normal JSON format.
I need to import:
{"nome": "Pippo","cognome": "Paperino"}
but the example data in the table is:
a:2:{s:4:"nome";s:5:"Pippo";s:7:"cognome";s:8:"Paperino";}
How i can convert my JSON to "WP JSON"?
The data is serialized, that's why it looks weird. You can use maybe_unserialize() in WordPress, this function will unserialize the data if it was serialized.
https://developer.wordpress.org/reference/functions/maybe_unserialize/
Some functions serialize data before saving it in wordpress, and some will also unserialize when pulling from the DB. So depending on how you save it, and how you later extract the data, you might end up with serialized data.

How to avoid serialization of postgres json result in Sequel

In a web server, I'm running a query in Postgres that performs a select json_build_object of an aggregation(json_agg). I'm using Sequel to fetch this data, and I'm getting as a value an instance of Sequel::Postgres::JSONHash. Then, I perform a to_json in order to send it to the web client.
The output of this query is very big and I think that it could be more performant If I could grab the raw json response of postgres directly and send it to the client, instead of parsing it into JSONHash (which is done by Sequel) and then converting it again to json.
How could I do it?
Cast the json_build_object from json to text in the query: SELECT json_build_object(...)::text ...

write a spark Dataset to json with all keys in the schema, including null columns

I am writing a dataset to json using:
ds.coalesce(1).write.format("json").option("nullValue",null).save("project/src/test/resources")
For records that have columns with null values, the json document does not write that key at all.
Is there a way to enforce null value keys to the json output?
This is needed since I use this json to read it onto another dataset (in a test case) and cannot enforce a schema if some documents do not have all the keys in the case class (I am reading it by putting the json file under resources folder and transforming to a dataset via RDD[String], as explained here: https://databaseline.bitbucket.io/a-quickie-on-reading-json-resource-files-in-apache-spark/)
I agree with #philantrovert.
ds.na.fill("")
.coalesce(1)
.write
.format("json")
.save("project/src/test/resources")
Since DataSets are immutable you are not altering the data in ds and you can process it (complete with null values and all) in any following code. You are simply replacing null values with an empty string in the saved file.
Since Pyspark 3, one can use the ignoreNullFields option when writing to a JSON file.
spark_dataframe.write.json(output_path,ignoreNullFields=False)
Pyspark docs: https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.json

Spark SQL on Postgresql JSONB data

The current Postgresql version (9.4) supports json and jsonb data type as described in http://www.postgresql.org/docs/9.4/static/datatype-json.html
For instance, JSON data stored as jsonb can be queried via SQL query:
SELECT jdoc->'guid', jdoc->'name'
FROM api
WHERE jdoc #> '{"company": "Magnafone"}';
As a Sparker user, is it possible to send this query into Postgresql via JDBC and receive the result as DataFrame?
What I have tried so far:
val url = "jdbc:postgresql://localhost:5432/mydb?user=foo&password=bar"
val df = sqlContext.load("jdbc",
Map("url"->url,"dbtable"->"mydb", "driver"->"org.postgresql.Driver"))
df.registerTempTable("table")
sqlContext.sql("SELECT data->'myid' FROM table")
But sqlContext.sql() was unable to understand the data->'myid' part in the SQL.
It is not possible to query json / jsonb fields dynamically from Spark DataFrame API. Once data is fetched to Spark it is converted to string and is no longer a queryable structure (see: SPARK-7869).
As you've already discovered you can use dbtable / table arguments to pass a subquery directly to the source and use it to extract fields of interest. Pretty much the same rule applies to any non-standard type, calling stored procedures or any other extensions.

Convert file of JSON objects to Parquet file

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.
Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?
Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.
First, you would infer the schema of your JSON:
kite-dataset json-schema sample-file.json -o schema.avsc
Then you can use that file to create a Parquet Hive table:
kite-dataset create mytable --schema schema.avsc --format parquet
And finally, you can load your JSON into the dataset.
kite-dataset json-import sample-file.json mytable
You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.
You can actually use Drill itself to create a parquet file from the output of any query.
create table student_parquet as select * from `student.json`;
The above line should be good enough. Drill interprets the types based on the data in the fields. You can substitute your own query and create a parquet file.
To complete the answer of #rahul, you can use drill to do this - but I needed to add more to the query to get it working out of the box with drill.
create table dfs.tmp.`filename.parquet` as select * from dfs.`/tmp/filename.json` t
I needed to give it the storage plugin (dfs) and the "root" config can read from the whole disk and is not writable. But the tmp config (dfs.tmp) is writable and writes to /tmp. So I wrote to there.
But the problem is that if the json is nested or perhaps contains unusual characters, I would get a cryptic
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: java.lang.IndexOutOfBoundsException:
If I have a structure that looks like members: {id:123, name:"joe"} I would have to change the select to
select members.id as members_id, members.name as members_name
or
select members.id as `members.id`, members.name as `members.name`
to get it to work.
I assume the reason is that parquet is a "column" store so you need columns. JSON isn't by default so you need to convert it.
The problem is I have to know my json schema and I have to build the select to include all the possibilities. I'd be happy if some knows a better way to do this.