How to bind dynamic JSON objects to PostgreSQL, using mongodb_fdw? - json

The Foreign Data Wrapper for MongoDB is pretty awesome! I've gotten it to work using these instructions, apart from:
an object with dynamic fields within it - which PostgreSQL type to use for such?
{
"key1": some,
...
}
an array of objects - which PostgreSQL type to use for such? The length of the array may vary, but the objects are uniform in their inner structure.
[ { "a": 1 }, { "a": 2 }, { "a": 3 } ]
I found these slides on JSON capabilities in recent PostgreSQL versions. Neat. But BSON, JSON or JSONB don't seem to be recognized by the FDW as SQL data types.
If I use:
CREATE FOREIGN TABLE t6
(
"aaa.bbb" JSON -- 'bbb' is an array of JSON objects
)
SERVER mongo_server OPTIONS(...);
SELECT "aaa.bbb" AS bbb FROM t6;
I get:
psql:6.sql:152: ERROR: cannot convert bson type to column type
HINT: Column type: 114
The normal types TEXT, FLOAT etc. work.

The EnterpriseDB fork does it, as #pozs was pointing out. Just mark your data as JSON type.
However, the build system is rather bizarre to my taste, and does not really give you right errors for missing build components (it's obviously Linux-based and simply expects you to have a bunch of tools without properly checking for them).
Here's how I managed to build it on OS X + Homebrew:
$ brew install libtool libbson autoconf automake
$ ./autogen.sh --with-legacy
Note that the --with-meta variant does not provide JSON support, which was the reason I went for this fork anyways.
ref. https://github.com/EnterpriseDB/mongo_fdw/issues/20

Related

Entity Framework Queries For Complicated JSON Documents (npgsql)

I am handling legacy (old) JSON files that we are now uploading to a database that was built using code-first EF Core (with the JSON elements saved as a jsonb field in a postgresql db, represented as JsonDocument properties in the EF classes). We want to be able to query these massive documents against any of the JSON's many properties. I've been very interested in the excellent docs here https://www.npgsql.org/efcore/mapping/json.html?tabs=data-annotations%2Cpoco, but the problem in our case is that our JSON has incredibly complicated hierarchies.
According to the npgsql/EF doc above, a way to do this for "shallow" json hierarchies would be something like:
myDbContext.MyClass
.Where(e => e.JsonDocumentField.RootElement.GetProperty("FieldToSearch").GetString() == "SearchTerm")
.ToList();
But that only works if is directly under the root of the JSONDocument. If the doc is structed like, say
{"A": {
...
"B": {
...
"C": {
...
"FieldToSearch":
<snip>
Then the above query won't work. There is an alternative to map our JSON to an actual POCO model, but this JSON structure (a) may change and (b) is truly massive and would result in some ridiculously complicated objects.
Right now, I'm building SQL strings using field configurations where I save strings to find the fields I want using psql's JSON querying language
Example:
"(JSONDocumentField->'A'->'B'->'C'->>'FieldToSearch')"
and then using that sql against the DB using
myDbContext.MyClass.FromSqlRaw(sql).ToList();
This is hacky and I'd much rather do it in a method call. Is there a way to force JsonDocument's GetProperty call to drill down into the hierarchy to find the first/any instance of the property name in question (or another method I'm not aware of)?
Thanks!

Are My Data Better Suited To A CSV Import, Rather Than a JSON Import?

I am trying to force myself to use mongoDB, using the excuse of the "convenience" of it being able to accept JSON data. Of course, it's not as simple as that (it never is!).
At the moment, for this use case, I think I should revert to a traditional CSV import, and possibly a traditional RDBMS (e.g. MariaDB or MySQL). Am I wrong?
I found a possible solution in CSV DATA import to nestable json data, which seems to be a lot of faffing around.
The problem:
I am pulling some data from an online database, which returns data in blocks like this (actually it's all on one line, but I have broken it up to improve readability):
[
[8,1469734163000,50.84516753,0.00021818,2],
[6,1469734163000,50.80342373,0.00021818,2],
[4,1469734163000,50.33066367,0.00021818,2],
[12,1469734164000,40.31650031,0.00021918,2],
[10,1469734164000,11.36652478,0.00021818,2],
[14,1469734165000,52.03905845,0.00021918,2],
[16,1469734168000,57.32,0.00021918,2]
]
According to the command python -mjson.tool this is valid JSON.
But this command barfs
mongoimport --jsonArray --db=bitfinexLendingHistory --collection=fUSD --file=test.json
with
2019-12-31T12:23:42.934+0100 connected to: localhost
2019-12-31T12:23:42.935+0100 Failed: error unmarshaling bytes on document #3: JSON decoder out of sync - data changing underfoot?
2019-12-31T12:23:42.935+0100 imported 0 documents
The named DB and collection already exist.
$ mongo
> use bitfinexLendingHistory
switched to db bitfinexLendingHistory
> db.getCollectionNames()
[ "fUSD" ]
>
I realise that, at this stage, I have no <whatever the mongoDB equivalent of a column header is called in this case> defined, but I suspect the problem above is independent of that.
By wrapping by data above in the way as shown below, I managed to get the data imported.
{
"arf":
[
[8,1469734163000,50.84516753,0.00021818,2],
[6,1469734163000,50.80342373,0.00021818,2],
[4,1469734163000,50.33066367,0.00021818,2],
[12,1469734164000,40.31650031,0.00021918,2],
[10,1469734164000,11.36652478,0.00021818,2],
[14,1469734165000,52.03905845,0.00021918,2],
[16,1469734168000,57.32,0.00021918,2]
]
}
Next step is to determine if that is what I want, and if so, work out how to query it.

How to query json file located at s3 using presto

I have one json file stored in amazon-s3 location, I want to query this json file using presto. how can I achieve this?
Option 1 - Presto on EMR with json_extract built-in function
I am supposing that you have already launched Presto using EMR.
The easiest way to do this would be to use the json_extract function that comes by default with Presto.
So imagine you have a json file on s3 like this:
{"a": "a_value1", "b": { "bb": "bb_value1" }, "c": "c_value1"}
{"a": "a_value2", "b": { "bb": "bb_value2" }, "c": "c_value2"}
{"a": "a_value3", "b": { "bb": "bb_value3" }, "c": "c_value3"}
...
...
Each row represents a json tree object.
So you can simply define in presto a table with one field which is of string type, and then easily query the table with json_extract.
SELECT json_extract(json_field, '$.b.bb') as extract
FROM my_table
The result would be something like:
| extract |
|-----------------|
| bb_value1 |
| bb_value2 |
| bb_value3 |
This can be a fast and easy way to read a json file using presto, but unfortunately this doesn't scale well on big json files.
Some presto docs on json_extract: https://prestodb.github.io/docs/current/functions/json.html#json_extract
Option 2 - Presto on EMR with a specific Serde for json files
You can also customize your presto in bootstrap phase of your emr cluster, by adding custom plugins or SerDe libraries.
So you just have to choose one of the available JSON SerDe libraries (e.g. org.openx.data.jsonserde.JsonSerDe) and follow their guide to define a table that matches the structure of the Json file.
You will be able to access to the fields of the json file in a similar way of the json_extract (using the dotted notation), and it should be faster and scale well on big files. Unfortunately using this method you have 2 main problems:
1) Defining a table for complex files is like being in hell.
2) You may have internal java cast exception, because the data in the json couldn't be easily casted by the SerDe library.
Option 3 - Athena Built-In JSON Serde
https://docs.aws.amazon.com/athena/latest/ug/json.html
It seems that you have some Json SerDe also built-in Athena, I have personally never tried these but they are managed by AWS so should be easier to set up everything.
Rather than installing and running your own Presto service, there are some other options you can try:
Amazon Athena is a fully-managed Presto service. You can use it to query large datastores in Amazon S3, including compressed and partitioned data.
Amazon S3 Select allows you to run a query on a single object stored in Amazon S3. This is possibly simpler for your particular use-case.

Clickhouse/Kafka: reading a JSON Object type into a field

I have this kind of data in a Kafka Topic:
{..., fields: { "a": "aval", "b": "bval" } }
If I create a Kafka Engine table, I get an error when using a field definition like this:
fields String
because it (correctly) doesn't recognize it as a String:
2018.07.09 17:09:54.362061 [ 27 ] <Error> void DB::StorageKafka::streamThread(): Code: 26, e.displayText() = DB::Exception: Cannot parse JSON string: expected opening quote: (while read the value of key fields): (at row 1)
As ClickHouse does not currently have a Map or JSONObject type, what would be the best way to work over it, provided I don't know in advance the name of the inner fields ("a" or "b" in the example - so I cannot see Nested structures helping)?
Apparently, at the moment ClickHouse does not support complex JSON parsing.
From this answer in ClickHouse Github:
Clickhouse uses quick and dirty JSON parser, which does not how to read complex deep structures. So it can't skip that field as it does not know where that nested structure ends.
Sorry. :/
So you should preprocess your json with some external tools, of you can contribute to Clickhouse and improve JSON parser.

Postgres json type inner Query [duplicate]

I am looking for some docs and/or examples for the new JSON functions in PostgreSQL 9.2.
Specifically, given a series of JSON records:
[
{name: "Toby", occupation: "Software Engineer"},
{name: "Zaphod", occupation: "Galactic President"}
]
How would I write the SQL to find a record by name?
In vanilla SQL:
SELECT * from json_data WHERE "name" = "Toby"
The official dev manual is quite sparse:
http://www.postgresql.org/docs/devel/static/datatype-json.html
http://www.postgresql.org/docs/devel/static/functions-json.html
Update I
I've put together a gist detailing what is currently possible with PostgreSQL 9.2.
Using some custom functions, it is possible to do things like:
SELECT id, json_string(data,'name') FROM things
WHERE json_string(data,'name') LIKE 'G%';
Update II
I've now moved my JSON functions into their own project:
PostSQL - a set of functions for transforming PostgreSQL and PL/v8 into a totally awesome JSON document store
Postgres 9.2
I quote Andrew Dunstan on the pgsql-hackers list:
At some stage there will possibly be some json-processing (as opposed
to json-producing) functions, but not in 9.2.
Doesn't prevent him from providing an example implementation in PLV8 that should solve your problem. (Link is dead now, see modern PLV8 instead.)
Postgres 9.3
Offers an arsenal of new functions and operators to add "json-processing".
The manual on new JSON functionality.
The Postgres Wiki on new features in pg 9.3.
The answer to the original question in Postgres 9.3:
SELECT *
FROM json_array_elements(
'[{"name": "Toby", "occupation": "Software Engineer"},
{"name": "Zaphod", "occupation": "Galactic President"} ]'
) AS elem
WHERE elem->>'name' = 'Toby';
Advanced example:
Query combinations with nested array of records in JSON datatype
For bigger tables you may want to add an expression index to increase performance:
Index for finding an element in a JSON array
Postgres 9.4
Adds jsonb (b for "binary", values are stored as native Postgres types) and yet more functionality for both types. In addition to expression indexes mentioned above, jsonb also supports GIN, btree and hash indexes, GIN being the most potent of these.
The manual on json and jsonb data types and functions.
The Postgres Wiki on JSONB in pg 9.4
The manual goes as far as suggesting:
In general, most applications should prefer to store JSON data as
jsonb, unless there are quite specialized needs, such as legacy
assumptions about ordering of object keys.
Bold emphasis mine.
Performance benefits from general improvements to GIN indexes.
Postgres 9.5
Complete jsonb functions and operators. Add more functions to manipulate jsonb in place and for display.
Major good news in the release notes of Postgres 9.5.
With Postgres 9.3+, just use the -> operator. For example,
SELECT data->'images'->'thumbnail'->'url' AS thumb FROM instagram;
see http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json/ for some nice examples and a tutorial.
With postgres 9.3 use -> for object access. 4 example
seed.rb
se = SmartElement.new
se.data =
{
params:
[
{
type: 1,
code: 1,
value: 2012,
description: 'year of producction'
},
{
type: 1,
code: 2,
value: 30,
description: 'length'
}
]
}
se.save
rails c
SELECT data->'params'->0 as data FROM smart_elements;
returns
data
----------------------------------------------------------------------
{"type":1,"code":1,"value":2012,"description":"year of producction"}
(1 row)
You can continue nesting
SELECT data->'params'->0->'type' as data FROM smart_elements;
return
data
------
1
(1 row)