I am evaluating whether Postgres would be suited to parsing a number of quite complicated JSON documents to extract and semantically match entities, eventually filling a relational schema with high integrity.
I have found the use of jsonpath to be very helpful working with these documents and found this article which suggested that Postgres 11 would have support of sorts. However, the docs do not mention this at all
My question is then will support be forthcoming? Also, is this kind of processing suited to Postgres at all? (possibly use Lucene, MongoDb for the parsing and matching then importing into Postgres relational tables somehow?)
An example of the data could be:
```
{
"event_classes": [
{
"name": "American Football",
"url": "/sportsapi/v2/american-football",
"id": 27
},
{
"name": "Athletics",
"url": "/sportsapi/v2/athletics",
"id": 48
},
{
"name": "Aussie Rules",
"url": "/sportsapi/v2/aussie-rules",
"id": 10000062
},
{
"name": "Badminton",
"url": "/sportsapi/v2/badminton",
"id": 10000069
},
{
"name": "Baseball",
"url": "/sportsapi/v2/baseball",
"id": 5000026
}
]
}
```
SQL/JSON support didn't make it in v11.
It is available from PostgreSQL v12 on.
Your use case is a little vague, but I think that PostgreSQL would be well suited for this kind of processing, particularly if the data should end up in a relational schema.
As a general point on SQL/Path will be a more terse method to query a JSONB data-structure. It will compile down to traditional methods of querying JSONB. That makes it a parser feature, providing a standard syntax.
Imho, that standard syntax is substantially better and gives room for future optimizations, however any query on JSON can be done with the PostgreSQL operators you've linked to, it's just not always pretty.
Finding out if that array contains {"foo":2} is simple.
WITH t(jsonb) AS ( VALUES ('[{"foo":2, "qux":42},{"bar":2},{"baz":4}]'::jsonb) )
SELECT *
FROM t
WHERE jsonb #> '[{"foo":2}]';
However, it's substantially harder to do get the value of qux given the above.
WITH t(jsonb) AS ( VALUES ('[{"foo":2, "qux":42},{"bar":2},{"baz":4}]'::jsonb) )
SELECT e->'qux'
FROM t
CROSS JOIN LATERAL jsonb_array_elements(jsonb) AS a(e)
WHERE t.jsonb #> '[{"foo":2}]'
AND e #> '{"foo":2}';
But, that's not the end of the world. That's actually a really nice SQL syntax. It's just not JavaScript. With JSON PATH you'll be able to do something,
SELECT t.json _ '$.[#.foo == 2].qux'
FROM t
WHERE t.json _ '$.[#.foo == 2]';
Where _ is some kind of JSONPATH operator. As an aside, you can always create an actual JavaScript stored procedure on the server and run it with node. It's really dirt simple with pl/v8.
Related
I'm working on an app that will generate a Json potentially very big. In my tests this was 8000 rows. This is because is an aggregation of data for a year, and is required to display details in the UI.
For example:
"voice1": {
"sum": 24000,
"items": [
{
"price": 2000,
"description": "desc1",
"date": "2021-11-01T00:00:00.000Z",
"info": {
"Id": "85fda619bbdc40369502ec3f792ae644",
"address": "add2",
"images": {
"icon": "img.png",
"banner": null
}
}
},
{
"price": 2000,
"description": "desc1",
"date": "2021-11-01T00:00:00.000Z",
"info": {
"Id": "85fda619bbdc40369502ec3f792ae644",
"address": "add2",
"images": {
"icon": "img.png",
"banner": null
}
}
}
]
},
The point is that I can have potentially 10 voices and for each dozen and dozens of items.
I was wondering if you can point to me some Best Practices or if you have some tips about them because I've got the feeling this can be done better.
It sounds like you are finding out that JSON is a rather verbose format (not as bad as XML but still very verbose). If you are worried about the size of messages between server client and you have a few options:
JSON compresses rather well. You can see how most tokens repeat many times. So make sure to Gzip or Snappy before sending to clients. This will drastically reduce the size, but cost some performance for inflating / deflating.
The other alternative is to not use JSON for transfer, but a more optimized format. One of the best options here is Flat Buffers. It does require you to provide schemas of the data that you are sending but is an optimized binary format with minimal overhead. It will also drastically speed up your application because it will remove the need for serialization / deserialization, which takes a significant time for JSON. Another popular, but slightly slower alternative is Protobuf.
The only thing immediately obvious to me is that you would likely want to make a list of voices (like you have for items) rather than voice1, voice2, etc.
Beyond that it really just depends the structure of the data you start with (to create the json) and the structure of the data or code at the destination (and possibly also the method of transferring data if size is a concern). If you're doing a significant amount of processing on either end to encode/decode the json that can suggest there's a simpler way to structure the data. Can you share some additional context or examples of the overall process?
I've a big set of json data inside a database, the data wasn't supposed to be queryed, so the was stored in a very messy way... this is the structure
{
"0": {"key": "Developer(s)", "values": ["Capcom"]},
"1": {"key": "Publisher(s)", "values": ["Capcom"]},
"2": {"key": "Producer(s)", "values": ["Tokuro Fujiwara"]},
"3": {"key": "Composer(s)", "values": ["Setsuo Yamamoto"]},
"4": {"key": "Series", "values": ["X-Men"]},
"6": {"key": "Release", "values": ["EU:", " 1995"]},
"7": {"key": "Mode(s)", "values": ["Single-player"]}
}
I should query inside the db to verify which records has which property (i.e. all records with a "Release" key inside, all that contains the value "Capcom" inside the Developoer key, etc.)
Can someone point me to the right way? I found only examples with simple structures (i.e. { "key": "value" }), here the key is the index number, and the value is an array with two different keys...
Should I find a way to rewrite all the data or there is something easy?
p.s. I'm building a laravel application over this data, so I can also use an eloquent approach.
Thanks in advance
as #Guy L mentioned in his answer, you can use LIKE or REGEX. But, It will be expensive.
example:
SELECT * FROM table WHERE json_column LIKE '%"Release"%';
answering:
Should I find a way to rewrite all the data or there is something easy?
Consider, how often do you have to access this data?
a NoSQL database like MongoDB is a really good usecase for data like this, I have been using Mongo and I am happy with it.
You can easily migrate your data to MongoDB and use a ORM similar to Eloquent model like : https://github.com/jenssegers/laravel-mongodb to communicate with mongo from your laravel project.
Hope it helps you arrive at a solution.
Please refer to
https://dev.mysql.com/doc/refman/8.0/en/json-search-functions.html
which allows you to search for specific value at specific place of the JSON structure.
However, I would suggest - if the data is NOW supposed to be searchable - redesign/convert the data.
Query JSON as a string can be done using the LIKE or REGEXP operators
But in general, since the strings are long an complex it's really not recommended
The best way is reloading the info into proper tables, storing in an SQL way
If your MySQL version is 5.7 or above, you can try some options will its JSON support:
https://dev.mysql.com/doc/refman/8.0/en/json.html#json-paths
Thanks for the suggestions about using mysql json specific functions, the column is already in JSON type, so i can definitively use thm, but unfortunally even after reading the mysql documentation i can't figure out how to solve my problem, honestly i'm not a data specialist, i'm kind of confused when dealing with database, so any examples will be appreciated.
So far i've tried to mess up with queries, but i wasn't able to find a correct "selector" for the keys of my dataset, using '$[0]' will return only the first column, i'd need some hints for creating the right syntax using json_extract, json_contains, etc.
Thanks everyone, at the end i've decided to fetch all the data and storing them again properly.
Background: We have all seen several ways to configure a distributed application. For my purposes, two of them stand out:
Have a massive database that all nodes have access to. Each node knows its own identity, and so can perform queries against said database to pull out the configuration information specific to itself.
Use tailored (i.e., specific to each node) configuration files (e.g., JSON) so that the nodes do not have to touch a database at all. They simply read the tailored config file and do what it says.
There are pros and cons to each. For my purposes, I would like to explore #2a little further, but the problem I'm running into is that the JSON files can get pretty big. I'm wondering if anyone knows a DSL that is well-suited for generating these JSON files.
Step-by-step examples to illustrate what I mean:
Suppose I make up this metalanguage that looks like this:
bike.%[0..3](Vehicle)
This would then output the following JSON:
{
"bike.0":
{
"type": "Vehicle"
},
"bike.1":
{
"type": "Vehicle"
},
"bike.2":
{
"type": "Vehicle"
},
"bike.3":
{
"type": "Vehicle"
}
}
The idea is that we've just created 4 bikes, each of which is of type Vehicle.
Going further:
bike[i=0..3](Vehicle)
label: "hello %i"
label.3: "last"
Now what this does is to name the index variable 'i' so that it can be used for the configuration information of each item. The JSON that would be output would be something like this:
{
"bike.0":
{
"type": "Vehicle",
"cfg":
{
"label": "hello 0"
}
},
"bike.1":
{
"type": "Vehicle",
"cfg":
{
"label": "hello 1"
}
},
"bike.2":
{
"type": "Vehicle",
"cfg":
{
"label": "hello.2"
}
},
"bike.3":
{
"type": "Vehicle",
"cfg":
{
"label": "last"
}
}
}
You can see how the last label was overridden, so this is a way to sparsely specify stuff. Is there anything already out there that lets one do this?
Thanks!
Rather than thinking of the metalanguage as a monolithic entity, it might be better to divide it into three parts:
An input specification. You can use a configuration file syntax to hold this specification.
A library or utility that can use print statements and for-loops to generate runtime configuration files. The Apache Velocity template engine comes to mind as something that is suitable for this purpose. I suggest you look at its user guide to get a feel for what it can do.
Some glue code to join together the above two items. In particular, the glue code reads name=value pairs from the input specification, and passes them to the template engine, which uses them as parameters to "instantiate" the template versions of the configuration files that you want to generate.
My answer to another StackOverflow question provides some more details of the above idea.
Our REST API allows users to add custom schemaless JSON to some of our REST resources, and we need it to be searchable in Elasticsearch. This custom data and its structure can be completely different across resources of the same type.
Consider this example document:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
All fields except customData adhere to a schema. customData is always a JSON Object, but all the fields and values within that Object can vary dramatically from resource to resource. There is no guarantee that any given field name or value (or even value type) within customData is the same across any two resources as users can edit these fields however they wish.
What is the best way to support search for this?
We thought a solution would be to just not create any mapping for customData when the index is created, but then it becomes unqueryable (which is contrary to what the ES docs say). This would be the ideal solution if queries on non-mapped properties worked, and there were no performance problems with this approach. However, after running multiple tests for that matter we haven’t been able to get that to work.
Is this something that needs any special configuration? Or are the docs incorrect? Some clarification as to why it is not working would be greatly appreciated.
Since this is not currently working for us, we’ve thought of a couple alternative solutions:
Reindexing: this would be costly as we would need to reindex every index that contains that document and do so every time a user updates a property with a different value type. Really bad for performance, so this is likely not a real option.
Use multi-match query: we would do this by appending a random string to the customData field name every time there is a change in the customData object. For example, this is what the document being indexed would look like:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData_03ae8b95-2496-4c8d-9330-6d2058b1bbb9": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
This means ES would create a new mapping for each ‘random’ field, and we would use phrase multi-match query using a "starts with" wild card for the field names when performing the queries. For example:
curl -XPOST 'eshost:9200/test/_search?pretty' -d '
{
"query": {
"multi_match": {
"query" : "red",
"type" : "phrase",
"fields" : ["customData_*.favoriteColor"]
}
}
}'
This could be a viable solution, but we are concerned that having too many mappings like this could affect performance. Are there any performance repercussions for having too many mappings on an index? Maybe periodic reindexing could alleviate having too many mappings?
This also just feels like a hack and something that should be handled by ES natively. Am I missing something?
Any suggestions about any of this would be much appreciated.
Thanks!
You're correct that Elasticsearch is not truly schemaless. If no mapping is specified, Elasticsearch infers field type primitives based upon the first value it sees for that field. Therefore your non-deterministic customData object can get you in trouble if you first see "favoriteColor": 10 followed by "favoriteColor": "red".
For your requirements, you should take a look at SIREn Solutions Elasticsearch plugin which provides a schemaless solution coupled with an advanced query language (using Twig) and a custom Lucene index format to speed up indexing and search operations for non-deterministic data.
Fields with same mapping will be stored as same lucene field in the lucene index (Elasticsearch shard). Different lucene field will have separate inverted index (term dict and index entry) and separate doc values. Lucene is highly optimized to store documents of same field in a compressed way. Using a mapping with different field for different document prevent lucene from doing its optimization.
You should use Elasticsearch Nested Document to search efficiently. The underlying technology is Lucene BlockJoin, which indexes parent/child documents as a document block.
I am looking for some docs and/or examples for the new JSON functions in PostgreSQL 9.2.
Specifically, given a series of JSON records:
[
{name: "Toby", occupation: "Software Engineer"},
{name: "Zaphod", occupation: "Galactic President"}
]
How would I write the SQL to find a record by name?
In vanilla SQL:
SELECT * from json_data WHERE "name" = "Toby"
The official dev manual is quite sparse:
http://www.postgresql.org/docs/devel/static/datatype-json.html
http://www.postgresql.org/docs/devel/static/functions-json.html
Update I
I've put together a gist detailing what is currently possible with PostgreSQL 9.2.
Using some custom functions, it is possible to do things like:
SELECT id, json_string(data,'name') FROM things
WHERE json_string(data,'name') LIKE 'G%';
Update II
I've now moved my JSON functions into their own project:
PostSQL - a set of functions for transforming PostgreSQL and PL/v8 into a totally awesome JSON document store
Postgres 9.2
I quote Andrew Dunstan on the pgsql-hackers list:
At some stage there will possibly be some json-processing (as opposed
to json-producing) functions, but not in 9.2.
Doesn't prevent him from providing an example implementation in PLV8 that should solve your problem. (Link is dead now, see modern PLV8 instead.)
Postgres 9.3
Offers an arsenal of new functions and operators to add "json-processing".
The manual on new JSON functionality.
The Postgres Wiki on new features in pg 9.3.
The answer to the original question in Postgres 9.3:
SELECT *
FROM json_array_elements(
'[{"name": "Toby", "occupation": "Software Engineer"},
{"name": "Zaphod", "occupation": "Galactic President"} ]'
) AS elem
WHERE elem->>'name' = 'Toby';
Advanced example:
Query combinations with nested array of records in JSON datatype
For bigger tables you may want to add an expression index to increase performance:
Index for finding an element in a JSON array
Postgres 9.4
Adds jsonb (b for "binary", values are stored as native Postgres types) and yet more functionality for both types. In addition to expression indexes mentioned above, jsonb also supports GIN, btree and hash indexes, GIN being the most potent of these.
The manual on json and jsonb data types and functions.
The Postgres Wiki on JSONB in pg 9.4
The manual goes as far as suggesting:
In general, most applications should prefer to store JSON data as
jsonb, unless there are quite specialized needs, such as legacy
assumptions about ordering of object keys.
Bold emphasis mine.
Performance benefits from general improvements to GIN indexes.
Postgres 9.5
Complete jsonb functions and operators. Add more functions to manipulate jsonb in place and for display.
Major good news in the release notes of Postgres 9.5.
With Postgres 9.3+, just use the -> operator. For example,
SELECT data->'images'->'thumbnail'->'url' AS thumb FROM instagram;
see http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json/ for some nice examples and a tutorial.
With postgres 9.3 use -> for object access. 4 example
seed.rb
se = SmartElement.new
se.data =
{
params:
[
{
type: 1,
code: 1,
value: 2012,
description: 'year of producction'
},
{
type: 1,
code: 2,
value: 30,
description: 'length'
}
]
}
se.save
rails c
SELECT data->'params'->0 as data FROM smart_elements;
returns
data
----------------------------------------------------------------------
{"type":1,"code":1,"value":2012,"description":"year of producction"}
(1 row)
You can continue nesting
SELECT data->'params'->0->'type' as data FROM smart_elements;
return
data
------
1
(1 row)