Searching trough JSON data stored in MySql - mysql

I've a big set of json data inside a database, the data wasn't supposed to be queryed, so the was stored in a very messy way... this is the structure
{
"0": {"key": "Developer(s)", "values": ["Capcom"]},
"1": {"key": "Publisher(s)", "values": ["Capcom"]},
"2": {"key": "Producer(s)", "values": ["Tokuro Fujiwara"]},
"3": {"key": "Composer(s)", "values": ["Setsuo Yamamoto"]},
"4": {"key": "Series", "values": ["X-Men"]},
"6": {"key": "Release", "values": ["EU:", " 1995"]},
"7": {"key": "Mode(s)", "values": ["Single-player"]}
}
I should query inside the db to verify which records has which property (i.e. all records with a "Release" key inside, all that contains the value "Capcom" inside the Developoer key, etc.)
Can someone point me to the right way? I found only examples with simple structures (i.e. { "key": "value" }), here the key is the index number, and the value is an array with two different keys...
Should I find a way to rewrite all the data or there is something easy?
p.s. I'm building a laravel application over this data, so I can also use an eloquent approach.
Thanks in advance

as #Guy L mentioned in his answer, you can use LIKE or REGEX. But, It will be expensive.
example:
SELECT * FROM table WHERE json_column LIKE '%"Release"%';
answering:
Should I find a way to rewrite all the data or there is something easy?
Consider, how often do you have to access this data?
a NoSQL database like MongoDB is a really good usecase for data like this, I have been using Mongo and I am happy with it.
You can easily migrate your data to MongoDB and use a ORM similar to Eloquent model like : https://github.com/jenssegers/laravel-mongodb to communicate with mongo from your laravel project.
Hope it helps you arrive at a solution.

Please refer to
https://dev.mysql.com/doc/refman/8.0/en/json-search-functions.html
which allows you to search for specific value at specific place of the JSON structure.
However, I would suggest - if the data is NOW supposed to be searchable - redesign/convert the data.

Query JSON as a string can be done using the LIKE or REGEXP operators
But in general, since the strings are long an complex it's really not recommended
The best way is reloading the info into proper tables, storing in an SQL way
If your MySQL version is 5.7 or above, you can try some options will its JSON support:
https://dev.mysql.com/doc/refman/8.0/en/json.html#json-paths

Thanks for the suggestions about using mysql json specific functions, the column is already in JSON type, so i can definitively use thm, but unfortunally even after reading the mysql documentation i can't figure out how to solve my problem, honestly i'm not a data specialist, i'm kind of confused when dealing with database, so any examples will be appreciated.
So far i've tried to mess up with queries, but i wasn't able to find a correct "selector" for the keys of my dataset, using '$[0]' will return only the first column, i'd need some hints for creating the right syntax using json_extract, json_contains, etc.

Thanks everyone, at the end i've decided to fetch all the data and storing them again properly.

Related

Pact Consumer / Provider based in data type and not in data value

We are currently using Pact-Broker in our Spring Boot application with really good results for our integration tests.
Our tests using Pact-Broker are base in a call to a REST API and comparing the response with the value in our provider, always using JSON format.
Our problem is that the values to compare are in a DB where the data is changing quite often, which make us update the tests really often.
Do you know if it is possible to just validate by the data type?
What we would like to try is to validate that the JSON is properly formed and the data type match, for example, if our REST API gives this output:
[
{
"action": "VIEW",
"id": 1,
"module": "A",
"section": "pendingList",
"state": null
},
{
"action": "VIEW",
"id": 2,
"module": "B",
"section": "finished",
"state": null
}
}
]
For example, what we would like to validate from the previous output is the following:
The JSON is well formed.
All the keys / value pair exists based in the model.
The value match a specific data type, for example, that the key action exist in all the entries and contains a string data type.
Do you know if this is possible to be accomplished with Pact-Broker? I was searching in the documentation but I did not found any example of how to do it.
Thanks a lot in advance.
Best regards.
Absolutely! The first 2 things Pact will always do without any extra work.
What you are talking about is referred to as flexible matching [1]. You don't want to match the value, but the type (or a regex). Given you are using Spring Boot, you may want to look at the various matchers available for Pact JVM [2].
I'm not sure if you meant it, but just for clarity, Pact and Pact Broker are separate things. Pact is the Open Source contract-testing framework, and Pact Broker [3] is a tool to help share and collaborate on those contracts with the team.
[1] https://docs.pact.io/getting_started/matching
[2] https://github.com/DiUS/pact-jvm/tree/master/consumer/pact-jvm-consumer#dsl-matching-methods
[3] https://github.com/pact-foundation/pact_broker/

How to extract all elements in a JSON array using Redshift?

I want to extract from a JSON which has more JSONs nested inside, all the elements whose title is 'title2'. I have the code working on MySQL but I can not translate it into Redshift.
JSON structure:
{"master-title": [{"title": "a", "title2": "b"},{"title": "c", "title2: "d", "title3": "e"}], "master-title2": [{"title": "f", "title2": "g", "title3": "h"},{"title": "i", "title2": "j", "title3": "k"}]}
MySQL query (works as desired):
select id
,json_extract(myJSON, '$**.title2')),0)
from myTable
MySQL ouput:
["b", "d","g","j"]
My problem is that on Redshift I can only specifically define the path as:
JSON_EXTRACT_PATH_TEXT(myJSON, 'master-title2',0,'title')
So I can only get one element instead of all of them.
Any idea how to evaluate all paths and to get all elements in a JSON array which have the same "title2" using Redshift? (same output as in MySQL)
Thank you in advance.
Redshift has only a very rudimentary set to JSON manipulation functions (basically JSON_EXTRACT_PATH_TEXT and JSON_EXTRACT_ARRAY_ELEMENT_TEXT). It's not enough to deal with schemaless JSON.
Python UDF
If Redshift was my only mean of processing data I would give python UDF a try. You can code a function in imperative python. Then, having a column holding your json object and just call that function on all elements to do custom extraction.
Unnesting JSON arrays
Other options would be to really try to understand the schema and implement it using the two JSON funtions mentioned before (This SO answer will give you an idea on how to explode/unnest a JSON array in Redshift). Provided your JSON is not arbitrarily nested, but follows some patterns, this could work.
Regex (better don't)
Another desperate approach would be to try to extract your data with regex - could work for simple cases, but it's an easy way to shoot yourself in the foot.
Thanks for your answer.
I finally found a solution using Python. I hope it may help some others.
count=[x.count("title2") for x in df['myJSON'].tolist()]

manipulating (nested) JSON keys and there values, using nifi

I am currently facing an issue where I have to read a JSON file that has mostly the same structure, has about 10k+ lines, and is nested.
I thought about creating my own custom processor which reads the JSON and replaces several matching key/values to the ones needed. As I am trying to use NiFi I assume that there should be a more comfortable way as the JSON-structure itself is mostly consistent.
I already tried using the ReplaceText processor as well as the JoltTransformJson processor, but I could not figure out. How can I transform both keys and values, if needed? For example: if there is something like this:
{
"id": "test"
},
{
"id": "14"
}
It might be necessary to turn the "id" into "Number" and map "test" to "3", as I am using different keys/values in my jsonfiles/database, so they need to fit those. Is there a way of doing so without having to create my own processor?
Regards,
Steve

Schemaless Support for Elastic Search Queries

Our REST API allows users to add custom schemaless JSON to some of our REST resources, and we need it to be searchable in Elasticsearch. This custom data and its structure can be completely different across resources of the same type.
Consider this example document:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
All fields except customData adhere to a schema. customData is always a JSON Object, but all the fields and values within that Object can vary dramatically from resource to resource. There is no guarantee that any given field name or value (or even value type) within customData is the same across any two resources as users can edit these fields however they wish.
What is the best way to support search for this?
We thought a solution would be to just not create any mapping for customData when the index is created, but then it becomes unqueryable (which is contrary to what the ES docs say). This would be the ideal solution if queries on non-mapped properties worked, and there were no performance problems with this approach. However, after running multiple tests for that matter we haven’t been able to get that to work.
Is this something that needs any special configuration? Or are the docs incorrect? Some clarification as to why it is not working would be greatly appreciated.
Since this is not currently working for us, we’ve thought of a couple alternative solutions:
Reindexing: this would be costly as we would need to reindex every index that contains that document and do so every time a user updates a property with a different value type. Really bad for performance, so this is likely not a real option.
Use multi-match query: we would do this by appending a random string to the customData field name every time there is a change in the customData object. For example, this is what the document being indexed would look like:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData_03ae8b95-2496-4c8d-9330-6d2058b1bbb9": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
This means ES would create a new mapping for each ‘random’ field, and we would use phrase multi-match query using a "starts with" wild card for the field names when performing the queries. For example:
curl -XPOST 'eshost:9200/test/_search?pretty' -d '
{
"query": {
"multi_match": {
"query" : "red",
"type" : "phrase",
"fields" : ["customData_*.favoriteColor"]
}
}
}'
This could be a viable solution, but we are concerned that having too many mappings like this could affect performance. Are there any performance repercussions for having too many mappings on an index? Maybe periodic reindexing could alleviate having too many mappings?
This also just feels like a hack and something that should be handled by ES natively. Am I missing something?
Any suggestions about any of this would be much appreciated.
Thanks!
You're correct that Elasticsearch is not truly schemaless. If no mapping is specified, Elasticsearch infers field type primitives based upon the first value it sees for that field. Therefore your non-deterministic customData object can get you in trouble if you first see "favoriteColor": 10 followed by "favoriteColor": "red".
For your requirements, you should take a look at SIREn Solutions Elasticsearch plugin which provides a schemaless solution coupled with an advanced query language (using Twig) and a custom Lucene index format to speed up indexing and search operations for non-deterministic data.
Fields with same mapping will be stored as same lucene field in the lucene index (Elasticsearch shard). Different lucene field will have separate inverted index (term dict and index entry) and separate doc values. Lucene is highly optimized to store documents of same field in a compressed way. Using a mapping with different field for different document prevent lucene from doing its optimization.
You should use Elasticsearch Nested Document to search efficiently. The underlying technology is Lucene BlockJoin, which indexes parent/child documents as a document block.

How to index JSON data in PostgreSQL 9.2?

Does anyone know how to create index on JSON data in PostgreSQL 9.2?
Example data:
[
{"key" : "k1", "value" : "v1"},
{"key" : "k2", "value" : "v2"}
]
Say if I want to index on all the keys how to do that?
Thanks.
You are much better off using hstore for indexed fields, at least for now.
CREATE INDEX table_name_gin_data ON table_name USING GIN(data);
You can also create GIST indexes if you are interested in fulltext search. More info here: http://www.postgresql.org/docs/9.0/static/textsearch-indexes.html
Currently there are no built-in functions to index JSON directly. But you can do it with a function based index where the function is written in JavaScript.
See this blog post for details: http://people.planetpostgresql.org/andrew/index.php?/archives/249-Using-PLV8-to-index-JSON.html
There is another blog post from which talks about JSON and how it can be used with JavaScript: http://www.postgresonline.com/journal/archives/272-Using-PLV8-to-build-JSON-selectors.html
This question is a little old but I think the selected answer is not really the ideal one. To index json (the property values inside json text), we can use expression indexes with PLV8 (suggested by #a_horse_with_no_name).
Craig Kerstein does a great job of explaining/demonstrating:
http://www.craigkerstiens.com/2013/05/29/postgres-indexes-expression-or-functional-indexes/