I am handling legacy (old) JSON files that we are now uploading to a database that was built using code-first EF Core (with the JSON elements saved as a jsonb field in a postgresql db, represented as JsonDocument properties in the EF classes). We want to be able to query these massive documents against any of the JSON's many properties. I've been very interested in the excellent docs here https://www.npgsql.org/efcore/mapping/json.html?tabs=data-annotations%2Cpoco, but the problem in our case is that our JSON has incredibly complicated hierarchies.
According to the npgsql/EF doc above, a way to do this for "shallow" json hierarchies would be something like:
myDbContext.MyClass
.Where(e => e.JsonDocumentField.RootElement.GetProperty("FieldToSearch").GetString() == "SearchTerm")
.ToList();
But that only works if is directly under the root of the JSONDocument. If the doc is structed like, say
{"A": {
...
"B": {
...
"C": {
...
"FieldToSearch":
<snip>
Then the above query won't work. There is an alternative to map our JSON to an actual POCO model, but this JSON structure (a) may change and (b) is truly massive and would result in some ridiculously complicated objects.
Right now, I'm building SQL strings using field configurations where I save strings to find the fields I want using psql's JSON querying language
Example:
"(JSONDocumentField->'A'->'B'->'C'->>'FieldToSearch')"
and then using that sql against the DB using
myDbContext.MyClass.FromSqlRaw(sql).ToList();
This is hacky and I'd much rather do it in a method call. Is there a way to force JsonDocument's GetProperty call to drill down into the hierarchy to find the first/any instance of the property name in question (or another method I'm not aware of)?
Thanks!
Related
Having read this post suggesting that it's sometimes a good trade-off to use JSON operators to return JSON directly from the database; I'm exploring this idea using PostgreSQL and JOOQ.
I'm able to write SQL queries returning a JSON array of JSON objects rather than rows using the following pattern:
select jsonb_pretty(array_to_json(array_agg(row_to_json(r)))::jsonb)
from (
select [subquery]
) as r;
However, I failed to translate this pattern in JOOQ.
Any help regarding how to translate a collection of rows (fields being of "usual" SQL type or already mapped as json(b)) using JOOQ would be appreciated.
SQL Server FOR JSON semantics
That's precisely what the SQL Server FOR JSON clause does, which jOOQ supports and which it can emulate for you on other dialects as well:
ctx.select(T.A, T.B)
.from(T)
.forJSON().path()
.fetch();
PostgreSQL native functionality
If you prefer using native functions directly, you will have to do with plain SQL templating for now, as some of these functions aren't yet supported by jOOQ, including:
JSONB_PRETTY (no plans of supporting it yet)
ARRAY_TO_JSON (https://github.com/jOOQ/jOOQ/issues/12841)
ROW_TO_JSON (https://github.com/jOOQ/jOOQ/issues/10685)
It seems quite simple to write a utility that does precisely this:
public static ResultQuery<Record1<JSONB>> json(Select<?> subquery) {
return ctx
.select(field(
"jsonb_pretty(array_to_json(array_agg(row_to_json(r)))::jsonb)",
JSONB.class
))
.from(subquery.asTable("r"))
}
And now, you can execute any query in your desired form:
JSONB result = ctx.fetchValue(json(select(T.A, T.B).from(T)));
Converting between PG arrays and JSON arrays
A note on performance. It seems that you're converting between data types a bit often. Specifically, I'd suggest you avoid aggregating a PostgreSQL array and turning that into a JSON array, but to use JSONB_AGG() directly. I haven't tested this, but it seems to me that the extra data structure seems unnecessary.
I'm trying to understand what kind of json this structure can be:
{s:11:"current_tab";s:7:"content";}
At least I believe that it is json, can anybody can help me understanding how I can query and work with this?
I found it in a mysql DB with wordpress, in the wp_postmeta table.
These arrays are called serialized data representation. That type is used for storing or passing PHP values without losing their type or structure, you can normally find this type of data stored in plugins or themes configuration, but it's widely used when working with WordPress databases.
Let’s say a theme is creating an array for storing color and a path.
In pure PHP, it looks like:
$settings = array(
'color' => 'green',
'path' => 'https://example.com'
);
When that array is stored in the database, it is converted into the serialized representation and looks like:
a:2:{s:5:"color";s:5:"green";s:4:"path";s:18:"https://example.com";}
The advantage is that the serialized data representation can be stored in the database much more effectively than the PHP array. The drawback is that the serialized data can not be changed by a simple search & replace as you would do with a text editor.
You can find more info about this type of data (and also for the PHP methods used to create and retrieve them) by searching for serialized data in WordPress. Also you can find a detailed example here
When integrating spark and mongodb, it is possible to provide a sample schema in a form of an object - as described here: https://docs.mongodb.com/spark-connector/master/scala/datasets-and-sql/#sql-declare-schema
As a short-cut, there is a sample code how one can provide mongodb spark connector with sample schema:
case class Character(name: String, age: Int)
val explicitDF = MongoSpark.load[Character](sparkSession)
explicitDF.printSchema()
I have a collection, which has a constant document structure. I can provide a sample json, however to create a sample object manually will be impossible (30k properties in a document, 1.5MB average size). Is there a way how spark would infer schema just from that very json and would circumvent Mongodb connector's initial sampling which is quite exhaustive?
Spark is able to infer the schema, especially from sources having it as MongoDB. For instance for RDBMS it executes a simple query returning nothing but table columns with their types (SELECT * FROM $table WHERE 1=0).
For the sampling it'll read all documents unless you specify the configuration option called samplingRatio like this:
sparkSession.read.option("samplingRatio", 0.1)
For above Spark will only read 10% of the data. You can of course set any value you want. But be careful because if your documents have inconsistent schemas (e.g. 50% have a field called "A", the others not), the schema deduced by Spark may be incomplete and at the end you can miss some data.
Some time ago I wrote a post about schema projection if you're interested: http://www.waitingforcode.com/apache-spark-sql/schema-projection/read
Using the Play framework with Anorm, I would like to write a Controller method that simply returns the results of an SQL query as JSON. I don't want to bother converting to objects. Also, ideally this code should stream out the JSON as the SQL ResultSet is processed rather than processing the entire SQL result before returning anything to the client.
select colA colB from mytable
JSON Response
[{"colA": "ValueA", "colB": 33}, {"colA": "ValueA2", "colB": 34}, ...]
I would like to express this in Scala code as elegantly and concisely as possible, but the examples I'm finding seem to have a lot of boiler plate (redundant column name definitions). I'm surprised there isn't some kind of SqlResult to JsValue conversion in Play or Anorm already.
I realize you may need to define Writes[] or an Enumeratee implementation to achieve this, but once the conversion code is defined, I'd like the code for each method to be nearly as simple as this:
val columns = List("colA", "colB")
db.withConnection { implicit c =>
Ok(Json.toJson(SQL"select #$columns from mytable"))
}
I'm not clear on the best way to define column names just once and pass it to the SQL query as well as JSON conversion code. Maybe if I create some kind of implicit ColumnNames type, the JSON conversion code could access it in the previous example?
Or maybe define my own kind of SqlToJsonAction to achieve even simpler code and have common control over JSON responses?
def getJson = SqlToJsonAction(List("colA", "colB")) { columns =>
SQL"select #$columns from mytable"
}
The only related StackOverflow question I found was: Convert from MySQL query result or LIST to JSON, which didn't have helpful answers.
I'm a Java developer just learning Scala so I still have a lot to learn about the language, but I've read through the Anorm, Iteratee/Enumeratee, Writes, docs and numerous blogs on Anorm, and am having trouble figuring out how to setup the necessary helper code so that I can compose my JSON methods this way.
Also, I'm unclear on what approaches allow Streaming out the Response, and which will iterate the entire SQL ResultSet before responding with anything to the client. According to Anorm Streaming Results only methods such as fold/foldWhile/withResult and Iteratees stream. Are these the techniques I should use?
Bonus:
In some cases, I'll probably want to map a SQL column name to a different JSON field name. Is there a slick way to do this as well?
Something like this (no idea if this Scala syntax is possible):
def getJson = SqlToJsonAction("colA" -> "jsonColA", "colB", "colC" -> "jsonColC")) { columns =>
SQL"select #$columns from mytable"
}
I want to save a hash as a packed string in a db, I get the pack part down ok, but I'm having a problem getting the hash back
test hash
my $hash = {
test_string => 'apples,bananas,oranges',
test_subhash => { like => 'apples' },
test_subarray => [ red, yellow, orange ]
}
I thought maybe I could use JSON:XS like in this example to convert the hash to a json string, and then packing the JSON string...
Thoughts on this approach?
Storable is capable of storing Perl structures very precisely. If you need to remember that something is a weak reference, etc, you want Storable. Otherwise, I'd avoid it.
JSON (Cpanel::JSON::XS) and YAML are good choices.
You can have problems if you store something using one version of Storable and try to retrieve it using an earlier version. That means all machines that access the database must have the same version of Storable.
Cpanel::JSON::XS is faster than Storable.
A fast YAML module is probably faster than Storable.
JSON can't store objects, but YAML and Storable can.
JSON and YAML are human readable (well, for some humans).
JSON and YAML are easy to parse and generate in other languages.
Usage:
my $for_the_db = encode_json($hash);
my $hash = decode_json($from_the_db);
I don't know what you mean by "packing". The string produces by Cpanel::JSON::XS's encode_json can be stored as-is into a BLOB field, while the string produced by Cpanel::JSON::XS->new->encode can be stored as-is into a TEXT field.
You may want to give the Storable module a whirl.
It can :
store your hash(ref) as a string with freeze
thaw it out at the time of retrieval
There are a lot of different ways to store a data structure in a scalar and then "restore" it back to it's original state. There are advantages and disadvantages to each.
Since you started with JSON, I'll show you can example using it.
use JSON;
my $hash = {
test_string => 'apples,bananas,oranges',
test_subhash => { like => 'apples' },
test_subarray => [ red, yellow, orange ]
}
my $stored = encode_json($hash);
my $restored = decode_json($stored);
Storable, as was already suggested, is also a good idea. But it can be rather quirky. It's great if you just want your own script/system to store and restore the data, but beyond that, it can be a pain in the butt. Even transferring data across different operating systems can cause problems. It was recommended that you use freeze, and for most local applications, that's the right call. If you decide to use Storable for sending data across multiple machines, look at using nfreeze instead.
That being said, there are a ton of encoding methods that can handle "storing" data structures. Look at YAML or XML.
I'm not quite sure what you mean by "convert the hash to a JSON string, and then packing the JSON string". What further "packing" is required? Or did you mean "storing"?
There's a number of alternative methods for storing hashes in a database.
As Zaid suggested, you can use Storable to freeze and thaw your hash. This is likely to be the fastest method (although you should benchmark with the data you're using if speed is critical). But Storable uses a binary format which is not human readable, which means that you will only be able to access this field using Perl.
As you suggested, you can store the hash as a JSON string. JSON has the advantage of being fairly human readable, and there are JSON libraries for most any language, making it easy to access your database field from something other than Perl.
You can also switch to a document-oriented database like CouchDB or MongoDB, but that's a much bigger step.