How to aggregate (min/max etc.) over Django JSONField data? - json

I'm using Django 1.9 with its built-in JSONField and Postgres 9.4.
In my model's attrs json field I store objects with some values, including numbers. And I need to aggregate over them to find min/max values.
Something like this:
Model.objects.aggregate(min=Min('attrs__my_key'))
Also, it would be useful to extract specific keys:
Model.objects.values_list('attrs__my_key', flat=True)
The above queries fail with
FieldError: "Cannot resolve keyword 'my_key' into field. Join on 'attrs' not permitted."
Is it possible somehow?
Notes:
I know how to make a plain Postgres query to do the job, but am searching specifically for an ORM solution to have the ability to filter etc.
I suppose this can be done with a (relatively) new query expressions/lookups API, but I haven't studied it yet.

From django 1.11 (which isn't out yet, so this might change) you can use django.contrib.postgres.fields.jsonb.KeyTextTransform instead of RawSQL .
In django 1.10 you have to copy/paste KeyTransform to you own KeyTextTransform and replace the -> operator with ->> and #> with #>> so it returns text instead of json objects.
Model.objects.annotate(
val=KeyTextTransform('json_field_key', 'blah__json_field'))
).aggregate(min=Min('val')
You can even include KeyTextTransforms in SearchVectors for full text search
Model.objects.annotate(
search=SearchVector(
KeyTextTransform('jsonb_text_field_key', 'json_field'))
)
).filter(search='stuff I am searching for')
Remember you can also index in jsonb fields, so you should consider that based upon your specific workload.

For those who interested, I've found the solution (or workaround at least).
from django.db.models.expressions import RawSQL
Model.objects.annotate(
val=RawSQL("((attrs->>%s)::numeric)", (json_field_key,))
).aggregate(min=Min('val')
Note that attrs->>%s expression will become smth like attrs->>'width' after processing (I mean single quotes). So if you hardcode this name you should remember to insert them or you will get error.
/// A little bit offtopic ///
And one more tricky issue not related to django itself but that is needed to be handled somehow. As attrs is json field and there're no restrictions on its keys and values you can (depending on you application logic) get some non-numeric values in, for example, width key. In this case you will get DataError from postgres as a result of executing the above query. NULL values will be ignored meanwhile so it's ok. If you can just catch the error then no problem, you're lucky. In my case I needed to ignore wrong values and the only way here is to write custom postgres function that will supress casting errors.
create or replace function safe_cast_to_numeric(text) returns numeric as $$
begin
return cast($1 as numeric);
exception
when invalid_text_representation then
return null;
end;
$$ language plpgsql immutable;
And then use it to cast text to numbers:
Model.objects.annotate(
val=RawSQL("safe_cast_to_numeric(attrs->>%s)", (json_field_key,))
).aggregate(min=Min('val')
Thus we get quite solid solution for such a dynamic thing as json.

I know this is a bit late (several months) but I came across the post while trying to do this. Managed to do it by:
1) using KeyTextTransform to convert the jsonb value to text
2) using Cast to convert it to integer, so that the SUM works:
q = myModel.objects.filter(type=9) \
.annotate(numeric_val=Cast(KeyTextTransform(sum_field, 'data'), IntegerField())) \
.aggregate(Sum('numeric_val'))
print(q)
where 'data' is the jsonb property, and 'numeric_val' is the name of the variable I create by annotating.
Hope this helps somebody!

It is possible to do this using a Postgres function
https://www.postgresql.org/docs/9.5/functions-json.html
from django.db.models import Func, F, FloatField
from django.db.models.expressions import Value
from django.db.models.functions import Cast
text = Func(F(json_field), Value(json_key), function='jsonb_extract_path_text')
floatfield = Cast(text, FloatField())
Model.objects.aggregate(min=Min(floatfield))
This is much better than using the RawQuery because it doesn't break if you do a more complex query, where Django uses aliases and where there are field name collisions. There is so much going on with the ORM that can bite you with hand written implementations.

Since Django 3.1 the KeyTextTransform function on a JSON field works for all database backends. It maps to the ->> operator in Postgres.
It can be used to annotate a specific JSON value inside a JSONField on the queryset results before you aggregate it. A more clear example how to utilize this:
First we need to annotate the key you want to aggregate. So if you have a Django model with a JSONField named data and the JSON containing looks like this:
{
"age": 43,
"name" "John"
}
You would annotate the queryset as following:
from django.db.models import IntegerField
from django.db.models.fields.json import KeyTextTransform
qs = Model.objects.annotate(
age=Cast(
KeyTextTransform("age", "data"), models.IntegerField()
)
The Cast is needed to stay compatible with all database backend.
Now you can aggregate to your liking:
from django.db.models import Min, Max, Avg, IntegerField
from django.db.models.functions import Cast, Round
qs.aggregate(
min_age=Round(Min("age")),
max_age=Round(Max("age")),
avg_age=Cast(Round(Avg("age")), IntegerField()),
)
>>> {'min_age': 25, 'max_age' 82:, 'avg_age': 33}

Seems there is no native way to do it.
I worked around like this:
my_queryset = Product.objects.all() # Or .filter()...
max_val = max(o.my_json_field.get(my_attrib, '') for o in my_queryset)
This is far from being marvelous, since it is done at the Python Level (and not at the SQL level).

from django.db.models.functions import Cast
from django.db.models import Max, Min
qs = Model.objects.annotate(
val=Cast('attrs__key', FloatField())
).aggregate(
min=Min("val"),
max=Max("val")
)

Related

How do I update data inside a stringified JSON object in SQL?

So I have three databases - an Oracle one, SQL Server one, and a Postgres one. I have a table that has two columns: name, and value, both are texts. The value is a stringified JSON object. I need to update the nested value.
This is what I currently have:
name: 'MobilePlatform',
value:
'{
"iosSupported":true,
"androidSupported":false,
}'
I want to add {"enableTwoFactorAuth": false} into it.
In PostgreSQL you should be able to do this:
UPDATE mytable
SET MobilePlatform = jsonb_set(MobilePlatform::jsonb, '{MobilePlatform,enableTwoFactorAuth}', 'false');
In Postgres, the plain concatenation operator || for jsonb could do it:
UPDATE mytable
SET value = value::jsonb || '{"enableTwoFactorAuth":false}'::jsonb
WHERE name = 'MobilePlatform';
If a top-level key "enableTwoFactorAuth" already exists, it is replaced. So it's an "upsert" really.
Or use jsonb_set() for manipulating nested values.
The cast back to text works implicitly as assignment cast. (Results in standard format; any insignificant whitespace is removed effectively.)
If the content is valid JSON, the storage type should be json to begin with. In Postges, jsonb would be preferable as it's easier to manipulate, but that's not directly portable to the other two RDBMS mentioned.
(Or, possibly, a normalized design without JSON altogether.)
For ORACLE 21
update mytable
set json_col = json_transform(
json_col,
INSERT '$.value.enableTwoFactorAuth' = 'false'
)
where json_exists(json_col, '$?(#.name == "MobilePlatform")')
;
With json_col being JSON or VARCHAR2|CLOB column with IS JSON constraint.
(but must be JSON if you want a multivalue index on json_value.name:
create multivalue index ix_json_col_name on mytable t ( t.json_col.name.string() );
)
Two of the databases you are using support JSON data type, so it doesn't make sense to have them as stringified JSON object in a Text column.
Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/21/adjsn/json-in-oracle-database.html
PostgreSQL: https://www.postgresql.org/docs/current/datatype-json.html
Apart from these, MSSQL Server also provides methods to work with JSON data type.
MS SQL Server: https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-ver16
Using a JSON type column in any of the above databases would enable you to use their JSON functions to perform the tasks that you are looking for.
If you've to use Text only then you can use replace to add the key-value pair at the end of your JSON
update dataTable set value = REPLACE(value, '}',",\"enableTwoFactorAuth\": false}") where name = 'MobilePlatform'
Here dataTable is the name of table.
The cleaner and less riskier way would be connect to db using the application and use JSON methods such as JSON.parse in Javascript and JSON.loads in Python. This would give you the JSON object (dictionary in case of Python) to work on. You can look for similar methods in other languages as well.
But i would suggest, if possible use JSON columns instead of Text to store the JSON value wherever possible.

Bookshelf: query builder: how to use like operator with json column?

I'm using bookshelf with postgresql database
Information is a column of type json.
I want to retrieve all column that are like '%pattern%'
With sql query i use
select * from table where information::text like '%pattern%';
I want to do that with bookshelf query builder
model.query(function(qb) {
qb.where('information', 'LIKE', '%pattern%')
}).fetch()
But it didn't work and i can't find how to do it in bookshelf docs
Any idea?
The tricky part here is, although you might think that JSON (and JSONB) columns are text, they aren't! So there's no way to do a LIKE comparison on one. Well, there is, but you'd have to convert it to a string first:
SELECT * FROM wombats WHERE information #>> '{}' LIKE '%pattern%';
which is a really terrible idea, please don't do that! As #GMB points out in the comments, JSON is a structured format that is far more powerful. Postgres is great at handling JSON, so just ask it for what you need. Let's say your value is in a JSON property named description:
SELECT * FROM wombats
WHERE (information->'description')::TEXT
LIKE '%pattern%';
Here, even though we've identified the correct property in our JSON object, it comes out as type JSON: we still have to cast it to ::TEXT before comparing it with a string using LIKE. The Bookshelf/Knex version of all this would look like:
model
.query(function(qb) {
const keyword = "pattern";
qb.whereRaw(`(information->'description')::TEXT LIKE '%${keyword}%'`)
})
.fetch();
Apparently this part of the raw query cannot be parameterized (in Postgres, at least) so the string substitution in JavaScript is required. This means you should be extra careful with where that string comes from (ie only use a limited subset, or sanitise before use) as you're bypassing Knex's usual protections.

sqlalchemy/postgres: arithmetic on JSON fields?

I am using sqlalchemy on a postgres database, and I'm trying to do arithmetic in a SELECT on two JSON fields which represent floats. However, I have not figured out how to make this work.
Assume I have properly defined a table called transactions which contains a JSON column called cost_data, and assume that this JSON structure contains two attributes called cost and subtotal which represent float values.
In a SELECT statement, I generate the sum of those two fields as follows:
(cast(transactions.c.cost_data['subtotal'], sqlalchemy.Float) + cast(transactions.c.cost_data['cost'], sqlalchemy.Float)).label('total_cost')
This generates the following SQL fragment ...
CAST((transactions.cost_data -> %(cost_data_6)s) AS FLOAT) + CAST((transactions.cost_data -> %(cost_data_7)s) AS FLOAT) AS total_cost
(where cost_data_6 and cost_data_7 get set to subtotal and cost, respectively).
However, I get the following error:
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) cannot cast type json to double precision
If I remove the casting and do it as follows, it also fails ...
(transactions.c.cost_data['subtotal'] + transactions.c.cost_data['cost']).label('total_cost')
I get this error ...
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) operator does not exist: json || json
LINE 9: ... (transactions.cost_data -> 'subtotal') || (transa...
^
Clearly, this is because the fields come in as strings, and the "+" operator gets interpreted as string concatenation.
Also, if I use the Python float operator, it also fails ...
(float(transactions.c.cost_data['subtotal']) + float(transactions.c.cost_data['cost'])).label('total_cost')
The python interpreter doesn't even execute the code, and it gives this error:
TypeError: float() argument must be a string or a number, not 'BinaryExpression'
So how can I perform the addition of those two fields using sqlalchemy?
PS: the following is a typical cost_data column value ...
{"cost":3.99,"subtotal":12.34}
OK. I finally figured it out. I have to pass each reference through the astext operator before applying cast, as follows ...
(transactions.c.cost_data['subtotal'].astext.cast(sqlalchemy.Float) + transactions.c.cost_data['cost'].astext.cast(sqlalchemy.Float)).label('total_cost')
The accepted answer works when only using SQLAlchemy with PostgreSQL, but it is not portable to other databases that have JSON types.
If you want to write code that is portable between PostgreSQL, MySQL, and SQLite, then you should cast a JSON column using one of the column's as_*() methods.
OP's answer would look like:
transactions.c.cost_data['subtotal'].as_float()
The methods .as_float(), .as_string(), .as_boolean(), and .as_integer() are available on SQLAlchemy JSON columns since SQLAlchemy version 1.3.11. More documentation here.

How do I write a query in Django to filter queryset on the basis of JSON data.?

I have a JSON field in my model which stores data like this:
{ "old_val": {"status": value1},
"new_val": {"status": value2}
}
Now I want to refine my select query so that the result contains all those tuples whose JSON field has,
["new_val"]["status"] = value2 and ["old_val"]["status"] !=value1
how do I write this query in django. ???
This depends on what JSONField you are using and what database. Some of them just save json to text field. If this is the case when you can't access parts of data in the database and hence can't filter by it. If you are however using PostgreSQL 9.3+ than you can use its JSON support and its operators with extra:
Something.objects.extra(where=["data->'new_val'->>'status' = %s"], params=["foo"])
Note that PostgreSQL 9.4 has more operators than 9.3.
You may also take a look at django-pgjson, it encapsulates the use of some of postgresql json operators into custom lookups (new in Django 1.7):
Something.objects.filter(data__at_new_val__at_status="foo")
jsonField is basically a String. So, you have got to perform the queries that you would perform on any StringField.
needed_objects = YourModel.objects.filter(jsonfield__contains={"status": value2}).exclude(jsonfield__contains={"status": value1})
Hope this does the trick for you.

Search in mysql database - unserialized data

Situation:
I have user model. attribute "meta_data" in db represents "text" type field.
In model it seriazized by custom class. ( serialize :meta_data, CustomJsonSerializer.new )
It means, when I have an instance of user, I can work with meta_data like with Hash.
User.first.meta_data['username']
Problem:
I need to write a search function, which will search users by given string. I can do it by manual building search query in rails ex. User.where("email LIKE '%#{string}%'")...
But what about meta_data ? Should I search in this field by LIKE statement too? If I will do so, it will decrease relevance of found record.
For example:
I have 2 users. One of them has username "patrick", another one is "sergio"
meta data in db will look like this:
1) {username: patrick}
2) {username: sergio}
I want to find sergio , I enter a search string "ser" => but I have 2 results, instead of one. This meta_data string "{uSERname: Patrick}" also has "ser", so it makes this record irrelevant.
Do you have any idea how to solve it?
That's really the problem with serialized data. In theory, the serialization could be an algorithm that is very unsearchable. It could do a Hoffman encoding, or other compression, and store the serialization in binary. You are relying on the assumption that the serialization uses JSON and your string will still be findable as a sub-string in the serialization.
Then the problem you are having is another issue. Other data in the serialization can mess up your results.
In general, if you serialize data, you are making a choice to not be searchable.
So a solution would be to add an additional field that you populate in a way that you control. Have a values field and store a pipe (|) delimited value that you can search. So if the data is {firstname: "Patrick", lastname: "Stern"}, your meta_values field might be "Patrick|Stern".
Also, don't use the where method with a string with #{} expansion of input values. The makes it vulnerable to SQL attacks. Instead use:
where("meta_values is like :pattern", pattern: "%#{string}%")
I know that may not look very different, but ActiveRecord will go through a sanitizing this way. If someone has a semi-colon in string, then ActiveRecord will escape the semi-colon in the search condition.