I would like to sort by a field in a specific order, lets say 2,4,1,5,3.
In MySQL I could use ORDER BY FIELD(id,2,4,1,5,3).
Is there anything equivalent for ArangoDB?
I think it should be possible to use the POSITION AQL function, which can return the position of an element inside an array
FOR i IN [ 1, 2, 3, 4, 5 ] /* what to iterate over */
SORT POSITION([ 2, 4, 1, 5, 3 ], i, true) /* order to be returned */
RETURN i
This will return:
[ 2, 4, 1, 5, 3 ]
Update: my original answer included the CONTAINS AQL function, however, it should be POSITION!
Unfortunately, there is no direct equivalent for that, at the moment.
However, there are ways to accomplish that by yourself.
1) By constructing an AQL query:
The query would run through your sort value array and query the DB for every defined value. Each of those results would then be added to the final output array.
Mind you, that this does have a performance penalty, because there is one query for every value. If you are defining only a few ones, I guess it will be tolerable, but if you have to define for example tens or hundreds, it will lead to n+1 queries (where n is the number of custom sorted values).
The "+1" is the last query, which should get the result of all the other values, which are not defined in your custom sort array and also append these to your output array.
That would look like the following snippet, which you can copy into your AQL Editor and run it.
Notes for the snippet:
I am first creating an array, which would represent the collection we
would query.
Then I am setting the defined sort values.
After that, the actual AQL statement does its job.
Also, note the FLATTEN function at the outer RETURN statement, which is required, because in the first loop we are getting result arrays for each defined sort value. These have all to be flatten down to the same level in order to be processed as a unique result set (instead of many encapsulated small ones).
/* Define a dummy collection-array to work with */
LET a = [
{
"_id": "a/384072353674",
"_key": "384072353674",
"_rev": "384073795466",
"sort": 2
},
{
"_id": "a/384075040650",
"_key": "384075040650",
"_rev": "384075827082",
"sort": 3
},
{
"_id": "a/384077137802",
"_key": "384077137802",
"_rev": "384078579594",
"sort": 4
},
{
"_id": "a/384067504010",
"_key": "384067504010",
"_rev": "384069732234",
"sort": 1
},
{
"_id": "a/384079497098",
"_key": "384079497098",
"_rev": "384081004426",
"sort": 5
}
]
/* Define the custom sort values */
LET cSort = [5,3,1]
/* Gather the results of each defined sort value query into definedSortResults */
LET definedSortResults = (
FOR u in cSort
LET d = (
FOR docs IN `a`
FILTER docs.`sort` == u
RETURN docs
)
RETURN d
)
/* Append the the result of the last (all the non-defined sort values) query to the results of the definedSortResults into the output array */
LET output = (
APPEND (definedSortResults, (
FOR docs IN `a`
FILTER docs.`sort` NOT IN cSort
RETURN docs
)
)
)
/* Finally FLATTEN and RETURN the output variable */
RETURN FLATTEN(output)
2) A different approach would be, to extend AQL with a function written in JavaScript, that would essentially do the same steps as above.
Of course, you could also open up a feature request on ArangoDB's GitHub Page, and maybe the nice folks at ArangoDB will consider it for inclusion. :)
Hope that helps
Related
I'm trying to convince PostgreSQL 13 to pull out parts of a JSON field into another field, including a subset of properties within an array based on a discriminator (type) property. For example, given a data field containing:
{
"id": 1,
"type": "a",
"items": [
{ "size": "small", "color": "green" },
{ "size": "large", "color": "white" }
]
}
I'm trying to generate new_data like this:
{
"items": [
{ "size": "small" },
{ "size": "large"}
]
}
items can contain any number of entries. I've tried variations of SQL something like:
UPDATE my_table
SET new_data = (
CASE data->>'type'
WHEN 'a' THEN
json_build_object(
'items', json_agg(json_array_elements(data->'items') - 'color')
)
ELSE
null
END
);
but I can't seem to get it working. In this case, I get:
ERROR: set-returning functions are not allowed in UPDATE
LINE 6: 'items', json_agg(json_array_elements(data->'items')...
I can get a set of items using json_array_elements(data->'items') and thought I could roll this up into a JSON array using json_agg and remove unwanted keys using the - operator. But now I'm not sure if what I'm trying to do is possible. I'm guessing it's a case of PEBCAK. I've got about a dozen different types each with slightly different rules for how new_data should look, which is why I'm trying to fit the value for new_data into a type-based CASE statement.
Any tips, hints, or suggestions would be greatly appreciated.
One way is to handle the set json_array_elements() returns in a subquery.
UPDATE my_table
SET new_data = CASE
WHEN data->>'type' = 'a' THEN
(SELECT json_build_object('items',
json_agg(jae.item::jsonb - 'color'))
FROM json_array_elements(data->'items') jae(item))
END;
db<>fiddle
Also note that - isn't defined for json only for jsonb. So unless your columns are actually jsonb you need a cast. And you don't need an explicit ... ELSE NULL ... in a CASE expression, NULL is already the default value if no other value is specified in an ELSE branch.
A sample of the JSON is as shown below:
{
"AN": {
"dates": {
"2020-03-26": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
}
}
},
"KA": {
"dates": {
"2020-03-09": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
},
"2020-03-10": {
"delta": {
"confirmed": 3
},
"total": {
"confirmed": 4
}
}
}
}
}
I would like to load it into a DataFrame, such that the state names (AN, KA) are represented as Row names, and the dates and nested entries are present as Columns.
Any tips to achieve this would be very much appreciated. [I am aware of json_normalize, however I haven't figured out how to work it out yet.]
The output I am expecting, is roughly as shown below:
Can you update your post with the DataFrame you have in mind ? It'll be easier to understand what you want.
Also sometimes it's better to reshape your data if you can't make it work the way they are now.
Update:
Following your update here's what you can do.
You need to reshape your data, as I said when you can't achieve what you want it is best to look at the problem from another point of view. For instance (and from the sample you shared) the 'dates' keys is meaningless as the other keys are already dates and there are no other keys ate the same level.
A way to achieve what you want would be to use MultiIndex, it'll help you group your data the way you want. To use it you can for instance create all the indices you need and store in a dictionary the values associated.
Example :
If the only index you have is ('2020-03-26', 'delta', 'confirmed') you should have values = {'AN' : [1], 'KA':None}
Then you only need to create your DataFrame and transpose it.
I gave it a quick try and came up with a piece of code that should work. If you're looking for performance I don't think this will do the trick.
import pandas as pd
# d is the sample you shared
index = [[],[],[]]
values = {}
# Get all the dates
dates = [date for c in d.keys() for date in d[c]['dates'].keys() ]
for country in d.keys():
# For each country we create an array containing all 6 values for each date
# (missing values as None)
values[country] = []
for date in dates:
if date in d[country]['dates']:
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
# Incrementing indices
index[0].append(date)
index[1].append(method)
index[2].append(step)
if step in value.keys():
values[country].append(deepcopy(d[country]['dates'][date][method][step]))
else :
values[country].append(None)
# When country does not have a date fill with None
else :
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
index[0].append(date)
index[1].append(method)
index[2].append(step)
values[country].append(None)
# Removing duplicates introduced because we added n_countries times
# the indices
# 3 is the number of steps
# 2 is the number of methods
number_of_rows = 3*2*len(dates)
index[0] = index[0][:number_of_rows]
index[1] = index[1][:number_of_rows]
index[2] = index[2][:number_of_rows]
df = pd.DataFrame(values, index=index).T
Here is what I have for the transposed data frame of my output :
Hope this can help you
You clearly needs to reshape your json data before load it into a DataFrame.
Have you tried load your json like a dict ?
dataframe = pd.DataFrame.from_dict(JsonDict, orient="index")
The “orient” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
I need to compare duplicates ip of a json by date field and remove the older date
Ex:
[
{
"IP": "10.0.0.20",
"Date": "2019-09-14T20:00:11.543-03:00"
},
{
"IP": "10.0.0.10",
"Date": "2019-09-17T15:45:16.943-03:00"
},
{
"IP": "10.0.0.10",
"Date": "2019-09-18T15:45:16.943-03:00"
}
]
The output of operation need to be like this:
[
{
"IP": "10.0.0.20",
"Date": "2019-09-14T20:00:11.543-03:00"
},
{
"IP": "10.0.0.10",
"Date": "2019-09-18T15:45:16.943-03:00"
}
]
For simplicity's sake, I'll assume the order of the data doesn't matter.
First, if your data isn't already in Python, you can use json.load or json.loads to convert it into a Python object, following the straightforward type mappings.
Then you problem has three parts: comparing date strings as dates, finding the maximum element of a list by that date, and performing this process for each distinct IP address. For these purposes, you can use two of Pyhton's built-in methods and two from the standard library.
Python's built-in max and sorted functions (as well as list.sort) support a (keyword-only) key argument, which uses a function to determine the value to compare by. For example, max(d1, d2, key=lambda x: x[0]) compares the data by the first index of the each (like d1[0] < d2[0]), and returns whichever of d1 and d2 produced the larger key.
To allow that type of comparison between dates, you can use the datetime.datetime class. If your dates are all in the format specified by datetime.datetime.fromisoformat, you can use that function to turn your date strings into datetimes, which can then be compared to each other. Using that in a function that extracts the dates from the dictionaries gives you the key function you need.
def extract_date(item):
return datetime.datetime.fromisoformat(item['Date'])
Those functions allow you to choose the object from the list with the largest date, but not to keep separate values for different IP addresses. To do that, you can use itertools.groupby, which takes a key function and puts the elements of the input into separate outputs based on that key. However, there are two things you might need to watch out for with groupby:
It only groups elements that are next to each other. For example, if you give it [3, 3, 2, 2, 3], it will group two 3s, then two 2s, then one 3 rather than grouping all three 3 together.
It returns an iterator of key, iterator pairs, so you have to collect the results yourself. The best way to do that may depend on your application, but a basic approach is nested iterations:
for key, values in groupby(data, key_function):
for value in values:
print(key, value)
With the functions I've mentioned above, it should be relatively straightforward to assemble an answer to your problem.
SQL Server 2017.
Table OrderData has column DataProperties where JSON is stored. JSON example stored there:
{
"Input": {
"OrderId": "abc",
"Data": [
{
"Key": "Files",
"Value": [
"test.txt",
"whatever.jpg"
]
},
{
"Key": "Other",
"Value": [
"a"
]
}
]
}
}
So, it's an object with Input object, which has Data array that's KVP - full of objects with Key string and Value array of strings.
And my problem - I need to query for rows based on values in Files in example JSON - simple LIKE that matches %text%.
This query works:
SELECT TOP 10 *
FROM OrderData CROSS APPLY OPENJSON(DataProperties,'$.Input.Data') dat
WHERE JSON_VALUE(dat.value, '$.Key') = 'Files' and dat.[key] = 0
AND JSON_QUERY(dat.value, '$.Value') LIKE '%2%'
Problem is that this query is very slow, unsurprisingly.
How to make it faster?
I cannot create computed column with JSON_VALUE, because I need to filter in an array.
I cannot create computed column with JSON_QUERY on "$.Input.Data" or "$.Input.Data[0].Values" - because I need specific array item in this array with Key == "Files".
I've searched, but it seems that you cannot create computed column that also filters data, like with this attempt:
ALTER TABLE OrderData
ADD aaaTest AS (select JSON_QUERY(dat.value, '$.Value')
OPENJSON(DataProperties,'$.Input.Data') dat
WHERE JSON_VALUE(dat.value, '$.Key') = 'Files' and dat.[key] = 0 );
Error: Subqueries are not allowed in this context. Only scalar expressions are allowed.
What are my options?
Add Files column with an index and use INSERT/UPDATE triggers that populate this column on inserts/updates?
Create a view that "computes" this column? Can't add index, will still be slow
So far only option 1. has some merit, but I don't like triggers and maybe there's another option?
You might try something along this:
Attention: I've added a 2 to the text2 to fullfill your filter. And I named both to the plural "Values":
DECLARE #mockupTable TABLE(ID INT IDENTITY, DataProperties NVARCHAR(MAX));
INSERT INTO #mockupTable VALUES
(N'{
"Input": {
"OrderId": "abc",
"Data": [
{
"Key": "Files",
"Values": [
"test2.txt",
"whatever.jpg"
]
},
{
"Key": "Other",
"Values": [
"a"
]
}
]
}
}');
The query
SELECT TOP 10 *
FROM #mockupTable t
CROSS APPLY OPENJSON(t.DataProperties,'$.Input.Data')
WITH([Key] NVARCHAR(100)
,[Values] NVARCHAR(MAX) AS JSON) dat
WHERE dat.[Key] = 'Files'
AND dat.[Values] LIKE '%2%';
The main difference is the WITH-clause, which is used to return the properties inside an object in a typed way and side-by-side (similar to a naked OPENJSON with a PIVOT for all columns - but much better). This avoids expensive JSON methods in your WHERE...
Hint: As we return the Value with NVARCHAR(MAX) AS JSON we can continue with the nested array and might proceed with something like this:
SELECT TOP 10 *
FROM #mockupTable t
CROSS APPLY OPENJSON(t.DataProperties,'$.Input.Data')
WITH([Key] NVARCHAR(100)
,[Values] NVARCHAR(MAX) AS JSON) dat
WHERE dat.[Key] = 'Files'
--we read the array again with `OPENJSON`:
AND 'test2.txt' IN(SELECT [Value] FROM OPENJSON(dat.[Values]));
You might use one more CROSS APPLY to add the array's values and filter this at the WHERE directly.
SELECT TOP 10 *
FROM #mockupTable t
CROSS APPLY OPENJSON(t.DataProperties,'$.Input.Data')
WITH([Key] NVARCHAR(100)
,[Values] NVARCHAR(MAX) AS JSON) dat
CROSS APPLY OPENJSON(dat.[Values]) vals
WHERE dat.[Key] = 'Files'
AND vals.[Value]='test2.txt'
Just check it out...
This is an old question, but I would like to revisit it. There isn't any mention of how the source table is actually constructed in terms of indexing. If the original author is still around, can you confirm/deny what indexing strategy you used? For performant json document queries, I've found that having a table using the COLUMSTORE indexing strategy yields very performant JSON queries even with large amounts of data.
https://learn.microsoft.com/en-us/sql/relational-databases/json/store-json-documents-in-sql-tables?view=sql-server-ver15 has an example of different indexing techniques. For my personal solution I've been using COLUMSTORE albeit on a limited NVARCAHR document size. It's fast enough for any purposes I have even under millions of rows of decently sized json documents.
I'm currently trying to do a bit of complex N1QL for a project I'm working on, theoretically I could do all of this processing in multiple N1QL calls and by parsing the results each time, however if possible I'd like for this to contained in one call.
What I would like to do is:
filter all documents that contain a "dataSync.test.id" field with more than 1 id
Read back all other ids in that list
Use that list to get other documents containing those ids
Get the "dataSync.test._channels" field for those documents (optionally a filter by docType might help parsing)
This would probably return a list of "dataSync.test._channels"
Is this possible in N1QL? It appears like it might be but I can't get the syntax right.
My data structures look a little like
{
"dataSync": {
"test": {
"_channels": [
"RP"
],
"id": [
"dataSync_user_1015",
"dataSync_user_1010",
"dataSync_user_1005"
],
"_lastUpdatedBy": "TEST"
}
},
...
}
{
"dataSync": {
"test": {
"_channels": [
"RSD"
],
"id": [
"dataSync_user_1010"
],
"_lastUpdatedBy": "TEST"
}
},
...
}
Yes. I think you can do all these.
Initial set of IDs with filtering can be retrieved as a subquery and then you can get subsquent documents by joins.
SELECT fulldoc
FROM (select meta().id as dockey from doc where a=1) as mydoc
INNER JOIN doc fulldoc ON KEYS mydoc.dockey;
There are optimizations that can be done here. Try the sequencing first to ensure you're get the job done.