Using logical operators in json path file - json

Is there a functionality to use logical operators in JSON path file used in copy command.
For example, I have a JSON which can contain a key which can either be
Desc
Or
Description
So in the JSON it would be something like -
{
"Desc": "Hello",
"City" : "City1",
"Age": "21"
}
{
"Description" : "World",
"City" : "City2",
"Age": "25"
}
I'm using copy command to pull the data from the JSON above into my table in redshift. The table has a column named "description_data". This would store values of either "Desc" or "Description". So I want my path file to identify using an "OR" condition.
This is the path file that I'm currently using -
{
"jsonpaths": [
"$['Desc']",
"$['City']",
"$['Age']"
]
}
Which is working fine.
What I'm trying to do is the below (this is where I'm unsure if there is any syntax or functionality to achieve the objective)
{
"jsonpaths": [
"$['Desc']" or "$['Description']",
"$['City']",
"$['Age']"
]
}

No, Redshift doesn't support this.
You can issue two copy commands, one with Desc, and another with Description, to load the data into two temporary tables. After that, you can merge the two into your final table.

Related

How to merge two collections keeping the document with highest timestamp in MongoDB

I'm creating a MongoDB client for a Go Application, using the MongoDB Go Driver. In particular, I have two databases with one collection each. These collection can be modified asynchronously by different clients, so i need to periodically synchronize them, keeping the most recently edited document, among those with the same id field
The two databases are stored on different hosts, so i need to export the collection from one host using mongoexport and import into the other host using mongoimport.
I already tried using mongoimport --collection=myColl --mode=merge, but this doesn't fit my goal because simply overrides the conflicting documents from myColl with the imported ones.
My idea is to import the json into a temp collection, but i don't know how to compare the timestamps during the aggregation/merge process.
My collections are structured Like this, any idea?
Collection 1
{"_id":"K1","value":"VAL1","timest":{"$date":"2021-09-26T09:05:09.942Z"}}
{"_id":"K2","value":"VAL2","timest":{"$date":"2021-09-26T09:05:10.234Z"}}
Collection 2
{"_id":"K2","value":"VAL3","timest":{"$date":"2021-09-26T09:15:09.942Z"}}
{"_id":"K3","value":"VAL4","timest":{"$date":"2021-09-26T09:15:10.234Z"}}
Desired Behaviour
Conflict
{"_id":"K2","value":"VAL2","timest":{"$date":"2021-09-26T09:05:10.234Z"}}
{"_id":"K2","value":"VAL3","timest":{"$date":"2021-09-26T09:15:09.942Z"}}[LATEST]
Output
{"_id":"K1","value":"VAL1","timest":{"$date":"2021-09-26T09:05:09.942Z"}}
{"_id":"K2","value":"VAL3","timest":{"$date":"2021-09-26T09:15:09.942Z"}}
{"_id":"K3","value":"VAL4","timest":{"$date":"2021-09-26T09:15:10.234Z"}}
You can use $merge
The bellow merges testdb1.coll to testdb2.coll based on same _id
And keeps the document with the latest date. If _id is not found, then document is inserted.
Data in
testdb1.coll
[{"_id" "K2","value" "VAL3","timest" (date "2021-09-26T09:15:09.942Z")}
{"_id" "K3","value" "VAL4","timest" (date "2021-09-26T09:15:10.234Z")}]
testdb2.coll
[{"_id" "K1","value" "VAL1","timest" (date "2021-09-26T09:05:09.942Z")}
{"_id" "K2","value" "VAL2","timest" (date "2021-09-26T09:05:10.234Z")}]
Results
testdb2.coll (after the merge)
{"_id": "K1", "value": "VAL1", "timest": {"$toDate": "2021-09-26T09:05:09.942Z"}}
{"_id": "K2", "value": "VAL3", "timest": {"$toDate": "2021-09-26T09:15:09.942Z"}}
{"_id": "K3", "value": "VAL4", "timest": {"$toDate": "2021-09-26T09:15:10.234Z"}}
Query
(instead of $let you could use $$new)
client.db("testdb1").collection("coll").aggregate(
[
{
"$merge": {
"into": {
"db": "testdb2",
"coll": "coll"
},
"on": [
"_id"
],
"let": {
"p_ROOT": "$$ROOT"
},
"whenMatched": [
{
"$replaceRoot": {
"newRoot": {
"$cond": [
{
"$gt": [
"$$p_ROOT.timest",
"$timest"
]
},
"$$p_ROOT",
"$$ROOT"
]
}
}
}
],
"whenNotMatched": "insert"
}
}
])
You can do following in an aggregation pipeline:
use $unionWith to combine the 2 collections
$sort to order them by timest
use $first to get the latest document
use $replaceRoot to get the final form your want
Here is the Mongo playground for your reference.

What JSON format does STRIP_OUTER_ARRAY support?

I have a file composed of a single array containing multiple records.
{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}
I load it with the following code:
create or replace stage stage.test
url='azure://xxx/xxx'
credentials=(azure_sas_token='xxx');
create table if not exists stage.client (json_data variant not null);
copy into stage.client_test
from #stage.test/client_test.json
file_format = (type = 'JSON' strip_outer_array = true);
Snowflake imports the entire file as one row.
I would like the the COPY INTO command to remove the outer array structure and load the records into separate table rows.
When I load larger files, I hit the size limit for variant and get the error Error parsing JSON: document is too large, max size 16777216 bytes.
If you can import the file into Snowflake, into a single row, then you can use LATERAL FLATTEN on the Clients field to generate one row per element in the array.
Here's a blog post on LATERAL and FLATTEN (or you could look them up in the snowflake docs):
https://support.snowflake.net/s/article/How-To-Lateral-Join-Tutorial
If the format of the file is, as specified, a single object with a single property that contains an array with 500 MB worth of elements in it, then perhaps importing it will still work -- if that works, then LATERAL FLATTEN is exactly what you want. But that form is not particularly great for data processing. You might want to use some text processing script to massage the data if that's needed.
RECOMMENDATION #1:
The problem with your JSON is that it doesn't have an outer array. It has a single outer object containing a property with an inner array.
If you can fix the JSON, that would be the best solution, and then STRIP_OUTER_ARRAY will work as you expected.
You could also try to recompose the JSON (an ugly business) after reading line for line with:
CREATE OR REPLACE TABLE X (CLIENT VARCHAR);
COPY INTO X FROM (SELECT $1 CLIENT FROM #My_Stage/Client.json);
User Response to Recommendation #1:
Thank you. So from what I gather, COPY with STRIP_OUTER_ARRAY can handle a file starting and ending with square brackets, and parse the file as if they were not there.
The real files don't have line breaks, so I can't read the file line by line. I will see if the source system can change the export.
RECOMMENDATION #2:
Also if you would like to see what the JSON parser does, you can experiment using this code, I have parsed JSON on the copy command using similar code. Working with your JSON data in small project can help you shape the Copy command to work as intended.
CREATE OR REPLACE TABLE SAMPLE_JSON
(ID INTEGER,
DATA VARIANT
);
INSERT INTO SAMPLE_JSON(ID,DATA)
SELECT
1,parse_json('{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}');
SELECT
C.value:ClientNo AS ClientNo
,C.value:ClientName::STRING AS ClientName
,ClientBusiness.value:BusinessNo::Integer AS BusinessNo
,ClientBusiness.value:IndustryCode::Integer AS IndustryCode
from SAMPLE_JSON f
,table(flatten( f.DATA,'Client' )) C
,table(flatten(c.value:ClientBusiness,'')) ClientBusiness;
User Response to Recommendation #2:
Thank you for the parse_json example!
Trouble is, the real files are sometimes 500 MB, so the parse_json function chokes.
Follow-up on Recommendation #2:
The JSON needs to be in the NDJSON http://ndjson.org/ format. Otherwise the JSON will be impossible to parse because of the potential for large files.
Hope the above helps other running into similar questions!

Couchbase Index and N1QL Query

I have created a new bucket, FooBar on my couchbase server.
I have a Json Document which is a List with some properties and it is in my couchbase bucket as follows:
[
{
"Venue": "Venue1",
"Country": "AU",
"Locale": "QLD"
},
{
"Venue": "Venue2",
"Country": "AU",
"Locale": "NSW"
},
{
"Venue": "Venue3",
"Country": "AU",
"Locale": "NSW"
}
]
How Do i get the couchbase query to return a List of Locations when using N1QL query.
For instance, SELECT * FROM FooBar WHERE Locale = 'QLD'
Please let me know of any indexes I would need to create as well. Additionally, how can i return only results where the object is of type Location, and not say another object which may have the 'Locale' Property.
Chud
PS - I have also created some indexes, however I would like an unbiased answer on how to achieve this.
Typically you would store these as separate documents, rather than in a single document as an array of objects, which is how the data is currently shown.
Since you can mix document structures, the usual pattern to distinguish them is to have something like a 'type' field. ('type' is in no way special, just the most common choice.)
So your example would look like:
{
"Venue": "Venue1",
"Country": "AU",
"Locale": "QLD",
"type": "location"
}
...
{
"Venue": "Venue3",
"Country": "AU",
"Locale": "NSW",
"type": "location"
}
where each JSON object would be a separate document with a unique document ID. (If you have some predefined data you want to load, look at cbimport for how to add it to your database. There are a few different formats for doing it. You can also have it generate document IDs for you.)
Then, what #vsr wrote is correct. You'd create an index on the Locale field. That will be optimal for the query you want. Note you could create an index on every document with CREATE INDEX ix1 ON FooBar(Locale); too. In this simple case it doesn't really make a difference. Read about the query Explain feature of the admin console to for help using that to understand optimizing queries.
Finally, the query #vsr wrote is also correct:
SELECT * FROM FooBar WHERE type = "Location" AND Locale = "QLD";
CREATE INDEX ix1 ON FooBar(Locale);
https://dzone.com/articles/designing-index-for-query-in-couchbase-n1ql
CREATE INDEX ix1 ON FooBar(Locale) WHERE type = "Location";
SELECT * FROM FooBar WHERE type = "Location" AND Locale = "QLD";
If it is array and filed name is list
CREATE INDEX ix1 ON FooBar(DISTINCT ARRAY v.Locale FOR v IN list END) WHERE type = "Location";
SELECT * FROM FooBar WHERE type = "Location" AND ANY v IN list SATISFIES v.Locale = "QLD" END;

How to display 'c' array values alone from the given JSON document below using MongoDB?

I am a newbie to MongoDB. I am experimenting the various ways of extracting fields from a document inside collection.
Here in the below JSON document, I am finding it difficult to get extract it according to my need
{
"_id":1,
"dependencies":{
"a":[
"hello",
"hi"
],
"b":[
"Hmmm"
],
"c":[
"Vanilla",
"Strawberry",
"Pista"
],
"d":[
"Carrot",
"Cauliflower",
"Potato",
"Cabbage"
]
},
"productid":"25",
"date":"Thu Jul 30 11:36:49 PDT 2015"
}
I need to display the following output:
c:[
"Vanilla",
"Strawberry",
"Pista"
]
Can anyone please help me in solving it?
MongoDB Aggregation comes into rescue to get the result you are looking for :
$Project--> Passes along the documents with only the specified fields to the next stage in the pipeline. The specified fields can be existing fields from the input documents or newly computed fields.
db.collection.aggregate( [
{ $project :
{ c: "$dependencies.c", _id : 0 }
}
]).pretty();
As per the output you required, we just need to project ( display) the field "dependencies.c" , so we are creating a new field "c" and assigining the value of the "dependencies.c" into it.
Also by defalut "_id" field will be display along with the result, since you dont need it, so we are suppressing of the _id field by assigining "_id" : <0 or false>, so that it will not display the _id field in the output.
The above query will fetch you the result as below :
"c" : [
"Vanilla",
"Strawberry",
"Pista"
]

How to add nested json object to Lucene Index

I need a little help regarding lucene index files, thought, maybe some of you guys can help me out.
I have json like this:
[
{
"Id": 4476,
"UrlName": null,
"PhoneData": [
{
"PhoneType": "O",
"PhoneNumber": "0065898",
},
{
"PhoneType": "F",
"PhoneNumber": "0065898",
}
],
"Contact": [],
"Services": [
{
"ServiceId": 10,
"ServiceGroup": 2
},
{
"ServiceId": 20,
"ServiceGroup": 1
}
],
}
]
Adding first two fields is relatively easy:
// add lucene fields mapped to db fields
doc.Add(new Field("Id", sampleData.Id.Value.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("UrlName", sampleData.UrlName.Value ?? "null" , Field.Store.YES, Field.Index.ANALYZED));
But how I can add PhoneData and Services to index so it can be connected to unique Id??
For indexing JSON objects I would go this way:
Store the whole value under a payload field, named for example $json. This field would be stored but not indexed.
For each (indexable) property (maybe nested) create an indexable field with its name as a XMLPath-like expression identifying the property, for example PhoneData.PhoneType
If is ok that all nested properties will be indexed then it's simple, just iterate over all of them generating this indexable field.
But if you don't want to index all of them (a more realistic case), how to know which property is indexable is another problem; in this case you could:
Accept from the client the path expressions of the index fields to be created when storing the document, or
Put JSON Schema into play to describe your data (assuming your JSON records have a common schema), and extend it with a custom property that would allow you to tag which properties are indexable.
I have created a library doing this (and much more) that maybe can help you.
You can check it at https://github.com/brutusin/flea-db