Upserting subdocumets as a list with mutate_in - couchbase

Would really appreciate your help here.
I am trying to mutate the sub documents by using mutate_in and then defining the SD with a path for each element that I want to put in.
The code looks like this:
cb.mutate_in(‘1_2’,
SD.upsert(
‘strat.a’, 76543, create_parents=True),
SD.upsert(
‘strat.b’, “true”, create_parents=True),
SD.upsert(
‘strat.c’, 30, create_parents=True),
SD.upsert(
‘strat.d’, -25, create_parents=True),
SD.upsert(
‘strat.e’, “C”, create_parents=True),
SD.upsert(
‘strat.f’, 7, create_parents=True),
)
This is the kind of result it brings in:
{
“strat”: {
“a”: 76543,
“b”: “true”,
“c”: 30,
“d”: -25,
“e”: “C”,
“f”: 7
}
}
What I am looking for is this (Notice that I am putting that in a list here):
{
“Strat”: [
{
“a”: “12345”,
“b”: 7,
“c”: “C”,
“d”: “true”,
“e”: -25,
“f”: 25
}
]
}
Once I am able to achieve the above as a list, I should be able to generate multiple sub documents like this:
Notice that now I have 2 elements in this list both having different values:
{
“Strat”: [
{
“a”: “12345”,
“b”: 7,
“c”: “C”,
“d”: “true”,
“e”: -25,
“f”: 25
},
{
“a”: “34567”,
“b”: 8,
“c”: “F”,
“d”: “false”,
“e”: -30,
“f”: 40
}
]
}
QUESTION:
can I achieve this with mutate_in and SD.upsert ?
if so, how can I change that code I have provided at the beginning of this thread to put things in a list
Once I can put it as a list, how can I avoid upsert not overwriting the record but to attach another record in that list?
Your help is highly appreciated. Thank you !
Looking forward to some direction from all the couchbase gurus out there

You can create a Python Dictionary representing the new JSON Object, and append it to the array (create_parents=True says "create the array if it doesn't already exist").
I am not fluent in Python, and have no idea if this compiles... so let's call it pseudocode :-)
cb.mutate_in(‘1_2’,
SD.array_append('strat',
{
"a": "12345",
"b": 7,
"c": "C",
"d": "true",
"e": -25,
"f": 25
},
create_parents=True),
SD.array_append('strat',
{
// some other values...
},
create_parents=True));
You can append up to 16 items at a time this way (subdoc batches are limited to 16 operations).

Related

What JSON format does STRIP_OUTER_ARRAY support?

I have a file composed of a single array containing multiple records.
{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}
I load it with the following code:
create or replace stage stage.test
url='azure://xxx/xxx'
credentials=(azure_sas_token='xxx');
create table if not exists stage.client (json_data variant not null);
copy into stage.client_test
from #stage.test/client_test.json
file_format = (type = 'JSON' strip_outer_array = true);
Snowflake imports the entire file as one row.
I would like the the COPY INTO command to remove the outer array structure and load the records into separate table rows.
When I load larger files, I hit the size limit for variant and get the error Error parsing JSON: document is too large, max size 16777216 bytes.
If you can import the file into Snowflake, into a single row, then you can use LATERAL FLATTEN on the Clients field to generate one row per element in the array.
Here's a blog post on LATERAL and FLATTEN (or you could look them up in the snowflake docs):
https://support.snowflake.net/s/article/How-To-Lateral-Join-Tutorial
If the format of the file is, as specified, a single object with a single property that contains an array with 500 MB worth of elements in it, then perhaps importing it will still work -- if that works, then LATERAL FLATTEN is exactly what you want. But that form is not particularly great for data processing. You might want to use some text processing script to massage the data if that's needed.
RECOMMENDATION #1:
The problem with your JSON is that it doesn't have an outer array. It has a single outer object containing a property with an inner array.
If you can fix the JSON, that would be the best solution, and then STRIP_OUTER_ARRAY will work as you expected.
You could also try to recompose the JSON (an ugly business) after reading line for line with:
CREATE OR REPLACE TABLE X (CLIENT VARCHAR);
COPY INTO X FROM (SELECT $1 CLIENT FROM #My_Stage/Client.json);
User Response to Recommendation #1:
Thank you. So from what I gather, COPY with STRIP_OUTER_ARRAY can handle a file starting and ending with square brackets, and parse the file as if they were not there.
The real files don't have line breaks, so I can't read the file line by line. I will see if the source system can change the export.
RECOMMENDATION #2:
Also if you would like to see what the JSON parser does, you can experiment using this code, I have parsed JSON on the copy command using similar code. Working with your JSON data in small project can help you shape the Copy command to work as intended.
CREATE OR REPLACE TABLE SAMPLE_JSON
(ID INTEGER,
DATA VARIANT
);
INSERT INTO SAMPLE_JSON(ID,DATA)
SELECT
1,parse_json('{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}');
SELECT
C.value:ClientNo AS ClientNo
,C.value:ClientName::STRING AS ClientName
,ClientBusiness.value:BusinessNo::Integer AS BusinessNo
,ClientBusiness.value:IndustryCode::Integer AS IndustryCode
from SAMPLE_JSON f
,table(flatten( f.DATA,'Client' )) C
,table(flatten(c.value:ClientBusiness,'')) ClientBusiness;
User Response to Recommendation #2:
Thank you for the parse_json example!
Trouble is, the real files are sometimes 500 MB, so the parse_json function chokes.
Follow-up on Recommendation #2:
The JSON needs to be in the NDJSON http://ndjson.org/ format. Otherwise the JSON will be impossible to parse because of the potential for large files.
Hope the above helps other running into similar questions!

Elasticsearch - return records having a field falling in the given range

A subset of my data looks like:
{
"6220":{
"abstract":"We investigate the two-dimensional $\\mathcal{
N
}=(2,
2)$ supersymmetric\nYang-Mills (SYM) theory on the discretized curved space (polyhedra). We first\nrevisit that the number of supersymmetries of the continuum $\\mathcal{
N
}=(2,
2)$\nSYM theory on any curved manifold can be enhanced at least to two by\nintroducing an appropriate $U(1)$ gauge background associated with the\n$U(1)_{
V
}$ symmetry. We then show that the generalized Sugino model on the\ndiscretized curved space,
which was proposed in our previous work,
can be\nidentified to the discretization of this SUSY enhanced theory,
where one of the\nsupersymmetries remains and the other is broken but restored in the continuum\nlimit. We find that the $U(1)_{
A
}$ anomaly exists also in the discretized\ntheory as a result of an unbalance of the number of the fermions proportional\nto the Euler characteristics of the polyhedra. We then study this model by\nusing the numerical Monte-Carlo simulation. We propose a novel phase-quench\nmethod called \"anomaly-phase-quenched approximation\" with respect to the\n$U(1)_A$ anomaly. We numerically show that the Ward-Takahashi (WT) identity\nassociated with the remaining supersymmetry is realized by adopting this\napproximation. We figure out the relation between the sign (phase) problem and\npseudo-zero-modes of the Dirac operator. We also show that the divergent\nbehavior of the scalar one-point function gets milder as the genus of the\nbackground increases. These are the first numerical observations for the\nsupersymmetric lattice model on the curved space with generic topologies.",
"arxiv_id":"1607.01260",
"authors":[
"Kamata Syo",
"Matsuura So",
"Misumi Tatsuhiro",
"Ohta Kazutoshi"
],
"categories":[
"hep-th",
"hep-lat"
],
"created":"2016-07-05 00:00:00",
"doi":"10.1093\/ptep\/ptw153",
"primary_category":"physics",
"title":"Anomaly and Sign problem in $\\mathcal{
N
}=(2,
2)$ SYM on Polyhedra :\n Numerical Analysis",
"updated":1473724800000
},
"407":{
"abstract":"In this paper,
we use the methods of subriemannian geometry to study the dual\nfoliation of the singular Riemannian foliation induced by isometric Lie group\nactions on a complete Riemannian manifold M. We show that under some\nconditions,
the dual foliation has only one leaf.",
"arxiv_id":"1408.0060",
"authors":[
"Shi Yi"
],
"categories":[
"math.DG"
],
"created":"2014-07-31 00:00:00",
"doi":null,
"primary_category":"math",
"title":"The dual foliation of some singular Riemannian foliations",
"updated":1483574400000
}
}
I need to look for all records that have created in a given range.
I tried:
GET _search
{
"query": {
"range" : {
"created" : {
"gte": "2012-01-01",
"lte": "2018-01-01",
"format": "yyyy-MM-dd"
}
}
}
}
I got:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 13,
"successful" : 13,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
I expected a couple of hits for that query. I have verified that there are nearly 500 documents in my ES index.
What am I doing wrong here?

Regex Return First Match

I have a weather file where I would like to extract the first value for "air_temp" recorded in a JSON file. The format this HTTP retriever uses is regex (I know it is not the best method).
I've shortened the JSON file to 2 data entries for simplicity - there are usually 100.
{
"observations": {
"notice": [
{
"copyright": "Copyright Commonwealth of Australia 2017, Bureau of Meteorology. For more information see: http://www.bom.gov.au/other/copyright.shtml http://www.bom.gov.au/other/disclaimer.shtml",
"copyright_url": "http://www.bom.gov.au/other/copyright.shtml",
"disclaimer_url": "http://www.bom.gov.au/other/disclaimer.shtml",
"feedback_url": "http://www.bom.gov.au/other/feedback"
}
],
"header": [
{
"refresh_message": "Issued at 12:11 pm EST Tuesday 11 July 2017",
"ID": "IDN60901",
"main_ID": "IDN60902",
"name": "Canberra",
"state_time_zone": "NSW",
"time_zone": "EST",
"product_name": "Capital City Observations",
"state": "Aust Capital Territory"
}
],
"data": [
{
"sort_order": 0,
"wmo": 94926,
"name": "Canberra",
"history_product": "IDN60903",
"local_date_time": "11/12:00pm",
"local_date_time_full": "20170711120000",
"aifstime_utc": "20170711020000",
"lat": -35.3,
"lon": 149.2,
"apparent_t": 5.7,
"cloud": "Mostly clear",
"cloud_base_m": 1050,
"cloud_oktas": 1,
"cloud_type_id": 8,
"cloud_type": "Cumulus",
"delta_t": 3.6,
"gust_kmh": 11,
"gust_kt": 6,
"air_temp": 9.0,
"dewpt": 0.2,
"press": 1032.7,
"press_qnh": 1031.3,
"press_msl": 1032.7,
"press_tend": "-",
"rain_trace": "0.0",
"rel_hum": 54,
"sea_state": "-",
"swell_dir_worded": "-",
"swell_height": null,
"swell_period": null,
"vis_km": "10",
"weather": "-",
"wind_dir": "WNW",
"wind_spd_kmh": 7,
"wind_spd_kt": 4
},
{
"sort_order": 1,
"wmo": 94926,
"name": "Canberra",
"history_product": "IDN60903",
"local_date_time": "11/11:30am",
"local_date_time_full": "20170711113000",
"aifstime_utc": "20170711013000",
"lat": -35.3,
"lon": 149.2,
"apparent_t": 4.6,
"cloud": "Mostly clear",
"cloud_base_m": 900,
"cloud_oktas": 1,
"cloud_type_id": 8,
"cloud_type": "Cumulus",
"delta_t": 2.9,
"gust_kmh": 9,
"gust_kt": 5,
"air_temp": 7.3,
"dewpt": 0.1,
"press": 1033.1,
"press_qnh": 1031.7,
"press_msl": 1033.1,
"press_tend": "-",
"rain_trace": "0.0",
"rel_hum": 60,
"sea_state": "-",
"swell_dir_worded": "-",
"swell_height": null,
"swell_period": null,
"vis_km": "10",
"weather": "-",
"wind_dir": "NW",
"wind_spd_kmh": 4,
"wind_spd_kt": 2
}
]
}
}
The regex expression I am currently using is: .*air_temp": (\d+).* but this is returning 9 and 7.3 (entries 1 and 2). Could someone suggest a way to only return the first value?
I have tried using lazy quantifier group, but have had no luck.
This regex will help you. But I think you should capture and extract the first match with features of the programming language you are using.
.*air_temp": (\d{1,3}\.\d{0,3})[\s\S]*?},
To understand the regex better: take a look at this.
Update
The above solution works if you have only two data entries. For more than two entries, we should have used this one:
header[\s\S]*?"air_temp": (\d{1,3}\.\d{0,3})
Here we match the word header first and then match anything in a non-greedy way. After that, we match our expected pattern. thus we get the first match. Play with it here in regex101.
To capture the negative numbers, we need to check if there is any - character exists or not. We do this by ? which means 'The question mark indicates zero or one occurrence of the preceding element'.
So the regex becomes,
header[\s\S]*?"air_temp": (-?\d{1,3}\.\d{0,3}) Demo
But the use of \K without the global flag ( in another answer given by mickmackusa ) is more efficient. To detect negative numbers, the modified version of that regex is
air_temp": \K-?\d{1,2}\.\d{1,2} demo.
Here {1,2} means 1~2 occurance/s of the previous character. We use this as {min_occurance,max_occurance}
I do not know which language you are using, but it seems like a difference between the global flag and not using the global flag.
If the global flag is not set, only the first result will be returned. If the global flag is set on your regex, it will iterate through returning all possible results. You can test it easily using Regex101, https://regex101.com/r/x1bwg2/1
The lazy/greediness should not have any impact in regards to using/not using the global flag
If \K is allowed in your coding language, use this: Demo
/air_temp": \K[\d.]+/ (117steps) this will be highly efficient in searching your very large JSON text.
If no \K is allowed, you can use a capture group: (Demo)
/air_temp": ([\d.]+)/ this will still move with decent speed through your JSON text
Notice that there is no global flag at the end of the pattern, so after one match, the regex engine stops searching.
Update:
For "less literal" matches (but it shouldn't matter if your source is reliable), you could use:
Extended character class to include -:
/air_temp": \K[\d.-]+/ #still 117 steps
or change to negated character class and match everything that isn't a , (because the value always terminates with a comma):
/air_temp": \K[^,]+/ #still 117 steps
For a very strict match (if you are looking for a pattern that means you have ZERO confidence in the input data)...
It appears that your data doesn't go beyond one decimal place, temps between 0 and 1 prepend a 0 before the decimal, and I don't think you need to worry with temps in the hundreds (right?), so you could use:
/air_temp": \K-?[1-9]?\d(?:\.\d)? #200steps
Explanation:
Optional negative sign
Optional tens digit
Required ones digit
Optional decimal which must be followed by a digit
Accuracy Test Demo
Real Data Demo

Retrieve json field value with importJSON

I have a problem with google spreadsheets. I try to import a value from a link (which returns me a JSON) but it seems like it does not work.
I tried this:
https://medium.com/#paulgambill/how-to-import-json-data-into-google-spreadsheets-in-less-than-5-minutes-a3fede1a014a#.pb26xo98x
The link returns a json like this:
{
"data": [
{
"time": "2016-10-16T07:00:00+0000",
"value": "249.884067074"
}
],
"summary": {
"name": "Custom Events",
"period": "daily",
"since": "2016-10-17T00:00:00+0000",
"until": "2016-10-17T00:00:00+0000"
}
}
How can I extract the value from the data field?
I tried like this:
=ImportJSON(myUrl, "/data[0]/value", "noInherit,noTruncate,rawHeaders")
According to a comment on the project page there is a fix that should be manually applied:
Chris says:
November 4, 2014 at 11:35 pm (UTC -4)
Trevor,
I was able to fix this problem by making a minor change to the
ParseData_ function. I changed line 286 in version 1.2.1 to:
if (i >= 0 && data[state.rowIndex]) {
and it seems to have addressed the issue.
Thank you!
CR

MongoDB find() to return the sub document when a (field,value) is matched

This is a single collection which has 2 json files. I am searching for a particular field: value in an object and the entire sub document must be returned in case of a match ( That particular sub document from the collection must be returned out of the 2 sub documents in the following collection). Thanks in advance.
{
"clinical_study": {
"#rank": "379",
"#comment": [],
"required_header": {
"download_date": "ClinicalTrials.gov processed this data on March 18, 2015",
"link_text": "Link to the current ClinicalTrials.gov record.",
"url": "http://clinicaltrials.gov/show/NCT00000738"
},
"id_info": {
"org_study_id": "ACTG 162",
"secondary_id": "11137",
"nct_id": "NCT00000738"
},
"brief_title": "Randomized, Double-Blind, Placebo-Controlled Trial of Nimodipine for the Neurological Manifestations of HIV-1",
"official_title": "Randomized, Double-Blind, Placebo-Controlled Trial of Nimodipine for the Neurological Manifestations of HIV-1",
}
{
"clinical_study": {
"#rank": "381",
"#comment": [],
"required_header": {
"download_date": "ClinicalTrials.gov processed this data on March 18, 2015",
"link_text": "Link to the current ClinicalTrials.gov record.",
"url": "http://clinicaltrials.gov/show/NCT00001292"
},
"id_info": {
"org_study_id": "920106",
"secondary_id": "92-C-0106",
"nct_id": "NCT00001292"
},
"brief_title": "Study of Scaling Disorders and Other Inherited Skin Diseases",
"official_title": "Clinical and Genetic Studies of the Scaling Disorders and Other Selected Genodermatoses",
}
Your example documents are malformed - right now both clinical_study keys are part of the same object, and that object is missing a closing }. I assume you want them to be two separate documents, although you call them subdocuments. It doesn't make sense to have them be subdocuments of a document if they are both named under the same key. You cannot save the document that way, and in the mongo shell it will silently replace the first instance of the key with the second:
> var x = { "a" : 1, "a" : 2 }
> x
{ "a" : 2 }
If you just want to return the clinical_study part of the document when you match on clinical_study.#rank, use projection:
db.test.find({ "clinical_study.#rank" : "379" }, { "clinical_study" : 1, "_id" : 0 })
If instead you meant for the clinical_study documents to be elements of an array inside a larger document, then use $. Here, clinical_study is now the name of an array field which has as its elements the two values of the clinical_study key in your non-documents:
db.test.find({ "clinical_study.#rank" : "379" }, { "_id" : 0, "clinical_study.$" : 1 })