I have inherited project where an avro file that is being consumed by Snowflake. The schema of the avro is as follows:
{
"name": "TableName",
"namespace": "sqlserver",
"type": "record",
"fields": [
{
"name": "hAccount",
"type": "string"
},
{
"name": "hTableName",
"type": "string"
},
{
"name": "hRawJSON",
"type": "string"
}
]
}
The hRawJSON is a blob of JSON itself. The previous dev put this as a type of string, and this is where I believe the problem lies.
The application takes a JSON object (the JSON is varible so I never know the contents or what it contains) and populates the hRawJSON field in the Avro record. But it contains the escape characters for the double quotes in the string:
hAccount:"H11122"
hTableName:"Departments"
hRawJSON:"{\"DepartmentID\":1,\"ModelID\":0,\"Description\":\"P Medicines\",\"Margin\":\"3.300000000000000e+001\",\"UCSVATRateID\":0,\"References\":719,\"HeadOfficeID\":1,\"DividendID\":0}"
As a result the JSON blob is staged into Snowflake as a VARIANT field but still retains the escape characters:
Snowflake image
This means when querying the data in the JSON I constantly have to use this:
PARSE_JSON(RAW_FILE:hRawJSON):DepartmentID
I can't help feeling that the field type of string in the Avro file is causing the issue and that a different type should be used. I've tried Record, but without fields it's unuable. Doc also not working.
The other alternative is that this behavior is correct and when moving the hRawJSON from staging into "proper" tables I should use something like:
INSERT INTO DATA.PUBLIC.DEPARTMENTS
SELECT
RAW_FILE:hAccount::VARCHAR(4) as Account,
PARSE_JSON(RAW_FILE:hRawJSON) as JsonRaw
FROM DATA.STAGING.AVRO_RAW WHERE RAW_FILE:hTableName::STRING = 'Department';
So if this should be the correct approach and I'm over thinking this I'd appreciate guidance.
I am currently implementing some test automation that uses a json POST to a REST API to initialize the test data in the SUT. Most of the fields I don't have an issue editing using information I found in another thread: Json handling in ROBOT
However, one of the sets of information I am editing is a dictionary of meta data.
{
"title": "Test Auotmation Post 2018-03-06T16:12:02Z",
"content": "dummy text",
"excerpt": "Post made by automation for testing purposes.",
"name": "QA User",
"status": "publish",
"date": "2018-03-06T16:12:02Z",
"primary_section": "Entertainment",
"taxonomy": {
"section": [
"Entertainment"
]
},
"coauthors": [
{
"name": "QA User - CoAuthor",
"meta": {
"Title": "QA Engineer",
"Organization": "That One Place"
}
}
],
"post_meta": [
{
"key": "credit",
"value": "QA Engineer"
},
{
"key": "pub_date",
"value": "2018-03-06T16:12:02Z"
},
{
"key": "last_update",
"value": "2018-03-06T16:12:02Z"
},
{
"key": "source",
"value": "wordpress"
}
]
}
Is it possible to use the Set to Dictionary Keyword on a dictionary inside a dictionary? I would like to be able to edit the value of the pub_date and last_update inside of post_meta, specifically.
The most straightforward way would be to use the Evaluate keyword, and set the sub-dict value in it. Presuming you are working with a dictionary that's called ${value}:
Evaluate $value['post_meta'][1]['pub_date'] = 'your new value here'
I won't get into how to find the index of the post_meta list that has the 'key' with value 'pub_date', as that's not part of your question.
Is it possible to use the Set to Dictionary Keyword on a dictionary inside a dictionary?
Yes, it's possible.
However, because post_meta is a list rather than a dictionary, you will have to write some code to iterate over all of the values of post_meta until you find one with the key you want to update.
You could do this in python quite simply. You could also write a keyword in robot to do that for you. Here's an example:
*** Keywords ***
Set list element by key
[Arguments] ${data} ${target_key} ${new_value}
:FOR ${item} IN #{data}
\ run keyword if '''${item['key']}''' == '''${target_key}'''
\ ... set to dictionary ${item} value=${new_value}
[return] ${data}
Assuming you have a variable named ${data} contains the original JSON data as a string, you could call this keyword like the following:
${JSON}= evaluate json.loads('''${data}''') json
set list element by key ${JSON['post_meta']} pub_date yesterday
set list element by key ${JSON['post_meta']} last_update today
You will then have a python object in ${JSON} with the modified values.
Why egrep is not giving me all the matching entries?
This is my simple JSON blob:
[nukaNUKA#dev-machine csv]$ cat jsonfile.json
{"number": 303,"projectName": "giga","queueId":8881,"result":"SUCCESS"}
This is my pattern file (so that I don't scare the editor):
[nukaNUKA#dev-machine csv]$ cat egrep-pattern.txt
\"number\":.*\"projectName
\"projectName\":.*,\"queueId
\"queueId\":.*,\"result
\"result\":\".*$
This is egrep/grep command for individual searches, which works!:
[nukaNUKA#dev-machine csv]$ egrep -o "\"number\":.*\"projectName" jsonfile.json
"number": 303,"projectName
[nukaNUKA#dev-machine csv]$ egrep -o "\"projectName\":.*,\"queueId" jsonfile.json
"projectName": "giga","queueId
[nukaNUKA#dev-machine csv]$ egrep -o "\"queueId\":.*,\"result" jsonfile.json
"queueId":8881,"result
[nukaNUKA#dev-machine csv]$ egrep -o "\"result\":\".*$" jsonfile.json
"result":"SUCCESS"}
So, wth this didn't work? I don't wear glasses, yes.
[nukaNUKA#dev-machine csv]$ egrep -o "\"number\":.*\"projectName|\"projectName\":.*,\"queueId|\"queueId\":.*,\"result|\"result\":\".*$" jsonfile.json
"number": 303,"projectName
"queueId":8881,"result
[nukaNUKA#dev-machine csv]$ egrep -o -f egrep-pattern.txt jsonfile.json
"number": 303,"projectName
"queueId":8881,"result
[nukaNUKA#dev-machine csv]$
I have a complex nested JSON blob and because everything is unstructured, it seems like, I can't use JQ or JSONV or anything other Python script (as the data that I'm looking for is stored in arrays containing 1 dictionary entries (key=value) with same key names for what I'm looking for (ex: { "parameters": [ { "name": "jobname", "value": "shenzi" }, { "name": "pipelineVersion", "value": "1.2.3.4" }, ...so on..., ... ]) and the index for jobname and pipelineVersion or similar parameter names is not at the same index[X] location in every JSON entry I have.
Worst case, I can add conditional checks to see if the key at every index matches, jobname etc and then I get those fields what I looking for, but then, there are hundreds of such fields that I want to grab. I don't want to hard code those if possible.
I thought as my JSON entry is per line, I can simply write a cool patterns (ugly I know) but at least then I don't need to worry about the conditional code or just use BASH/sed/tr/cut power to get what I need but it seems like egrep -f or -o ... didn't work as shown above.
Sample JSON blob object (from one Jenkins job). There are different Jenkins build job's JSON blob entries (each having different JSON structures, parameters etc) in a single JenkinsJobsBuild collection in MongoDB.
See attached for sample JSON blob object.
{
"_id": {
"$oid": "5120349es967yhsdfs907c4f"
},
"actions": [
{
"causes": [
{
"shortDescription": "Started by an SCM change"
}
]
},
{
},
{
"oneClickDeployPossible": false,
"oneClickDeployReady": false,
"oneClickDeployValid": false
},
{
},
{
},
{
},
{
"cspec": "element * ...\/MyProject_latest_int\/LATESTnelement * ...\/MyProject_integration\/LATESTnelement \/vobs\/some_vob\/gigi \/main\/myproject_integration\/MyProject_Slot_0_maint_int\/LATESTnelement * ...\/myproject_integration\/LATESTnelement \/vobs\/some_vob \/main\/LATEST",
"latestBlsOnConfiguredStream": null,
"stream": null
},
{
},
{
"parameters": [
{
"name": "CLEARCASE_VIEWTAG",
"value": "jenkins_MyProject_latest"
},
{
"name": "BUILD_DEBUG",
"value": false
},
{
"name": "CLEAN_BUILD",
"value": true
},
{
"name": "BASEVERSION",
"value": "7.4.1"
},
{
"name": "ARTIFACTID",
"value": "lowercaseprojectname"
},
{
"name": "SYSTEM",
"value": "myprojectSystem"
},
{
"name": "LOT",
"value": "02"
},
{
"name": "PIPENUMBER",
"value": "7.4.1.303"
}
]
},
{
},
{
},
{
"parameters": [
{
"name": "DESCRIPTION_SETTER_DESCRIPTION",
"value": "lowercaseprojectname_V7.4.1.303"
}
]
},
{
},
{
},
{
},
{
}
],
"artifacts": [
],
"building": false,
"builtOn": "servername",
"changeSet": {
"items": [
{
"affectedPaths": [
"vobs\/some_vob\/myproject\/apps\/app1\/Java\/test\/src\/com\/giga\/highlevelproject\/myproject\/schedule\/validation\/SomeActivityTest.java"
],
"author": {
"absoluteUrl": "http:\/\/11.22.33.44:8080\/user\/hitj1620",
"fullName": "name1, name2 A"
},
"commitId": null,
"date": {
"$numberLong": "1489439532000"
},
"dateStr": "13\/03\/2017 21:12:12",
"elements": [
{
"action": "create version",
"editType": "edit",
"file": "vobs\/some_vob\/myproject\/apps\/app1\/Java\/test\/src\/com\/giga\/highlevelproject\/myproject\/schedule\/validation\/SomeActivityTest.java",
"operation": "checkin",
"version": "\/main\/MyProject_latest_int\/2"
}
],
"msg": "",
"timestamp": -1,
"user": "user111"
}
],
"kind": null
},
"culprits": [
{
"absoluteUrl": "http:\/\/11.22.33.44:8080\/user\/nuka1620",
"fullName": "nuka, Chuck"
}
],
"description": "lowercaseprojectname_V7.4.1.303",
"displayName": "#303",
"duration": 525758,
"estimatedDuration": 306374,
"executor": null,
"fullDisplayName": "MyProject \u00bb MyProject-build #303",
"highlevelproject_metrics_source_url": "http:\/\/11.22.33.44:8080\/job\/MyProject\/job\/MyProject-build\/303\/\/api\/json",
"id": "303",
"keepLog": false,
"number": 303,
"projectName": "MyProject-build",
"queueId": 8201,
"result": "SUCCESS",
"timeToRepair": null,
"timestamp": {
"$numberLong": "1489439650307"
},
"url": "http:\/\/11.22.33.44:8080\/job\/MyProject\/job\/MyProject-build\/303\/"
}
When the regexes are in a file, you don't have to escape double quotes; you don't have to fight to get your double quotes past the shell.
"number":.*"projectName
"projectName":.*,"queueId
"queueId":.*,"result
"result":".*$
When that's fixed, I get:
$ egrep -o -f egrep-pattern.txt jsonfile.json
"number": 303,"projectName
"queueId":8881,"result
$
The trouble now is, I think, that you've consumed the projectName with the first pattern, so the others don't get a chance to match it. Change the patterns to read up to a comma and you can get better results:
"number":[^,]*
"projectName":[^,]*
"queueId":[^,]*
"result":".*$
yields:
"number": 303
"projectName": "giga"
"queueId":8881
"result":"SUCESS"}
You could try to be more delicate, but you rapidly reach a point where a JSON-aware tool becomes more sensible. Commas in a string value would mess up the modified regexes, for example. (So, if the project name was "Giga, if not Tera", you'd have problems.)
Matching more general JSON name:value notation
As long as you're looking for simple "key":"quoted value" objects, you can use the following grep -E (aka egrep) command:
grep -Eoe '"[^"]+":"((\\(["\\/bfnrt]|u[0-9a-fA-F]{4}))|[^"])*"' data
Given the JSON-like data (in the file called data):
{"key1":"value","key2":"value2 with \"quoted\" text","key3":"value3 with \\ and \/ and \f and \uA32D embedded"}
that script produces:
"key1":"value"
"key2":"value2 with \"quoted\" text"
"key3":"value3 with \\ and \/ and \f and \uA32D embedded"
You can upgrade it to handle almost any valid JSON "key":value by using:
grep -Eoe '"[^"]+":(("((\\(["\\/bfnrt]|u[0-9a-fA-F]{4}))|[^"])*")|true|false|null|(-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][-+]?[0-9]+)?))' data
With a new data file containing:
{"key1":"value","key2":"value2 with \"quoted\" text"}
{"key3":"value3 with \\ and \/ and \f and \uA32D embedded"}
{"key4":false,"key5":true,"key6":null,"key7":7,"key8":0,"key9":0.123E-23}
{"key10":10,"key11":3.14159,"key12":0.876,"key13":-543.123}
the script produces:
"key1":"value"
"key2":"value2 with \"quoted\" text"
"key3":"value3 with \\ and \/ and \f and \uA32D embedded"
"key4":false
"key5":true
"key6":null
"key7":7
"key8":0
"key9":0.123E-23
"key10":10
"key11":3.14159
"key12":0.876
"key13":-543.123
You can follow the railroad diagrams in the outline JSON specification at http://json.org to see how I created the regex.
It could be enhanced by the judicious addition of [[:space:]]* in places where spaces are permitted but not required — before the key string, before the colon, after the colon (you could add it after the value too, but you probably don't want that).
Another simplification that I've taken is that the key doesn't allow for the various escape characters that the value string does. You could repeat that.
And, of course, this only works for 'leaf' name:value pairs; if a value is itself an object {…} or an array […], this doesn't handle the value as a whole.
However, this just goes to emphasize that it gets very messy very quickly and you would be better off using a special-purpose JSON query tool. One such tool is jq, as mentioned in a comment to the main query.
The complex JSON blob I had, was from Jenkins (i.e. Jenkins job's RestAPI data) that I had in MongoDB database.
To grab it from MongoDB, I used mongoexport command for generating (non-JsonArray or non-Pretty format) JSON blob successfully.
#/bin/bash
server=localhost
collectionFile=collections.txt
## Generate collection file contains all collections in the Jenkins database in MongoDB.
( set -x
mongo "mongoDbServer.company.com/database_Jenkins" --eval "rs.slaveOk();db.getCollectionNames()" --quiet > ${collectionFile}
)
## create collection based JSON files
for collection in $(cat ${collectionFile} | sed -e 's:,: :g')
do
mongoexport --host ${server} --db ${db} --collection "${collection}" --out ${exportDir}/${collection}.json
##mongoexport --host ${server} --db ${db} --collection "${collection}" --type=csv --fieldFile ~/mongoDB_fetch/get_these_csv_fields.txt --out ${exportDir}/${collection}.csv; ## This didn't work if you have nested fields. fieldFile file was just containing field name per line in a particular xyz.IndexNumber.yyy format.
done
Tried inbuilt mongoexport command's --type=csv with -f fields to catch topfield.0.subField, field2, field3.7.parameters.7.. nothing worked.
PS: The number after the . mark is how you define indexes if you are going to create a CSV file and using fields (mandatory) using mongoexport command.
As my JSON structure was all unstructured (Jenkins version bumps/upgrades happened in past and the data about a job was not the same structure), I tried this final sed trick (as JSON data per entry was in each individual line).
This sed command (as shown below) will give you all the keys and it's values (in the key=value format) per line at the LEAF field key=value level of almost any JSON blob / at least from the Jenkins JSON blob . Once you have this info, you can feed the output of this command to temporary file, then read all the value part (after the = mark) and create your CSV file acc. YES, you have to sort it so that your CSV file's fields are maintained in order for the header names and thus values are inserted to the right column/field. I calculated the fields names from all different collection JSON file's temporary key=value generated key names. Then, read all temporary collection files and added the values acc. into the final CSV file under respective header/field/column.
OK this is a weird solution but at least it's a solution - in one liner.
cat myJenkinsJob.json | sed "s/{}//g;s/,,*/,/g;s/},\"/\n/g;s/},{/\n/g;s/\([^\"a-zA-Z]\),\"/\1\n/g;s/:\[{/\n/g;s/\"name\":\"//g;s/\",\"value//g;s/,\"/\n/g;s/\":\"*/=/g;s/\"//g;s/[\[}\]]//g;s/[{}]//g;s/\$[a-zA-Z][a-zA-Z]*=//g"|grep "=" | sed "s/,$//"|egrep -v "=-|=$|=\[|^_class="
Tweak this acc. to your own solution for the sed part a little bit, if your JSON blob shows you funny characters that you don't want. The order of sed operations below is important. I'm also excluding any redundant variables (that I don't need at this time, for ex: JSON blob contained _class="..." values) so I'm excluding those via egrep -v after the last | pipe.
I'm using Solr in order to index my data.
Through the Solr's UI I added, in the Schema window, two fields: word, messageid
After I made the following query post:
curl -X POST -H "Content-Type: application/json" 'http://localhost:8983/solr/messenger/update.json/docs' --data-binary '{"word":"hello","messageid":"23523}'
I received the following JSON:
{
"responseHeader": {
"status": 0,
"QTime": 55
}
}
When I'm going to the Query Window in the API and Execute a query without parameters I get the following JSON:
{
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"indent": "on",
"wt": "json",
"_": "1488911768817"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "92db6722-d10d-447a-b5b1-13ad9b70b3e2",
"_src_": "{\"word\":\"hello\",\"messageid\":\"23523\"}",
"_version_": 1561232739042066432
}
}
}
}
Shouldn't my JSON appear more like the following one?:
//More Code
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "92db6722-d10d-447a-b5b1-13ad9b70b3e2",
"word": "hello",
"messageid": "23523",
"_version_": 1561232739042066432
}
//More Code
In order to be able later on to filter using parameters through the following option?:
It turns out you were using so-called 'custom JSON indexing' approach which is described here. You can tweak it as described in the wiki in order to extract desired fields. Here is excerpt for your reference:
split: Defines the path at which to split the input JSON into multiple Solr documents and is required if you have multiple documents in a single JSON file. If the entire JSON makes a single solr document, the path must be “/”. It is possible to pass multiple split paths by separating them with a pipe (|) example : split=/|/foo|/foo/bar . If one path is a child of another, they automatically become a child document
f: This is a multivalued mapping parameter. The format of the parameter is target-field-name:json-path. The json-path is required. The target-field-name is the Solr document field name, and is optional. If not specified, it is automatically derived from the input JSON.The default target field name is the fully qualified name of the field. Wildcards can be used here, see the section Wildcards below for more information.
But I would recommend using the standard approach of indexing documents which is old good update command from here. So it would look more like:
curl 'http://localhost:8983/solr/messenger/update?commit=true' --data-binary '{"word":"hello","messageid":"23523}' -H 'Content-type:application/json'
I am trying to index a json field in elastic search, I have given it external mapping that this field should be treated as string and not json, also indexing is not required for it, so no need to analyze it. The mapping for this is following
"json_field": {
"type": "string",
"index": "no"
},
Still at the time of indexing, this field is getting analyzed and because of that I am getting MapperParsingException
in Short How can we store json as a string in elastic search without getting analyzed ?
Finally got it,
if you want to store JSON as a string, without analyzing it, the mapping should be like this
"json_field": {
"type": "object",
"enabled" : false
},
The enabled flag allows to disable parsing and indexing a named object completely. This is handy when a portion of the JSON document contains an arbitrary JSON which should not be indexed, nor added to the mapping.
Update - From ES version 7.12 "enabled" has been changed to "index".
As of today ElasticSearch 7.12 "enabled" is now "index".
So mapping should be like this:
"json_field": {
"type": "object",
"index" : false
},
Solution
Set "enabled": false for the field.
curl -X PUT "localhost:9200/{{INDEX-NAME}}/_mapping/doc" -H 'Content-Type: application/json' -d'
{
"properties" : {
"json_field" : {
"type" : "object",
"enabled": false
}
}
}
Note: This cannot be applied to the existing field. Either pass it in mapping during the creation of index or you can always create a new field.
Explanation
The enabled setting, which can be applied only to the top-level mapping definition and to object fields, causes Elasticsearch to skip parsing of the contents of the field entirely. The JSON can still be retrieved from the _source field, but it is not searchable or stored in any other way:
Ref: Elasticsearch Docs