indexing json file using solr - json

I am able to index a simple JSON using solr but for complex JSON which are having nested structures like below I am getting an error. I am using the curl command to index the JSON file using solr:
curl 'https://localhost:8983/solr/json_collection/update?commit=true' --data-binary #/home/mic.json -H 'Content-type:application/json'
Error:
Error - {"responseHeader":{"status":400,"QTime":12},"error":{"metadata":["error-class","org.apache.solr.common.SolrException"],"msg":"Error parsing JSON field value. Unexpected OBJECT_START","code":400}}
JSON:
[
{
"PART I, ITEM 1. BUSINESS": {
"GENERAL": {
"Our vision": {
"text": [
"Microsoft world."
]
},
"The ambitions that drive us": {
"text": [
"To carry ambitions:",
"* Create more personal computing."
],
"Create more personal computing": {
"text": [
"We strive available. website."
]
}
}
},
"ITEM 1A. RISK FACTORS": "Our opk."
}
}
]
Error
JSON

Your JSON seems to be erroneous. In either of the cases, single object or array of JSON, your JSON should follow basic conventions.
In case of single object, the syntax should be-
{ "key":"value"}
In case of Array of JSON, the syntax can be-
{
"key1":["value1", "value2",...],
"key2":["value12", "value22",...]
}

Related

How to filter only JSON from TEXT and JSON mixed format in logstash

We have input coming from one of the applications in TEXT + JSON format like the below:
<12>1 2022-10-18T10:48:40.163Z 7VLX5D8 ERAServer 14016 - - {"event_type":"FilteredWebsites_Event","ipv4":"192.168.0.1","hostname":"9krkvs1","source_uuid":"11160173-r3bc-46cd-9f4e-99f66fc0a4eb","occured":"18-Oct-2022 10:48:37","severity":"Warning","event":"An attempt to connect to URL","target_address":"172.66.43.217","target_address_type":"IPv4","scanner_id":"HTTP filter","action_taken":"Blocked","handled":true,"object_uri":"https://free4pc.org","hash":"0E9ACB02118FBF52B28C3570D47D82AFB82EB58C","username":"CKFCVS1\\some.name","processname":"C:\\Users\\some.name\\AppData\\Local\\Programs\\Opera\\opera.exe","rule_id":"Blocked by internal blacklist"}
that is <12>1 2022-10-18T10:48:40.163Z 7VLX5D8 ERAServer 14016 - - in TEXT and other in JSON.
The TEXT part is similar, only the date and time is different, so even if we delete all TEXT part it is okay.
The JSON part is random, but it contains useful information.
Currently, on Kibana, the logs are appearing in the message field, but the separate fields are not appearing because of improper JSON.
So actually we tried to push ONLY the required JSON part by putting manually in the file gives us the required output in Kibana.
So our question is how to achieve this through logstash filters/grok.
Update:
#Val - We already have below configuration
input {
syslog {
port => 5044
codec => json
}
}
But the output on the Kibana is appearing as
And we want it like:
Even though syslog seems like an appealing way of shipping data, it is a big mess in terms of standardization and anyone has a different way of shipping data. The Logstash syslog input only supports RFC3164 and your log format doesn't match that standard.
You can still bypass the normal RFC3164 parsing by providing your own grok pattern, as shown below:
input {
syslog {
port => 5044
grok_pattern => "<%{POSINT:priority_key}>%{POSINT:version} %{TIMESTAMP_ISO8601:timestamp} %{HOSTNAME:[observer][hostname]} %{WORD:[observer][name]} %{WORD:[process][id]} - - %{GREEDYDATA:[event][original]}"
}
}
filter {
json {
source => "[event][original]"
}
}
output {
stdout { codec => json }
}
Running Logstash with the above config, your sample log line gets parsed as this:
{
"#timestamp": "2022-10-18T10:48:40.163Z",
"#version": "1",
"action_taken": "Blocked",
"event": "An attempt to connect to URL",
"event_type": "FilteredWebsites_Event",
"facility": 0,
"facility_label": "kernel",
"handled": true,
"hash": "0E9ACB02118FBF52B28C3570D47D82AFB82EB58C",
"host": "0:0:0:0:0:0:0:1",
"hostname": "9krkvs1",
"ipv4": "192.168.0.1",
"message": "<12>1 2022-10-18T10:48:40.163Z 7VLX5D8 ERAServer 14016 - - {\"event_type\":\"FilteredWebsites_Event\",\"ipv4\":\"192.168.0.1\",\"hostname\":\"9krkvs1\",\"source_uuid\":\"11160173-r3bc-46cd-9f4e-99f66fc0a4eb\",\"occured\":\"18-Oct-2022 10:48:37\",\"severity\":\"Warning\",\"event\":\"An attempt to connect to URL\",\"target_address\":\"172.66.43.217\",\"target_address_type\":\"IPv4\",\"scanner_id\":\"HTTP filter\",\"action_taken\":\"Blocked\",\"handled\":true,\"object_uri\":\"https://free4pc.org\",\"hash\":\"0E9ACB02118FBF52B28C3570D47D82AFB82EB58C\",\"username\":\"CKFCVS1\\\\some.name\",\"processname\":\"C:\\\\Users\\\\some.name\\\\AppData\\\\Local\\\\Programs\\\\Opera\\\\opera.exe\",\"rule_id\":\"Blocked by internal blacklist\"}\n",
"object_uri": "https://free4pc.org",
"observer": {
"hostname": "7VLX5D8",
"name": "ERAServer"
},
"occured": "18-Oct-2022 10:48:37",
"priority": 0,
"priority_key": "12",
"process": {
"id": "14016"
},
"processname": "C:\\Users\\some.name\\AppData\\Local\\Programs\\Opera\\opera.exe",
"rule_id": "Blocked by internal blacklist",
"scanner_id": "HTTP filter",
"severity": "Warning",
"severity_label": "Emergency",
"source_uuid": "11160173-r3bc-46cd-9f4e-99f66fc0a4eb",
"target_address": "172.66.43.217",
"target_address_type": "IPv4",
"timestamp": "2022-10-18T10:48:40.163Z",
"username": "CKFCVS1\\some.name",
"version": "1"
}

rename invalid keys from JSON

I have following flow in NIFI , JSON has (1000+) objects in it.
invokeHTTP->SPLIT JSON->putMongo
Flow works fine, till I receive some keys in json with "." in the name. e.g. "spark.databricks.acl.dfAclsEnabled".
my current solution is not optimal, I have jotted down bad keys, and using multiple replace text processor to replace "." with "_". I am not using REGEX, I am using string literal find/replace. So each time I am getting failure in putMongo processor, I am inserting new replaceText processor.
This is not maintainable. I am wondering if I can use JOLT for this? couple of info regarding input JSON.
1) no set structure, only thing that is confirmed is. everything will be in events array. But event object itself is free form.
2) maximum list size = 1000.
3) 3rd party JSON, so I cant ask for change in format.
Also, key with ".", can appear anywhere. So I am looking for JOLT spec that can cleanse at all level and then rename it.
{
"events": [
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1531896847915,
"type": "EDITED",
"details": {
"previous_attributes": {
"cluster_name": "Kylo",
"spark_version": "4.1.x-scala2.11",
"spark_conf": {
"spark.databricks.acl.dfAclsEnabled": "true",
"spark.databricks.repl.allowedLanguages": "python,sql"
},
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"autotermination_minutes": 10,
"enable_elastic_disk": true,
"cluster_source": "UI"
},
"attributes": {
"cluster_name": "Kylo",
"spark_version": "4.1.x-scala2.11",
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"autotermination_minutes": 10,
"enable_elastic_disk": true,
"cluster_source": "UI"
},
"previous_cluster_size": {
"autoscale": {
"min_workers": 1,
"max_workers": 8
}
},
"cluster_size": {
"autoscale": {
"min_workers": 1,
"max_workers": 8
}
},
"user": ""
}
},
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1535540053785,
"type": "TERMINATING",
"details": {
"reason": {
"code": "INACTIVITY",
"parameters": {
"inactivity_duration_min": "15"
}
}
}
},
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1535537117300,
"type": "EXPANDED_DISK",
"details": {
"previous_disk_size": 29454626816,
"disk_size": 136828809216,
"free_space": 17151311872,
"instance_id": "6cea5c332af94d7f85aff23e5d8cea37"
}
}
]
}
I created a template using ReplaceText and RouteOnContent to perform this task. The loop is required because the regex only replaces the first . in the JSON key on each pass. You might be able to refine this to perform all substitutions in a single pass, but after fuzzing the regex with the look-ahead and look-behind groups for a few minutes, re-routing was faster. I verified this works with the JSON you provided, and also JSON with the keys and values on different lines (: on either):
...
"spark_conf": {
"spark.databricks.acl.dfAclsEnabled":
"true",
"spark.databricks.repl.allowedLanguages"
: "python,sql"
},
...
You could also use an ExecuteScript processor with Groovy to ingest the JSON, quickly filter all JSON keys that contain ., perform a collect operation to do the replacement, and re-insert the keys in the JSON data if you want a single processor to do this in a single pass.

How to edit a json dictionary in Robot Framework

I am currently implementing some test automation that uses a json POST to a REST API to initialize the test data in the SUT. Most of the fields I don't have an issue editing using information I found in another thread: Json handling in ROBOT
However, one of the sets of information I am editing is a dictionary of meta data.
{
"title": "Test Auotmation Post 2018-03-06T16:12:02Z",
"content": "dummy text",
"excerpt": "Post made by automation for testing purposes.",
"name": "QA User",
"status": "publish",
"date": "2018-03-06T16:12:02Z",
"primary_section": "Entertainment",
"taxonomy": {
"section": [
"Entertainment"
]
},
"coauthors": [
{
"name": "QA User - CoAuthor",
"meta": {
"Title": "QA Engineer",
"Organization": "That One Place"
}
}
],
"post_meta": [
{
"key": "credit",
"value": "QA Engineer"
},
{
"key": "pub_date",
"value": "2018-03-06T16:12:02Z"
},
{
"key": "last_update",
"value": "2018-03-06T16:12:02Z"
},
{
"key": "source",
"value": "wordpress"
}
]
}
Is it possible to use the Set to Dictionary Keyword on a dictionary inside a dictionary? I would like to be able to edit the value of the pub_date and last_update inside of post_meta, specifically.
The most straightforward way would be to use the Evaluate keyword, and set the sub-dict value in it. Presuming you are working with a dictionary that's called ${value}:
Evaluate $value['post_meta'][1]['pub_date'] = 'your new value here'
I won't get into how to find the index of the post_meta list that has the 'key' with value 'pub_date', as that's not part of your question.
Is it possible to use the Set to Dictionary Keyword on a dictionary inside a dictionary?
Yes, it's possible.
However, because post_meta is a list rather than a dictionary, you will have to write some code to iterate over all of the values of post_meta until you find one with the key you want to update.
You could do this in python quite simply. You could also write a keyword in robot to do that for you. Here's an example:
*** Keywords ***
Set list element by key
[Arguments] ${data} ${target_key} ${new_value}
:FOR ${item} IN #{data}
\ run keyword if '''${item['key']}''' == '''${target_key}'''
\ ... set to dictionary ${item} value=${new_value}
[return] ${data}
Assuming you have a variable named ${data} contains the original JSON data as a string, you could call this keyword like the following:
${JSON}= evaluate json.loads('''${data}''') json
set list element by key ${JSON['post_meta']} pub_date yesterday
set list element by key ${JSON['post_meta']} last_update today
You will then have a python object in ${JSON} with the modified values.

Invalid JSON file error while importing JSON in Firebase

I'm trying to import a json file (titled, 'filename.json') into my firebase database using 'Import JSON' under 'Database.'
However, i am getting an Invalid JSON file error.
The foll is the structure of my JSON that i wish to import. Can you pls help me with where i am going wrong with this:
{
"checklist": "XXX",
"notes": ""
}
{ "checklist": "XXX",
"notes": ""
}
{
"checklist": "XXX",
"notes": ""
}
{
"checklist": "XXX",
"notes": ""
}
Your objects need commas between them. Basically, any line where you've got an } here (except for the last one), throw a comma after it. Then wrap the whole thing in a [] so it's a valid json array.

JSON returned by Solr

I'm using Solr in order to index my data.
Through the Solr's UI I added, in the Schema window, two fields: word, messageid
After I made the following query post:
curl -X POST -H "Content-Type: application/json" 'http://localhost:8983/solr/messenger/update.json/docs' --data-binary '{"word":"hello","messageid":"23523}'
I received the following JSON:
{
"responseHeader": {
"status": 0,
"QTime": 55
}
}
When I'm going to the Query Window in the API and Execute a query without parameters I get the following JSON:
{
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"indent": "on",
"wt": "json",
"_": "1488911768817"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "92db6722-d10d-447a-b5b1-13ad9b70b3e2",
"_src_": "{\"word\":\"hello\",\"messageid\":\"23523\"}",
"_version_": 1561232739042066432
}
}
}
}
Shouldn't my JSON appear more like the following one?:
//More Code
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "92db6722-d10d-447a-b5b1-13ad9b70b3e2",
"word": "hello",
"messageid": "23523",
"_version_": 1561232739042066432
}
//More Code
In order to be able later on to filter using parameters through the following option?:
It turns out you were using so-called 'custom JSON indexing' approach which is described here. You can tweak it as described in the wiki in order to extract desired fields. Here is excerpt for your reference:
split: Defines the path at which to split the input JSON into multiple Solr documents and is required if you have multiple documents in a single JSON file. If the entire JSON makes a single solr document, the path must be “/”. It is possible to pass multiple split paths by separating them with a pipe (|) example : split=/|/foo|/foo/bar . If one path is a child of another, they automatically become a child document
f: This is a multivalued mapping parameter. The format of the parameter is target-field-name:json-path. The json-path is required. The target-field-name is the Solr document field name, and is optional. If not specified, it is automatically derived from the input JSON.The default target field name is the fully qualified name of the field. Wildcards can be used here, see the section Wildcards below for more information.
But I would recommend using the standard approach of indexing documents which is old good update command from here. So it would look more like:
curl 'http://localhost:8983/solr/messenger/update?commit=true' --data-binary '{"word":"hello","messageid":"23523}' -H 'Content-type:application/json'