JSON returned by Solr - json

I'm using Solr in order to index my data.
Through the Solr's UI I added, in the Schema window, two fields: word, messageid
After I made the following query post:
curl -X POST -H "Content-Type: application/json" 'http://localhost:8983/solr/messenger/update.json/docs' --data-binary '{"word":"hello","messageid":"23523}'
I received the following JSON:
{
"responseHeader": {
"status": 0,
"QTime": 55
}
}
When I'm going to the Query Window in the API and Execute a query without parameters I get the following JSON:
{
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"indent": "on",
"wt": "json",
"_": "1488911768817"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "92db6722-d10d-447a-b5b1-13ad9b70b3e2",
"_src_": "{\"word\":\"hello\",\"messageid\":\"23523\"}",
"_version_": 1561232739042066432
}
}
}
}
Shouldn't my JSON appear more like the following one?:
//More Code
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "92db6722-d10d-447a-b5b1-13ad9b70b3e2",
"word": "hello",
"messageid": "23523",
"_version_": 1561232739042066432
}
//More Code
In order to be able later on to filter using parameters through the following option?:

It turns out you were using so-called 'custom JSON indexing' approach which is described here. You can tweak it as described in the wiki in order to extract desired fields. Here is excerpt for your reference:
split: Defines the path at which to split the input JSON into multiple Solr documents and is required if you have multiple documents in a single JSON file. If the entire JSON makes a single solr document, the path must be “/”. It is possible to pass multiple split paths by separating them with a pipe (|) example : split=/|/foo|/foo/bar . If one path is a child of another, they automatically become a child document
f: This is a multivalued mapping parameter. The format of the parameter is target-field-name:json-path. The json-path is required. The target-field-name is the Solr document field name, and is optional. If not specified, it is automatically derived from the input JSON.The default target field name is the fully qualified name of the field. Wildcards can be used here, see the section Wildcards below for more information.
But I would recommend using the standard approach of indexing documents which is old good update command from here. So it would look more like:
curl 'http://localhost:8983/solr/messenger/update?commit=true' --data-binary '{"word":"hello","messageid":"23523}' -H 'Content-type:application/json'

Related

rename invalid keys from JSON

I have following flow in NIFI , JSON has (1000+) objects in it.
invokeHTTP->SPLIT JSON->putMongo
Flow works fine, till I receive some keys in json with "." in the name. e.g. "spark.databricks.acl.dfAclsEnabled".
my current solution is not optimal, I have jotted down bad keys, and using multiple replace text processor to replace "." with "_". I am not using REGEX, I am using string literal find/replace. So each time I am getting failure in putMongo processor, I am inserting new replaceText processor.
This is not maintainable. I am wondering if I can use JOLT for this? couple of info regarding input JSON.
1) no set structure, only thing that is confirmed is. everything will be in events array. But event object itself is free form.
2) maximum list size = 1000.
3) 3rd party JSON, so I cant ask for change in format.
Also, key with ".", can appear anywhere. So I am looking for JOLT spec that can cleanse at all level and then rename it.
{
"events": [
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1531896847915,
"type": "EDITED",
"details": {
"previous_attributes": {
"cluster_name": "Kylo",
"spark_version": "4.1.x-scala2.11",
"spark_conf": {
"spark.databricks.acl.dfAclsEnabled": "true",
"spark.databricks.repl.allowedLanguages": "python,sql"
},
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"autotermination_minutes": 10,
"enable_elastic_disk": true,
"cluster_source": "UI"
},
"attributes": {
"cluster_name": "Kylo",
"spark_version": "4.1.x-scala2.11",
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"autotermination_minutes": 10,
"enable_elastic_disk": true,
"cluster_source": "UI"
},
"previous_cluster_size": {
"autoscale": {
"min_workers": 1,
"max_workers": 8
}
},
"cluster_size": {
"autoscale": {
"min_workers": 1,
"max_workers": 8
}
},
"user": ""
}
},
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1535540053785,
"type": "TERMINATING",
"details": {
"reason": {
"code": "INACTIVITY",
"parameters": {
"inactivity_duration_min": "15"
}
}
}
},
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1535537117300,
"type": "EXPANDED_DISK",
"details": {
"previous_disk_size": 29454626816,
"disk_size": 136828809216,
"free_space": 17151311872,
"instance_id": "6cea5c332af94d7f85aff23e5d8cea37"
}
}
]
}
I created a template using ReplaceText and RouteOnContent to perform this task. The loop is required because the regex only replaces the first . in the JSON key on each pass. You might be able to refine this to perform all substitutions in a single pass, but after fuzzing the regex with the look-ahead and look-behind groups for a few minutes, re-routing was faster. I verified this works with the JSON you provided, and also JSON with the keys and values on different lines (: on either):
...
"spark_conf": {
"spark.databricks.acl.dfAclsEnabled":
"true",
"spark.databricks.repl.allowedLanguages"
: "python,sql"
},
...
You could also use an ExecuteScript processor with Groovy to ingest the JSON, quickly filter all JSON keys that contain ., perform a collect operation to do the replacement, and re-insert the keys in the JSON data if you want a single processor to do this in a single pass.

SOLR post json file Default fieldtype

I have POSTAL_CODE field in my json file. If I try importing that data to SOLR using solr/post, the fieldtype is being set as 'plongs' which is not suitable for data like "108-0023". Beacause of that the data import is throwing out an error. Is there any work around for this kind of issues?
Edit:
Sample data which you might use to check it.
{
"id": "1",
"POSTAL_CODE": "1982"
},
{
"id": "2",
"POSTAL_CODE": "1947"
},
{
"id": "3",
"POSTAL_CODE": "19473"
},
{
"id": "4",
"POSTAL_CODE": "19471"
},
{
"id": "5",
"POSTAL_CODE": "1947-123"
}
In the above sample, I don't understand why 'id' is not being considered as 'plongs' or 'pints' but only 'POSTAL_CODE' has that issue. if the first element has POSTAL_CODE as, say "1947-145" then the field type is being taken as 'text_general'. Generally if the value has double quotes, (i.e., "Data": "123") shouldn't it be considered as a string value?
Remove the collection, create it as new and before you index anything, define a field POSTAL_CODE in your schema as type string. This will then index any incoming data on this field without guessing, but instead use the string type, which means it is indexed as-is.
Copied and adapted from https://lucene.apache.org/solr/guide/7_0/schema-api.html, but untested:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"POSTAL_CODE",
"type":"string",
"stored":true }
}' http://localhost:8983/solr/yourcollectionhere/schema
I tried to import the data by creating a raw json document with the field POSTAL_CODE. Below is my json & my solr version is 7.2.1
{"array": [1,2,3],"boolean": true,"color": "#82b92c","null": null,"number": 123,"POSTAL_CODE": "108-0023"}
It is indexed as Text Field in solr below is the attached screenshot. Command I triggered to index the data is as below:
bin/post -c gettingstarted test.json
Could you please provide the sample data and version of solr on which you are facing this issue.

How to Retrieve and Query JSON type fields in Apache Solr 6.5

My Goal is to retrieve JSON type fields in an Solr index and also perform search queries on such fields.
I have the following documents in Solr Index and using the auto generated schema utilizing schemaless feature in Solr.
POST http://localhost:8983/solr/test1/update?commitWithin=1000
[
{"id" : "1", "type_s":"book", "title_t" : "The Way of Kings", "author_s" : "Brandon Sanderson",
"miscinfo": {"provider": "orielly", "site": "US"}
},
{"id" : "2", "type_s":"book", "title_t" : "The Game of Thrones", "author_s" : "James Sanderson",
"miscinfo": {"provider": "pacman", "site": "US"}
}
]
I see the JSON types are stored as strings in the schemaField type as seen in the output for following
GET http://localhost:8983/solr/test1/schema/fields
{
"name":"miscinfo",
"type":"strings"}
I had tried using srcField as mentioned in this post. However, a query to retrieve json type returns empty response. Below are the GET request used for the same
GET http://localhost:8983/solr/test1/select?q=1&fl=miscinfo&wt=json
GET http://localhost:8983/solr/test1/select?q=1&fl=miscinfo,source_s:[json]&wt=json
Also, the search queries for values inside JSON type fields return empty response
http://localhost:8983/solr/test1/select?q=pacman&wt=json
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"q": "pacman",
"json": "",
"wt": "json"
}
},
"response": {
"numFound": 0,
"start": 0,
"docs": []
}
}
Please help in searching object types in Solr.
Have you checked this: https://cwiki.apache.org/confluence/display/solr/Response+Writers
JSON Response Writer A very commonly used Response Writer is the
JsonResponseWriter, which formats output in JavaScript Object Notation
(JSON), a lightweight data interchange format specified in specified
in RFC 4627. Setting the wt parameter to json invokes this Response
Writer. Here is a sample response for a simple query like
q=id:VS1GB400C3&wt=json:

indexing json file using solr

I am able to index a simple JSON using solr but for complex JSON which are having nested structures like below I am getting an error. I am using the curl command to index the JSON file using solr:
curl 'https://localhost:8983/solr/json_collection/update?commit=true' --data-binary #/home/mic.json -H 'Content-type:application/json'
Error:
Error - {"responseHeader":{"status":400,"QTime":12},"error":{"metadata":["error-class","org.apache.solr.common.SolrException"],"msg":"Error parsing JSON field value. Unexpected OBJECT_START","code":400}}
JSON:
[
{
"PART I, ITEM 1. BUSINESS": {
"GENERAL": {
"Our vision": {
"text": [
"Microsoft world."
]
},
"The ambitions that drive us": {
"text": [
"To carry ambitions:",
"* Create more personal computing."
],
"Create more personal computing": {
"text": [
"We strive available. website."
]
}
}
},
"ITEM 1A. RISK FACTORS": "Our opk."
}
}
]
Error
JSON
Your JSON seems to be erroneous. In either of the cases, single object or array of JSON, your JSON should follow basic conventions.
In case of single object, the syntax should be-
{ "key":"value"}
In case of Array of JSON, the syntax can be-
{
"key1":["value1", "value2",...],
"key2":["value12", "value22",...]
}

Mapping format on elasticsearch

I'm to upload a json document to my server via elasticsearch but i wanted to map it before i upload it but i keep getting a search phase execution exception error.
The json data looks like this
{"geometry":{"type":"Point","coordinates":[-73.20266100000001,45.573647]},"properties":{"persistent_id":"XVCPFsbsqB7h4PrxEtCU3w==","timestamp":1408216040000,"tower_id":"10.48.66.178"}}
So far i've tried this as my mapping. Im not sure what i am doing wrong...
curl –XPUT 'http://localhost:9200/carrier/_search?q=coordinates?pretty=true' -d'
{ “geometry”: {
“type” : {“type” : “string”},
“coordinates” : {“type” : “geo_point”}
},
“properties” : {
“persistent_id” : {“type” : “string”},
“timestamp”: { “type” : “long”},
“tower_id” : {“type” : “string”}
}'
There are a few problems here. First of all you need to use put mapping request instead of search request. The body of the request has to start with the name of the type followed by the list of properties (fields) that you add. The second problem is that you probably copied the example from some documentation where all ascii quotes (") were replaced with replaced with their fancy unicode versions (“ and ”) and dash in front of the XPUT parameter looks like n-dash – instead of normal dash -. You need to replace all fancy quotes and dashes with their ascii versions. So, all together the working statement should look like this (assuming doc as your document type):
curl -XPUT 'http://localhost:9200/carrier/doc/_mapping' -d '{
"doc": {
"properties": {
"geometry": {
"properties": {
"type": {
"type": "string"
},
"coordinates": {
"type": "geo_point"
}
}
},
"properties": {
"properties": {
"persistent_id": {
"type": "string"
},
"timestamp": {
"type": "long"
},
"tower_id": {
"type": "string"
}
}
}
}
}
}'
then you can add document like this:
curl -XPUT 'http://localhost:9200/carrier/doc/1' -d '{"geometry":{"type":"Point","coordinates":[-73.20266100000001,45.573647]},"properties":{"persistent_id":"XVCPFsbsqB7h4PrxEtCU3w==","timestamp":1408216040000,"tower_id":"10.48.66.178"}}'
Please note that in order to add the mapping you might need to delete and recreate the index if you already tried to add documents to this index and the mapping was already created.
This is because you're using the _search endpoint in order to install your mapping.
You have to use the _mapping endpoint instead, like this:
curl –XPUT 'http://localhost:9200/carrier/_mapping/geometry' -d '{
...your mapping...
}'