how can I bulk index a json file into elasticsearch? - json

I am doing the works of "Shakespeare tutorial" for elasticsearch. The elasticsearch instance is set up on AWS, like so:
{
"_shards" : {
"total" : 40,
"successful" : 20,
"failed" : 0
},
"_all" : {
"primaries" : { },
"total" : { }
},
"indices" : {
"rentalyield" : {
"primaries" : { },
"total" : { }
},
"yield" : {
"primaries" : { },
"total" : { }
},
"r" : {
"primaries" : { },
"total" : { }
},
"shakespeare" : {
"primaries" : { },
"total" : { }
}
}
}
I have created a map as follows:
curl -XPUT http://localhost:9200/shakespeare -d '
{
"mappings" : {
"_default_" : {
"properties" : {
"speaker" : {"type": "string", "index" : "not_analyzed" },
"play_name" : {"type": "string", "index" : "not_analyzed" },
"line_id" : { "type" : "integer" },
"speech_number" : { "type" : "integer" }
}
}
}
}
';
The shakespeare.json file is on my local machine and also on S3 on AWS, so I have tried:
curl -XPUT localhost:9200/_bulk --data-binary #shakespeare.json
which fails with this error message:
Warning: Couldn't read data from file "shakespeare.json", this makes an empty
Warning: POST.
{"error":{"root_cause":[{"type":"parse_exception","reason":"Failed to derive xcontent"}],"type":"parse_exception","reason":"Failed to derive xcontent"},"status":400}
and I have also tried:
curl -XPOST localhost:9200/shakespeare/type/_bulk?pretty=true --data-binary #https://s3-us-west-2.amazonaws.com/richjson/shakespeare.json
which produces the same error.
How can I do this correctly?

Related

Index a JSON file into elasticsearch command/mapping errors

I'm new to ELK and I want to import a JSON file into Elasticsearch. this is my file:
{
"news":{
"1":{
"_score":1.0,
"_index":"newsvit",
"_source":{
"content":" \u0641\u0647\u06cc\u0645\u0647 \u062d\u0633\u0646\u200c\u0645\u06cc\u0631\u06cc: \u0627\u06af\u0631\u0686\u0647 \u062f\u0631 \u0647\u06cc\u0627\u0647\u0648\u06cc \u0627\u0646\u062a\u062e\u0627\u0628\u0627\u062a \u0631\u06cc\u0627\u0633\u062a \u062c\u0645\u0647\u0648\u0631\u06cc\u060c \u0645\u0648\u0636\u0648\u0639\u06cc \u0645\u0627\u0646\u0646\u062f \u0645\u0639\u0631\u0641\u06cc \u06a9\u0627\u0646\u062f\u06cc\u062f\u0627\u0647\u0627\u06cc \u0634\u0648\u0631\u0627\u06cc \u0634\u0647\u0631 \u062f\u0631 \u062d\u0627\u0634\u06cc\u0647 \u0642\u0631\u0627\u0631 \u06af\u0631\u0641\u062a\u0647\u060c \u0627\u0645\u0627 \u0627\u0645\u0633\u0627\u0644 \u0628\u0647 \u0639\u0646\u0648\u0627\u0646 \u067e\u0646\u062c\u0645\u06cc\u0646 \u062f\u0648\u0631\u0647 \u0627\u0646\u062a\u062e\u0627\u0628 \u0627\u0639\u0636\u0627\u06cc \u0634\u0648\u0631\u0627\u06cc \u0634\u0647\u0631\u060c \u0627\u06cc\u0646 \u0631\u0648\u06cc\u062f\u0627\u062f \u0628\u0647 \u0646\u0633\u0628\u062a \u062f\u0648\u0631\u0647\u200c\u0647\u0627\u06cc \u0642\u0628\u0644\u060c \u0628\u06cc\u0634\u062a\u0631 \u0645\u0648\u0631\u062f \u062a\u0648\u062c\u0647 \u0648\u0627\u0642\u0639 \u0634\u062f\u0647. \u0627\u06cc\u0646 \u0627\u0642\u0628\u0627\u0644\u060c \u0686\u0647 \u0627\u0632 \u0633\u0648\u06cc \u0686\u0647\u0631\u0647\u200c\u0647\u0627\u06cc \u0645\u0637\u0631\u062d \u0628\u0631\u0627\u06cc \u062b\u0628\u062a \u0646\u0627\u0645 \u0648 \u0686\u0647 \u0627\u0632 \u0633\u0648\u06cc \u0645\u0631\u062f\u0645 \u0628\u0631\u0627\u06cc \u0645\u0634\u0627\u0631\u06a9\u062a \u062f\u0631 \u0627\u06cc\u0646 \u0631\u0648\u06cc\u062f\u0627\u062f\u060c \u0639\u0644\u062a\u200c\u0647\u0627\u06cc \u06af\u0648\u0646\u0627\u06af\u0648\u0646\u06cc \u0645\u06cc\u200c\u062a\u0648\u0627\u0646\u062f \u062f\u0627\u0634\u062a\u0647 \u0628\u0627\u0634\u062f \u06a9\u0647 \u062a\u0648\u062c\u0647 \u0628\u0647 \u0622\u0646\u060c \u0645\u06cc\u200c\u062a\u0648\u0627\u0646\u062f \u0631\u0627\u0647\u06af\u0634\u0627\u06cc \u0627\u0639\u0636\u0627\u06cc \u0631\u06",
"lead":"\u062c\u0627\u0645\u0639\u0647 > \u0634\u0647\u0631\u06cc - \u0645\u06cc\u0632\u06af\u0631\u062f\u06cc \u062f\u0631\u0628\u0627\u0631\u0647 \u0639\u0645\u0644\u06a9\u0631\u062f \u062f\u0648\u0631\u0647\u200c\u0647\u0627\u06cc \u06af\u0630\u0634\u062a\u0647 \u0634\u0648\u0631\u0627\u06cc \u0634\u0647\u0631\u060c \u0622\u0646\u0686\u0647 \u0627\u0639\u0636\u0627\u06cc \u062c\u062f\u06cc\u062f \u0628\u0627\u06cc\u062f \u0645\u062f \u0646\u0638\u0631 \u062f\u0627\u0634\u062a\u0647 \u0628\u0627\u0634\u0646\u062f \u0648 \u0647\u0645\u0686\u0646\u06cc\u0646 \u0645\u0627\u0647\u06cc\u062a \u0633\u06cc\u0627\u0633\u06cc \u0628\u0648\u062f\u0646 \u06cc\u0627 \u0646\u0628\u0648\u062f\u0646 \u0634\u0648\u0631\u0627\u06cc \u0634\u0647\u0631.",
"agency":"13",
"date_created":1494518193,
"url":"http://www.khabaronline.ir/(X(1)S(bud4wg3ebzbxv51mj45iwjtp))/detail/663749/society/urban",
"image":"uploads/2017/05/11/1589793661.jpg",
"category":"15"
},
"_type":"news",
"_id":"2981643"
},
"2": {
...
based on what I have learnt, at first, I tried to create a mapping system for it in DevTools of Kibana. I want to be able to perform queries and search on this file based on fields in _source, such as category, id and so on. this is my mapping:
PUT /main-news-test-data
{
"mappings": {
"properties": {
"_score": {"type":"integer"},
"_index": {"type":"keyword"},
"_type":{"type":"keyword"},
"_id":{"type":"keyword"}
},
"_source":{
"properties": {
"content":{"type":"text"},
"title":{"type":"text"},
"lead":{"type":"text"},
"agency":{"type":"keyword"},
"date_created":{"type":"date"},
"url":{"type":"keyword"},
"image":{"type":"keyword"},
"category":{"type":"keyword"}
}
}
}
}
HEAD main-news-test-data
GET /main-news-test-data/_search?q=*
but when I run this in Devtools I receive this error:
{
"error" : {
"root_cause" : [
{
"type" : "mapper_parsing_exception",
"reason" : "Mapping definition for [_source] has unsupported parameters: [properties : {image={type=keyword}, agency={type=keyword}, date_created={type=date}, title={type=text}, category={type=keyword}, content={type=text}, lead={type=text}, url={type=keyword}}]"
}
],
"type" : "mapper_parsing_exception",
"reason" : "Failed to parse mapping [_doc]: Mapping definition for [_source] has unsupported parameters: [properties : {image={type=keyword}, agency={type=keyword}, date_created={type=date}, title={type=text}, category={type=keyword}, content={type=text}, lead={type=text}, url={type=keyword}}]",
"caused_by" : {
"type" : "mapper_parsing_exception",
"reason" : "Mapping definition for [_source] has unsupported parameters: [properties : {image={type=keyword}, agency={type=keyword}, date_created={type=date}, title={type=text}, category={type=keyword}, content={type=text}, lead={type=text}, url={type=keyword}}]"
}
},
"status" : 400
}
I also tried to index my file into elasticsearch using this PowerShell command afterwards:
Invoke-RestMethod "http://localhost:9200/main-news-test-data/doc/_bulk?pretty" -Method Post -ContentType 'application/x-ndjson' -InFile "test.json"
but again I get this error from Powershell:
Invoke-RestMethod : {
"error" : {
"root_cause" : [
{
"type" : "json_e_o_f_exception",
"reason" : "Unexpected end-of-input: expected close marker for Object (start marker at [Source:
(org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 1, column: 1])\n at
[Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 2, column: 1]"
}
],
"type" : "json_e_o_f_exception",
"reason" : "Unexpected end-of-input: expected close marker for Object (start marker at [Source:
(org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 1, column: 1])\n at
[Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 2, column: 1]"
},
"status" : 400
}
So what should I do? how do I import a JSON file into elasticsearch that is queryable by fields?
with what I read I can say that :
your mapping is strange
Just put :
PUT /main-news-test-data
{
"mappings": {
"properties": {
"content": {
"type": "text"
},
"title": {
"type": "text"
},
"lead": {
"type": "text"
},
"agency": {
"type": "keyword"
},
"date_created": {
"type": "date"
},
"url": {
"type": "keyword"
},
"image": {
"type": "keyword"
},
"category": {
"type": "keyword"
}
}
}
}
Your Json is wrong. A bulk don't use a valid json.
A file for the _bulk api will look like this :
{ "index" : { "_index" : "main-news-test-data", "_id" : "1" } }
{ "field1" : "value1" }
{ "index" : { "_index" : "main-news-test-data", "_id" : "2" } }
{ "field1" : "value2" }
Please also Note that "_score":1.0, has no reason to be in your request and that _type is deprecated (if you use a 7.0+, _type can only be _doc and should be ignored)

Creating Multiple QueueConfigurations in CloudFormation

I'm currently trying to write multiple QueueConfigurations into my CloudFormation template. Each is an SQS queue that is triggered when an object is created to a specified prefix. Here's what I have so far:
{
"Resources": {
"S3Bucket": {
"Type" : "AWS::S3::Bucket",
"Properties" :
"BucketName" : { "Ref" : "paramBucketName" },
"LoggingConfiguration" : {
"DestinationBucketName" : "test-bucket",
"LogFilePrefix" : { "Fn::Join": [ "", [ { "Ref": "paramBucketName" }, "/" ] ] }
},
"NotificationConfiguration" : {
"QueueConfigurations" : [{
"Id" : "1",
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder1/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-cdc_feeder_prod_hvr_dev"
}],
"QueueConfigurations" : [{
"Id" : "2",
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder2/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-latency_hvr_dev"
}]
}
}
}
}
}
}
I've encountered the error saying Encountered unsupported property Id. I thought that by defining the ID, I would be able to avoid the Duplicate object key error.
Does anyone know how to create multiple triggers in a single CloudFormation template? Thanks for the help in advance.
It should be structured like the below, There should only be one QueueConfigurations attribute
that contains all queue configurations within it. Also the Id parameter is not a valid property.
{
"Resources": {
"S3Bucket": {
"Type" : "AWS::S3::Bucket",
"Properties" :
"BucketName" : { "Ref" : "paramBucketName" },
"LoggingConfiguration" : {
"DestinationBucketName" : "test-bucket",
"LogFilePrefix" : { "Fn::Join": [ "", [ { "Ref": "paramBucketName" }, "/" ] ] }
},
"NotificationConfiguration" : {
"QueueConfigurations" : [{
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder1/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-cdc_feeder_prod_hvr_dev"
},
{
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder2/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-latency_hvr_dev"
}]
}
}
}
}
}
}
There is more information about QueueConfiguration in the documentation.

The `type` is not updating when creating or updating index

I'm new to Kibana and Elasticsearch. I have task to migrate data from our production site to staging. Currently, I have given a simple code on creating index.
I have successfully created index, but upon comparing to production site, the type declared as date became text on my new site. We have noticed that all types are converting to text and not sure is because we are using new version of kibana.
Here in production site...
"authorizationDate": {
"type": "date",
"ignore_malformed": true,
"format": "yyyy/MM/dd||yyyy-MM-dd"
},
This is how I implemented on staging site...
POST /orders/_doc/1
{
"order": {
"properties": {
"authorization": {
"authorizationDate": {
"type": "date",
"ignore_malformed": true,
"format": "yyyy/MM/dd||yyyy-MM-dd"
}
}
}
}
}
Upon checking...
GET orders?pretty
Output in orders mapping...
"mappings" : {
"properties" : {
"order" : {
"properties" : {
"properties" : {
"properties" : {
"authorization" : {
"properties" : {
"authorizationDate" : {
"properties" : {
"format" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"ignore_malformed" : {
"type" : "boolean"
},
"type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}
}
}
},
The type became text instead of date, and the date format is not recorded.
Thanks in advance.
POST /orders/_doc/1 -- This will create a new index named orders with a default inferred mapping.
When you are running above it is treating "order", "properties", "ignore_malformed" as fields not part of mapping. hence you can see multiple nested properties in output mapping below.
"properties" : {
"properties" : {
"properties" : {
To create a new mapping first you should run
PUT orders ---> create a new index named orders
{
"mappings": {
"properties": {
"authorization": {
"type": "object", --->should be object/nested was not present in your quest
"properties": {
"authorizationDate": {
"type": "date",
"ignore_malformed": true,
"format": "yyyy/MM/dd||yyyy-MM-dd"
}
}
}
}
}
}
Then adding a new doc using
POST /orders/_doc/1
{
"authorization":{
"authorizationDate":"2019-01-01"
}
}
will give below data
[
{
"_index" : "orders12",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"authorization" : {
"authorizationDate" : "2019-01-01"
}
}
}
]
Link for (Mapping)[https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html]

Conversion from sql to elastic search query

I want to convet the foll. sql query to elastic json query
select count(distinct(fk_id)),city_id from table
where status1 != "xyz" and satus2 = "abc" and
cr_date >="date1" and cr_date<="date2" group by city_id
Also is there any way of writing nested queries in elastic.
select * from table where status in (select status from table2)
The first query can be translated like this in the Elasticsearch query DSL:
curl -XPOST localhost:9200/table/_search -d '{
"size": 0,
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"status2": "abc"
}
},
{
"range": {
"cr_date": {
"gt": "date1", <--- don't forget to change the date
"lt": "date2" <--- don't forget to change the date
}
}
}
],
"must_not": [
{
"term": {
"status1": "xyz"
}
}
]
}
}
}
},
"aggs": {
"by_cities": {
"terms": {
"field": "city_id"
},
"aggs": {
"fk_count": {
"cardinality": {
"field": "fk_id"
}
}
}
}
}
}'
Using Sql API In Elastic search, we can write queries and also we can translate them to elastic query
POST /_sql/translate
{
"query": "SELECT * FROM customer where address.Street='JanaChaitanya Layout' and Name='Pavan Kumar'"
}
Response for this is
{
"size" : 1000,
"query" : {
"bool" : {
"must" : [
{
"term" : {
"address.Street.keyword" : {
"value" : "JanaChaitanya Layout",
"boost" : 1.0
}
}
},
{
"term" : {
"Name.keyword" : {
"value" : "Pavan Kumar",
"boost" : 1.0
}
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
"_source" : {
"includes" : [
"Name",
"address.Area",
"address.Street"
],
"excludes" : [ ]
},
"docvalue_fields" : [
{
"field" : "Age"
}
],
"sort" : [
{
"_doc" : {
"order" : "asc"
}
}
]
}
Now we can use this result to query elastic search
For further details please go through this article
https://xyzcoder.github.io/elasticsearch/2019/06/25/making-use-of-sql-rest-api-in-elastic-search-to-write-queries-easily.html

Sub-records in Avro with Morphlines

I'm trying to convert JSON into Avro using the kite-sdk morphline module. After playing around I'm able to convert the JSON into Avro using a simple schema (no complex data types).
Then I took it one step further and modified the Avro schema as displayed below (subrec.avsc). As you can see the schema consist of a subrecord.
As soon as I tried to convert the JSON to Avro using the morphlines.conf and the subrec.avsc it failed.
Somehow the JSON paths "/record_type[]/alert/action" are not translated by the toAvro function.
The morphlines.conf
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
# Read the JSON blob
{ readJson: {} }
{ logError { format : "record: {}", args : ["#{}"] } }
# Extract JSON
{ extractJsonPaths { flatten: false, paths: {
"/record_type[]/alert/action" : /alert/action,
"/record_type[]/alert/signature_id" : /alert/signature_id,
"/record_type[]/alert/signature" : /alert/signature,
"/record_type[]/alert/category" : /alert/category,
"/record_type[]/alert/severity" : /alert/severity
} } }
{ logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } }
{ extractJsonPaths { flatten: false, paths: {
timestamp : /timestamp,
event_type : /event_type,
source_ip : /src_ip,
source_port : /src_port,
destination_ip : /dest_ip,
destination_port : /dest_port,
protocol : /proto,
} } }
# Create Avro according to schema
{ logError { format : "WE GO TO AVRO"} }
{ toAvro { schemaFile : /etc/flume/conf/conf.empty/subrec.avsc } }
# Create Avro container
{ logError { format : "WE GO TO BINARY"} }
{ writeAvroToByteArray { format: containerlessBinary } }
{ logError { format : "DONE!!!"} }
]
}
]
And the subrec.avsc
{
"type" : "record",
"name" : "Event",
"fields" : [ {
"name" : "timestamp",
"type" : "string"
}, {
"name" : "event_type",
"type" : "string"
}, {
"name" : "source_ip",
"type" : "string"
}, {
"name" : "source_port",
"type" : "int"
}, {
"name" : "destination_ip",
"type" : "string"
}, {
"name" : "destination_port",
"type" : "int"
}, {
"name" : "protocol",
"type" : "string"
}, {
"name": "record_type",
"type" : ["null", {
"name" : "alert",
"type" : "record",
"fields" : [ {
"name" : "action",
"type" : "string"
}, {
"name" : "signature_id",
"type" : "int"
}, {
"name" : "signature",
"type" : "string"
}, {
"name" : "category",
"type" : "string"
}, {
"name" : "severity",
"type" : "int"
}
] } ]
} ]
}
The output on { logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } } I output the following:
[{
/record_type[]/alert / action = [allowed],
/record_type[]/alert / category = [],
/record_type[]/alert / severity = [3],
/record_type[]/alert / signature = [GeoIP from NL,
Netherlands],
/record_type[]/alert / signature_id = [88006],
_attachment_body = [{
"timestamp": "2015-03-23T07:42:01.303046",
"event_type": "alert",
"src_ip": "1.1.1.1",
"src_port": 18192,
"dest_ip": "46.231.41.166",
"dest_port": 62004,
"proto": "TCP",
"alert": {
"action": "allowed",
"gid": "1",
"signature_id": "88006",
"rev": "1",
"signature" : "GeoIP from NL, Netherlands ",
"category" : ""
"severity" : "3"
}
}],
_attachment_mimetype=[json/java + memory],
basename = [simple_eve.json]
}]
UPDATE 2017-06-22
you MUST populate the data in the structure in order for this to work, by using addValues or setValues
{
addValues {
micDefaultHeader : [
{
eventTimestampString : "2017-06-22 18:18:36"
}
]
}
}
after debugging the sources of morphline toAvro, it appears that the record is the first object to be evaluated, no matter what you put in your mappings structure.
the solution is quite simple, but unfortunately took a little extra time, eclipse, running the flume agent in debug mode, cloning the source code and lots of coffee.
here it goes.
my schema:
{
"type" : "record",
"name" : "co_lowbalance_event",
"namespace" : "co.tigo.billing.cboss.lowBalance",
"fields" : [ {
"name" : "dummyValue",
"type" : "string",
"default" : "dummy"
}, {
"name" : "micDefaultHeader",
"type" : {
"type" : "record",
"name" : "mic_default_header_v_1_0",
"namespace" : "com.millicom.schemas.root.struct",
"doc" : "standard millicom header definition",
"fields" : [ {
"name" : "eventTimestampString",
"type" : "string",
"default" : "12345678910"
} ]
}
} ]
}
morphlines file:
morphlines : [
{
id : convertJsonToAvro
importCommands : ["org.kitesdk.**"]
commands : [
{
readJson {
outputClass : java.util.Map
}
}
{
addValues {
micDefaultHeader : [{}]
}
}
{
logDebug { format : "my record: {}", args : ["#{}"] }
}
{
toAvro {
schemaFile : /home/asarubbi/Development/test/co_lowbalance_event.avsc
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
}
}
{
writeAvroToByteArray {
format : containerlessJSON
codec : null
}
}
]
}
]
the magic lies here:
{
addValues {
micDefaultHeader : [{}]
}
}
and in the mappings:
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
explanation:
inside the code the first field name that is evaluated is micDefaultHeader of type RECORD. as there's no way to specify a default value for a RECORD (logically correct), the toAvro code evaluates this, does not get any value configured in mappings and therefore it fails at it detects (wrongly) that the record is empty when it shouldn't.
however, taking a look at the code, you may see that it requires a Map object, containing no values to please the parser and continue to the next element.
so we add a map object using the addValues and fill it with an empty map [{}]. notice that this must match the name of the record that is causing you an empty value. in my case "micDefaultHeader"
feel free to comment if you have a better solution, as this looks like a "dirty fix"