Upload CSV-data to elasticsearch without logstash - csv

I uploaded CSV-data into elasticsearch using the machine-learning approach described here.
This created an index and a pipeline with a csv - preprocessor. The import was successful.
What is the corresponding curl command line to upload CSV data into elasticsearch, assuming the index is called iislog and the pipeline iislog-pipeline?

The csv ingest processor will only work on a JSON document that contains a field with CSV data. You cannot throw raw CSV data at it using curl.
The CSV to JSON transformation happens in Kibana (when you drop the raw CSV file in the browser window) and only then Kibana will send JSON-ified CSV.
If your CSV looks like this:
column1,column2,column3
1,2,3
4,5,6
7,8,9
Kibana will transform each line into
{"message": "1,2,3"}
{"message": "4,5,6"}
{"message": "7,8,9"}
And then Kibana will send each of those raw CSV/JSON documents to your iislog index through the iislog-pipeline ingest pipeline. The pipeline looks like this:
{
"description" : "Ingest pipeline created by file structure finder",
"processors" : [
{
"csv" : {
"field" : "message",
"target_fields" : [
"column1",
"column2",
"column3"
],
"ignore_missing" : false
}
},
{
"remove" : {
"field" : "message"
}
}
]
}
In the end, the documents will look like this in your index:
{"column1": 1, "column2": 2, "column3": 3}
{"column1": 4, "column2": 5, "column3": 6}
{"column1": 7, "column2": 8, "column3": 9}
That's the way it works. So if you want to use curl, you need to do Kibana's pre-parsing job and send the latter documents.
curl -H 'Content-type: application/json' -XPOST iislog/_doc?pipeline=iislog-pipeline -d '{"column1": 1, "column2": 2, "column3": 3}'

There is another approach to insert CSV into elastic using an ingest pipeline described here: https://www.elastic.co/de/blog/indexing-csv-elasticsearch-ingest-node
In the end, it wraps each line into an json document and grok-parses each line in order to have the csv rows mapped to specific document fields.

Related

Nifi - edit json flow file content

I am using Nifi 1.6.0. I am trying to copy an S3 file into redshift. The json file on S3 looks like this:
[
{
"a": 1,
"b": 2
},
{
"a": 3,
"b": 4
}
]
However this gives an error (Invalid JSONPath format: Member is not an object.) because of '[' and ']' in the file (https://stackoverflow.com/a/45348425).
I need to convert the json from above format to a format which looks like this:
{
"a": 1,
"b": 2
}
{
"a": 3,
"b": 4
}
So basically what I am trying to do is to remove '[' and ']' and replace '},\n' with '}\n'.
The file has over 14 million rows (113 MB in size)
How can I achieve this using Nifi?
You can use ReplaceText. Check this. It is very similar to your problem. First replace brackets with empty string then replace commas by using LiteralReplace strategy

Import json to MongoDB that is not supported

I have a json file that im trying to import into MongoDB.
The compass says it is invalid json so I used the mongoimport command,
it did imporot it but everything is in 1 row.
How can I make it so the following json format is imported as while using the id as the main id instead of the auto generated ObjectId.
{
"id": {
"value1": "value"
},
"id": {
"value1": "value",
"value2": "value"
}
}
("id" is a string with the value of the actual id so the json doesnt actually have "id" there)
I guess one way to solve this is to fully reformat my json file to the correct format but I have a lot of records in the json and would like to keep it this way.
edit:
I have reformatted my jsons and everything works.

JSON returned by Solr

I'm using Solr in order to index my data.
Through the Solr's UI I added, in the Schema window, two fields: word, messageid
After I made the following query post:
curl -X POST -H "Content-Type: application/json" 'http://localhost:8983/solr/messenger/update.json/docs' --data-binary '{"word":"hello","messageid":"23523}'
I received the following JSON:
{
"responseHeader": {
"status": 0,
"QTime": 55
}
}
When I'm going to the Query Window in the API and Execute a query without parameters I get the following JSON:
{
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"indent": "on",
"wt": "json",
"_": "1488911768817"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "92db6722-d10d-447a-b5b1-13ad9b70b3e2",
"_src_": "{\"word\":\"hello\",\"messageid\":\"23523\"}",
"_version_": 1561232739042066432
}
}
}
}
Shouldn't my JSON appear more like the following one?:
//More Code
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "92db6722-d10d-447a-b5b1-13ad9b70b3e2",
"word": "hello",
"messageid": "23523",
"_version_": 1561232739042066432
}
//More Code
In order to be able later on to filter using parameters through the following option?:
It turns out you were using so-called 'custom JSON indexing' approach which is described here. You can tweak it as described in the wiki in order to extract desired fields. Here is excerpt for your reference:
split: Defines the path at which to split the input JSON into multiple Solr documents and is required if you have multiple documents in a single JSON file. If the entire JSON makes a single solr document, the path must be “/”. It is possible to pass multiple split paths by separating them with a pipe (|) example : split=/|/foo|/foo/bar . If one path is a child of another, they automatically become a child document
f: This is a multivalued mapping parameter. The format of the parameter is target-field-name:json-path. The json-path is required. The target-field-name is the Solr document field name, and is optional. If not specified, it is automatically derived from the input JSON.The default target field name is the fully qualified name of the field. Wildcards can be used here, see the section Wildcards below for more information.
But I would recommend using the standard approach of indexing documents which is old good update command from here. So it would look more like:
curl 'http://localhost:8983/solr/messenger/update?commit=true' --data-binary '{"word":"hello","messageid":"23523}' -H 'Content-type:application/json'

Insert JSON file into Elasticsearch + Kibana

I am currently working on an anomaly detection project based on elasticsearch and kibana. Recently I have converted csv file to json and tried to import this data to elasticsearch via Postman using Bulk API. Unfortunately all of the queries were wrong.
Then I have found this topic : Import/Index a JSON file into Elasticsearch
and tried following approach :
curl -XPOST 'http://localhost:9200/yahoodata/a4benchmark/4' --data-binary #Anomaly1.json
The answer I got :
{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed
to parse"}],"type":"mapper_parsing_exception","reason":"failed to
parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor
detection can only be called on some xcontent bytes or compressed
xcontent bytes"}},"status":400}
The data I am trying to insert has the following structure (Anomaly1.json):
[
{
"timestamps": 11,
"value": 1,
"anomaly": 1,
},
{
"timestamps": 1112,
"value": 211,
"anomaly": 0,
},
{
"timestamps": 2,
"value": 1,
"anomaly": 0,
}
]

How should I filter multiple fields with the same name in logstash?

I'm putting tsung logs into ElasticSearch (ES) so that I can filter, visualize and compare results using Kibana.
I'm using logstash and its JSON parsing filter to push tsung logs in JSON format to ES.
Tsung logs are a bit complicated (IMO) with array objects into array objects, multiple-lines event, and several fields having the same name such as "value" in my example hereafter.
I would like to transform this event:
{
"stats":[
{"timestamp": 1317413861, "samples": [
{"name": "users", "value": 0, "max": 1},
{"name": "users_count", "value": 1, "total": 1},
{"name": "finish_users_count", "value": 1, "total": 1}]}]}
into this:
{"timestamp": 1317413861},{"users_value":0},{"users_max":1},{"users_count_value":1},{"users_count_total":1},{"finish_users_count_value":1},{"finish_users_count_total":1}
Since the entire tsung log file is forwarded to logstash at the end of a performance test campaign, I'm thinking about using regex to remove CR and unusefull stats and samples arrays before sending the event to logstash in order to simplify a little bit.
And then, I would use those kind of JSON filter options:
add_field => {"%{name}_value" => "%{value}"}
add_field => {"%{name}_max" => "%{max}"}
add_field => {"%{name}_total" => "%{total}"}
But how should I handle the fact that there are many value fields in one event for instance? What is the best thing to do?
Thanks for your help.
Feels like the ruby{} filter would be needed here. Loop across the entries in the 'samples' field, and construct your own fields based on the name/value/total/max.
There are examples of this type of behavior elsewhere on SO.