Importing and updating data in Elasticsearch - csv

We have an existing search function that involves data across multiple tables in SQL Server. This causes a heavy load on our DB, so I'm trying to find a better way to search through this data (it doesn't change very often). I have been working with Logstash and Elasticsearch for about a week using an import containing 1.2 million records. My question is essentially, "how do I update existing documents using my 'primary key'"?
CSV data file (pipe delimited) looks like this:
369|90045|123 ABC ST|LOS ANGELES|CA
368|90045|PVKA0010|LA|CA
367|90012|20000 Venice Boulvd|Los Angeles|CA
365|90045|ABC ST 123|LOS ANGELES|CA
363|90045|ADHOCTESTPROPERTY|DALES|CA
My logstash config looks like this:
input {
stdin {
type => "stdin-type"
}
file {
path => ["C:/Data/sample/*"]
start_position => "beginning"
}
}
filter {
csv {
columns => ["property_id","postal_code","address_1","city","state_code"]
separator => "|"
}
}
output {
elasticsearch {
embedded => true
index => "samples4"
index_type => "sample"
}
}
A document in elasticsearch, then looks like this:
{
"_index": "samples4",
"_type": "sample",
"_id": "64Dc0_1eQ3uSln_k-4X26A",
"_score": 1.4054651,
"_source": {
"message": [
"369|90045|123 ABC ST|LOS ANGELES|CA\r"
],
"#version": "1",
"#timestamp": "2014-02-11T22:58:38.365Z",
"host": "[host]",
"path": "C:/Data/sample/sample.csv",
"property_id": "369",
"postal_code": "90045",
"address_1": "123 ABC ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
I think would like the unique ID in the _id field, to be replaced with the value of property_id. The idea is that subsequent data files would contain updates. I don't need to keep previous versions and there wouldn't be a case where we added or removed keys from a document.
The document_id setting for elasticsearch output doesn't put that field's value into _id (it just put in "property_id" and only stored/updated one document). I know I'm missing something here. Am I just taking the wrong approach?
EDIT: WORKING!
Using #rutter's suggestion, I've updated the output config to this:
output {
elasticsearch {
embedded => true
index => "samples6"
index_type => "sample"
document_id => "%{property_id}"
}
}
Now documents are updating by dropping new files into the data folder as expected. _id and property_id are the same value.
{
"_index": "samples6",
"_type": "sample",
"_id": "351",
"_score": 1,
"_source": {
"message": [
"351|90045|Easy as 123 ST|LOS ANGELES|CA\r"
],
"#version": "1",
"#timestamp": "2014-02-12T16:12:52.102Z",
"host": "TXDFWL3474",
"path": "C:/Data/sample/sample_update_3.csv",
"property_id": "351",
"postal_code": "90045",
"address_1": "Easy as 123 ST",
"city": "LOS ANGELES",
"state_code": "CA"
}

Converting from comment:
You can overwrite a document by sending another document with the same ID... but that might be tricky with your previous data, since you'll get randomized IDs by default.
You can set an ID using the output plugin's document_id field, but it takes a literal string, not a field name. To use a field's contents, you could use an sprintf format string, such as %{property_id}.
Something like this, for example:
output {
elasticsearch {
... other settings...
document_id => "%{property_id}"
}
}

declaimer - I'm the author of ESL
You can use elasticsearch_loader to load psv files into elasticsearch.
In order to set the _id field you can use --id-field=property_id.
for instance:
elasticsearch_loader --index=myindex --type=mytype --id-field=property_id csv --delimiter='|' filename.csv

Have you tried changing the config to this:
filter {
csv {
columns => ["_id","postal_code","address_1","city","state_code"]
separator => "|"
}
}
By naming property_id as _id it should get used during indexing.

Related

Talend- Need to extract data from JSON (JSON array) and load it to Oracle DB

I have a Talend Job that receives a JSON(JSON format below) from a route. I need to extract data from JSON and load it to Oracle DB table.
Job
JSON format:
{
"data": [
{
"name": "FRSC-01",
"recordnum": "01",
"Expense1": "100",
"Expense2": "7265",
"Expense3": "9000"
},
{
"name": "FRSC-02",
"recordnum": "",
"Expense1": "200",
"Expense2": "6000",
"Expense3": "9000"
},
{
"name": "FRSC-03",
"recordnum": "03",
"Expense1": "200",
"Expense2": "7000",
"Expense3": "8000"
}
]
}
You can use tExtractJsonFields component to extract data from your json.
Define a schema with the columns you want from the json (name, recordNum, Expense1, Expense2, Expense3), set loop jsonpath query to "$.data[*]", and then for each column set the jsonpath expression like so:
name => "name"
recordNum => "recordNum"
...
And then just use a tMap to map the columns to your target table in the tOracleOutput component.

Pentaho Kettle: How to dynamically fetch JSON file columns

Background: I work for a company that basically sells passes. Every order that is placed by the customer will contain N number of passes.
Issue: I have these JSON event-transaction files coming into a S3 bucket on a daily basis from DocumentDB (MongoDB). This JSON file is associated to the relevant type of event (insert, modify or delete) for every document key (which is an order in my case). The example below illustrates a "Insert" type of event that came through to the S3 bucket:
{
"_id": {
"_data": "11111111111111"
},
"operationType": "insert",
"clusterTime": {
"$timestamp": {
"t": 11111111,
"i": 1
}
},
"ns": {
"db": "abc",
"coll": "abc"
},
"documentKey": {
"_id": {
"$uuid": "abcabcabcabcabcabc"
}
},
"fullDocument": {
"_id": {
"$uuid": "abcabcabcabcabcabc"
},
"orderNumber": "1234567",
"externalOrderId": "12345678",
"orderDateTime": "2020-09-11T08:06:26Z[UTC]",
"attraction": "abc",
"entryDate": {
"$date": 2020-09-13
},
"entryTime": {
"$date": 04000000
},
"requestId": "abc",
"ticketUrl": "abc",
"tickets": [
{
"passId": "1111111",
"externalTicketId": "1234567"
},
{
"passId": "222222222",
"externalTicketId": "122442492"
}
],
"_class": "abc"
}
}
As we see above, every JSON file might contain N number of passes and every pass is - in turn - is associated to an external ticket id, which is a different column (as seen above). I want to use Pentaho Kettle to read these JSON files and load the data into the DW. I am aware of the Json input step and Row Normalizer that could then transpose "PassID 1", "PassID 2", "PassID 3"..."PassID N" columns into 1 unique column "Pass" and I would have to have to apply a similar logic to the other column "External ticket id". The problem with that approach is that it is quite static, as in, I need to "tell" Pentaho how many Passes are coming in advance in the Json input step. However what if tomorrow I have an order with 10 different passes? How can I do this dynamically to ensure the job will not break?
If you want a tabular output like
TicketUrl Pass ExternalTicketID
---------- ------ ----------------
abc PassID1Value1 ExTicketIDvalue1
abc PassID1Value2 ExTicketIDvalue2
abc PassID1Value3 ExTicketIDvalue3
And make incoming value dynamic based on JSON input file values, then you can download this transformation Updated Link
I found everything work dynamic in JSON input.

Is it possible to combine two Json with Ruby in MongoDB?

I have to insert documents in my Mongo database by using Ruby (not on Rails, on Notepad++), many documents have duplicates with some modifications.
I want to write a script which use a json file, read it, import it in MongoDB by checking if each documents do not have any duplicate in the database, and if there is a duplicate I want to combine it if it contains any additional information:
Such as :
Document 1
{ "Name" : "Lila",
"Files":
[
{ "Name": "File1", "Date" : "05-11-2017"},
{ "Name": "File2", "Date" : "26-03-2018"}
]
}
Document 2
{ "Name" : "Lila",
"Files":
[
{ "Name": "File3", "Date" : "26-03-2018"}
]
}
Combine them to have:
{ "Name" : "Lila",
"Files":
[
{ "Name": "File1", "Date" : "05-11-2017"},
{ "Name": "File2", "Date" : "26-03-2018"},
{ "Name": "File3", "Date" : "26-03-2018"}
]
}
I founded that it is possible in mongo shell thanks to the aggregation-accumulation $mergeObjects, but in Ruby it do not seems to exists.
You can use all the operators in ruby, too. You need to get the underlying collection object first.
require 'mongo'
db = Mongo::Connection.new.db("mydb")
coll = db.collection('posts')
coll.aggregate([
{"$project" => {"last_name" => 1, "first_name" => 1 }},
{"$match" => {"last_name" => "Jones"}}
])
This is an example pipeline. You can give the same aggregation pipeline that worked for you on the mongo shell to aggregate.
For more information, refer to the MongoDB Ruby driver documentation:
http://www.rubydoc.info/gems/mongo/1.8.2/Mongo%2FCollection%3Aaggregate

search in elastic search index created using a json file

I pushed a json file ( as shown below ) to ES using the following code :
with open('test.json','rb') as payload:
headers = {'content-type': 'application/json'}
r = requests.post('http://localhost:9200/test_nest_json/1',data=payload, verify=False, headers=headers)
{
"data": [
{
"keyword": "abc",
"lists": [
{
"item_val": "some_val"
}
],
"another_key": "some_key"
},
{
"keyword": "xyz",
"lists": [
{
"item_val":"another_val"
}
],
"another_key": "pqr"
}
]
}
I tried updating the mappings and used the term query but still it results in displaying all the indices. I am not able to query only one keyword like "data.keyword" = "abc" using term query.
Looks like you are having a problem with nested object
https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html
The reason for this cross-object matching, as discussed in Arrays of
Inner Objects, is that our beautifully structured JSON document is
flattened into a simple key-value format in the index
So the effective document stored looks like this:
{
"data.keyword": [ abc, xyz ],
"data.another_key": [ some_key, pqr ],
}
Which means the query you posted will match any document, as long as at least one of the nested object contains the xyz keyword. I recommend reading the link above for clarification.
This is what worked for me :
es.indices.refresh(index="test-index")
with open('abc.json','rb') as payload:
json_data = json.load(payload);
leng = len(json_data["data"])
for i in range (leng):
doc = json.dumps(json_data["data"][i]);
res = es.index(index="sample-index", doc_type='pdf',id=str(uuid.uuid4()), body=doc)
I am parsing the json and extracting the array items one by one and push it to ElasticSearch.
{
"keyword": "abc",
"lists": [
{
"item_val": "some_val"
}
],
"another_key": "some_key"
},
Still looking for an optimised solution.

Obtain a different JSON object structure in AngularJS

I'm Working on AngularJS.
In this part of the project my goal is to obtain a JSON structure after filling a form with some particulars values.
Here's the fiddle of my simple form: Fiddle
With the form I will do a query to KairosDB, that is my NoSql Database, I will query data from it by a JSON object. The form is structured in this way:
a Name
a certain Number of Tags, with Tag Id ("ch" for example) and tag value ("932" for example)
a certain Number of Aggregators to manipulate data coming from DB
Start Timestamp and End Timestamp (now they are static and only included in the final JSON Object)
After filling this form, with my code I'll obtain for example this JSON object:
{
"metrics": [
{
"tags": [
{
"id": "ch",
"value": "932"
},
{
"id": "ch",
"value": "931"
}
],
"aggregators": {
"name": "sum",
"sampling": [
{
"value": "1",
"unit": "milliseconds",
"type": "SUM"
}
]
}
}
],
"cache_time": 0,
"start_absolute": 123,
"end_absolute": 1234
}
Unfortunately, KairosDB accepts a different structure, and as you could see, Tag id "ch" doesn't hase an "id" string before, or for example, Tag values coming from the same tag id are grouped together
{
"metrics": [
{
"tags": {
"ch": [
"932",
"931"
]
},
"name": "AIENR",
"aggregators": [
{
"name": "sum",
"sampling": {
"value": "1",
"unit": "milliseconds"
}
}
]
}
],
"cache_time": 0,
"start_absolute": 1367359200000,
"end_absolute": 1386025200000
}
My question is: Is there a way to obtain the JSON structure like the one accepted by Kairos DB with an Angular JS form?. Thanks to everyone.
I've seen this topic as the one more similar to mine but it isn't in AngularJS.
Personally, I'd do the refactoring work in the backend - Have what ever server interfaces sends and receives data do the manipulation - Otherwise you'll end up needing to refactor your data inside Angular anywhere you want to use that dataset.
Where as doing it in the backend would put it in a single access point.
Of course, you could do it in Angular, just replace userString in the submitData method with a copy of the array and replace the tags section with data in the new format, and likewise refactor the returned result to the correct format when you get a reply.