Is it possible to combine two Json with Ruby in MongoDB? - json

I have to insert documents in my Mongo database by using Ruby (not on Rails, on Notepad++), many documents have duplicates with some modifications.
I want to write a script which use a json file, read it, import it in MongoDB by checking if each documents do not have any duplicate in the database, and if there is a duplicate I want to combine it if it contains any additional information:
Such as :
Document 1
{ "Name" : "Lila",
"Files":
[
{ "Name": "File1", "Date" : "05-11-2017"},
{ "Name": "File2", "Date" : "26-03-2018"}
]
}
Document 2
{ "Name" : "Lila",
"Files":
[
{ "Name": "File3", "Date" : "26-03-2018"}
]
}
Combine them to have:
{ "Name" : "Lila",
"Files":
[
{ "Name": "File1", "Date" : "05-11-2017"},
{ "Name": "File2", "Date" : "26-03-2018"},
{ "Name": "File3", "Date" : "26-03-2018"}
]
}
I founded that it is possible in mongo shell thanks to the aggregation-accumulation $mergeObjects, but in Ruby it do not seems to exists.

You can use all the operators in ruby, too. You need to get the underlying collection object first.
require 'mongo'
db = Mongo::Connection.new.db("mydb")
coll = db.collection('posts')
coll.aggregate([
{"$project" => {"last_name" => 1, "first_name" => 1 }},
{"$match" => {"last_name" => "Jones"}}
])
This is an example pipeline. You can give the same aggregation pipeline that worked for you on the mongo shell to aggregate.
For more information, refer to the MongoDB Ruby driver documentation:
http://www.rubydoc.info/gems/mongo/1.8.2/Mongo%2FCollection%3Aaggregate

Related

Talend- Need to extract data from JSON (JSON array) and load it to Oracle DB

I have a Talend Job that receives a JSON(JSON format below) from a route. I need to extract data from JSON and load it to Oracle DB table.
Job
JSON format:
{
"data": [
{
"name": "FRSC-01",
"recordnum": "01",
"Expense1": "100",
"Expense2": "7265",
"Expense3": "9000"
},
{
"name": "FRSC-02",
"recordnum": "",
"Expense1": "200",
"Expense2": "6000",
"Expense3": "9000"
},
{
"name": "FRSC-03",
"recordnum": "03",
"Expense1": "200",
"Expense2": "7000",
"Expense3": "8000"
}
]
}
You can use tExtractJsonFields component to extract data from your json.
Define a schema with the columns you want from the json (name, recordNum, Expense1, Expense2, Expense3), set loop jsonpath query to "$.data[*]", and then for each column set the jsonpath expression like so:
name => "name"
recordNum => "recordNum"
...
And then just use a tMap to map the columns to your target table in the tOracleOutput component.

Importing CSV File in Elasticsearch

I am new to elasticsearch. Trying to import CSV file by following the guide
And successfully imported the file and it has created an index with the documents also.
But what I found that in every docs _id contains random unique id as a value. I want to have value of _id from the CSV file field (the CSV file which I'm importing contains a field with unique id for every row) using query or any other ways. And I do not know how to do that.
Even in docs it is not explained. A sample example of document of elasticsearch index is shown below
{
"_index" : "sample_index",
"_type" : "_doc",
"_id" : "nGHXgngBpB_Kjkqcxfj",
"_score" : 1.0,
"_source" : {
"categoryid" : "34128b58-9148-11eb-a8b3-0242ac130003",
"categoryname" : "Blogs",
"isdeleted" : "False"
}
while adding ingest pipeline with the following query
{
"set": {
"field": "_id",
"value": "{{categoryid}}"
}
}
it throwing an error with this message
You can achieve this by modifying the ingest pipeline used to ingest your CSV file.
In the Ingest pipeline area (Advanced section), simply add the following processor at the end of the pipeline and the document ID will be set accordingly:
...
{
"set": {
"field": "_id",
"value": "{{categoryid}}"
}
}
It should look like this:
Added following processor in ingest pipeline section and it works..
{
"processors": [
{
"set": {
"field": "_id",
"value": "{{categoryid}}"
}
}
]
}

Is there any way to preserve the order while generating patch for json files ?

I am new to Json stuff i.e. JSON PATCH.
I have scenario where I need to figure out between two version of Json files of same object, for that I am using json-patch-master.
But unfortunately the patch generated interpreting it differently i.e. the order differently hence getting unexpected/invalid results.
Could anyone help me how to preserve the order while generating Json Patch ?
**Here is the actual example.
Original Json file :**
[ {
"name" : "name1",
"roolNo" : "1"
}, {
"name" : "name2",
"roolNo" : "2"
}, {
"name" : "name3",
"roolNo" : "3"
}, {
"name" : "name4",
"roolNo" : "4"
} ]
**Modified/New Json file: i.e. removed 2nd node of original file.**
[ {
"name" : "name1",
"roolNo" : "1"
}, {
"name" : "name3",
"roolNo" : "3"
}, {
"name" : "name4",
"roolNo" : "4"
} ]
**Patch/Diff Generated :**
[ {"op":"remove","path":"/3"},
{"op":"replace","path":"/1/name","value":"name3"},
{"op":"replace","path":"/1/roolNo","value":"3"},
{"op":"replace","path":"/2/name","value":"name4"},
{"op":"replace","path":"/2/roolNo","value":"4"}]
Very time I generate Diff/Patch it is giving different path/diff results.
And moreover the interpretation is different i.e. order is not preserving.
**Is there any way to get expected results i.e. [ {"op":"remove","path":"/1"} ] , in other words generated a patch/diff based some order so will get what is expected. ?
How to handle this kind of scenario ?**
Please help me.
Thank you so much.
~Shyam
We are currently working on this issue in Starcounter-Jack/JSON-Patch.
It seems to work nice with native Array.Observe- http://jsfiddle.net/tomalec/p4s7aw96/.
Try Starcounter-Jack/JSON-Patch issues/65_ArrayObserve branch
we will release it as new version once shim and performance will be checked.
Feel free to add you comments at JSON-Patch issue board

Json array in mongoDB

I want to get objects according to an ID they have in an array in a json file in mongodb.
I tried a lot of ways to get them with no success:
db.collection.find({"Id":"2"})
db.collection.find({"Messages.Id":"2"})
db.collection.find({"Messages":{$elemMatch:{"Id":"2"}}})
db.collection.find({"Messages.Id":{$elemMatch:{"Id":"2"}}})
{
"Messages" : [
{
"text":"aaa",
"Id" : [ "1", "2" ]
},
{
"texts" : "bbb",
"Id" : [ "1", "3" ]
}
]
}
Even though that's how it's supposed to be done according to the mongodb documentation.
So I thought something was wrong with my json design (I tried changing it but that didn't help either).
Can anyone suggest to me a good design or query to get the objects with a certain id will work?
UPDATE:
I want for example that if in the query i request the id 2
only the first message and all of it will be displayed (I don't mind if the Id field wont be displayed)
{
"text":"aaa",
"Id":["1","2"]
}
To find single elements that match you will need to utilize the positional operator ($).
db.collection.find({"Messages.Id": "2"}, {"Messages.$": 1, _id: 0})
For finding multiple matches, you would use the aggregation pipeline:
db.collection.aggregate([
{ $unwind: "$Messages" },
{ $match: {"Messages.Id": "1"}},
{ $group: { _id: null, messages: { $push: "$Messages"}}}
])

Importing and updating data in Elasticsearch

We have an existing search function that involves data across multiple tables in SQL Server. This causes a heavy load on our DB, so I'm trying to find a better way to search through this data (it doesn't change very often). I have been working with Logstash and Elasticsearch for about a week using an import containing 1.2 million records. My question is essentially, "how do I update existing documents using my 'primary key'"?
CSV data file (pipe delimited) looks like this:
369|90045|123 ABC ST|LOS ANGELES|CA
368|90045|PVKA0010|LA|CA
367|90012|20000 Venice Boulvd|Los Angeles|CA
365|90045|ABC ST 123|LOS ANGELES|CA
363|90045|ADHOCTESTPROPERTY|DALES|CA
My logstash config looks like this:
input {
stdin {
type => "stdin-type"
}
file {
path => ["C:/Data/sample/*"]
start_position => "beginning"
}
}
filter {
csv {
columns => ["property_id","postal_code","address_1","city","state_code"]
separator => "|"
}
}
output {
elasticsearch {
embedded => true
index => "samples4"
index_type => "sample"
}
}
A document in elasticsearch, then looks like this:
{
"_index": "samples4",
"_type": "sample",
"_id": "64Dc0_1eQ3uSln_k-4X26A",
"_score": 1.4054651,
"_source": {
"message": [
"369|90045|123 ABC ST|LOS ANGELES|CA\r"
],
"#version": "1",
"#timestamp": "2014-02-11T22:58:38.365Z",
"host": "[host]",
"path": "C:/Data/sample/sample.csv",
"property_id": "369",
"postal_code": "90045",
"address_1": "123 ABC ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
I think would like the unique ID in the _id field, to be replaced with the value of property_id. The idea is that subsequent data files would contain updates. I don't need to keep previous versions and there wouldn't be a case where we added or removed keys from a document.
The document_id setting for elasticsearch output doesn't put that field's value into _id (it just put in "property_id" and only stored/updated one document). I know I'm missing something here. Am I just taking the wrong approach?
EDIT: WORKING!
Using #rutter's suggestion, I've updated the output config to this:
output {
elasticsearch {
embedded => true
index => "samples6"
index_type => "sample"
document_id => "%{property_id}"
}
}
Now documents are updating by dropping new files into the data folder as expected. _id and property_id are the same value.
{
"_index": "samples6",
"_type": "sample",
"_id": "351",
"_score": 1,
"_source": {
"message": [
"351|90045|Easy as 123 ST|LOS ANGELES|CA\r"
],
"#version": "1",
"#timestamp": "2014-02-12T16:12:52.102Z",
"host": "TXDFWL3474",
"path": "C:/Data/sample/sample_update_3.csv",
"property_id": "351",
"postal_code": "90045",
"address_1": "Easy as 123 ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
Converting from comment:
You can overwrite a document by sending another document with the same ID... but that might be tricky with your previous data, since you'll get randomized IDs by default.
You can set an ID using the output plugin's document_id field, but it takes a literal string, not a field name. To use a field's contents, you could use an sprintf format string, such as %{property_id}.
Something like this, for example:
output {
elasticsearch {
... other settings...
document_id => "%{property_id}"
}
}
declaimer - I'm the author of ESL
You can use elasticsearch_loader to load psv files into elasticsearch.
In order to set the _id field you can use --id-field=property_id.
for instance:
elasticsearch_loader --index=myindex --type=mytype --id-field=property_id csv --delimiter='|' filename.csv
Have you tried changing the config to this:
filter {
csv {
columns => ["_id","postal_code","address_1","city","state_code"]
separator => "|"
}
}
By naming property_id as _id it should get used during indexing.