Elastic Search Document Update taking too long when retrieved by alias - json

Here is the situation.
1) There is an existing document (let's say the index is baseball-a
2) baseball-a, baseball-b, and baseball-c are aliased to baseball
3) Update a document in baseball-a
POST /baseball-a/1/_update?pretty'
{
"doc": { "my_name": "Casey at the bat2"}
}'
4) now if I do a GET baseball-a/1/ everything is updated
5) but if I do a search
POST /baseball/_search?pretty
{
"query": { "match": { "id": "1" } }
}
then the document that is returned has the old my_name of "Casey at the bat" (missing the '2') but 15 minutes later it shows up... how do I fix this or speed it up?

I think I figured it out. Basically you need to look at the refresh_interval for the alias by doing
GET /baseball/_settings
Mine was set to -1 and it should be set to either 1s or 5s
Additionally after I manually ran this command
POST /baseball/_refresh
it also worked but that is just a hassle... let elastic do it for you automatically. Now if I could only figure out why I can't set the refresh inteval correctly Updating ElasticSearch interval_refresh when aliased

Related

How to parse in Google Sheets a nested JSON structure with fallback when the data isn't available?

I'm getting Yahoo Finance data as a JSON file (via the YahooFinancials python API) and I would like to be able to parse the data in a smart way to feed my Google Sheet.
For this example, I'm interested in getting the "cash" variable under the "date" nested structure. But as you'll see, sometimes there is no "cash" variable under the first date, so I would like the script/formula to go and get the "cash" variable that's under the second date structure.
Here is sample 1 of JSON code:
{ "balanceSheetHistoryQuarterly": {
"ABBV": [
{
"2018-12-31": {
"totalStockholderEquity": -2921000000,
"netTangibleAssets": -45264000000
}
},
{
"2018-09-30": {
"intangibleAssets": 26625000000,
"capitalSurplus": 14680000000,
"totalLiab": 69085000000,
"totalStockholderEquity": -2921000000,
"otherCurrentLiab": 378000000,
"totalAssets": 66164000000,
"commonStock": 18000000,
"otherCurrentAssets": 112000000,
"retainedEarnings": 6789000000,
"otherLiab": 16511000000,
"goodWill": 15718000000,
"treasuryStock": -24408000000,
"otherAssets": 943000000,
"cash": 8015000000,
"totalCurrentLiabilities": 15387000000,
"shortLongTermDebt": 1026000000,
"otherStockholderEquity": -2559000000,
"propertyPlantEquipment": 2950000000,
"totalCurrentAssets": 18465000000,
"longTermInvestments": 1463000000,
"netTangibleAssets": -45264000000,
"shortTermInvestments": 770000000,
"netReceivables": 5780000000,
"longTermDebt": 37187000000,
"inventory": 1786000000,
"accountsPayable": 10981000000
}
},
{
"2018-06-30": {
"intangibleAssets": 26903000000,
"capitalSurplus": 14596000000,
"totalLiab": 65016000000,
"totalStockholderEquity": -3375000000,
"otherCurrentLiab": 350000000,
"totalAssets": 61641000000,
"commonStock": 18000000,
"otherCurrentAssets": 128000000,
"retainedEarnings": 5495000000,
"otherLiab": 16576000000,
"goodWill": 15692000000,
"treasuryStock": -23484000000,
"otherAssets": 909000000,
"cash": 3547000000,
"totalCurrentLiabilities": 17224000000,
"shortLongTermDebt": 3026000000,
"otherStockholderEquity": -2639000000,
"propertyPlantEquipment": 2787000000,
"totalCurrentAssets": 13845000000,
"longTermInvestments": 1505000000,
"netTangibleAssets": -45970000000,
"shortTermInvestments": 196000000,
"netReceivables": 5793000000,
"longTermDebt": 31216000000,
"inventory": 1580000000,
"accountsPayable": 10337000000
}
},
{
"2018-03-31": {
"intangibleAssets": 27230000000,
"capitalSurplus": 14519000000,
"totalLiab": 65789000000,
"totalStockholderEquity": 3553000000,
"otherCurrentLiab": 125000000,
"totalAssets": 69342000000,
"commonStock": 18000000,
"otherCurrentAssets": 17000000,
"retainedEarnings": 4977000000,
"otherLiab": 17250000000,
"goodWill": 15880000000,
"treasuryStock": -15961000000,
"otherAssets": 903000000,
"cash": 9007000000,
"totalCurrentLiabilities": 17058000000,
"shortLongTermDebt": 6024000000,
"otherStockholderEquity": -2630000000,
"propertyPlantEquipment": 2828000000,
"totalCurrentAssets": 20444000000,
"longTermInvestments": 2057000000,
"netTangibleAssets": -39557000000,
"shortTermInvestments": 467000000,
"netReceivables": 5841000000,
"longTermDebt": 31481000000,
"inventory": 1738000000,
"accountsPayable": 10542000000
}
}
]
}
}
The first date structure (under 2018-12-31) doesn't contain the cash variable. So I would like the Google sheet to go and search for the same data in 2018-09-30 and if not available go and search in 2018-06-30.
OR just scan the nested structure dates and fetch the first "cash" occurrence that will be found.
Basically, I would like to know how to skip the name of the date variable (i.e.2018-12-31) as it doesn't really matter, and just make the formula seek for the first available "cash" variable.
Main questions recap
How to skip mentioning an exact nested level name and scan what's
inside?
How to keep scanning until you find the desired variable with
a value that is not "null" (this can happen)?
What would be the entire formula to achieve the following logic: Scan the JSON file until you find the value > if no value found, fallback to this IMPORTXML function that calls an external API.
Let me know if you need more context about the issue and thanks in advance for your help :)
EDIT: this is the IMPORTJSON formula I use in the cell of the spreadsheet right now.
=ImportJSON("https://api.myjson.com/bins/8mxvi", "/financial/balanceSheetHistoryQuarterly/ABBV/2018-31-12/cash", "noHeaders")
Obviously, this one returns an error as there is nothing under that date. The JSON is also the valid link I use just now.
=REGEXEXTRACT(FILTER(
TRANSPOSE(SPLIT(SUBSTITUTE(A1, ","&CHAR(10), "×"), "×")),
ISNUMBER(SEARCH("*"&"cash"&"*",
TRANSPOSE(SPLIT(SUBSTITUTE(A1, ","&CHAR(10), "×"), "×"))))), ": (.+)")
=INDEX(ARRAYFORMULA(SUBSTITUTE(REGEXEXTRACT(FILTER(TRANSPOSE(SPLIT(SUBSTITUTE(
TRANSPOSE(IMPORTDATA("https://api.myjson.com/bins/8mxvi")), ","&CHAR(10), "×"), "×")),
ISNUMBER(SEARCH("*"&"cash"&"*", TRANSPOSE(SPLIT(SUBSTITUTE(
TRANSPOSE(IMPORTDATA("https://api.myjson.com/bins/8mxvi")), ","&CHAR(10), "×"), "×"))))),
":(.+)"), ",", "")), 1, 1)

Solr/Lucene update query deletes attribute from data

I am running into an issue with an attribute missing in the data after running an update query.
I run a select query like this:
curl "http://localhost:8080/solr/collection1/select?q=title%3AHans+head:true&fl=title,uid,articleId,missing_Attribute,my_otherAttribute&wt=json&indent=true"
It returns an article:
{
"title":"Hans",
"uid":"18_UNIQUEID_123",
"articleId":"123123123",
"missing_Attribute":"M",
}
So missing_Attribute = M, my_otherAttribute is not present yet. Which is fine.
Then I run an update query on this document using:
curl http://localhost:8080/solr/collection1/update?commit=true --data-binary #MyUpdate.json -H 'Content-type:application/json'
with MyUpdate.json as:
[
{
"uid": "18_UNIQUEID_123",
"my_otherAttribute": {
"set": "12"
}
}
]
And run the select query again, results in:
{
"title":"Hans",
"uid":"18_UNIQUEID_123",
"articleId":"123123123",
"my_otherAttribute":"12",
}
my_otherAttribute = 12 but missing_Attribute is gone!
Why is missing_Attribute gone when I update my_otherAttribute?
Why does it not affect any of the other fields ?
To answer my own question, the answer is:
https://wiki.apache.org/solr/Atomic_Updates
The issue I face is that I want to make a partial update of a document. I am using Solr 4.10, so in theory it would work. But only if the schema would support it. And ours does not. So that is why attributes disappear.

Make dynamic name text field in Postman

I'm using Postman to make REST API calls to a server. I want to make the name field dynamic so I can run the request with a unique name every time.
{
"location":
{
"name": "Testuser2", // this should be unique, eg. Testuser3, Testuser4, etc
"branding_domain_id": "52f9f8e2-72b7-0029-2dfa-84729e59dfee",
"parent_id": "52f9f8e2-731f-b2e1-2dfa-e901218d03d9"
}
}
In Postman you want to use Dynamic Variables.
The JSON you post would look like this:
{
"location":
{
"name": "{{$guid}}",
"branding_domain_id": "52f9f8e2-72b7-0029-2dfa-84729e59dfee",
"parent_id": "52f9f8e2-731f-b2e1-2dfa-e901218d03d9"
}
}
Note that this will give you a GUID (you also have the option to use ints or timestamps) and I'm not currently aware of a way to inject strings (say, from a test file or a data generation utility).
In Postman you can pass random integer which ranges from 0 to 1000, in your data you can use it as
{
"location":
{
"name": "Testuser{{$randomInt}}",
"branding_domain_id": "52f9f8e2-72b7-0029-2dfa-84729e59dfee",
"parent_id": "52f9f8e2-731f-b2e1-2dfa-e901218d03d9"
}
}
Just my 5 cents to this matter. When using randomInt there is always a chance that the number might eventually be present in the DB which can cause issues.
Solution (for me at least) is to use $timestamp instead.
Example:
{
"username": "test{{$timestamp}}",
"password": "test"
}
For anyone who's about to downvote me this post was made before the discussion in comments with the OP (see below). I'm leaving it in place so the comment from the OP which eventually described what he needs isn't removed from the question.
From what I understand you're looking for, here's a basic solution. It's assuming that:
you're developing some kind of script where you need test data
the name field should be unique each time it's run
If your question was more specific then I'd be able to give you a more specific answer, but this is the best I can do from what's there right now.
var counter = location.hash ? parseInt(location.hash.slice(1)) : 1; // get a unique counter from the URL
var unique_name = 'Testuser' + counter; // create a unique name
location.hash = ++counter; // increase the counter by 1
You can forcibly change the counter by looking in the address bar and changing the URL from ending in #1 to #5, etc.
You can then use the variable name when you build your object:
var location = {
name: unique_name,
branding_domain_id: 'however-you-currently-get-it',
parent_id: 'however-you-currently-get-it'
};
Add the below text in pre-req:
var myUUID = require('uuid').v4();
pm.environment.set('myUUID', myUUID);
and use the myUUID wherever you want
like
name: "{{myUUID}}"
It will generate a random unique GUID for every request
var uuid = require('uuid');
pm.globals.set('unique_name', 'testuser' + uuid.v4());
add above code to the pre-request tab.
this was you can reuse the unique name for subsequent api calls.
Dynamic variable like randomInt, or guid is dynamic ie : you donot know what was send in the request. there is no way to refer it again, unless it is send back in response. even if you store it in a variable,it will still be dynamic
another way is :
var allowed = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
var shuffled_unique_str = allowed.split('').sort(function(){return 0.5-Math.random()}).join('');
courtsey refer this link for more options

Using addToSet inside an array with MongoDB

I'm trying to track daily stats for an individual.
I'm having a hard time adding a new day inside "history" and can also use a pointer on updating "walkingSteps" as new data comes in.
My schema looks like:
{
"_id": {
"$oid": "50db246ce4b0fe4923f08e48"
},
"history": [
{
"_id": {
"$oid": "50db2316e4b0fe4923f08e12"
},
"date": {
"$date": "2012-12-24T15:26:15.321Z"
},
"walkingSteps": 10,
"goalStatus": 1
},
{
"_id": {
"$oid": "50db2316e4b0fe4923f08e13"
},
"date": {
"$date": "2012-12-25T15:26:15.321Z"
},
"walkingSteps": 5,
"goalStatus": 0
},
{
"_id": {
"$oid": "50db2316e4b0fe4923f08e14"
},
"date": {
"$date": "2012-12-26T15:26:15.321Z"
},
"walkingSteps": 8,
"goalStatus": 0
}
]
}
db.history.update( ? )
I've been browsing (and attempting) the mongodb documentation but they don't quite break it all the way down to dummies like myself... I couldn't quite translate their examples to my setup.
Thanks for any help.
E = noob trying to learn programming
Adding a day:
user = {_id: ObjectId("50db246ce4b0fe4923f08e48")}
day = {_id: ObjectId(), date: ISODate("2013-01-07"), walkingSteps:0, goalStatus: 0}
db.users.update(user, {$addToSet: {history:day}})
Updating walkingSteps:
user = ObjectId("50db246ce4b0fe4923f08e48")
day = ObjectId("50db2316e4b0fe4923f08e13") // second day in your example
query = {_id: user, 'history._id': day}
db.users.update(query, {$set: {"history.$.walkingSteps": 6}})
This uses the $ positional operator.
It might be easier to have a separate history collection though.
[Edit] On the separate collections:
Adding days grows the document in size and it might need to be relocated on the disk. This can lead to performance issues and fragmentation.
Deleting days won't shrink the document size on disk.
It makes querying easier/straightforward (e.g. searching for a period of time)
Even though #Justin Case puts the right answer he doesn't explain a few things in it extremely well.
You will notice first of all that he gets rid of the resolution on dates and moves their format to merely the date instead of date and time like so:
day = {_id: ObjectId(), date: ISODate("2013-01-07"), walkingSteps:0, goalStatus: 0}
This means that all your dates will have 00:00:00 for their time instead of the exact time you are using atm. This increases the ease of querying per day so you can do something like:
db.col.update(
{"_id": ObjectId("50db246ce4b0fe4923f08e48"),
"history.date": ISODate("2013-01-07")},
{$inc: {"history.$.walkingSteps":0}}
)
and other similar queries.
This also makes $addToSet actually enforce its rules, however since the data in this sub document could change, i.e. walkingSteps will increment $addToSet will not work well here anyway.
This is something I would change from the ticked answer. I would probably use $push or something else instead since $addToSet is heavier and won't really do anything useful here.
The reason for a separate history collection in my view would be what you said earlier with:
Yes, the amount of history items for that day.
So this array contains a set of days, which is fine but it sounds like the figure that you wish to get walkingSteps from, a set of history items, should be in another collection and you set walkingSteps according to the count of the amount of items in that other collection for today:
db.history_items.find({date: ISODate("2013-01-07")}).count();
Referring to MongoDB Manual, $ is the positional operator which identifies an element in an array field to update without explicitly specifying the position of the element in the array. The positional $ operator, when used with the update() method and acts as a placeholder for the first match of the update query selector.
So, if you issue a command to update your collection like this:
db.history.update(
{someCriterion: someValue },
{ $push: { "history":
{"_id": {
"$oid": "50db2316e4b0fe4923f08e12"
},
"date": {
"$date": "2012-12-24T15:26:15.321Z"
},
"walkingSteps": 10,
"goalStatus": 1
}
}
)
Mongodb might try to identify $oid and $date as some positional parameters. $ also is part of the atomic operators like $set and $push. So, it is better to avoid use this special character in Mongodb.

Alternative to preventing duplicates in importing CSV to CouchDB

I have 2 big CSV file with millions of rows.
Because those 2 CSVs are from MySQL, I want to merge those 2 tables into one Document in couch DB.
What is the most efficient way to do this?
My current method is:
import 1st CSV
import 2nd CSV
To prevent duplication, the program will search the Document with the key for each row. after the row is found, then the Document is updated with the columns from the 2nd CSV
The problem is, it really takes a long time to search for each row.
while importing the 2nd CSV, it updates 30 documents/second and I have about 7 million rows. rough calculation, it will take about 64 hours to complete the whole importing.
Thank You
It sounds like you have a "primary key" that you know from the row (or you can compute it from the row). That is ideal as the document _id.
The problem is, you get 409 Conflict if you try to add the 2nd CSV data but there was already a document with the same _id. Is that correct? (If so, please correct me so I can fix the answer.)
I think there is a good answer for you:
Use _bulk_docs to import everything, then fix the conflicts.
Begin with a clean database.
Use the Bulk docuent API to insert all the rows from the 1st and then 2nd CSV set—as many as possible per HTTP query, e.g. 1,000 at a time. (Bulk docs are much faster than inserting one-by-one.)
Always add "all_or_nothing": true in your _bulk_docs POST data. That will guarantee that every insertion will be successful (assuming no disasters such as power loss or full HD).
When you are done, some documents will be conflicted, which means you inserted twice for the same _id value. That is no problem. Simply follow this procedure to merge the two versions:
For each _id that has conflicts, fetch it from couch by GET /db/the_doc_id?conflicts=true.
Merge all the values from the conflicting versions into a new final version of the document.
Commit the final, merged document into CouchDB and delete the conflicted revisions. See the CouchDB Definitive Guide section on conflict resolution. (You can use _bulk_docs to speed this up too.)
Example
Hopefully this will clarify a bit. Note, I installed the *manage_couchdb* couchapp from http://github.com/iriscouch/manage_couchdb. It has a simple view to show conflicts.
$ curl -XPUT -Hcontent-type:application/json localhost:5984/db
{"ok":true}
$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary #-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
, "first_value": "This is the first value"
}
, { "_id": "some_id"
, "second_value": "The second value is here"
}
]
}
[{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"},{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"}]
$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":2,"offset":0,"rows":[
{"id":"some_id","key":["some_id","1-0cb8fd1fd7801b94bcd2f365ce4812ba"],"value":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba"},"doc":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba","first_value":"This is the first value"}},
{"id":"some_id","key":["some_id","1-d1b74e67eee657f42e27614613936993"],"value":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993"},"doc":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993","second_value":"The second value is here"}}
]}
$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary #-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
, "_rev": "1-0cb8fd1fd7801b94bcd2f365ce4812ba"
, "first_value": "This is the first value"
, "second_value": "The second value is here"
}
, { "_id": "some_id"
, "_rev": "1-d1b74e67eee657f42e27614613936993"
, "_deleted": true
}
]
}
[{"id":"some_id","rev":"2-df5b9dc55e40805d7f74d1675af29c1a"},{"id":"some_id","rev":"2-123aab97613f9b621e154c1d5aa1371b"}]
$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":0,"offset":0,"rows":[]}
$ curl localhost:5984/db/some_id?conflicts=true\&include_docs=true
{"_id":"some_id","_rev":"2-df5b9dc55e40805d7f74d1675af29c1a","first_value":"This is the first value","second_value":"The second value is here"}
The final two commands show that there are no conflicts, and the "merged" document is now served as "some_id".
Another option is simply to do what you are doing already, but use the bulk document API to get a performance boost.
For each batch of documents:
POST to /db/_all_docs?include_docs=true with a body like this:
{ "keys": [ "some_id_1"
, "some_id_2"
, "some_id_3"
]
}
Build your _bulk_docs update depending on the results you get.
Doc already exists, you must update it: {"key":"some_id_1", "doc": {"existing":"data"}}
Doc does not exist, you must create it: {"key":"some_id_2", "error":"not_found"}
POST to /db/_bulk_docs with a body like this:
{ "docs": [ { "_id": "some_id_1"
, "_rev": "the _rev from the previous query"
, "existing": "data"
, "perhaps": "some more data I merged in"
}
, { "_id": "some_id_2"
, "brand": "new data, since this is the first doc creation"
}
]
}