Parse large Jsons from Rest-API - json

i am facing the problem of parsing large json-results from a rest-endpoint (elasticsearch).
besides the design of the system has got its flaws, I am wondering whether there is another way to do the parsing.
The rest-response contains 10k Object in Json-Array. I am using the native Json-mapper of elasticsearch and Jsoniter. Both lack performance and slow down the application. The request duration raises up to 10-15 sec.
I will encourage a change of the interface but the big result list will remain for the next 6 month.
Could anyone give me an advice what to do to speed up the performance with elasticsearch?

Profile everything.
Is Elasticsearch slow in generating the response?
If you perform the query with Curl, redirect the output to a file, and time it, what fraction of your app's time taken does that take?
Are you running it locally? You might be dropping packets/being throttled by low bandwidth over the network.
Is the performance hit is purely decoding the response?
How long does it take to decode the same blob of JSON using Jsoniter once loaded into memory from a static file?
Have you considered chunking your query?
What about spinning it off as a separate process and immediately returning to the event loop?
There are lots of options and not enough detail in your question to be able to give solid advice.

Related

Options for storing and retrieving small objects (node.js), is a database necessary?

I am in the process of building my first live node.js web app. It contains a form that accepts data regarding my clients current stock. When submitted, an object is made and saved to an array of current stock. This stock is then permanently displayed on their website until the entry is modified or deleted.
It is unlikely that there will ever be more than 20 objects stored at any time and these will only be updated perhaps once a week. I am not sure if it is necessary to use MongoDB to store these, or whether there could be a simpler more appropriate alternative. Perhaps the objects could be stored to a JSON file instead? Or would this have too big an implication on page load times?
You could potentially store in a JSON file or even in a cache of sorts such as Redis but I still think MongoDB would be your best bet for a live site.
Storing something in a JSON file is not scalable so if you end up storing a lot more data than originally planned (this often happens) you may find you run out of storage on your server hard drive. Also if you end up scaling and putting your app behind a load balancer, then you will need to make sure there are matching copy's of that JSON file on each server. Further more, it is easy to run into race conditions when updating a JSON file. If two processes are trying to update the file at the same time, you are going to potentially lose data. Technically speaking, JSON file would work but it's not recommended.
Storing in memory (i.e.) Redis has similar implications that the data is only available on that one server. Also the data is not persistent, so if your server restarted for whatever reason, you'd lose what was stored in memory.
For all intents and purposes, MongoDB is your best bet.
The only way to know for sure is test it with a load test. But as you probably read html and js files from the file system when serving web pages anyway, the extra load of reading a few json files shouldn't be a problem.
If you want to go with simpler way i.e JSON file use nedb API which is plenty fast as well.

HTTP POST Transmission suggestion

I'm building a system which requires an Arduino board to send data to the server.
The requirements/constraints of the app are:
The server must receive data and store them in a MySQL database.
A web application is used to graph and plot historical data.
Data consumption is critical
Web application must also be able to plot data in real time.
So far, the system is working fine, however, optimization is required.
The current adopted steps are:
Accumulate data in Arduino board for 10 seconds.
Send the data to the server using POST with data containing an XML string representing the 10 records.
The server parse the received XML and store the values in the database.
This approach is good for historical data, but not for realtime monitoring.
My question is: Is there a difference between:
Accumulating the data and send them as XML, and,
Send the data each second.
In term of data consumption, is sending a POST request each second too much?
Thanks
EDIT: Can anybody provide a mathematical formula benchmarking the two approaches in term of data consumption?
For your data consumption question you need to figure out how much each POST costs you giving your cell phone plan. I don't know if there is a mathematical formula, but you could easily test and work it out.
However, using 3G (even Wifi for that matter), the power consumption will be an issue, especially if your circuit runs on a battery; each POST bursts around 1.5 amps, that's too much for sending data every second.
But again, why would you send data every second?
Real time doesn't mean sending data every second, it means being at least as fast as the system.
For example, if you are sending temperatures, temperature doesn't change from 0° to 100° in one second. So all those POSTs will be a waste of power and data.
You need to know how fast the parameters change in your system and adapt your POST accordingly.

Storing elasticsearch query result in Django session

I am currently in a development team that has implemented a search app using Flask-WhooshAlchemy. Admittedly, we did not think this completely through.
The greatest problem we face is being unable to store query results into a Flask session without serializing the data set first. The '__QueryObject' being returned via Whoosh can be JSON serialized using Marshmallow. We have gone through this route and, yes, we are able to store and manipulate the retrieved data, but at a tradeoff: initial searches will take a very long time (at least 30 seconds for larger result sets, due to serialization). For the time being, we are currently stuck with having to re-query anytime there are changes to the data set (changes that shouldn't require a fresh search, such as switching between result views and changing the number of results per page). Adding insult to injury, whoosh is probably not scalable for our purposes; Elasticsearch seems a better contender.
In short:
How can we store elasticsearch query results in a Django session so that we may be able to manipulate these results?
Any other guidance will be greatly appreciated.
If anyone cares, we finally got everything up and running and yes, it is possible to store elasticsearch query results in a Django session.

Live chat application using Node JS Socket IO and JSON file

I am developing a Live chat application using Node JS, Socket IO and JSON file. I am using JSON file to read and write the chat data. Now I am stuck on one issue, When I do the stress testing i.e pushing continuous messages into the JSON file, the JSON format becomes invalid and my application crashes.Although I am using forever.js which should keep application up but still the application crashes.
Does anybody have idea on this?
Thanks in advance for any help.
It is highly recommended that you re-consider your approach for persisting data to disk.
Among other things, one really big issue is that you will likely experience data loss. If we both get the file at the exact same time - {"foo":"bar"} - we both make a change and you save it before me, my change will overwrite yours since I started with the same thing as you. Although you saved it before me, I didn't re-open it after you saved.
What you are possibly seeing now in an append-only approach is that we're both adding bits and pieces without regard to valid JSON structure (IE: {"fo"bao":r":"ba"for"o"} from {"foo":"bar"} x 2).
Disk I/O is actually pretty slow. Even with an SSD hard drive. Memory is where it's at.
As recommended, you may want to consider MongoDB, MySQL, or otherwise. This may be a decent use case for Couchbase which is an in-memory key/value store based on memcache that persists things to disk ASAP. It is extremely JSON friendly (it is actually mostly based on JSON), offers great map/reduce support to query data, is super easy to scale to multiple servers, and has a node.js module.
This would allow you to very easily migrate your existing data storage routine into a database. Also, it provides CAS support which will prevent you from data loss in the scenarios outlined earlier.
At minimum though, you should possibly just modify an in memory object that you save to disk ever so often to prevent permanent data loss. However, this only works well with 1 server and then you're back at likely needing to look at a database.

Optimal way to "mirror" JSON data from a third-party to a Meteor Collection

We have a Meteor-based system that basically polls for data from a third-party REST API, loops through the retrieved data, inserts or updates each record to a Meteor collection.
But then it hit me: What happens when an entry is deleted from the data of the third-party?
One would say insert/update the data, and then loop through the collection and find out which one isn't in the fetched data. True, that's one way of doing it.
Another would be to clear the collection, and rewrite everything from the fetched data.
But with thousands of entries (currently at 1500+ records, will potentially explode), both seem to be very slow and CPU consuming.
What is the most optimal procedure to mirror data from JS Object to a Meteor/Mongo collection in such a way that deleted items from the data are also deleted on the collection?.
I think code is irrelevant here since this could be applicable to other languages that can do a similar feat.
For this kind of usage try using something that's more optimized. The meteor guys are working on using meteor as a sort of replica mongodb set to get/and set data.
For the moment there is Smart-Collections that uses mongodb's oplog to significantly boost performance. It could work in a sort of one size fits all scenario without optimizing for specifics. There are benchmarks that show this.
When Meteor 1.0 comes out I think they'll have optimized their own mongodb driver.
I think this may help with thousands of entries. If you're changing thousands of documents every second you need to get something closer to mongodb. Meteor employs lots of caching techniques which aren't too optimal for this. I think it polls the database every 5 seconds to refresh its cache.
Smart Collections: http://meteorhacks.com/introducing-smart-collections.html
Please do let know if it helps I'm interested to know if its useful in this scenario.
If this doesn't work, redis might be helpful too since everything is stored in-memory. Not sure what your use case is but if you don't need persistence redis would squeeze out more performance than mongo.