Storing elasticsearch query result in Django session - json

I am currently in a development team that has implemented a search app using Flask-WhooshAlchemy. Admittedly, we did not think this completely through.
The greatest problem we face is being unable to store query results into a Flask session without serializing the data set first. The '__QueryObject' being returned via Whoosh can be JSON serialized using Marshmallow. We have gone through this route and, yes, we are able to store and manipulate the retrieved data, but at a tradeoff: initial searches will take a very long time (at least 30 seconds for larger result sets, due to serialization). For the time being, we are currently stuck with having to re-query anytime there are changes to the data set (changes that shouldn't require a fresh search, such as switching between result views and changing the number of results per page). Adding insult to injury, whoosh is probably not scalable for our purposes; Elasticsearch seems a better contender.
In short:
How can we store elasticsearch query results in a Django session so that we may be able to manipulate these results?
Any other guidance will be greatly appreciated.

If anyone cares, we finally got everything up and running and yes, it is possible to store elasticsearch query results in a Django session.

Related

Parse large Jsons from Rest-API

i am facing the problem of parsing large json-results from a rest-endpoint (elasticsearch).
besides the design of the system has got its flaws, I am wondering whether there is another way to do the parsing.
The rest-response contains 10k Object in Json-Array. I am using the native Json-mapper of elasticsearch and Jsoniter. Both lack performance and slow down the application. The request duration raises up to 10-15 sec.
I will encourage a change of the interface but the big result list will remain for the next 6 month.
Could anyone give me an advice what to do to speed up the performance with elasticsearch?
Profile everything.
Is Elasticsearch slow in generating the response?
If you perform the query with Curl, redirect the output to a file, and time it, what fraction of your app's time taken does that take?
Are you running it locally? You might be dropping packets/being throttled by low bandwidth over the network.
Is the performance hit is purely decoding the response?
How long does it take to decode the same blob of JSON using Jsoniter once loaded into memory from a static file?
Have you considered chunking your query?
What about spinning it off as a separate process and immediately returning to the event loop?
There are lots of options and not enough detail in your question to be able to give solid advice.

Find function in j2ee web application

i'm developing a little market in a web application and i have to implement the search function. Now, i know i can use MATCH function in mysql or i can add some libraries (like apache lucene) but that's not the point of my doubt. I'm thinking about managing the set of results i get from the search function (a servlet will do this), cause not all the results should be send to client at one time, so i would like to separate them in some pages. I want to know what is more efficient to do, if i should prefer to do the search in db for every page the client calls or if i should save the result set in a managed bean and access them while the client request a new page of results. Thx (i hope my english is enough understandable)
The question you should be asking is "how many results can you store in memory"? If you have a small dataset, by all means, sure but you will have to define what "small dataset means". This will help as you call the database once and filter on your result in memory (which is faster).
Alternative approach, for larger/huge dataset, you will want to request to the database on every user page request. The problem here is that you call the database on each call, so you will have to have an optimised search query that will bring results in small chunks (SQL LIMIT clause). If you only want to hit the database once and filter the result in "memory", you will have to slot in a caching layer in between your application and your database. That way, the results are cached and you filter on the cached result. The cache will sit on a different JVM as not to share your memory heap space.
There is no silver bullet here. You can only answer this based on your non-functional requirements.
I hope this helps.

Optimal way to "mirror" JSON data from a third-party to a Meteor Collection

We have a Meteor-based system that basically polls for data from a third-party REST API, loops through the retrieved data, inserts or updates each record to a Meteor collection.
But then it hit me: What happens when an entry is deleted from the data of the third-party?
One would say insert/update the data, and then loop through the collection and find out which one isn't in the fetched data. True, that's one way of doing it.
Another would be to clear the collection, and rewrite everything from the fetched data.
But with thousands of entries (currently at 1500+ records, will potentially explode), both seem to be very slow and CPU consuming.
What is the most optimal procedure to mirror data from JS Object to a Meteor/Mongo collection in such a way that deleted items from the data are also deleted on the collection?.
I think code is irrelevant here since this could be applicable to other languages that can do a similar feat.
For this kind of usage try using something that's more optimized. The meteor guys are working on using meteor as a sort of replica mongodb set to get/and set data.
For the moment there is Smart-Collections that uses mongodb's oplog to significantly boost performance. It could work in a sort of one size fits all scenario without optimizing for specifics. There are benchmarks that show this.
When Meteor 1.0 comes out I think they'll have optimized their own mongodb driver.
I think this may help with thousands of entries. If you're changing thousands of documents every second you need to get something closer to mongodb. Meteor employs lots of caching techniques which aren't too optimal for this. I think it polls the database every 5 seconds to refresh its cache.
Smart Collections: http://meteorhacks.com/introducing-smart-collections.html
Please do let know if it helps I'm interested to know if its useful in this scenario.
If this doesn't work, redis might be helpful too since everything is stored in-memory. Not sure what your use case is but if you don't need persistence redis would squeeze out more performance than mongo.

What would be the best DB cache to use for this application?

I am about 70% of the way through developing a web application which contains what is essentially a largeish datatable of around 50,000 rows.
The app itself is a filtering app providing various different ways of filtering this table such as range filtering by number, drag and drop filtering that ultimately performs regexp filtering, live text searching and i could go on and on.
Due to this I coded my MySQL queries in a modular fashion so that the actual query itself is put together dynamically dependant on the type of filtering happening.
At the moment each filtering action (in total) takes between 250-350ms on average. For example:-
The user grabs one end of a visual slider, drags it inwards, when he/she lets go a range filtering query is dynamically put together by my PHP code and the results are returned as a JSON response. The total time from the user letting go of the slider until the user has recieved all data and the table is redrawn is between 250-350ms on average.
I am concerned with scaleability further down the line as users can be expected to perform a huge number of the filtering actions in a short space of time in order to retrieve the data they are looking for.
I have toyed with trying to do some fancy cache expiry work with memcached but couldn't get it to play ball correctly with my dynamically generated queries. Although everything would cache correctly I was having trouble expiring the cache when the query changes and keeping the data relevent. I am however extremely inexperienced with memcached. My first few attempts have led me to believe that memcached isn't the right tool for this job (due to the highly dynamic nature of the queries. Although this app could ultimately see very high concurrent usage.
So... My question really is, are there any caching mechanisms/layers that I can add to this sort of application that would reduce hits on the server? Bearing in mind the dynamic queries.
Or... If memcached is the best tool for the job, and I am missing a piece of the puzzle with my early attempts, can you provide some information or guidance on using memcached with an application of this sort?
Huge thanks to all who respond.
EDIT: I should mention that the database is MySQL. The siite itself is running on Apache with an nginx proxy. But this question is related purely to speeding up and reducing the database hits, of which there are many.
I should also add that the quoted 250-350ms roundtrip time is fully remote. As in from a remote computer accessing the website. The time includes DNS lookup, Data retrieval etc.
If I understand your question correctly, you're essentially asking for a way to reduce the number of queries against the database eventhough there will be very few exactly the same queries.
You essentially have three choices:
Live with having a large amount of queries against your database, optimise the database with appropriate indexes and normalise the data as far as you can. Make sure to avoid normal performance pitfalls in your query building (lots of ORs in ON-clauses or WHERE-clauses for instance). Provide views for mashup queries, etc.
Cache the generic queries in memcached or similar, that is, without some or all filters. And apply the filters in the application layer.
Implement a search index server, like SOLR.
I would recommend you do the first though. A roundtrip time of 250~300 ms sounds a bit high even for complex queries and it sounds like you have a lot to gain by just improving what you already have at this stage.
For much higher workloads, I'd suggest solution number 3, it will help you achieve what you are trying to do while being a champ at handling lots of different queries.
Use Memcache and set the key to be the filtering query or some unique key based on the filter. Ideally you would write your application to expire the key as new data is added.
You can only make good use of caches when you occasionally run the same query.
A good way to work with memcache caches is to define a key that matches the function that calls it. For example, if the model named UserModel has a method getUser($userID), you could cache all users as USER_id. For more advanced functions (Model2::largerFunction($arg1, $arg2)) you can simply use MODEL2_arg1_arg2 - this will make it easy to avoid namespace conflicts.
For fulltext searches, use a search indexer such as Sphinx or Apache Lucene. They improve your queries a LOT (I was able to do a fulltext search on a 10 million record table on a 1.6 GHz atom processor, in less than 500 ms).

Solr vs. MySQL performance for autocomplete

In one of our applications, we need to hold some plain tabular data and we need to be able to perform user-side autocompletion on one of the columns.
The initial solution we came up with, was to couple MySQL with Solr to achieve this (MySQL to hold data and Solr to just hold the tokenized column and return ids as result). But something unpleasant happened recently (developers started storing some of the data in Solr, because the MySQL table and the operations done on it are nothing that Solr can not provide) and we thought maybe we could merge them together and eliminate one of the two.
So we had to either: (1) move all the data to Solr (2) use MySQL for autocompletion
(1) sounded terrible so I gave it a shot with (2), I started with loading that single column's data into MySQL, disabled all caches on both MySQL and Solr, wrote a tiny webapp that is able to perform very similar queries [1] on both databases, and fired up a few JMeter scenarios against both in a local and similar environment. The results show a 2.5-3.5x advantage for Solr, however, I think the results may be totally wrong and fault prone.
So, what would you suggest for:
Correctly benchmarking these two systems, I believe you need to
provide similar[to MySQL] environment to JVM.
Designing this system.
Thanks for any leads.
[1] SELECT column FROM table WHERE column LIKE 'USER-INPUT%' on MySQL and column:"USER-INPUT" on Solr.
I recently moved a website over from getting its data from the database (postgres) to getting all data from Solr. Unbelievable difference in speed. We also have autocomplete for Australian suburbs (about 15K of them) and it finds them in a couple of milliseconds, so the ajax auto-complete (we used jQuery) reacts almost instantly.
All updates are done against the original database, but our site is a mostly-read site. We used triggers to fire events when records were updated and that spawns a reindex into Solr of the record.
The other big speed improvement was pre-caching data required to render the items - ie we denormalize data and pre-calculate lots of stuff at Solr indexing time so the rendering is easy for the web guys and super fast.
Another advantage is that we can put our site into read-only mode if the database needs to be taken offline for some reason - we just fall back to Solr. At least the site doesn't go fully down.
I would recommend using Solr as much as possible, for both speed and scalability.