Find function in j2ee web application - mysql

i'm developing a little market in a web application and i have to implement the search function. Now, i know i can use MATCH function in mysql or i can add some libraries (like apache lucene) but that's not the point of my doubt. I'm thinking about managing the set of results i get from the search function (a servlet will do this), cause not all the results should be send to client at one time, so i would like to separate them in some pages. I want to know what is more efficient to do, if i should prefer to do the search in db for every page the client calls or if i should save the result set in a managed bean and access them while the client request a new page of results. Thx (i hope my english is enough understandable)

The question you should be asking is "how many results can you store in memory"? If you have a small dataset, by all means, sure but you will have to define what "small dataset means". This will help as you call the database once and filter on your result in memory (which is faster).
Alternative approach, for larger/huge dataset, you will want to request to the database on every user page request. The problem here is that you call the database on each call, so you will have to have an optimised search query that will bring results in small chunks (SQL LIMIT clause). If you only want to hit the database once and filter the result in "memory", you will have to slot in a caching layer in between your application and your database. That way, the results are cached and you filter on the cached result. The cache will sit on a different JVM as not to share your memory heap space.
There is no silver bullet here. You can only answer this based on your non-functional requirements.
I hope this helps.

Related

Handling Very Large JSON Dataset?

I have a scenario wherein frontend app makes a call to backend DB (APP -> API Gateway -> SpringBoot -> DB) with a JSON request. Backend returns a very large dataset (>50000 rows) in response sizing ~10 MB.
My frontend app is highly responsive and mission critical, we are seeing performance issues; frontend where app is not responding or timing-out. What can be best design to resolve this issue condering
DB query cant be normalized any further.
SpringBoot code has has cache builtin.
No data can be left behind due to intrinsic nature
No multiple calls can be made as data is needed is first call itself
Can any cache be built in-between frontend and backend?
Thanks.
Sounds like this is a generated report from a search. If this data needs to be associated with each other, I'd assign the search an id and restore the results on the server. Then pull the data for this id as needed on the frontend. You should never have to send 50,000 rows to the client in one go... Paginate the data and pull as needed if you have to. If you don't want to paginate, how much data can they display on a single screen? You can pull more data from the server based on where they scroll on the page. You should only need to return the count of the rows to the frontend, and maybe 100 rows of data. This would allow you to show a scrollbar with the right height. When they scroll to a certain position within the data, you can pull the corresponding offset from the server for that particular search id. Even if you could return all 50,000+ rows in one go, it doesn't sound very friendly to the end user's device to have to load that kind of memory for a functional page.
This is a sign of a flawed frontend that should be redone.
10mb is huge and can be inconsiderate to your users especially if there's a high probability of mobile use.
If possible, it would be best to collect this data on the backend, probably put it onto disk, and then provide only the necessary data to the frontend as it's needed. As the map needs more data, you would make further calls to the backend.
If this isn't possible, you could load this data with the client-side bundle. If the data doesn't update too frequently, you can even cache it on the frontend. This would at least prevent the user from needing to fetch it repeatedly.

Storing elasticsearch query result in Django session

I am currently in a development team that has implemented a search app using Flask-WhooshAlchemy. Admittedly, we did not think this completely through.
The greatest problem we face is being unable to store query results into a Flask session without serializing the data set first. The '__QueryObject' being returned via Whoosh can be JSON serialized using Marshmallow. We have gone through this route and, yes, we are able to store and manipulate the retrieved data, but at a tradeoff: initial searches will take a very long time (at least 30 seconds for larger result sets, due to serialization). For the time being, we are currently stuck with having to re-query anytime there are changes to the data set (changes that shouldn't require a fresh search, such as switching between result views and changing the number of results per page). Adding insult to injury, whoosh is probably not scalable for our purposes; Elasticsearch seems a better contender.
In short:
How can we store elasticsearch query results in a Django session so that we may be able to manipulate these results?
Any other guidance will be greatly appreciated.
If anyone cares, we finally got everything up and running and yes, it is possible to store elasticsearch query results in a Django session.

UI Autocomplete : Make multiple ajax requests or load all data at once for a list of locations in a city?

I have a text box in my application which allows a user to select a location with the help of UI autocomplete. There are around 10,000 valid locations out of which the user must select one. There are two implementations for the autocomplete functionality:
Fetch the list of locations when the page loads for the first time and iterate over the array to find matching items on every keystroke in javascript
Make ajax requests on every keystroke as searching in MySQL(the db being used) is much faster?
Performance wise, which one is better?
An initial test shows that loading the data at once is the better approach from a performance point of view. However, this test was done on a MBP where JavaScipt processing is quite fast. I'm not sure whether this technique is the better one for machines with low processing power like lower end android phones, old systems etc.
Your question revolves around which is quicker, processing over 10,000 rows in the browser, or sending a request to a remote server to return the smaller result set. An interesting problem that depends on context and environment at runtime. Sending to the remote server incurs network delay mostly, with small amounts of server overhead.
So you have two variables in the performance equation, processing speed of the client and network latency. There is also a third variable, volume of data, but this is constant 10k in your question.
If both client browser and network are fast, use whatever you prefer.
If the network is faster, use the remote server approach, although be careful not to overload the server with thousands of little requests.
If the client is faster, probably use the local approach. (see below)
If both are slow, then you probably need to chose either, or spend lots of time and effort optimizing this.
Both clients slow can easily happen, my phone browser on 3G falls into this category, network latency for a random Ajax request is around 200mS, and it performs poorly for some JavaScript too.
As user perceieved performance is all that really matters, you could preload the first N values for each letter as variables in the initial page load, then use these for the first keystroke results, this buys you a few mS.
If you go with the server approach, you can always send requested result AND a few values for each of the next keystroke. This overlaps what users see and makes it appear snappier on slow networks. Eg
Client --> request 'ch'
Server responds with a few result for each potential next letter
'cha' = ...
'chb' = ...
Etc
This of course requires some specialized javascript to alternate between Ajax requests and using cached results from previous requests to prefill the selection.
If you are going with the local client searching through all 10k records, then make sure the server returns the records in sorted order. If your autocomplete scanning is able to use 'starting with' selection rather than 'contains' (eg typing RO will match Rotorua but not Paeroa) then you can greatly reduce processing time by using http://en.wikipedia.org/wiki/Binary_search_algorithm techniques, and I'm sure there are lots of SO answers on this area.
If there is no advantage for querying the backend every time, don't do it.
What could be an advantage of querying the backend all the time? If the amount of returned data for the initial call is to heavy (bandwidth, javascript processing time to prepare it, time at all), the partial request every time could be the smarter option.

What would be the best DB cache to use for this application?

I am about 70% of the way through developing a web application which contains what is essentially a largeish datatable of around 50,000 rows.
The app itself is a filtering app providing various different ways of filtering this table such as range filtering by number, drag and drop filtering that ultimately performs regexp filtering, live text searching and i could go on and on.
Due to this I coded my MySQL queries in a modular fashion so that the actual query itself is put together dynamically dependant on the type of filtering happening.
At the moment each filtering action (in total) takes between 250-350ms on average. For example:-
The user grabs one end of a visual slider, drags it inwards, when he/she lets go a range filtering query is dynamically put together by my PHP code and the results are returned as a JSON response. The total time from the user letting go of the slider until the user has recieved all data and the table is redrawn is between 250-350ms on average.
I am concerned with scaleability further down the line as users can be expected to perform a huge number of the filtering actions in a short space of time in order to retrieve the data they are looking for.
I have toyed with trying to do some fancy cache expiry work with memcached but couldn't get it to play ball correctly with my dynamically generated queries. Although everything would cache correctly I was having trouble expiring the cache when the query changes and keeping the data relevent. I am however extremely inexperienced with memcached. My first few attempts have led me to believe that memcached isn't the right tool for this job (due to the highly dynamic nature of the queries. Although this app could ultimately see very high concurrent usage.
So... My question really is, are there any caching mechanisms/layers that I can add to this sort of application that would reduce hits on the server? Bearing in mind the dynamic queries.
Or... If memcached is the best tool for the job, and I am missing a piece of the puzzle with my early attempts, can you provide some information or guidance on using memcached with an application of this sort?
Huge thanks to all who respond.
EDIT: I should mention that the database is MySQL. The siite itself is running on Apache with an nginx proxy. But this question is related purely to speeding up and reducing the database hits, of which there are many.
I should also add that the quoted 250-350ms roundtrip time is fully remote. As in from a remote computer accessing the website. The time includes DNS lookup, Data retrieval etc.
If I understand your question correctly, you're essentially asking for a way to reduce the number of queries against the database eventhough there will be very few exactly the same queries.
You essentially have three choices:
Live with having a large amount of queries against your database, optimise the database with appropriate indexes and normalise the data as far as you can. Make sure to avoid normal performance pitfalls in your query building (lots of ORs in ON-clauses or WHERE-clauses for instance). Provide views for mashup queries, etc.
Cache the generic queries in memcached or similar, that is, without some or all filters. And apply the filters in the application layer.
Implement a search index server, like SOLR.
I would recommend you do the first though. A roundtrip time of 250~300 ms sounds a bit high even for complex queries and it sounds like you have a lot to gain by just improving what you already have at this stage.
For much higher workloads, I'd suggest solution number 3, it will help you achieve what you are trying to do while being a champ at handling lots of different queries.
Use Memcache and set the key to be the filtering query or some unique key based on the filter. Ideally you would write your application to expire the key as new data is added.
You can only make good use of caches when you occasionally run the same query.
A good way to work with memcache caches is to define a key that matches the function that calls it. For example, if the model named UserModel has a method getUser($userID), you could cache all users as USER_id. For more advanced functions (Model2::largerFunction($arg1, $arg2)) you can simply use MODEL2_arg1_arg2 - this will make it easy to avoid namespace conflicts.
For fulltext searches, use a search indexer such as Sphinx or Apache Lucene. They improve your queries a LOT (I was able to do a fulltext search on a 10 million record table on a 1.6 GHz atom processor, in less than 500 ms).

Searching text in database: caching database records into domain logic V/S using MySQL full-text search?

I am developing a layered web app. In brief it has:
UI: html, javascript and jquery
Domain logic: Java and servlets
Business logic: MySQL
I have large amounts of records in the database containing info about books. In addition the application will be used by a lot of users at the same time.
I want to enable users to input a book's "name" in a search text field
, say "book1" and display a drop down list using jquery autocomplete.
The records in database are not updatable since they will never change.
Considering solid design patterns,which is better (performance and speed wise) :
Preloading these database records into a cache object at the domain logic and let the users search (query) them from this object? Or querying directly from
the database using something like MySQL full-text search?
If using MySQL full-text search, I am concerned about having lots of calls to the database by many users at the same time.
As for preloading into a cache object, i am not sure if this is generally a good software practice, does anyone recommend it? Should
i put a timer for records to remain cached in memory?
Which of these 2 methods is preferable? are there other better methods for such scenarios?
I found a solution and hope this answer will help the ones who are dealing with a similar situation:
I will be using a software engineering design pattern, which I used frequently in the past, called Identity Map.
Only this time, since records are not updatable (i.e. not changeable), I will be using only the caching functionality of the identity map. So I will will be loading the records from the database into this identity map object once the server starts. This way a user will query them directly from the domain layer, thus faster and less calls to the database.
One issue to consider, and it is when adding new records by the administration, for this situation I will be using another design pattern called "observer pattern" (you can learn about it in this book).
UPDATE:
In case you are dealing with a similar situation, a good idea is to use MySQL indexing. I used it on the column of the "book's name" to enable faster loading in the cache object, because in my case the cache object will contain only book(s) name(s) since using the search field in the UI, the user is only concerned about a book's name. PS: other book details will be loaded only when user clicks on a book's name in the drop down list, and at this point u might want to use another Identity map holding the details . The reason behind this architecture is that, logically speaking never will u have all the books (with their details) from your database searched (loaded) by all the users of your application at the same time... thus to minimize server memory and bandwidth usage you first compromise loading the whole column of book names in memory for faster searching of all available book names BUT their details (ie more space in memory) will be loaded only when needed by a user and will be kept in another identity map to be used by another user searching for the same book with same details. In my opinion this minimizes memory usage on the server and less calls to the database to get details that have already been fetched by other users before.