I am working in rails and would like to add dynamic charts to my app. My thinking was that, given the data is in flat files, a stored procedure could be created in the MySQL db to query the data based on the parameters a user wishes to see, ie. sum users and group by activity they do (y-axis) by the months of the year (x-axis). The stored procedure would then be called from rails and return the result to be built into a chart.
However, it has bee noted that stored procedures are slow and therefore may kill the user experience if there are long loading times to simply render a chart.
For what I am trying to do is this the best way to proceed forward?
stored procedures are slow
It has to be noted that stored procedures approach does not spend time on traversing data back and forth from your SQL server to rails and vice versa, so actually stored procedures should be executed much faster than traditional bunch of ActiveRecord queries.
may kill the user experience if there are long loading times
Based on what the loading times and user queries could be, there is a variety of approaches to improve user experience:
You could render a page with some placeholding text like "Generating data, plase hold on for a while" and make results of analytical query appear a bit later using technique like AJAX or server-side events. Report generation could be processed in a background using Delayed Job or Sidekiq. It's also possible to show something like a progress bar to let user know that his request is not abandoned.
You could take an advantage of caching data for common queries, so that each query is performed just once in a while. It's possible to choose different caching periods (e.g. update cached statistics once an hour or once a day) based on how relevant your chart should be. Take a look at rails caching manual page.
You could also be intested in some data mining researches, e.g. multi-dimensional online analytical processing approach. The way of using OLAP cubes to represent data would be slightly more difficult to follow, but could improve your user experience with multidimensional data analysis possibilities.
Related
i'm developing a little market in a web application and i have to implement the search function. Now, i know i can use MATCH function in mysql or i can add some libraries (like apache lucene) but that's not the point of my doubt. I'm thinking about managing the set of results i get from the search function (a servlet will do this), cause not all the results should be send to client at one time, so i would like to separate them in some pages. I want to know what is more efficient to do, if i should prefer to do the search in db for every page the client calls or if i should save the result set in a managed bean and access them while the client request a new page of results. Thx (i hope my english is enough understandable)
The question you should be asking is "how many results can you store in memory"? If you have a small dataset, by all means, sure but you will have to define what "small dataset means". This will help as you call the database once and filter on your result in memory (which is faster).
Alternative approach, for larger/huge dataset, you will want to request to the database on every user page request. The problem here is that you call the database on each call, so you will have to have an optimised search query that will bring results in small chunks (SQL LIMIT clause). If you only want to hit the database once and filter the result in "memory", you will have to slot in a caching layer in between your application and your database. That way, the results are cached and you filter on the cached result. The cache will sit on a different JVM as not to share your memory heap space.
There is no silver bullet here. You can only answer this based on your non-functional requirements.
I hope this helps.
I am working on a project using ZK Framework, Hibernate, Spring and Mysql.
I need to generate some charts from Mysql database, but after I calculate the number of objects that I need to calculate the values of those charts I found it more than 1400 objects and same numbers of queries and transactions.
So i thought if using stored procedures in Mysql to calculate those values and save them in a separate tables (using an architecture close to Data Warehouse), and then use my web application to just read the values of those tables and display them as charts.
I want to know in terms of speed and performance, which of those methods is better?
And thank you
No way to tell, really, without many more details. However:
What you want to do is called Denormalisation. This is a recognised technique for speeding up reporting and making it easier. (If it doesn't, your denormalisation has failed!) When it works it has the following advantages:
Reports run faster
Report code is easier to write
On the other hand:
Report Data is out of date, containing only data as at the time you
last did the calculations
An extreme form of doing this is to take the OLTP database (a standard database) and export it into an Analysis database (aka a Cube or an OLAP database).
One of the problems of Denormalisation is that a) it is usually a significant effort, b) it adds extra code which adds complexity and thus increases support costs, and c) it might not make enough (or any) difference. Because of this, it is usual not to do it until you know you have a problem. This will happen when you have done your reports on the basic database and have found that they either are too difficult to write and/or run too slowly. I would strongly suggest that only when you reach that point do you go for Denormalisation.
There can be times when you don't need to do that, but I've only seen 1 such example in over 25 years of development; and that decision was helped by a desire to use an OLAP database by Management for political purposes.
I am about 70% of the way through developing a web application which contains what is essentially a largeish datatable of around 50,000 rows.
The app itself is a filtering app providing various different ways of filtering this table such as range filtering by number, drag and drop filtering that ultimately performs regexp filtering, live text searching and i could go on and on.
Due to this I coded my MySQL queries in a modular fashion so that the actual query itself is put together dynamically dependant on the type of filtering happening.
At the moment each filtering action (in total) takes between 250-350ms on average. For example:-
The user grabs one end of a visual slider, drags it inwards, when he/she lets go a range filtering query is dynamically put together by my PHP code and the results are returned as a JSON response. The total time from the user letting go of the slider until the user has recieved all data and the table is redrawn is between 250-350ms on average.
I am concerned with scaleability further down the line as users can be expected to perform a huge number of the filtering actions in a short space of time in order to retrieve the data they are looking for.
I have toyed with trying to do some fancy cache expiry work with memcached but couldn't get it to play ball correctly with my dynamically generated queries. Although everything would cache correctly I was having trouble expiring the cache when the query changes and keeping the data relevent. I am however extremely inexperienced with memcached. My first few attempts have led me to believe that memcached isn't the right tool for this job (due to the highly dynamic nature of the queries. Although this app could ultimately see very high concurrent usage.
So... My question really is, are there any caching mechanisms/layers that I can add to this sort of application that would reduce hits on the server? Bearing in mind the dynamic queries.
Or... If memcached is the best tool for the job, and I am missing a piece of the puzzle with my early attempts, can you provide some information or guidance on using memcached with an application of this sort?
Huge thanks to all who respond.
EDIT: I should mention that the database is MySQL. The siite itself is running on Apache with an nginx proxy. But this question is related purely to speeding up and reducing the database hits, of which there are many.
I should also add that the quoted 250-350ms roundtrip time is fully remote. As in from a remote computer accessing the website. The time includes DNS lookup, Data retrieval etc.
If I understand your question correctly, you're essentially asking for a way to reduce the number of queries against the database eventhough there will be very few exactly the same queries.
You essentially have three choices:
Live with having a large amount of queries against your database, optimise the database with appropriate indexes and normalise the data as far as you can. Make sure to avoid normal performance pitfalls in your query building (lots of ORs in ON-clauses or WHERE-clauses for instance). Provide views for mashup queries, etc.
Cache the generic queries in memcached or similar, that is, without some or all filters. And apply the filters in the application layer.
Implement a search index server, like SOLR.
I would recommend you do the first though. A roundtrip time of 250~300 ms sounds a bit high even for complex queries and it sounds like you have a lot to gain by just improving what you already have at this stage.
For much higher workloads, I'd suggest solution number 3, it will help you achieve what you are trying to do while being a champ at handling lots of different queries.
Use Memcache and set the key to be the filtering query or some unique key based on the filter. Ideally you would write your application to expire the key as new data is added.
You can only make good use of caches when you occasionally run the same query.
A good way to work with memcache caches is to define a key that matches the function that calls it. For example, if the model named UserModel has a method getUser($userID), you could cache all users as USER_id. For more advanced functions (Model2::largerFunction($arg1, $arg2)) you can simply use MODEL2_arg1_arg2 - this will make it easy to avoid namespace conflicts.
For fulltext searches, use a search indexer such as Sphinx or Apache Lucene. They improve your queries a LOT (I was able to do a fulltext search on a 10 million record table on a 1.6 GHz atom processor, in less than 500 ms).
I have a table with a few relational columns and one XML column which sometimes holds a fairly large chunk of data. I also have a simple webservice which uses the database. I need to be able to report on things like all the instances of a certain element within the XML column, a list of all the distinct values for a certain element, things like that.
I was able to get a list of all the distinct values for an element, but didn't get much further than that. I ended up writing incredibly complex T-SQL code to do something that seems pretty simple in C#: go through all the rows in this table, and apply this ( XPath | XQuery | XSLT ) to the XML column. I can filter on the relational columns to reduce the amount of data, but this is still a lot of data for some of the queries.
My plan was to embed an assembly in SQL Server (I'm using 2008 SP2) and have it create an indexed view on the fly for a given query (I'd have other logic to clean this view up). This would allow me to keep the network traffic down, and possibly also allow me to use tools like Excel and MSRS reports as a cheap user interface, but I'm seeing a lot of people saying "just use application logic rather than SQL assemblies". (I could be barking entirely up the wrong tree here, I guess).
Grabbing the big chunk of data to the web service and doing the processing there would have benefits as well - I'm less constrained by the SQL Server environment (since I don't live inside it) and my setup process is easier. But it does mean I'm bringing a lot of data over the network, storing it in memory while I process it, then throwing some of it away.
Any advice here would be appreciated.
Thanks
Edit:
Thanks guys, you've all been a big help. The issue was that we were generating a row in the table for a file, and each file could have multiple results, and we would doing this each time we ran a particular build job. I wanted to flatten this out into a table view.
Each execution of this build job checked thousands of files for several attributes, and in some cases each of these tests these were generating thousands of results (MSIVAL tests were the worst culprit).
The answer (duh!) is to flatten it out before it goes into the database! Based on your feedback, I decided to try creating a row for each result for each test on each file, and the XML just had the details of that one result - this made the query much simpler. Of course, we now have hundreds of thousands of rows each time we run this tool but the performance is much better. I now have a view which creates a flattened version of one of the classes of results that are emitted by the build job - this returns >200,000 and takes <5 seconds, compared to around 3 minutes for the equivalent (complicated) query before I went the flatter route, and between 10 and 30 minutes for the XML file processing of the old (non-database) version.
I now have some issues with the number of times I connect, but I have an idea of how to fix that.
Thanks again! +1's all round
I suggest using the standard xml tools in TSQL. (http://msdn.microsoft.com/en-us/library/ms189075.aspx). If you don't wish to use this I would recommend processing the xml on another machine.
SQLCLR is perfect for smaller functions, but with the restrictions on the usable methods it tends to become an exercise in frustration once you are trying to do more advanced things.
What you're asking about is really a huge balancing act and it totally depends on several factors. First, what's the current load on your database? If you're running this on a database that is already under heavy load, you're probably going to want to do this parsing on the web service. XML shredding and querying is an incredibly expensive procedure in SQL Server, especially if you're doing it on un-indexed columns that don't have a schema defined for them. Schemas and indexes help with this processing overhead, but they can't eliminate the fact that XML parsing isn't cheap. Secondly, the amount of data you're working with. It's entirely possible that you just have too much data to push over the network. Depending on the location of your servers and the amount of data, you could face insurmountable problems here.
Finally, what are the relative specs of your machines? If your web service machine has low memory, it's going to be thrashing data in and out of virtual memory trying to parse the XML which will destroy your performance. Maybe you're not running the most powerful database hardware and shredding XML is going to be performance prohibitive for the CPU you've got on your database machine.
At the end of the day, the only way to really know is to try both ways and figure out what makes sense for you. Doing the development on your web services machine will almost undoubtedly be easier as LINQ to XML is a more elegant way of parsing through XML than XQuery shoehorned into T-SQL is. My indication, given the information you provided in your question, is that T-SQL is going to perform better for you in the long run because you're doing XML parsing on every row or at least most rows in the database for reporting purposes. Pushing that kind of information over the network is just ugly. That said, if performance isn't that important, there's something to be said about taking the easier and more maintainable route of doing all the parsing on the application server.
As I said in a previous post, our Rails app has to interface with an E-A-V type of table in a third-party application that we're pulling data from. I had created a View to make the data normal but it is taking way too long to run. We had one of our offshore PHP developers create a stored procedure to help speed it up.
Now we run into the issue that we need to call this stored procedure from the Rails app, as well as provide searching and filtering. The view could do this because Rails was treating it as a traditional Rails model. How could I do this with the stored proc? Would we need to write custom searching and ordering (we were using Searchlogic)? Management is incapable of understanding the drawbacks of using a stored proc from Rails; all they say is that the current method is taking too long to load the data and needs to be fixed, but searching and filtering are critical functions.
EDIT I posted the code for this query here: Optimizing a strange MySQL Query. What is funny is that when I run this query in a GUI (Navicat) it runs in about 5 seconds, but on the web page it takes over a minute to run; the view is complicated for reasons I outline in the original post but I would think that MySQL optimizes and caches views like SQL Server does (or rather, how I read that SQL Server does) to improve performance.
You can call stored procedures from Rails, but you are going to lose most of the benefits of ActiveRecord, as the standard generated SQL will not work. You can use the native database connection and call it, but it's going to be a leaky abstraction. You may want to consider DataMapper.
Looking back at your last question, I would get the DBA to create a trigger to create a more relational structure from the data. The trigger would insert the EVA data into a table, which is the only way I know of to do materialized views in MySQL. This way you only pay a small incremental background cost on insert, and the application can run normally.
Anyway...
ActiveRecord::Base.connection.execute("call SP_name (#{param1}, #{param2}, ... )")
But there's an open ticket out there on lighthouse indicating this approach may not work with out changing some of the parameters to use the connection.