I am developing a pretty big enterprise level data analysis software based on flex-4. I usually need to filter datagrids based on users selection, that requires to run a query on my database. I am wondering if there is any way to filter grid data without sql query? That would take very little time where it's causing me 2-3 minutes delay now.
If you are using ArrayCollection (or other implementation of ICollectionView), take a look at ICollectionView.filterFunction property. You can set it to what you need after user interaction and call ICollectionView.refresh() - all associated grids should automatically show filtered data then.
There are many ways to do this in ActionScript. However, since you use Flex, let's rely on the framework. The feature you are looking for the filterFunction (see the docs):
Given a data object such as {name:"Jo", type:"employee"}, you can filter employees with:
myArrayCollection.filterFunction = function(data:Object):Boolean {
return data.type == "employee";
}
myArrayCollection.refresh();
Your data grid should then be updated accordingly.
Of course, depending on the number of items being present in your list, this might run in a blink of an eye or be horribly slow =)
Related
I have a general question: How would one read & write to the same dataset, e.g. to implement something like a caching mechanism. Naivly, this would create a cycle in the dependency graph and hence it is not allowed?
What I want to do is something like:
if not key in cache.keys():
value = some_long_running_computation(key)
cache[key] = value
return cache[key]
or an equivalent logic with PySpark dataframes.
I thought about incremental transforms but they do not really fit this case since they do not allow to check existence of a key in the cache and hence you would always run your code under the brittle assumption that the cache is "complete" after the incremental transform.
Any ideas?
Thanks
Tobias
There is the ability to access the previous view of an output dataset to the transform. This is done in python when using a #dataframe decorator like so:
previous_output = foundry_output.dataframe("previous")
(you can also provide a schema.)
and in java like so:
foundryOutput.getExistingOutputDataFrame().get()
However, I would encourage this to only be used when it's absolutely essential. There's a huge benefit to keeping your pipelines fully "repeatable"/"stateless" so that you can snapshot and recompute them any time and still get the same results repeatably.
Once you introduce a "statefulness" into your pipeline, doing certain things like adding a column to your output dataset becomes much harder, since you will have to write something akin to a database migration.
That being said, it's fine to use when you really need it, and ideally keep the impact of the added complexity small.
I'm not so sure about the correct approach when using MySQL. When starting with a big website, I used to load articles with all their info using one function, load_articles. I loaded all articles that were supposed to be somehow displayed.
However often only some article info was used, for example only title, only image icon... So I created object oriented model, where Article uses MySQL to fetch the properties when needed, using __get and ArrayAccess. This results in higher number of queries in general, but reduces the ammount of data fetched from MySQL.
Of course, ideal approach would be to buffer the "data needed" and then send one query. But if this is too complicated for me, where should I aim?
Bulk fetch all data that may be needed and discard the unnecesary data - reducing the ammount of queries
Lazy-load the individual properties as they're needed when generating the page - fetching little data with many queries
If the second is the better, should I go as far as not fetching SELECT * and rather have multiple selects for individual properties, as they are needed?
First of all, Answer totally depends on how your webpage is getting loaded& what are user requirements and what are your SLAs.
suppose your page has 5 elements on it then your solutions will behave like below,
Fetch bulk data and store it locally and load it
This is good approach when your user needs to see all data at once or something very computational is required at user end. In this case also fetch only required attributes. never use select * which is always worst.
Check your network bandwidth while transferring data and if possible use CDN if you have many images or static data.
Fetch only base data first and then according to user requirement fetch more data.
This is good approach when your user generally wants to see only first section of webpage or rather he will be happy to see atleast first section on screen within 1 sec.
and slowly you can load/fetch more data as user scrolls down and performs some activity.
This ways you can save amount of memory needed on app. server and its cpu cycles processing bulk data. This approach also maintains the user by showing something very fast and continues to load.
This all was for page loading SLAs. Both options are suitable for different conditions(nowadays 2nd is more preferably used)
Coming to slow sql queries, you need to normalize the database and use proper indexes wherever required. use optimal sql queries to ensure only required data is fetched and with efficiency.
If you have something which cannot be normalized more and getting complex then you can look at nosql options.'
Applying these techniques efficiently will help you achieve your desired performance.
I hope I have cleared you confusion a bit.
I recently asked a question about many-to-many relationships and how they can be used to calculate intersections that got answered pretty fine. Now, there is another nice-to-have requirement for our cube to extend that to more data. The general question remains: How many orders contain both product x and y?
However, the measure groups are now much larger, currently about 1.4 billion rows. I tried to implement that using the method described in the other post, with several hidden cross-referenced measure groups. However, this is simply too much for our hardware, the cube is reaching sizes next to 0.5 TB, and querys take several minutes to complete.
Now I would try to use another option: Can I access our relational database in a calculated measure? It seems I can, using UDFs like described in this article. I could write a Function in c# that queries our relational database and returns all the orders that contain the products chosen by the user. But in order to do that, I need to supply all the dimensional data the user has selected to the UDF. I also need the UDF to return the calculated value so it can be output as the result of the calculated member. Is that possible? If yes, how? The example microsoft provides only includes a small deterministic string-function as the UDF.
Here my own results:
It seems to be possible, though with limitations. The class Microsoft.AnalysisServices.AdomdServer.Context can provide you with the currentMember of each Hierarchy, however this does not work with Excel-Style-Subselects. It either contains a single member or the AllMember.
Another option is to get the MDX query using the dmv SELECT * FROM $System.DISCOVER_SESSIONS. There will be a column on that view which contains the last mdx query for a given session. However in order to not overwrite your own last query, you will need to not use the current connection, but to open a new one. The session id can be obtained through Microsoft.AnalysisServices.AdomdServer.Context.CurrentConnection.SessionID.
The second approach is ok for our use-case. It does not allow you to handle axes, since the udf-function has a cell-scope, but you don't know which cell you are in. If anyone of you knows anything about that last bit, please tell me. Thanks!
I have an application that allows users to filter applicants based on very large set of criteria. The criteria are each represented by boolean columns spanning multiple tables in the database. Instead of using active record models I thought it was best to use pure sql and put the bulk of the work in the database. In order to do this I have to construct a rather complex sql query based on the criteria that the users selected and then run it through AR on the db. Is there a better way to do this? I want to maximize performance while also having maintainable and non brittle code at the same time? Any help would be greatly appreciated.
As #hazzit said, it is difficult to answer without much details, but here's my two cents on this. Raw SQL is usually needed to perform complex operations like aggregates, calculations, etc. However, when it comes to search / filtering features, I often find using raw SQL overkill and not quite maintainable.
The key question here is : can you break down your problem in multiple independent filters ?
If the answer is yes, then you should leverage the power of ActiveRecord and Arel. I often find myself implementing something like this in my model :
scope :a_scope, ->{ where something: true }
scope :another_scope, ->( option ){ where an_option: option }
scope :using_arel, ->{ joins(:assoc).where Assoc.arel_table[:some_field].not_eq "foo" }
# cue a bunch of scopes
def self.search( options = {} )
output = relation
relation = relation.a_scope if options[:an_option]
relation = relation.another_scope( options[:another_option] ) unless options[:flag]
# add logic as you need it
end
The beauty of this solution is that you declare a clean interface in which you can directly pour all the params from your checkboxes and fields, and that returns a relation. Breaking the query into multiple, reusable scopes helps keeping the thing readable and maintainable ; using a search class method ties it all together and allows thorough documentation... And all in all, using Arel helps securing the app against injections.
As a side note, this does not prevent you from using raw SQL, as long as the query can be isolated inside a scope.
If this method is not suitable to your needs, there's another option : use a full-fledged search / filtering solution like Sunspot. This uses another store, separate from your db, that indexes defined parts of your data for easy and performant search.
It is hard to answer this question fully without knowing more details, but I'll try anyway.
While databases are bad at quite a few things, they are very good at filtering data, especially when it comes to a high volumes.
If you do the filtering in Ruby on Rails (or just about any other programming language), the system will have to retrieve all of the unfiltered data from the database, which will cause tons of disk I/O and network (or interprocess) traffic. It then has to go through all those unfiltered results in memory, which may be quite a burdon on RAM and CPU.
If you do the filtering in the database, there is a pretty good chance that most of the records will never be actually retrieved from disk, won't be handed over to RoR and won't then be filtered. The main reason for indexes to even exist is for the sole purpose of avoiding expensive operations in order to speed things up. (Yes, they also help maintain data integrity)
To make this work, however, you may need to help the database a bit to do its job efficiently. You will have to create indexes matching your filtering criteria, and you may have to look into performance issues with certain types of queries (how to avoid temporary tables and such). However, it is definately worth it.
Having that said, there actually are a few types of queries that a given database is not good at doing. Those are few and far between, but they do exist. In those cases, an implementation in RoR might be the better way to go. Even without knowing more about your scenario, I'd say it's a pretty safe bet that your queries are not among those.
In an attempt to reduce round-trips to the database, I was hoping to 'preload' child collections for a parent object. My hope was that if I loaded the objects that make up the child collection into the DataContext cache, Linq2SQL would use those objects instead of going to the database.
For example, assume I have a Person objects with two child collections: Children and Cars.
I thought this might work:
var children = from p in dbc.Person
select p.Children;
var cars = from p in dbc.Person
select p.Cars;
var people = from p in dbc.Person
select p;
var dummy1 = children.ToList();
var dummy2 = cars.ToList();
foreach(var person in people){
Debug.WriteLine(person.Children.Count);
}
But instead, I'm still getting one trip to the database for every call to person.Children.Count, even though all the children are already loaded in the DataContext.
Ultimately, what I'm looking for is a way to load a full object graph (with multiple child collections) in as few trips to the database as possible.
I'm aware of the DataLoadOptions class (and LoadWith), but it's limited to one child collection, and that's a problem for me.
I wouldn't mind using a stored procedure if I was still able to build up the full object graph (and still have a Linq2SQL object, of course) without a lot of extra manipulation of the objects.
I don't think what you require is directly possible with LINQ-SQL or even SQL in the way that your expecting.
When you consider how SQL works, a single one-many relationship can easily be flattened by an inner/left join. After this, if you want another set of objects including, you could theoretically write another left join which would bring back all the rows in the other table. But imagine the output SQL would provide, it's not easily workable back into an object graph. ORM's will typically start to query per row to provide this data. An example of this is to write multiple level DataLoadOptions using LINQ-SQL, you will start to see this.
Also, the impact on performance with many LEFT JOINs will easily outweigh the perceived benefits of single querying.
Consider your example. You are fetching ALL of the rows back from these two tables. This poses two problems further down the line:
The tables may get much more data in them than you expect. Pulling back 1000's of rows in SQL all at once may not be good for performance. You then expect LINQ-SQL to search the lists for matching objects. Dependent on use, I imagine SQL can provide this data faster.
The behavior you expect is hidden, and to me would be unusual that running a SELECT on a large table could potentially bloat the memory usage of the app by caching all of the rows. Maybe they could include it as an option or provide your own extension methods.
Solution
I would personally start to look at caching the data outside of LINQ-SQL, and then retrieve objects from the ASP.Net / Memory Cache / Your Cache provider, if it is deemed that the full object graph is expensive to create.
One level of DataLoadOptions plus relying on the built-in entity relationships + some manual caching will probably save many a headache.
I have come across this particular problem a few times and haven't thought of anything else yet.