Help me optimize an ActiveRecord object with too many attributes - mysql

I'm working on a app which ties to a legacy database. The primary model is based on a stupidly large 100+ column table. I don't know too much about the inner-workings of ActiveRecord but it seems to me that any request on this model is slowing down because it's creating objects with 100+ attributes. Let's call this SlowModel.
Rendering pages with this model sometimes take 17 seconds on my dev computer. Straight up mysql queries only take ~ 0.5 - 1 second.
I've managed to speed up one portion of the app by using a MySQL view that selects a subset of fields (20 or so). We'll call this QuickModel. Using views is OK but isn't the most portable solution.
I will likely continue to try and add this QuickModel into other parts of the site but I was wondering if anyone had other ideas in speeding up the original object. For instance, is there a way to specify in the model what columns activerecord should just ignore and avoid building? Maybe there are specific column types (:text??) that cause bloat in ActiveRecord objects.
Assume that columns have proper indices.

You can specify which columns are returned in the model lookup using the :select option of the ActiveRecord lookup:
SlowModel.all(:select => 'id, col1, col2, col3')
...will load instances of SlowModel with only the specified columns populated.

How about having a completely new QuickModel that sits to its own table... and a QuickModel has_one SlowModel?
You can use SQL to move the most-necessary data into the QuickModel table and only refer to the SlowModel using my_quick_model.slow_model when necessary.
Alternatively, you can add a "select" to the default scope (you can google "rails default scope" for more). By default it'll only fetch the reduced set - but you can ask for all attributes by passing :select => "*" if necessary.

Along the lines of what Winfield is saying, you may want to take a look at using an attribute tracker like SlimScrooge. The tracker attempts to fetch only the data that you're using, which reduces overhead. It attempts to automatically do what Winfield is suggesting.
Example from the Readme:
# 1st request, sql is unchanged but columns accesses are recorded
Brochure Load SlimScrooged 1st time (27.1ms) SELECT * FROM `brochures` WHERE (expires_at IS NULL)
# 2nd request, only fetch columns that were used the first time
Brochure Load SlimScrooged (4.5ms) SELECT `brochures`.expires_at,`brochures`.operator_id,`brochures`.id FROM `brochures` WHERE (expires_at IS NULL)
# 2nd request, later in code we need another column which causes a reload of all remaining columns
Brochure Reload SlimScrooged (0.6ms) `brochures`.name,`brochures`.comment,`brochures`.image_height,`brochures`.id, `brochures`.tel,`brochures`.long_comment,`brochures`.image_name,`brochures`.image_width FROM `brochures` WHERE `brochures`.id IN ('5646','5476','4562','3456','4567','7355')
# 3rd request
Brochure Load SlimScrooged (4.5ms) SELECT `brochures`.expires_at,`brochures`.operator_id,`brochures`.name, `brochures`.id FROM `brochures` WHERE (expires_at IS NULL)

Related

Is there a way to store database modifications with a versioning feature (for eventual versions comparaison)?

I'm working on a project where users could upload excel files into a MySQL database. Those files are the main source of our data as they come directly from the contractors working with the company. They contain a large number of rows (23000 on average for each file) and 100 columns for each row!
The problem I am facing currently is that the same file could be changed by someone (either the contractor or the company) and when re-uploading it, my system should detect changes, update the actual data, and save the action (The fact that the cell went from a value to another value :: oldValue -> newValue) so we can go back and run a versions comparison (e.g 3 re-uploads === 3 versions). (oldValue Version1 VS newValue Version5)
I developed a tiny mechanism for saving the changes => I have a table to save Imports data (each time a user import a file a new row will be inserted in this table) and another table for saving the actual changes
Versioning data
I save the id of the row that have some changes, as well as the id and the table where the actual data was modified (Uploading a file results in a insertion in multiple tables, so whenever a change occurs, I need to know in which table that happened). I also save the new value and the old value which is gonna help me with restoring the "archives data".
To restore a version : SELECT * FROM 'Archive' WHERE idImport = ${versionNumber}
To restore a version for one row : SELECT * FROM 'Archive' WHERE idImport = ${versionNumber} and rowId = ${rowId}
To restore all version for one row : SELECT * FROM 'Archive' WHERE rowId = ${rowId}
To restore version for one table : SELECT * FROM 'Archine' WHERE tableName = ${table}
Etc.
Now with this structure, I'm struggling to restore a version or to run a comparaison between two versions, which makes think that I've came up with a wrong approach since it makes it hard to do the job! I am trying to know if anyone had done this before or what a good approach would look like?
Cases when things get really messy :
The rows that have changed in a version might not have changed in the other version (I am working on a time machine to search in other versions when this happens)
The rows have changed in both versions but not the same fields. (Say we have a user table, the data of the user with id 15 have changed in 2nd and 5th upload, great! Now for the second version only the name was changed, but for the fifth version his address was changed! When comparing these two versions, we will run into a problem constrcuting our data array. name went from "some"-> NULL (Name was never null. No name changes in 5th version) and address went from NULL -> "some' is which obviously wrong).
My actual approach (php)
<?php
//Join records sets and Compare them
foreach ($firstRecord as $frecord) {
//Retrieve first record fields that have changed
$fFields = $frecord->fieldName;
//Check if the same record have changed in the second version as well
$sId = array_search($frecord->idRecord, $secondRecord);
if($sId) {
$srecord = $secondRecord[$sId];
//Retrieve straversee fields that have changed
$sFields = $srecord->fieldName;
//Compare the two records fields
foreach ($fFields as $fField) {
$sfId = array_search($fField, $sFields);
//The same field for the same record was changed in both version (perfect case)
if($sfId) {
$sField = $sFields[$sfId];
$deltaRow[$fField]["oldValue"] = $frecord->deltaValue;
$deltaRow[$fField]["newValue"] = $srecord->deltaValue;
//Delete the checked field from the second version traversee to avoid re-checking
unset($sField[$sfId]);
}
//The changed field in V1 was not found in V2 -> Lookup for a value
else {
$deltaRow[$fField]["oldValue"] = $frecord->deltaValue;
$deltaRow[$fField]["newValue"] = $this->valueLookUp();
}
}
$dataArray[] = $deltaRow;
//Delete the checked record from the second version set to avoid re-checking
unset($secondRecord[$srecord]);
}
I don't know how to deal with that, as I said I m working on a value lookup algorithm so when no data found in a version I will try to find it in the versions between theses two so I can construct my data array. I would be very happy if anyone could give some hints, ideas, improvements so I can go futher with that.
Thank you!
Is there a way to store database modifications with a versioning feature (for eventual versions comparaison [sic!])?
What constitutes versioning depends on the database itself and how you make use of it.
As far as a relational database is concerned (e.g. MariaDB), this boils down to the so called Normal Form which is in numbers.
On Database Normalization: 5th Normal Form and Beyond you can find the following guidance:
Beyond 5th normal form you enter the heady realms of domain key normal form, a kind of theoretical ideal. Its practical use to a database designer os [sic!] similar to that of infinity to a bookkeeper - i.e. it exists in theory but is not going to be used in practice. Even the most demanding owner is not going to expect that of the bookkeeper!
One strategy to step into these realms is to reach the 5th normal form first (do this just in theory, by going through all the normal forms, and study database normalization).
Additionally you can construe versioning outside and additional to the database itself, e.g. by creating your own versioning system. Reading about what you can do with normalization will help you to find better ways to decide on how to structure and handle the database data for your versioning needs.
However, as written it depends on what you want and need. So no straight forward "code" answer can be given to such a general question.

SSIS consolidate and concatenate multiple rows into single rows without using SQL

I am trying to accomplish something that is pretty easy to do in SQL, but seemingly very challenging to do in SSIS without using SQL. Basically, I need to consolidate and concatenate a field of a many-to-one relationship.
Given entities: [Contract Item] (many) to (one) [Account]
There is a field [ari_productsummary] that contains the product listed on the Contract Item entity. We want to write that value to the Account as [ari_activecontractitems]. However, an Account may have more than one Contract Item record associated to it, in which case, we want to concatenate those values. We also only want the distinct values to be concatenated (distinct rows already solved within my data flow).
This can be accomplished by writing to a temporary table, and then using a query or view to obtain the summarized results as followed. I created a SQL table called TESTTABLE that contains the [ari_productsummary] from the Contract Item entity along with the referring [accountid] to map it back to Account. I then wrote the following query as a view:
SELECT distinct accountid,
(SELECT TT2.ari_productsummary + '; '
FROM TESTTABLE TT2
WHERE TT2.accountid = TT.accountid
FOR XML PATH ('')
) AS 'ari_activecontractitems'
FROM TESTTABLE TT
Executing that Query provides me the results that I want, which I can then use for importing into the Account entity as shown below:
But how do I do this in a SSIS dataflow without writing to a SQL table as a temporary placeholder for the data?? I want to do the entire process inside one dataflow container, without using a temporary SQL table/view. The whole summarization process needs to be done on the fly:
Does anyone have a solution that doesn't require a temporary SQL table/view/query, but is contained entirely within the data flow?
I am using VS 2017 and the KingswaySoft Dynamic CRM 365 ETL toolset to develop my solution/package.
Spit balling here as I don't Dynamics nor do I have the custom components.
Data Flow 1 - Contract aggregation
The purpose of this data flow is to replicate your logic in the elegant query you provided and shove that into a Cache Connection Manager (see Notes for 2008+ at the end)
KingswaySoft Dynamics Source -> Script Task -> Cache Transform
If you want to keep the sort in there, do it before the script task. The implementation I'll take with the Script Task is that it's fully blocking - that is all the rows must arrive before it can send any on. Tasks like the Merge Join are only partially blocking because the requirement of sorted data means that once you no longer have a match for the current item, you can send it on down the pipeline.
The Script Task is going to be asynchronous transformation. You'll have two output columns, your key accountid and your new derived column of ari_activecontractitems. That column will might need to be big - you'll know your data best but if it's a blob type in Dynamics (> 4k unicode or > 8k ascii characters) then you'll have to define the data type as DT_TEXT/DT_NTEXT
As inputs, you'll select accountid and ari_productsummary from your source.
The code should be pretty easy. We're going to accumulate the inbound data into a Dictionary.
// member variable
Dictionary<string, List<string>> accumulator;
The PreProcess method, we'll tack this in there to initialize our variable
// initialize in PreProcess method
accumulator = new Dictionary<string, List<string>>();
In the OnBufferRowSent (name approx)
// simulate the inbound queue
// row_id would be something like Rows.row_id
if (!accumulator.ContainsKey(row_id))
{
// Create an empty dictionary for our list
accumulator.Add(row_id, new List<string>());
}
// add it if we don't have it
if (!accumulator[row_id].Contains(invoice))
{
accumulator[row_id].Add(invoice);
}
Once you get the signal sent of no more data available, that's when you start buffering output data. The auto generated code will have placeholders for all this.
// This is how we shove data out the pipe
foreach(var kvp in accumulator)
{
// approximately thus
OutputBuffer1.AddRow();
OutputBuffer1.row_id = kvp.Key;
OutputBuffer1.ari_productsummary = string.Join("; ", kvp.Value);
}
We have an upcoming release that comes with a component that does exactly what you are trying to achieve without the need of writing custom code. The feature is currently under preview, please reach out to us for private access to the feature. You can find our contact information on our website.
UPDATE - June 5, 2020, we have made the components available for public access at https://www.kingswaysoft.com/products/ssis-productivity-pack/ as a result of our 2020 Release Wave 1. We have two components available that serve this kind of purpose. The Composition component will take input values and transform into a composite value in a SSIS column. The Decomposition component does the opposite, it would take an input value and split it into multiple rows using either delimiter-based text splitting or XML/JSON array splitting.

Rows count of Couchbase view subset

When I query some view in Couchbase I get the response that has following structure:
{
"total_rows":100,
"rows":[...]
}
'total_rows' is very useful property that I can use for paging.
But lets say I select only a subset of view using 'start_key' and 'end_key' and of course I don't know how big this subset will be. 'total_rows' is still the same number (as I understand it's just total of whole view). Is there any easy way to know how many rows was selected in subset?
You can use the in-built reduce function _count to get the total count of your query.
Just add _count as reduce function for your view. After that, you will need to make two calls to couchbase:
In one call, you'll set the query param reduce=true (along with either group=true or group_level=n, depending upon how you're sending your key(s)). This will give you the total count of your filtered rows.
In the other call, you'll disable the reduce function with reduce=false because you now need the actual rows.
You can find more details about map and reduce at http://docs.couchbase.com/admin/admin/Views/views-writing.html
You can just use an array count/total/length in whatever language you are using.
For example in PHP:
$result = $cb->view("dev_beer", "beer_by_name", array('startkey' => 'O', 'endkey'=>'P'));
echo "total = >>".count($result["rows"])
If you're actually wanting to paginate your data then you should use limit and skip:
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-querying-pagination.html
If you have to paginate the view in the efficient way, you actually don't need to specify both start and the end.
Generally it is possible to use startkey/startkey_id and limit. In this case the limit will tell you that the page won't be bigger than known size.
Both cases are described in CouchDB book: http://guide.couchdb.org/draft/recipes.html#pagination
Here is how it works:
Request rows_per_page + 1 rows from the view
Display rows_per_page rows, store + 1 row as next_startkey and next_startkey_docid
As page information, keep startkey and next_startkey
Use the next_* values to create the next link, and use the others to create the previous link

sfPropelPager reduce queries

i'm working in a symfony project and using sfPropelPager to show a paged list of elements.
The problem is that with a great amount of data to list (i.e. thousands of registers) it makes a query to the database for each page to show!!!! That means about 100 extra queries in my case, and that is unacceptable.
Showing some of my code: the function that returns the pager object
$pager = new sfPropelPager('MyTable',sfConfig::get('sfPropelPagerLines'));
$c = new Criteria();
$c->add('my_table_field',$value);
$c->addDescendingOrderByColumn('date');
$pager->setCriteria($c);
$pager->init();
return $pager;
So, please, if you know a way to get all the results with only one query, it would be a great solution for my problem. Otherwise i must implement that list with an ajax call for every page the user wants to see
Thank you very much for your time.
I'm not sure to get your problem but, anyway, avoid the use of Criteria. Try to make queries with the ModelCriteria API: http://www.propelorm.org/reference/model-criteria.html.
For each paginated page, a query to the database will be done, this is the standard behavior for all pagers I know. If it's related to related objects (assuming you want to display information from relations), you may want to create a query that links those objects before to paginate, that way you'll get one query per page for all your data to display.
Read this doc for instance: http://www.propelorm.org/documentation/03-basic-crud.html#query_termination_methods
At last i did'nt get a solution for the problem, i had to implement the list via AJAX call, calling to a function that returns the requested page, so at the load of the page, no query for this list is slowing the user experience.
Thank you anyway to help me :)

nested sql queries in rails

I have the following query
#initial_matches = Listing.find_by_sql(["SELECT * FROM listings WHERE industry = ?", current_user.industry])
Is there a way I can run another SQL query on the selection from the above query using a each do? I want to run geokit calculations to eliminate certain listings that are outside of a specified distance...
Your question is slightly confusing. Do you want to use each..do (ruby) to do the filtering. Or do you want to use a sql query. Here is how you can let the ruby process do the filtering
refined list = #initial_matches.map { |listing|
listing.out_of_bounds? ? nil : listing
}.comact
If you wanted to use sql you could simply add additional sql (maybe a sub-select) it into your Listing.find_by_sql call.
If you want to do as you say in your comment.
WHERE location1.distance_from(location2, :units=>:miles)
You are mixing ruby (location1.distance_from(location2, :units=>:miles)) and sql (WHERE X > 50). This is difficult, but not impossible.
However, if you have to do the distance calculation in ruby already, why not do the filtering there as well. So in the spirit of my first example.
listing2 = some_location_to_filter_by
#refined_list = #initial_matches.map { |listing|
listing.distance_from(listing2) > 50 ? nil : listing
}.compact
This will iterate over all listings, keeping only those that are further than 50 from some predetermined listing.
EDIT: If this logic is done in the controller you need to assign to #refined_list instead of refined_list since only controller instance variables (as opposed to local ones) are accessible to the view.
In short, no. This is because after the initial query, you are not left with a relational table or view, you are left with an array of activerecord objects. So any processing to be done after the initial query has to be in the format of ruby and activerecord, not sql.