Rails massive data upload and nested records - mysql

I have to update a lot of data to mysql (~100Mio records!). Some records already exists, some have to be created. I also have to create some nested resources for each record.
I know the activerecord-import gem but as far as i know it can't handle nested records (or only with ugly workarounds). The issue is that I dont know the ID's for all nested records before they are created - and creating them in single queries takes time.
So lets say there is a model called Post and can have many Comments. My current code looks like this:
Post.transaction do
import_posts.each do |import_post|
post = Post.find_or_initialize_by(somevalue: import_post['somevalue']
post.text = import_post['text']
import_post['comments'].each do |import_comment|
comment = post.comments.find_or_initialize_by(someothervalue: import_comment['someothervalue'])
comment.text = import_comment['text']
end
post.save(validate: false) #Dont need validation - saves some time
end
end
This is just an example and it works but its far away from 'damn fast'. Are there any ideas how to speed up the data upload? Am I totally wrong?
Im working with Rails5 and ruby 2.4.
Thanks in advance!

Related

What is the least resource intensive to perform "findManyOrCreate" in Laravel eloquent?

I have an array of post codes coming from an input:
$postCodes = collect(["BK13TV", "BK14TV", "BK15TV", "BK16TV"]);
In my database I already have two of the post codes - "BK13TV", "BK16TV".
So I would like to run something like this:
$postCodeModels = PostCode::findManyOrCreate($postCodes->map($postCode) {
return ['code' => $postCode]
})
My initial approach was to load all the post codes, then diff them against the postCodes from the input like so:
PostCode::createMany($postCodes->diff(PostCode::all()->pluck('code')))
However, here it means that I am loading the entire content of post_codes table, which just seems wrong.
In the ideal case, this would return all post code models matching the passed post codes as well as would create entries for post codes that did not exist in the database.
First I need to retrieve existing postcodes:
$existingPostCodes = PostCode::whereIn('code', $postCodes)->get();
The find all the post codes in the input, that are not stored yet in database:
$newPostCodes = $postCodes->diff($existingPostCode->pluck('code'));
Finally retrieve all the new post codes as models:
$postCodeModels = PostCode::whereIn('code', $postCodes)->get();
Admittedly, this still takes three queries, but does eliminate the crazy stuff of loading an entire table worth of data.

Not like on Array field

here is the problem I'm stuck with:
I'm using Rails 4 & MySQL
I've Message which have one sender and one recipient.
I want to be able to archive messages but if sender archive a message, the recipient still can access to the message until he archive it too.
I've serialize a field :
serialize :archived_by, Array
which contains which user archived the message
but I can't figure out how to query with it.
Message.where("archived_by like ?", [1].to_yaml)
works well, returning messages archived by User '1'
Message.where.not("archived_by like ?", [1].to_yaml)
won't work, returning nothing
I would like to find something else than using a classic many to many ...
Thanks!
UPDATE
I finally decided to add 2 fields, one for the sender & one for the recipient to know which archived the message. If someone has the proper way to do this, tell us :)
If you are using postgresql you could query the informations.
As in answer Searching serialized data, using active record described, the downsize of serializer at least under mysql is, that you byepass native db abstraction.

Finding a specific value out of an Array (RoR / MySQL)

I am trying to find a specific value inside an array. The array is composed from data in MySQL database and looks like:
info = [#<Info1: bla, Info2: blo>,#<Info1: bli, Info2, Ble>]
Now I want to get every Info1's value from it, but I do not know how.
The array was formed by calling
info = Info.find(:all)
Can anyone help me?
I am using Rails 2.2.2 (don't ask, can't do anything about it) and Ruby 1.8.
Edit: More details
Info is a database, where Info1 and info 2 are the columns. Calling it with info = Info.find(:all) returns the array above.
What I have tried so far involves trying to go through the array with each, but so far no luck.
Most of what I have tried like
a.grep(/^info1/)
and
info.select(|i| i.name == "info1")
all return empty arrays
Edit
Nevermind, I found the answer. I was thinking too weird. The answer is
info.each do |object|
puts object.info2
end
What's your selection criteria? You can do something like
info.select{|i| i.name == 'hello' }
and you will get all the Info objects with name = 'hello'.
But I would prefer to change the query, if you can, to filter them in the database query directly.

sfPropelPager reduce queries

i'm working in a symfony project and using sfPropelPager to show a paged list of elements.
The problem is that with a great amount of data to list (i.e. thousands of registers) it makes a query to the database for each page to show!!!! That means about 100 extra queries in my case, and that is unacceptable.
Showing some of my code: the function that returns the pager object
$pager = new sfPropelPager('MyTable',sfConfig::get('sfPropelPagerLines'));
$c = new Criteria();
$c->add('my_table_field',$value);
$c->addDescendingOrderByColumn('date');
$pager->setCriteria($c);
$pager->init();
return $pager;
So, please, if you know a way to get all the results with only one query, it would be a great solution for my problem. Otherwise i must implement that list with an ajax call for every page the user wants to see
Thank you very much for your time.
I'm not sure to get your problem but, anyway, avoid the use of Criteria. Try to make queries with the ModelCriteria API: http://www.propelorm.org/reference/model-criteria.html.
For each paginated page, a query to the database will be done, this is the standard behavior for all pagers I know. If it's related to related objects (assuming you want to display information from relations), you may want to create a query that links those objects before to paginate, that way you'll get one query per page for all your data to display.
Read this doc for instance: http://www.propelorm.org/documentation/03-basic-crud.html#query_termination_methods
At last i did'nt get a solution for the problem, i had to implement the list via AJAX call, calling to a function that returns the requested page, so at the load of the page, no query for this list is slowing the user experience.
Thank you anyway to help me :)

How IQueryables are dealt with in ASP.NET MVC Views?

I have some tables in a MySQL database to represent records from a sensor. One of the features of the system I'm developing is to display this records from the database to the web user, so I used ADO.NET Entity Data Model to create an ORM, used Linq to SQL to get the data from the database, and stored them in a ViewModel I designed, so I can display it using MVCContrib Grid Helper:
public IQueryable<TrendSignalRecord> GetTrends()
{
var dataContext = new SmgerEntities();
var trendSignalRecords = from e in dataContext.TrendSignalRecords
select e;
return trendSignalRecords;
}
public IQueryable<TrendRecordViewModel> GetTrendsProjected()
{
var projectedTrendRecords = from t in GetTrends()
select new TrendRecordViewModel
{
TrendID = t.ID,
TrendName = t.TrendSignalSetting.Name,
GeneratingUnitID = t.TrendSignalSetting.TrendSetting.GeneratingUnit_ID,
//{...}
Unit = t.TrendSignalSetting.Unit
};
return projectedTrendRecords;
}
I call the GetTrendsProjectedMethod and then I use Linq to SQL to select only the records I want. It is working fine in my developing scenario, but when I test it in a real scenario, where the number of records is way greater (something around a million records), it stops working.
I put some debug messages to test it, and everything works fine, but when it reaches the return View() statement, it simply stops, throwing me a MySQLException: Timeout expired. That let me wondering if the data I sent to the page is retrieved by the page itself (it only search for the displayed items in the database when the page itself needs it, or something like that).
All of my other pages use the same set of tools: MVCContrib Grid Helper, ADO.NET, Linq to SQL, MySQL, and everything else works alright.
You absolutely should paginate your data set before executing your query if you have millions of records. This could be done using the .Skip and .Take extension methods. And those should be called before running any query against your database.
Trying to fetch millions of records from a database without pagination would very likely cause a timeout at best.
Well, assuming information in this blog is correct, .AsPagination method requires you to sort your data by a particular column. It's possible that trying to do an OrderBy on a table with millions of records in it is just a time consuming operation and times out.