Rails best way to add huge amount of records - mysql

I've got to add like 25000 records to database at once in Rails.
I have to validate them, too.
Here is what i have for now:
# controller create action
def create
emails = params[:emails][:list].split("\r\n")
#created_count = 0
#rejected_count = 0
inserts = []
emails.each do |email|
#email = Email.new(:email => email)
if #email.valid?
#created_count += 1
inserts.push "('#{email}', '#{Date.today}', '#{Date.today}')"
else
#rejected_count += 1
end
end
return if emails.empty?
sql = "INSERT INTO `emails` (`email`, `updated_at`, `created_at`) VALUES #{inserts.join(", ")}"
Email.connection.execute(sql) unless inserts.empty?
redirect_to new_email_path, :notice => "Successfuly created #{#created_count} emails, rejected #{#rejected_count}"
end
It's VERY slow now, no way to add such number of records 'cause of timeout.
Any ideas? I'm using mysql.

Three things come into mind:
You can help yourself with proper tools like:
zdennis/activerecord-import or jsuchal/activerecord-fast-import. The problem is with, your example, that you will also create 25000 objects. If you tell activerecord-import to not use validations, it will not create new objects (activerecord-import/wiki/Benchmarks)
Importing tens thousands of rows into relational database will never be super fast, it should be done asynchronously via background process. And there are also tools for that, like DelayedJob and more: https://www.ruby-toolbox.com/
Move the code that belongs to model out of controller(TM)
And after that, you need to rethink the flow of this part of application. If you're using background processing inside a controller action like create, you can not just simply return HTTP 201, or HTTP 200. What you need to do is to return "quick" HTTP 202 Accepted, and provide a link to another representation where user could check the status of their request (do we already have success response? how many emails failed?), as it is in now beeing processed in the background.
It can sound a bit complicated, and it is, which is a sign, that you maybe shouldn't do it like that. Why do you have to add like 25000 records in one request? What's the backgorund?

Why don't you create a rake task for the work? The following link explains it pretty well.
http://www.ultrasaurus.com/sarahblog/2009/12/creating-a-custom-rake-task/
In a nutshell, once you write your rake task, you can kick off the work by:
rake member:load_emails

If speed is your concern, I'd attack the problem from a different angle.
Create a table that copies the structure of your emails table; let it be emails_copy. Don't copy indexes and constraints.
Import the 25k records into it using your database's fast import tools. Consult your DB docs or see e.g. this answer for MySQL. You will have to prepare the input file, but it's way faster to do — I suppose you already have the data in some text or tabular form.
Create indexes and constraints for emails_copy to mimic emails table. Constraint violations, if any, will surface; fix them.
Validate the data inside the table. It may take a few raw SQL statements to check for severe errors. You don't have to validate emails for anything but very simple format anyway. Maybe all your validation could be done against the text you'll use for import.
insert into emails select * from emails_copy to put the emails into the production table. Well, you might play a bit with it to get autoincrement IDs right.
Once you're positive that the process succeeded, drop table emails_copy.

Related

Django admin - model visible to superuser, not staff user

I am aware of syncdb and makemigrations, but we are restricted to do that in production environment.
We recently had couple of tables created on production. As expected, tables were not visible on admin for any user.
Post that, we had below 2 queries executed manually on production sql (i ran migration on my local and did show create table query to fetch raw sql)
django_content_type
INSERT INTO django_content_type(name, app_label, model)
values ('linked_urls',"urls", 'linked_urls');
auth_permission
INSERT INTO auth_permission (name, content_type_id, codename)
values
('Can add linked_urls Table', (SELECT id FROM django_content_type where model='linked_urls' limit 1) ,'add_linked_urls'),
('Can change linked_urls Table', (SELECT id FROM django_content_type where model='linked_urls' limit 1) ,'change_linked_urls'),
('Can delete linked_urls Table', (SELECT id FROM django_content_type where model='linked_urls' limit 1) ,'delete_linked_urls');
Now this model is visible under super-user and is able to grant access to staff users as well, but staff users cant see it.
Is there any table entry that needs to be entered in it?
Or is there any other way to do a solve this problem without syncdb, migrations?
We recently had couple of tables created on production.
I can read what you wrote there in two ways.
First way: you created tables with SQL statements, for which there are no corresponding models in Django. If this is the case, no amount of fiddling with content types and permissions that will make Django suddenly use the tables. You need to create models for the tables. Maybe they'll be unmanaged, but they need to exist.
Second way: the corresponding models in Django do exist, you just manually created tables for them, so that's not a problem. What I'd do in this case is run the following code, explanations follow after the code:
from django.contrib.contenttypes.management import update_contenttypes
from django.apps import apps as configured_apps
from django.contrib.auth.management import create_permissions
for app in configured_apps.get_app_configs():
update_contenttypes(app, interactive=True, verbosity=0)
for app in configured_apps.get_app_configs():
create_permissions(app, verbosity=0)
What the code above does is essentially perform the work that Django performs after it runs migrations. When the migration occurs, Django just creates tables as needed, then when it is done, it calls update_contenttypes, which scans the table associated with the models defined in the project and adds to the django_content_type table whatever needs to be added. Then it calls create_permissions to update auth_permissions with the add/change/delete permissions that need adding. I've used the code above to force permissions to be created early during a migration. It is useful if I have a data migration, for instance, that creates groups that need to refer to the new permissions.
So, finally i had a solution.I did lot of debugging on django and apparanetly below function (at django.contrib.auth.backends) does the job for providing permissions.
def _get_permissions(self, user_obj, obj, from_name):
"""
Returns the permissions of `user_obj` from `from_name`. `from_name` can
be either "group" or "user" to return permissions from
`_get_group_permissions` or `_get_user_permissions` respectively.
"""
if not user_obj.is_active or user_obj.is_anonymous() or obj is not None:
return set()
perm_cache_name = '_%s_perm_cache' % from_name
if not hasattr(user_obj, perm_cache_name):
if user_obj.is_superuser:
perms = Permission.objects.all()
else:
perms = getattr(self, '_get_%s_permissions' % from_name)(user_obj)
perms = perms.values_list('content_type__app_label', 'codename').order_by()
setattr(user_obj, perm_cache_name, set("%s.%s" % (ct, name) for ct, name in perms))
return getattr(user_obj, perm_cache_name)
So what was the issue?
Issue lied in this query :
INSERT INTO django_content_type(name, app_label, model)
values ('linked_urls',"urls", 'linked_urls');
looks fine initially but actual query executed was :
--# notice the caps case here - it looked so trivial, i didn't even bothered to look into it untill i realised what was happening internally
INSERT INTO django_content_type(name, app_label, model)
values ('Linked_Urls',"urls", 'Linked_Urls');
So django, internally, when doing migrate, ensures everything is migrated in lower case - and this was the problem!!
I had a separate query executed to lower case all the previous inserts and voila!

Does Statement.RETURN_GENERATED_KEYS generate any extra round trip to fetch the newly created identifier?

JDBC allows us to fetch the value of a primary key that is automatically generated by the database (e.g. IDENTITY, AUTO_INCREMENT) using the following syntax:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?)",
Statement.RETURN_GENERATED_KEYS
);
while (resultSet.next()) {
LOGGER.info("Generated identifier: {}", resultSet.getLong(1));
}
I'm interested if the Oracle, SQL Server, postgresQL, or MySQL driver uses a separate round trip to fetch the identifier, or there is a single round trip which executes the insert and fetches the ResultSet automatically.
It depends on the database and driver.
Although you didn't ask for it, I will answer for Firebird ;). In Firebird/Jaybird the retrieval itself doesn't require extra roundtrips, but using Statement.RETURN_GENERATED_KEYS or the integer array version will require three extra roundtrips (prepare, execute, fetch) to determine the columns to request (I still need to build a form of caching for it). Using the version with a String array will not require extra roundtrips (I would love to have RETURNING * like in PostgreSQL...).
In PostgreSQL with PgJDBC there is no extra round-trip to fetch generated keys.
It sends a Parse/Describe/Bind/Execute message series followed by a Sync, then reads the results including the returned result-set. There's only one client/server round-trip required because the protocol pipelines requests.
However sometimes batches that can otherwise be streamed to the server may be broken up into smaller chunks or run one by on if generated keys are requested. To avoid this, use the String[] array form where you name the columns you want returned and name only columns of fixed-width data types like integer. This only matters for batches, and it's a due to a design problem in PgJDBC.
(I posted a patch to add batch pipelining support in libpq that doesn't have that limitation, it'll do one client/server round trip for arbitrary sized batches with arbitrary-sized results, including returning keys.)
MySQL receives the generated key(s) automatically in the OK packet of the protocol in response to executing a statement. There is no communication overhead when requesting generated keys.
In my opinion even for such a trivial thing a single approach working in all database systems will fail.
The only pragmatic solution is (in analogy to Hibernate) to find the best working solution for each target RDBMS (and
call it a dialect of your one for all solution:)
Here the information for Oracle
I'm using a sequence to generate key, same behavior is observed for IDENTITY column.
create table auto_pk
(id number,
pad varchar2(100));
This works and use only one roundtrip
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX')",
Statement.RETURN_GENERATED_KEYS)
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getGeneratedKeys()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getString(1);
But unfortunately you get ROWID as a result - not the generated key
How is it implemented internally? You can see it if you activate a 10046 trace (BTW this is also the best way to see
how many roundtrips were performed)
PARSING IN CURSOR
insert into auto_pk values(auto_pk_seq.nextval, 'XXX')
RETURNING ROWID INTO :1
END OF STMT
So you see the JDBC Standard 3.0 is implemented, but you don't get a requested result. Under the cover is used the
RETURNING clause.
The right approach to get the generated key in Oracle is therefore:
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX') returning id into ?")
stmt.registerReturnParameter(1, Types.INTEGER);
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getReturnResultSet()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getLong(1);
}
Note:
Oracle Release 12.1.0.2.0
To activate the 10046 trace use
con.createStatement().execute "alter session set events '10046 trace name context forever, level 12'"
con.createStatement().execute "ALTER SESSION SET tracefile_identifier = my_identifier"
Depending on frameworks or libraries to do things that are perfectly possible in plain SQL is bad design IMHO, especially when working against a defined DBMS. (The Statement.RETURN_GENERATED_KEYS is relatively innocuous, although it apparently does raise a question for you, but where frameworks are built on separate entities and doing all sorts of joins and filters in code or have custom-built transaction isolation logic things get inefficient and messy very quickly.)
Why not simply:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?) RETURNING id");
Single trip, defined result.

Use MySQL Stored Procedure to check for malicious code

I'm attempting to write a stored procedure in MySql that will take a single parameter, and then check that parameter for any text that contains 'DROP','INSERT','UPDATE','TRUNCATE', etc., pretty much anything that isn't a SELECT statement. I know it's not ideal, but, unfortunately the SELECT statement is being built client-side, and to prevent some kind of man-in-the-middle change, it's just an added level of security from the server.
I've tried doing several means of accomplishing it, but, it's not working for me. I've come up with things similar to this:
CREATE PROCEDURE `myDatabase`.`execQuery` (in INC_query text)
BEGIN
#check to see if the incoming SQL query contains INSERT, DROP, TRUNCATE,
#or UPDATE as an added measure of security
IF (
SELECT LOCATE(LOWER(INC_query),'drop') OR
SELECT LOCATE(LOWER(INC_query),'truncate') OR
SELECT LOCATE(LOWER(INC_query),'insert') OR
SELECT LOCATE(LOWER(INC_query),'update') OR
SELECT LOCATE(LOWER(INC_query),'set')
>= 1)
THEN
SET #command = INC_query;
PREPARE statement FROM #command;
EXECUTE statement;
ELSE
SELECT * FROM database.otherTable; #just a generic output to know the procedure executed correctly, and will be removed later. Purely testing.
END IF;
END
Even if it contains any of my "filterable" words, it still executes the query. Any help would be appreciated, or if there's a better way of doing something, I'm all ears.
What if you have a column called updated_at or settings? You can't possibly expect this to work as you intend. This kind of technique is the reason there's so many references to clbuttic on the web.
You're really going to make a mess of things if you go down this road.
The only reasonable way to approach this is to send in the parameters for the kind of query you want to construct, then construct the query in your application using a vetted white list of allowed terms. An example expressed in JSON:
{
"select" : {
"table" : "users",
"columns" : [ "id", "name", "DROP TABLE users", "SUM(date)", "password_hash" ],
"joins" : {
"orders" : [ "users.id", "orders.user_id" ]
}
}
You just need to create a query constructor that emits this kind of thing, and another that converts it back into a valid query. You might want to list only particular columns for querying, as certain columns might be secret or internal only, not to be disclosed, like password_hash in this example.
You could also allow for patterns like (SUM|MIN|MAX|AVG)\((\w+)\) to capture specific grouping operations or JOIN conditions. It depends on how far you want to take this.

Codeignighter Record wont insert

Using CI for the first time and i'm smashing my head with this seemingly simple issue. My query wont insert the record.
In an attempt to debug a possible problem, the insert code has been simplified but i'm still getting no joy.
Essentially, i'm using;
$data = array('post_post' => $this->input->post('ask_question'));
$this->db->insert('posts', $data);
I'm getting no errors (although that possibly due to disabling them in config/database.php due to another CI related trauma :-$ )
Ive used
echo print $this->db->last_query();
to get the generated query, shown as below:
INSERT INTO `posts` (`post_post`) VALUES ('some text')
I have pasted this query into phpMyAdmin, it inserts no problem. Ive even tried using $this->db->query() to run the outputted query above 'manually' but again, the record will not insert.
The scheme of the DB table 'posts' is simply two columns, post_id & post_post.
Please, any pointers on whats going on here would be greatly appreciated...thanks
OK..Solved, after much a messing with CI.
Got it to work by setting persistant connection to false.
$db['default']['pconnect'] = FALSE;
sigh
Things generally look ok, everything you have said suggests that it should work. My first instinct would be to check that what you're inserting is compatible with your SQL field.
Just a cool CI feature; I'd suggest you take a look at the CI Database Transaction class. Transactions allow you to wrap your query/queries inside a transaction, which can be rolled back on failure, and can also make error handling easier:
$this->db->trans_start();
$this->db->query('INSERT INTO posts ...etc ');
$this->db->trans_complete();
if ($this->db->trans_status() === FALSE)
{
// generate an error... or use the log_message() function to log your error
}
Alternatively, one thing you can do is put your Insert SQL statement into $this->db->query(your_query_here), instead of calling insert. There is a CI Query feature called Query Binding which will also auto-escape your passed data array.
Let me know how it goes, and hope this helps!

processing data with perl - selecting for update usage with mysql

I have a table that is storing data that needs to be processed. I have id, status, data in the table. I'm currently going through and selecting id, data where status = #. I'm then doing an update immediately after the select, changing the status # so that it won't be selected again.
my program is multithreaded and sometimes I get threads that grab the same id as they are both querying the table at a relatively close time to each other, causing the grab of the same id. i looked into select for update, however, i either did the query wrong, or i'm not understanding what it is used for.
my goal is to find a way of grabbing the id, data that i need and setting the status so that no other thread tries to grab and process the same data. here is the code i tried. (i wrote it all together for show purpose here. i have my prepares set at the beginning of the program as to not do a prepare for each time it's ran, just in case anyone was concerned there)
my $select = $db->prepare("SELECT id, data FROM `TestTable` WHERE _status=4 LIMIT ? FOR UPDATE") or die $DBI::errstr;
if ($select->execute($limit))
{
while ($data = $select->fetchrow_hashref())
{
my $update_status = $db->prepare( "UPDATE `TestTable` SET _status = ?, data = ? WHERE _id=?");
$update_status->execute(10, "", $data->{_id});
push(#array_hash, $data);
}
}
when i run this, if doing multiple threads, i'll get many duplicate inserts, when trying to do an insert after i process my transaction data.
i'm not terribly familiar with mysql and the research i've done, i haven't found anything that really cleared this up for me.
thanks
As a sanity check, are you using InnoDB? MyISAM has zero transactional support, aside from faking it with full table locking.
I don't see where you're starting a transaction. MySQL's autocommit option is on by default, so starting a transaction and later committing would be necessary unless you turned off autocommit.
It looks like you simply rely on the database locking mechanisms. I googled perl dbi locking and found this:
$dbh->do("LOCK TABLES foo WRITE, bar READ");
$sth->prepare("SELECT x,y,z FROM bar");
$sth2->prepare("INSERT INTO foo SET a = ?");
while (#ary = $sth->fetchrow_array()) {
$sth2->$execute($ary[0]);
}
$sth2->finish();
$sth->finish();
$dbh->do("UNLOCK TABLES");
Not really saying GIYF as I am also fairly novice at both MySQL and DBI, but perhaps you can find other answers that way.
Another option might be as follows, and this only works if you control all the code accessing the data. You can create lock column in the table. When your code accesses the table it (pseudocode):
if row.lock != 1
row.lock = 1
read row
update row
row.lock = 0
next
else
sleep 1
redo
again though, this trusts that all users/script that access this data will agree to follow this policy. If you cannot ensure that then this won't work.
Anyway thats all the knowledge I have on the topic. Good Luck!