Ruby on Rails: populate fake data extremely quick - mysql

I can imagine it can be very easy just for lazy people like me to populate any fake data to db just with one rake (terminal) command.
I know about Faker, Populator and others but all of them, as far as i can see, need to write some (primitive but) code to make data more human friendly (defining type of random data directly and manually: emalis, names, prices and so on).
It makes sense in most cases but now in my case it would be enough for me to fill mysql varchar fields with any strings, text fields with any long text, int - with numbers and so on
any suggestions?

If speed is your aim, you should do two things:
Use an in-memory database for your tests until you get to acceptance testing. In other words, consider something like SQLite for your integration tests (Some might say unit tests) rather than MySQL.
Use Factory Girl to generate your fake data. Apparently, the data created by tools like that makes more sense than you prefer, but it is weird to me that you care about that. Regardless, it is a lot faster to use existing tools than to write code that generates gibberish just because you don't want data that look "too good."

Some example code that shows how to do that:
SKIP_COLUMNS = %w(id created_at updated_at)
RECORDS_COUNT = 10
# random data to fill
int = rand(1..100)
varchar = 'lorem'
text = 'big lorem'
# get models
#models = ActiveRecord::Base.connection.tables.collect {|t| t.underscore.singularize.camelize }
#models.select {|m| m.constantize rescue #models.delete(m) }
# fill in data
#models.map(&:constantize).each do |model|
model.columns_hash.each do |column|
next if SKIP_COLUMNS.include?(column.first)
# column_name = column.first
# column_type = column.last.type
RECORDS_COUNT.times do
record = model.new
case column.last.type
when :integer
record.send("#{column.first}=", int)
when :string
record.send("#{column.first}=", varchar)
when :text
record.send("#{column.first}=", text)
end
record.save!
end
end
end
You can put that to rake task.

Related

Django: Is there a way to effienctly bulk get_or_create()

I need to import a database (given in JSON format) of papers and authors.
The database is very large (194 million entries) so I am forced to use django's bulk_create() method.
To load the authors for the first time I use the following script:
def load_authors(paper_json_entries: List[Dict[str, any]]):
authors: List[Author] = []
for paper_json in paper_json_entries:
for author_json in paper_json['authors']:
# len != 0 is needed as a few authors dont have a id
if len(author_json['ids']) and not Author.objects.filter(author_id=author_json['ids'][0]).exists():
authors.append(Author(author_id=author_json['ids'][0], name=author_json['name']))
Author.objects.bulk_create(set(authors))
However, this is much too slow.
The bottleneck is this query:
and not Author.objects.filter(author_id=author_json['ids'][0]).exists():
Unfortunately I have to make this query, because of course one author can write multiple papers and otherwise there will be a key conflict.
Is there a way to implement something like the normal get_or_create() efficiently with bulk_create?
To avoid creating entries with existing unique keys, you can enable the ignore_conflicts parameter:
def load_authors(paper_json_entries: List[Dict[str, any]]):
Author.objects.bulk_create(
(
Author(author_id=author_json['ids'][0], name=author_json['name'])
for paper_json in paper_json_entries
for author_json in paper_json['authors']
),
ignore_conflicts=True
)

Does Statement.RETURN_GENERATED_KEYS generate any extra round trip to fetch the newly created identifier?

JDBC allows us to fetch the value of a primary key that is automatically generated by the database (e.g. IDENTITY, AUTO_INCREMENT) using the following syntax:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?)",
Statement.RETURN_GENERATED_KEYS
);
while (resultSet.next()) {
LOGGER.info("Generated identifier: {}", resultSet.getLong(1));
}
I'm interested if the Oracle, SQL Server, postgresQL, or MySQL driver uses a separate round trip to fetch the identifier, or there is a single round trip which executes the insert and fetches the ResultSet automatically.
It depends on the database and driver.
Although you didn't ask for it, I will answer for Firebird ;). In Firebird/Jaybird the retrieval itself doesn't require extra roundtrips, but using Statement.RETURN_GENERATED_KEYS or the integer array version will require three extra roundtrips (prepare, execute, fetch) to determine the columns to request (I still need to build a form of caching for it). Using the version with a String array will not require extra roundtrips (I would love to have RETURNING * like in PostgreSQL...).
In PostgreSQL with PgJDBC there is no extra round-trip to fetch generated keys.
It sends a Parse/Describe/Bind/Execute message series followed by a Sync, then reads the results including the returned result-set. There's only one client/server round-trip required because the protocol pipelines requests.
However sometimes batches that can otherwise be streamed to the server may be broken up into smaller chunks or run one by on if generated keys are requested. To avoid this, use the String[] array form where you name the columns you want returned and name only columns of fixed-width data types like integer. This only matters for batches, and it's a due to a design problem in PgJDBC.
(I posted a patch to add batch pipelining support in libpq that doesn't have that limitation, it'll do one client/server round trip for arbitrary sized batches with arbitrary-sized results, including returning keys.)
MySQL receives the generated key(s) automatically in the OK packet of the protocol in response to executing a statement. There is no communication overhead when requesting generated keys.
In my opinion even for such a trivial thing a single approach working in all database systems will fail.
The only pragmatic solution is (in analogy to Hibernate) to find the best working solution for each target RDBMS (and
call it a dialect of your one for all solution:)
Here the information for Oracle
I'm using a sequence to generate key, same behavior is observed for IDENTITY column.
create table auto_pk
(id number,
pad varchar2(100));
This works and use only one roundtrip
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX')",
Statement.RETURN_GENERATED_KEYS)
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getGeneratedKeys()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getString(1);
But unfortunately you get ROWID as a result - not the generated key
How is it implemented internally? You can see it if you activate a 10046 trace (BTW this is also the best way to see
how many roundtrips were performed)
PARSING IN CURSOR
insert into auto_pk values(auto_pk_seq.nextval, 'XXX')
RETURNING ROWID INTO :1
END OF STMT
So you see the JDBC Standard 3.0 is implemented, but you don't get a requested result. Under the cover is used the
RETURNING clause.
The right approach to get the generated key in Oracle is therefore:
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX') returning id into ?")
stmt.registerReturnParameter(1, Types.INTEGER);
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getReturnResultSet()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getLong(1);
}
Note:
Oracle Release 12.1.0.2.0
To activate the 10046 trace use
con.createStatement().execute "alter session set events '10046 trace name context forever, level 12'"
con.createStatement().execute "ALTER SESSION SET tracefile_identifier = my_identifier"
Depending on frameworks or libraries to do things that are perfectly possible in plain SQL is bad design IMHO, especially when working against a defined DBMS. (The Statement.RETURN_GENERATED_KEYS is relatively innocuous, although it apparently does raise a question for you, but where frameworks are built on separate entities and doing all sorts of joins and filters in code or have custom-built transaction isolation logic things get inefficient and messy very quickly.)
Why not simply:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?) RETURNING id");
Single trip, defined result.

Rails best way to add huge amount of records

I've got to add like 25000 records to database at once in Rails.
I have to validate them, too.
Here is what i have for now:
# controller create action
def create
emails = params[:emails][:list].split("\r\n")
#created_count = 0
#rejected_count = 0
inserts = []
emails.each do |email|
#email = Email.new(:email => email)
if #email.valid?
#created_count += 1
inserts.push "('#{email}', '#{Date.today}', '#{Date.today}')"
else
#rejected_count += 1
end
end
return if emails.empty?
sql = "INSERT INTO `emails` (`email`, `updated_at`, `created_at`) VALUES #{inserts.join(", ")}"
Email.connection.execute(sql) unless inserts.empty?
redirect_to new_email_path, :notice => "Successfuly created #{#created_count} emails, rejected #{#rejected_count}"
end
It's VERY slow now, no way to add such number of records 'cause of timeout.
Any ideas? I'm using mysql.
Three things come into mind:
You can help yourself with proper tools like:
zdennis/activerecord-import or jsuchal/activerecord-fast-import. The problem is with, your example, that you will also create 25000 objects. If you tell activerecord-import to not use validations, it will not create new objects (activerecord-import/wiki/Benchmarks)
Importing tens thousands of rows into relational database will never be super fast, it should be done asynchronously via background process. And there are also tools for that, like DelayedJob and more: https://www.ruby-toolbox.com/
Move the code that belongs to model out of controller(TM)
And after that, you need to rethink the flow of this part of application. If you're using background processing inside a controller action like create, you can not just simply return HTTP 201, or HTTP 200. What you need to do is to return "quick" HTTP 202 Accepted, and provide a link to another representation where user could check the status of their request (do we already have success response? how many emails failed?), as it is in now beeing processed in the background.
It can sound a bit complicated, and it is, which is a sign, that you maybe shouldn't do it like that. Why do you have to add like 25000 records in one request? What's the backgorund?
Why don't you create a rake task for the work? The following link explains it pretty well.
http://www.ultrasaurus.com/sarahblog/2009/12/creating-a-custom-rake-task/
In a nutshell, once you write your rake task, you can kick off the work by:
rake member:load_emails
If speed is your concern, I'd attack the problem from a different angle.
Create a table that copies the structure of your emails table; let it be emails_copy. Don't copy indexes and constraints.
Import the 25k records into it using your database's fast import tools. Consult your DB docs or see e.g. this answer for MySQL. You will have to prepare the input file, but it's way faster to do — I suppose you already have the data in some text or tabular form.
Create indexes and constraints for emails_copy to mimic emails table. Constraint violations, if any, will surface; fix them.
Validate the data inside the table. It may take a few raw SQL statements to check for severe errors. You don't have to validate emails for anything but very simple format anyway. Maybe all your validation could be done against the text you'll use for import.
insert into emails select * from emails_copy to put the emails into the production table. Well, you might play a bit with it to get autoincrement IDs right.
Once you're positive that the process succeeded, drop table emails_copy.

How to instruct Rails to generate the correct SQL on uniqueness validation when case insensitive

Assume Rails 3 with MySQL DB with Case Insensitive collation
What's the story:
Rails allows you to validate an attribute of a Model with the "uniqueness" validator. BUT the default comparison is CASE SENSITIVE according to Rails documentation.
Which means that on validation it executes SQL like the following:
SELECT 1 FROM `Users` WHERE (`Users`.`email` = BINARY 'FOO#email.com') LIMIT 1
This works completely wrong for me who has a DB with CI Collation. It will consider the 'FOO#email.com' valid, even if there is another user with 'foo#email.com' already in Users table. In other words, this means, that if the user of the application tries to create a new User with email 'FOO#email.com' this would have been completely VALID (by default) for Rails and INSERT will be sent to db. If you do not happen to have unique index on e-mail then you are boomed - row will be inserted without problem. If you happen to have a unique index, then exception will be thrown.
Ok. Rails says: since your DB has case insensitive collation, carry out a case insensitive uniqueness validation.
How is this done? It tells that you can override the default uniqueness comparison sensitivity by setting ":case_sensitive => false" on the particular attribute validator. On validation it creates the following SQL:
SELECT 1 FROM `Users` WHERE (LOWER(`Users`.`email`) = LOWER('FOO#email.com') LIMIT 1
which is a DISASTER on a database table Users that you have designed to have a unique index on the email field, because it DOES NOT USE the index, does full table scan.
I now see that the LOWER functions in SQL are inserted by the UniquenessValidator of ActiveRecord (file uniqueness.rb, module ActiveRecord, module Validations class UniquenessValidator). Here is the piece of code that does this:
if value.nil? || (options[:case_sensitive] || !column.text?)
sql = "#{sql_attribute} #{operator}"
else
sql = "LOWER(#{sql_attribute}) = LOWER(?)"
end
So Question goes to Rails/ActiveRecord and not to MySQL Adapter.
QUESTION: Is there a way to tell Rails to pass the requirement about uniqueness validation case sensitivity to MySQL adapter and not be 'clever' about it to alter the query? OR
QUESTION REPHRASED FOR CLARIFICATION: Is there another way to implement uniqueness validation on an attribute (PLEASE, CAREFUL...I AM NOT TALKING ABOUT e-mail ONLY, e-mail was given as an example) with case sensitivity OFF and with generation of a query that will use a simple unique index on the corresponding column?
These two questions are equivalent. I hope that now, I make myself more clear in order to get more accurate answers.
Validate uniqueness without regard to case
If you want to stick to storing email in upper or lower case then you can use the following to enforce uniqueness regardless of case:
validates_uniqueness_of :email, case_sensitive: false
(Also see this question:
Rails "validates_uniqueness_of" Case Sensitivity)
Remove the issue of case altogether
Rather than doing a case insensitive match, why not downcase the email before validating (and therefore also):
before_validation {self.email = email.downcase}
Since case is irrelevant to email this will simplify everything that you do as well and will head off any future comparisons or database searches you might be doing
I have searched around and the only answer, according to my knowledge today, that can be acceptable is to create a validation method that does the correct query and checks. In other words, stop using :uniqueness => true and do something like the following:
class User
validate :email_uniqueness
protected
def email_uniqueness
entries = User.where('email = ?', email)
if entries.count >= 2 || entries.count == 1 && (new_record? || entries.first.id != self.id )
errors[:email] << _('already taken')
end
end
end
This will definitely use my index on email and works both on create and update (or at least it does up to the point that I have tested that ok).
After asking on the RubyOnRails Core Google group
I have taken the following answer from RubyOnRails Core Google Group: Rails is fixing this problem on 3.2. Read this:
https://github.com/rails/rails/commit/c90e5ce779dbf9bd0ee53b68aee9fde2997be123
Workaround
If you want a case-insensitive comparison do:
SELECT 1 FROM Users WHERE (Users.email LIKE 'FOO#email.com') LIMIT 1;
LIKE without wildcards always works like a case-insensitive =.
= can be either case sensitive or case-insensitive depending on various factors (casting, charset...)
starting with http://guides.rubyonrails.org/active_record_querying.html#finding-by-sql
then adding their input
#Johan,
#PanayotisMatsinopoulos
and this
http://guides.rubyonrails.org/active_record_validations_callbacks.html#custom-methods
and http://www.w3schools.com/sql/sql_like.asp
then we have this:
class User < ActiveRecord::Base
validate :email_uniqueness
protected
def email_uniqueness
like_emails = User.where("email LIKE ?", email))
if (like_emails.count >= 2 || like_emails.count == 1
&& (new_record? || like_emails.first.id != self.id ))
errors[:email] << _('already taken')
end
end
end
validates :email, uniqueness: {case_sensitive: false}
Works like a charm in Rails 4.1.0.rc2
;)
After fighting with MySQL binary modifier, i found a way that removes that modifier from all queries comparing fields (not limited to uniqueness validation, but includes it).
First: Why MySQL adds that binary modifier? That's because by default MySQL compares fields in a case-insensitive way.
Second: Should I care? I always had design my systems to suppose that String comparison are made in a case-insensitive way, so that is a desired feature to me. Be warned if you don't
This is where is added the binary modifier:
https://github.com/rails/rails/blob/ee291b9b41a959e557b7732100d1ec3f27aae4f8/activerecord/lib/active_record/connection_adapters/abstract_mysql_adapter.rb#L545
def case_sensitive_modifier(node)
Arel::Nodes::Bin.new(node)
end
So i override this. I create an initializer (at config/initializers) named "mysql-case-sensitive-override.rb" with this code:
# mysql-case-sensitive-override.rb
class ActiveRecord::ConnectionAdapters::AbstractMysqlAdapter < ActiveRecord::ConnectionAdapters::AbstractAdapter
def case_sensitive_modifier(node)
node
end
end
And that's it. No more binary modifier on my queries :D
Please notice that this does not explain why the "{case_sensitive: false}" option of the validator doesn't works, and does not solves it. It changes the default-and-unoverrideable-case-sensitive behavior for a default-and-unoverrideable-case-insensitive new behavior. I must insist, this also changes for any comparison that actually uses binary modifier for case-sensitive behavior (i hope).

Multiple word searching with Ruby, and MySQL

I'm trying to accomplish a multiple word searching in a quotes database using Ruby, ActiveRecord, and MySQL. The way I did is shown bellow, and it is working, but I would like to know if there a better way to do.
# receives a string, splits it in a array of words, create the 'conditions'
# query, and send it to ActiveRecord
def search
query = params[:query].strip.split if params[:query]
like = "quote LIKE "
conditions = ""
query.each do |word|
conditions += (like + "'%#{word}%'")
conditions += " AND " unless query.last == word
end
#quotes = Quote.all(:conditions => conditions)
end
I would like to know if there is better way to compose this 'conditions' string. I also tried it using string interpolation, e.g., using the * operator, but ended up needing more string processing. Thanks in advance
First, I strongly encourage you to move Model's logic into Models. Instead of creating the search logic into the Controller, create a #search method into your Quote mode.
class Quote
def self.search(query)
...
end
end
and your controller becomes
# receives a string, splits it in a array of words, create the 'conditions'
# query, and send it to ActiveRecord
def search
#quotes = Quote.search(params[:query])
end
Now, back to the original problem. Your existing search logic does a very bad mistake: it directly interpolates value opening your code to SQL injection. Assuming you use Rails 3 you can take advantage of the new #where syntax.
class Quote
def self.search(query)
words = query.to_s.strip.split
words.inject(scoped) do |combined_scope, word|
combined_scope.where("quote LIKE ?", "%#{word}%")
end
end
end
It's a little bit of advanced topic. I you want to understand what the combined_scope + inject does, I recommend you to read the article The Skinny on Scopes.
MySQL fulltext search not working, so best way to do this:
class Quote
def self.search_by_quote(query)
words = query.to_s.strip.split
words.map! { |word| "quote LIKE '%#{word}%'" }
sql = words.join(" AND ")
self.where(sql)
end
end
The better way to do it would be to implement full text searching. You can do this in MySQL but I would highly recommend Solr. There are many resources online for implementing Solr within rails but I would recommend Sunspot as an entrance point.
Create a FULLTEXT index in MySQL. With that, you can leave string processing to MySQL.
Example : http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html