How to import large csv file into mysql in rails application? - mysql

I'm implementing import csv data into mysql in rails application. I have use CSV.parse to read line by line in csv file and import into database. This way works well.
But, when I deploy to Heroku server, timeout for each request is 30 seconds. If import csv file more than 30 seconds. Heroku server has error: request timeout - H12. Does anyone help me find out the best way to import large csv file? Now, I only import small csv include 70 users. I want import large csv include 500 - 1000 users. Here is the code:
Import controller:
CSV.foreach(params[:file].path, :headers => true) do |row|
i = i + 1
if i == 1
#company = Company.find_or_create_by!(name: row[0])
end
#users = User.find_by(email: row[1])
if #users
if #company.id == #users.employee.company_id
render :status=> 401, :json => {:message=> "Error"}
return
else
render :status=> 401, :json => {:message=> "Error"}
return
end
else
# User
# # Generate password
password = row[2]
user = User.new(email: row[1])
user.password = password.downcase
user.normal_password = password.downcase
user.skip_confirmation!
user.save!
obj = {
'small' => 'https://' + ENV['AWS_S3_BUCKET'] + '.s3.amazonaws.com/images/' + 'default-profile-pic_30x30.png',
'medium' => 'https://' + ENV['AWS_S3_BUCKET'] + '.s3.amazonaws.com/images/' + 'default-profile-pic_40x40.png'
}
employee = Employee.new(user_id: user.id)
employee.update_attributes(name: row[3], job_title: row[5], gender: row[9], job_location: row[10], group_name: row[11], is_admin: to_bool(row[13]),
is_manager: to_bool(row[14]), is_reviewee: to_bool(row[6]), admin_target: row[7], admin_view_target: row[12], department: row[8],
company_id: #company.id, avatar: obj.to_json)
employee.save!
end
end
I has try use gems 'activerecord-import' or 'fastercsv' but 'activerecord-import' not work, 'fastercsv' not work for ruby 2.0 and rails 4.0

Doing this in a controller seems a bit much to me, especially since it's blocking. Have you given any thought to throwing it in a background job?
If I were you I'd:
Upload the file
Parse it in the background as a rake task
Also, have a look at: https://github.com/tilo/smarter_csv

Process your CSV in the background, using products such as delayed_job, sidekiq, resque. If it fits your usecase, you can even do this using guard or cron.

It seems that these lines
if i == 1
#company = Company.find_or_create_by!(name: row[0])
end
#users = User.find_by(email: row[1])
takes a lot of computation cycle in your 30 seconds timeframe.
I would suggest to convert your routine into Heroku background process by using resque or delayed_job, or split the routine into n requests, if we cannot somewhat optimize the code above.
Hope this helps.

Related

Angr for a HTB challenge, no solution found

I'm new to RE. I'm trying to solve a HackTheBox challenge called RAuth, with angr. I understand how to analyze and solve this challenge differently, without angr, but I really want to understand what is wrong with my angr script, or maybe what is the reason why angr is not feasible for this (and similar) case?
The application is easy, after it starts it's requesting the password from stdin, and exits if the password is incorrect:
>:~/challenges/rauth$ ./rauth
Welcome to secure login portal!
Enter the password to access the system:
wqeqwwqewqewqeqwewqeqweqweqweqweqwewqeqwe
You entered a wrong password!
When I run my script it works for around 15 minutes and dies with one of two errors:
"Killed"
"IndexError: tuple index out of range"
My angr script:
#!/usr/bin/env python
#coding: utf-8
import angr
import claripy
import time
import sys
def is_successful(st):
output = st.posix.dumps(sys.stdout.fileno())
if b'Successfully Authenticated' in output:
return True
return False
def should_abort(state):
output = state.posix.dumps(sys.stdout.fileno())
return b'You entered a wrong password!' in output
def main(round):
print("Checking "+str(round))
p = angr.Project('rauth',auto_load_libs=False)
flag_chars = [claripy.BVS('flag_%d' % i, 8) for i in range(round)]
flag = claripy.Concat(*flag_chars + [claripy.BVV(b'\n')])
st = p.factory.full_init_state(
args=['./rauth'],
add_options=angr.options.unicorn,
stdin=angr.SimPackets(name='stdin', content=[(flag, 32)])
#stdin=flag,
)
st.options.add(angr.options.SYMBOL_FILL_UNCONSTRAINED_MEMORY)
st.options.add(angr.options.SYMBOL_FILL_UNCONSTRAINED_REGISTERS)
for k in flag_chars:
st.solver.add(k >= ord(" "))
st.solver.add(k <= ord("~"))
sm = p.factory.simulation_manager(st)
#sm.explore(find=is_successful,avoid=should_abort)
sm.explore(find=lambda s: b"Successfully" in s.posix.dumps(1),avoid=lambda s: b"wrong" in s.posix.dumps(1))
if len(sm.found) > 0:
for solution_state in ex.found:
print("[>>] {!r}".format(solution_state.solver.eval(flag,cast_to=bytes)))
else:
print("[>>] no solution found :(")
if __name__ == "__main__":
print(main(32))
Am I using angr for the case when it's not applicable? What am I missing?
A also tried to play with options, like removing unicorn, enabling auto_load_libs, using or disabling lambdas etc. No success.

Can't read JSON file in Ruby on Rails

I am new in ruby on rails and I want to read data from a JSON file from a specified directory, but I constantly get an error in chap3(File name)
Errno::ENOENT in TopController#chap3. No such file or directory # rb_sysopen - links.json.
In the console, I get a message
Failed to load resource: the server responded with a status of 500 (Internal Server Error)
How I can fix that?
Code:
require "json"
class TopController < ApplicationController
def index
#message = "おはようございます!"
end
def chap3
data = File.read('links.json')
datahash = JSON.parse(data)
puts datahash.keys
end
def getName
render plain: "名前は、#{params[:name]}"
end
def database
#members = Member.all
end
end
JSON file:
{ "data": [
{"link1": "http://localhost:3000/chap3/a.html"},
{"link2": "http://localhost:3000/chap3/b.html"},
{"link3": "http://localhost:3000/chap3/c.html"},
{"link4": "http://localhost:3000/chap3/d.html"},
{"link5": "http://localhost:3000/chap3/e.html"},
{"link6": "http://localhost:3000/chap3/f.html"},
{"link7": "http://localhost:3000/chap3/g.html"}]}
I would change these two lines
data = File.read('links.json')
datahash = JSON.parse(data)
in the controller to
datahash = Rails.root.join('app/controllers/links.json').read
Note: I would consider moving this kind of configuration file into the /config folder and creating a simple Ruby class to handle it. Additionally, you might want to consider paths instead of URLs with a host because localhost:3000 might work in the development environment but in production, you will need to return non-localhost URLs anyway.
Rails use the content of file in the controller
#data = File.read("#{Rails.root}/app/controllers/links.json")

Updating one table of MYSQL with multiple processes via pymysql

Actually, I am trying to update one table with multiple processes via pymysql, and each process reads a CSV file split from a huge one in order to promote the speed. But I get the Lock wait timeout exceeded; try restarting transaction exception when I run the script. After searching the posts on this site, I found one post which mentioned that to set or build the built-in LOAD_DATA_INFILE, but no details on it. How can I do it with 'pymysql' to reach my aim?
---------------------------first edit----------------------------------------
Here's the job method:
`def importprogram(path, name):
begin = time.time()
print('begin to import program' + name + ' info.')
# "c:\\sometest.csv"
file = open(path, mode='rb')
csvfile = csv.reader(codecs.iterdecode(file, 'utf-8'))
connection = None
try:
connection = pymysql.connect(host='a host', user='someuser', password='somepsd', db='mydb',
cursorclass=pymysql.cursors.DictCursor)
count = 1
with connection.cursor() as cursor:
sql = '''update sometable set Acolumn='{guid}' where someid='{pid}';'''
next(csvfile, None)
for line in csvfile:
try:
count = count + 1
if ''.join(line).strip():
command = sql.format(guid=line[2], pid=line[1])
cursor.execute(command)
if count % 1000 == 0:
print('program' + name + ' cursor execute', count)
except csv.Error:
print('program csv.Error:', count)
continue
except IndexError:
print('program IndexError:', count)
continue
except StopIteration:
break
except Exception as e:
print('program' + name, str(e))
finally:
connection.commit()
connection.close()
file.close()
print('program' + name + ' info done.time cost:', time.time()-begin)`
And the multi-processing method:
import multiprocessing as mp
def multiproccess():
pool = mp.Pool(3)
results = []
paths = ['C:\\testfile01.csv', 'C:\\testfile02.csv', 'C:\\testfile03.csv']
name = 1
for path in paths:
results.append(pool.apply_async(importprogram, args=(path, str(name))))
name = name + 1
print(result.get() for result in results)
pool.close()
pool.join()
And the main method:
if __name__ == '__main__':
multiproccess()
I am new to Python. How can I make the code or the way itself goes wrong? Should I use only one single process to finish the data reading and importing?
Your issue is that you are exceeding the time allowed for a response to be fetched from the server, so the client is automatically timing out.
In my experience, adjust the wait timeout to something like 6000 seconds, combine into one CSV and just leave the data to import. Also, I would recommend running the query direct from MySQL rather than Python.
The way I usually import CSV data from Python to MySQL is through the INSERT ... VALUES ... method, and I only do so when some kind of manipulation of the data is required (i.e. inserting different rows into different tables).
I like your approach and understand your thinking but in reality there is no need. The benefit to the INSERT ... VALUES ... method is that you won't run into any timeout issue.

Ruby: Handling different JSON response that is not what is expected

Searched online and read through the documents, but have not been able to find an answer. I am fairly new and part of learning Ruby I wanted to make the script below.
The Script essentially does a Carrier Lookup on a list of numbers that are provided through a CSV file. The CSV file has just one row with the column header "number".
Everything runs fine UNTIL the API gives me an output that is different from the others. In this example, it tells me that one of the numbers in my file is not a valid US number. This then causes my script to stop running.
I am looking to see if there is a way to either ignore it (I read about Begin and End, but was not able to get it to work) or ideally either create a separate file with those errors or just put the data into the main file.
Any help would be much appreciated. Thank you.
Ruby Code:
require 'csv'
require 'uri'
require 'net/http'
require 'json'
number = 0
CSV.foreach('data1.csv',headers: true) do |row|
number = row['number'].to_i
uri = URI("https://api.message360.com/api/v3/carrier/lookup.json?PhoneNumber=#{number}")
req = Net::HTTP::Post.new(uri)
req.basic_auth 'XXX' , 'XXX'
res = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => true) {|http|
http.request(req)
}
json = JSON.parse(res.body)
new = json["Message360"]["Carrier"].values
CSV.open("new.csv", "ab") do |csv|
csv << new
end
end
File Data:
number
5556667777
9998887777
Good Response example in JSON:
{"Message360"=>{"ResponseStatus"=>1, "Carrier"=>{"ApiVersion"=>"3", "CarrierSid"=>"XXX", "AccountSid"=>"XXX", "PhoneNumber"=>"+19495554444", "Network"=>"Cellco Partnership dba Verizon Wireless - CA", "Wireless"=>"true", "ZipCode"=>"92604", "City"=>"Irvine", "Price"=>0.0003, "Status"=>"success", "DateCreated"=>"2018-05-15 23:05:15"}}}
The response that causes Script to stop:
{
"Message360": {
"ResponseStatus": 0,
"Errors": {
"Error": [
{
"Code": "ER-M360-CAR-111",
"Message": "Allowed Only Valid E164 North American Numbers.",
"MoreInfo": []
}
]
}
}
}
It would appear you can just check json["Message360"]["ResponseStatus"] first for a 0 or 1 to indicate failure or success.
I'd probably add a rescue to help catch any other errors (malformed JSON, network issue, etc.)
CSV.foreach('data1.csv',headers: true) do |row|
number = row['number'].to_i
...
json = JSON.parse(res.body)
if json["Message360"]["ResponseStatus"] == 1
new = json["Message360"]["Carrier"].values
CSV.open("new.csv", "ab") do |csv|
csv << new
end
else
# handle bad response
end
rescue StandardError => e
# request failed for some reason, log e and the number?
end

Inserting data from CSV into MySQL DB is very slow

Trying to insert data from a CSV file to a MySQL DB using Ruby, and it's very slow. Note that this is not a Rails application, just stand-alone Ruby script.
Here is my code:
def add_record (data1, data2, time)
date = DateTime.strptime(time, "%m/%d/%y %H:%M")
<my table>.create(data1: data1, data2: data2, time: date)
end
def parse_file (file)
path = #folder + "\\" + file
CSV.foreach(path, {headers: :first_row}) do |line|
add_record(line[4], line[5], line[0])
end
end
def analyze_data ()
Dir.foreach #folder do |file|
next if file == '.' or file == '..'
parse_file file
end
end
And my connection:
#connection = ActiveRecord::Base.establish_connection(
:adapter=> "mysql2",
:host => "localhost",
:database=> <db>,
:username => "root",
:password => <pw>
)
Any help appreciated.
Use Load Data Infile.
Here is a nice article on performance and strategies titled Testing the Fastest Way to Import a Table into MySQL. Don't let the mysql version of the title or inside the article scare you away. Jumping to the bottom and picking up some conclusions:
The fastest way you can import a table into MySQL without using raw
files is the LOAD DATA syntax. Use parallelization for InnoDB for
better results, and remember to tune basic parameters like your
transaction log size and buffer pool. Careful programming and
importing can make a >2-hour problem became a 2-minute process. You
can disable temporarily some security features for extra performance
You might just find your times greatly reduced.
Use the zdennis/activerecord-import gem. you can insert tons records quickly.