We have a table that has 13m rows,It's name and surname fields are nil by default, when we are trying to push some data, it stops running after 1.2m query. We looped with 10k row each because of ram issue.
The algorithm is,
$i = 0;
until $i > 13000 do
b = Tahsil.where("NO < ?",(10000*($i+1))).offset(10000*$i)
b.each do |a|
a.name = Generator('name')
a.surname = Generator('surname')
a.save
end
$i += 1
end
Ruby on Rails has some methods build in that you might want to use:
Tahsil.find_each do |tahsil|
tahsil.update(name: Generator('name'), surname: Generator('surname'))
end
find_each iterates through all records in batches (with a default batch size of 1000). update updates a record.
Related
I have a data frame in pyspark like below
df = spark.createDataFrame(
[
('2021-10-01','A',25),
('2021-10-02','B',24),
('2021-10-03','C',20),
('2021-10-04','D',21),
('2021-10-05','E',20),
('2021-10-06','F',22),
('2021-10-07','G',23),
('2021-10-08','H',24)],("RUN_DATE", "NAME", "VALUE"))
Now using this data frame I want to update a table in MySql
# query to run should be similar to this
update_query = "UPDATE DB.TABLE SET DATE = '2021-10-01', VALUE = 25 WHERE NAME = 'A'"
# mysql_conn is a function which I use to connect to `MySql` from `pyspark` and run queries
# Invoking the function
mysql_conn(host, user_name, password, update_query)
Now when I invoke the mysql_conn function by passing parameters the query runs successfully and the record gets updated in the MySql table.
Now I want to run the update statement for all the records in the data frame.
For each NAME it has to pick the RUN_DATE and VALUE and replace in update_query and trigger the mysql_conn.
I think we need to a for loop but not sure how to proceed.
Instead of iterating through the dataframe with a for loop, it would be better to distribute the workload across each partitions using foreachPartition. Moreover, since you are writing a custom query instead of executing one query for each query, it would be more efficient to execute a batch operation to reduce the round trips, latency and concurrent connections. Eg
def update_db(rows):
temp_table_query=""
for row in rows:
if len(temp_table_query) > 0:
temp_table_query = temp_table_query + " UNION ALL "
temp_table_query = temp_table_query + " SELECT '%s' as RUNDATE, '%s' as NAME, %d as VALUE " % (row.RUN_DATE,row.NAME,row.VALUE)
update_query="""
UPDATE DBTABLE
INNER JOIN (
%s
) new_records ON DBTABLE.NAME = new_records.NAME
SET
DBTABLE.DATE = new_records.RUNDATE,
DBTABLE.VALUE = new_records.VALUE
""" % (temp_table_query)
mysql_conn(host, user_name, password, update_query)
df.foreachPartition(update_db)
View Demo on how the UPDATE query works
Let me know if this works for you.
I am running a task to import around 1 million orders. I am looping through the data to update it to the values on the new database and it is working fine on my local computer with 8 gig of ram.
However when I upload it to my AWS instance t2.medium It will run for the first 500 thousand rows but towards the end, I will start maxing out my memory when it starts actually creating non-existent orders. I am porting a mysql database to postgres
am I missing something obvious here?
require 'mysql2' # or require 'pg'
require 'active_record'
def legacy_database
#client ||= Mysql2::Client.new(Rails.configuration.database_configuration['legacy_production'])
end
desc "import legacy orders"
task orders: :environment do
orders = legacy_database.query("SELECT * FROM oc_order")
# init progressbar
progressbar = ProgressBar.create(:total => orders.count, :format => "%E, \e[0;34m%t: |%B|\e[0m")
orders.each do |order|
if [1, 2, 13, 14].include? order['order_status_id']
payment_method = "wx"
if order['paid_by'] == "Alipay"
payment_method = "ap"
elsif order['paid_by'] == "UnionPay"
payment_method = "up"
end
user_id = User.where(import_id: order['customer_id']).first
if user_id
user_id = user_id.id
end
order = Order.create(
# id: order['order_id'],
import_id: order['order_id'],
# user_id: order['customer_id'],
user_id: user_id,
receiver_name: order['payment_firstname'],
receiver_address: order['payment_address_1'],
created_at: order['date_added'],
updated_at: order['date_modified'],
paid_by: payment_method,
order_num: order['order_id']
)
#increment progress bar on each save
progressbar.increment
end
end
end
I assume this line orders = legacy_database.query("SELECT * FROM oc_order") loads entire table to the memory, which is very ineffective.
You need to iterate over table in batches. In ActiveRecord, there is find_each method for that. You may want to implement your own batch querying using limit and offset, since you don't use ActiveRecord.
In order to handle memory efficiently, you can run mysql query in batches as suggested by nattfodd.
There are two ways to achieve it, as per mysql documentation:
SELECT * FROM oc_order LIMIT 5,10;
or
SELECT * FROM oc_order LIMIT 10 OFFSET 5;
Both of the queries will return rows 6-15.
You can decide the offset of your choice and run the queries in loop until your orders object is empty.
Let us assume you handle 1000 orders at a time, then you'll have something like this:
batch_size = 1000
offset = 0
loop do
orders = legacy_database.query("SELECT * FROM oc_order LIMIT #{batch_size} OFFSET #{offset}")
break unless orders.present?
offset += batch_size
orders.each do |order|
... # your logic of creating new model objects
end
end
It is also advised to run your code in production with proper error handling:
begin
... # main logic
rescue => e
... # handle error
ensure
... # ensure
end
Disabling row caching while iterating over the orders collection should reduce the memory consumption:
orders.each(cache_rows: false) do |order|
there is a gem that helps us do this called activerecord-import.
bulk_orders=[]
orders.each do |order|
order = Order.new(
# id: order['order_id'],
import_id: order['order_id'],
# user_id: order['customer_id'],
user_id: user_id,
receiver_name: order['payment_firstname'],
receiver_address: order['payment_address_1'],
created_at: order['date_added'],
updated_at: order['date_modified'],
paid_by: payment_method,
order_num: order['order_id']
)
end
Order.import bulk_orders, validate: false
with a single INSERT statement.
I want select X records from database (in PHP script), then sleep 60 seconds after continue the next 60 results...
SO:
SELECT * FROM TABLE WHERE A = 'B' LIMIT 60
SELECT SLEEP(60);
....
SELECT * FROM TABLE WHERE A = 'B' LIMIT X **where X is the next 60 results, then**
SELECT SLEEP(60);
AND etc...
How can I achievement this?
There is no such thing as "the next 60 records". SQL tables represent unordered sets. Without an order by, a SQL statement can return a result set in any order -- and even in different orders on different executions.
Hence, you first need something to guarantee the ordering . . . that is, an order by with keys that uniquely identify each row.
You can then use offset/limit to accomplish what you want. Or, you could put the code into a stored procedure and use a while loop. Or, you could do this on the application side.
In PHP:
<?php
// obtain the database connection, there's a heap of examples on the net, assuming you're using a library like mysqlite
$offset = 0;
while (true) {
if ($offset == 0) {
$res = $db->query('SELECT * FROM TABLE WHERE A = 'B' LIMIT 60');
} else {
$res = $db->query('SELECT * FROM TABLE WHERE A = 'B' LIMIT ' . $offset . ',60');
}
$rows = $db->fetch_assoc($res);
sleep(60);
if ($offset >= $some_arbitrary_number) {
break;
}
$offset += 60;
}
What you're doing is gradually incrementing the limit field by 60 until you reach a limit. The easiest way to do it is in a control while loop using true for the condition and break when you reach your invalid condition.
Trying to convert a Visual Foxpro code to set-based MySQL query. Following is the code segment from Foxpro
lnFound=0
IF LnFound = 0 .and. rcResult = "ALL" AND PcOpOrIp = "OP"
SELECT PFile
LcTag = ORDER()
SET ORDER TO TAG PtcntlNm
=SEEK(LcPatientNo)
SCAN WHILE PtcntlNm = LcPatientNo
IF GcMResult <= "0"
GcMResult = "1-7MAT-PTC"
ENDIF
IF MONTH(cSRa.Fromdate) = MONTH(pFile.Fromdate) ;
.AND. pFile.ThruDate >= cSRa.ThruDate
** Check From/Thru Date against pFile
IF (ABS(cSRa.totalchrg) = (pFile.BDeduct+pFile.Deduct+pFile.Coinsur)) .OR. cSRa.Tchrgs = (pFile.BDeduct+pFile.Deduct+pFile.Coinsur) .or. (ABS(cSRa.totalchrg) = pFile.Total .OR. cSRa.Tchrgs = pFile.Total)
IF lnFound = 0
gcRecid = recid
gcmResult=rcResult
ENDIF
lnFound = lnFound + 1
gcUNrECID = gcunRecid + IIF(EMPTY(gCUNreCID),Recid,[,]+recid)
ENDIF
ENDIF
ENDSCAN
SELECT PFile
SET ORDER TO &LcTag
ENDIF
I have a table named pfile which I'am trying to join with another table named csra. The main aim of this is to set the record_id (gcrecid) based on the condition of three nested if statements. After setting the gcrecid variable the lnfound variable is set to one hence the third if statement condition is false from the second iteration onwards.
Here is the MySQL stored procedure which I came up with and as you can see I'm not able to completely convert the code in an efficient manner.
UPDATE csra AS cs
JOIN p051331s AS p ON cs.patientno = p.ptcntlnm
SET cs.recid = p.recid
, cs.mcsult = "ALL"
, cs.lnfound = '"1"'
WHERE cs.provider = '051331'
AND cs.lnfound = "0"
AND cs.RECID IS NULL
AND month(cs.fromdate) = month(p.fromdate)
AND p.thrudate >= cs.ThruDate
AND ABS(cs.totalchrg) = (p.bdeduct+p.deduct+p.coinsur)
OR cs.tchrgs = (p.bdeduct+p.deduct+p.coinsur)
OR ABS(cs.totalchrg) = p.total OR cs.tchrgs = p.total;
Any lead in this regard will be much appreciated as I've been working on this procedure for a couple of day with no noticeable results.
According to this partial VFP code (which is not clear on variables it uses) there is no code to be converted to set based at all. Corresponding mySQL or MS SQL or any other SQL series backend code would simply be "nothing". ie: this would be equivalant:
-- Hello to mySQL or MS SQL
PS: On your trial to convert to an update code, inner joining with csra is wrong. It is not joined in VFP code, csra values are constant --unless there is a relation on fields set-- (pointing to the "current row" values in csra only). You would want to make them into parameters as with the rest of memory variables (which is not clear from the code which ones are memory variables).
Im trying to check 2.5 second intervals for records and add an object to an array based on the count. This way works but its far too slow. thanks
#tweets = Tweet.last(3000)
first_time = #tweets.first.created_at
last_time = #tweets.last.created_at
while first_time < last_time
group = #tweets.where(created_at: (first_time)..(first_time + 2.5.seconds)).count
if group == 0 || nil
puts "0 or nil"
first_id + 1
array << {tweets: 0}
else
first_id += group
array << {tweets: group}
end
first_time += 2.5.seconds
end
return array.to_json
end
What you really need is the group_by method on the records you've retrieved:
grouped = #tweets.group_by do |tweet|
# Convert from timestamp to 2.5s interval number
(tweet.created_at.to_f / 2.5).to_i
end
That returns a hash with the key being the time interval, and the values being an array of tweets.
What you're doing in your example probably has the effect of making thousands of queries. Always watch log/development.log to see what's going on in the background.