batch insert in scalikejdbc is slow on remote computer - mysql

I am trying to insert to a table in bulk of 100 ( i heard it's the best size to use with mySQL), i use scala 2.10.4 with sbt 0.13.6 and the jdbc framework i am using is scalikejdbc with Hikaricp , my connection settings look like this:
val dataSource: DataSource = {
val ds = new HikariDataSource()
ds.setDataSourceClassName("com.mysql.jdbc.jdbc2.optional.MysqlDataSource");
ds.addDataSourceProperty("url", "jdbc:mysql://" + org.Server.GlobalSettings.DB.mySQLIP + ":3306?rewriteBatchedStatements=true")
ds.addDataSourceProperty("autoCommit", "false")
ds.addDataSourceProperty("user", "someUser")
ds.addDataSourceProperty("password", "not my password")
ds
}
ConnectionPool.add('review, new DataSourceConnectionPool(dataSource))
The insert code:
try {
implicit val session = AutoSession
val paramList: scala.collection.mutable.ListBuffer[Seq[(Symbol, Any)]] = scala.collection.mutable.ListBuffer[Seq[(Symbol, Any)]]()
.
.
.
for(rev<reviews){
paramList += Seq[(Symbol, Any)](
'review_id -> rev.review_idx,
'text -> rev.text,
'category_id -> rev.category_id,
'aspect_id -> aspectId,
'not_aspect -> noAspect /*0*/ ,
'certainty_aspect -> rev.certainty_aspect,
'sentiment -> rev.sentiment,
'sentiment_grade -> rev.certainty_sentiment,
'stars -> rev.stars
)
}
.
.
.
try {
if (paramList != null && paramList.length > 0) {
val result = NamedDB('review) localTx { implicit session =>
sql"""INSERT INTO `MasterFlow`.`classifier_results`
(
`review_id`,
`text`,
`category_id`,
`aspect_id`,
`not_aspect`,
`certainty_aspect`,
`sentiment`,
`sentiment_grade`,
`stars`)
VALUES
( {review_id}, {text}, {category_id}, {aspect_id},
{not_aspect}, {certainty_aspect}, {sentiment}, {sentiment_grade}, {stars})
"""
.batchByName(paramList.toIndexedSeq: _*)/*.__resultOfEnsuring*/
.apply()
}
Each time i insert a batch it took 15 seconds, my logs:
29/10/2014 14:03:36 - DEBUG[Hikari Housekeeping Timer (pool HikariPool-0)] HikariPool - Before cleanup pool stats HikariPool-0 (total=10, inUse=1, avail=9, waiting=0)
29/10/2014 14:03:36 - DEBUG[Hikari Housekeeping Timer (pool HikariPool-0)] HikariPool - After cleanup pool stats HikariPool-0 (total=10, inUse=1, avail=9, waiting=0)
29/10/2014 14:03:46 - DEBUG[default-akka.actor.default-dispatcher-3] StatementExecutor$$anon$1 - SQL execution completed
[SQL Execution]
INSERT INTO `MasterFlow`.`classifier_results` ( `review_id`, `text`, `category_id`, `aspect_id`, `not_aspect`, `certainty_aspect`, `sentiment`, `sentiment_grade`, `stars`) VALUES ( ...can't show this....);
INSERT INTO `MasterFlow`.`classifier_results` ( `review_id`, `text`, `category_id`, `aspect_id`, `not_aspect`, `certainty_aspect`, `sentiment`, `sentiment_grade`, `stars`) VALUES ( ...can't show this....);
.
.
.
INSERT INTO `MasterFlow`.`classifier_results` ( `review_id`, `text`, `category_id`, `aspect_id`, `not_aspect`, `certainty_aspect`, `sentiment`, `sentiment_grade`, `stars`) VALUES ( ...can't show this....);
... (total: 100 times); (15466 ms)
[Stack Trace]
...
logic.DB.ClassifierJsonToDB$$anonfun$1.apply(ClassifierJsonToDB.scala:119)
logic.DB.ClassifierJsonToDB$$anonfun$1.apply(ClassifierJsonToDB.scala:96)
scalikejdbc.DBConnection$$anonfun$_localTx$1$1.apply(DBConnection.scala:252)
scala.util.control.Exception$Catch.apply(Exception.scala:102)
scalikejdbc.DBConnection$class._localTx$1(DBConnection.scala:250)
scalikejdbc.DBConnection$$anonfun$localTx$1.apply(DBConnection.scala:257)
scalikejdbc.DBConnection$$anonfun$localTx$1.apply(DBConnection.scala:257)
scalikejdbc.LoanPattern$class.using(LoanPattern.scala:33)
scalikejdbc.NamedDB.using(NamedDB.scala:32)
scalikejdbc.DBConnection$class.localTx(DBConnection.scala:257)
scalikejdbc.NamedDB.localTx(NamedDB.scala:32)
logic.DB.ClassifierJsonToDB$.insertBulk(ClassifierJsonToDB.scala:96)
logic.DB.ClassifierJsonToDB$$anonfun$bulkInsert$1.apply(ClassifierJsonToDB.scala:176)
logic.DB.ClassifierJsonToDB$$anonfun$bulkInsert$1.apply(ClassifierJsonToDB.scala:167)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
...
When i run it on the server that host the mySQL database it's run fast, what can i do to make it run faster on a remote computer ?

In case if any one need that, I had similar problem to batch insert 10000 records into MySQL with ScalikeJdbc, and it could be solved by setting rewriteBatchedStatements to true in jdbc url ("jdbc:mysql://host:3306/db?rewriteBatchedStatements=true"). It reduced the batch insert time from 40 seconds to 1 second!

I guess this is not an issue of ScalikeJDBC or HikariCP. You should investigate your network environment between your machine and MySQL server.

Related

Why does Fast API take upwards of 10 minutes to insert 100,000 rows into a SQL database

I've tried using SqlAlchemy, as well as raw mysql.connector here, but commiting an insert into a SQL database from FastAPI takes forever.
I wanted to make sure it wasn't just my DB, so I tried it on a local script and it ran in a couple seconds.
How can I work with FastAPI to make this query possible?
Thanks!
'''
#router.post('/')
def postStockData(data:List[pydanticModels.StockPrices], raw_db = Depends(get_raw_db)):
cursor = raw_db[0]
cnxn = raw_db[1]
# i = 0
# for row in data:
# if i % 10 == 0:
# print(i)
# db.flush()
# i += 1
# db_pricing = models.StockPricing(**row.dict())
# db.add(db_pricing)
# db.commit()
SQL = "INSERT INTO " + models.StockPricing.__tablename__ + " VALUES (%s, %s, %s)"
print(SQL)
valsToInsert = []
for row in data:
rowD = row.dict()
valsToInsert.append((rowD['date'], rowD['symbol'], rowD['value']))
cursor.executemany(SQL, valsToInsert)
cnxn.commit()
return {'message':'Pricing Updated'}
'''
You are killing performances because you try a "RBAR" approach which is not suitable in RDBMS...
You use a loop and execute an SQL INSERT of only one row...
When the RDBMS is facing a query, the sequence of execution is the following :
does the user that throw the query be authenticate ?
parsing the string to verify the syntax
looking for metadata (tables, columns, datatypes...)
analyzing which operations on tables and columns this user is granted
creating an execution plan to sequences all the operations needed for the query
setting up lock for concurrency
executing the query (inserting only 1 row)
throw back an error or a OK message
Every steps consumes time... and your are all theses steps 100 000 times because of your loop.
Usually when inserting in a table many rows, there just one query to do even if the INSERT concerns 10000000000 rows from a file !

Batch Insert into mysql by springboot+mybatis, it is much slower in linux than in windows. What could have caused it?

When I use mybatis insert 100000 records into a table in mysql,
1. it takes about 14s when I run the application(springboot+mybatis) in windows(my pc, 16G+i7),
2. but it takes 1244s when I run the same application in centos7 (product env, 4Core+8G ECS Server).
They both connect to the same mysql server (also run on centos7).
The network connection is better in centos7 (product env).
CPU performance is almost the same(I have tested).
The application is simple , takes only 1G memory when running.
Libraries versions in my application:
openjdk version "1.8.0_212",
Spring boot 2.1.6 ,
spring-boot-starter-tomcat-2.1.6.RELEASE.jar ,
spring-jdbc-5.0.7.RELEASE.jar ,
druid-1.1.19.jar ,
mybatis-3.5.2.jar ,
mybatis-spring-2.0.2.jar ,
mybatis-spring-boot-starter-2.1.0.jar ,
mybatis-spring-boot-autoconfigure-2.1.0.jar ,
mysql-connector-java-5.1.38.jar ,
Anyone know the reason ?
Thanks in advance.
==============================
Insert By Foreach (max_allowed_packet has been set to 200M):
<insert id="insertBatch" parameterType="java.util.List" useGeneratedKeys="false">
insert into table_product
(id,
code,
status,
type,
create_time,
update_time)
values
<foreach collection="products" item="product" index="index" separator=",">
(#{product.id},
#{product.code},
#{product.status},
#{product.type},
#{product.createTime},
#{product.updateTime})
</foreach>
</insert>
==================================
Insert By ExecutorType.BATCH:
public void batchInsert(List<Product> products){
SqlSession session = sqlSessionTemplate.getSqlSessionFactory().openSession(ExecutorType.BATCH, false);
BatchTableDao batchTableDao = session.getMapper(BatchTableDao.class);
try {
int i=0;
for (Product product : products) {
batchTableDao.insert(product);
if (i % 1000 == 0 || i == products.size()-1) {
session.flushStatements();
session.clearCache();
}
i++;
}
session.commit();
} catch (Exception e) {
Log.warn("error : "+e.getMessage());
} finally{
session.close();
}
}
===================================
In windows 10, it takes about 14s to insert 100000 rows by using 'foreach'.
And it takes about 2500s to insert 100000 rows by using 'ExecutorType.BATCH', it is too slow to accept.
First of all, how about changing to the version specified in the official document? And I think you need to know more about the code.

Updating one table of MYSQL with multiple processes via pymysql

Actually, I am trying to update one table with multiple processes via pymysql, and each process reads a CSV file split from a huge one in order to promote the speed. But I get the Lock wait timeout exceeded; try restarting transaction exception when I run the script. After searching the posts on this site, I found one post which mentioned that to set or build the built-in LOAD_DATA_INFILE, but no details on it. How can I do it with 'pymysql' to reach my aim?
---------------------------first edit----------------------------------------
Here's the job method:
`def importprogram(path, name):
begin = time.time()
print('begin to import program' + name + ' info.')
# "c:\\sometest.csv"
file = open(path, mode='rb')
csvfile = csv.reader(codecs.iterdecode(file, 'utf-8'))
connection = None
try:
connection = pymysql.connect(host='a host', user='someuser', password='somepsd', db='mydb',
cursorclass=pymysql.cursors.DictCursor)
count = 1
with connection.cursor() as cursor:
sql = '''update sometable set Acolumn='{guid}' where someid='{pid}';'''
next(csvfile, None)
for line in csvfile:
try:
count = count + 1
if ''.join(line).strip():
command = sql.format(guid=line[2], pid=line[1])
cursor.execute(command)
if count % 1000 == 0:
print('program' + name + ' cursor execute', count)
except csv.Error:
print('program csv.Error:', count)
continue
except IndexError:
print('program IndexError:', count)
continue
except StopIteration:
break
except Exception as e:
print('program' + name, str(e))
finally:
connection.commit()
connection.close()
file.close()
print('program' + name + ' info done.time cost:', time.time()-begin)`
And the multi-processing method:
import multiprocessing as mp
def multiproccess():
pool = mp.Pool(3)
results = []
paths = ['C:\\testfile01.csv', 'C:\\testfile02.csv', 'C:\\testfile03.csv']
name = 1
for path in paths:
results.append(pool.apply_async(importprogram, args=(path, str(name))))
name = name + 1
print(result.get() for result in results)
pool.close()
pool.join()
And the main method:
if __name__ == '__main__':
multiproccess()
I am new to Python. How can I make the code or the way itself goes wrong? Should I use only one single process to finish the data reading and importing?
Your issue is that you are exceeding the time allowed for a response to be fetched from the server, so the client is automatically timing out.
In my experience, adjust the wait timeout to something like 6000 seconds, combine into one CSV and just leave the data to import. Also, I would recommend running the query direct from MySQL rather than Python.
The way I usually import CSV data from Python to MySQL is through the INSERT ... VALUES ... method, and I only do so when some kind of manipulation of the data is required (i.e. inserting different rows into different tables).
I like your approach and understand your thinking but in reality there is no need. The benefit to the INSERT ... VALUES ... method is that you won't run into any timeout issue.

Empty result for sql query in Rstudio-server

I'm trying to get data from MySQL DB into Rstudio-server. My actions are like
mydb = dbConnect(MySQL(), user='user', password='password', dbname='dbname', host='localhost')
query <- stri_paste('select sellings.updated_at AS Up_Date, concat(item_parameters.title, " ", ad_attributes.int_value) AS Class, CONCAT(geos.name, " ", geos.kind) AS place, geos.lon, geos.lat, sellings.price AS price, ((geo_routes.distance*2/1000 + 100)) AS delivery_cost FROM sellings, users, item_parameters, ad_attributes, geos, geo_routes WHERE users.encrypted_password!="" && item_parameters.title="Класс" && sellings.price IS NOT NULL && ad_attributes.int_value IS NOT NULL AND users.id=sellings.user_id AND item_parameters.id=ad_attributes.item_parameter_id AND sellings.id = ad_attributes.ad_id AND sellings.geo_guid = geos.guid AND geos.routable_guid = geo_routes.src_guid AND geo_routes.distance = (SELECT geo_routes.distance FROM geo_routes, geos WHERE geos.guid = sellings.geo_guid AND geo_routes.src_guid = geos.routable_guid AND geo_routes.dst_guid = (SELECT geos.routable_guid FROM geos WHERE geos.name = "Воронеж" && geos.kind = "г")) ORDER BY Up_Date;')
rs = dbGetQuery(mydb, query)
And I get an empty dataframe. But when I do the same with my local DB everything is OK. The query takes a pretty long time, about 3 minutes, but it works properly. Moreover the same query works right from the command line in MySQL. On the server, it takes about 4 seconds. OS of server is Debian 7, OS of local machine is Win 8. Any idea?
Sometimes when querying from the command line the default schema has been set in a previous command. This command doesn't carry over to R so the exact same query from a command line to a R session might not work. Maybe check the dbname.
Insert the below statements in your SQL query
SET NOCOUNT ON
SET ANSI_WARNINGS OFF
It worked for me

Storing objects in memcached and still loading them from the database

I have a problem with memcached.
I have the following code:
/**
* Load the char object
* #param char_id id char
* #return $char object
*/
function get_info( $char_id )
{
$cache = Cache::instance();
$cachetag = Kohana::config( 'medeur.environment' ) . '-charinfo_' . $char_id . '_obj' ;
kohana::log('debug', "-> Getting $cachetag from CACHE..." );
$char = $cache -> get( $cachetag );
if ( is_null( $char ) )
{
kohana::log('debug', "-> Getting $cachetag from DB.");
$char = ORM::factory('character', $char_id );
if ( !$char -> loaded )
$char = null;
$cache -> set( $cachetag, $char, 3600 );
}
return $char;
}
I see in the logfile that the object $char is taken from the cache:
2012-12-08 18:24:07 +01:00 --- debug: -> Getting test-global_adminmessage from CACHE...
2012-12-08 18:24:07 +01:00 --- debug: -> Getting test-charinfo_1_obj from CACHE...
However i keep seeing in the profiler table that i am still going on the database:
SELECT `characters`.* FROM (`characters`) WHERE `characters`.`id` = 1 ORDER BY `characters`.`id` ASC LIMIT 0, 1
Why? in this case, the memcached it would be useless...
Your "Getting nnnn from CACHE..." logging statement will always show up, regardless of whether or not you actually retrieve anything from the cache. Consider moving it into an else statement after the large if block.
if(is_null($char)){
....
}
else {
kohana::log('debug', "-> Got $cachetag from CACHE..." );
}
I checked with the guys at Kohana. Kohana 2.x ORM class is not cacheable. It is cacheable on framework version 3.x