ElasticSearch, Logstash, MySQL: how do I speed up a large import? - mysql

I'm trying to import a large (~30M row) MySQL database to ElasticSearch. Cool cool, there's a logstash tool that looks like it's built for this sort of thing; its JDBC plugin will let me connect right to the database and slurp up the rows real fast.
However! When I try it, it bombs with java.lang.OutOfMemoryError. Okay. It's probably trying to batch up too many rows or something. So I add jdbc_fetch_size => 1000 to my configuration. No dice, still out of memory. Okay, maybe that option doesn't work, or doesn't do what I think?
So I try adding jdbc_paging_enabled => true and jdbc_page_size => 10000 to my config. Success! It starts adding rows in batches of 10k to my index.
But it slows down. At first I'm running 100k rows/minute; by the time I'm at 2M rows, however, I'm at maybe a tenth of that. And no surprise; I'm pretty sure this is using LIMIT and OFFSET, and using huge OFFSETs in queries is real slow, so I'm dealing with an O(n^2) kind of thing here.
I'd really like to just run the whole big query and let the cursor iterate over the resultset, but it looks like that isn't working for some reason. If I had more control over the query, I could change the LIMIT/OFFSET thing out for an WHERE id BETWEEN val1 AND val2 kind of thing, but I can't see where I could get in to do that.
Any suggestions on how I can not crash, but still run at a reasonable speed?

Okay! After searching the issues for the logstash-input-jdbc github page for "Memory" I found this revelation:
It seems that an additional parameter ?useCursorFetch=true needs to be added to the connection string of mysql 5.x.
It turns out that the MySQL JDBC client does not use a cursor for fetching rows by default because of some reason, and the logstash client doesn't warn you that it's not able to use a cursor to iterate over the resultset even though you've set a jdbc_fetch_size because of some other reason.
The obvious way to know about this, of course, would have been to have carefully read the MySQL Connector/J documentation which does mention that cursors are off by default, though not why.
Anyhow, I added useCursorFetch=true to the connection string, kicked jdbc_query_paging to the curb, and imported 26M rows into my index in 2.5 hours, on an aging Macbook Pro with 8G memory.
Thanks to github user axhiao for the helpful comment!

Related

Self Hosted mysql server on azure vm hangs on select where DB is ~100mb

I i'm doing select from 3 joined tables on MySql server 5.6 running on azure instance with inno_db set to 2GB. I used to have 14GB ram and 2core server and I just doubled ram and cores hoping this will result positive on my select but it didn't happen.
My 3 tables I'm doing select from are 90mb,15mb and 3mb.
I believe I don't do anything crazy in my request where I select few booleans however i'm seeing this select is hangind the server pretty bad and I can't get my data. I do see traffic increasing to like 500MB/s via Mysql workbench but can't figure out what to do with this.
Is there anything I can do to get my sql queries working? I don't mind to wait for 5 minutes to get that data, but i need to figure out how to get it.
==================== UPDATE ===============================
I was able to get it done via cloning the table that is 90 mb and forfilling it with filtered original table. It ended up to be ~15mb, then I just did select all 3 tables joining then via ids. So now request completes in 1/10 of a second.
What did I do wrong in the first place? I feel like there is a way to increase some sizes of some packets to get such queries to work? Any suggestions on what shall I google?
Just FYI, my select query looked like this
SELECT
text_field1,
text_field2,
text_field3 ,..
text_field12
FROM
db.major_links,db.businesses, db.emails
where bool1=1
and bool2=1
and text_field is not null or text_field!=''
and db.businesses.major_id=major_links.id
and db.businesses.id=emails.biz_id;
So bool1,2 and textfield i'm filtering are the filds from that 90mb table
I know this might be a bit late, but I have some suggestions.
First take a look the max_allowed_packet in your my.ini file. This is usually found here in Windows:
C:\ProgramData\MySQL\MySQL Server 5.6
This controls the packet size, and usually causes errors in large queries if it isn't set correctly. I have mine set to 100M
Here is some documentation for you:
Official documentation
In addition I've slow queries when there are a lot of items in the where statement and here you have several. Make sure you have indexes and compound indexes on the values in your where clause especially related to the joins.

Best way to process large database with Laravel

The database
I'm working with a database that has pretty big tables and it's causing me problems. One in particular has more than 120k lines.
What I'm doing with it
I'm looping over this table in a MakeAverage.php file to merge them into about 1k lines in a new table in my database.
What doesn't work
Laravel doesn't allow me to process it all at once even if I try to DB::disableQueryLog() or or a take(1000) limit for example. It returns me a blank page every time even if my error reporting was enabled (kind of like this). Also, I had no Laravel log file for this. I had to look in my php_error.log (I'm using MAMP) to realize that it was actually a memory_limit problem.
What I did
I increased the amount of memory before executing my code by using ini_set('memory_limit', '512M'). (It's bad practice, I should do it in php.ini.)
What happened?
It worked! However, Laravel thrown me an error because the page didn't finished to load after 30s because of the large amount of data.
What I will do
After spending some time on this issue and looking at other people having similar problems (see: Laravel forum, 19453595, 18775510 and 12443321), I thought that maybe PHP isn't the solution.
Since, I'm only creating a Table B from the average values of the Table A, I believe that a SQL is going to fits best my needs as it's clearly faster than PHP for that type of operation (see: 6449072) and I can use functions such as SUM, AVERAGE, COUNT and GROUP_BY (Reference).

Mysql query fast only first time run

I have a mysql SELECT query which is fast (<0.1 sec) but only the first time I run it. It joins 3 tables together (using indices) and has a relatively simple WHERE statement. When I run it by hand in the phpmyadmin (always changing numbers in the WHERE so that it isn't cached) it is always fast but when I have php run several copies of it in a row, the first one is fast and the others hang for ~400 sec. My only guess is that somehow mysql is running out of memory for the connection and then has to do expensive paging.
My general question is how can I fix this behavior, but my specific questions are without actually closing and restarting the connection how can I make these queries coming from php be seen as separate just like the queries coming from phpmyadmin, how can I tell mysql to flush any memory when the request is done, and does this sound like a memory issue to you?
Well I found the answer at least in my case and I'm putting it here for anyone in the future who runs into a similar issue. The query I was running had a lot of results returned and MYSQL's query cache was causing a lot of overhead. When you run a query MYSQL will save it and its output so that it can quickly answer future identical requests quickly. All I had to do was put SQL_NO_CACHE and the speed was back to normal. Just look out if your incoming query is large or the results are very large because it can take considerable resources for MYSQL to decide when to kick things out.

Hibernate spring hangs

I'm working on an hibernate Spring Mysql app, sometimes when i make a gethibernateTemplate()get(class,id) i can see a bunch of HQL in the logs and the application hangs, have to kill tomcat. This method reads trhough a 3,000 lines file, and there should be 18 files of these, i've been thinking i probably been looking at this wrong. I need you to help me check this at database level , but i don´t know hot to approach. Maybe my database can´t take so many hits so fast.
I´v looked in phpMyAdmin in the information bout executions time section, i see a red values in:
Innodb_buffer_pool_reads 165
Handler_read_rnd 40
Handler_read_rnd_next 713 k
Created_tmp_disk_tables 8
Opened_tables 30
Can i set the application some how to threat more gently the database ?
How can i check if this is the issue ?
Update
I put a
Thread.sleep(2000);
at the end of each cycle and it made the same numbers of calls (18), so i guess this wont be te reason ? can i discard this approach ?
This is a different view of this question
Hibernate hangs or throws lazy initialization no session or session was closed
trying some different
Update 2
Think it might be the buffer reader reading the file?? file is 44KB, tried this method:
http://code.hammerpig.com/how-to-read-really-large-files-in-java.html
class but did not work.
Update 1 -- do never use a Sleep or something slow within an transaction. A transaction has to be closed as fast as possible, because it can block other database operations (what exactly will be blocked depends on the isolation level)
I do not really understand how the database is related to the files in your usecase. But if the stuff works for the first file and become slow later on, then the problem can be the Hibernate Session (to many objects), in this case start an new Transaction/Hibernate Session for each file.
I rewrote the program so i load the information directly into the database using mysql query LOAD DATA INFILE. It works very fast. Then i updated rows changing some fields i need also with sql queries. I think there is simply too much information at the same time to manage trhough memory and abstractions.

Randomly long DB queries/Memcache fetches in production env

I'm having trouble diagnosing a problem I'm having on my ubuntu scalr/ec2 production environment.
The trouble is apparently randomly, database queries and/or memcache queries will take MUCH longer than they should. I've seen a simple select statement take 130ms or a Memcache fetch take 65ms! It can happen a handful of times per request, causing some requests to take twice as long as they should.
To diagnose the problem, I wrote a very simple script which will just connect to the MySql server and run a query.
require 'mysql'
mysql = Mysql.init
mysql.real_connect('', '', '', '')
max = 0
100.times do
start = Time.now
mysql.query('select * from navigables limit 1')
stop = Time.now
total = stop - start
max = total if total > max
end
puts "Max Time: #{max * 1000}"
mysql.close
This script consistently returned a really high max time, so I eliminated any Rails as the source of the problem. I also wrote the same thing in Python to eliminate Ruby. And indeed the Python one took inordinate amounts of time as well!
Both MySql and Memcache are on their own boxes, so I considered network latency, but watching pings and tracerouteing look normal.
Also running the queries/fetches on the respective machines returns expected times, and I'm running the same version gems on my staging machine without this issue.
I'm really stumped on this one... any thoughts on something I could try to diagnose this?
Thanks
My only thought is that it might be disk?
Mysql uses query cache to store SELECT together with its result. That could explain the constant speed you are getting in continous selecting. Try EXPLAIN-inig the query to see if you are using indexes.
I don't see why memcache would be a problem (unles it's crashing and restarting?). Check your server logs for suspicious service failures.