Speed up select distinct process from very large table - mysql

I want to use select distinct on a single variable to extract data from a very large MyISAM table with ~300 million rows (~ 12.3 GiBs in size -- select distinct should yield ~100k observations, so much smaller than 1 GiB).
The problem is, this query takes 10+ hours to run. I actually don't know how long it takes because I've never finished the process due to impatience.
My query is as follows:
create table codebook(
symbol varchar(16) not null);
create index IDXcodebook on codebook(symbol);
insert into codebook
select distinct(symbol) from bigboytable
I've tried an indexon bigboytable(symbol) to speed up the process, but I have ran that indexing code for 15+ hours with no end in sight.
I've also tried:
SELECT symbol from bigboytable, GROUP BY symbol
But I get
Error Code: 2013. Lost connection to MySQL server during query
in fact, if any query, in this project or in other projects, is "too complicated", I get Error Code 2013 after only ~1-6+ hours, depending.
Other settings are:
Migration connection timeout (3600); DBS connection read timeout skipped; DBMS connection keep-alive interval (5 seconds); SSH BufferSize (10240 bytes); SSH connect, read write, and command timeouts (500 seconds);.
Any suggestions? I might work with Python's MySQL packages if that might speed things up; Workbench is very slow. I need this data ASAP for a large project, but don't need the 300+ million observations from bigboytable.
Edit: I attach my bigboytable definition and explain output here.

Related

MySQL keeps losing connection when trying to make a query

I have a table with the following contents in MySQL:
I am to query a DATETIME column called 'trade_time' with a where clause as follows:
SELECT * FROM tick_data.AAPL
WHERE trade_time between '2021-01-01 09:30:00' and '2021-01-01 16:00:00';
What I'm getting is a 2013 error: lost connection to MySQL server after about 30 seconds.
I'm pretty new to SQL so I'm pretty sure I might be doing something wrong here, surely such a simple query shouldn't take longer than 30 seconds?
The data has 298M rows, which is huge, I was under the impression that MySQL should handle this kind of operations.
The table has just 3 columns, which is trade_time, price and volume, I would just want to query data by dates and times in a reasonable time for further processing in Python.
Thanks for any advice.
EDIT: I've put up the timeout limit on MySQL Workbench to 5 minutes, the query described above took 291 seconds to run, just to get 1 day of data, is there some way I can speed up the performance?
298M rows is a lot to go through. I can definitely see that taking more than 30 seconds, but not much more. First, thing I would do is remove your default disconnection time limit. Personally I always make mine around 300 seconds or 5 min. If you're using mysql workbench that can be done via this method: MySQL Workbench: How to keep the connection alive
Also, I would try and check to see if the trade_time column has an index on it. Having your column that you query often indexed is a good strategy to make queries faster.
SHOW INDEX FROM tablename;
Look to see if trade_time is in the list. If not, you can create an index like so:
CREATE INDEX dateTime ON tablename (trade_time);

MySQL lost connection -> operation completed?

I have a table with >19M rows that I want to create a subtable of (I'm breaking the table into several smaller tables). So I'm doing a CREATE TABLE new_table (SELECT ... FROM big_table). I run the query in MySQL Workbench.
The query takes a really long time to execute so eventually I get a "Lost connection to MySQL server" message. However, after a few minute the new table is there and it seems to contain all the data that was supposed to be copied over (I'm doing a GROUP BY so cannot just check that the number of rows are equal in both tables).
My question is: Am I guaranteed that the query is completed even though I lose connection to the database? Or could MySQL interrupt the query midway and still leave a table with incomplete data?
Am I guaranteed that the query is completed even though I lose connection to the database?
No. There are several reasons other than connection timeout to get lost-connection errors. The server might crash due to used-up disk space or a hardware fault. An administrator might have terminated your session.
"Guarantee" is a strong word in the world of database management. Because other peoples' data. You should not assume that any query ran correctly to completion unless it ended gracefully.
If you're asking because an overnight query failed and you don't want to repeat it, you can inspect the table with stuff like COUNT(*) to convince yourself it completed. But please don't rely on this kind of hackery in production with other peoples' data.

SQL query on MySQL taking three second longer with no changes to the database or to the SQL query

I have been asked to diagnose why a query looking something like this
SELECT COUNT(*) AS count
FROM users
WHERE first_digit BETWEEN 500 AND 1500
AND second_digit BETWEEN 5000 AND 45000;
went from taking around 0.3 seconds to execute suddenly is taking over 3 seconds. The system is MySQL running on Ubuntu.
The table is not sorted and contains about 1.5M rows. After I added a composite index I got the execution time down to about 0.2 seconds again, however this does not explain the root cause why all of a sudden the execution time increased exponentially.
How can I begin to investigate the cause of this?
Since your SQL query has not changed, and I interpret your description as the data set has not changed/grown - I suggest you take a look at the following areas, in order:
1) Have your removed the index and run your SQL query again?
2) Other access to the database. Are other applications or users running heavy queries on the same database? Larger data transfers, in particular to and from the database server in question.
A factor of 10 slowdown? A likely cause is going from entirely cached to not cached.
Please show us SHOW CREATE TABLE. EXPLAIN SELECT, RAM size, and the value of innodb_buffer_pool_size. And how big (GB) is the table?
Also, did someone happen to do a dump or ALTER TABLE or OPTIMIZE TABLE just before the slowdown.
The above info will either show what caused caching to fail, or show the need for more RAM.
INDEX(first_digit, second_digit) (in either order) will be "covering" for that query; this will be faster than without any index.

Self Hosted mysql server on azure vm hangs on select where DB is ~100mb

I i'm doing select from 3 joined tables on MySql server 5.6 running on azure instance with inno_db set to 2GB. I used to have 14GB ram and 2core server and I just doubled ram and cores hoping this will result positive on my select but it didn't happen.
My 3 tables I'm doing select from are 90mb,15mb and 3mb.
I believe I don't do anything crazy in my request where I select few booleans however i'm seeing this select is hangind the server pretty bad and I can't get my data. I do see traffic increasing to like 500MB/s via Mysql workbench but can't figure out what to do with this.
Is there anything I can do to get my sql queries working? I don't mind to wait for 5 minutes to get that data, but i need to figure out how to get it.
==================== UPDATE ===============================
I was able to get it done via cloning the table that is 90 mb and forfilling it with filtered original table. It ended up to be ~15mb, then I just did select all 3 tables joining then via ids. So now request completes in 1/10 of a second.
What did I do wrong in the first place? I feel like there is a way to increase some sizes of some packets to get such queries to work? Any suggestions on what shall I google?
Just FYI, my select query looked like this
SELECT
text_field1,
text_field2,
text_field3 ,..
text_field12
FROM
db.major_links,db.businesses, db.emails
where bool1=1
and bool2=1
and text_field is not null or text_field!=''
and db.businesses.major_id=major_links.id
and db.businesses.id=emails.biz_id;
So bool1,2 and textfield i'm filtering are the filds from that 90mb table
I know this might be a bit late, but I have some suggestions.
First take a look the max_allowed_packet in your my.ini file. This is usually found here in Windows:
C:\ProgramData\MySQL\MySQL Server 5.6
This controls the packet size, and usually causes errors in large queries if it isn't set correctly. I have mine set to 100M
Here is some documentation for you:
Official documentation
In addition I've slow queries when there are a lot of items in the where statement and here you have several. Make sure you have indexes and compound indexes on the values in your where clause especially related to the joins.

Slow INSERT INTO SELECT statement – not repeatable

Short Version:
INSERT INTO SELECT into a table with fulltext index with the same data takes sometimes 25 seconds and sometimes 2500 seconds. We have no idea where this huge gap is coming from.
Long Version:
I have a problem with a cronjob which imports new data into an import table and copies this data via an INSERT INTO SELECT statement to the production tables. I have split the tables because of the time an update takes with a mysql fulltext index – it seems to be faster to insert the data into this table with one INSERT INTO SELECT than with many single insert statements
The cron to import new data is running every 5 minutes. There is a function that checks if an instance of the cron is running to disallow parallel running of the script. Usually there are about 500 new records with every cron call.
In the night at 1-2 am there are a lot of more new data (about 5.000 – 15.000 new records) and the cron is running much more longer than 5 minutes.
When the cron is running long in the night and while tracking the performance of these queries I detected that the performance of the INSERT INTO SELECT statement is very (!) slow. To copy about 15.000 new records (with a filesize of about 30 MB) the query takes more than 2.500 seconds!
The query is:
INSERT IGNORE INTO mentiondata
SELECT * FROM mentionimport
WHERE id <= 1203780;
I profile the query and with the following result:
2012-10-31 06:52:06 Queryprofile: {
"starting":"0.000036",
"checking permissions":"0.000003",
"Opening tables":"0.000132",
"System lock":"0.000003",
"Table lock":"0.000007",
"init":"0.000041",
"optimizing":"0.000007",
"statistics":"0.000023",
"preparing":"0.000005",
"executing":"0.000002",
"Sending data":"999.999999",
"end":"0.000017",
"query end":"0.000005",
"freeing items":"1.458159",
"logging slow query":"0.000050",
"cleaning up":"0.000007"}
In the process-list the sending-data was over 2.500 – in the profile it is just 999.999999. Maybe this is the profiler-limit – whatever…
The really strange thing is: When I try to reproduce the problem via deleting the records from the fulltext-table (DELETE FROM mentiondata WHERE id >= 1203780;) and starting the copy process manually it just takes about 25 seconds!!!
So I don’t get it and I really need help! I don’t understand why there is such a performance-difference between the same query! I checked the mysql-processlist while the cron-copy-statement was running – there are no other queries which lock tables or something. There’s just the single copy-query in the processlist – and a “sending data” with more than 2.500 seconds. There is no other cron or any other tasks which influence the performance of the server running. It seems that the mysql-server slows down every night or that the sql-query takes an extreme long time when the connection was opened a long time before the insert statement takes place (to insert the data into the copy import tables).
Are there any status-variables I can check why mysql is so slow? Is there a possibility to check why these queries are so slow? Here some server-variables for info:
bulk_insert_buffer_size: 268435456
key_buffer_size: 536870912
query_cache_size: 536870912
Thanks for any help!
Timo