I am trying to load a simple, but fairly large table from a MySQL database into Matlab (the table is about 16'000'000 x 18)
The MySQL database size is 2.6 GB, and my Windows machine has 32 GB of memory, so in principle, memory should not be a problem.
I tried to load the data via a simple fetch:
curs = exec(conn,['SELECT * FROM mydb.large_table']);
curs = fetch(curs);
data = curs.Data;
I also tried to use the select function, but in both instances data is simply returned as 0.
As there are no error messages, and as it does not seem that I am even close to any matlab or memory size restrictions, I am at a loss to understand what is going wrong.
Any help would be greatly appreciated.
[added : ]
Did some further checks:
Tried to pre-allocate memory for the table, pulling in the first row of the table and then replicating it 20 million time. No problem there.
Was able to pull in 2.5m rows. 6m failed again
Watching the memory consumption (Windows Task Manager), I noticed that onces the exec statement pulled all the data from the database onto the local machine, the fetch command started to eat up memory. For 6m rows, the memory in use first increased to the full available 32 GB, then dropped to 2GB (where it stayed), but the Committed memory went up a staggering 125 GB !
I have absolutely no idea what is going on here
Related
My DB has around 15 tables, each with 40 columns, with 10.000 rows each.
Most of it with VARCHAR, some indexes and foreign keys.
Sometime I need to reconstruct my database (design flaw, working on it), which takes about 40 seconds locally. Now I'm trying to do the same to a AWS RDS MySQL 5.75 instance, but it takes forever, something like 40-50 minutes. The last time I had to do this same process it took no more than 5 minutes, still way more than the local 40 seconds, but I'm happy with it.
My internet speed is at about 35 Mbps Download / 5 Mbps Upload.
I know it's not fast, but it's consistent, and it hasn't changed since my last rebuilt.
I enabled General Logs, but all I can see are the INSERT queries, occasionally some "SELECT 1".
I do have same space for improvements on my code, but still, from 00:40:00 to 50:00:00, it seems that there's something else going on.
Any ideas on how to diagnose and find the bottleneck?
Thanks
--
Additional relevant information:
It is a Micro instance from AWS, all of the relevant monitoring indicators are basically flat: CPU at 4%, Free Storage Space at 20.000 MB, Freeable Memory at 200 MB, Write IOPS at around 2,5, the server runs a 5.7.25 MySQL, 1vCPU, 1Gb of RAM and 20GB of SSD. This is the same as 3 months ago when I last rebuilt the database.
SHOW GLOBAL STATUS: https://pastebin.com/jSrAzYZP
SHOW GLOBAL VARIABLES: https://pastebin.com/YxD7dVhR
SHOW ENGINE INNODB STATUS: https://pastebin.com/r5wffB5t
SHOW PROCESS LIST: https://pastebin.com/kWwiyGwf
SELECT * FROM information_schema...: https://pastebin.com/eXGBmetP
I haven't made any big changes to the server configuration, except enabling logs, e maxing out max_allowed_packets and saving logs to file.
In my backend I have a Flask app running, when it receives the API call, it takes a bunch of pickled objects and adds them all to the database (appending the Flask SQLAlchemy class to a list) and then running db.session.add_all(entries), trying to run a bulk operation. The code is the same, both for localhost and my remote server.
It does get slower in three specific tables, most of them with VARCHAR columns, but nothing different from my last inserts - it seems odd that the problem would be data, or the way the code is structured, or at least doesn't seem reasonable that this would result in a 20 second (localhost) to 40 minutes (hosted server) time, specially when the rest of the tables work mostly the same.
Enable the slow log, set long_query_time=0, run your code, then put the resulting log through mysqldumpslow.
Establish which queries contribute most to slowness and take it from there.
Compare the config between your old server and your new one.
Also, are they the same version of MySQL? 5.6, 5.7 and 8.0 can produce very different execution plans (with 5.6 usually coming up with the sane one if they differ).
Rate Per Second = RPS
Suggestions to consider for your AWS RDS Parameters group
thread_cache_size=24 # from 8 to reduce threads_created count
innodb_io_capacity=1900 # from 200 to enable more use of SSD IOPS capacity
read_rnd_buffer_size=128K # from 512K to reduce handler_read_rnd_next RPS of 21
query_cache_size=0 # from 1M since you have QC turned off with query_cache_typ=OFF
Determine why com_flush is running 13 times per hour and get it stopped to avoid table open thrashing.
I found that after migrating to RDS all my database Indexes are gone! They weren't migrated along with the schema and data. Make sure you're indexes are there.
Also, MySQL query cache is OFF by default in RDS. This won't help the performance of your initial query, but it may speed things up in general.
You can set query_cache_type to 1 and define a value for query_cache_size. I also changed the thread_cache_size from 8 to 24 and innodb_io_capacity from 200 to 1900 don't know if it helps you.
Also creating AWS DB Parameter Groups helped me a lot with configuring and tuning DB variables. Here you can read more:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html
I'm batching CSV 15GB (30mio rows) into a mysql-8 database.
Problem: the task takes about 20min, with approxy throughput of 15-20 MB/s. While the harddrive is capable of transfering files with 150 MB/s.
I have a RAM disk of 20GB, which holds my csv. Import as follows:
mysqlimport --user="root" --password="pass" --local --use-threads=8 mytable /tmp/mydata.csv
This uses LOAD DATA under the hood.
My target table does not have any indexes, but approx 100 columns (I cannot change this).
What is strange: I tried tweaking several config parameters as follows in /etc/mysql/my.cnf, but they did not give any significant improvement:
log_bin=OFF
skip-log-bin
innodb_buffer_pool_size=20G
tmp_table_size=20G
max_heap_table_size=20G
innodb_log_buffer_size=4M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_autoinc_lock_mode=2
Question: does LOAD DATA / mysqlimport respect those config changes? Or does it bypass? Or did I use the correct configuration file at all?
At least a select on the variables shows they are correctly loaded by the mysql server. For example show variables like 'innodb_doublewrite' shows OFF
Anyways, how could I improve import speed further? Or is my database the bottleneck and there is no way to overcome the 15-20 MB/s threshold?
Update:
Interestingly if I import my csv from harddrive into the ramdisk, performance is almost the same (just a little bit better, but never over 25 MB/s). I also tested the same amount of rows, but only with a few (5) columns. And there I'm getting to about 80 MB/s. So clearly the number of columns is the bottleneck? But why do more columns slow down this process?
MySQL/MariaDB engine have little parallelization when making bulk inserts. It can only use one CPU core per LOAD DATA statement. You may probably monitor CPU utilization during load to see one core is fully utilized and it can provide only so much of output data - thus leaving disk throughput underutilized.
The most recent version of MySQL has new parallel load feature: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-parallel-table.html . It looks promising but probably hasn't received much feedback yet. I'm not sure it would help in your case.
I saw various checklists on the internet that recommended having higher values in the following config params: log_buffer_size, log_file_size, write_io_threads, bulk_insert_buffer_size . But the benefits were not very pronounced when I performed comparison tests (maybe 10-20% faster than just innodb_buffer_pool_size being large enough).
This could be normal. Let's walk through what is being done:
The csv file is being read from a RAM disk, so no IOPs are being used.
Are you using InnoDB? If so, the data is going into the buffer_pool. As blocks are being built there, they are being marked 'dirty' for eventual flushing to disk.
Since the buffer_pool is large, but probably not as large as the table will become, some of the blocks will need to be flushed before it finishes reading all the data.
After all the data is read, and the table is finished, the dirty blocks will gradually be flushed to disk.
If you had non-unique indexes, they would similarly be written in a delayed manner to disk (cf 'Change buffering'). The change_buffer, by default occupies 25% of the buffer_pool.
How large is the resulting table? It may be significantly larger, or even smaller, than the 15GB of the csv file.
How much time did it take to bring the csv file into the ram disk? I proffer that that was wasted time and it should have been read from disk while doing the LOAD DATA; that I/O can be overlapped.
Please SHOW GLOBAL VARIABLES LIKE 'innodb%';; there are several others that may be relevant.
More
These are terrible:
tmp_table_size=20G
max_heap_table_size=20G
If you have a complex query, 20GB could be allocated in RAM, possibly multiple times!. Keep those to under 1% of RAM.
If copying the csv from hard disk to ram disk runs slowly, I would suspect the validity of 150 MB/s.
If you are loading the table once every 6 hours, and it takes 1/3 of an hour to perform, I don't see the urgency of making it faster. OTOH, there may be something worth looking into. If that 20 minutes is downtime due to the table being locked, that can be easily eliminated:
CREATE TABLE t LIKE real_table;
LOAD DATA INFILE INTO t ...; -- not blocking anyone
RENAME TABLE real_table TO old, t TO real_table; -- atomic; fast
DROP TABLE old;
I'm trying to create a database with data collected from google n-grams. It's actually a lot of data, but after the creation of the CSV files the insertion was pretty fast. The problem is that, immediately after the insertion, the neo4j-import tool indexes the data, and this step its taking too much time. It's been more than an hour and it looks like it achieved 10% of progress.
Nodes
[*>:9.85 MB/s---------------|PROPERTIES(2)====|NODE:198.36 MB--|LABE|v:22.63 MB/s-------------] 25M
Done in 4m 54s 828ms
Prepare node index
[*SORT:295.94 MB-------------------------------------------------------------------------------] 26M
This is the console info atm. Does anyone have a suggestion about what to do to speed up this process?
Thank you. (:
Indexing takes a long time depending on number of nodes. I tried indexing with 10 million nodes and it took around 35 minutes, but you can still try these settings :
Increase your page cache size which is stored in '/var/lib/neo4j/conf/neo4j.properties' file (in my ubuntu system). Edit the following line
dbms.pagecache.memory=4g
according to your RAM, allocate size, here, 4g means 4gb space. Also, you can try changing java memory size which is stored in neo4j-wrapper.conf
wrapper.java.initmemory=1024
wrapper.java.maxmemory=1024
You can also read neo4j documentation on this - http://neo4j.com/docs/stable/configuration-io-examples.html
We have a MySQL master database which replicates to a MySQL slave. We were experiencing issues where MySQL was showing a high number of writes (but not an increased number of queries being ran) for a short period of time (a few hours). We are trying to investigate the cause.
Normally our binary logs are 1 GB in file size but during the period that we were experiencing these issues, the log files jumped to 8.5 GB.
When I run mysqlbinlog --short-form BINARYLOG.0000 on one of the 8.5 GB binary log it only returns 196 KB of queries and data. When I run mysqlbinlog --short-form on a normal binary log (1 GB) it returns around 8,500 KB worth of queries and database activity. That doesn't make sense because it has 7 GB more of data yet returns less than a 1 GB binary log file.
I see lots of these statements with very sequential timestamps, but I'm not sure if that's related to the problem because they're in both the normal period as well as when we experienced these issues.
SET TIMESTAMP=1391452372/*!*/;COMMIT/*!*/;
SET TIMESTAMP=1391452372/*!*/;BEGIN/*!*/;COMMIT/*!*/;
SET TIMESTAMP=1391452372/*!*/;BEGIN/*!*/;COMMIT/*!*/;
SET TIMESTAMP=1391452372/*!*/;BEGIN/*!*/;COMMIT/*!*/;
How can I determine what caused those binary logs to balloon in size which also caused high writes, so much so it took the server offline at points, almost like a DDoS attack would?
How could mysqlbinlog return so much less data, even though the binary log file itself had 7 GB more? What can I do to identify the difference between a normal period where the binary logs are 1 GB to the period we had issues with the 8 GB binary log? Thank you for any help you can provide.
Bill
I would guess that your log contains some form of LOAD DATA [LOCAL] INFILE commands and the data files associated with them. These commands do not generate much SQL output as their data is written to a temporary file by mysqlbinlog during processing. Can you check if the output contains any such LOAD DATA commands?
I am using Cassandra 1.0.6 version... I have around ~1million JSON Objects of 5KB each to be inserted to the cassandra. As the inserts goes on, the memory consumption of cassandra also goes up until it gets stable to certain point.. After some inserts(around 2-3 lkhs) the ruby client gives me "`recv_batch_mutate': CassandraThrift::TimedOutException" exception.
I have also tried inserting 1KB sized JSON Objects more than a million times. This doesnt give any exception. Also in this experiment I plotted a graph between time taken by 50000 inserts vs number of 50000 inserts. I could find that there is a sharp rise in time taken to inserts after some iterations and suddenly that falls down. This could be due to Garbage collection done by JVM. But the same doesnt happen while inserting 5KB of data for a million times.
What may be the problem?? Some of the configuration options which I am using:-
System:-
8GB with 4-core ..
Cassandra configuration:-
- concurrent_writes: 64
memtable_flush_writers: 4
memtable_flush_queue_size: 8
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 30
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: true
Do I need to do any changes in configuration. Is it related to JVM heap space or due to Garbage collection ??
You can increase the rpc timeout to a larger value in cassandra config file, look for rpc_timeout_in_ms . But you should really look into your ruby client on the connection part.
# Time to wait for a reply from other nodes before failing the command
rpc_timeout_in_ms: 10000