Mysql InnoDB optimisation - mysql

I'm having some trouble understanding InnoDB usage - we have a drupal based DB (5:1 read:write) running on mysql (Server version: 5.1.41-3ubuntu12.10-log (Ubuntu)). Our current Innodb data/index sizing is:
Current InnoDB index space = 196 M
Current InnoDB data space = 475 M
Looking around on the web and reading books like 'High performance sql' suggest to have 10% increase on data size - i have set the buffer pool to be (data+index)+10% and noticed that the buffer pool was at 100%...even increasing about this to 896Mb still makes it 100% (even though the data + indexes are only ~671Mb?
I've attached the output of the innodb section of mysqlreport below. Pages free of 1 seems to be suggesting a major problem also as well. The innodb_flush_method is set at its default - I will investigate setting this to O_DIRECT but want to sort out this issue before.
__ InnoDB Buffer Pool __________________________________________________
Usage 895.98M of 896.00M %Used: 100.00
Read hit 100.00%
Pages
Free 1 %Total: 0.00
Data 55.96k 97.59 %Drty: 0.01
Misc 1383 2.41
Latched 0 0.00
Reads 405.96M 1.2k/s
From file 15.60k 0.0/s 0.00
Ahead Rnd 211 0.0/s
Ahead Sql 1028 0.0/s
Writes 29.10M 87.3/s
Flushes 597.58k 1.8/s
Wait Free 0 0/s
__ InnoDB Lock _________________________________________________________
Waits 66 0.0/s
Current 0
Time acquiring
Total 3890 ms
Average 58 ms
Max 3377 ms
__ InnoDB Data, Pages, Rows ____________________________________________
Data
Reads 21.51k 0.1/s
Writes 666.48k 2.0/s
fsync 324.11k 1.0/s
Pending
Reads 0
Writes 0
fsync 0
Pages
Created 84.16k 0.3/s
Read 59.35k 0.2/s
Written 597.58k 1.8/s
Rows
Deleted 19.13k 0.1/s
Inserted 6.13M 18.4/s
Read 196.84M 590.6/s
Updated 139.69k 0.4/s
Any help on this would be greatly apprectiated.
Thanks!

Related

libvirt: use of hugepages on NUMA system

Machine has 4 Numa nodes and is booted with kernel boot parameter default_hugepagesz=1G. I start VM with libvirt/virsh, and I can see that qemu launches with -m 65536 ... -mem-prealloc -mem-path /mnt/hugepages/libvirt/qemu, i.e. start virtual machine with 64GB of memory and request it to allocate the guest memory from a temporarily created file in /mnt/hugepages/libvirt/qemu:
% fgrep Huge /proc/meminfo
AnonHugePages: 270336 kB
ShmemHugePages: 0 kB
HugePages_Total: 113
HugePages_Free: 49
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 118489088 kB
%
% numastat -cm -p `pidof qemu-system-x86_64`
Per-node process memory usage (in MBs) for PID 3365 (qemu-system-x86)
Node 0 Node 1 Node 2 Node 3 Total
------ ------ ------ ------ -----
Huge 29696 7168 0 28672 65536
Heap 0 0 0 31 31
Stack 0 0 0 0 0
Private 4 9 4 305 322
------- ------ ------ ------ ------ -----
Total 29700 7177 4 29008 65889
...
Node 0 Node 1 Node 2 Node 3 Total
------ ------ ------ ------ ------
MemTotal 128748 129017 129017 129004 515785
MemFree 98732 97339 100060 95848 391979
MemUsed 30016 31678 28957 33156 123807
...
AnonHugePages 0 4 0 260 264
HugePages_Total 29696 28672 28672 28672 115712
HugePages_Free 0 21504 28672 0 50176
HugePages_Surp 0 0 0 0 0
%
This output confirms that host's memory of 512GB is equally split across the numa nodes, and hugepages are also equally distributed across the nodes.
The question is how does qemu (or kvm?) determine how many hugepages to allocate? Note that libvirt xml has the following directive:
<memoryBacking>
<hugepages/>
<locked/>
</memoryBacking>
However, it is unclear from https://libvirt.org/formatdomain.html#memory-tuning what are defaults for hugepage allocation and on which nodes? Is it possible to have all memory for VM allocated from node 0? What is the right way doing this?
UPDATE
Since my VM workload is actually pinned to a set of cores on a single numa node 0 using <vcpupin> element, I thought it'd be good idea to enforce Qemu to allocate memory from the the same numa node:
<numtune>
<memory mode="strict" nodeset="0">
</numtune>
However this didn't work, qemu returned error in its log:
os_mem_prealloc insufficient free host memory pages available to allocate guest ram
Does it mean it fails to find free huge pages on the numa node 0?
If you use a plain <hugepages/> element, then libvirt will configure QEMU to allocate from the default huge page pool. Given your 'default_hugepagesz=1G' that should mean that QEMU allocates 1 GB sized pages. QEMU will allocate as many as are needed to satisfy the request RAM size. Given your configuration, these huge pages can potentially be allocated from any NUMA node.
With more advanced libvirt configuration it is possible to request allocation of a specific size of huge page, and pick them from specific NUMA nodes. The latter is only really needed if you are also locking CPUs to a specific host NUMA node.
Does it mean it fails to find free huge pages on the numa node 0?
Yes, it does.
numastat -m can be used to find out how many Huge Pages are there totally, free.

Spark CSV GZip to Parquet?

I am using Spark 2.3.1 PySpark (AWS EMR)
I am getting memory errors:
Container killed by YARN for exceeding memory limits
Consider boosting spark.yarn.executor.memoryOverhead
I have input of 160 files, each file approx 350-400 MBytes, each file is a CSV Gzip format.
To read the csv.gz files (with wildcard) and I use this Pyspark
dfgz = spark.read.load("s3://mybucket/yyyymm=201708/datafile_*.csv.gz",
format="csv", sep="^", inferSchema="false", header="false", multiLine="true", quote="^", nullValue="~", schema="id string,...."))
To save the data frame I use this (PySpark)
(dfgz
.write
.partitionBy("yyyymm")
.mode("overwrite")
.format("parquet")
.option("path", "s3://mybucket/mytable_parquet")
.saveAsTable("data_test.mytable")
)
One line of code to save all 160 files.
I tried this with 1 file and it works fine.
Total size for all 160 files (csv.gzip) is about 64 GBytes.
Each file as a pure CSV, when Unzipped is approx 3.5 GBytes. I am assuming Spark may unzip each file in RAM and then convert it to Parquet in RAM ??
I want to convert each csv.gzip file to Parquet format i.e. I want 160 Parquet files as output (ideally).
The task runs for a while and it seems to create 1 Parquet file for each CSV.GZ file. After some time it always fails with Yarn memory error.
I tried various settings for executors memory and memoryOverhead and all results in no change - jobs always fails. I tried memoryOverhead of up to 1-8 GB and executormemory of 8G.
Apart from manually breaking up input 160 files workload into many small workloads what else can I do?
Do I need a Spark cluster with a total RAM capacity of much greater than 64 GB?
I use 4 slave nodes, each has 8 CPU and 16 GB per node (slaves), plus one master of 4 CPU and 8 GB of RAM.
This is (with overhead) less than 64 GB of input gzip csv files I am trying to process but the files are evenly sized of 350-400 MBytes so I dont understand why Spark is throwing memory errors given it can easily process these 1 file at a time per executor, discard it and move on to next file. It does not appear to work this way. I feel it is trying to load all input csv.gzip files into memory but I have no way of knowing it (I am still new to Spark 2.3.1).
Late Update: I managed to get it to work with following memory config:
4 slave nodes, each 8 CPU and 16 GB of RAM
1 master node, 4 CPU and 8 GB of RAM:
spark maximizeResourceAllocation false
spark-defaults spark.driver.memoryOverhead 1g
spark-defaults spark.executor.memoryOverhead 2g
spark-defaults spark.executor.instances 8
spark-defaults spark.executor.cores 3
spark-defaults spark.default.parallelism 48
spark-defaults spark.driver.memory 6g
spark-defaults spark.executor.memory 6g
Needless to say - I cannot explain why this config worked!
Also this took 2 hours+ to process 64 GB of gzip data which seems slow even for a small 4+1 node cluster with total of 32+4 CPU and 64+8 GB of RAM. Perhaps S3 was the bottleneck....
FWIW I just did not expect to micro-manage a database cluster for memory, disk I/O or CPU allocation.
Update 2:
I just ran another load on same cluster with same config, a smaller load of 129 files of same sizes and this load failed with same Yarn memory errors.
I am very disappointed with Spark 2.3.1 memory management.
Thank you for any guidance

MySQL crash after enormous row locks

I'm using MySQL 5.7.14 x64 on Windows Server 2008 R2
Sometimes (randomly times at day) mysql crashing with this stack trace
11:44:40 UTC - mysqld got exception 0x80000003 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
key_buffer_size=8388608
read_buffer_size=65536
max_used_connections=369
max_threads=2800
thread_count=263
connection_count=263
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 3195125 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x2ee2b72b0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
13fe1bad2 mysqld.exe!my_sigabrt_handler()[my_thr_init.c:449]
1401c7979 mysqld.exe!raise()[winsig.c:587]
1401c6870 mysqld.exe!abort()[abort.c:82]
13ff1dd38 mysqld.exe!ut_dbg_assertion_failed()[ut0dbg.cc:67]
13ff1df51 mysqld.exe!ib::fatal::~fatal()[ut0ut.cc:916]
13ff0e008 mysqld.exe!buf_LRU_check_size_of_non_data_objects()[buf0lru.cc:1219]
13ff0f4ab mysqld.exe!buf_LRU_get_free_block()[buf0lru.cc:1303]
1400305cb mysqld.exe!buf_block_alloc()[buf0buf.cc:557]
13ff3767e mysqld.exe!mem_heap_create_block_func()[mem0mem.cc:319]
13ff37499 mysqld.exe!mem_heap_add_block()[mem0mem.cc:408]
13ffd87f4 mysqld.exe!RecLock::lock_alloc()[lock0lock.cc:1441]
13ffd795c mysqld.exe!RecLock::create()[lock0lock.cc:1534]
13ffd73a6 mysqld.exe!RecLock::add_to_waitq()[lock0lock.cc:1735]
13ffdcaaa mysqld.exe!lock_rec_lock_slow()[lock0lock.cc:2007]
13ffdc6ce mysqld.exe!lock_rec_lock()[lock0lock.cc:2081]
13ffd8cc7 mysqld.exe!lock_clust_rec_read_check_and_lock()[lock0lock.cc:6307]
140076fe3 mysqld.exe!row_ins_set_shared_rec_lock()[row0ins.cc:1502]
140072927 mysqld.exe!row_ins_check_foreign_constraint()[row0ins.cc:1739]
140072de8 mysqld.exe!row_ins_check_foreign_constraints()[row0ins.cc:1932]
140075d69 mysqld.exe!row_ins_sec_index_entry()[row0ins.cc:3356]
1400758a6 mysqld.exe!row_ins_index_entry_step()[row0ins.cc:3583]
140071b30 mysqld.exe!row_ins()[row0ins.cc:3721]
14007755a mysqld.exe!row_ins_step()[row0ins.cc:3907]
13ffaad50 mysqld.exe!row_insert_for_mysql_using_ins_graph()[row0mysql.cc:1735]
13fe7a7d3 mysqld.exe!ha_innobase::write_row()[ha_innodb.cc:7489]
13f6e5531 mysqld.exe!handler::ha_write_row()[handler.cc:7891]
13f8e54de mysqld.exe!write_record()[sql_insert.cc:1860]
13f8e916a mysqld.exe!read_sep_field()[sql_load.cc:1222]
13f8e7af4 mysqld.exe!mysql_load()[sql_load.cc:563]
13f716e86 mysqld.exe!mysql_execute_command()[sql_parse.cc:3649]
13f7194b3 mysqld.exe!mysql_parse()[sql_parse.cc:5565]
13f71267d mysqld.exe!dispatch_command()[sql_parse.cc:1430]
13f71368a mysqld.exe!do_command()[sql_parse.cc:997]
13f6d82bc mysqld.exe!handle_connection()[connection_handler_per_thread.cc:300]
140105122 mysqld.exe!pfs_spawn_thread()[pfs.cc:2191]
13fe1b93b mysqld.exe!win_thread_start()[my_thread.c:38]
1401c73ef mysqld.exe!_callthreadstartex()[threadex.c:376]
1401c763a mysqld.exe!_threadstartex()[threadex.c:354]
772859bd kernel32.dll!BaseThreadInitThunk()
773ba2e1 ntdll.dll!RtlUserThreadStart()
At this time active only 2 transactions
---TRANSACTION 1111758443, ACTIVE 565 sec
mysql tables in use 7, locked 7
7527 lock struct(s), heap size 876752, 721803 row lock(s), undo log entries 379321
MySQL thread id 166068, OS thread handle 1508, query id 112695582 localhost converter Waiting for table level lock
delete from pl
using
import_k2b_product_links ipl inner join k2b_products pSource on ipl.src_product = pSource.article and pSource.account_id = 22
inner join k2b_products pDest on ipl.dst_product = pDest.article and pDest.account_id = 22
inner join k2b_product_links pl on pl.src_product_id = pSource.id and pl.dst_product = pDest.id
where ipl.action = 1
---TRANSACTION 1111759716, ACTIVE 496 sec inserting, thread declared inside InnoDB 1
mysql tables in use 4, locked 4
7 lock struct(s), heap size 1304535248, 102060778 row lock(s), undo log entries 1
MySQL thread id 19436, OS thread handle 11664, query id 112301161 localhost exchange_central
LOAD DATA INFILE 'd:/kdm/temp/webCentral/ufrd1uwx.v2r'
INTO TABLE k2b_orders
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
(id_status, dt, account_id, sms_sended, params, update_ts, exported, id_editor, dt_offset, device_id, gen, changer_device_id, total, creator_device_id, id, dt_server, device_category_id, original_params, order_num, sended, editor_comment, admin_comment)
I don't understand why transaction 1111758443 Waiting for table level lock?
And why transaction 1111759716 lock 102060778 rows while it load just only one from external file and it showed in undo log entries 1?
Which investigation I must done for known reason of this enormous locks and crash.
Thanks!
Two things make me think that the crash is not the 'real' problem.
Both queries in the log show 'huge' times, such as ACTIVE 565 sec.
And these are all quite large:
max_used_connections=369
max_threads=2800
thread_count=263
connection_count=263
When there are hundreds of threads simultaneously active, InnoDB stumbles over itself. Throughput stalls, and latency goes through the roof.
One cure is to avoid so many connections. This is sometimes best done at the client. What is the client? For example, Apache has MaxClients. A dozen Apaches, each with MaxClients = 50 would be trying to open 600 connections. Probably one Apache cannot effectively handle 50 threads at once. Lower that number.
Are there any VIEWs deceiving us?
Another thing to do is to pursue table level lock. Let's see SHOW CREATE TABLE for the tables involved. Check for appropriate indexes.
import_k2b_product_links: INDEX(action, ...)
k2b_products: INDEX(account_id, src_product) -- in either order
k2b_products: INDEX(account_id, dest_product) -- in either order
k2b_product_links: INDEX(src_product_id, dest_product_id) -- or PK, see below
Is k2b_product_links a many:many mapping table? If so, get rid of id auto_increment as discussed Here .
The index suggestions, if useful, could speed up the DELETE, thereby cutting down on possible contention.

How atopsar calculates HDD load?

I have dedicated to MySQL InnoDB log files (ib_logfile0, ib_logfile0) HDD - sda. And atopsar shows big load to this HDD
atopsar -d 60:
13:02:10 disk busy read/s KB/read writ/s KB/writ avque avserv _dsk
13:03:10 sda 59% 4.4 4.0 45.3 6.2 1.0 11.88 ms
13:04:10 sda 60% 4.5 4.0 45.6 6.1 1.0 11.98 ms
13:05:10 sda 58% 4.2 4.0 44.7 6.0 1.0 11.94 ms
dstat -tdD total,sda 60:
----system---- -dsk/total----dsk/sda--
time | read writ: read writ
24-09 13:11:24| 23k 912k:9689B 391k
24-09 13:12:24| 33k 971k: 16k 270k
24-09 13:13:24| 16k 893k: 14k 235k
24-09 13:14:24| 18k 963k: 16k 254k
pt-ioprofile -cell sizes:
total pread read pwrite write fsync open close lseek fcntl filename
905728 0 0 905728 0 0 0 0 0 0 /var/mysqllog/mysql/ib_logfile0
200-400Kb per second does not seems to be much to show busy > 50%. Specially considering that the only files on HDD are MySQL InnoDB log files and (from the InnoDB blog).:
The redo log files are used in a circular fashion. This means that the redo logs are written from the beginning to end of first redo log file, then it is continued to be written into the next log file, and so on till it reaches the last redo log file. Once the last redo log file has been written, then redo logs are again written from the first redo log file.
So the question is why the load is so big, Is it real physical capability of HDD?
Seems to be the load is calculated (all request, read+write)*avserv/1000. For the first atopsar line calculation is as follows: (4.4 + 45.3)*11.88 / 1000 = 0.59

Creating index takes too long time

About 2 months ago, I imported EnWikipedia data(http://dumps.wikimedia.org/enwiki/20120211/) into mysql.
After finished importing EnWikipedia data, I have been creating index in the tables of the EnWikipedia database in mysql for about 2 month.
Now, I have reached the point of creating index in "pagelinks".
However, it seems to take an infinite time to pass that point.
Therefore, I checked the time remaining to pass to ensure that my intuition was correct or not.
As a result, the expected time remaining was 60 days(assuming that I create index in "pagelinks" again from the beginning.)
My EnWikipedia database has 7 tables:
"categorylinks"(records: 60 mil, size: 23.5 GiB),
"langlinks"(records: 15 mil, size: 1.5 GiB),
"page"(records: 26 mil, size 4.9 GiB),
"pagelinks"(records: 630 mil, size: 56.4 GiB),
"redirect"(records: 6 mil, size: 327.8 MiB),
"revision"(records: 26 mil, size: 4.6 GiB) and "text"(records: 26 mil, size: 60.8 GiB).
My server is...
Linux version 2.6.32-5-amd64 (Debian 2.6.32-39),Memory 16GB, 2.39Ghz Intel 4 core
Is that common phenomenon for creating index to take so long days ?
Does anyone have a good solution to create index more quickly ?
Thanks in advance !
P.S: I made following operations for checking the time remaining.
References(Sorry,following page is written in Japanese): http://d.hatena.ne.jp/sh2/20110615
1st. I got records in "pagelink".
mysql> select count(*) from pagelinks;
+-----------+
| count(*) |
+-----------+
| 632047759 |
+-----------+
1 row in set (1 hour 25 min 26.18 sec)
2nd. I got the amount of records increased per minute.
getHandler_write.sh
#!/bin/bash
while true
do
cat <<_EOF_
SHOW GLOBAL STATUS LIKE 'Handler_write';
_EOF_
sleep 60
done | mysql -u root -p -N
command
$ sh getHandler_write.sh
Enter password:
Handler_write 1289808074
Handler_write 1289814597
Handler_write 1289822748
Handler_write 1289829789
Handler_write 1289836322
Handler_write 1289844916
Handler_write 1289852226
3rd. I computed the speed of recording.
According to the result of 2. ,the speed of recording is
7233 records/minutes
4th. Then the time remaining is
(632047759/7233)/60/24 = 60 days
Those are pretty big tables, so I'd expect the indexing to be pretty slow. 630 million records is a LOT of data to index. One thing to look at is partitioning, with data sets that large, without correctly partitioned tables, performance will be sloooow. Here's some useful links:
using partioning on slow indexes you could also try looking at the buffer size settings for building the indexes (the default is 8MB, do for your large table that's going to slow you down a fair bit. buffer size documentation