how to speed data loading XML file into MySQL - mysql

I've got a 2Gb XML file that I want to load into a single table in MySQL.
The number of records/rows is ~140,000, but the default behavior of the LOAD XML function in MYSQL seems to depart from linear time.
Cutting the data into smaller pieces, I get the following performance (dropped table between each LOAD)
all were: Deleted: 0 Skipped: 0 Warnings: 0
5000 row(s) affected Records: 5000 4.852 sec
10000 row(s) affected Records: 10000 20.670 sec
15000 row(s) affected Records: 15000 80.294 sec
20000 row(s) affected Records: 20000 202.474 sec
The XML is well formed. I've tried:
SET FOREIGN_KEY_CHECKS=0;
SET UNIQUE_CHECKS=0;
What can I do to load it in a reasonable time that doesn't involve cutting it into a dozen pieces?

Try removing the indexes before the load, then rebuilding them afterward.

Related

Mysql load null data

The document I want to upload is organized this way:
student#ubuntuM0151:~/PEC3$ head -5 Attributes.txt
GTEX-1117F-0003-SM-58Q7G Blood Whole Blood
GTEX-1117F-0003-SM-5DWSB Blood Whole Blood
GTEX-1117F-0003-SM-6WBT7 Blood Whole Blood
GTEX-1117F-0011-R10a-SM-AHZ7F Brain Brain - Frontal Cortex (BA9)
GTEX-1117F-0011-R10b-SM-CYKQ8 Brain Brain - Frontal Cortex (BA9)
I create a table to upload it to:
mysql> CREATE TABLE attributes
-> (sampID VARCHAR(200) NOT NULL,
-> muestra VARCHAR(200) NOT NULL,
-> cantidad FLOAT,
-> PRIMARY KEY(muestra),
-> FOREIGN KEY(sampID) REFERENCES agesex(SUBJID));
Query OK, 0 rows affected (0.11 sec)
mysql> LOAD DATA LOCAL INFILE 'Attributes.txt' INTO TABLE attributes
-> FIELDS TERMINATED BY "\t" LINES TERMINATED BY "\n"
-> ;
Query OK, 0 rows affected, 65535 warnings (2.43 sec)
Records: 22951 Deleted: 0 Skipped: 22951 Warnings: 68853
I can't understand what is going on

MySQL - max_binlog_cache_size vs binlog_cache_size

There is quite a lot of confusion in the description of these variables, in official documentation of MySQL.
According to it, max_binlog_cache_size means,
If a transaction requires more than this many bytes of memory, the
server generates a Multi-statement transaction required more than
'max_binlog_cache_size' bytes of storage error.
max_binlog_cache_size sets the size for the transaction cache only
and binlog_cache_size means,
The size of the cache to hold changes to the binary log during a
transaction.
binlog_cache_size sets the size for the transaction cache only
On reading the documentation, I observed there is no difference among these two. There is also something very confusing in the documentation like,
In MySQL 5.7, the visibility to sessions of max_binlog_cache_size
matches that of the binlog_cache_size system variable; in other words,
changing its value effects only new sessions that are started after
the value is changed.
When I query the server variables it shows both. I have a MySQL 5.6 and a MySQL 5.7. All I need to know is, which variable I should consider and configure for which server.
binlog_cache_size for MySQL 5.6 and max_binlog_cache_size for MySQL 5.7??
There are additional confusing variables max_binlog_stmt_cache_size and binlog_stmt_cache_size, related to these.
Both variables can be configured in both versions, they have different meaning. Definitions in the manual and in the help are confusing; here is a much better explanation: http://dev.mysql.com/doc/refman/5.6/en/binary-log.html
binlog_cache_size defines the maximum amount of memory that the buffer can use. If transaction grows above this value, it uses a temporary disk file. Please note that the buffer is allocated per connection.
max_binlog_cache_size defines the maximum total size of a transaction. If the transaction grows above this value, it fails.
Below is a simple demonstration of the difference.
Setup:
MariaDB [test]> select ##binlog_cache_size, ##max_binlog_cache_size, ##binlog_format;
+---------------------+-------------------------+-----------------+
| ##binlog_cache_size | ##max_binlog_cache_size | ##binlog_format |
+---------------------+-------------------------+-----------------+
| 32768 | 65536 | ROW |
+---------------------+-------------------------+-----------------+
1 row in set (0.01 sec)
MariaDB [test]> show create table t1 \G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
`a` text
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
1. Transaction size is below ##binlog_cache_size
(transaction succeeds, uses the cache, does not use the disk)
MariaDB [test]> flush status;
Query OK, 0 rows affected (0.00 sec)
MariaDB [test]> begin;
Query OK, 0 rows affected (0.00 sec)
MariaDB [test]> insert into t1 values (repeat('a',20000));
Query OK, 1 row affected (0.01 sec)
MariaDB [test]> insert into t1 values (repeat('a',10000));
Query OK, 1 row affected (0.04 sec)
MariaDB [test]> commit;
Query OK, 0 rows affected (0.05 sec)
MariaDB [test]> show status like 'Binlog_cache%';
+-----------------------+-------+
| Variable_name | Value |
+-----------------------+-------+
| Binlog_cache_disk_use | 0 |
| Binlog_cache_use | 1 |
+-----------------------+-------+
2 rows in set (0.01 sec)
2. Transaction size is above ##binlog_cache_size, but below ##max_binlog_cache_size
(transaction uses the cache, and the cache uses the disk)
MariaDB [test]> flush status;
Query OK, 0 rows affected (0.00 sec)
MariaDB [test]> begin;
Query OK, 0 rows affected (0.00 sec)
MariaDB [test]> insert into t1 values (repeat('a',20000));
Query OK, 1 row affected (0.10 sec)
MariaDB [test]> insert into t1 values (repeat('a',20000));
Query OK, 1 row affected (0.10 sec)
MariaDB [test]> commit;
Query OK, 0 rows affected (0.03 sec)
MariaDB [test]> show status like 'Binlog_cache%';
+-----------------------+-------+
| Variable_name | Value |
+-----------------------+-------+
| Binlog_cache_disk_use | 1 |
| Binlog_cache_use | 1 |
+-----------------------+-------+
2 rows in set (0.01 sec)
3. Transaction size exceeds ##max_binlog_cache_size
(transaction fails)
MariaDB [test]> flush status;
Query OK, 0 rows affected (0.00 sec)
MariaDB [test]> begin;
Query OK, 0 rows affected (0.00 sec)
MariaDB [test]> insert into t1 values (repeat('a',20000));
Query OK, 1 row affected (0.12 sec)
MariaDB [test]> insert into t1 values (repeat('a',20000));
Query OK, 1 row affected (0.15 sec)
MariaDB [test]> insert into t1 values (repeat('a',20000));
Query OK, 1 row affected (0.12 sec)
MariaDB [test]> insert into t1 values (repeat('a',20000));
ERROR 1197 (HY000): Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage; increase this mysqld variable and try again
So, if your transactions are big, but you don't have too many connections, you might want to increase ##binlog_cache_size to avoid excessive disk writes.
If you have many concurrent connections, you should be careful to avoid connections trying to allocate too much memory for the caches simultaneously.
If you want to make sure that transactions don't grow too big, you might want to limit ##max_binlog_cache_size.
##binlog_stmt_cache_size and ##max_binlog_stmt_cache_size should work in a similar way, the difference is that %binlog_cache% values are for transactional updates, and %binlog_stmt_cache% for non-transactional updates.
While experimenting, please note that the values are not 100% precise, there are some hidden subtleties with initially allocated sizes. It shouldn't matter for practical purposes, but can be confusing when you play with low values.

MySQL Update Query, Rows Matched But Not Changed

Why will this not update 501 records? What's wrong with my query?
MariaDB [contacts]> UPDATE history h, phone_corrections t SET h.contact = t.new_nmbr WHERE h.contact = t.old_nmbr;
Query OK, 0 rows affected (0.03 sec)
Rows matched: 501 Changed: 0 Warnings: 0
MariaDB [contacts]>
FIXED! There were several records where the wrong_nmbr field was the same value as the right_nmbr field. Sorry for the post.

Which logfile to check for in case of warnings in MariaDB?

I issued an update command which gave the following output:
Query OK, 1 row affected, 2 warnings (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 2
So I was looking for the log file which might have the details of warnings that were reported as a result of the above command.
Note: I had exited from the session so Show Warnings returns Empty Set.

Why isn't all the data getting loaded into my MySQL table?

So I have a file of Twitter data that looks like this
Robert_Aderholt^&^&^2013-06-12 18:32:02^&^&^RT #financialcmte: In 2012, the Obama Admin published 1,172 new regulations totaling 79,000 pages. 57 were expected to have costs of at...
Robert_Aderholt^&^&^2013-06-12 13:42:09^&^&^The Administration's idea of a 'recovery' is 4 million fewer private sector jobs than the average post WWII recovery http://t.co/gSVW0Q8MYK
Robert_Aderholt^&^&^2013-06-11 13:51:17^&^&^As manufacturing jobs continue to decrease, its time to open new markets #4Jobs http://t.co/X2Mswr1i43
(The ^&^&^ words are separators, and I chose that separator because it's unlikely to occur in any of the tweets.)
This file is 90663 lines long (I checked by typing "wc -l tweets_parsed-6-12.csv").
However, when I load them into the table, I only get a table with 40456 entries:
mysql> source ../code/tweets2tables.sql;
Query OK, 0 rows affected (0.03 sec)
Query OK, 0 rows affected (0.08 sec)
Query OK, 40456 rows affected, 2962 warnings (0.81 sec)
Records: 40456 Deleted: 0 Skipped: 0 Warnings: 2962
mysql> SELECT COUNT(*) FROM tweets;
+----------+
| COUNT(*) |
+----------+
| 40456 |
+----------+
1 row in set (0.02 sec)
Why is that? I deleted all lines that didn't contain ^&^&^ so I didn't think there was any funny business going on with the data, but I could be wrong.
My loading code is
DROP TABLE IF EXISTS tweets;
CREATE TABLE tweets (
twitter_id VARCHAR(20),
post_date DATETIME,
body VARCHAR(140)
);
LOAD DATA
LOCAL INFILE 'tweets_parsed-6-12.csv'
INTO TABLE tweets
FIELDS TERMINATED BY '^&^&^'
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
(twitter_id, post_date, body);
The lines that weren't loaded probably contained the " character. If you specify that your fields are terminated with ", the quotes inside of them should be escaped like this - "" (double quotes).
The OPTIONALLY keyword before ENCLOSED may help.