I would like to know progress of indexing my primary column which is "URL" has around 1million rows if it is finish or still is in progress. I would like to run indexing in memory to reduce IO of hdd, more importantly I want to speed up process no more than waiting 10 minutes, because I need to delete all datas and insert new ones everyday.
Innodb_buffer_pool_size = 3G
Mysql mytable structure
URL(primary) 255Varchar
filename 200Varchar
mytable 1,065,380rows InnoDB latin1_swedish_ci 1 GiB
Space usage
Data 404 MiB
Index 660 MiB
Total 1 GiB
Row Statistics
Format dynamic
Collation latin1_swedish_ci
Creation Nov 18, 2017 at 02:35 PM
How much RAM do you have? For a 3G buffer_pool, I hope you have at least 6GB of RAM. If not, the buffer_pool is configured so big that you are probably swapping. This is a serious performance killer.
If you already have PRIMARY KEY(URL), there is no reason to add INDEX(URL). The PK is already an index, and UNIQUE.
What query are you hoping to speed up?
Related
I downloaded the enwiki-latest-pagelinks.sql.gz dump from dumps.wikimedia.org/enwiki/latest/.
I upacked the file, its uncompressed size is 37G.
The table structure is this:
SHOW CREATE TABLE wp_dump.pagelinks;
CREATE TABLE `pagelinks` (
`pl_from` int(8) unsigned NOT NULL DEFAULT '0',
`pl_namespace` int(11) NOT NULL DEFAULT '0',
`pl_title` varbinary(255) NOT NULL DEFAULT '',
`pl_from_namespace` int(11) NOT NULL DEFAULT '0',
UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`),
KEY `pl_backlinks_namespace` (`pl_from_namespace`,`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary
I imported the table into a new, empty database:
mysql -D wp_dump -u root -p < enwiki-latest-pagelinks.sql
The computer I am running the task on has 16G of RAM and the mysql database is located on a SSD, so I was assuming that despite the table's size the import would not take too long.
However, the task is running since over a day and still running. There are no other processes accessing mysql and there is no workload on the computer.
The database file itself now is 79G large.
ls -lh
-rw-r----- 1 mysql mysql 65 May 11 17:40 db.opt
-rw-r----- 1 mysql mysql 8,6K May 12 07:06 pagelinks.frm
-rw-r----- 1 mysql mysql 79G May 13 16:59 pagelinks.ibd
The table now has over 500 million rows.
SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'wp_dump';
+------------+------------+
| table_name | table_rows |
+------------+------------+
| pagelinks | 520919860 |
+------------+------------+
I am wondering:
Is the enwiki-latest-pagelinks.sql really over 79G large?
Does pagelinks really contain over 500 million rows?
Does it really take that long to import the pagelinks table?
Can you provide some metrics the expected table size and the row amount, please?
Update: 14th may, 2017:
insert still running; pagelinks.ibdfile now 130G; number of rows now almost 700 million
Update: 16th may, 2017:
insert still running; pagelinks.ibdfile now 204G; number of rows now over 1.2 billion
I calculated the rows inserted per second over the last two days:
rows/sek = 3236
And: It is many thousand inserts per insert statement in the sql script (head -41 enwiki-latest-pagelinks.sql | tail -1 | grep -o "(" | wc -l is 30471)
So, my follow-up / modified questions:
Is the number of rows and the idb file size to be expected given the sql file size of 37G and the table structure (as listed above)?
Is rows/sek = 3236 a good value (meaning that it takes a few days to insert the table)?
What may be the limiting speed factor / how can I speed up the import?
Disable the indexes (and calculate them after the insert)?
Optimize transactions (commit (nothing set in script) / autocommit (now ON))?
Optimize variable settings (e.g. innodb_buffer_pool_size, now 134217728)?
#Sim Betren: I was currently importing the same table, I can get about 7700 rows/s. Which means about 600.000.000 rows a day. Probably the most important thing is to get the right settings on InnoDB:
https://dba.stackexchange.com/questions/83125/mysql-any-way-to-import-a-huge-32-gb-sql-dump-faster
innodb_buffer_pool_size = 4G
innodb_log_buffer_size = 256M
innodb_log_file_size = 1G
innodb_write_io_threads = 16
innodb_flush_log_at_trx_commit = 0
Those settings work good. From what I read and tried, InnoDB loves high memory settings. Ideally, one would use a 16Gb or even 32Gb machine, then increase these settings even more. But I got 7700 rows/s on a modest setup, that's almost 10 years old:
Intel Q6700 quad
8 Gb DDR2 memory
I combined that 10 year old hardware with a 2017 model 500Gb SSD, which is dedicated to the job and handles both the reading and the writing. the reason for using the older hardware is that the SSD is the most important part of the setup (because of IOPS). Plus by using older hardware I saved a bit of money. However, the hardware is limited to 8Gb of DDR2. A newer dedicated machine with 32Gb or 64Gb internal memory could really fly I reckon.
Software setup:
Linux Mint 64bit
MySQL Server 5.7.18 for Ubuntu
MySQL Workbench for importing
I also tried this on Windows 10 and the speed is almost the same on both. So you could try Windows too.
Note: I did try changing the engine to MyISAM. MyISAM can be pretty fast, also around 8000 rows/sec or more. But the import always became corrupted for some reason. So I would stick to InnoDB
Update 17-06-2017:
Finished the import. The table "pagelinks" is about 214Gb large with 1200 million rows. About 112Gb is raw data, 102Gb are indexes. The original uncompressed file was about 37Gb.
It took about 2 days and 6 hours to import. Avg speed = 5350 rows/second. With high end equipment (huge memory, preferably 64Gb or more) and optimal settings, it can probably be done faster. But I let it run on a dedicated machine 24/7 and I wasn't in a hurry, so 2 days seems OK.
Update 18-06-2017:
Also imported "page.sql" because this contains the names connected to the ID's. The uncompressed file is about 5Gb, import took 1 hour. Which seems quick: the pagelink file is about 37Gb which is 7x bigger than "page.sql". Yet takes 50x longer to import. So, there are a few reasons why "pagelinks" takes so long: (A) probably because it doesn't fit in memory (B) The table structure, many data per insert (C) Settings. But most likely it's memory.
Conclusion: try get a PC with 32Gb or 64Gb internal memory. Maybe even more. And use SSD's that can keep up with that memory, 500Gb or more. The SSD is more important than memory so try that first.
#Sim Betren:
I want to open a completely new answer, since I have discovered a new solution. Splitting up the file is probably the best answer. As discussed in the other answer, InnoDB works best when the entire model fits in memory. The delays start when it needs to swap stuff on disk. The pagelinks file is 37Gb and that's simply too big for most machines to fit easily into memory. Maybe a $1000+ dedicated machine with endless memory can do it, most desktops can't. So what you can do:
The plan is to split the file. First thing to do, is to separate the SQL structure from the data.
There's probably better ways to do that, but a program I found was this:
SqlDumpSplitter2
That dump splitter program maybe old but it worked on pagelinks. It is Windows only though. I simply told it to split the unpacked 37Gb file into 37 chunks of 1Gb and it dutifully did. Checked the data and it seems to be working. You could also use 74 chunks of 500Mb.
The import of each file takes maybe 10 to 20 minutes per 1Gb.
Total time: about 1 to 2 hours for splitting the 37Gb file. About 6 to 12 hours for importing. That easily beats the previous answer I gave
When importing, use the same big data settings as in the previous answer. And try to find a machine that has big memory 16Gb or 32Gb preferred.
What is most important here is: it doesn't really matter that much how you split it. Just split the file anyway you can. Then build it up by re-creating the structure and the data separately. This way import could be down from 2 days to maybe as little as a few hours. Given a big dedicated machine it can probably be done in just 1 to 6 hours.
37GB of data --> 79GB of InnoDB table seems reasonable...
Title: 2 quotes and 1 comma --> 1 byte for length
Ints: several bytes, plus comma --> 4 bytes for INT (regardless of the (...) after INT. See MEDIUMINT.
20-30 bytes overhead per row
20-40% overhead for BTrees.
UNIQUE index becomes PRIMARY KEY and clusters with data --> very little overhead.
Other two indexes: Each is virtually the same size as the data. This more that allows for the increased size.
Adding it all together, I would expect the table to be more than 120GB. So, there are probably some details missing. A guess: The dump is one row per INSERT, instead of the less verbose many-rows-per-INSERT.
As for performance, it all depends on the SELECTs. Set innodb_buffer_pool_size to somewhere around 11G. This may work efficiently enough for caching the 79G.
More
Change UNIQUE to PRIMARY, for clarity and because InnoDB really needs a PK.
Check the source data. Is it in (pl_from,pl_namespace,pl_title) order? If not, can you sort the file before loading? If you can, that, alone, should significantly help the speed.
128MB for the buffer_pool is also significantly hampering progress.
I noticed that a mysql server is with CPU at 100%, and the "kernel time" (I'm not sure what it means) is unusually high, about 70%.
There are many connections on this server (around 400) and some active queries (about 40). Would that explain this behavior? Is there something wrong or this is expected?
Edit:
As suggested by a comment, I checked the 'handler_read%' variables:
show global status like 'handler_read%'. Here are the results:
Handler_read_first 248684
Handler_read_key 3081370400
Handler_read_last 83333
Handler_read_next 3520958058
Handler_read_prev 330
Handler_read_rnd 2210158755
Handler_read_rnd_deleted 60107588
Handler_read_rnd_next 929907565
The complete show status and show variables result is here:
https://www.dropbox.com/s/98pnd1rzgfp4jtf/server_status.txt?dl=0
https://www.dropbox.com/s/rh0m8np0mosx6tp/server_variables.txt?dl=0
The high values for handler_read_rnd* indicate that your tables are not properly indexed or that your queries are not written to take advantage of the indexes you have.
Due to syscall overhead and context switches table scans use more CPU.
Before changing parameters or invest money in hardware, I would suggest to optimize your database:
Activate the slow query log (additionally you might specify parameters log_queries_not_using_indexes and min_examined_row_limit) for a limited time (size of slow query log might grow very fast).
Analyze the queries in query log with EXPLAIN or EXPLAIN EXTENDED
If the problems occurs on a production server, replicate the content first to a test system
A number of settings are too high or too low...
tmp_table_size and max_heap_table_size are 16G -- This is disastrous! Each connection might need one or more of these. Lower it to 1% of RAM.
There are a large number of Com_show_fields -- complain to the 3rd party vendor.
Large number for Created_tmp_disk_tables -- this usually means poorly indexed or designed queries.
Select_scan / Com_select = 77% -- Missing lots of indexes?
Threads_running = 229 -- they are probably tripping over each other.
FLUSH STATUS was run recently, so some STATUS values are not useful.
table_open_cache is 256 -- There some indications that a bigger number would be good. Try 1500.
key_buffer_size is only 1% of RAM; raise it to 20%.
Still, ... High CPU means poor indexes and/or poorly designed queries. Let's see some of them, together with SHOW CREATE TABLE.
I have the below data that I am caching in Python now:
id timestamp data-string
The data-string size is ~87 bytes. Storing this optimally in python (using dict and having the timestamp pre-pended to the data-str with delimiter), the RAM costing per entry comes to ~198 bytes. This is quite big for the size of the cache I need.
I would like to try out storing the same in MySQL table, to see if I can save on RAM space. While doing so, I store this as:
id timestamp data-string
4B 4B
<---- PK ---->
I understand that MySQL will load the index of the InnoDB table (that's what I have now) into RAM. Therefore, the id (unique), timestamp and a pointer to the data-string will reside on RAM.
How do I calculate the complete RAM usage (ie including the meta-data) for the B+Tree of MySQL only for this new table?
There are so many variables, padding, etc, that it is impractical to estimate how much disk space an InnoDB BTree will consume. The 2x you quote is pretty good. The buffer_pool is a cache in RAM, so you can't say that the BTree will consume as much RAM space as disk did. Caching is on 16KB blocks.
(#ec5 has good info on the current size, on disk, of the index(es).)
I have an innodb table with over 140 million rows taking 26GB. I would like to drop this table however this is taking way too long, probably days. How can I speed up this query?
processor: Intel® Xeon® E3-1220 4 Cores x 3.1 GHz (3.4 Turbo Boost)
ram: 12 GB DDR3 ECC
If you don't want to check referential integrity while deleting records, use
TRUNCATE TABLE table_name
More in MySQL documentation.
This may depend on your innodb_file_per_table setting. You may find Performance problem with Innodb and DROP TABLE relevant:
If you’re not using innodb_file_per_table you’re not affected because
in this case tablespace does not need to be dropped so pages from
Buffer pool LRU do not need to be discarded. The raw performance of
dropping tables I measured on the test server was 10 times better with
single tablespace, though this will vary a lot depending on buffer
pool size. MyISAM tables creating/dropping was several times faster
still.
Also check Slow DROP TABLE
This solved my problem, who would've thought it was because of integrity checks:
https://dba.stackexchange.com/questions/53515/drop-table-on-a-huge-innodb-table/
We are constructing, for every day, mappings from tweet user id to the list of tweet ids of tweets made by that user. The storage engine we are using is Percona xtraDB "5.1.63-rel13.4 Percona Server (GPL), 13.4, Revision 443"
We are unsatisfied with the maximal throughput in terms of row inserts per second. Our maximal throughput to process tweets with xtraDB is around 6000 ~ 8000 tweets per second. (for example, if we had to rebuild data for some day from scratch, we'll have to wait for almost a day)
For the most part we are able to do this realtime enough with the full amount of twitter data (which is roughly 4000 ~ 5000 tweets per second).
We have narrowed down the bottleneck of our application to MySQL InnoDB insert. In our application, we read the feed from the disk and parse it with jackson (which happens at about 30,000 tweets per second). Our application then proceeds in batches of tweets. For the set of authors that generates these tweets, we partitioning them into 8 groups (simple partitioning with user id modulo 8). A table is allocated for each group and 1 thread is allocated to write the data to that table. Everyday there are roughly 26 million unique users that generates these tweets, and therefore each table have roughly 4 millions rows.
For a group of users, we only use one transaction for read and update. The group size is a runtime tunable. We have tried various sizes from 8 ~ 64000 , and we have determined 256 to be a good batch size.
the schema of our table is
CREATE TABLE `2012_07_12_g0` ( `userid` bigint(20) NOT NULL, `tweetId` longblob, PRIMARY KEY (`userid`)) ENGINE=InnoDB DEFAULT CHARSET=utf8
where tweetId is the compressed list of tweet ids long integers, compressed with Google snappy
Each thread uses
Select userid,tweetId from <tablename> where userid IN (....)
to resolve the userids to readback the data, and the threads use
INSERT INTO <tablename> (userid,tweetId) VALUES (...) ON DUPLICATE KEY UPDATE tweetId=VALUES(tweetId)
to update the rows with new tweetids.
We have tried setting various XtraDB parameters
innodb_log_buffer_size = 4M
innodb_flush_log_at_trx_commit = 2
innodb_max_dirty_pages_pct = 80
innodb_flush_method = O_DIRECT
innodb_doublewrite = 0
innodb_use_purge_thread = 1
innodb_thread_concurrency = 32
innodb_write_io_threads = 8
innodb_read_io_threads = 8
#innodb_io_capacity = 20000
#innodb_adaptive_flushing = 1
#innodb_flush_neighbor_pages= 0"
The table size for each day is roughly 8G for all tables, and InnoDB is given 24GB to work with.
We are using:
6-disk (crucial m4 SSD, 512 GB, 000F firmware) software RAID5.
Mysql innodb data, table space on the SSD partition
ext4 mount with noatime,nodiratime,commit=60
centos 6.2
sun jdk 1.6.30
Any tips for making our insert go faster would be greatly appreciated, thanks.
InnoDB is given 24GB
Do you mean this is the innodb_buffer_pool_size? You didn't say how much memory you have nor what CPUs you are using. If so then you should probably be using a larger innodb_log_buffer_size. What's your setting for innodb_log_file_size? It should probably be in the region of 96Mb.
innodb_write_io_threads = 8
ISTR that ext3 has some concurrency problems with multiple writers - but I don't know about ext4
Have you tried changing innodb_flush_method?
Which I/O scheduler are you using (in the absence of a smart disk controller, usually deadline is fastest, sometimes CFQ)?
Switching off the ext4 barriers will help with throughput - its a bit more risky - make sure you've got checksums enabled in JBD2. Similarly setting innodb_flush_log_at_trx_commit=0 should give a significant increase but more risky.
Since you're obviously not bothered about maintaining your data in a relational format, then you might consider using a noSQL database.
My initial suggestions would be:
As you don't have RAID card with memory you may want to comment out innodb_flush_method = O_DIRECT line to let system cache writes
as you disabled double write buffer you could also set innodb_flush_log_at_trx_commit to 0 which would be faster than 2
set innodb_log_buffer_size to cover at least one second of writes (approx 12Mb for 30K tweets)
in case you use binary logs - make sure you have sync_binlog = 0
On the hardware side I would strongly suggest to try RAID card with at least 256Mb RAM and battery unit (BBU) to improve write speed. There are RAID cards on the market that supports SSD.
Hope this helps. Please let me know how it goes.