Issues with wikipedia dump table pagelinks

Issues with wikipedia dump table pagelinks - mysql

I downloaded the enwiki-latest-pagelinks.sql.gz dump from dumps.wikimedia.org/enwiki/latest/.
I upacked the file, its uncompressed size is 37G.
The table structure is this:
SHOW CREATE TABLE wp_dump.pagelinks;
CREATE TABLE `pagelinks` (
`pl_from` int(8) unsigned NOT NULL DEFAULT '0',
`pl_namespace` int(11) NOT NULL DEFAULT '0',
`pl_title` varbinary(255) NOT NULL DEFAULT '',
`pl_from_namespace` int(11) NOT NULL DEFAULT '0',
UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`),
KEY `pl_backlinks_namespace` (`pl_from_namespace`,`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary
I imported the table into a new, empty database:
mysql -D wp_dump -u root -p < enwiki-latest-pagelinks.sql
The computer I am running the task on has 16G of RAM and the mysql database is located on a SSD, so I was assuming that despite the table's size the import would not take too long.
However, the task is running since over a day and still running. There are no other processes accessing mysql and there is no workload on the computer.
The database file itself now is 79G large.
ls -lh
-rw-r----- 1 mysql mysql 65 May 11 17:40 db.opt
-rw-r----- 1 mysql mysql 8,6K May 12 07:06 pagelinks.frm
-rw-r----- 1 mysql mysql 79G May 13 16:59 pagelinks.ibd
The table now has over 500 million rows.
SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'wp_dump';
+------------+------------+
| table_name | table_rows |
+------------+------------+
| pagelinks | 520919860 |
+------------+------------+
I am wondering:
Is the enwiki-latest-pagelinks.sql really over 79G large?
Does pagelinks really contain over 500 million rows?
Does it really take that long to import the pagelinks table?
Can you provide some metrics the expected table size and the row amount, please?
Update: 14th may, 2017:
insert still running; pagelinks.ibdfile now 130G; number of rows now almost 700 million
Update: 16th may, 2017:
insert still running; pagelinks.ibdfile now 204G; number of rows now over 1.2 billion
I calculated the rows inserted per second over the last two days:
rows/sek = 3236
And: It is many thousand inserts per insert statement in the sql script (head -41 enwiki-latest-pagelinks.sql | tail -1 | grep -o "(" | wc -l is 30471)
So, my follow-up / modified questions:
Is the number of rows and the idb file size to be expected given the sql file size of 37G and the table structure (as listed above)?
Is rows/sek = 3236 a good value (meaning that it takes a few days to insert the table)?
What may be the limiting speed factor / how can I speed up the import?
Disable the indexes (and calculate them after the insert)?
Optimize transactions (commit (nothing set in script) / autocommit (now ON))?
Optimize variable settings (e.g. innodb_buffer_pool_size, now 134217728)?

#Sim Betren: I was currently importing the same table, I can get about 7700 rows/s. Which means about 600.000.000 rows a day. Probably the most important thing is to get the right settings on InnoDB:
https://dba.stackexchange.com/questions/83125/mysql-any-way-to-import-a-huge-32-gb-sql-dump-faster
innodb_buffer_pool_size = 4G
innodb_log_buffer_size = 256M
innodb_log_file_size = 1G
innodb_write_io_threads = 16
innodb_flush_log_at_trx_commit = 0
Those settings work good. From what I read and tried, InnoDB loves high memory settings. Ideally, one would use a 16Gb or even 32Gb machine, then increase these settings even more. But I got 7700 rows/s on a modest setup, that's almost 10 years old:
Intel Q6700 quad
8 Gb DDR2 memory
I combined that 10 year old hardware with a 2017 model 500Gb SSD, which is dedicated to the job and handles both the reading and the writing. the reason for using the older hardware is that the SSD is the most important part of the setup (because of IOPS). Plus by using older hardware I saved a bit of money. However, the hardware is limited to 8Gb of DDR2. A newer dedicated machine with 32Gb or 64Gb internal memory could really fly I reckon.
Software setup:
Linux Mint 64bit
MySQL Server 5.7.18 for Ubuntu
MySQL Workbench for importing
I also tried this on Windows 10 and the speed is almost the same on both. So you could try Windows too.
Note: I did try changing the engine to MyISAM. MyISAM can be pretty fast, also around 8000 rows/sec or more. But the import always became corrupted for some reason. So I would stick to InnoDB
Update 17-06-2017:
Finished the import. The table "pagelinks" is about 214Gb large with 1200 million rows. About 112Gb is raw data, 102Gb are indexes. The original uncompressed file was about 37Gb.
It took about 2 days and 6 hours to import. Avg speed = 5350 rows/second. With high end equipment (huge memory, preferably 64Gb or more) and optimal settings, it can probably be done faster. But I let it run on a dedicated machine 24/7 and I wasn't in a hurry, so 2 days seems OK.
Update 18-06-2017:
Also imported "page.sql" because this contains the names connected to the ID's. The uncompressed file is about 5Gb, import took 1 hour. Which seems quick: the pagelink file is about 37Gb which is 7x bigger than "page.sql". Yet takes 50x longer to import. So, there are a few reasons why "pagelinks" takes so long: (A) probably because it doesn't fit in memory (B) The table structure, many data per insert (C) Settings. But most likely it's memory.
Conclusion: try get a PC with 32Gb or 64Gb internal memory. Maybe even more. And use SSD's that can keep up with that memory, 500Gb or more. The SSD is more important than memory so try that first.

#Sim Betren:
I want to open a completely new answer, since I have discovered a new solution. Splitting up the file is probably the best answer. As discussed in the other answer, InnoDB works best when the entire model fits in memory. The delays start when it needs to swap stuff on disk. The pagelinks file is 37Gb and that's simply too big for most machines to fit easily into memory. Maybe a $1000+ dedicated machine with endless memory can do it, most desktops can't. So what you can do:
The plan is to split the file. First thing to do, is to separate the SQL structure from the data.
There's probably better ways to do that, but a program I found was this:
SqlDumpSplitter2
That dump splitter program maybe old but it worked on pagelinks. It is Windows only though. I simply told it to split the unpacked 37Gb file into 37 chunks of 1Gb and it dutifully did. Checked the data and it seems to be working. You could also use 74 chunks of 500Mb.
The import of each file takes maybe 10 to 20 minutes per 1Gb.
Total time: about 1 to 2 hours for splitting the 37Gb file. About 6 to 12 hours for importing. That easily beats the previous answer I gave
When importing, use the same big data settings as in the previous answer. And try to find a machine that has big memory 16Gb or 32Gb preferred.
What is most important here is: it doesn't really matter that much how you split it. Just split the file anyway you can. Then build it up by re-creating the structure and the data separately. This way import could be down from 2 days to maybe as little as a few hours. Given a big dedicated machine it can probably be done in just 1 to 6 hours.

37GB of data --> 79GB of InnoDB table seems reasonable...
Title: 2 quotes and 1 comma --> 1 byte for length
Ints: several bytes, plus comma --> 4 bytes for INT (regardless of the (...) after INT. See MEDIUMINT.
20-30 bytes overhead per row
20-40% overhead for BTrees.
UNIQUE index becomes PRIMARY KEY and clusters with data --> very little overhead.
Other two indexes: Each is virtually the same size as the data. This more that allows for the increased size.
Adding it all together, I would expect the table to be more than 120GB. So, there are probably some details missing. A guess: The dump is one row per INSERT, instead of the less verbose many-rows-per-INSERT.
As for performance, it all depends on the SELECTs. Set innodb_buffer_pool_size to somewhere around 11G. This may work efficiently enough for caching the 79G.
More
Change UNIQUE to PRIMARY, for clarity and because InnoDB really needs a PK.
Check the source data. Is it in (pl_from,pl_namespace,pl_title) order? If not, can you sort the file before loading? If you can, that, alone, should significantly help the speed.
128MB for the buffer_pool is also significantly hampering progress.

Related

Mysql RAM requirement for 22 Billion Records Select query

I have a table which is expected to have 22 billion records yearly. How much will be the RAM requirement if each of the records cost around 4 KB of data.
It is expected to have around 8 TB of storage for the same table.
[update]
There is no join queries involved. I just need the select queries to be executed efficiently.

I have found that there is no general rule of thumb when it comes to how much RAM you need for x amount of records in MYSQL.
The first factor which you need to look at is the design of the database itself. This is one of the most impacting factors of them all. If your database is poorly designed, throwing RAM at it isn't going to fix your problem.
Another factor is how this data is going to be accessed, i.e if a specific row is being accessed by 100 people SELECT * FROM table where column = value then you could get away with a tiny amount of RAM as you would just use query caching.
It MAY (not always) be a good idea to keep your entire database in RAM to allow it to be read quicker (Dependent on the total size of the database). I.e. if your database is 100GB in size then 128GB of RAM should be proficient to deal with any overheads such as the OS and other factors.

As per my system i am supporting daily Oracle 224GB CDR record to a network operator.
Also for another system daily 20 lack data retrieve from SQL database .
you can use 128 GB if you are using one server or else
if you are using load balancer then you can use 62 GB on every PC.

heave writing to InnoDB

We are constructing, for every day, mappings from tweet user id to the list of tweet ids of tweets made by that user. The storage engine we are using is Percona xtraDB "5.1.63-rel13.4 Percona Server (GPL), 13.4, Revision 443"
We are unsatisfied with the maximal throughput in terms of row inserts per second. Our maximal throughput to process tweets with xtraDB is around 6000 ~ 8000 tweets per second. (for example, if we had to rebuild data for some day from scratch, we'll have to wait for almost a day)
For the most part we are able to do this realtime enough with the full amount of twitter data (which is roughly 4000 ~ 5000 tweets per second).
We have narrowed down the bottleneck of our application to MySQL InnoDB insert. In our application, we read the feed from the disk and parse it with jackson (which happens at about 30,000 tweets per second). Our application then proceeds in batches of tweets. For the set of authors that generates these tweets, we partitioning them into 8 groups (simple partitioning with user id modulo 8). A table is allocated for each group and 1 thread is allocated to write the data to that table. Everyday there are roughly 26 million unique users that generates these tweets, and therefore each table have roughly 4 millions rows.
For a group of users, we only use one transaction for read and update. The group size is a runtime tunable. We have tried various sizes from 8 ~ 64000 , and we have determined 256 to be a good batch size.
the schema of our table is
CREATE TABLE `2012_07_12_g0` ( `userid` bigint(20) NOT NULL, `tweetId` longblob, PRIMARY KEY (`userid`)) ENGINE=InnoDB DEFAULT CHARSET=utf8
where tweetId is the compressed list of tweet ids long integers, compressed with Google snappy
Each thread uses
Select userid,tweetId from <tablename> where userid IN (....)
to resolve the userids to readback the data, and the threads use
INSERT INTO <tablename> (userid,tweetId) VALUES (...) ON DUPLICATE KEY UPDATE tweetId=VALUES(tweetId)
to update the rows with new tweetids.
We have tried setting various XtraDB parameters
innodb_log_buffer_size = 4M
innodb_flush_log_at_trx_commit = 2
innodb_max_dirty_pages_pct = 80
innodb_flush_method = O_DIRECT
innodb_doublewrite = 0
innodb_use_purge_thread = 1
innodb_thread_concurrency = 32
innodb_write_io_threads = 8
innodb_read_io_threads = 8
#innodb_io_capacity = 20000
#innodb_adaptive_flushing = 1
#innodb_flush_neighbor_pages= 0"
The table size for each day is roughly 8G for all tables, and InnoDB is given 24GB to work with.
We are using:
6-disk (crucial m4 SSD, 512 GB, 000F firmware) software RAID5.
Mysql innodb data, table space on the SSD partition
ext4 mount with noatime,nodiratime,commit=60
centos 6.2
sun jdk 1.6.30
Any tips for making our insert go faster would be greatly appreciated, thanks.

InnoDB is given 24GB
Do you mean this is the innodb_buffer_pool_size? You didn't say how much memory you have nor what CPUs you are using. If so then you should probably be using a larger innodb_log_buffer_size. What's your setting for innodb_log_file_size? It should probably be in the region of 96Mb.
innodb_write_io_threads = 8
ISTR that ext3 has some concurrency problems with multiple writers - but I don't know about ext4
Have you tried changing innodb_flush_method?
Which I/O scheduler are you using (in the absence of a smart disk controller, usually deadline is fastest, sometimes CFQ)?
Switching off the ext4 barriers will help with throughput - its a bit more risky - make sure you've got checksums enabled in JBD2. Similarly setting innodb_flush_log_at_trx_commit=0 should give a significant increase but more risky.
Since you're obviously not bothered about maintaining your data in a relational format, then you might consider using a noSQL database.

My initial suggestions would be:
As you don't have RAID card with memory you may want to comment out innodb_flush_method = O_DIRECT line to let system cache writes
as you disabled double write buffer you could also set innodb_flush_log_at_trx_commit to 0 which would be faster than 2
set innodb_log_buffer_size to cover at least one second of writes (approx 12Mb for 30K tweets)
in case you use binary logs - make sure you have sync_binlog = 0
On the hardware side I would strongly suggest to try RAID card with at least 256Mb RAM and battery unit (BBU) to improve write speed. There are RAID cards on the market that supports SSD.
Hope this helps. Please let me know how it goes.

MySQL Huge table select performance

I currently have a table with 10 million rows and need to increase the performance drastically.
I have thought about dividing this 1 table into 20 smaller tables of 500k but I could not get an increase in performance.
I have created 4 indexes for 4 columns and converted all the columns to INT's and I have another column that is a bit.
my basic query is select primary from from mytable where column1 = int and bitcolumn = b'1', this still is very slow, is there anything I can do to increase the performance?
Server Spec
32GB Memory, 2TB storage, and using the standard ini file, also my processor is AMD Phenom II X6 1090T

In addition to giving the mysql server more memory to play with, remove unnecessary indexes and make sure you have index on column1 (in your case). Add a limit clause to the sql if possible.

Download this (on your server):
MySQLTuner.pl
Install it, run it and see what it says - even better paste the output here.

There is not enough information to reliably diagnose the issue, but you state that you're using "the default" my.cnf / my.ini file on a system with 32G of memory.
From the MySQL Documentation the following pre-configured files are shipped:
Small: System has <64MB memory, and MySQL is not used often.
Medium: System has at least 64MB memory
Large: System has at least 512MB memory and the server will run mainly MySQL.
Huge: System has at least 1GB memory and the server will run mainly MySQL.
Heavy: System has at least 4GB memory and the server will run mainly MySQL.
Best case, you're using a configuration file that utilizes 1/8th of the memory on your system (if you are using the "Heavy" file, which as far as I recall is not the default one. I think the default one is Medium or perhaps Large).
I suggest editing your my.cnf file appropriately.
There several areas of MySQL for which the memory allocation can be tweaked to maximize performance for your particular case. You can post your my.cnf / my.ini file here for more specific advice. You can also use MySQL Tuner to get some automated advice.

I made something that make a big difference in the query time
but it is may not useful for all cases, just in my case
I have a huge table (about 2,350,000 records), but I can expect the exact place that I should play with
so I added this condition WHERE id > '2300000' as I said this is my case, but it may help others
so the full query will be:
SELECT primary from mytable where id > '2300000' AND column1 = int AND bitcolumn = b'1'
The query time was 2~3 seconds and not it is less than 0.01

First of all, your query
select primary from from mytable where column1 = int and bitcolumn = b'1'
has some errors, like two from clauses. Second thing, splitting the table and using an unnecessary index never helps in performance. Some tips to follow are:
1) Use a composite index if you repeatedly query some columns together. But precautions must be taken, because in a composite index the order of placing a column in the index matters a lot.
2) The primary key is more helpful if it's on int column.
3) Read some articles on indices and optimization, they are so many, search on Google.

MySQL Cluster is much slower than InnoDB

I have a denormalized table product with about 6 million rows (~ 2GB) mainly for lookups. Fields include price, color, unitprice, weight, ...
I have BTREE indexes on color etc. Query conditions are dynamically generated from the Web, such as
select count(*)
from product
where color = 1 and price > 5 and price < 100 and weight > 30 ... etc
and
select *
from product
where color = 2 and price > 35 and unitprice < 110
order by weight
limit 25;
I used to use InnoDB and tried MEMORY tables, and switched to NDB hoping more concurrent queries can be done faster. I have 2 tables with the same schema, indexes, and data. One is InnoDB while the other is NDB. But the results are very disappointing：for the queries mentioned above, InnoDB is like 50 times faster than NDB. It's like 0.8 seocond vs 40 seconds. For this test I was running only a single select query repeatedbly. Both InnoDB and NDB queries are using the same index on color.
I am using mysql-5.1.47 ndb-7.1.5 on a dual Xeon 5506 (8 cores total), 32GB memory running CentOS 5. I set up 2 NDB Data nodes, one MGM node and one MYSQL node on the same box. For each node I allocated like 9GB memory, and also tried MaxNoOfExecutionThreads=8, LockPagesInMainMemory, LockExecuteThreadToCPU and many other config parameters, but no luck. While NDB is running the query, my peak CPU load was only like 200%, i.e., only 2 out of 8 cores were busy. Most of the time it was like 100%. I was using ndbmtd, and verified in the data node log and the LQH threads were indeed spawned.
I also tried explain, profiling -- it just showing that Sending data was consuming most of the time. I also went thru some Mysql Cluster tuning documents available online, not very helpful in my case.
Anybody can shed some light on this? Is there any better way to tune an NDB database? Appreciate it!

You need to pick the right storage engine for your application.
myISAM -- read frequently / write infrequently. Ideal for data lookups in big tables. Does reasonably well with complex indexes and is quite good for batch reloads.
MEMORY -- good for fast access to relatively small and simple tables.
InnoDB -- good for transaction processing. Also good for a mixed read / write workload.
NDB -- relatively less mature. Good for fault tolerance.
The mySQL server is not inherently multiprocessor software. So adding cores isn't necessarily going to jack up performance. A good host for mySQL is a decent two-core system with plenty of RAM and the fastest disk IO channels and disks you can afford. Do NOT put your mySQL data files on a networked or shared file system, unless you don't care about query performance.
If you're running on Linux issue these two commands (on the machine running the mySQL server) to see whether you're burning all your cpu, or burning all your disk IO:
sar -u 1 10
sar -d 1 10
Your application sounds like a candidate for myISAM. It sounds like you have plenty of hardware. In that case you can build a master server and an automatically replicated slave server But you may be fine with just one server. This will be easier to maintain.
Edit It's eight years latar and this answer is now basically obsolete.

MySQL index creation is slow (on EC2)

I have a fairly simple table
requestparams (
requestid varchar(64) NOT NULL,
requestString text,
) ENGINE=MyISAM;
After populating the table with "LOAD DATA", I am changing the schema and making "requestid" the primary key.
There are 11 million rows in this table and the data size is less than 2GB (size of the MYD file.) The index file size is around 600M at the end of the process.
Creating index takes less than 20 minutes on my laptop. But, when I ran the process on Amazon's EC2 (medium instance), the process took more than 12 hours. During entire process, the disk was superbusy with IO wait (as see by top) between 40-100%. The CPU was mostly idle. I don't think, the disk on EC2 are that slow.
On MySQL mailing list, some had suggested to change server variables myisam_sort_buffer_size & myisam_max_sort_file_size. I set them to 512MB & 4GB respectively. But index creation was equally slow. In fact, the memory usage by MySQL rarely went beyond 40M.
How do I fix this?
Solution: Increasing "key_buffer_size" helped. I set this value to 1GB and the process completed in 4 minutes. Make sure you verify the new settings with mysqladmin variables command.

The suggestions you received are correct. I suspect the changes you made didn't actually take effect. Try making the changes in my.cnf and restart the server.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008