MySQL, Two billion rows of data, read only, performance optimisations? - mysql

I have a set of integer data. The first being the number 0 and the last being 47055833459. There are two billion of these numbers from the first to the last and they will never change or be added to. The only insert into the mysql table will be loading this data into it. From then on, it will only be read from.
I predict the size of the database table to be roughly 20Gb. I plan on having two columns:
id, data
Id will be a primary key, auto incremented unsigned INT and data will be an unsigned BIGINT
What will be the best way of optimising this data for read only with those two columns? I have looked at the other questions which are similar but they all take into account write speeds and ever increasing tables. The host I am using does not support MySQL partitioning so unfortunately this is not an option at the moment. If it turns out that partitioning is the only way, then I will reconsider a new host.
The table will only ever be accessed by the id column so there does not need to be an index on the data column.
So to summarise, what is the best way of handling a table with 2 billion rows with two columns, without partitioning, optimised for reads, in MySQL?

Assuming you are using InnnDB, you should simply:
CREATE TABLE T (
ID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
DATA BIGINT UNSIGNED
);
This will effectively create one big B-Tree and nothing else, and retrieving a row by ID can be done in a single index seek1. Take a look at "Understanding InnoDB clustered indexes" for more info.
1 Without table heap access, in fact there is no heap at all.

Define your table like so.
CREATE TABLE `lkup` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`data` BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (`id`, `data`)
)
The compound primary key will consume disk space, but will make lookups very fast; your queries can be satisfied just by reading the index (which is known as a covering index).
And, do OPTIMIZE TABLE lkup when you're finished loading your static data into it. That may take a while, but it will pay off big at runtime.

Related

Choosing a string primary key vs using a join table with corresponding integer id

I have a large sensor data count table say SENSORS_COUNT with a string SID referring to another table SENSOR_DEFINITIONS with the same primary key SID. As there are millions of data points, the index on string primary key becomes 1) bloated 2) slow. The total number of sensors is pretty small (< 2000).
I can think about 3 different ways of making the queries faster:
Using a join table which translates the string key into a corresponding integer key and refer that with joins in all queries
Load the string/integer translation as a hash in program memory and refer that within the code
Use index on string primary id (which would be slower than integer though)
I'm trying to build a system for a variety of sensors which may have different types of string ids (but same schema). What would be the best recommendation to go about it?
EDIT 1: This is the schema. And yes (thanks for the correction), in SENSORS_COUNT table, SID is not a primary key
TABLE: SENSOR_DEFINITIONS (2000 records)
SID : VARCHAR(20), PRIMARY KEY
SNAME: VARCHAR(50)
TABLE: SENSORS_COUNT (N million records)
SID: VARCHAR(20)
DATETIME: TIMESTAMP
VALUE: INTEGER
For "large" tables, normalization becomes more important. Especially when the table is too big to be cached.
So, I agree with the choice of using a SMALLINT UNSIGNED (2 bytes, 0..64K) for the 2000 sensor names, not a VARCHAR(...).
Without seeing (1) the SHOW CREATE TABLE and (2) some critical SELECTs, it is hard to give further advice.
Probably, a "composite" PRIMARY KEY would be better than an AUTO_INCREMENT. It might be (sensor_id, datetime), but it would help to see the selects first.
Do not have two tables with the same schema (without a good reason).

Creating an auxiliary table to improve performance on a large MySQL table?

I have a client who has asked me to tune his MySQL database in order to implement some new features and to improve the performance of an already existing web app.
The biggest table (~90 GB) has over 200M rows, and is growing at periodic intervals (one per visit to any of the websites he owns). Having continuous INSERTs, each SELECT query performed from the backend page takes a while to complete, as indexes are regenerated each time.
I've done a simulation on my own server switching from BTREE indexes to HASH indexes. Both SELECTs and INSERTs are not running any faster. The table uses MyISAM as storage engine. There are only INSERTs and SELECTs, no UPDATEs or DELETEs.
I've came up with the idea of creating an auxiliary table updated together with each INSERT to speed up every SELECT query coming from the backend. I know this is bad practice, but, I'm sure the performance will improve for the statistics page.
I'm not a database performance expert, as you may have noticed... Is there a better approach for this?
By the way, from phpMyAdmin I've seen that most indexes on the table have a cardinality of 0. In my simulation, this didn't happen. I'm not sure why is this happening.
Thanks a lot.
1st update: I've just learned that hash index isn't available for MyISAM engine.
2nd update: OK. Here's the table schema.
CREATE TABLE `visits` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`datetime` int(8) NOT NULL,
`webmaster_id` char(18) NOT NULL,
`country` char(2) NOT NULL,
`connection` varchar(15) NOT NULL,
`device` varchar(15) NOT NULL,
`provider` varchar(100) NOT NULL,
`ip_address` varchar(15) NOT NULL,
`url` varchar(300) NOT NULL,
`user_agent` varchar(300) NOT NULL,
PRIMARY KEY (`id`),
KEY `datetime` (`datetime`),
KEY `webmaster_id` (`webmaster_id`),
KEY `country` (`country`),
KEY `connection` (`connection`),
KEY `device` (`device`),
KEY `provider` (`provider`)
) ENGINE=InnoDB;
So, instead of performing queries like select count(*) from visits where datetime=20140715 and device="ios", won't it be best to fetch this from select count from visits_stats where datetime=20140715 and device="ios"?
INSERTs are, as said, much more frequent than SELECTs, but my client wants to improve the performance of the backend used to retrieve aggregated data. Using my approach, each visit would imply one INSERT and one INSERT/UPDATE (or REPLACE) which would increment one or more counters (I haven't decided the schema for the visits_stats table yet, the above query was just an example).
Apart from this, I've decided to replace some of the fields by their appropriate IDs from a foreign table. So far, data is stored in strings like connection=cable, device=android, and so on. I'm not sure how would this affect performance.
Thanks again.
Edit: I said before not to use partitions. But Bill is right that the way he described would work. Your only concern would be if you tried to select across the 101 partitions, then the whole thing would come to a standstill. If you don't intend to do this then partitioning would solve the problem. Fix your indexes first though.
Your primary problem is that MyISAM is not the best engine, neither is InnoDB. TokuDB would be your best bet, but you'd have to install that on the server.
Now, you need to prune your indexes. This is the major reason for the slowness. Remove an index on everything that isn't part of common SELECT statements. Add an multi-column index on exactly what is requested in the WHERE of your SELECT statements.
So (in addition to your primary key) you want an index on datetime, device only as a multi-column index, according to your posted SELECT statement.
If you change to TokuDB the inserts will be much faster, if you stick with MyISAM then you could speed the whole thing up by using INSERT DELAYED instead of INSERT. The only issue with this is that the inserts will not be live, but will be added whenever MySQL decides there is not too much load.
Alternatively, if the above still does not help, your final option would be to use two tables. One table that you SELECT from, and another that you INSERT to. Once an day or so you would then copy the insert table to the select table. Though this means the data in your select table could be up to 24 hours old.
Other than that you would have to completely change the table structure, for which I can't tell you how to do because it depends on what you are using it for exactly, or use something other than MySQL for this. However, my above optimizations should work.
I would suggest looking into partitioning. You have to add datetime to the primary key to make that work, because of a limitation of MySQL. The primary or unique keys must include the column by which you partition the table.
Also make the index on datetime into a compound index on (datetime, device). This will be a covering index for the query you showed, so the query can get its answer from the index alone, without having to touch table rows.
CREATE TABLE `visits` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`datetime` int(8) NOT NULL,
`webmaster_id` char(18) NOT NULL,
`country` char(2) NOT NULL,
`connection` varchar(15) NOT NULL,
`device` varchar(15) NOT NULL,
`provider` varchar(100) NOT NULL,
`ip_address` varchar(15) NOT NULL,
`url` varchar(300) NOT NULL,
`user_agent` varchar(300) NOT NULL,
PRIMARY KEY (`id`, `datetime`), -- compound primary key is necessary in this case
KEY `datetime` (`datetime`,`device`), -- compound index for the SELECT
KEY `webmaster_id` (`webmaster_id`),
KEY `country` (`country`),
KEY `connection` (`connection`),
KEY `device` (`device`),
KEY `provider` (`provider`)
) ENGINE=InnoDB
PARTITION BY HASH(datetime) PARTITIONS 101;
So when you query for select count(*) from visits where datetime=20140715 and device='ios', your query is only scanning one partition, with about 1% of the rows in the table. Then within that partition, it narrows down even further using the index.
Inserts should also improve, because they are updating much smaller indexes.
I use a prime number when doing hash partitioning, to help the partitions remain more evenly filled in case the dates inserted follow a regular pattern.
Converting a 90GB table to partitioning is going to take a long time. You can use pt-online-schema-change to avoid blocking your application.
You can even make more partitions if you want, in theory up to 1024 in MySQL 5.5 and 8192 in MySQL 5.6. Although with thousands of partitions, you may run into different bottlenecks, like the number of open files.
P.S.: HASH indexes are not support by either MyISAM or InnoDB. HASH indexes are only supported by MEMORY and NDB storage engines.
You are in the problem which is called Big Data Querying / Big Data handling now a days. For handling big data there are many solutions available unfortunately none of them are easy enough to be implemented. You always need a team to structure Big Data to fulfill your need. Some of The solution I may define here are as Under.
1. Big Table
Google uses this technique to create a whole lot big table with thousands of column.(To minimize records vertically). For which you will have to analyze your data and then partition on the basis of similarity and then tag those similarity with appropriate name. Now you must have to write Query that will be first analyzed by some algorithm to check what column space have to be queried. Not Simple enough
2. Distribute Database Across multiple Machine
Hadoop file system is an open source Apache project which is totally created for solving the problem of storing and querying big data. In early days Space was issue and system were capable enough to process small data but now space is not an issue.Even Small organization have tera bytes of data stored locally. But this terabytes of data can not be be processed in one go at one machine. Even a giant machine can take days to process aggregate operation. That is why hadoop is there.
If you are individual then definitely you are in trouble you will need resource for doing this painful task for You. But you can use the essence of these techniques without employing these technologies.
You are free to give a try to these technique. Just study articles about handling big data. Relational database queries are not gonna work in your case

A simple INSERT query on InnoDB taking too much

I have this simple query:
INSERT IGNORE INTO beststat (bestid,period,rawView) VALUES ( 4510724 , 201205 , 1 )
On the table:
CREATE TABLE `beststat` (
`bestid` int(11) unsigned NOT NULL,
`period` mediumint(8) unsigned NOT NULL,
`view` mediumint(8) unsigned NOT NULL DEFAULT '0',
`rawView` mediumint(8) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`bestid`,`period`),
) ENGINE=InnoDB AUTO_INCREMENT=2020577 DEFAULT CHARSET=utf8
And it takes 1 sec to completes.
Side Note: actually it doesn't take always 1sec. Sometime it's done even in 0.05 sec. But often it takes 1 sec
This table (beststat) currently has ~500'000 records and its size is: 40MB. I have 4GB RAM and innodb buffer pool size = 104,857,600, with: Mysql: 5.1.49-3
This is the only InnoDB table in my database (others are MyISAM)
ANALYZE TABLE beststat shows: OK
Maybe there is something wrong with InnoDB settings?
I ran some simulations about 3 years ago as part of some evaluation project for a customer. They had a requirement to be able to search a table where data is constantly being added, and they wanted to be up to date up to a minute.
InnoDB has shown much better results in the beginning, but has quickly deteriorated (much before 1mil records), until I have removed all indexes (including primary). At that point InnoDB has become superior to MyISAM when executing inserts/updates. (I have much worse HW then you, executing tests only on my laptop.)
Conclusion: Insert will always suffer if you have indexes, and especially unique.
I would suggest following optimization:
Remove all indexes from your beststat table and use it as a simple dump.
If you really need these unique indexes, consider some programmable solution (like remembering the max bestid at all time, and insisting that the new record is above that number - and immediately increasing this number. (But do you really need so many unique fields - and they all sound to me just like indexes.)
Have a background thread move new records from InnoDB to another table (which can be MyISAM) where they would be indexed.
Consider dropping indexes temporarily and then after bulk update re-indexing the table, possibly switching two tables so that querying is never interrupted.
These are theoretical solutions, I admit, but is the best I can say given your question.
Oh, and if your table is planned to grow to many millions, consider a NoSQL solution.
So you have two unique indexes on the table. You primary key is a autonumber. Since this is not really part of the data as you add it to the data it is what you call a artificial primary key. Now you have a unique index on bestid and period. If bestid and period are supposed to be unique that would be a good candidate for the primary key.
Innodb stores the table either as a tree or a heap. If you don't define a primary key on a innodb table it is a heap if you define a primary key it is defined as a tree on disk. So in your case the tree is stored on disk based on the autonumber key. So when you create the second index it actually creates a second tree on disk with the bestid and period values in the index. The index does not contain the other columns in the table only bestid, period and you primary key value.
Ok so now you insert the data first thing myself does is to ensure the unique index is always unique. Thus it read the index to see if you are trying to insert a duplicate value. This is where the slow down comes into play. It first has to ensure uniqueness then if it passes the test write data. Then it also has to insert the bestid, period and primary key value into the unique index. So total operation would be 1 read index for value 1 insert row into table 1 insert bestid and period into index. A total of three operations. If you removed the autonumber and used only the unique index as the primary key it would read table if unique insert into table. In this case you would have the following number of operations 1 read table to check values 1 insert into tables. This is two operations vs three. So you do 33% less work by removing the redundant autonumber.
I hope this is clear as I am typing from my Android and autocorrect keeps on changing innodb to inborn. Wish I was at a computer.

mysql table 2 unique column

I have a table, called tablen.
It has this structure:
CREATE TABLE `tablen` (
`a` int(11) unsigned not null,
`b` int(11) unsigned not null,
unique key(a,b)
}
This table has 1 use. I hit it with a row of data. If it is found to be a new unique row not already in the table, it gets added, else I get the returned error code.
The main thing I guess is speed. I don't have the facility to stress test the setup at the moment and so...
What would you say is the best format for this table?
Innodb or myisam?
If you have a lot of inserts and updates, go for InnoDB, because it has row locking. MyISAM has table locking, which means, the whole table gets locked when a record is inserted (queuing all other inserts). If you have far more selects than inserts/updates then use MyISAM which is usually faster there (if you also don't care for foreign keys).

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread, someone suggested that the current setup of my cluster is not suitable for my need. My table structure is:
CREATE TABLE `IPAddresses` (
`id` int(11) unsigned NOT NULL auto_increment,
`ipaddress` bigint(20) unsigned default NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM;
And I added the 71 million records and then did a:
ALTER TABLE IPAddresses ADD INDEX(ipaddress);
It's been 14 hours and the operation is still not completed. Upon Googling, I found that there is a well-known approach for solving this problem - Partitioning. I understand that I need to partition my table now based on the ipaddress but can I do this without recreating the entire table? I mean, through an ALTER statement? If yes, there was one requirement saying that the column to be partitioned on should be a primary key. I will be using the id of this ipaddress in constructing a different table so ipaddress is not my primary key. How do I partition my table given this scenario?
Ok turns out that this problem was more than just a simple create a table, index it and forget problem :) Here's what I did just in case someone else faces the same problem (I have used an example of IP Address but it works for other data types too):
Problem: Your table has millions of entries and you need to add an index really fast
Usecase: Consider storing millions of IP addresses in a lookup table. Adding the IP addresses should not be a big problem but creating an index on them takes more than 14 hours.
Solution: Partition your table using MySQL's Partitioning strategy
Case #1: When the table you want is not yet created
CREATE TABLE IPADDRESSES(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ipaddress BIGINT UNSIGNED,
PRIMARY KEY(id, ipaddress)
) ENGINE=MYISAM
PARTITION BY HASH(ipaddress)
PARTITIONS 20;
Case #2: When the table you want is already created.
There seems to be a way to use ALTER TABLE to do this but I have not yet figured out a proper solution for this. Instead, there is a slightly inefficient solution:
CREATE TABLE IPADDRESSES_TEMP(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ipaddress BIGINT UNSIGNED,
PRIMARY KEY(id)
) ENGINE=MYISAM;
Insert your IP addresses into this table. And then create the actual table with partitions:
CREATE TABLE IPADDRESSES(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ipaddress BIGINT UNSIGNED,
PRIMARY KEY(id, ipaddress)
) ENGINE=MYISAM
PARTITION BY HASH(ipaddress)
PARTITIONS 20;
And then finally
INSERT INTO IPADDRESSES(ipaddress) SELECT ipaddress FROM IPADDRESSES_TEMP;
DROP TABLE IPADDRESSES_TEMP;
ALTER TABLE IPADDRESSES ADD INDEX(ipaddress)
And there you go... indexing on the new table took me about 2 hours on a 3.2GHz machine with 1GB RAM :) Hope this helps.
Creating indexes with MySQL is slow, but not that slow. With 71 million records, it should take a couple minutes, not 14 hours. Possible problems are :
you have not configured sort buffer sizes and other configuration options
look here : http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_myisam_sort_buffer_size
If you try to generate a 1GB index with a 8MB sort buffer it's going to take lots of passes. But if the buffer is larger than your CPU cache it will get slower. So you have to test and see what works best.
someone has a lock on the table
your IO system sucks
your server is swapping
etc
as usual check iostat, vmstat, logs, etc. Issue a LOCK TABLE on your table to check if someone has a lock on it.
FYI on my 64-bit desktop creating an index on 10M random BIGINTs takes 17s...
I had the problem where I wanted to speed up my query by adding an index. The table only had about 300.000 records but it also took way too long. When I checked the mysql server processes, it turned out that the query I was trying to optimize was still running in the background. 4 times! After I killed those queries, indexing was done in a jiffy. Perhaps the same problem applies to your situation.
You are using MyISAM which is being deprecated soon. An alternative would be InnoDB.
"InnoDB is a transaction-safe (ACID compliant) storage engine for MySQL that has commit, rollback, and crash-recovery capabilities to protect user data. InnoDB row-level locking (without escalation to coarser granularity locks) and Oracle-style consistent nonlocking reads increase multi-user concurrency and performance. InnoDB stores user data in clustered indexes to reduce I/O for common queries based on primary keys. To maintain data integrity, InnoDB also supports FOREIGN KEY referential-integrity constraints. You can freely mix InnoDB tables with tables from other MySQL storage engines, even within the same statement."\
http://dev.mysql.com/doc/refman/5.0/en/innodb.html
According to:
http://dev.mysql.com/tech-resources/articles/storage-engine/part_1.html
, you should be able to switch between different engine by utilizing a simple alter command which allows you some flexibility. It also states that each table in your DB can be configured independently.
In your table . you have already inserted 71 billion records. now if you want to create partitions on the primary key column of your table, you can use alter table option. An example is given for your reference.
CREATE TABLE t1 (
id INT,
year_col INT
);
ALTER TABLE t1
PARTITION BY HASH(id)
PARTITIONS 8;