How to index monotonically increasing data in a table? - mysql

I have a table with a monotonically increasing field that I want to put into an index. However, the best practices guide says to not put monotonically increasing data into a non-interleaved index. When I try putting the data into an interleaved index, I can't interleave an index in its parent table.
In other words, I want the Cloud Spanner equivalent of this MySQL schema.
CREATE TABLE `my_table` (
'id' bigint(20) unsigned NOT NULL,
'monotonically_increasing' int(10) unsigned DEFAULT '0',
PRIMARY KEY ('id'),
KEY 'index_name' ('monotonically_increasing')
)

It really depends the rate you'll be writing monotonically increasing/decreasing values.
Small write loads
I don't know the exact range of writes per second a Spanner server can handle before you'll hotspot (and it depends on your data), but if you are writing < 500 rows per second you should be okay with this pattern. It's only an issue if your write load is higher than a single Spanner server can comfortably handle by itself.
Large write loads
If your write rate is larger, or relatively unbounded (e.g. scales up with your systems/sites popularity), then you'll need to look alternatives. These alternatives really depend on your exact use case to work out which trade-offs you're willing to take.
One generic approach is to manually shard the index. Let's say for example you know your peak write load will be 1740 inserts per second. Using the approx 500 writes per server number from before, we would be able to avoid hotspotting if we could shard this load over 4 Spanner servers (435 writes/second each).
Using the INT64 type in Cloud Spanner allows for a maximum value of 9,223,372,036,854,775,808. One example way to shard is us by adding random(0,3)*1,000,000,000,000,000,000 to each value. This will split the index key range into 4 ranges that can be served by 4 Spanner servers. The down-side is you'll need to do 4 queries and merge the results on the client side after masking out x,000,000,000,000,000,000.
Note: Interleaving is when data/indexes from one table are interleaved with date from another table. You cannot interleave with only one table.

Related

Storing a 100k by 100k array in MySQL

I need to store a massive, fixed size square array in MySQL. The values of the array are just INTs but they need to be accessed and modified fairly quickly.
So here is what I am thinking:
Just use 1 column for primary keys and translate the 2d arrays indexes into single dimensional indexes.
So if the 2d array is n by n => 2dArray[i][j] = 1dArray[n*(i-1)+j]
This translates the problem into storing a massive 1D array in the database.
Then use another column for the values.
Make every entry in the array a row.
However, I'm not very familiar with the internal workings of MySQL.
100k*100k makes 10 billion data points, which is more than what 32 bits can get you so I can't use INT as a primary key. And researching stackoverflow, some people have experienced performance issues with using BIGINT as primary key.
In this case where I'm only storing INTs, would the performance of MySQL drop as the number of rows increases?
Or if I were to scatter the data over multiple tables on the same server, could that improve performance? Right now, it looks like I won't have access to multiple machines, so I can't really cluster the data.
I'm completely flexible about every idea I've listed above and open to suggestions (except not using MySQL because I've kind of committed to that!)
As for your concern that BIGINT or adding more rows decreases performance, of course that's true. You will have 10 billion rows, that's going to require a big table and a lot of RAM. It will take some attention to the queries you need to run against this dataset to decide on the best storage method.
I probably recommend using two columns for the primary key. Developers often overlook the possibility of a compound primary key.
Then you can use INT for both primary key columns if you want to.
CREATE TABLE MyTable (
array_index1 INT NOT NULL,
array_index1 INT NOT NULL,
datum WHATEVER_TYPE NOT NULL,
PRIMARY KEY (array_index1, array_index2)
);
Note that a compound index like this means that if you search on the second column without an equality condition on the first column, the search won't use the index. So you need a secondary index if you want to support that.
100,000 columns is not supported by MySQL. MySQL has limits of 4096 columns and of 65,535 bytes per row (not counting BLOB/TEXT columns).
Storing the data in multiple tables is possible, but will probably make your queries terribly awkward.
You could also look into using table PARTITIONING, but this is not as useful as it sounds.

Selection of Primary Key for distributed databases

I am implementing an application in which there will be a database in Oracle 11G and multiple other MySQL databases. All databases will be synchronized with each other at least after 30 mins. Initially i thought of implementing GUID/UUID as primary key but then i came across its cons in innodb and got little worried.
I just want that my primary key to be unique with good performance which means that i am certainly looking for indexing. Please suggest what should i keep as my primary key. It is pertinent to mention that my database MySQL will be running on simple intel corei3 and i expect to have a million records on it; whereas, oracle will run on a server which is not an issue.
UUID/GUID has the problem of being "random". This leads to difficulty in caching data. The "next" UUID could be anywhere in the table/index. If the entire data (or index) is not small enough to fit in cache, then it will probably incur a disk hit.
If you need to generate ids in multiple servers, perhaps the best way is to have a two-part id. The first part is a small number representing the source of the id, and the second part is some form of sequence.
That could be implemented either as two fields: PRIMARY KEY (machine, seq) or as the combination of the values in a single number. Example: Machine 1 has ids starting with 1000000000; machine 2 has ids starting with 2000000000; etc. (You would, of course, have to carefully design the numbers to avoid running out of space for either part.)
INSERTs would be hitting one "hot spot" per machine. If the SELECTs tend to fetch "recent" rows, then they would also be hitting hot spots, not the entire table.
In MySQL, the compound PK could be:
seq ... AUTO_INCREMENT,
machine TINYINT UNSIGNED NOT NULL,
PRIMARY KEY(machine, seq),
INDEX(seq)
Yes, that is sufficient to make the auto_increment work.
In MySQL, the single-column PK would require some form of sequence simulation.

Keeping video viewing statistics breakdown by video time in a database

I need to keep a number of statistics about the videos being watched, and one of them is what parts of the video are being watched most. The design I came up with is to split the video into 256 intervals and keep the floating-point number of views for each of them. I receive the data as a number of intervals the user watched continuously. The problem is how to store them. There are two solutions I see.
Row per every video segment
Let's have a database table like this:
CREATE TABLE `video_heatmap` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`video_id` int(11) NOT NULL,
`position` tinyint(3) unsigned NOT NULL,
`views` float NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_lookup` (`video_id`,`position`)
) ENGINE=MyISAM
Then, whenever we have to process a number of views, make sure there are the respective database rows and add appropriate values to the views column. I found out it's a lot faster if the existence of rows is taken care of first (SELECT COUNT(*) of rows for a given video and INSERT IGNORE if they are lacking), and then a number of update queries is used like this:
UPDATE video_heatmap
SET views = views + ?
WHERE video_id = ? AND position >= ? AND position < ?
This seems, however, a little bloated. The other solution I came up with is
Row per video, update in transactions
A table will look (sort of) like this:
CREATE TABLE video (
id INT NOT NULL AUTO_INCREMENT,
heatmap BINARY (4 * 256) NOT NULL,
...
) ENGINE=InnoDB
Then, upon every time a view needs to be stored, it will be done in a transaction with consistent snapshot, in a sequence like this:
If the video doesn't exist in the database, it is created.
A row is retrieved, heatmap, an array of floats stored in the binary form, is converted into a form more friendly for processing (in PHP).
Values in the array are increased appropriately and the array is converted back.
Row is changed via UPDATE query.
So far the advantages can be summed up like this:
First approach
Stores data as floats, not as some magical binary array.
Doesn't require transaction support, so doesn't require InnoDB, and we're using MyISAM for everything at the moment, so there won't be any need to mix storage engines. (only applies in my specific situation)
Doesn't require a transaction WITH CONSISTENT SNAPSHOT. I don't know what are the performance penalties of those.
I already implemented it and it works. (only applies in my specific situation)
Second approach
Is using a lot less storage space (the first approach is storing video ID 256 times and stores position for every segment of the video, not to mention primary key).
Should scale better, because of InnoDB's per-row locking as opposed to MyISAM's table locking.
Might generally work faster because there are a lot less requests being made.
Easier to implement in code (although the other one is already implemented).
So, what should I do? If it wasn't for the rest of our system using MyISAM consistently, I'd go with the second approach, but currently I'm leaning to the first one. But maybe there are some reasons to favour one approach or another?
Second approach looks tempting at first sight, but it makes queries like "how many views for segment x of video y" unable to use an index on video.heatmap. Not sure if this is a real-life concern for you though. Also, you would have to parse back and forth the entire array every time you need data for one segment only.
But first and foremost, your second solution is hackish (but interesting nonetheless). I wouldn't recommend denormalising your database until you face an acutal performance issue.
Also, try populating the video_headmap table in advance with wiews = 0 as soon as a video is inserted (a trigger can help).
If space is really a concern, remove your surrogate key video_headmap.id and instead make (video_id, position) the primary key (then get rid of the superfluous UNIQUE constraint). But this shouldn't come into the equation. 256 x 12 bytes per video (rough row length with 3 numeric columns, okay add some for the index) is only an extra 3kb per video!
Finally, nothing prevents you from switching your current table to InnoDB and leverage its row-level locking capability.
Please note I fail to undestand why views cannot be an UNSIGNED INT. I would recommend changing this type.

How to handle a large amount of data in a specific table of a database

I am working on a project where I constantly insert rows in a table and within a few days this table is going to be very big and I came up with a question and can't find the answer:
what is going to happen when I'll have more rows than 'bigint' in that table knowing that
I have an 'id' column (which is an int)? Does my database (MySQL) can handle that properly? How does big companies handle that kind of problems and joins on big tables?
I don't know if there are short answers to that kind of problems but any lead to solve my question would be welcome!
You would run out of storage before you run out of BIGINT primary key sequence.
Unsigned BIGINT can represent a range of 0 to 18,446,744,073,709,551,615. Even if you had a table with a single column that held the primary key of BIGINT type (8 bytes), you would consume (18,446,744,073,709,551,615×8)÷1,024^4 = 134,217,728 terabytes of storage.
Also maximum size of tables in MySQL is 256 terabytes for MyISAM and 64 terabytes for InnoDB, so really you're limited to 256×1,024^4÷8 = 35 trillion rows.
Oracle supports NUMBER(38) (takes 20 bytes) as largest possible PK, 0 to 1e38. However having a 20 byte primary key is useless because maximum table size in Oracle is 4*32 = 128 terabytes (at 32K block size).
numeric data type
If this column is primary key, you are not able to insert more rows.
If not a primary key, the column is truncated to the maximum value it can presented in that data type.
You should change id column to bigint as well if you require to perform join.
You can use uuid to replace integer primary key (for big companies),
take note that uuiq is string, and your field will not longer in numeric
That is one of the big problems of every website with LOTS of users. Think about Facebook, how many requests do they get every second? How many servers do they have to store all the data? If they have many servers, how do they separate the data across the servers? If they separate the data across the servers, how would they be able to call normal SQL queries on multiple servers and then join the results? And so on. Now to avoid complicating things for you by answering all these questions (which will most probably make you give up :-)), I would suggest using Google AppEngine. It is a bit difficult at the beginning, but once you get used to it you will appreciate the time you spent learning it.
If you are only having a database and you don't have many requests, and your concern is just the storage, then you should consider moving to MSSQL or -better as far as I know- Oracle.
Hope that helps.
To put BIGINT even more into perspective, if you were inserting rows non-stop at 1 row per millisecond (1000 rows per second), you would have 31,536,000,000 row per year.
With BIGINT at 18,446,744,073,709,551,615 you would be good for about 18 million years.
You could make your bigint unsigned, giving you 18,446,744,073,709,551,615 available IDs
Big companies handle it by using DB2 or Oracle

Database table with 3.5 million entries - how can we improve performance?

We have a MySQL table with about 3.5 million IP entries.
The structure:
CREATE TABLE IF NOT EXISTS `geoip_blocks` (
`uid` int(11) NOT NULL auto_increment,
`pid` int(11) NOT NULL,
`startipnum` int(12) unsigned NOT NULL,
`endipnum` int(12) unsigned NOT NULL,
`locid` int(11) NOT NULL,
PRIMARY KEY (`uid`),
KEY `startipnum` (`startipnum`),
KEY `endipnum` (`endipnum`)
) TYPE=MyISAM AUTO_INCREMENT=3538967 ;
The problem: A query takes more than 3 seconds.
SELECT uid FROM `geoip_blocks` WHERE 1406658569 BETWEEN geoip_blocks.startipnum AND geoip_blocks.endipnum LIMIT 1
- about 3 seconds
SELECT uid FROM `geoip_blocks` WHERE startipnum < 1406658569 and endipnum > 1406658569 limit 1
- no gain, about 3 seconds
How can this be improved?
The solution to this is to grab a BTREE/ISAM library and use that (like BerkelyDB). Using ISAM this is a trivial task.
Using ISAM, you would set your start key to the number, do a "Find Next", (to find the block GREATER or equal to your number), and if it wasn't equal, you'd "findPrevious" and check that block. 3-4 disk hits, shazam, done in a blink.
Well, it's A solution.
The problem that is happening here is that SQL, without a "sufficiently smart optimizer", does horrible on this kind of query.
For example, your query:
SELECT uid FROM `geoip_blocks` WHERE startipnum < 1406658569 and endipnum > 1406658569 limit 1
It's going to "look at" ALL of the rows that are "less than" 1406658569. ALL of them, then it's going to scan them looking for ALL of the rows that match the 2nd criteria.
With a 3.5m row table, assuming "average" (i.e. it hits the middle), welcome to a 1.75m row table scan. Even worse, and index table scan. Ideally MySQl will "give up" and "just" table scan, as it's faster.
Clearly, this is not what you want.
#Andomar's solution is basically forcing you to "block" to data space, via the "network" criteria. Effectively breaking your table in to 255 pieces. So, instead of scanning 1.75m rows, you get to scan 6800 rows, a marked improvement at a cost of you breaking your blocks up on the network boundary.
There is nothing wrong with range queries in SQL.
SELECT * FROM table WHERE id between X and Y
is a, typically, fast query, as the optimizer can readily delimit the range of rows using the index.
But, that's not your query, because you are not ranging you main ID in this case (startipnum).
If you "know" that your block sizes are within a certain range (i.e. none of your blocks, EVER, have more than, say, 1000 ips), then you can block your query by adding "WHERE startipnum between {ipnum - 1000} and {ipnum + 1000}". That's not really different than the network blocking that was proposed, but here you don't have to maintain it as much. Of course, you can learn this with:
SELECT max(endipnum - startipnum) FROM table
to get an idea what your largest range is.
Another option, which I've seen, have never used, but is actually, well, perfect for this, is to look at MySql's Spatial Extensions, since that's what this really is.
This is designed more for GIS applications, but you ARE searching for something in ranges, and that's a lot of what GIS apps do. So, that may well be a solution for you as well.
Your startip and endip should be a combined index. Mysql can't utilize multiple indexes on the same table in one query.
I'm not sure about the syntax, but something like
KEY (startipnum, endipnum)
It looks like you're trying to find the range that an IP address belongs to. The problem is that MySQL can't make the best use of an index for the BETWEEN operation. Indexes work better with an = operation.
One way you can add an = operation to your query is by adding the network part of the address to the table. For your example:
numeric 1406658569
ip address 83.215.232.9
class A with 8 bit network part
network part = 83
With an index on (networkpart, startipnum, endipnum, uid) a query like this will become very fast:
SELECT uid
FROM `geoip_blocks`
WHERE networkpart = 83
AND 1406658569 BETWEEN startipnum AND endipnum
In case one geoip block spans multiple network classes, you'd have to split it in one row per network class.
Based on information from your question I am assuming that what you are doing is an implementation of the GeoIP® product from MaxMind®. I downloaded the free version of the GeoIP® data, loaded it into a MySQL database and did a couple of quick experiments.
With an index on startipnum the query execution time ranged from 0.15 to 0.25 seconds. Creating a composite index on startipnum and endipnum did not change the query performance. This leads me to believe that your performance issues are due to insufficient hardware, improper MySQL tuning, or both. The server I ran my tests on had 8G of RAM which is considerably more than would be needed to get this same performance as the index file was only 28M.
My recommendation is one of the two following options.
Spend some time tuning your MySQL server. The MySQL online documentation would be a reasonable starting point. http://dev.mysql.com/doc/refman/5.0/en/optimizing-the-server.html An internet search will turn up a large volume of books, forums, articles, etc. if the MySQL documentation is not sufficient.
If my assumption is correct that you are using the GeoIP® product, then a second option would be to use the binary file format provided by MaxMind®. The custom file format has been optimized for speed, memory usages, and database size. APIs to access the data are provided for a number of languages. http://www.maxmind.com/app/api
As an aside, the two queries you presented are not equivalent. The between operator is inclusive. The second query would need to use <= >= operators to be equivalent to the query which used the between operator.
Maybe you would like to have a look at partitioning the table. This feature has been available since MySQL 5.1 - hence you do not specify which version you are using, this might not work for you if you are stuck with an older version.
As the possible value range for IP addresses is known - at least for IPv4 - you could break down the table into multiple partitions of equal size (or maybe even not equal if your data is not evenly distributed). With that MySQL could very easily skip large parts of the table, speeding up the scan if it is still required.
See MySQL manual on partitioning for the available options and syntax.
Thanks for all your comments, I really appreciate it.
For now we ended up using a caching mechanism and we have reduced that expensive queries.