I need to store a massive, fixed size square array in MySQL. The values of the array are just INTs but they need to be accessed and modified fairly quickly.
So here is what I am thinking:
Just use 1 column for primary keys and translate the 2d arrays indexes into single dimensional indexes.
So if the 2d array is n by n => 2dArray[i][j] = 1dArray[n*(i-1)+j]
This translates the problem into storing a massive 1D array in the database.
Then use another column for the values.
Make every entry in the array a row.
However, I'm not very familiar with the internal workings of MySQL.
100k*100k makes 10 billion data points, which is more than what 32 bits can get you so I can't use INT as a primary key. And researching stackoverflow, some people have experienced performance issues with using BIGINT as primary key.
In this case where I'm only storing INTs, would the performance of MySQL drop as the number of rows increases?
Or if I were to scatter the data over multiple tables on the same server, could that improve performance? Right now, it looks like I won't have access to multiple machines, so I can't really cluster the data.
I'm completely flexible about every idea I've listed above and open to suggestions (except not using MySQL because I've kind of committed to that!)
As for your concern that BIGINT or adding more rows decreases performance, of course that's true. You will have 10 billion rows, that's going to require a big table and a lot of RAM. It will take some attention to the queries you need to run against this dataset to decide on the best storage method.
I probably recommend using two columns for the primary key. Developers often overlook the possibility of a compound primary key.
Then you can use INT for both primary key columns if you want to.
CREATE TABLE MyTable (
array_index1 INT NOT NULL,
array_index1 INT NOT NULL,
datum WHATEVER_TYPE NOT NULL,
PRIMARY KEY (array_index1, array_index2)
);
Note that a compound index like this means that if you search on the second column without an equality condition on the first column, the search won't use the index. So you need a secondary index if you want to support that.
100,000 columns is not supported by MySQL. MySQL has limits of 4096 columns and of 65,535 bytes per row (not counting BLOB/TEXT columns).
Storing the data in multiple tables is possible, but will probably make your queries terribly awkward.
You could also look into using table PARTITIONING, but this is not as useful as it sounds.
Related
What would be the performance penalty of using strings as primary keys instead of bigints etc.? String comparison is much more expensive than integer comparison, but on the other hand I can imagine that internally a DBMS will compute hash keys to reduce the penalty.
An application that I work on uses strings as primary keys in several tables (MySQL). It is not trivial to change this, and I'd like to know what can be gained performance wise to justify the work.
on the other hand I can imagine that
internally a DBMS will compute hash
keys to reduce the penalty.
The DB needs to maintain a B-Tree (or a similar structure) with the key in a way to have them ordered.
If the key is hashed and stored it in the B-Tree that would be fine to check rapidly the uniqueness of the key -- the key can still be looked up efficiently. But you would not be able to search efficient for range of data (e.g. with LIKE) because the B-Tree is no more ordered according to the String value.
So I think most DB really store the String in the B-Tree, which can (1) take more space than numeric values and (2) require the B-Tree to be re-balanced if keys are inserted in arbitrary order (no notion of increasing value as with numeric pk).
The penalty in practice can range from insignificant to huge. It all depends on the usage, the number of rows, the average size of the string key, the queries which join table, etc.
In our product we use varchar(32) for primary keys (GUIDs) and we haven't met performance issues of this. Our product is a web site with extreme overload and is critical to be stable.
We use SQL Server 2005.
Edit: In our biggest tables we have more than 3 000 000 records with lots of inserts and selects from them. I think in general, the benefit of migrating to int key will be very low, but the problems while migrating very high.
One thing to watch out for is page splits (I know this can happen in SQL Server - probably the same in MySQL).
Primary keys are physically ordered. By using an auto-increment integer you guarantee that each time you insert you are inserting the next number up, so there is no need for the db to reorder the keys. If you use strings however, the pk you insert may need to be placed in the middle of the other keys to maintain the pk order. That process of reordering the pks on the insert can get expensive.
It depends on several factors: RDBMS, number of indexes involving those columns but in general it will be more efficient using ints, folowed by bigints.
Any performance gains depend on usage, so without concrete examples of table schema and query workload it is hard to say.
Unless it makes sense in the domain (I'm thinking unique something like social security number), a surrogate integer key is a good choice; referring objects do not need to have their FK reference updated when the referenced object changes.
I have a table with a monotonically increasing field that I want to put into an index. However, the best practices guide says to not put monotonically increasing data into a non-interleaved index. When I try putting the data into an interleaved index, I can't interleave an index in its parent table.
In other words, I want the Cloud Spanner equivalent of this MySQL schema.
CREATE TABLE `my_table` (
'id' bigint(20) unsigned NOT NULL,
'monotonically_increasing' int(10) unsigned DEFAULT '0',
PRIMARY KEY ('id'),
KEY 'index_name' ('monotonically_increasing')
)
It really depends the rate you'll be writing monotonically increasing/decreasing values.
Small write loads
I don't know the exact range of writes per second a Spanner server can handle before you'll hotspot (and it depends on your data), but if you are writing < 500 rows per second you should be okay with this pattern. It's only an issue if your write load is higher than a single Spanner server can comfortably handle by itself.
Large write loads
If your write rate is larger, or relatively unbounded (e.g. scales up with your systems/sites popularity), then you'll need to look alternatives. These alternatives really depend on your exact use case to work out which trade-offs you're willing to take.
One generic approach is to manually shard the index. Let's say for example you know your peak write load will be 1740 inserts per second. Using the approx 500 writes per server number from before, we would be able to avoid hotspotting if we could shard this load over 4 Spanner servers (435 writes/second each).
Using the INT64 type in Cloud Spanner allows for a maximum value of 9,223,372,036,854,775,808. One example way to shard is us by adding random(0,3)*1,000,000,000,000,000,000 to each value. This will split the index key range into 4 ranges that can be served by 4 Spanner servers. The down-side is you'll need to do 4 queries and merge the results on the client side after masking out x,000,000,000,000,000,000.
Note: Interleaving is when data/indexes from one table are interleaved with date from another table. You cannot interleave with only one table.
A developer of mine was making an application and came up with the following schema
purchase_order int(25)
sales_number int(12)
fulfillment_number int(12)
purchase_order is the index in this table. (There are other fields but not relevant to this issue). purchase_order is a concatenation of sales_number + fulfillment.
Instead i proposed an auto_incrementing field of id.
Current format could be essentially 12-15 characters long and randomly generated (Though always unique as sales_number + fulfillment_number would always be unique).
My question here is:
if I have 3 rows each with a random btu unique ID i.e. 983903004, 238839309, 288430274 vs three rows with the ID 1,2,3 is there a performance hit?
As an aside my other argument (for those interested) to this was the schema makes little sense on the grounds of data redundancy (can easily do a SELECT CONCATENAE(sales_number,fulfillment_number)... rather than storing two columns together in a third)
The problem as I see is not with bigint vs int ( autoicrement column can be bigint as well, there is nothing wrong with it) but random value for primary key. If you use INNODB engine, primary key is at the same time a clustered key which defines physical order of data. Inserting random value can potentially cause more page splits, and, as a result a greater fragmentation, which in turn causes not only insert/update query to slow down, but also selects.
Your argument about concatenating makes sense, but executing CONCATE also has its cost(unfortunately, mysql doesn't support calculated persistent columns, so in some cases it's ok to store result of concatenation in a separate column; )
AFAIK integers are stored and compared as integers so the comparisons should take the same length of time.
Concatenating two ints (32bit) into one bigint (64bit) may have a performance hit that is hardware dependent.
having incremental id's will put records that were created around the same time near each other on the hdd. this might make some queries faster. if this is the primary key on innodb or for the index that these id's are used.
incremental records can sometimes be inserted a little bit quicker. test to see.
you'll need to make sure that the random id is unique. so you'll need an extra lookup.
i don't know if these points are material for you application.
I am working on a project where I constantly insert rows in a table and within a few days this table is going to be very big and I came up with a question and can't find the answer:
what is going to happen when I'll have more rows than 'bigint' in that table knowing that
I have an 'id' column (which is an int)? Does my database (MySQL) can handle that properly? How does big companies handle that kind of problems and joins on big tables?
I don't know if there are short answers to that kind of problems but any lead to solve my question would be welcome!
You would run out of storage before you run out of BIGINT primary key sequence.
Unsigned BIGINT can represent a range of 0 to 18,446,744,073,709,551,615. Even if you had a table with a single column that held the primary key of BIGINT type (8 bytes), you would consume (18,446,744,073,709,551,615×8)÷1,024^4 = 134,217,728 terabytes of storage.
Also maximum size of tables in MySQL is 256 terabytes for MyISAM and 64 terabytes for InnoDB, so really you're limited to 256×1,024^4÷8 = 35 trillion rows.
Oracle supports NUMBER(38) (takes 20 bytes) as largest possible PK, 0 to 1e38. However having a 20 byte primary key is useless because maximum table size in Oracle is 4*32 = 128 terabytes (at 32K block size).
numeric data type
If this column is primary key, you are not able to insert more rows.
If not a primary key, the column is truncated to the maximum value it can presented in that data type.
You should change id column to bigint as well if you require to perform join.
You can use uuid to replace integer primary key (for big companies),
take note that uuiq is string, and your field will not longer in numeric
That is one of the big problems of every website with LOTS of users. Think about Facebook, how many requests do they get every second? How many servers do they have to store all the data? If they have many servers, how do they separate the data across the servers? If they separate the data across the servers, how would they be able to call normal SQL queries on multiple servers and then join the results? And so on. Now to avoid complicating things for you by answering all these questions (which will most probably make you give up :-)), I would suggest using Google AppEngine. It is a bit difficult at the beginning, but once you get used to it you will appreciate the time you spent learning it.
If you are only having a database and you don't have many requests, and your concern is just the storage, then you should consider moving to MSSQL or -better as far as I know- Oracle.
Hope that helps.
To put BIGINT even more into perspective, if you were inserting rows non-stop at 1 row per millisecond (1000 rows per second), you would have 31,536,000,000 row per year.
With BIGINT at 18,446,744,073,709,551,615 you would be good for about 18 million years.
You could make your bigint unsigned, giving you 18,446,744,073,709,551,615 available IDs
Big companies handle it by using DB2 or Oracle
I in the process of designing a database for high volume data and I was wondering what datatype to use for the primary keys?
There will be table partitioning and the database will ultimatley be clustered and will be hot failover to alternative datacentres.
EDIT
Tables - think chat system for multiple time periods and multiple things to chat about with multiple users chatting about the time period and thing.
Exponential issues are what I am thinking about - ie something could generate billions of rows in small time period. ie before we could change the database or DBA doing DBA things
Mark - I share your concearn of GUID - I dont like coding with GUIDs flying about.
With just the little bit of info you've provided, I would recommend using a BigInt, which would take you up to 9,223,372,036,854,775,807, a number you're not likely to ever exceed. (Don't start with an INT and think you can easily change it to a BigInt when you exceed 2 billion rows. Its possible (I've done it), but can take an extremely long time, and involve significant system disruption.)
Kimberly Tripp has an Excellent series of blog articles (GUIDs as PRIMARY KEYs and/or the clustering key and The Clustered Index Debate Continues) on the issue of creating clustered indexes, and choosing the primary key (related issues, but not always exactly the same). Her recommendation is that a clustered index/primary key should be:
Unique (otherwise useless as a key)
Narrow (the key is used in all non-clustered indexes, and in foreign-key relationships)
Static (you don't want to have to change all related records)
Always Increasing (so new records always get added to the end of the table, and don't have to be inserted in the middle)
If you use a BigInt as an increasing identity as your key and your clustered index, that should satisfy all four of these requirements.
Edit: Kimberly's article I mentioned above (GUIDs as PRIMARY KEYs and/or the clustering key) talks about why a (client generated) GUID is a bad choice for a clustering key:
But, a GUID that is not sequential -
like one that has it's values
generated in the client (using .NET)
OR generated by the newid() function
(in SQL Server) can be a horribly bad
choice - primarily because of the
fragmentation that it creates in the
base table but also because of its
size. It's unnecessarily wide (it's 4
times wider than an int-based identity
- which can give you 2 billion (really, 4 billion) unique rows). And,
if you need more than 2 billion you
can always go with a bigint (8-byte
int) and get 263-1 rows.
SQL has a function called NEWSEQUENTIALID() that allows you to generate sequential GUIDs that avoid the fragmentation issue, but they still have the problem of being unnecessarily wide.
You can always go for int but taking into account your partitioning/clustering I'd suggest you look into uniqueidentifier which will generate globally unique keys.
int tends to be the norm unless you need massive volume of data, and has the advantage of working with IDENTITY etc; Guid has some advantages if you want the numbers to be un-guessable or exportable, but if you use a Guid (unless you generate it yourself as "combed") you should ensure it is non-clustered (the index, that is; not the farm), as it won't be incremental.
I thik that int will be very good for it.
The range of INTEGER is - 2147483648 to 2147483647.
also you can use UniqueIdentifier (GUID), but in this case
table row size limit in MSSQL
storage + memory. Imagine you have tables with 10000000 rows and growing
flexibility: there are T-SQL operators available for INT like >, <, =, etc...
GUID is not optimized for ORDER BY/GROUP BY queries and for range queries in general