I hava a database with the following structure.
How can I determine, what storage space will a number of such entries take, if:
I'll run this on a MySql server (innoDB). The int columns will have small values (1-30 at most), except one, which will have a value between [1-400].
There will be 40 entries produced every day.
Mysql manual has a section on data type storage space. Since you are using numbers and dates, which are stored on fixed length, it is pretty easy to estimate the storage space: each integer column requires 4 bytes (even if you store the numeric value of 1 in it), the date column requires 3 bytes.
You may reduce the storage requirements by using smaller integer types.
What you have described comes in the "tiny" category. After adding the 40 entries for the first day, you will be a few hundred bytes, occupying one 16KB InnoDB block.
After a year, you might have a megabyte -- still "tiny" as database tables go.
Switch to SMALLINT, add UNSIGNED if appropriate; that will cut it down noticeably. Follow the various links in the comments and other answers; they give you more insight.
Related
Say I have a db table with 5 fields.
id bigint(20)
name varchar(255)
place varchar(255)
DOB date
about TEXT
Maximum row size is 65535 bytes in MySQL that is shared among all columns in the table, except TEXT/BLOB columns.
I need a query which will give me the maximum row size in bytes for this table so that I can check if the max size exceeds 65535. The charset used is utf8mb4. I need something like this
(20*8 + 255*4 + 255*4 + 3) = 2203 the size of about field is ignored since the type is TEXT.
It is much simpler -- Write and run the CREATE TABLE statement. It will spit at you if it is too large.
Meanwhile, here are some of the flaws in your attempt:
BIGINT is 8 bytes; the (20) is useless information.
Each VARCHAR needs 1- or 2-bytes for length field.
TEXT takes some amount of space toward the 64K limit.
There is a bunch of "overhead" for each column and the row, so it is really impractical to try to compute the length.
Note also -- When the length gets too large, some of the VARCHARs may be treated like TEXT. That is they won't necessarily count toward the 64K limit. However, they will leave a 20-byte pointer in their place.
If you are tempted to have a table with too many too-big columns, describe the table to us; we can suggest improvements, such as:
BIGINT is rarely needed; use a smaller int type.
Don't blindly use (255) is has several drawbacks. Analyze your data and pick a more realistic limit.
Be aware that there are four ROW_FORMATs. DYNAMIC is probably optimal for you. (Old versions of MySQL did not have Dynamic.
Parallel tables (Two tables with the same PK)
Normalization (Replacing a common string with a smaller INT.)
Don't use a bunch of columns as an "array"; make another table. (Example: 3 phone number columns.)
Have you hit any INDEX limits yet? It smells like you might get bitten by that, too. Write up the CREATE TABLE, including tentative indexes. We'll chew it up and advise you.
We have to ingest and store 150 billion records in our MySQL InnoDB database. One field in particular is a field that is a VARCHAR as is taking up a lot of space. Its characteristics:
Can be NULL
Highly duplicated but we can't de-dupe because it increases ingestion time exponentially
Average length is about 75 characters
It has to have an index as it will have to join with another table
We don't need to store it in human readable format but we need to be able to match it to another table which would have to have the same format for this column
I've tried the following:
Compressing the table, this helps with space but dramatically increases ingestion time, so I'm not sure compression is going to work for us
Tried hashing to SHA2 which reduced the string length to 56, which gives us reasonable space saving but just not quite enough. Also I'm not sure SHA2 will generate unique values for this sort of data
Was thinking about MD5 which would further reduce string length to probably the right level but not sure again whether MD5 is string enough to generate unique values to be able to match with another table
A hash function like MD5 produces a 128-bit hash in a string of 32 hex characters, but you can use UNHEX() to cut that in half to 16 binary characters, and store the result in a column of type BINARY(16). See my answer to What data type to use for hashed password field and what length?
MD5 has 2128 distinct hashes, or 340,282,366,920,938,463,463,374,607,431,768,211,456. The chances of two different strings resulting in a collision is pretty reasonably low, even if you have 15 billion distinct inputs. See How many random elements before MD5 produces collisions? If you're still concerned, use SHA1 or SHA2.
I'm a bit puzzled by your attempts to use a hash function, though. You must not care what the original string is, since you must understand that hashing is not reversible. That is, you can't get the original string from a hash.
I like the answer from #Data Mechanics, that you should enumerate the unique string inputs in a lookup table, and use a BIGINT primary key (a INT has only 4+ billion values so it isn't large enough for 15 billion rows).
I understand what you mean that you'd have to look up the strings to get the primary key. What you'll have to do is write your own program to do this data input. Your program will do the following:
Create an in-memory hash table to map strings to integer primary keys.
Read a line of your input
If the hash table does not yet have an entry for the input, insert that string into the lookup table and fetch the generated insert id. Store this as a new entry in your hash table, with the string as the key and the insert id as the value of that entry.
Otherwise the hash table does have an entry already, and just read the primary key bigint from the hash table.
Insert the bigint into your real data table, as a foreign key, along with other data you want to load.
Loop to step 2.
Unfortunately it would take over 1 TB of memory to hold a HashMap of 15 billion entries, even if you MD5 the string before using it as the key of your HashMap.
So I would recommend putting the full collection of mappings into a database table, and keep a subset of it in memory. So you have to do an extra step around 3. above, if the in-memory HashMap doesn't have an entry for your string, first check the database. If it's in the database, load it into the HashMap. If it isn't in the database, then proceed to insert it to the database and then to the HashMap.
You might be interested in using a class like LruHashMap. It's a HashMap with a maximum size (which you choose according to how much memory you can dedicate to it). If you put a new element when it's full, it kicks out the least recently referenced element. I found an implementation of this in Apache Lucene, but there are other implementations too. Just Google for it.
Is the varchar ordinary text? Such is compressible 3:1. Compressing just the one field may get it down to 25-30 bytes. Then use something like VARBINARY(99).
INT (4 bytes) is not big enough for normalizing 15 billion distinct values, so you need something bigger. BIGINT takes 8 bytes. BINARY(5) and DECIMAL(11,0) are 5 bytes each, but are messier to deal with.
But you are concerned by the normalization speed. I would be more concerned by the ingestion speed, especially if you need to index this column!
How long does it take to build the table? You haven't said what the schema is; I'll guess that you can put 100 rows in an InnoDB block. I'll say you are using SSDs and can get 10K IOPs. 1.5B blocks / 10K blocks/sec = 150K seconds = 2 days. This assumes no index other than an ordered PRIMARY KEY. (If it is not ordered, then you will be jumping around the table, and you will need a lot more IOPs; change the estimate to 6 months.)
The index on the column will effectively be a table 150 billion 'rows' -- it will take several terabytes just for the index BTree. You can either index the field as you insert the rows, or you can build the index later.
Building index as you insert, even with the benefit of InnoDB's "change buffer", will eventually slow down to not much faster than 1 disk hit per row inserted. Are you using SSDs? (Spinning drives are rated about 10ms/hit.) Let's say you can get 10K hits (inserts) per second. That works out to 15M seconds, which is 6 months.
Building the index after loading the entire table... This effectively builds a file with 150 billion lines, sorts it, then constructs the index in order. This may take a week, not months. But... It will require enough disk space for a second copy of the table (probably more) during the index-building.
So, maybe we can do the normalization in a similar way? But wait. You said the column was so big that you can't even get the table loaded? So we have to compress or normalize that column?
How will the load be done?
Multiple LOAD DATA calls (probably best)? Single-row INSERTs (change "2 days" to "2 weeks" at least)? Multi-row INSERTs (100-1000 is good)?
autocommit? Short transactions? One huge transaction (this is deadly)? (Recommend 1K-10K rows per COMMIT.)
Single threaded (perhaps cannot go fast enough)? Multi-threaded (other issues)?
My discussion of high-speed-ingestion.
Or will the table be MyISAM? The disk footprint will be significantly smaller. Most of my other comments still apply.
Back to MD5/SHA2. Building the normalization table, assuming it is much bigger than can be cached in RAM, will be a killer, too, no matter how you do it. But, let's get some of the other details ironed out first.
See also TokuDB (available with newer versions of MariaDB) for good high-speed ingestion and indexing. TokuDB will slow down some for your table size, whereas InnoDB/MyISAM will slow to a crawl, as I already explained. TokuDB also compresses automatically; some say by 10x. I don't have any speed or space estimates, but I see TokuDB as very promising.
Plan B
It seems that the real problem is in compressing or normalizing the 'router address'. To recap: Of the 150 billion rows, there are about 15 billion distinct values, plus a small percentage of NULLs. The strings average 75 bytes. Compressing may be ineffective because of the nature of the strings. So, let's focus on normalizing.
The id needs to be at least 5 bytes (to handle 15B distinct values); the string averages 75 bytes. (I assume that is bytes, not characters.) Add on some overhead for BTree, etc, and the total ends up somewhere around 2TB.
I assume the router addresses are rather random during the load of the table, so lookups for the 'next' address to insert is a random lookup in the ever-growing index BTree. Once the index grows past what can fit in the buffer_pool (less than 768GB), I/O will be needed more and more often. By the end of the load, approximately 3 out of 4 rows inserted will have to wait for a read from that index BTree to check for the row already existing. We are looking at a load time of months, even with SSDs.
So, what can be done? Consider the following. Hash the address with MD5 and UNHEX it - 16 bytes. Leave that in the table. Meanwhile write a file with the hex value of the md5, plus the router address - 150B lines (skipping the NULLs). Sort, with deduplication, the file. (Sort on the md5.) Build the normalization table from the sorted file (15B lines).
Result: The load is reasonably fast (but complex). The router address is not 75 bytes (nor 5 bytes), but 16. The normalization table exists and works.
You state its highly duplicated ?
My first thought would be to create another table with the actual varchar value and a primary int key pointing to this value.
Then the existing table can simply change to contain as a foreign key the reference to this value (and additionally be efficiently index able).
tl;dr: Is assigning rows IDs of {unixtimestamp}{randomdigits} (such as 1308022796123456) as a BIGINT a good idea if I don't want to deal with UUIDs?
Just wondering if anyone has some insight into any performance or other technical considerations / limitations in regards to IDs / PRIMARY KEYs assigned to database records across multiple servers.
My PHP+MySQL application runs on multiple servers, and the data needs to be able to be merged. So I've outgrown the standard sequential / auto_increment integer method of identifying rows.
My research into a solution brought me to the concept of using UUIDs / GUIDs. However the need to alter my code to deal with converting UUID strings to binary values in MySQL seems like a bit of a pain/work. I don't want to store the UUIDs as VARCHAR for storage and performance reasons.
Another possible annoyance of UUIDs stored in a binary column is the fact that rows IDs aren't obvious when looking at the data in PhpMyAdmin - I could be wrong about this though - but straight numbers seem a lot simpler overall anyway and are universal across any kind of database system with no conversion required.
As a middle ground I came up with the idea of making my ID columns a BIGINT, and assigning IDs using the current unix timestamp followed by 6 random digits. So lets say my random number came about to be 123456, my generated ID today would come out as: 1308022796123456
A one in 10 million chance of a conflict for rows created within the same second is fine with me. I'm not doing any sort of mass row creation quickly.
One issue I've read about with randomly generated UUIDs is that they're bad for indexes, as the values are not sequential (they're spread out all over the place). The UUID() function in MySQL addresses this by generating the first part of the UUID from the current timestamp. Therefore I've copied that idea of having the unix timestamp at the start of my BIGINT. Will my indexes be slow?
Pros of my BIGINT idea:
Gives me the multi-server/merging advantages of UUIDs
Requires very little change to my application code (everything is already programmed to handle integers for IDs)
Half the storage of a UUID (8 bytes vs 16 bytes)
Cons:
??? - Please let me know if you can think of any.
Some follow up questions to go along with this:
Should I use more or less than 6 random digits at the end? Will it make a difference to index performance?
Is one of these methods any "randomer" ?: Getting PHP to generate 6 digits and concatenating them together -VS- getting PHP to generate a number in the 1 - 999999 range and then zerofilling to ensure 6 digits.
Thanks for any tips. Sorry about the wall of text.
I have run into this very problem in my professional life. We used timestamp + random number and ran into serious issues when our applications scaled up (more clients, more servers, more requests). Granted, we (stupidly) used only 4 digits, and then change to 6, but you would be surprised how often that the errors still happen.
Over a long enough period of time, you are guaranteed to get duplicate key errors. Our application is mission critical, and therefore even the smallest chance it could fail to due inherently random behavior was unacceptable. We started using UUIDs to avoid this issue, and carefully managed their creation.
Using UUIDs, your index size will increase, and a larger index will result in poorer performance (perhaps unnoticeable, but poorer none-the-less). However MySQL supports a native UUID type (never use varchar as a primary key!!), and can handle indexing, searching,etc pretty damn efficiently even compared to bigint. The biggest performance hit to your index is almost always the number of rows indexed, rather than the size of the item being index (unless you want to index on a longtext or something ridiculous like that).
To answer you question: Bigint (with random numbers attached) will be ok if you do not plan on scaling your application/service significantly. If your code can handle the change without much alteration and your application will not explode if a duplicate key error occurs, go with it. Otherwise, bite-the-bullet and go for the more substantial option.
You can always implement a larger change later, like switching to an entirely different backend (which we are now facing... :P)
You can manually change the autonumber starting number.
ALTER TABLE foo AUTO_INCREMENT = ####
An unsigned int can store up to 4,294,967,295, lets round it down to 4,290,000,000.
Use the first 3 digits for the server serial number, and the final 7 digits for the row id.
This gives you up to 430 servers (including 000), and up to 10 million IDs for each server.
So for server #172 you manually change the autonumber to start at 1,720,000,000, then let it assign IDs sequentially.
If you think you might have more servers, but less IDs per server, then adjust it to 4 digits per server and 6 for the ID (i.e. up to 1 million IDs).
You can also split the number using binary digits instead of decimal digits (perhaps 10 binary digits per server, and 22 for the ID. So, for example, server 76 starts at 2^22*76 = 318,767,104 and ends at 322,961,407).
For that matter you don't even need a clear split. Take 4,294,967,295 divide it by the maximum number of servers you think you will ever have, and that's your spacing.
You could use a bigint if you think you need more identifiers, but that's a seriously huge number.
Use the GUID as a unique index, but also calculate a 64-bit (BIGINT) hash of the GUID, store that in a separate NOT UNIQUE column, and index it. To retrieve, query for a match to both columns - the 64-bit index should make this efficient.
What's good about this is that the hash:
a. Doesn't have to be unique.
b. Is likely to be well-distributed.
The cost: extra 8-byte column and its index.
If you want to use the timestamp method then do this:
Give each server a number, to that append the proccess ID of the application that is doing the insert (or the thread ID) (in PHP it's getmypid()), then to that append how long that process has been alive/active for (in PHP it's getrusage()), and finally add a counter that starts at 0 at the start of each script invocation (i.e. each insert within the same script adds one to it).
Also, you don't need to store the full unix timestamp - most of those digits are for saying it's year 2011, and not year 1970. So if you can't get a number saying how long the process was alive for, then at least subtract a fixed timestamp representing today - that way you'll need far less digits.
How does MySQL store a varchar field? Can I assume that the following pattern represents sensible storage sizes :
1,2,4,8,16,32,64,128,255 (max)
A clarification via example. Lets say I have a varchar field of 20 characters. Does MySQL when creating this field, basically reserve space for 32 bytes(not sure if they are bytes or not) but only allow 20 to be entered?
I guess I am worried about optimising disk space for a massive table.
To answer the question, on disk MySql uses 1 + the size that is used in the field to store the data (so if the column was declared varchar(45), and the field was "FooBar" it would use 7 bytes on disk, unless of course you where using a multibyte character set, where it would be using 14 bytes). So, however you declare your columns, it wont make a difference on the storage end (you stated you are worried about disk optimization for a massive table). However, it does make a difference in queries, as VARCHAR's are converted to CHAR's when MySql makes a temporary table (SORT, ORDER, etc) and the more records you can fit into a single page, the less memory and faster your table scans will be.
MySQL stores a varchar field as a variable length record, with either a one-byte or a two-byte prefix to indicate the record size.
Having a pattern of storage sizes doesn't really make any difference to how MySQL will function when dealing with variable length record storage. The length specified in a varchar(x) declaration will simply determine the maximum length of the data that can be stored. Basically, a varchar(16) is no different disk-wise than a varchar(128).
This manual page has a more detailed explanation.
Edit: With regards to your updated question, the answer is still the same. A varchar field will only use up as much space on disk as the data you store in it (plus a one or two byte overhead). So it doesn't matter if you have a varchar(16) or a varchar(128), if you store a 10-character string in it, you're only going to use 10 bytes (plus 1 or 2) of disk space.
I'm using mySQL to set up a database of stock options. There are about 330,000 rows (each row is 1 option). I'm new to SQL so I'm trying to decide on the field types for things like option symbol (varies from 4 to 5 characters), stock symbol (varies from 1 to 5 characters), company name (varies from 5 to 60 characters).
I want to optimize for speed. Both creating the database (which happens every 5 minutes as new price data comes out -- i don't have a real-time data feed, but it's near real-time in that i get a new text file with 330,000 rows delivered to me every 5 minutes; this new data completely replaces the prior data), and also for lookup speed (there will be a web-based front end where many users can run ad hoc queries).
If I'm not concerned about space (since the db lifetime is 5 minutes, and each row contains maybe 300 bytes, so maybe 100MBs for the whole thing) then what is the fastest way to structure the fields?
Same question for numeric fields, actually: Is there a performance difference between int(11) and int(7)? Does one length work better than another for queries and sorting?
Thanks!
In MyISAM, there is some benefit to making fixed-width records. VARCHAR is variable width. CHAR is fixed-width. If your rows have only fixed-width data types, then the whole row is fixed-width, and MySQL gains some advantage calculating the space requirements and offset of rows in that table. That said, the advantage may be small and it's hardly worth a possible tiny gain that is outweighed by other costs (such as cache efficiency) from having fixed-width, padded CHAR columns where VARCHAR would store more compactly.
The breakpoint where it becomes more efficient depends on your application, and this is not something that can be answered except by you testing both solutions and using the one that works best for your data under your application's usage.
Regarding INT(7) versus INT(11), this is irrelevant to storage or performance. It is a common misunderstanding that MySQL's argument to the INT type has anything to do with size of the data -- it doesn't. MySQL's INT data type is always 32 bits. The argument in parentheses refers to how many digits to pad if you display the value with ZEROFILL. E.g. INT(7) will display 0001234 where INT(11) will display 00000001234. But this padding only happens as the value is displayed, not during storage or math calculation.
If the actual data in a field can vary a lot in size, varchar is better because it leads to smaller records, and smaller records mean a faster DB (more records can fit into cache, smaller indexes, etc.). For the same reason, using smaller ints is better if you need maximum speed.
OTOH, if the variance is small, e.g. a field has a maximum of 20 chars, and most records actually are nearly 20 chars long, then char is better because it allows some additional optimizations by the DB. However, this really only matters if it's true for ALL the fields in a table, because then you have fixed-size records. If speed is your main concern, it might even be worth it to move any non-fixed-size fields into a separate table, if you have queries that use only the fixed-size fields (or if you only have shotgun queries).
In the end, it's hard to generalize because a lot depends on the access patterns of your actual app.
Given your system constraints I would suggest a varchar since anything you do with the data will have to accommodate whatever padding you put in place to make use of a fixed-width char. This means more code somewhere which is more to debug, and more potential for errors. That being said:
The major bottleneck in your application is due to dropping and recreating your database every five minutes. You're not going to get much performance benefit out of microenhancements like choosing char over varchar. I believe you have some more serious architectural problems to address instead. – Princess
I agree with the above comment. You have bigger fish to fry in your architecture before you can afford to worry about the difference between a char and varchar. For one, if you have a web user attempting to run an ad hoc query and the database is in the process of being recreated, you are going to get errors (i.e. "database doesn't exist" or simply "timed out" type issues).
I would suggest that instead you build (at the least) a quote table for the most recent quote data (with a time stamp), a ticker symbol table and a history table. Your web users would query against the ticker table to get the most recent data. If a symbol comes over in your 5-minute file that doesn't exist, it's simple enough to have the import script create it before posting the new info to the quote table. All others get updated and queries default to the current day's data.
I would definitely not recreate the database each time. Instead I would do the following:
read in the update/snapshot file and create some object based on each row.
for each row get the symbol/option name (unique) and set that in the database
If it were me I would also have an in memory cache of all the symbols and the current price data.
Price data is never an int - you can use characters.
The company name is probably not unique as there are many options for a particular company. That should be an index and you can save space just using the id of a company.
As someone else also pointed out - your web clients do not need to have to hit the actual database and do a query - you can probably just hit your cache. (though that really depends on what tables and data you expose to your clients and what data they want)
Having query access for other users is also a reason NOT to keep removing and creating a database.
Also remember that creating databases is subject to whatever actual database implementation you use. If you ever port from MySQL to, say, Postgresql, you will discover a very unpleasant fact that creating databases in postgresql is a comparatively very slow operation. It is orders of magnitude slower than reading and writing table rows, for instance.
It looks like there is an application design problem to address first, before you optimize for performance choosing proper data types.