MySQL: Many Indexes, but on small fields - mysql

I have a BOOK table consits of 15 columns, but most of them are small integers(INT(1) for different ratings, and also somewhere INT(4) or INT(5))
The table is meant to be used for dynamic search with filters on web-site. In order to speed things up, I created indexes on almost every INT column (10-11 indexes in total). I don't have most of data in table yet, but will I have any memory trouble in prospect of huge table?
My question in general - does small integer index require comparatively more memory than I expect?

It's a lot easier to shrink the datatypes before you have a zillion rows in the table.
INT UNSIGNED takes 4 bytes and allows numbers from 0 to about 4 billion.
TINYINT UNSIGNED takes 1 byte and allows values 0..255. So, if you have a billion-row table, changing an INT to TINYINT would shrink the disk footprint by 3GB, plus another 3GB if it is also in an index. (This is a simplification; hope you get the idea.)
SMALLINT UNSIGNED takes 2 bytes, allowing 0..65535. That is probably what you want instead of INT(4) and maybe INT(5)?
The (5) means nothing (except when used with ZEROFILL).
Your table will probably be 1/3 data and 2/3 index. This ratio is abnormal, but not "bad".
Instead of 10-11 single-column indexes, I recommend you make about that many 2-column indexes. This will improve some more queries.
You need to get a feel for the traffic -- what columns do people usually filter on? And how do they filter? That is a=7 versus a>7.
Once you have some likely SELECTs, study my Cookbook to see how to optimize the indexes. After that, come back with SHOW CREATE TABLE and the SELECTs; I may suggest further tweaks.
I would not hesitate to build a table like yours with a billion rows, even if I did not have enough RAM to cache it all.

Related

Which one is faster to get the row? The primary key that carries numbers or that carries characters?

ID (Int 11) (Primary key) (Auto increment)
TITLE
1
...
2
...
3
...
4
...
5
...
To 10 million rows
ID (Char 32) (Primary key)
TITLE
a4a0FCBbE614497581da84454f806FbA
...
40D553d006EF43f4b8ef3BcE6B08a542
...
781DB409A5Db478f90B2486caBaAdfF2
...
fD07F0a9780B4928bBBdbb1723298F92
...
828Ef8A6eF244926A15a43400084da5D
...
To 10 million rows
If I want to get a specific row from the first table, How much time will take approximately, Same thing with the second table, How much time will take approximately?
Is the primary key that carries numbers will be found faster than that carries characters?
I do not want to use auto-increment with int like the first table because of this problem
UUIDs and MD5s and other hashes suck because of the "randomness" and lack of "locality of reference", not because of being characters instead of numeric.
You could convert those to BINARY(16), thereby making them half as big.
10M INT = 40MB = 600/block
10M CHAR(32) = 320MB = 300/block
10M VARCHAR(32) = 330MB = 300/block
10M BINARY(16) = 160MB = 450/block
Add that much more for each secondary key in that table.
Add again for each other table that references that PK (eg, FOREIGN KEY).
Let's look at the B+Tree that is the structure of the PK and secondary indexes. In a 16KB block, some number of entries can be placed. I have estimated them above. (Yes, the 'overhead' is much than an INT.) For INT, the BTree for 10M rows will probably be 3 levels deep. Ditto for the others. (As the table grows, Varchar would move to 4 levels before the others.)
So, I conclude, there is little or no difference in how many BTree blocks are needed to do your "point query".
Summary of how much slower a string is than an INT:
BTree depth -- little or none
Cachability of index blocks -- some; not huge
CPU time to compare numbers vs strings -- some; not huge
Use of a fancy COLLATION -- some; not huge
Overall -- not enough difference to worry about.
What I will argue for in some cases is whether you need a fabricated PK. In 2/3 of the tables I build, I find that there is a 'natural' PK -- some column(s) that is, by the business logic, naturally UNIQUE and NOT NULL. These are the two main qualifications (in MySQL) for a PRIMARY KEY. In some situations the speedup afforded by a "natural PK" can be more than a factor of 2.
A Many-to-many mapping table is an excellent (and common) example of such.
It is impossible to tell the exact times needed to retrieve a specific record, because it depends on lots of factors.
In general, numeric values take less storage space, thus scanning the index requires less I/O operations, therefore are usually faster.
However in this specific case the second table looks like a hexadecimal representation of a large number. You can probably store it as a binary value to save storage space.
On top of the above, in general numeric values are not affected by various database and column settings, while strings are (like collation), which also can add some processing time while querying.
The real question is what is the purpose of using the binary representation. 10 million values can easily fit in INT what is the need to have a key which can store way more (32 long hexadecimal value)?
As long as you are within the range of the numeric values and there is no other requirement, just to be able to store that many different values, I would go with an integer.
The 'problem' you mention in the question is usually not a problem. There is no need to not have gaps in the identifiers in most caes. In fact in lots of systems, gaps are naturally occurring during normal operations. You most probably won't reassign the records to other IDs when one record is being deleted from the middle of the table.
Unless there is a semantic meaning of the ID (it should not), I would just go with an AUTO_INCREMENT, there is no need to reinvent the wheel.

Storing a 100k by 100k array in MySQL

I need to store a massive, fixed size square array in MySQL. The values of the array are just INTs but they need to be accessed and modified fairly quickly.
So here is what I am thinking:
Just use 1 column for primary keys and translate the 2d arrays indexes into single dimensional indexes.
So if the 2d array is n by n => 2dArray[i][j] = 1dArray[n*(i-1)+j]
This translates the problem into storing a massive 1D array in the database.
Then use another column for the values.
Make every entry in the array a row.
However, I'm not very familiar with the internal workings of MySQL.
100k*100k makes 10 billion data points, which is more than what 32 bits can get you so I can't use INT as a primary key. And researching stackoverflow, some people have experienced performance issues with using BIGINT as primary key.
In this case where I'm only storing INTs, would the performance of MySQL drop as the number of rows increases?
Or if I were to scatter the data over multiple tables on the same server, could that improve performance? Right now, it looks like I won't have access to multiple machines, so I can't really cluster the data.
I'm completely flexible about every idea I've listed above and open to suggestions (except not using MySQL because I've kind of committed to that!)
As for your concern that BIGINT or adding more rows decreases performance, of course that's true. You will have 10 billion rows, that's going to require a big table and a lot of RAM. It will take some attention to the queries you need to run against this dataset to decide on the best storage method.
I probably recommend using two columns for the primary key. Developers often overlook the possibility of a compound primary key.
Then you can use INT for both primary key columns if you want to.
CREATE TABLE MyTable (
array_index1 INT NOT NULL,
array_index1 INT NOT NULL,
datum WHATEVER_TYPE NOT NULL,
PRIMARY KEY (array_index1, array_index2)
);
Note that a compound index like this means that if you search on the second column without an equality condition on the first column, the search won't use the index. So you need a secondary index if you want to support that.
100,000 columns is not supported by MySQL. MySQL has limits of 4096 columns and of 65,535 bytes per row (not counting BLOB/TEXT columns).
Storing the data in multiple tables is possible, but will probably make your queries terribly awkward.
You could also look into using table PARTITIONING, but this is not as useful as it sounds.

use mariadb to store future market data (large number records)

for future market data, we need at least 1,000,000 records each day, each record has less them 10 fileds with a few letters. i chose mariadb 5.5 on centos 7. engine is innodb. my.cnf has following configure:
[server]
innodb_file_per_table=1
innodb_flush_log_at_trx_commit=2
innodb_buffer_pool_size=2G
innodb_log_file_size=256M
innodb_log_buffer_size=8M
bulk_insert_buffer_size=256M
when i insert records, it is not so fast, but it can be accepted. but when I do export data, it is very slow when innodb talbe large than some GB.
fields like: id, bid, ask, time, xx,xx, id is auto increment, and is the key. my query sql like following:
select * from table where instrument="xx" and time >= "xx" and time <= "xx"
any advise how to speed up select performance? thanks!
To tailor to the SELECT, make the table InnoDB and set up the clustered PRIMARY KEY so that the desired rows are consecutive. This is likely to slow down the INSERT process, but that is not an issue -- 12 inserts/second is easily handled.
But let me digress for a moment -- do the 1M rows come in all at once? Or are the trickling in over 7 hours? Or what? If all at once, sort the data according to the PK before doing the massive LOAD DATA.
Your query begs for PRIMARY KEY(instrument, time). But a PK must be "unique"; is that unique? If not, then another column (id?)` should be tacked onto the end to make it unique.
Note that if it is unique, then you don't need an AUTO_INCREMENT; get rid of it. For such large tables, minimizing the number of indexes is critical, not just for performance, but for even being able to survive.
Other things to do...
Normalize the instrument. That have a table of such and map it to an id, probably SMALLINT UNSIGNED (2 bytes) if there are under 65K. See my blog for more discussion of normalizing as you ingest.
Shrink any fields you can -- FLOAT (4 bytes) is tempting, but it has round off errors. DECIMAL is tricky because you need to worry about penny-stocks at one extreme and BRK-A at the other.
Look at the rest of the queries to make sure this change in PK does not hurt them.
Set innodb_buffer_pool_size to about 70% of available RAM (assuming you have more than 4GB of RAM).
If you do have to keep id as an AUTO_INCREMENT, then add INDEX(id); that is all that is needed to keep A_I happy.
Use CHARACTER SET ascii unless you need utf8 somewhere.
Volume can exceed 4 billion in rare cases; ponder what to do.
Fetching 10K rows in PK order will take only seconds.
FULLTEXT is not useful for this application.
PARTITIONing is not likely to be useful; we can revisit it if you care to share the rest of the queries. On the other hand, if you will be deleting 'old' data, then PARTITIONing is an excellent idea. See my partition blog.

Proper database design for Big Data

I have a huge number of tables for each country. I want multiple comment related fields for each so that users can make comments on my website. I might might have a few more fields like: date when comment was created, user_id of commenter. Also I might need to add other fields in the future.For Example, company_support_comment/support_rating, company_professionalism_comment
Let's say I have 1 million companies in one table and 100 comments per company. then I get lot's of comments just for one country It will easily exceed 2 billion.
Unsigned bigint can support 18 446 744 073 709 551 615. So we can have that many comments in one table. Unsigned int will give us 4.2+ billion. Which won't be enough in one table.
However imagine querying a table with 4 billion records? How long would that take ? I might not be able to efficiently retrieve the comments and it would take a huge load on the database. Given that in practice one table probably can't be done.
Multiple tables might also be bad. unless we just use json data..
Actually I'm not sure now. I need a proper solution for my database design. I am using mysql now.
Your question goes in the wrong direction, in my view.
Start with your database design. That means go with bigints to start with if you are concerned about it (because converting from int to bigint is a pain if you get that wrong). Build a good, normalized schema. Then figure out how to make it fast.
In your case, PostgreSQL may be a better option than MySQL because your query is going to likely be against secondary indexes. These are more expensive on MySQL with InnoDB than PostgreSQL, because with MySQL, you have to traverse the primary key index to retrieve the row. This means, effectively, traversing two btree indexes to get the rows you are looking for. Probably not the end of the world, but if performance is your primary concern that may be a cost you don't want to pay. While MySQL covering indexes are a little more useful in some cases, I don't think they help you here since you are interested, really, in text fields which you probably are not directly indexing.
In PostgreSQL, you have a btree index which then gives you a series of page/tuple tuples, which then allow you to look up the data effectively with random access. This would be a win with such a large table, and my experience is that PostgreSQL can perform very well on large tables (tables spanning, say, 2-3TB in size with their indexes).
However, assuming you stick with MySQL, careful attention to indexing will likely get you where you need to go. Remember you are only pulling up 100 comments for a company and traversing an index has O(log n) complexity so it isn't really that bad. The biggest issue is traversing the pkey index for each of the rows retrieved but even that should be manageable.
4 billions records in one table is not a big deal for No SQL database. Even for traditional database, if you build a bunch of secondary indexes correctly, like in MySQL, search in them will be quick(travels a b tree like data structure takes Log(n) disk visitation).
And for faster access, you need a front end cache system to work on your hot data, like redis or memcachd.
Recall your current situation, you are not sure what fields will be needed, then the only choice is a no-sql solution. Since the fields(columns) can be added in the future when they will be needed.
(From a MySQL perspective...)
1 table for companies; INT UNSIGNED will do. 1 table for comments BIGINT UNSIGNED may be necessary. You won't fetch hundreds of comments for display at once, will you? Unless you take care of the data layout, 100 comments could easily be 100 random disk hits, which (on cheap disk) would be 1 second.
You must have indexes (this mostly rules out NoSql)? Otherwise searching for records would be too painfully slow.
CREATE TABLE Comments (
comment_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
company_id INT UNSIGNED NOT NULL,
ts TIMESTAMP,
...
PRIMARY KEY(company_id, comment_id, ts), -- to get clustering and ordering
INDEX(comment_id) -- to keep AUTO_INCREMENT happy
...
) ENGINE=InnoDB;
If you paginate the display of the comments, use the tips in remember where you left off. That will make fetching comments about as efficient as possible.
As for Log(n) -- With about 100 items per node, a billion rows will have only 5 levels of BTree. This is small enough to essentially ignore when worrying about timing. Comments will be a terabyte or more? And your RAM will be significantly less than that? Then, you will generally have non-leaf nodes cached, but leaf nodes (where the data is) not cached. There might be several comment rows per leaf node consecutively stored. Hence, less than 100 disk hits to get 100 comments for display.
(Note: When the data is much bigger than RAM, 'performance' degenerates into 'counting the disk hits'.)
Well, you mentioned comments. What about other queries?
As for "company_support_comment/support_rating..." -- The simplest would be to add a new table(s) when you need to add those 'columns'. The basic Company data is relatively bulky and static; ratings are relative small but frequently changing. (Again, I am 'counting the disk hits'.)

MYSQL autoincrement a column or just have an integer, difference?

If I have a column, set as primary index, and set as INT.
If I don't set it as auto increment and just insert random integers which are unique into it, does that slow down future queries compared to autincrementing?
Does it speed things up if I run OPTIMIZE on a table with its primary and only index as INT? (assuming only 2 columns, and second column is just some INT value)
(the main worry is the upper limit on the autoincrement as theres lots of adds and deletes in my table)
If I don't set it as auto increment and just insert random integers which are unique into it, does that slow it down compared to autincrementing?
In MyISAM it will in fact speed it (marginally).
In InnoDB, this may slow the INSERT operations down due to page splits.
This of course implies that your numbers are really unique.
Does it speed things up if I optimise a table with its primary and only index as INT? (assuming only 2 columns, and second column is just some INT value)
AUTO_INCREMENT and INT may be used together.
OPTIMIZE TABLE will compact you table and indexes, freeing the space left from the deleted rows and page splits. If you had lots of DELETE operations on the table or INSERT out of order (like in your solution with random numbers), this will help.
It will also bring the logical and physical order of the index pages into consistency with each other which will speed up full scans or ranged queries on PK (PK BETWEEN val1 AND val2), but will hardly matter for random seeks.
(the main worry is the upper limit on the autoincrement as theres lots of adds and deletes in my table)
BIGINT UNSIGNED (which can also be used with AUTO_INCREMENT) may hold up values up to 18446744073709551615.
The upper limit for autoincremented integers is 18446744073709551615:
http://dev.mysql.com/doc/refman/5.1/en/numeric-types.html
Are you really hitting such limit? If you do, allowing MySQL to add one to the previous number is an algorithm that can hardly improve.
The upper limit on AUTOINCREMENT is the upper limit on the number type in the respective column. Even with INT UNSIGNED, this can take a while to hit; with BIGINT it's going to be very hard to reach that (and seriously, what kind of app are you building that 4 extra bytes per row are way too much?). So, if you're going to hit that limit, you'll hit it with autoincrement or without it.
Also, although not having AUTOINCREMENT will speed your inserts up a tiny bit, I'm willing to bet that any code to generate a unique integer to use instead of the AUTOINCREMENT will slow down the code more than the autoincrement would (generating random non-conflicting numbers will get progressively harder as your table fills up).
In other words, IMNSHO this looks like premature optimization, and will not significantly contribute to faster code (if at all), but it will make it less maintainable (as the PK will need to be generated explicitly, instead of the database taking care of it).