I have an database that is rapidly filled with data we talk about 10-20k rows per day.
What is a limit of an ID with and autoincrement option? If ID is created as INTEGER then I can do max value of 2,147,483,647 for unsigned values?
But what when autoincrement goes above this? Does it all collapses? What would be solution then?
I am sure that a lot of people have big databases, and I would like to hear them.
Thank you.
If you are worried about it growing out of bounds too quickly, I would set the PK as an UNSIGNED BIGINT. That gives you a max value of 18446744073709551615, which should be sufficient.
| Min. (inclusive) | Max. (inclusive)
-----------------------------------------------------------------------------
INT Signed (+|-) | -2,147,483,648 | 2,147,483,647
-----------------------------------------------------------------------------
INT Unsigned (+) | 0 | 4,294,967,295
-----------------------------------------------------------------------------
BIGINT Signed (+|-) | -9,223,372,036,854,775,807 | 9,223,372,036,854,775,806
-----------------------------------------------------------------------------
BIGINT Unsigned (+) | 0 | 18,446,744,073,709,551,615
MySQL reference.
If you have MySQL table with column ID (INT unsigned) with auto_increment, and the table has 4,294,967,295 records, then you try to insert 1 more record, the ID of the new record will be automatically changed and set to the max which is "4,294,967,295", so you get a MySQL error message Duplicate entry '4294967295' for key 'PRIMARY', you will have duplicated IDs if the column is set as Primary Key.
2 Possible Solutions:
Easy Approach: Extend the limits, by setting the ID to BIGINT unsigned, just like what Dan Armstrong said. Although this doesn't mean it's unbreakable! and performance might be affected when the table gets really large.
Harder Approach: Use partitioning, which is a little more complicated approach, but gives better performance, and truly no database limit (Your only limit is the size of your physical harddisk.). Twitter (and similar huge websites) use this approach for their millions of tweets (records) per day!
Related
I got a MySQL database and need to store upto 25 recommendations for each of the users (when user visits the site), here is my simple table that holds userid, recommendation and rank for the recommendation:
userid | recommendation | rank
1 | movie_A | 1
1 | movie_X | 2
...
10 | movie_B | 1
10 | movie_A | 2
....
I expect about 10M users and that combined with 25 recommendations would result in 250M rows. Is there any other better ways to design a user-recommendation table?
Thanks!
Is your requirement only to retrieve the 25 recommendations and send it to a UI layer for consumption?
if that is the case, the system that computes the recommendations can build a JSON document and update the value against the Userid. MySQL has support for JSON datatype.
This might not be a good approach if you want to perform search queries on the JSON document.
250 million rows isn't unreasonable in a simple table like this:
CREATE TABLE UserMovieRecommendations (
user_id INT UNSIGNED NOT NULL,
movie_id INT UNSIGNED NOT NULL,
rank TINYINT UNSIGNED NOT NULL,
PRIMARY KEY (user_id, movie_id, rank),
FOREIGN KEY (user_id) REFERENCES Users(user_id),
FOREIGN KEY (movie_id) REFERENCES Movies(movie_id)
);
That's 9 bytes per row. so only about 2GB.
25 * 10,000,000 * 9 bytes = 2250000000 bytes, or 2.1GB.
Perhaps double that to account for indexes and so on. Still not hard to imagine a MySQL server configured to hold the entire data set in RAM. And it's probably not necessary to hold all the data in RAM, since not all 10 million users will be viewing their data at once.
You might never reach 10 million users, but if you do, I expect that you will be using a server with plenty of memory to handle this.
If I have a large table with floating numbers, can it help in reading speed if I add a column that represent the int value of each float? maybe if the int value will be an index, then when I need to select all the floats that starts with certain int it will "filter" the values that are surely not necessary?
For example if there are 10,000 numbers, 5000 of which begin with 14: 14.232, 14.666, etc, is there an sql statement that can increase the selecting speed if I add the int value column?
id | number | int_value |
1 | 11.232 | 11 |
2 | 30.114 | 30 |
3 | 14.888 | 14 |
.. | .. | .. |
3005 | 14.332 | 14 |
You can create a non clustered index on number column itself. and when selecting the data from table you can filtered out with like operator. No need of additional column,
Select * from mytable
where number like '14%'
First of all: Do you have performance issues? If not then why worry?
Then: You need to store decimals, but you are sometimes only interested in the integer part. Yes?
So you have one or more queries of the type
where number >= 14 and number < 15
or
where truncate(number, 0) = 14
Do you already have indexes on the number? E.g.
create index idx on mytable(number);
The first mentioned WHERE clause would probably benefit from it. The second doesn't, because when you invoke a function on the column, the DBMS doesn't see the relation to the index anymore. This shows it can make a difference how you write the query.
If the first WHERE clause is still too slow in spite of the index, you can create a computed column (ALTER TABLE mytable ADD numint int GENERATED ALWAYS AS truncate(number, 0) STORED), index that, and access it instead of the number column in your query. But I doubt that would speed things up noticeably.
As to your example:
if there are 10,000 numbers, 5000 of which begin with 14
This is not called a large table, but a small one. And as you'd want half of the records anyway, the DBMS would simply read all records sequentially and look at the number. It doesn't make a difference whether it looks at an integer or a decimal number. (Well, some nanoseconds maybe, but nothing you would notice.)
I have the following scenario:
A form with many checkboxes, around 100.
I have 2 ideas on how to save them in database:
1. Multicolumn
I create a table looking like this:
id | box1 | box2 | ... | box100 | updated| created
id: int
box1: bit(1)
SELECT * FROM table WHERE box1 = 1 AND box22 = 1 ...
2. Single data column
Table is simply:
id | data | updated | created
data: varchar(100)
SELECT * FROM table WHERE data LIKE '_______1___ ... ____1____1'
where data looks like 0001100101010......01 each character representing if value was checked or not.
Considering that the table will have 200k+ rows, which is a more scalable solution?
3. Single data column of type JSON
I have no good information about this yet.
Or...
4. A few SETs
5. A few INTs
These are much more compact: about 8 checkboxes per byte.
They are a bit messy to set/test.
Since they are limited to 64 bits, you would need more than one SET or INT. I recommend grouping the bits is some logical way, based on the app.
Be aware of FIND_IN_SET().
Be aware of (1 << $n) for creating the value 2^n.
Be aware of | and & Operators.
Which of the 5 is best? That depends on the queries you need to run -- for searching (if necessary?), for inserting, for updating (if necessary?), and for selecting.
An example: For INTs , WHERE (bits & 0x2C08) = 0x2C08 would simultaneously check for 4 flags being 'ON'. That constant could either be constructed in app code, or ((1<<13) | (1<<11) | (1<<10) | (1<<3)) for bits 3,10,11,13. Meanwhile, the other flags are ignored. If you need them to be 'OFF', the test would be WHERE bits ^ 0x2C08 = 0. If either of these kind of test is your main activity, then Choice 5 is probably the best for both performance and space, though it is somewhat cryptic to read.
When adding another option, SET requires an ALTER TABLE. INT usually has some spare bits (TINYINT UNSIGNED has 8 bits, ... BIGINT UNSIGNED has 64). So, about one time in 8, you would need an ALTER to get a bigger INT or add another INT. Deleting an option: suggest just abandoning that SET element or bit of INT.
I want to show a user only the content he has not viewed yet.
I considered storing a string containing the ids of the items separated by ',' that a user has viewed but i thought i won't know the possible length of the string.
The alternative i could find was to store it like a log. A table like
user_id | item_id
1 | 1
2 | 2
1 | 2
Which approach will be better for around ten thousand users and thousands of items.
A table of pairs like that would be only 10M rows. That is "medium sized" as tables go.
Have
PRIMARY KEY(user_id, item_id),
INDEX(item_id, user_id)
And, if you are not going past 10K and 1K, consider using SMALLINT UNSIGNED (up to 64K in 2 bytes). Or, to be more conservative, MEDIUMINT UNSIGNED (up to 16M in 3 bytes).
I am working on an e-shop which sells products only via loans. I display 10 products per page in any category, each product has 3 different price tags - 3 different loan types. Everything went pretty well during testing time, query execution time was perfect, but today when transfered the changes to the production server, the site "collapsed" in about 2 minutes. The query that is used to select loan types sometimes hangs for ~10 seconds and it happens frequently and thus it cant keep up and its hella slow. The table that is used to store the data has approximately 2 milion records and each select looks like this:
SELECT *
FROM products_loans
WHERE KOD IN("X17/Q30-10", "X17/12", "X17/5-24")
AND 369.27 BETWEEN CENA_OD AND CENA_DO;
3 loan types and the price that needs to be in range between CENA_OD and CENA_DO, thus 3 rows are returned.
But since I need to display 10 products per page, I need to run it trough a modified select using OR, since I didnt find any other solution to this. I have asked about it here, but got no answer. As mentioned in the referencing post, this has to be done separately since there is no column that could be used in a join (except of course price and code, but that ended very, very badly). Here is the show create table, kod and CENA_OD/CENA_DO very indexed via INDEX.
CREATE TABLE `products_loans` (
`KOEF_ID` bigint(20) NOT NULL,
`KOD` varchar(30) NOT NULL,
`AKONTACIA` int(11) NOT NULL,
`POCET_SPLATOK` int(11) NOT NULL,
`koeficient` decimal(10,2) NOT NULL default '0.00',
`CENA_OD` decimal(10,2) default NULL,
`CENA_DO` decimal(10,2) default NULL,
`PREDAJNA_CENA` decimal(10,2) default NULL,
`AKONTACIA_SUMA` decimal(10,2) default NULL,
`TYP_VYHODY` varchar(4) default NULL,
`stage` smallint(6) NOT NULL default '1',
PRIMARY KEY (`KOEF_ID`),
KEY `CENA_OD` (`CENA_OD`),
KEY `CENA_DO` (`CENA_DO`),
KEY `KOD` (`KOD`),
KEY `stage` (`stage`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
And also selecting all loan types and later filtering them trough php doesnt work good, since each type has over 50k records and the select takes too much time as well...
Any ides about improving the speed are appreciated.
Edit:
Here is the explain
+----+-------------+----------------+-------+---------------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+-------+---------------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | products_loans | range | CENA_OD,CENA_DO,KOD | KOD | 92 | NULL | 190158 | Using where |
+----+-------------+----------------+-------+---------------------+------+---------+------+--------+-------------+
I have tried the combined index and it improved the performance on the test server from 0.44 sec to 0.06 sec, I cant access the production server from home though, so I will have to try it tomorrow.
Your issue is that you are searching for intervals which contain a point (rather than the more normal query of all points in an interval). These queries do not work well with the standard B-tree index, so instead you need to use an R-Tree index. Unfortunately MySQL doesn't allow you to select an R-Tree index on a column, but you can get the desired index by changing your column type to GEOMETRY and using the geometric functions to check if the interval contains the point.
See Quassnoi's article Adjacency list vs. nested sets: MySQL where he explains this in more detail. The use case is different, but the techniques involved are the same. Here's an extract from the relevant part of the article:
There is also a certain class of tasks that require searching for all ranges containing a known value:
Searching for an IP address in the IP range ban list
Searching for a given date within a date range
and several others. These tasks can be improved by using R-Tree capabilities of MySQL.
Try to refactor your query like:
SELECT * FROM products_loans
WHERE KOD IN("X17/Q30-10", "X17/12", "X17/5-24")
AND CENA_OD >= 369.27
AND CENA_DO <= 369.27;
(mysql is not very smart when choosing indexes) and check the performance.
The next try is to add a combined key - (KOD,CENA_OD,CENA_DO)
And the next major try is to refactor your base to have products separated from prices. This should really help.
PS: you can also migrate to postgresql, it's smarter than mysql when choosing right indexes.
MySQL can only use 1 key. If you always get the entry by the 3 columns, depending on the actual data (range) in the columns one of the following could very well add a serious amount of performance:
ALTER TABLE products_loans ADD INDEX(KOD, CENA_OD, CENA_DO);
ALTER TABLE products_loans ADD INDEX(CENA_OD, CENA_DO, KOD);
Notice that the order of the columns matter! If that doesn't improve performance, give us the EXPLAIN output of the query.