Is it possible to use the Locate() function on TEXT column, or is there any alternative to it for TEXT fields.
the thing is we have LARGE varchars (65kb) that we use to track for subscriptions, so we add subscription_ids inside 1 long string in varchar.
this string can hold up to 5000 subscription_ids in 1 row. we use LOCATE to see if a user is subscribed.
if a subscription_id is found inside the varchar string.
the problem is that we plan to have more than 500,000 rows like this, it seems this can have a big impact on performance.
so we decided to move to TEXT instead, but now there is a problem with indexation and how to LOCATE sub-text inside a TEXT column.
Billions of subscriptions? Please show an abbreviated example of a TEXT value. Have you tried FIND_IN_SET()?
Is one TEXT field showing up to 5000 subscriptions for one user? Or is it the other way -- up to 5K users for one magazine?
In any case, it would be better to have a table with 2 columns:
CREATE TABLE user_sub (
user_id INT UNSIGNED NOT NULL,
sub_id INT UNSIGNED NOT NULL,
PRIMARY KEY(user_id, sub_id),
INDEX(sub_id, user_id)
) ENGINE=InnoDB;
The two composite indexes let you very efficiently find the 5K subscriptions for a user or the 500K users for a sub.
Shrink the less-500K id to MEDIUMINT UNSIGNED (16M limit instead of 4 billion; 3 bytes each instead of 4).
Shrink the less-5K id to SMALLINT UNSIGNED (64K limit instead of 4B; 2 bytes each instead of 4).
If you desire, you can use GROUP_CONCAT() to reconstruct the commalist. Be sure to change group_concat_max_len to a suitably large number (default is only 1024 bytes.)
Related
I am coding an app where users can read and write comments.
When the number of comments exceed a certain limit, a "Load more coments" button is displayed and the offset of loaded comments is stored.
I update this offset whenever the user writes or deletes own comments so that no duplicates are loaded and no comments are left out.
But I forgot about the case when the database changes because other users added/deleted comments.
So the offset method seems to be unreliable, so is there any way to solve this problem maybe by saving the id of the last comment and using this as some kind of "offset"?
The WHERE clause in my query is like:
WHERE x = ? ORDER BY y = ? (neither x nor y are the ID, y is not unique)
You can do this using a timestamp column or possibly even the primary key itself depending on how you've set that up. Here is an example of using the primary key if it is an AUTO_INCREMENT integer.
CREATE TABLE `comments` (
`comment_id` int NOT NULL AUTO_INCREMENT,
`thread_id` int NOT NULL,
`comment` text,
PRIMARY KEY (`comment_id`),
FOREIGN KEY (`thread_id`) REFERENCES `threads` (`thread_id`)
);
In that table definition, you have an AUTO_INCREMENT int primary key. You also have a thread_id that is a foreign key to a threads table. Finally, you have the comment itself in comment.
When you first load the page for some thread you'd do the following:
SELECT comment_id, comment
FROM comments
WHERE thread_id = 123
ORDER BY comment_id
LIMIT 10;
This means you'd select 10 comments ordered by their int PK for your given thread (123 in this case). Now, when you display this, you need to somehow save the largest comment_id. Say in this case it is 10. Then, have the "Load more comments" button pass this largest comment_id to the server when it is clicked. The server will now execute the following:
SELECT comment_id, comment
FROM comments
WHERE thread_id = 123 AND comment_id > 10 -- 10 is the value you passed in as your largest previously loaded comment_id
ORDER BY comment_id
LIMIT 10;
Now you have a set of ten more comments where you know that none of the comments can possibly be duplicates of your previously displayed comments, and you will never skip over any comments because they're always ordered by ascending int keys.
If you now look back to the query you used for loading the initial set of comments, you'll see that it's pretty much the same as the one for loading additional comments, so you can actually use the same query for both. When you load the comments initially just pass 0 as the largest comment_id.
You can do the same thing using a timestamp column as well if you don't have a primary key that works like this, and you don't want to change it to work like this either. You'd simply order the results by the timestamp column, and then pass the timestamp of the last loaded comment to your "Load more comments" function. In order to avoid skipping comments posted at the same time, you can use a timestamp with six digits of fractional second precision. Create the timestamp column as TIMESTAMP(6). Your timestamps will then be recorded as things like 2014-09-08 17:51:04.123456, where the last six digits after the second are the fraction of a second. With this much precision, it's extremely unlikely that you have comments recorded exactly at the same time.
Of course you could still have two or more comments recorded at the same exact timestamp, but it's unlikely. This makes the AUTO_INCREMENT int a slightly better solution. One final option is to use a time-based UUID because they include a mechanism to ensure uniqueness by slightly adjusting the value when things occur at the same microsecond. They are also still ordered by time. The problem with this is that MySQL does not have very good support for UUIDs.
I had a table with 3 columns and 3600K rows. Using MySQL as a key-value store.
The first column id was VARCHAR(8) and set to primary key.The 2nd and 3rd columns were MEDIUMTEXT. When calling SELECT * FROM table WHERE id=00000 MySQL took like 54 sec ~ 3 minutes.
For testing I created a table containing VARCHAR(8)-VARCHAR(5)-VARCHAR(5) where data casually generated from numpy.random.randint. SELECT takes 3 sec without primary key. Same random data with VARCHAR(8)-MEDIUMTEXT-MEDIUMTEXT, the time cost by SELECT was 15 sec without primary key.(note: in second test, 2nd and 3rd column actually contained very short text like '65535', but created as MEDIUMTEXT)
My question is: how can I achieve similar performance on my real data? (or, is it impossible?)
If you use
SELECT * FROM `table` WHERE id=00000
instead of
SELECT * FROM `table` WHERE id='00000'
you are looking for all strings that are equal to an integer 0, so MySQL will have to check all rows, because '0', '0000' and even ' 0' will all be casted to integer 0. So your primary key on id will not help and you will end up with a slow full table. Even if you don't store values that way, MySQL doesn't know that.
The best option is, as all comments and answers pointed out, to change the datatype to int:
alter table `table` modify id int;
This will only work if your ids casted as integer are unique (so you don't have e.g. '0' and '00' in your table).
If you have any foreign keys that references id, you have to drop them first and, before recreating them, change the datatype in the other columns too.
If you have a known format you are storing your values (e.g. no zeros, or filled with 0s up to the length of 8), the second best option is to use this exact format to do your query, and include the ' to not cast it to integer. If you e.g. always fill 0 to 8 digits, use
SELECT * FROM `table` WHERE id='00000000';
If you never add any zeros, still add the ':
SELECT * FROM `table` WHERE id='0';
With both options, MySQL can use your primary key and you will get your result in milliseconds.
If your id column contains only numbers so define it as int , because int will give you better performance ( it is more faster)
Make the column in your table (the one defined as key) integer and retry. Check first performance by running a test within your DB (workbench or simple command line). You should get a better result.
Then, and only if needed (I doubt it though), modify your python to convert from integer to string (and/or vise-versa) when referencing the key column.
I have table storing phone numbers with 800M rows.
column
region_code_id smallint(4) unsigned YES
local_number mediumint(7) unsigned YES
region_id smallint(4) unsigned YES
operator_id smallint(4) unsigned YES
id int(10) unsigned NO PRI auto_increment
I need find number.id where region_code_id = 119 and localnumber = 1234567
select * from numbers where numbers.region_code_id = 119 and numbers.local_number = 1234567;
this query execute over 600 second.
How can I improve it ?
UPDATE
Thank for unswer, i understand i need index for this column, i try this as soon as I get the server with more SSD, now i have free 1GB SSD space. How i can to find out how much space the index will occupy?
Consider adding INDEX on columns which you use in WHERE clause.
Start with:
ALTER TABLE `numbers`
ADD INDEX `region_code_id_local_number`
(`region_code_id`, `local_number`);
Note : it can take some time for index to build.
Before and after change, execute explain plan to compare:
EXPLAIN EXTENDED select * from numbers where numbers.region_code_id = 119 and numbers.local_number = 1234567;
References:
How MySQL uses indexes
For this query:
select *
from numbers
where numbers.region_code_id = 119 and
numbers.local_number = 1234567;
You want an index on numbers(region_code_id, local_number) or numbers(local_number, region_code_id). The order of the columns doesn't matter because the conditions are equality for both columns.
create index idx_numbers_region_local on numbers(region_code_id, local_number);
I agree that INDEX(region_code_id, local_number) (in either order) is mandatory for this problem, but I am sticking my nose in to carry it a step further. Isn't that pair "unique"? Or do you have duplicate numbers in the table? If it is unique, then get rid of id and make that pair PRIMARY KEY(region_code_id, local_number). The table will possibly be smaller after the change.
Back to your question of "how big". How big is the table now? Perhaps 40GB? A secondary index (as originally proposed) would probably add about 20GB. And you would need 20-60GB of free disk space to perform the ALTER. This depends on whether adding the index can be done "inplace" in that version.
Changing the PK (as I suggest) would result in a little less than 40GB for the table. It will take 40GB of free space to perform the ALTER.
In general (and pessimistically), plan on an ALTER needing the both the original table and the new table sitting on disk at one time. That includes full copies of the data and index(es).
(A side question: Are you sure local_number is limited to 7 digits everywhere?)
Another approach to the question... For calculating the size of a table or index in InnoDB, add up the datatype sizes (3 bytes for MEDIUMINT, some average for VARCHAR, etc). Then multiply by the number of rows. Then multiply by 4; this will give you the approximate disk space needed. (Usually 2-3 is sufficient for the last multiplier.)
When changing the PK, do it in one step:
ALTER TABLE foo
DROP PRIMARY KEY,
ADD PRIMARY KEY(region_code_id, local_number);
Changing the PK cannot be done "inplace".
Edit (mostly for other readers)
#berap points out that id is needed for other purposes. Hence, dropping id and switching the PK is not an option.
However, this is sometimes an option (perhaps not in this case):
ALTER TABLE foo
DROP PRIMARY KEY,
ADD PRIMARY KEY(region_code_id, local_number),
ADD INDEX(id);
Notes:
The id..AUTO_INCREMENT will continue to work even with just INDEX.
The SELECT in question will be more efficient because it is the PK.
SELECT .. WHERE id = ... will be less efficient because id is a secondary key.
The table will be the same size either way; the secondary key would also be the same size either way -- because every secondary key contains the PK columns, too. (This note is InnoDB-specific.)
I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!
Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.
Funny thing I've found abount mysql. MySQL has a 3 byte numeric type - MEDIUMINT. Its range is from -8388608 to 8388607. It seems strange to me. Size of numeric types choosen for better performance, I thought data should be aligned to a machine word or double word. And if we need some restriction rules for numeric ranges, it must be external relative to datatype. For example:
CREATE TABLE ... (
id INT RANGE(0, 500) PRIMARY KEY
)
So, does anyone know why 3 bytes? Is there any reason?
The reason is so that if you have a number that falls within a 3 byte range, you don't waste space by storing it using 4 bytes.
When you have twenty billion rows, it matters.
The alignment issue you mentioned applies mostly to data in RAM. Nothing forces MySQL to use 3 bytes to store that type as it processes it.
This might have a small advantage in using disk cache more efficiently though.
We frequently use tinyint, smallint, and mediumint as very significant space savings. Keep in mind, it makes your indexes that much smaller.
This effect is magnified when you have really small join tables, like:
id1 smallint unsigned not null,
id2 mediumint unsigned not null,
primary key (id1, id2)
And then you have hundreds of millions or billions of records.