How to store uuid as number? - mysql

Based on the answer of question, UUID performance in MySQL, the person who answers suggest to store UUID as a number and not as a string. I'm not so sure how it can be done. Anyone could suggest me something? How my ruby code deal with that?

If I understand correctly, you're using UUIDs in your primary column? People will say that a regular (integer) primary key will be faster , but there's another way using MySQL's dark side. In fact, MySQL is faster using binary than anything else when indexes are required.
Since UUID is 128 bits and is written as hexadecimal, it's very easy to speed up and store the UUID.
First, in your programming language remove the dashes
From 110E8400-E29B-11D4-A716-446655440000 to 110E8400E29B11D4A716446655440000.
Now it's 32 chars (like an MD5 hash, which this also works with).
Since a single BINARY in MySQL is 8 bits in size, BINARY(16) is the size of a UUID (8*16 = 128).
You can insert using:
INSERT INTO Table (FieldBin) VALUES (UNHEX("110E8400E29B11D4A716446655440000"))
and query using:
SELECT HEX(FieldBin) AS FieldBin FROM Table
Now in your programming language, re-insert the dashes at the positions 9, 14, 19 and 24 to match your original UUID. If the positions are always different you could store that info in a second field.
Full example :
CREATE TABLE `test_table` (
`field_binary` BINARY( 16 ) NULL ,
PRIMARY KEY ( `field_binary` )
) ENGINE = INNODB ;
INSERT INTO `test_table` (
`field_binary`
)
VALUES (
UNHEX( '110E8400E29B11D4A716446655440000' )
);
SELECT HEX(field_binary) AS field_binary FROM `test_table`
If you want to use this technique with any hex string, always do length / 2 for the field length. So for a sha512, the field would be BINARY (64) since a sha512 encoding is 128 characters long.

I don't think that its a good idea to use a binary.
Let's say that you want to query some value:
SELECT HEX(field_binary) AS field_binary FROM `test_table`
If we are returning several values then we are calling the HEX function several times.
However, the main problem is the next one:
SELECT * FROM `test_table`
where field_binary=UNHEX('110E8400E29B11D4A716446655440000')
And using a function inside the where, simply ignores the index.
Also
SELECT * FROM `test_table`
where field_binary=x'skdsdfk5rtirfdcv##*#(&##$9'
Could leads to many problems.

Related

Primary key cannot hold more than 767 or 1000 character as primary key

I'm trying to create a table which one of their columns will hold characters more than 5000 characters and I don't want any row for this column to be repeated so I used a primary key to make every row in this column not save again when it's already exist
But the problem is when I'm trying to create this column with column_name VARCHAR(5500) Primary key it's giving me this error
Specified key was too long; max key length is 767 bytes
I searched a lot and found that InnoDB engine accepts only 767 characters as max length and in MYISAM engine accepts 1000 character but this doesn't help me because this column maybe hold more than 5000 character
What I'm looking for is a way to create a column which no one of its rows can be repeated and accepts many characters
CREATE TABLE data_table (
date_time VARCHAR(100),
message VARCHAR(5500) PRIMARY KEY
) ENGINE = MYISAM CHARACTER SET latin1
You have hit a fundamental limitation. Sadly, no amount of negotiation or hacking will find you a way to make an index as long as you need. Therefore, a unique index is not a solution to your problem of preventing duplicate text strings.
Many people store a hash of long text fields along with the text.
SHA-256 is a decent choice for a hash. The issue with hashes is the chance of a hash collision. That is, it is possible that two different text strings will generate the exact same hash. With SHA-256 or larger hashes, that chance is very low indeed.
If you work with SHA-256, you need a column defined like this. (32 bytes is the same as 256 bits, of course.)
text_hash BINARY(32)
Then when you go to insert text you can do this.
INSERT INTO tbl (text, text_hash) VALUES(?, UNHEX(SHA2(?, 256));
If you make your text_hash into a unique index you'll have a way of preventing duplicates by throwing an error when trying. Something like this.
CREATE UNIQUE INDEX no_text_dups_please ON tbl(text_hash);
Needs : "[one] column will hold characters (more than 5000 characters) and I don't want any row for this column to be repeated"
PRIMARY KEY add a UNIQUE CONSTRAINT on the field(s) specified, but if you don't need to use it as PRIMARY KEY use only UNIQUE. In addition, I would not recommend UNIQUE CONSTRAINT on large text column.
I would recommend you to check the unicity of your data by making and storing hashs of your texts.
Sure, the Hash is one way. (I think the latest MariaDB has a technique for doing that by magic!) Here's another approach:
For many reasons, you should switch from MyISAM to InnoDB, but I will ignore that for this Q&A.
CREATE TABLE data_table (
date_time VARCHAR(100),
message VARCHAR(5500) PRIMARY KEY
INDEX(message(100))
) CHARACTER SET utf8mb4 -- since you might get non-English test, including Emoji.
(The "100" is a tradeoff between speed and space.)
But you will have to do an extra test:
SELECT 1 FROM data_table WHERE message = ?
If you get something back, you have a dup -- take action. Else do an INSERT.
Oops, I do need to insist on InnoDB -- at least if you could have conflicting connections inserting the same message:
BEGIN;
SELECT 1 FROM data_table WHERE message = ? FOR UPDATE;
if ... then handle dup and don't COMMIT
INSERT INT data_table (date_time, message) VALUES (?, ?);
COMMIT;
You might want to hide all that inside a Stored Procedure.

MySql Indexing part of a column

I need to search a medium sized MySql table (about 15 million records).
My query searches for a value ending with another value, for example:
SELECT * FROM {tableName} WHERE {column} LIKE '%{value}'
{value} is always 7 characters length.
{column} is sometimes 8 characters length (otherwise it is 7).
Is there a way to improve performence on my search?
clearly index is not an option.
I could save {column} values in reverse order on another column and index that column, but im looking to avoid this solution.
{value} is always 7 characters length
Your data is not mormalized. Fixing this is the way to fix the problem. Anything else is a hack. Having said that I accept it is not always proactical to repair damage done in the past by dummies.
However the most appropriate hack depends on a whole lot of information you've not told us about.
how frequently you will run the query
what the format of the composite data is
but im looking to avoid this solution.
Why? It's a reasonable way to address the problem. The only downside is that you need to maintain the new attribute - given that this data domain appears in different attributes in multiple (another normalization violation) means it would make more sense to implement the index in a seperate, EAV relation but you just need to add triggers on the original table to maintain sync using your existing code base. Every solution I can think will likely require a similar fix.
Here's a simplified example (no multiple attributes) to get you started:
CREATE TABLE lookup (
table_name VARCHAR(18) NOT NULL,
record_id INT NOT NULL, /* or whatever */
suffix VARCHAR(7),
PRIMARY KEY (table_name, record_id),
INDEX (suffix, table_name, record_id)
);
CREATE TRIGGER insert_suffix AFTER INSERT ON yourtable
FOR EACH ROW
REPLACE INTO lookup (table_name, record_id, suffix)
VALUES ('yourtable', NEW.id
, SUBSTR(NEW.attribute, NEW.id, RIGHT(NEW.attribute, 7
);
CREATE TRIGGER insert_suffix AFTER UPDATE ON yourtable
FOR EACH ROW
REPLACE INTO lookup (table_name, record_id, suffix)
VALUES ('yourtable', NEW.id
, RIGHT(NEW.attribute, 7)
);
CREATE TRIGGER insert_suffix AFTER DELETE ON yourtable
FOR EACH ROW
DELETE FROM lookup WHERE table_name='yourtable' AND record_id=OLD.id
;
If you have a set number of options for the first character, then you can use in. For instance:
where column in ('{value}', '0{value}', '1{value}', . . . )
This allows MySQL to use an index on the column.
Unfortunately, with a wildcard at the beginning of the pattern, it is hard to use an index. Is it possible to store the first character in another column?

MySQL performance issue on ~3million rows containing MEDIUMTEXT?

I had a table with 3 columns and 3600K rows. Using MySQL as a key-value store.
The first column id was VARCHAR(8) and set to primary key.The 2nd and 3rd columns were MEDIUMTEXT. When calling SELECT * FROM table WHERE id=00000 MySQL took like 54 sec ~ 3 minutes.
For testing I created a table containing VARCHAR(8)-VARCHAR(5)-VARCHAR(5) where data casually generated from numpy.random.randint. SELECT takes 3 sec without primary key. Same random data with VARCHAR(8)-MEDIUMTEXT-MEDIUMTEXT, the time cost by SELECT was 15 sec without primary key.(note: in second test, 2nd and 3rd column actually contained very short text like '65535', but created as MEDIUMTEXT)
My question is: how can I achieve similar performance on my real data? (or, is it impossible?)
If you use
SELECT * FROM `table` WHERE id=00000
instead of
SELECT * FROM `table` WHERE id='00000'
you are looking for all strings that are equal to an integer 0, so MySQL will have to check all rows, because '0', '0000' and even ' 0' will all be casted to integer 0. So your primary key on id will not help and you will end up with a slow full table. Even if you don't store values that way, MySQL doesn't know that.
The best option is, as all comments and answers pointed out, to change the datatype to int:
alter table `table` modify id int;
This will only work if your ids casted as integer are unique (so you don't have e.g. '0' and '00' in your table).
If you have any foreign keys that references id, you have to drop them first and, before recreating them, change the datatype in the other columns too.
If you have a known format you are storing your values (e.g. no zeros, or filled with 0s up to the length of 8), the second best option is to use this exact format to do your query, and include the ' to not cast it to integer. If you e.g. always fill 0 to 8 digits, use
SELECT * FROM `table` WHERE id='00000000';
If you never add any zeros, still add the ':
SELECT * FROM `table` WHERE id='0';
With both options, MySQL can use your primary key and you will get your result in milliseconds.
If your id column contains only numbers so define it as int , because int will give you better performance ( it is more faster)
Make the column in your table (the one defined as key) integer and retry. Check first performance by running a test within your DB (workbench or simple command line). You should get a better result.
Then, and only if needed (I doubt it though), modify your python to convert from integer to string (and/or vise-versa) when referencing the key column.

Design of mysql database for large number of large matrix data

I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!
Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.

MySQL processing large bit fields

My goal is to store a 256 bit object (this is actually a bitfield) into MySQL and be able to do some bitwise operations and comparisons to it.
Ideally, I would use BIT(256) for the type but MySQL limits bitfields to 64 bits.
My proposed solution was to use Binary String BINARY(32) type for this field and I can store my objects but there is no way I can operate on them.
My table structure is
CREATE TABLE `test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`bin` binary(32) NOT NULL,
PRIMARY KEY (`id`)
)
but then the query
SELECT
bit_count( bin ) AS fullBin,
bit_count( substring( bin, 0, 4 ) ) AS partialBin
FROM test
always returns 0 as it does not convert my binary string neither the substring into a number for bit_count to operate on.
I am looking for a way to extract parts of the binary string as BIGINT or some other type that I can operate on (I only need bitwise AND and bit_count() operations).
Performance wise, I would prefer a solution that does not involve creating strings and parsing them.
I would also accept any proposal for storing my data as another type but the obvious solution to split my bin column into 4 ones of type BIT(64) is not an option as I must preserve the table naming structure.