This question is tough but what I am looking at doing is querying binary data to check occurrences. I can't use full-text search and I'm not sure that'd help anyhow, but say I have a string in the database like 00100 (but 256 characters long) and a user tries to search the database 00101. Is there any way to find all of the rows that have a 1 in the 3rd position? Also, is there a way to do this with multiple position lookups (eg. a 1 for the 3rd and 5th position)?
I ask because I am trying to take five pieces of data and put them in one row of the database and not as five different rows. Each binary value is an boolean "occurrence" of an object, so 1 or 0.
Update:
Schema
`media_id` int(9) unsigned NOT NULL,
`256_hash` text NOT NULL,
`sequence` int(11) unsigned NOT NULL
I should have included this earlier but the actual 256 hash strings are 256 characters long. I'm assuming this it going to be a problem in the long run because its not indexable.
Sample records
media_id palette_hash sequence
1 00000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000 1464423415
2 00000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000 1464423415
You can use the REGEXP comparison.
All rows with "1" in the third position:
# The pattern reads: "anything twice, then 1, then anything"
SELECT * FROM rows WHERE (column REGEXP '^.{2}1.*$')
All rows with "1" in the third and fifth position:
# The pattern reads: "anything twice, then 1, then anything once, then 1, then anything"
SELECT * FROM rows WHERE (column REGEXP '^.{2}1.1.*$')
Related
We have to store a "file ID" information in a multi million rows table. The format is Brazilian State ID abbreviation (i.e.: PA for PARA, BA for Bahia, SP for Sao Paulo, RJ for Rio de Janeiro, and so on) and a "scope" information, built by a short format Year ie.: 19 for 2019 and month, resulting in i.e 'PA1908' format.
As said before, the table has multi million rows and every month we have to compare it's data with external data source, and in case the external data source is most update then our table, we must replace entire STATE-YEAR-MONTH records, so the file id exist just to be a param in the query's where clause in order to select rows to delete.
In the first modeling verion, I splitted file id in two columns, being fileid_state as Char(2) datatype using hash index and fileid_scope as smallint datatype, but I'm not sure this is the only way to archive acceptable performance, may be using just only one column named file_id with Char(6) datatype with hash index could be performatic as first version. Any suggestions how of two method is best, or another way to store file id in order to select rows for deleting as fast as possible?
Remember it's kind hard for me to benchmark the methods because we have almost 1 billion rows in a limited hardware.
Q1: Datatype: First ask yourself what will be done with the string:
Do you ever need to look just at the 'state' part? The 'year' part? The 'month' part? If you answer "yes" to any of those, then you should probably store the parts in 2 or 3 columns. state CHAR(2) CHARACTER SET ascii, then use TINYINT UNSIGNED or SMALLINT UNSIGNED for the numeric part(s).
If no, the simply do CHAR(6) CHARACTER SET ascii. If needed, this can be INDEXed, either by itself, or together with other column(s) in a "composite" index. Please provide the UPDATE and SELECT statements that may need this index; we will critique.
There is no "hash" indexing, only BTree.
"select rows for deleting as fast as possible" -- What percentage of the table will be deleted? If, for example, you will DELETE FROM tbl WHERE sym = 'PA1908', and it is only a small part of the table, then INDEX(sym) works optimally.
I say "ascii" so that you avoid the space/processing needed for utf8, etc.
Q2: "is most update then our table, we must replace entire STATE-YEAR-MONTH records" -- Please elaborate on what happens here.
In MySQL 5.7, a table defined as following shown
CREATE TABLE `person` (
`person_id` bigint(20) NOT NULL AUTO_INCREMENT,
`name` varchar(64) DEFAULT NULL,
PRIMARY KEY (`person_id`),
KEY `ix_name` (`name`)
) ENGINE=InnoDB CHARSET=utf8
And then we prepared two records for testing, the value of name field (with varchar type) are
123456789123456789
1
respectively.
Case 1
select * from person where name = 123456789123456789-1;
Note that we are using a number instead of string inside the where clause. The record with name 123456789123456789 returned, and it seemed that -1 in the end are ignored!
Furthermore, we add another record with name = 123456789123456788, and this time the above select returns two records, including both 123456789123456789 and 123456789123456788;
The output looks so strange!
Case 2
select * from person where name = 123456789123456789-123456789123456788;
We could get the record with name 1, and in this case it seems that the - act as a minus operator.
Why the behavior of - in two cases are so different!
I can't immediately tell you what the type of 123456789123456789-1 is but for the comparison operation, we're almost certainly falling through most of the more "normal" data type conversion rules for mysql and ending up at:
In all other cases, the arguments are compared as floating-point (real) numbers.
Because one of the argument for the comparison (name) is a string type and the other is numeric, nothing else matches. So both get converted to floats and float types don't have too many digits of precision. Certainly less than the 18 required to represent 123456789123456789 and 123456789123456788 as two different numbers.
Look here:
SELECT person_id, name, name + 0.0, 123456789123456789-1 + 0.0, name = 123456789123456789-1
FROM person
ORDER BY person_id;
Perhaps, before comparing name = 123456789123456789-1 MySQL converts name and 123456789123456789-1 to DOUBLE as I showed in select. So some digits are lost.
Demo.
I had a table with 3 columns and 3600K rows. Using MySQL as a key-value store.
The first column id was VARCHAR(8) and set to primary key.The 2nd and 3rd columns were MEDIUMTEXT. When calling SELECT * FROM table WHERE id=00000 MySQL took like 54 sec ~ 3 minutes.
For testing I created a table containing VARCHAR(8)-VARCHAR(5)-VARCHAR(5) where data casually generated from numpy.random.randint. SELECT takes 3 sec without primary key. Same random data with VARCHAR(8)-MEDIUMTEXT-MEDIUMTEXT, the time cost by SELECT was 15 sec without primary key.(note: in second test, 2nd and 3rd column actually contained very short text like '65535', but created as MEDIUMTEXT)
My question is: how can I achieve similar performance on my real data? (or, is it impossible?)
If you use
SELECT * FROM `table` WHERE id=00000
instead of
SELECT * FROM `table` WHERE id='00000'
you are looking for all strings that are equal to an integer 0, so MySQL will have to check all rows, because '0', '0000' and even ' 0' will all be casted to integer 0. So your primary key on id will not help and you will end up with a slow full table. Even if you don't store values that way, MySQL doesn't know that.
The best option is, as all comments and answers pointed out, to change the datatype to int:
alter table `table` modify id int;
This will only work if your ids casted as integer are unique (so you don't have e.g. '0' and '00' in your table).
If you have any foreign keys that references id, you have to drop them first and, before recreating them, change the datatype in the other columns too.
If you have a known format you are storing your values (e.g. no zeros, or filled with 0s up to the length of 8), the second best option is to use this exact format to do your query, and include the ' to not cast it to integer. If you e.g. always fill 0 to 8 digits, use
SELECT * FROM `table` WHERE id='00000000';
If you never add any zeros, still add the ':
SELECT * FROM `table` WHERE id='0';
With both options, MySQL can use your primary key and you will get your result in milliseconds.
If your id column contains only numbers so define it as int , because int will give you better performance ( it is more faster)
Make the column in your table (the one defined as key) integer and retry. Check first performance by running a test within your DB (workbench or simple command line). You should get a better result.
Then, and only if needed (I doubt it though), modify your python to convert from integer to string (and/or vise-versa) when referencing the key column.
I am working on a project that was started with someone else. In the db instead for using a separate table the developer had opted for saving the 1 to many relationships on a single table with comma separated tables. The table structure is like this
CREATE TABLE pages(
pageid INT(6) AUTO_INCREMENT PRIMARY KEY,
newsid INT(6),
pages VARCHAR(30)
);
How can I search for a value 1 from the column pages. I have identified a few conditions that may appear, but was not been able to create a solution for it.
If I am searching for 1 the following patterns should be handles
1, match
11 shouldn't match
11, shouldn't match
,1, match
,1 match
1 match
21 shouldn't match
21, shouldn't match
I have been thinking about this for sometime, but no solution came up. I don't think normal %LIKE% can be used here
Sample sql on sqlfiddle
Also I need to search multiple values too like 1, 7 and 3
Use
FIND_IN_SET().
Example:
SELECT * FROM pages WHERE FIND_IN_SET('1', pages)
From the documentation:
FIND_IN_SET(str,strlist)
Returns a value in the range of 1 to N if the string str is in the string list strlist consisting of N substrings. A string list is a string composed of substrings separated by “,” characters. If the first argument is a constant string and the second is a column of type SET, the FIND_IN_SET() function is optimized to use bit arithmetic. Returns 0 if str is not in strlist or if strlist is the empty string. Returns NULL if either argument is NULL. This function does not work properly if the first argument contains a comma (“,”) character.
(highlighting added)
With MySQL I often overlook some options like 'signed/unsigned' ints and 'allow null' but I'm wondering if these details could slow a web application down.
Are there any notable performance differences in these situations?
using a low/high range of Integer primary key
5000 rows with ids from 1 to 5000
5000 rows with ids from 20001 to 25000
Integer PK incrementing uniformly vs non-uniformly.
5000 rows with ids from 1 to 5000
5000 rows with ids scattered from 1 to 30000
Setting an Integer PK as unsigned vs. signed
example: where the gain in range of unsigned isn't actually needed
Setting a default value for a field (any type) vs. no default
example: update a row and all field data is given
Allow Null vs deny Null
example: updating a row and all field data is given
I'm using MySQL, but this is more of a general question.
From my understanding of B-trees (that's how relational databases are usually implemented, right?), these things should not make any difference. All you need is a fast comparison function on your key, and it usually doesn't matter what range of integers you use (unless you get out of the machine word size).
Of course, for keys, a uniform default value or allowing null doesn't make much sense. In all non-key fields, allowing null or providing default values should not have any significant impact.
5000 rows is almost nothing for a database. They normally use large B-trees for indexes, so they don't care much about the distribution of primary keys.
Generally, whether to use the other options should be based on what you need from the database application. They can't significantly affect the performance. So, use a default value when you want a default value, use a NOT NULL contraint when you don't want the column to be NULL.
If you have database performance issues, you should look for more important problems like missing indexes, slow queries that can be rewritten efficiently, making sure that the database has accurate statistics about the data so it can use indexes the right way (although this is an admin task).
using a low/high range of Integer primary key
* 5000 rows with ids from 1 to 5000
* 5000 rows with ids from 20001 to 25000
Does not make any difference.
Integer PK incrementing uniformly vs non-uniformly.
* 5000 rows with ids from 1 to 5000
* 5000 rows with ids scattered from 1 to 30000
If the distribution is uniform, this makes not difference.
Uniform distribution may help to build more efficient random sampling query, like described in this article in my blog:
PostgreSQL 8.4: sampling random rows
It's distribution which matters, not bounds: 1, 11, 21, 31 is OK, 1, 2, 3, 31 is not.
Setting an Integer PK as unsigned vs. signed
* example: where the gain in range of unsigned isn't actually needed
If you declare PRIMARY KEY as UNSIGNED, MySQL can optimize out predicates like id >= -1
Setting a default value for a field (any type) vs. no default
* example: update a row and all field data is given
No difference.
Allow Null vs deny Null
* example: updating a row and all field data is given
Nullable columns are one byte larger: the index key for an INT NOT NULL is 5 bytes long, that for an INT NULL is 4 bytes long.