Determine whether field in MySql indexed column is numeric using index semantics - mysql

I have a MySql table where I want to get the count of rows where a given VARCHAR column has a numeric value (convertible to number, you know). Right now, I'm doing a simple REGEXP check on this field. Since this table is very large, I'm using a series of indexes to REGEXP as few rows as possible.
But this VARCHAR column is also indexed. Is there a clever hack of the MySql indexing algorithm that I can exploit to scan even fewer rows? :-/ This is an InnoDB table.

You may not like this, as you are probably already trying to avoid it, but rather than trying to do some clever trick, when I have had situations like this, I add an additional column that stores the varchar in an numeric column (updated using a trigger), and query on that.
But, there is a way I can see to do it (though I have never had a reason to do this in production), which is to exploit the fact that indexing will put the values in order, such that all that begin with a number are sequenced together.
Assuming a table like this:
CREATE TABLE `test_1` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`text_or_number` varchar(255),
PRIMARY KEY (`id`),
KEY `test_1_idx` (`text_or_number`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
You can select only values starting with numbers by exploiting the order of utf8 characters - http://en.wikipedia.org/wiki/UTF-8#Examples
The lowest value before 0 in utf-8 is "/", and the highest after it is ":", so this should extract only values that start with a number:
select cast(text_or_number as unsigned)
from test_1
where text_or_number < ':'
and text_or_number > '/'
and cast(text_or_number as unsigned) > 0;
That could still contain values that start with a number, but do not end with one, which is why I have added the cast(...) > 0 clause, but I think mysql will be smart enough to run the where clauses in order, so hopefully it will only run the cast on the subset of rows that start with a numeric char.

Related

MySql Indexing part of a column

I need to search a medium sized MySql table (about 15 million records).
My query searches for a value ending with another value, for example:
SELECT * FROM {tableName} WHERE {column} LIKE '%{value}'
{value} is always 7 characters length.
{column} is sometimes 8 characters length (otherwise it is 7).
Is there a way to improve performence on my search?
clearly index is not an option.
I could save {column} values in reverse order on another column and index that column, but im looking to avoid this solution.
{value} is always 7 characters length
Your data is not mormalized. Fixing this is the way to fix the problem. Anything else is a hack. Having said that I accept it is not always proactical to repair damage done in the past by dummies.
However the most appropriate hack depends on a whole lot of information you've not told us about.
how frequently you will run the query
what the format of the composite data is
but im looking to avoid this solution.
Why? It's a reasonable way to address the problem. The only downside is that you need to maintain the new attribute - given that this data domain appears in different attributes in multiple (another normalization violation) means it would make more sense to implement the index in a seperate, EAV relation but you just need to add triggers on the original table to maintain sync using your existing code base. Every solution I can think will likely require a similar fix.
Here's a simplified example (no multiple attributes) to get you started:
CREATE TABLE lookup (
table_name VARCHAR(18) NOT NULL,
record_id INT NOT NULL, /* or whatever */
suffix VARCHAR(7),
PRIMARY KEY (table_name, record_id),
INDEX (suffix, table_name, record_id)
);
CREATE TRIGGER insert_suffix AFTER INSERT ON yourtable
FOR EACH ROW
REPLACE INTO lookup (table_name, record_id, suffix)
VALUES ('yourtable', NEW.id
, SUBSTR(NEW.attribute, NEW.id, RIGHT(NEW.attribute, 7
);
CREATE TRIGGER insert_suffix AFTER UPDATE ON yourtable
FOR EACH ROW
REPLACE INTO lookup (table_name, record_id, suffix)
VALUES ('yourtable', NEW.id
, RIGHT(NEW.attribute, 7)
);
CREATE TRIGGER insert_suffix AFTER DELETE ON yourtable
FOR EACH ROW
DELETE FROM lookup WHERE table_name='yourtable' AND record_id=OLD.id
;
If you have a set number of options for the first character, then you can use in. For instance:
where column in ('{value}', '0{value}', '1{value}', . . . )
This allows MySQL to use an index on the column.
Unfortunately, with a wildcard at the beginning of the pattern, it is hard to use an index. Is it possible to store the first character in another column?

mySQL valid bit - alternatives?

Currently, I have a mySQL table with columns that looks something like this:
run_date DATE
name VARCHAR(10)
load INTEGER
sys_time TIME
rec_time TIME
valid TINYINT
The column valid is essentially a valid bit, 1 if this row is the latest value for this (run_date,name) pair, and 0 if not. To make insertions simpler, I wrote a stored procedure that first runs an UPDATE table_name SET valid = 0 WHERE run_date = X AND name = Y command, then inserts the new row.
The table reads are in such a way that I usually use only the valid = 1 rows, but I can't discard the invalid rows. Obviously, this schema also has no primary key.
Is there a better way to structure this data or the valid bit, so that I can speed up both inserts and searches? A bunch of indexes on different orders of columns gets large.
In all of the suggestions below, get rid of valid and the UPDATE of it. That is not scalable.
Plan A: At SELECT time, use 'groupwise max' code to locate the latest run_date, hence the "valid" entry.
Plan B: Have two tables and change both when inserting: history, with PRIMARY KEY(name, run_date) and a simple INSERT statement; current, with PRIMARY KEY(name) and INSERT ... ON DUPLICATE KEY UPDATE. The "usual" SELECTs need only touch current.
Another issue: TIME is limited to 838:59:59 and is intended to mean 'time of day', not 'elapsed time'. For the latter, use INT UNSIGNED (or some variant of INT). For formatting, you can use sec_to_time(). For example sec_to_time(3601) -> 01:00:05.

MySQL performance issue on ~3million rows containing MEDIUMTEXT?

I had a table with 3 columns and 3600K rows. Using MySQL as a key-value store.
The first column id was VARCHAR(8) and set to primary key.The 2nd and 3rd columns were MEDIUMTEXT. When calling SELECT * FROM table WHERE id=00000 MySQL took like 54 sec ~ 3 minutes.
For testing I created a table containing VARCHAR(8)-VARCHAR(5)-VARCHAR(5) where data casually generated from numpy.random.randint. SELECT takes 3 sec without primary key. Same random data with VARCHAR(8)-MEDIUMTEXT-MEDIUMTEXT, the time cost by SELECT was 15 sec without primary key.(note: in second test, 2nd and 3rd column actually contained very short text like '65535', but created as MEDIUMTEXT)
My question is: how can I achieve similar performance on my real data? (or, is it impossible?)
If you use
SELECT * FROM `table` WHERE id=00000
instead of
SELECT * FROM `table` WHERE id='00000'
you are looking for all strings that are equal to an integer 0, so MySQL will have to check all rows, because '0', '0000' and even ' 0' will all be casted to integer 0. So your primary key on id will not help and you will end up with a slow full table. Even if you don't store values that way, MySQL doesn't know that.
The best option is, as all comments and answers pointed out, to change the datatype to int:
alter table `table` modify id int;
This will only work if your ids casted as integer are unique (so you don't have e.g. '0' and '00' in your table).
If you have any foreign keys that references id, you have to drop them first and, before recreating them, change the datatype in the other columns too.
If you have a known format you are storing your values (e.g. no zeros, or filled with 0s up to the length of 8), the second best option is to use this exact format to do your query, and include the ' to not cast it to integer. If you e.g. always fill 0 to 8 digits, use
SELECT * FROM `table` WHERE id='00000000';
If you never add any zeros, still add the ':
SELECT * FROM `table` WHERE id='0';
With both options, MySQL can use your primary key and you will get your result in milliseconds.
If your id column contains only numbers so define it as int , because int will give you better performance ( it is more faster)
Make the column in your table (the one defined as key) integer and retry. Check first performance by running a test within your DB (workbench or simple command line). You should get a better result.
Then, and only if needed (I doubt it though), modify your python to convert from integer to string (and/or vise-versa) when referencing the key column.

How to optimize database this query in large database?

Query
SELECT id FROM `user_tmp`
WHERE `code` = '9s5xs1sy'
AND `go` NOT REGEXP 'http://www.xxxx.example.com/aflam/|http://xx.example.com|http://www.xxxxx..example.com/aflam/|http://www.xxxxxx.example.com/v/|http://www.xxxxxx.example.com/vb/'
AND check='done'
AND `dataip` <1319992460
ORDER BY id DESC
LIMIT 50
MySQL returns:
Showing rows 0 - 29 ( 50 total, Query took 21.3102 sec) [id: 2622270 - 2602288]
Query took 21.3102 sec
if i remove
AND dataip <1319992460
MySQL returns
Showing rows 0 - 29 ( 50 total, Query took 0.0859 sec) [id: 3637556 - 3627005]
Query took 0.0859 sec
and if no data, MySQL returns
MySQL returned an empty result set (i.e. zero rows). ( Query took 21.7332 sec )
Query took 21.7332 sec
Explain plan:
SQL query: Explain SELECT * FROM `user_tmp` WHERE `code` = '93mhco3s5y' AND `too` NOT REGEXP 'http://www.10neen.com/aflam/|http://3ltool.com|http://www.10neen.com/aflam/|http://www.10neen.com/v/|http://www.m1-w3d.com/vb/' and checkopen='2010' and `dataip` <1319992460 ORDER BY id DESC LIMIT 50;
Rows: 1
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE user_tmp index NULL PRIMARY 4 NULL 50 Using where
Example of the database used
CREATE TABLE IF NOT EXISTS user_tmp ( id int(9) NOT NULL
AUTO_INCREMENT, ip text NOT NULL, dataip bigint(20) NOT NULL,
ref text NOT NULL, click int(20) NOT NULL, code text NOT
NULL, too text NOT NULL, name text NOT NULL, checkopen
text NOT NULL, contry text NOT NULL, vOperation text NOT NULL,
vBrowser text NOT NULL, iconOperation text NOT NULL,
iconBrowser text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=4653425 ;
--
-- Dumping data for table user_tmp
INSERT INTO `user_tmp` (`id`, `ip`, `dataip`, `ref`, `click`, `code`, `too`, `name`, `checkopen`, `contry`, `vOperation`, `vBrowser`, `iconOperation`, `iconBrowser`) VALUES
(1, '54.125.78.84', 1319506641, 'http://xxxx.example.com/vb/showthread.php%D8%AA%D8%AD%D9%85%D9%8A%D9%84-%D8%A7%D8%BA%D9%86%D9%8A%D8%A9-%D8%A7%D9%84%D8%A8%D9%88%D9%85-giovanni-marradi-lovers-rendezvous-3cd-1999-a-155712.html', 0, '4mxxxxx5', 'http://www.xxx.example.com/aflam/', 'xxxxe', '2010', 'US', 'Linux', 'Chrome 12.0.742 ', 'linux.png', 'chrome.png');
I want the correct way to do the query and optimize database
You don't have any indexes besides the primary key. You need to make index on fields that you use in your WHERE statement. If you need to index only 1 field or a combination of several fields depends on the other SELECTs you will be running against that table.
Keep in mind that REGEXP cannot use indexes at all, LIKE can use index only when it does not begin with wildcard (so LIKE 'a%' can use index, but LIKE '%a' cannot), bigger than / smaller than (<>) usually don't use indexes also.
So you are left with the code and check fields. I suppose many rows will have the same value for check, so I would begin the index with code field. Multi-field indexes can be used only in the order in which they are defined...
Imagine index created for fields code, check. This index can be used in your query (where the WHERE clause contains both fields), also in the query with only code field, but not in query with only check field.
Is it important to ORDER BY id? If not, leave it out, it will prevent the sort pass and your query will finish faster.
I will assume you are using mysql <= 5.1
The answers above fall into two basic categories:
1. You are using the wrong column type
2. You need indexes
I will deal with each as both are relevant for performance which is ultimately what I take your questions to be about:
Column Types
The difference between bigint/int or int/char for the dataip question is basically not relevant to your issue. The fundamental issue has more to do with index strategy. However when considering performance holistically, the fact that you are using MyISAM as your engine for this table leads me to ask if you really need "text" column types. If you have short (less than 255 say) character columns, then making them fixed length columns will most likely increase performance. Keep in mind that if any one column is of variable length (varchar, text, etc) then this is not worth changing any of them.
Vertical Partitioning
The fact to keep in mind here is that even though you are only requesting the id column from the standpoint of disk IO and memory you are getting the entire row back. Since so many of the rows are text, this could mean a massive amount of data. Any of these rows that are not used for lookups of users or are not often accessed could be moved into another table where the foreign key has a unique key placed on it keeping the relationship 1:1.
Index Strategy
Most likely the problem is simply indexing as is noted above. The reason that your current situation is caused by adding the "AND dataip <1319992460" condition is that it forces a full table scan.
As stated above placing all the columns in the where clause in a single, composite index will help. The order of the columns in the index will no matter so long as all of them appear in the where clause.
However, the order could matter a great deal for other queries. A quick example would be an index made of (colA, colB). A query with "where colA = 'foo'" will use this index. But a query with "where colB = 'bar'" will not because colB is not the left most column in the index definition. So, if you have other queries that use these columns in some combination it is worth minimizing the number of indexes created on the table. This is b/c every index increases the cost of a write and uses disk space. Writes are expensive b/c of necessary disk activity. Don't make them more expensive.
You need to add index like this:
ALTER TABLE `user_tmp` ADD INDEX(`dataip`);
And if your column 'dataip' contains only unique values you can add unique key like this:
ALTER TABLE `user_tmp` ADD UNIQUE(`dataip`);
Keep in mind, that adding index can take long time on a big table, so don't do it on production server with out testing.
You need to create index on fields in the same order that that are using in where clause. Otherwise index is not be used. Index fields of your where clause.
does dataip really need to be a bigint? According to mysql The signed range is -9223372036854775808 to 9223372036854775807 ( it is a 64bit number ).
You need to choose the right column type for the job, and add the right type of index too. Else these queries will take forever.

MySQL indexes: how do they work?

I'm a complete newbie with MySQL indexes. I have several MyISAM tables on MySQL 5.0x having utf8 charsets and collations with 100k+ records each. The primary keys are generally integer. Many columns on each table may have duplicate values.
I need to quickly count, sum, average, or otherwise perform custom calculations on any number of fields in each table or joined on any number of others.
I found this page giving an overview of MySQL index usage: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html, but I'm still not sure I'm using indexes right. Just when I think I've made the perfect index out of a collection of fields I want to calculate against, I get the "index must be under 1000 bytes" error.
Can anyone explain how to most efficiently create and use indexes to speed up queries?
Caveat: upgrading Mysql is not possible in this case. Using Navicat Light for db administration, but this app isn't required.
When you create an index on a column or columns in MySQL table, the database is creating a data structure called a B-tree (assuming you use the default index setting), for which the key of each record is a concatenation of the values in the indexed columns.
For example, let's say you have a table that is defined like:
CREATE TABLE mytable (
id int unsigned auto_increment,
column_a char(32) not null default '',
column_b int unsigned not null default 0,
column_c varchar(512),
column_d varchar(512),
PRIMARY KEY (id)
) ENGINE=MyISAM;
Then let's give it some data:
INSERT INTO mytable VALUES (1, 'hello', 2, null, null);
INSERT INTO mytable VALUES (2, 'hello', 3, 'hi', 'there');
INSERT INTO mytable VALUES (3, 'how', 4, 'are', 'you?');
INSERT INTO mytable VALUES (4, 'foo', 5, '', 'bar');
Now suppose you decide to add a key to column_a and column_b like:
ALTER TABLE mytable ADD KEY (column_a, column_b);
The database is going to create the aforementioned B-tree, which will have four keys in it, one for each row:
hello-2
hello-3
how-4
foo-5
When you perform a search that references the column_a column, or that references the column_a AND column_b columns, the database will be able to use this index to narrow the record set it has to examine. Let's say you have a query like:
SELECT ... FROM mytable WHERE column_a = 'hello';
Even though the above query does not specify a value for the column_b column, it can still take advantage of our index by looking for all keys that begin with "hello". For the same reason, if you had a query like:
SELECT ... FROM mytable WHERE column_b = '2';
This query would NOT be able to use our index, because it would have to parse the index keys themselves to try to determine which keys' second value matches '2', which is terribly inefficient.
Now, let's address your original question of the maximum length. Suppose we try to create an index spanning all four non-PK columns in this table:
ALTER TABLE mytable ADD KEY (column_a, column_b, column_c, column_d);
You will get an error:
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes
In this case our column lengths are 32, 10, 512, and 512, which in a single-byte-per-character situation is 1066, which is above the limit of 1000. Suppose that it DID work; you would be creating the following keys:
hello-2-
hello-3-hi-there
how-4-are-you?
foo-5--bar
Now, suppose that you had values in column_c and column_d that were very long -- 512 characters each. Even in a basic single-byte character set, your keys would now be over 1000 bytes in length, which is what MySQL is complaining about. It gets even worse with multibyte character sets, where seemingly "small" columns can still push the keys over the limit.
If you MUST use a large compound key, one solution is to use InnoDB tables rather than the default MyISAM tables, which support a larger key length (3500 bytes) -- you can do this by swapping ENGINE=InnoDB instead of ENGINE=MyISAM in the declaration above. However, generally speaking, if you are using long keys there is probably something wrong with your table design.
Remember that single-column indexes often provide more utility than multi-column indexes. You want to use a multi-column index when you are going to often/always take advantage of it by specifying all of the necessary criteria in your queries. Also, as others have mentioned, do NOT index every column of a table, since each index is adding storage overhead to your database. You want to limit your indexes to the columns that are frequently used by queries, and if it seems like you need too many, you should probably think about breaking up your tables up into more logical components.
Indexes generally aren't well suited for custom calculations where the user is able to construct their own queries. Typically you choose the indexes to match the specific queries you intend to run, using EXPLAIN to see if the index is being used.
In the case that you have absolutely no idea what queries might be performed it is generally best to create one index per column - and not one index covering all columns.
If you have a good idea of what queries might be run often you could create an extra index for those specific queries. You can also add indexes later if your users complain that certain types of queries run too slow.
Also, indexes generally aren't that useful for calculating counts, sums and averages since these types of calculations require looking at every row.
It sounds like you are trying to put too many fields into your index. The limit is the probably the number of bytes it takes to encode all the fields.
The index is used in looking up the records, so you want to choose the fields which you are "WHERE"ing on. In choosing between those fields, you want to choose the ones that will narrow the results the quickest.
As an example, a filter on Male/Female will usually not help much because you are only going to save about 50% of the time. However, a filter on State may be useful because you'll break down into many more categories. However, if almost everybody in the database is in a single state then that won't work.
Remember that indexes are for sorting and finding rows.
The error message you got sounds like it is talking about the 1000 byte Prefix Limit for MyISAM table indexes. From http://dev.mysql.com/doc/refman/5.0/en/create-index.html:
The statement shown here creates an
index using the first 10 characters of
the name column:
CREATE INDEX part_of_name ON customer
(name(10)); If names in the column
usually differ in the first 10
characters, this index should not be
much slower than an index created from
the entire name column. Also, using
column prefixes for indexes can make
the index file much smaller, which
could save a lot of disk space and
might also speed up INSERT operations.
Prefix support and lengths of prefixes
(where supported) are storage engine
dependent. For example, a prefix can
be up to 1000 bytes long for MyISAM
tables, and 767 bytes for InnoDB
tables.
Maybe you can try a FULLTEXT index for problematic columns.