Comparing strings up to column length (using index) - mysql

Basically what I want to do is to reverse the column LIKE 'string%' behavior. Consider following table:
CREATE TABLE texts (
id int not null,
txt varchar(30) not null,
primary key(id),
key `txt_idx` (txt)
) engine=InnoDB;
INSERT INTO texts VALUES(1, 'abcd');
According to B-Tree Index Characteristics following query will utilize txt_idx index:
SELECT txt FROM texts WHERE txt LIKE 'abc%';
Now I want somewhat different behavior. I want the 'abcd' row to be returned when queried for 'abcde'. At the moment I've got stuck with this query:
SELECT txt FROM texts WHERE 'abcde' LIKE CONCAT(txt, '%');
Obviously (confirmed by explain) it does not utilize any index, but my intuition tells me it should be possible to compare particular value against index up to indexed value length (just like strncmp does).
The main reason for this is my huge table with domain entries. I want to select both "example.org" and "something.example.org" (but not "else.example.org") when querying for "www.something.example.org". Splitting and performing multiple queries or applying OR clauses seems to work too slow for me unfortunately.

The only thing I can think of is to convert it to the equivalent IN test:
WHERE txt IN ('a', 'ab', 'abc', 'abcd', 'abcde')

Related

Mysql: LIKE CONCAT Replacement --> less performance heavy

So I have a SELECT Statement that is comparing the current column content from the table_1 column "table_1_content" with the content of another column (table_2_content) in table_2, whereas content in "table_2_content" can be found anywhere in "table_1_content":
$select = "SELECT * FROM table_1, table_2 WHERE `table_1_content` LIKE CONCAT('%', table_2_content, '%')";
$result = mysqli_query($con, $select);
My problem is that LIKE CONCAT is pretty performance heavy.
Is there another way to search through two columns from different tables, so that no full table scan is performed every time the query is executed?
The LIKE in total free text format (% at the start and at the end of the search string) is the performance heavy part. Is the wildcard at the start of the string necessary? If so: You might have to consider pre-processing the data in a different way so that the search can use a single wildcard or no wildcard at all. This last part (depending on the data) is for example done by splitting the string by a delimiter and storing the data in separate rows, after which a much faster comparison and indexes are possible to be used.
To put data in multiple rows, we would assume a usable separator (can be multiple, the code just gets longer):
CREATE TABLE baseinfo (id INT NOT NULL auto_increment primary key,
some other columns);
CREATE TABLE explodedstring(id INT NOT NULL, str VARCHAR(200),
FOREIGN KEY (id) REFERENCES baseinfo(id));
CREATE PROCEDURE explodestring(id int, fullstr VARCHAR(4000))
BEGIN
{many examples exist already how to do this on SO}
END;
The procedure would take as input your key from the original data (id in this case), and the original string.
The output of the procedure would end up in a secondary table explodedstring against which you now could run a normal select (add some index for performance). The resulting ids would tell you which record would match.

mySQL valid bit - alternatives?

Currently, I have a mySQL table with columns that looks something like this:
run_date DATE
name VARCHAR(10)
load INTEGER
sys_time TIME
rec_time TIME
valid TINYINT
The column valid is essentially a valid bit, 1 if this row is the latest value for this (run_date,name) pair, and 0 if not. To make insertions simpler, I wrote a stored procedure that first runs an UPDATE table_name SET valid = 0 WHERE run_date = X AND name = Y command, then inserts the new row.
The table reads are in such a way that I usually use only the valid = 1 rows, but I can't discard the invalid rows. Obviously, this schema also has no primary key.
Is there a better way to structure this data or the valid bit, so that I can speed up both inserts and searches? A bunch of indexes on different orders of columns gets large.
In all of the suggestions below, get rid of valid and the UPDATE of it. That is not scalable.
Plan A: At SELECT time, use 'groupwise max' code to locate the latest run_date, hence the "valid" entry.
Plan B: Have two tables and change both when inserting: history, with PRIMARY KEY(name, run_date) and a simple INSERT statement; current, with PRIMARY KEY(name) and INSERT ... ON DUPLICATE KEY UPDATE. The "usual" SELECTs need only touch current.
Another issue: TIME is limited to 838:59:59 and is intended to mean 'time of day', not 'elapsed time'. For the latter, use INT UNSIGNED (or some variant of INT). For formatting, you can use sec_to_time(). For example sec_to_time(3601) -> 01:00:05.

MySQL performance issue on ~3million rows containing MEDIUMTEXT?

I had a table with 3 columns and 3600K rows. Using MySQL as a key-value store.
The first column id was VARCHAR(8) and set to primary key.The 2nd and 3rd columns were MEDIUMTEXT. When calling SELECT * FROM table WHERE id=00000 MySQL took like 54 sec ~ 3 minutes.
For testing I created a table containing VARCHAR(8)-VARCHAR(5)-VARCHAR(5) where data casually generated from numpy.random.randint. SELECT takes 3 sec without primary key. Same random data with VARCHAR(8)-MEDIUMTEXT-MEDIUMTEXT, the time cost by SELECT was 15 sec without primary key.(note: in second test, 2nd and 3rd column actually contained very short text like '65535', but created as MEDIUMTEXT)
My question is: how can I achieve similar performance on my real data? (or, is it impossible?)
If you use
SELECT * FROM `table` WHERE id=00000
instead of
SELECT * FROM `table` WHERE id='00000'
you are looking for all strings that are equal to an integer 0, so MySQL will have to check all rows, because '0', '0000' and even ' 0' will all be casted to integer 0. So your primary key on id will not help and you will end up with a slow full table. Even if you don't store values that way, MySQL doesn't know that.
The best option is, as all comments and answers pointed out, to change the datatype to int:
alter table `table` modify id int;
This will only work if your ids casted as integer are unique (so you don't have e.g. '0' and '00' in your table).
If you have any foreign keys that references id, you have to drop them first and, before recreating them, change the datatype in the other columns too.
If you have a known format you are storing your values (e.g. no zeros, or filled with 0s up to the length of 8), the second best option is to use this exact format to do your query, and include the ' to not cast it to integer. If you e.g. always fill 0 to 8 digits, use
SELECT * FROM `table` WHERE id='00000000';
If you never add any zeros, still add the ':
SELECT * FROM `table` WHERE id='0';
With both options, MySQL can use your primary key and you will get your result in milliseconds.
If your id column contains only numbers so define it as int , because int will give you better performance ( it is more faster)
Make the column in your table (the one defined as key) integer and retry. Check first performance by running a test within your DB (workbench or simple command line). You should get a better result.
Then, and only if needed (I doubt it though), modify your python to convert from integer to string (and/or vise-versa) when referencing the key column.

How to do a CONTAINS() on two columns of Full Text Index Search SQL

I have a table (MyTable) with the following columns:
Col1: NameID VARCHAR(50) PRIMARY KEY NOT NULL
Col2: Address VARCHAR(255)
Data Example:
Name: '1 24'
Address: '1234 Main St.'
and i did a full text index on the table after making the catalog using default params.
How can I achieve the following query:
SELECT * FROM MyTable
WHERE CONTAINS(NameID, '1')
AND CONTAINS(Address, 'Main St.');
But my query is returning no results, which doesn't make sense because this does work:
SELECT * FROM MyTable
WHERE CONTAINS(Address, 'Main St.');
and so does this:
SELECT * FROM MyTable
WHERE CONTAINS(Address, 'Main St.')
AND NameID LIKE '1%'
but this also doesn't work:
SELECT * FROM MyTable
WHERE CONTAINS(NameID, '1');
Why can't I query on the indexed, primary key column (Name) when I selected this column to be included with
the Address column when setting up the Full Text Index?
Thanks in advance!
Since the NameID field is of type varchar, full-text will handle the indexing just fine.
The reasoning behind CONTAINS(NameID, '1') not returning any search results is that '1' (and other such small numbers) are regarded as noise words by full-text and filtered out during indexing time.
To get a list of the stop words, run the following query -
select * from sys.fulltext_system_stopwords where language_id = 1033;
You need to turn off or modify the stop list, an example of which can be found here.
I think the biggest problem here (and I edited my question to reflect this) is that I've got integers representing the primary key's name, which Contains() function on the full text catalog is not compatible. This is unfortunate and I'm still searching for a full text alternative to working with catalogs of integers.

How to optimize database this query in large database?

Query
SELECT id FROM `user_tmp`
WHERE `code` = '9s5xs1sy'
AND `go` NOT REGEXP 'http://www.xxxx.example.com/aflam/|http://xx.example.com|http://www.xxxxx..example.com/aflam/|http://www.xxxxxx.example.com/v/|http://www.xxxxxx.example.com/vb/'
AND check='done'
AND `dataip` <1319992460
ORDER BY id DESC
LIMIT 50
MySQL returns:
Showing rows 0 - 29 ( 50 total, Query took 21.3102 sec) [id: 2622270 - 2602288]
Query took 21.3102 sec
if i remove
AND dataip <1319992460
MySQL returns
Showing rows 0 - 29 ( 50 total, Query took 0.0859 sec) [id: 3637556 - 3627005]
Query took 0.0859 sec
and if no data, MySQL returns
MySQL returned an empty result set (i.e. zero rows). ( Query took 21.7332 sec )
Query took 21.7332 sec
Explain plan:
SQL query: Explain SELECT * FROM `user_tmp` WHERE `code` = '93mhco3s5y' AND `too` NOT REGEXP 'http://www.10neen.com/aflam/|http://3ltool.com|http://www.10neen.com/aflam/|http://www.10neen.com/v/|http://www.m1-w3d.com/vb/' and checkopen='2010' and `dataip` <1319992460 ORDER BY id DESC LIMIT 50;
Rows: 1
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE user_tmp index NULL PRIMARY 4 NULL 50 Using where
Example of the database used
CREATE TABLE IF NOT EXISTS user_tmp ( id int(9) NOT NULL
AUTO_INCREMENT, ip text NOT NULL, dataip bigint(20) NOT NULL,
ref text NOT NULL, click int(20) NOT NULL, code text NOT
NULL, too text NOT NULL, name text NOT NULL, checkopen
text NOT NULL, contry text NOT NULL, vOperation text NOT NULL,
vBrowser text NOT NULL, iconOperation text NOT NULL,
iconBrowser text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=4653425 ;
--
-- Dumping data for table user_tmp
INSERT INTO `user_tmp` (`id`, `ip`, `dataip`, `ref`, `click`, `code`, `too`, `name`, `checkopen`, `contry`, `vOperation`, `vBrowser`, `iconOperation`, `iconBrowser`) VALUES
(1, '54.125.78.84', 1319506641, 'http://xxxx.example.com/vb/showthread.php%D8%AA%D8%AD%D9%85%D9%8A%D9%84-%D8%A7%D8%BA%D9%86%D9%8A%D8%A9-%D8%A7%D9%84%D8%A8%D9%88%D9%85-giovanni-marradi-lovers-rendezvous-3cd-1999-a-155712.html', 0, '4mxxxxx5', 'http://www.xxx.example.com/aflam/', 'xxxxe', '2010', 'US', 'Linux', 'Chrome 12.0.742 ', 'linux.png', 'chrome.png');
I want the correct way to do the query and optimize database
You don't have any indexes besides the primary key. You need to make index on fields that you use in your WHERE statement. If you need to index only 1 field or a combination of several fields depends on the other SELECTs you will be running against that table.
Keep in mind that REGEXP cannot use indexes at all, LIKE can use index only when it does not begin with wildcard (so LIKE 'a%' can use index, but LIKE '%a' cannot), bigger than / smaller than (<>) usually don't use indexes also.
So you are left with the code and check fields. I suppose many rows will have the same value for check, so I would begin the index with code field. Multi-field indexes can be used only in the order in which they are defined...
Imagine index created for fields code, check. This index can be used in your query (where the WHERE clause contains both fields), also in the query with only code field, but not in query with only check field.
Is it important to ORDER BY id? If not, leave it out, it will prevent the sort pass and your query will finish faster.
I will assume you are using mysql <= 5.1
The answers above fall into two basic categories:
1. You are using the wrong column type
2. You need indexes
I will deal with each as both are relevant for performance which is ultimately what I take your questions to be about:
Column Types
The difference between bigint/int or int/char for the dataip question is basically not relevant to your issue. The fundamental issue has more to do with index strategy. However when considering performance holistically, the fact that you are using MyISAM as your engine for this table leads me to ask if you really need "text" column types. If you have short (less than 255 say) character columns, then making them fixed length columns will most likely increase performance. Keep in mind that if any one column is of variable length (varchar, text, etc) then this is not worth changing any of them.
Vertical Partitioning
The fact to keep in mind here is that even though you are only requesting the id column from the standpoint of disk IO and memory you are getting the entire row back. Since so many of the rows are text, this could mean a massive amount of data. Any of these rows that are not used for lookups of users or are not often accessed could be moved into another table where the foreign key has a unique key placed on it keeping the relationship 1:1.
Index Strategy
Most likely the problem is simply indexing as is noted above. The reason that your current situation is caused by adding the "AND dataip <1319992460" condition is that it forces a full table scan.
As stated above placing all the columns in the where clause in a single, composite index will help. The order of the columns in the index will no matter so long as all of them appear in the where clause.
However, the order could matter a great deal for other queries. A quick example would be an index made of (colA, colB). A query with "where colA = 'foo'" will use this index. But a query with "where colB = 'bar'" will not because colB is not the left most column in the index definition. So, if you have other queries that use these columns in some combination it is worth minimizing the number of indexes created on the table. This is b/c every index increases the cost of a write and uses disk space. Writes are expensive b/c of necessary disk activity. Don't make them more expensive.
You need to add index like this:
ALTER TABLE `user_tmp` ADD INDEX(`dataip`);
And if your column 'dataip' contains only unique values you can add unique key like this:
ALTER TABLE `user_tmp` ADD UNIQUE(`dataip`);
Keep in mind, that adding index can take long time on a big table, so don't do it on production server with out testing.
You need to create index on fields in the same order that that are using in where clause. Otherwise index is not be used. Index fields of your where clause.
does dataip really need to be a bigint? According to mysql The signed range is -9223372036854775808 to 9223372036854775807 ( it is a 64bit number ).
You need to choose the right column type for the job, and add the right type of index too. Else these queries will take forever.