Mysql InnoDB performance optimization and indexing - mysql

I have 2 databases and I need to link information between two big tables (more than 3M entries each, continuously growing).
The 1st database has a table 'pages' that stores various information about web pages, and includes the URL of each one. The column 'URL' is a varchar(512) and has no index.
The 2nd database has a table 'urlHops' defined as:
CREATE TABLE urlHops (
dest varchar(512) NOT NULL,
src varchar(512) DEFAULT NULL,
timestamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
KEY dest_key (dest),
KEY src_key (src)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Now, I need basically to issue (efficiently) queries like this:
select p.id,p.URL from db1.pages p, db2.urlHops u where u.src=p.URL and u.dest=?
At first, I thought to add an index on pages(URL). But it's a very long column, and I already issue a lot of INSERTs and UPDATEs on the same table (way more than the number of SELECTs I would do using this index).
Other possible solutions I thought are:
-adding a column to pages, storing the md5 hash of the URL and indexing it; this way I could do queries using the md5 of the URL, with the advantage of an index on a smaller column.
-adding another table that contains only page id and page URL, indexing both columns. But this is maybe a waste of space, having only the advantage of not slowing down the inserts and updates I execute on 'pages'.
I don't want to slow down the inserts and updates, but at the same time I would be able to do the queries on the URL efficiently. Any advice?
My primary concern is performance; if needed, wasting some disk space is not a problem.
Thank you, regards
Davide

The MD5 hash suggestion you had is very good - it's documented in High Performance MySQL 2nd Ed. There's a couple of tricks to get it to work:
CREATE TABLE urls (
id NOT NULL primary key auto_increment,
url varchar(255) not null,
url_crc32 INT UNSIGNED not null,
INDEX (url_crc32)
);
Select queries have to look like this:
SELECT * FROM urls WHERE url='http://stackoverflow.com' AND url_crc32=crc32('http://stackoverflow.com');
The url_crc32 is designed to work with the index, including url in the WHERE clause is designed to prevent hash collisions.
I'd probably recommend crc32 over md5. There will be a few more collisions, but you have a higher chance of fitting all the index in memory.

If pages to URL's is a 1-to-1 relationship and that table has a unique id (primary key?), you could store that id value in the src and dest fields in the urlHops table instead of the full URL.
This would make indexing and joins much more efficient.

I would create a page_url table that has auto-inc integer primary key, and your URL value. Then update Pages and urlHops to use page_url.id.
Your urlHops would become (dest int,src int,...)
Your Pages table would replace url with pageid.
Index page_url.url field, and you should be good to go.

Related

How to save varchar field for query performance in mysql innodb

I made the following sample table.
CREATE TABLE `User` (
`userId` INT(11) NOT NULL AUTO_INCREMENT,
`userKey` VARCHAR(20) NOT NULL,
PRIMARY KEY (`userId`),
UNIQUE KEY (`userKey`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
I want a more quick query speed for the userKey field.
SELECT * FROM `User` WHERE userKey='KEY123'
In order to do that, what else should I consider besides the index?
For example, when saving the userKey, would using a value such as a DATE(sortable value) as prefix be more beneficial to the search speed?
Or, if I save the userKey as the hash value, will it improve?
If not all, is it enough to use the index for the above query?
You've declared userKey as UNIQUE, so MySQL will automatically create an index for you. That should generally be enough, given that you are searching by "exact match" or by "starting with" type of queries.
If you consider storing hash values, you would only be able to perform "exact match" queries. However, for short strings that may turn out to be not really worth it. Also, with hashes, you need to consider the risk of a collision. If you use a smaller hash value (say 4 bytes) that risk increases. Those are just some guidelines. It's not possible to provide a definitive solution without delving into more details of your case.
One way to speed the table is to get rid of it.
Building a mapping from an under 20 char string to a 4-byte INT does not save much space. Getting rid of the lookup saves a little time.
If you need the to do the 'right thing' had have one copy of each string to avoid redundancy, then what you have is probably the best possible.
If there are more columns, then there may a slight improvement.
But first, answer this: Do you do the lookup by UserKey more of often than by UserId?
I give a resounding NO to a Hash. (Such might be necessary if the strings were too big to index.)
A BTree is very efficient as doing a "point query". (FROM User WHERE UserId = 123 or FROM User WHERE UserKey = 'xyz'). Adding a date or hash would only make it slower.

Implementing a composite index

I've been reading about how a composite index can improve performance but am still a unclear on a few things. I have an INNODB database that has over 20 million entries with 8 data points each. Its performance has dropped substantially in the past few months. The server has 6 cores with 4gb mem which will be increased soon, but there's no indication on the server that I'm running low on mem. INNODB settings have been changed in my.cnf to;
innodb_buffer_pool_size = 1000M
innodb_log_file_size = 147M
These settings have helped in the past. So, my understanding is that many factors can contribute to the performance decrease, including the fact that I originally I had no indexing at all. Indexing methods are predicated on the type of queries that are run. So, this is my table;
cdr_records | CREATE TABLE `cdr_records` (
`dateTimeOrigination` int(11) DEFAULT NULL,
`callingPartyNumber` varchar(50) DEFAULT NULL,
`originalCalledPartyNumber` varchar(50) DEFAULT NULL,
`finalCalledPartyNumber` varchar(50) DEFAULT NULL,
`pkid` varchar(50) NOT NULL DEFAULT '',
`duration` int(11) DEFAULT NULL,
`destDeviceName` varchar(50) DEFAULT NULL,
PRIMARY KEY (`pkid`),
KEY `dateTimeOrigination` (`dateTimeOrigination`),
KEY `callingPartyNumber` (`callingPartyNumber`),
KEY `originalCalledPartyNumber` (`originalCalledPartyNumber`),
KEY `finalCalledPartyNumber` (`finalCalledPartyNumber`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
So, typically, a query will take a value and search callingPartyNumber, originalCalledPartyNumber, and finalCalledPartyNumber to find any entries related to it in the table. So, it wouldn't make any sense to use individual indexes like I have illustrated above because I typically don't run queries like this. However, I have another job in the evenings that is basically;
select * from cdr_records;
In this case, it sounds like it would be a good idea to have another composite index with all columns in it. Am I understanding this correctly?
The benefit of the composite index comes when you need to select/sort/group based on multiple columns, in the same fashion.
I remember there was a very good example with a phone book analogy I read somewhere. As the names in a phone book are ordered alphabetically it is very easy for you to sort through them and find the one you need based on the letters of the name from left to right. You can imagine that is a composite index of the letters in the names.
If the names were ordered only by the first letter and subsequent letters were chaotic (single column index) you would have to go through all records after you find the first letter, which will take a lot of time.
With a composite index, you can start from left to right and very easily find the record you are looking for, this is also the reason why you can't use for example the second or third column of the composite index, because you need the previous ones in order for it to work. Imagine trying to find all names whose third letter is "a" in the phone book, it would be a nightmare, you would need a separate index just for that, which is exactly what you need to do if you need to use a column from a composite index without using other columns from the index before it.
Bear in mind that the phone book example assumes that each letter of the names is a separate column, that could be a little confusing.
The other great strength of the composite indexes are unique composite indexes, which allow you to apply higher logical restrictions on your data that is very handy when you need it. Has nothing to do with performance but I thought it was worth to mention.
In your question your sql has no criteria, so there will be no index used. It is always recommended to use EXPLAIN to see what is going on, you can never be sure!
No, its not a good idea to set a composite index over all fields.
Wich field you are put i one or more index depends on your Querys.
Note:
MySQL can only use one Index per Query and can use composite Index only if all fields from left site on are used.
You not may use all field.
Example:
if you have an index x on the field name, street, number so this index will used when you query (in WHERE)
name or
name and street or
name, street and numer
but not if you search only
street or
street an number.
To find out if your index working well with your query put EXPLAIN before your query and you can see wich indexe are used from your query.

Optimizing MySQL Table Structure and impact of row size

One of my database tables has grown quite large, to the point where I think it is impacting the performance on my site (it is definitely making backups a lot slower).
It has ~13,000,000 rows and is 4.2 GiB in size, of which 1.2 GiB is data.
The structure looks like this:
CREATE TABLE IF NOT EXISTS `t1` (
`id` int(10) unsigned NOT NULL,
`int2` int(10) unsigned NOT NULL,
`int3` int(10) unsigned NOT NULL,
`int4` int(10) unsigned NOT NULL,
`char1` varchar(255) NOT NULL,
`int5` int(10) NOT NULL,
`char2` varchar(1024) DEFAULT NULL,
`char3` varchar(1024) NOT NULL,
PRIMARY KEY (`id`,`int2`,`int3`,`int4`),
KEY `key1` (`id`,`int2`,`char1`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Common operations in this table are insert and selects, rows are never updated and rarely deleted. int2 is a running version number, which means usually only the rows with the highest value of int2 for that id are selected.
I have been thinking of several ways of optimizing this and I was wondering which one would be the which one to pursue:
char1 (which is in the index) actually only contains about 40,000 different strings. I could move the strings into a second table (idchar -> char) and then just save the id in my main table, at the cost of an additional id lookup step during inserts and selects.
char2 and char3 are often empty. I could move them to a separate table that I would then do a LEFT JOIN on in selects.
Even if char2 and char3 contain data they are usually shorter than 1024 chars. I could probably shorten these to ~200.
Which one of these do you think is the most promising? Does decreasing the row size (either by making char1 into an integer or by removing/resizing columns) in MySQL InnoDB tables actually have a big impact on performance?
Thanks
There are several options. From what you say, moving char1 to another table seems quite reasonable. The additional lookup could, under some circumstances, even be faster than storing the raw data in the tables. (This occurs when the repeated values cause the table to be larger than necessary, especially when the larger table might be larger than available memory.) And, this would save space both in the data table and the corresponding index.
The exact impact on performance is hard to say, without understanding much more about your system and the query load.
Moving char3 and char4 to another table will have minimal impact. The overhead of the link to the other table would eat up any gains in space. You could save a couple bytes per record by storing them as varchar(255) rather than varchar(1024).
If you have a natural partitioning key, then partitioning is definitely an option, particularly for reducing the time for backups. This is very handy for a transaction-style table, where records are inserted and never or very rarely modified. If, on the other hand, the records contain customer records and any could be modified at any time, then you would still need to back up all the partitions.
There are several factors that could affect performance of your DB. Partitioning is definitive the best option but not allways can be done. If you are searching char1 before insert, then partitioning can be a problem because you have to search all the parts for the key. You must analize how the data is generated and most important how you make your querys for this table. This is the key so you should post your querys over this table. In the case on char2 and char3, moving to another table won't make any difference. You also should mention the physical distribution of you data. Are you using a single data file? Are data files on same physical disk as SO? Give more details so we can give you more help.

MySQL query very slow because of BLOB field (that can't be moved in another table)

I am developping a PyQT software based on a MySql Database. The database contains some recorded electrical signals, and all the information describing these signals (sampling rate, date of recoding etc...).
To have an idea, one database contains between 10 000 and 100 000 rows, and total size is >10Gb. All these data are stored on a dedicated server. In fact, most of the data is the signal itself, which is in a BLOB field called analogsignal.signal (see below)
here is the architecture of the database : http://packages.python.org/OpenElectrophy/_images/simple_diagram1.png
I can't change it (I can add columns and indexes, but I can not move or delete existing columns).
In the software, I need to list all the analogsignal columns (id, name, channel, t_start,sampling_rate), except analogsignal.signal, which is called later via the analogsignal.id. So I'm doing the following query
SELECT block.id, block.datetime, segment.id, analogsignal.id, analogsignal.name, analogsignal.channel, analogsignal.sampling_rate, block.fileOrigin, block.info
FROM segment, block, analogsignal
WHERE block.id=segment.id_block
AND segment.id=analogsignal.id_segment
ORDER BY analogsignal.id
The problem is, my queries are vey slow (> 10 min if the request is not in cache) because of the presence of analogsignal.signal column. If i understand correctly what's happening, the table is read line by line, including analogsignal.signal, even if the analogsignal.signal is not in the SELECT field.
Does anyone have an idea how to optimize the database or the query without moving the BLOB in an other table (which I agree would be more logical, but I do not control this point).
Thank you!
Here's the CREATE TABLE command for the AnalogSignal table (pulled/formatted from comment)
CREATE TABLE analogsignal
( id int(11) NOT NULL AUTO_INCREMENT,
id_segment int(11) DEFAULT NULL,
id_recordingpoint int(11) DEFAULT NULL,
name text,
channel int(11) DEFAULT NULL,
t_start float DEFAULT NULL,
sampling_rate float DEFAULT NULL,
signal_shape varchar(128) DEFAULT NULL,
signal_dtype varchar(128) DEFAULT NULL,
signal_blob longblob, Tag text,
PRIMARY KEY (id),
KEY ix_analogsignal_id_recordingpoint (id_recordingpoint),
KEY ix_analogsignal_id_segment (id_segment)
) ENGINE=MyISAM AUTO_INCREMENT=34798 DEFAULT CHARSET=latin1 ;
EDIT: Problem solved, here are the key points:
-I had to add a multiple column index, type INDEX on all he SELECT fields in the analogsignal table
-The columns of 'TEXT' type blocked the use of the index. I converted these TEXT fields in VARCHAR(xx). for this I used this simple command:
SELECT MAX(LENGTH(field_to_query)) FROM table_to_query
to check the minimal text length before conversion, to be sure that I will not loose any data
ALTER TABLE table_to_query CHANGE field_to_query field_to_query VARCHAR(24)
I first used VARCHAR(8000), but with this setting, VARCHAR was like a TEXT field, and indexing didn't worked. No such problem with VARCHAR(24). If I'm right, the total TEXT length (all fields included) in a query must no pass 1000 bytes
Then I indexed all the columns as said above, with no size parameter in the index
Finally, using a better query structure (thank you DRapp), improved also the query.
I passed from 215s to 0.016s for the query, with no cache...
In addition to trying to shrink your "blob" column requirements by putting the data to an external physical file and just storing the path\file name in the corresponding record, I would try the following as an alternative...
I would reverse the query and put your AnalogSignal table first as it is basis of the order by clause and reverse the query backwards to the blocks. Also, to prevent having to read every literal row of data, if you build a compound index on all columns you want in your output, it would make a larger index, but then the query will pull the values directly from the key expression instead of from reading back to the actual rows of data.
create index KeyDataOnly on AnalogSignal ( id, id_segment, name, channel, sampling_rate )
SELECT STRAIGHT_JOIN
block.id,
block.datetime,
segment.id,
analogsignal.id,
analogsignal.name,
analogsignal.channel,
analogsignal.sampling_rate,
block.fileOrigin,
block.info
FROM
analogsignal
JOIN Segment
on analogsignal.id_segment = segment.id
JOIN block
on segment.id_block = block.id
ORDER BY
analogsignal.id
If you cannot delete the BLOB column, do you have to fill it? You could add a column for storing the path/to/filename of your signal and then put all your signal files in the appropriate directory(s). Once that's done, set your BLOB field values to null.
It's probably breaking the spirit of the restrictions you're under. But arbitrary restrictions often need to be circumvented.
So according the comments I'm sure your problem is caused by the MyISAM storage engine and its behavior on storing the data. toxicate20 is right. The MySQL has to skip through those big blobs anyway which is not effective. You can change the storage engine for InnoDB which will help a lot in this problem. Will only read the blob data if you explicitly ask for it in the SELECT ... part.
ALTER TABLE analogsignal ENGINE=InnoDB;
This will take a while but helps a lot in performance. You can read more about InnoDB file formats here:
http://dev.mysql.com/doc/innodb/1.1/en/innodb-row-format-antelope.html
http://dev.mysql.com/doc/innodb/1.1/en/innodb-row-format-dynamic.html
Disclaimer: If you use fulltext search (MATCH ... AGAINST http://dev.mysql.com/doc/refman/5.5/en//fulltext-search.html) on any of the columns in the table you cannot change it to InnoDB.
As the analog signal column is pretty large the query will take a long time because it has to skip (or jump over them if you see it metaphorically) them when doing a select query. What I would do is the following: Instead of having a blob in the database, generate binary files via
$fh = fopen("analogfile.spec", 'w') or die("can't open file");
$data = $yourAnalogDataFromSomewhere;
fwrite($fh, $data);
fclose($fh);
The filename would be given by the ID of the column for example. Instead of the blob you would just add the filepath within your server directory structure.
This way your query will run very fast as it does not have to skip the big chunks of data in the blob column.

best way to store url in mysql for a read&write-intensive application

What is the best way to store url in mysql effectively for a read&write-intensive application?
I will be storing over 500,000 web addresses (all starting with either http:// or https://. no other protocols) and saving the whole url (http://example.com/path/?variable=a) into one column seems to be largely redundant because the same domain name and path will be saved to mysql multiple times.
So, initially, I thought of breaking them down (i.e. domain, path, and variables, etc) to get rid of redundancy. But I saw some posts saying that it's not recommended. Any idea on this?
Also, the application often has to retrieve urls without primary keys, meaning it has to search text to retrieve url. URL can be indexed, but I'm wondering how much performance difference there would be between storing the whole url and broken-down-url if they are all indexed under innodb(no full text indexing).
Broken-down-url will have to go through extra steps of combining them. Also, it would mean that I have to retrieve data 4 times from different tables(protocol, domain, path, variable), but it also makes the stored data in each row shorter and there would be less rows in each table. Would this possibly speed up the process?
I have dealt with this extensively, and my general philosophy is to use the frequency of use method. It's cumbersome, but it lets you run some great analytics on the data:
CREATE TABLE URL (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
DomainPath integer unsigned NOT NULL,
QueryString text
) Engine=MyISAM;
CREATE TABLE DomainPath (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
Domain integer unsigned NOT NULL,
Path text,
UNIQUE (Domain,Path)
) Engine=MyISAM;
CREATE TABLE Domain (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
Protocol tinyint NOT NULL,
Domain varchar(64)
Port smallint NULL,
UNIQUE (Protocol,Domain,Port)
) Engine=MyISAM;
As a general rule, you'll have similar Paths on a single Domain, but different QueryStrings for each path.
I originally designed this to have all parts indexed in a single table (Protocol, Domain, Path, Query String) but think the above is less space-intensive and lends better to getting better data out of it.
text tends to be slow, so you can change "Path" to a varchar after some use. Most servers die after about 1K for a URL, but I've seen some large ones and would err on the side of not losing data.
Your retrieval query is cumbersome, but if you abstract it away in your code, no issue:
SELECT CONCAT(
IF(D.Protocol=0,'http://','https://'),
D.Domain,
IF(D.Port IS NULL,'',CONCAT(':',D.Port)),
'/', DP.Path,
IF(U.QueryString IS NULL,'',CONCAT('?',U.QueryString))
)
FROM URL U
INNER JOIN DomainPath DP ON U.DomainPath=DP.ID
INNER JOIN Domain D on DP.Domain=D.ID
WHERE U.ID=$DesiredID;
Store a port number if it's non standard (non-80 for http, non-443 for https), otherwise store it as NULL to signify it shouldn't be included. (You can add the logic to the MySQL but it gets much uglier.)
I would always (or never) strip the "/" from the Path as well as the "?" from the QueryString for space savings. Only loss would being able to distinguish between
http://www.example.com/
http://www.example.com/?
Which, if important, then I would change your tack to never strip it and just include it. Technically,
http://www.example.com
http://www.example.com/
Are the same, so stripping the Path slash is OK always.
So, to parse:
http://www.example.com/my/path/to/my/file.php?id=412&crsource=google+adwords
We would use something like parse_url in PHP to produce:
array(
[scheme] => 'http',
[host] => 'www.example.com',
[path] => '/my/path/to/my/file.php',
[query] => 'id=412&crsource=google+adwords',
)
You would then check/insert (with appropriate locks, not shown):
SELECT D.ID FROM Domain D
WHERE
D.Protocol=0
AND D.Domain='www.example.com'
AND D.Port IS NULL
(if doesn't exist)
INSERT INTO Domain (
Protocol, Domain, Port
) VALUES (
0, 'www.example.com', NULL
);
We then have our $DomainID going forward...
Then insert into DomainPath:
SELECT DP.ID FORM DomainPath DP WHERE
DP.Domain=$DomainID AND Path='/my/path/to/my/file.php';
(if it doesn't exist, insert it similarly)
We then have our $DomainPathID going forward...
SELECT U.ID FROM URL
WHERE
DomainPath=$DomainPathID
AND QueryString='id=412&crsource=google+adwords'
and insert if necessary.
Now, let me note importantly, that the above scheme will be slow for high-performance sites. You should modify everything to use a hash of some sort to speed up SELECTs. In short, the technique is like:
CREATE TABLE Foo (
ID integer unsigned PRIMARY KEY NOT NULL AUTO_INCREMENT,
Hash varbinary(16) NOT NULL,
Content text
) Type=MyISAM;
SELECT ID FROM Foo WHERE Hash=UNHEX(MD5('id=412&crsource=google+adwords'));
I deliberately eliminated it from the above to keep it simple, but comparing a TEXT to another TEXT for selects is slow, and breaks for really long query strings. Don't use a fixed-length index either because that will also break. For arbitrary length strings where accuracy matters, a hash failure rate is acceptable.
Finally, if you can, do the MD5 hash client side to save sending large blobs to the server to do the MD5 operation. Most modern languages supports MD5 built-in:
SELECT ID FROM Foo WHERE Hash=UNHEX('82fd4bcf8b686cffe81e937c43b5bfeb');
But I digress.
It really depends on what you want to do with the data. if you're doing statistics with the URLs e.g. to see what domains are the most popular, then it would be feasible to break it down through that. But if you are just storing them and accessing the url in it's entirety, there's no reason to split them up.
I've seen some people hashing long strings (e.g. md5) and searching against that, there might be a performance enhancement for URLs, but I'm not certain by how much (it's better for lots of text).
Whatever you do - don't forget to always use ints as primary keys as much as possible as those are the quickest lookups.
If you really want to split your URLs, you may want to consider keeping separate tables in order to not lock your table (in innoDB it doesn't matter as the table doesn't get locked up) but with seperate tables, you can just use foreign/primary_keys/ints to reference the rows you need.
A good read is friendfeed's blog entry - that might give you some ideas as well: