Suggestions for string lookup optimizations - mysql

We have the below MySQL database table with about 75,000 entries. Each entry in the table represents a symbol in the system for which further data could be retrieved. This table is queried for autocomplete purposes - a user looks up a symbol, which is then matched to either the symbol's name or to its tags (semicolon separated list of strings). When the user selects the correct symbol, relevant data is fetched. Here is the table's description:
CREATE TABLE `symbols` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(512) NOT NULL,
`tags` varchar(512) DEFAULT NULL,
`type` enum('1','2','3','4','5','6','7','8','9') NOT NULL,
`popularity` int(11) DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `uc_symbol_name` (`type`,`symbol`),
KEY `symbol_idx` (`symbol`),
KEY `type_popularity_idx` (`type`,`popularity`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The above table is stored, alongside copious amounts of data, on a backend machine which serves this data over a JSON API. Currently, our frontend JavaScript code is querying the backend server directly in AJAX in order to do the autocomplete. Instead, to speed things up, we want to create a local cached version of the symbols table on the server from which the frontend is served (the frontend is written in django). This is possible since the table contains under 100,000 symbols, and because the table only gets updated about once an minute. Furthermore, it will allow us to implement better matching algorithms like Levenshtein distance.
How would you go about creating this type of cached symbol table? Obviously the lookup will have to happen in code (probably Python), but how would you store the data, and how would you sync it once a minute? We have a Redis server running on the django frontend server, but that introduces the question of persistence... Any thoughts are very welcome!

Just use a simple hash table, together with a "last updated time". Every time you do a lookup in the hash, check the "last updated" time. If it is more than a minute in the past, dump the data in the hash and reload it from the DB. Of course you have to make sure to avoid race conditions...
There are other alternatives but this is the simplest way and will be easiest to code correctly. If you find that hitting one of your transactions per minute with the extra latency of a big DB operation is not acceptable, you can come up with something a bit more complicated (such as running the DB operations asynchronously, on a background thread). To prepare for that eventuality, encapsulate this code in a class. (Then if it's too slow, you can play with the implementation without affecting any of your other code.)
Of course, there are other things you could do if you need more performance. You could add an updated_time column to your DB records, and then only load the ones which have been updated since last time. Will this actually make things faster? If it does, will the difference be big to enough to even matter? There's no way to know unless you try it out. That's why it's better to try the simpler solution first, and see if it reaches your performance goals.

Related

Keeping video viewing statistics breakdown by video time in a database

I need to keep a number of statistics about the videos being watched, and one of them is what parts of the video are being watched most. The design I came up with is to split the video into 256 intervals and keep the floating-point number of views for each of them. I receive the data as a number of intervals the user watched continuously. The problem is how to store them. There are two solutions I see.
Row per every video segment
Let's have a database table like this:
CREATE TABLE `video_heatmap` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`video_id` int(11) NOT NULL,
`position` tinyint(3) unsigned NOT NULL,
`views` float NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_lookup` (`video_id`,`position`)
) ENGINE=MyISAM
Then, whenever we have to process a number of views, make sure there are the respective database rows and add appropriate values to the views column. I found out it's a lot faster if the existence of rows is taken care of first (SELECT COUNT(*) of rows for a given video and INSERT IGNORE if they are lacking), and then a number of update queries is used like this:
UPDATE video_heatmap
SET views = views + ?
WHERE video_id = ? AND position >= ? AND position < ?
This seems, however, a little bloated. The other solution I came up with is
Row per video, update in transactions
A table will look (sort of) like this:
CREATE TABLE video (
id INT NOT NULL AUTO_INCREMENT,
heatmap BINARY (4 * 256) NOT NULL,
...
) ENGINE=InnoDB
Then, upon every time a view needs to be stored, it will be done in a transaction with consistent snapshot, in a sequence like this:
If the video doesn't exist in the database, it is created.
A row is retrieved, heatmap, an array of floats stored in the binary form, is converted into a form more friendly for processing (in PHP).
Values in the array are increased appropriately and the array is converted back.
Row is changed via UPDATE query.
So far the advantages can be summed up like this:
First approach
Stores data as floats, not as some magical binary array.
Doesn't require transaction support, so doesn't require InnoDB, and we're using MyISAM for everything at the moment, so there won't be any need to mix storage engines. (only applies in my specific situation)
Doesn't require a transaction WITH CONSISTENT SNAPSHOT. I don't know what are the performance penalties of those.
I already implemented it and it works. (only applies in my specific situation)
Second approach
Is using a lot less storage space (the first approach is storing video ID 256 times and stores position for every segment of the video, not to mention primary key).
Should scale better, because of InnoDB's per-row locking as opposed to MyISAM's table locking.
Might generally work faster because there are a lot less requests being made.
Easier to implement in code (although the other one is already implemented).
So, what should I do? If it wasn't for the rest of our system using MyISAM consistently, I'd go with the second approach, but currently I'm leaning to the first one. But maybe there are some reasons to favour one approach or another?
Second approach looks tempting at first sight, but it makes queries like "how many views for segment x of video y" unable to use an index on video.heatmap. Not sure if this is a real-life concern for you though. Also, you would have to parse back and forth the entire array every time you need data for one segment only.
But first and foremost, your second solution is hackish (but interesting nonetheless). I wouldn't recommend denormalising your database until you face an acutal performance issue.
Also, try populating the video_headmap table in advance with wiews = 0 as soon as a video is inserted (a trigger can help).
If space is really a concern, remove your surrogate key video_headmap.id and instead make (video_id, position) the primary key (then get rid of the superfluous UNIQUE constraint). But this shouldn't come into the equation. 256 x 12 bytes per video (rough row length with 3 numeric columns, okay add some for the index) is only an extra 3kb per video!
Finally, nothing prevents you from switching your current table to InnoDB and leverage its row-level locking capability.
Please note I fail to undestand why views cannot be an UNSIGNED INT. I would recommend changing this type.

MySQL query very slow because of BLOB field (that can't be moved in another table)

I am developping a PyQT software based on a MySql Database. The database contains some recorded electrical signals, and all the information describing these signals (sampling rate, date of recoding etc...).
To have an idea, one database contains between 10 000 and 100 000 rows, and total size is >10Gb. All these data are stored on a dedicated server. In fact, most of the data is the signal itself, which is in a BLOB field called analogsignal.signal (see below)
here is the architecture of the database : http://packages.python.org/OpenElectrophy/_images/simple_diagram1.png
I can't change it (I can add columns and indexes, but I can not move or delete existing columns).
In the software, I need to list all the analogsignal columns (id, name, channel, t_start,sampling_rate), except analogsignal.signal, which is called later via the analogsignal.id. So I'm doing the following query
SELECT block.id, block.datetime, segment.id, analogsignal.id, analogsignal.name, analogsignal.channel, analogsignal.sampling_rate, block.fileOrigin, block.info
FROM segment, block, analogsignal
WHERE block.id=segment.id_block
AND segment.id=analogsignal.id_segment
ORDER BY analogsignal.id
The problem is, my queries are vey slow (> 10 min if the request is not in cache) because of the presence of analogsignal.signal column. If i understand correctly what's happening, the table is read line by line, including analogsignal.signal, even if the analogsignal.signal is not in the SELECT field.
Does anyone have an idea how to optimize the database or the query without moving the BLOB in an other table (which I agree would be more logical, but I do not control this point).
Thank you!
Here's the CREATE TABLE command for the AnalogSignal table (pulled/formatted from comment)
CREATE TABLE analogsignal
( id int(11) NOT NULL AUTO_INCREMENT,
id_segment int(11) DEFAULT NULL,
id_recordingpoint int(11) DEFAULT NULL,
name text,
channel int(11) DEFAULT NULL,
t_start float DEFAULT NULL,
sampling_rate float DEFAULT NULL,
signal_shape varchar(128) DEFAULT NULL,
signal_dtype varchar(128) DEFAULT NULL,
signal_blob longblob, Tag text,
PRIMARY KEY (id),
KEY ix_analogsignal_id_recordingpoint (id_recordingpoint),
KEY ix_analogsignal_id_segment (id_segment)
) ENGINE=MyISAM AUTO_INCREMENT=34798 DEFAULT CHARSET=latin1 ;
EDIT: Problem solved, here are the key points:
-I had to add a multiple column index, type INDEX on all he SELECT fields in the analogsignal table
-The columns of 'TEXT' type blocked the use of the index. I converted these TEXT fields in VARCHAR(xx). for this I used this simple command:
SELECT MAX(LENGTH(field_to_query)) FROM table_to_query
to check the minimal text length before conversion, to be sure that I will not loose any data
ALTER TABLE table_to_query CHANGE field_to_query field_to_query VARCHAR(24)
I first used VARCHAR(8000), but with this setting, VARCHAR was like a TEXT field, and indexing didn't worked. No such problem with VARCHAR(24). If I'm right, the total TEXT length (all fields included) in a query must no pass 1000 bytes
Then I indexed all the columns as said above, with no size parameter in the index
Finally, using a better query structure (thank you DRapp), improved also the query.
I passed from 215s to 0.016s for the query, with no cache...
In addition to trying to shrink your "blob" column requirements by putting the data to an external physical file and just storing the path\file name in the corresponding record, I would try the following as an alternative...
I would reverse the query and put your AnalogSignal table first as it is basis of the order by clause and reverse the query backwards to the blocks. Also, to prevent having to read every literal row of data, if you build a compound index on all columns you want in your output, it would make a larger index, but then the query will pull the values directly from the key expression instead of from reading back to the actual rows of data.
create index KeyDataOnly on AnalogSignal ( id, id_segment, name, channel, sampling_rate )
SELECT STRAIGHT_JOIN
block.id,
block.datetime,
segment.id,
analogsignal.id,
analogsignal.name,
analogsignal.channel,
analogsignal.sampling_rate,
block.fileOrigin,
block.info
FROM
analogsignal
JOIN Segment
on analogsignal.id_segment = segment.id
JOIN block
on segment.id_block = block.id
ORDER BY
analogsignal.id
If you cannot delete the BLOB column, do you have to fill it? You could add a column for storing the path/to/filename of your signal and then put all your signal files in the appropriate directory(s). Once that's done, set your BLOB field values to null.
It's probably breaking the spirit of the restrictions you're under. But arbitrary restrictions often need to be circumvented.
So according the comments I'm sure your problem is caused by the MyISAM storage engine and its behavior on storing the data. toxicate20 is right. The MySQL has to skip through those big blobs anyway which is not effective. You can change the storage engine for InnoDB which will help a lot in this problem. Will only read the blob data if you explicitly ask for it in the SELECT ... part.
ALTER TABLE analogsignal ENGINE=InnoDB;
This will take a while but helps a lot in performance. You can read more about InnoDB file formats here:
http://dev.mysql.com/doc/innodb/1.1/en/innodb-row-format-antelope.html
http://dev.mysql.com/doc/innodb/1.1/en/innodb-row-format-dynamic.html
Disclaimer: If you use fulltext search (MATCH ... AGAINST http://dev.mysql.com/doc/refman/5.5/en//fulltext-search.html) on any of the columns in the table you cannot change it to InnoDB.
As the analog signal column is pretty large the query will take a long time because it has to skip (or jump over them if you see it metaphorically) them when doing a select query. What I would do is the following: Instead of having a blob in the database, generate binary files via
$fh = fopen("analogfile.spec", 'w') or die("can't open file");
$data = $yourAnalogDataFromSomewhere;
fwrite($fh, $data);
fclose($fh);
The filename would be given by the ID of the column for example. Instead of the blob you would just add the filepath within your server directory structure.
This way your query will run very fast as it does not have to skip the big chunks of data in the blob column.

MySQL: UNIQUE text field using additional HASH field

In my MySQL DB I have a table defined like:
CREATE TABLE `mytablex_cs` (
`id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`tag` varchar(6) COLLATE utf8_bin NOT NULL DEFAULT '',
`value` text COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`),
KEY `kt` (`tag`),
KEY `kv` (`value`(200))
) ENGINE=MyISAM AUTO_INCREMENT=7 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I need to implement a UNIQUE constraint (key) on the value field.
I know that is not yet possible to define a unique index on the entire value for a blob or text field, but there is a ticket(?) open to implement such feature (see this page) where it has been suggested to create a unique key using a hash like it is already implemented for other fields.
Now I would like to use a similar approach adding to the table another field that will contain the hash and creating a unique key on this field.
I gave a look to possible ways to create this hash and, since I would like to avoid collisions (I need to insert several millions of entries), it seems that the RIPEMD-160 algorithm is the best one, even if a quick search gave me several similar solutions that use SHA256 or even SHA1 and MD5.
I totally lack of knowledge in cryptography, so what are the down sides of choosing this approach?
Another question I have is: which algorithm is currently used by MySQL to create the hash?
Lets look at your requirements:
You need to ensure that a value field is unique. The value field is a text column and due to the nature of it there is no way to create a unique index on the value field(for now). So using a extra field which is the hash of the field value is your only real option here.
Advantages to this approach:
Easy to calculate the hash.
It is extremely rare to create a duplicate hash for two different values so your hash values are almost gauranteed to be unqiue.
Hashes are normally some numeric value(expressed as hexdecimal) that can be efficiently indexed.
The hashes wont take up a lot of space, different hashing function return different length hashes so play around with the different algorithms and test them to find one that suits your need.
Disadvantages of this approach:
Extra field to cater for during INSERTS and UPDATES i.e. there is more work to do.
If you already have data in the table and this is in production you will have to update the current data and hopefully you dont have duplicates already. Also it will take time to run the update. Thus it might be tricky to apply the change in a already working system.
Hashing functions are CPU intensive and can have a negative impact on CPU usage.
I assume you understand what a hash function does and conceptually how it works.
You can find a list of cryptographic functions here: http://dev.mysql.com/doc/refman/5.5/en//encryption-functions.html
MySQL supports as far as I know MD5, SHA, SHA1 and SHA2 hashing functions. Most if not all of these should be sufficient for just hashing. Some functions like MD5 has some issues when used in cryptography applications i.e. when using it in PKI as a signature algorithm etc. However these issues should not be that important when you decide on using it to create a unique value as it is not really being applied in a cryptography context here.
To use the MySQL hashing functions you can try the following examples:
SELECT MD5('1234')
SELECT SHA('1234')
SELECT SHA1('1234')
SELECT SHA2('1234',224);
As with everythig new you should try all the approaches and find the one that will be most successfull in your case.

Prevent duplicate rows in MySQL without unique index/constraint?

I'm writing an application that needs to deal with millions of URLs. It also needs to do retrieval by URL.
My table currently looks like this:
CREATE TABLE Pages (
id bigint(20) unsigned NOT NULL,
url varchar(4096) COLLATE utf8_unicode_ci NOT NULL,
url_crc int(11) NOT NULL,
PRIMARY KEY (id),
KEY url_crc (url_crc)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The idea behind this structure is to do look ups by a CRC32 hash of the URL, since a b-tree index would be very inefficient on URLs which tend to have common prefixes (InnoDB doesn't support hash indexes). Duplicate results from the CRC32 are filtered by a comparison with the full URL. A sample retrieval query would look like this:
SELECT id
FROM Pages
WHERE url_crc = 2842100667
AND url = 'example.com/page.html';
The problem I have is avoiding duplicate entries being inserted. The application will always check the database for an existing entry before inserting a new one, but it's likely in my application that multiple queries for the same new URL will be made concurrently and duplicate CRC32s and URLs will be entered.
I don't want to create a unique index on url as it will be gigantic. I also don't want to write lock the table on every insert since that will destroy concurrent insert performance. Is there an efficient way to solve this issue?
Edit: To go into a bit more detail about the usage, it's a real-time table for looking up content in response to a URL. By looking up the URL, I can find the internal id for the URL, and then use that to find content for a page. New URLs are added to the system all the time, and I have no idea what those URLs will be before hand. When new URLs are referenced, they will likely be slammed by simultaneous requests referencing the same URLs, perhaps hundreds per second, which is why I'm concerned about the race condition when adding new content. The results need to be immediate and there can't be read lag (subsecond lag is okay).
To start, new URLs will be added only a few thousand per day, but the system will need to handle many times that before we have time to move to a more scalable solution next year.
One other issue with just using a unique index on url is that the length of the URLs can exceed the maximum length of a unique index. Even if I drop the CRC32 trick, it doesn't solve the issue of preventing duplicate URLs.
Have you actually benchmarked and found the btree to be a problem? I sense premature optimization.
Second, if you're worried about the start of all the strings being the same, one answer is to index your URL reversed—last character first. I don't think MySQL can do that natively, but you could reverse the data in your app before storing it. Or just don't use MySQL.
Have you considered creating a UNIQUE INDEX (url_crc, url) ? It may be 'gigantic', but with the number of collisions you're going to get using CRC32, it will probably help the performance of your page retrieval function while also preventing duplicate urls.
Another thing to consider would be allowing duplicates to be inserted, and removing them nightly (or whenever traffic is low) with a script.
In addition to your Pages table, create 3 additional tables with the same columns (PagesInsertA, PagesInsertB and PagesInsertC). When inserting URLs, check against Pages for an existing entry, and if it's not there, insert the URL into PagesInsertA. You can either use a unique index on that smaller table, or include a step to remove duplicates later (discussed below). At the end of your rotation time (perhaps one minute, see discussion below for constraints), switch to inserting new URLs into PagesInsertB. Perform the following steps on PagesInsertA: remove duplicates (if you weren't using a unique index), remove any entries that duplicate entries in PagesInsertC (that table will be empty the first time around, but not the second), add the entries from PagesInsertA to Pages, empty PagesInsertC.
At the end of the second period, switch to inserting new URLs into PagesInsertC. Perform the steps discussed above on PagesInsertB (only difference is that you will remove entries also found in PagesInsertA and empty PagesInsertA at the end). Continue the rotation of the table the new URLs are inserted into (A -> B -> C -> A -> ...).
A minimum of 3 insert tables is needed because there will be a lag between switching URL insertion to a new insert table and inserting the cleaned-up rows from the previous insert table into Pages. I used 1 minute as the time between switches in this example, but you can make that time smaller as long as the insertion from PagesInsertA to Pages and emptying PagesInsertC (for example), occurs before the switch between inserting new URLs into PagesInsertB and PagesInsertC.

best way to store url in mysql for a read&write-intensive application

What is the best way to store url in mysql effectively for a read&write-intensive application?
I will be storing over 500,000 web addresses (all starting with either http:// or https://. no other protocols) and saving the whole url (http://example.com/path/?variable=a) into one column seems to be largely redundant because the same domain name and path will be saved to mysql multiple times.
So, initially, I thought of breaking them down (i.e. domain, path, and variables, etc) to get rid of redundancy. But I saw some posts saying that it's not recommended. Any idea on this?
Also, the application often has to retrieve urls without primary keys, meaning it has to search text to retrieve url. URL can be indexed, but I'm wondering how much performance difference there would be between storing the whole url and broken-down-url if they are all indexed under innodb(no full text indexing).
Broken-down-url will have to go through extra steps of combining them. Also, it would mean that I have to retrieve data 4 times from different tables(protocol, domain, path, variable), but it also makes the stored data in each row shorter and there would be less rows in each table. Would this possibly speed up the process?
I have dealt with this extensively, and my general philosophy is to use the frequency of use method. It's cumbersome, but it lets you run some great analytics on the data:
CREATE TABLE URL (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
DomainPath integer unsigned NOT NULL,
QueryString text
) Engine=MyISAM;
CREATE TABLE DomainPath (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
Domain integer unsigned NOT NULL,
Path text,
UNIQUE (Domain,Path)
) Engine=MyISAM;
CREATE TABLE Domain (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
Protocol tinyint NOT NULL,
Domain varchar(64)
Port smallint NULL,
UNIQUE (Protocol,Domain,Port)
) Engine=MyISAM;
As a general rule, you'll have similar Paths on a single Domain, but different QueryStrings for each path.
I originally designed this to have all parts indexed in a single table (Protocol, Domain, Path, Query String) but think the above is less space-intensive and lends better to getting better data out of it.
text tends to be slow, so you can change "Path" to a varchar after some use. Most servers die after about 1K for a URL, but I've seen some large ones and would err on the side of not losing data.
Your retrieval query is cumbersome, but if you abstract it away in your code, no issue:
SELECT CONCAT(
IF(D.Protocol=0,'http://','https://'),
D.Domain,
IF(D.Port IS NULL,'',CONCAT(':',D.Port)),
'/', DP.Path,
IF(U.QueryString IS NULL,'',CONCAT('?',U.QueryString))
)
FROM URL U
INNER JOIN DomainPath DP ON U.DomainPath=DP.ID
INNER JOIN Domain D on DP.Domain=D.ID
WHERE U.ID=$DesiredID;
Store a port number if it's non standard (non-80 for http, non-443 for https), otherwise store it as NULL to signify it shouldn't be included. (You can add the logic to the MySQL but it gets much uglier.)
I would always (or never) strip the "/" from the Path as well as the "?" from the QueryString for space savings. Only loss would being able to distinguish between
http://www.example.com/
http://www.example.com/?
Which, if important, then I would change your tack to never strip it and just include it. Technically,
http://www.example.com
http://www.example.com/
Are the same, so stripping the Path slash is OK always.
So, to parse:
http://www.example.com/my/path/to/my/file.php?id=412&crsource=google+adwords
We would use something like parse_url in PHP to produce:
array(
[scheme] => 'http',
[host] => 'www.example.com',
[path] => '/my/path/to/my/file.php',
[query] => 'id=412&crsource=google+adwords',
)
You would then check/insert (with appropriate locks, not shown):
SELECT D.ID FROM Domain D
WHERE
D.Protocol=0
AND D.Domain='www.example.com'
AND D.Port IS NULL
(if doesn't exist)
INSERT INTO Domain (
Protocol, Domain, Port
) VALUES (
0, 'www.example.com', NULL
);
We then have our $DomainID going forward...
Then insert into DomainPath:
SELECT DP.ID FORM DomainPath DP WHERE
DP.Domain=$DomainID AND Path='/my/path/to/my/file.php';
(if it doesn't exist, insert it similarly)
We then have our $DomainPathID going forward...
SELECT U.ID FROM URL
WHERE
DomainPath=$DomainPathID
AND QueryString='id=412&crsource=google+adwords'
and insert if necessary.
Now, let me note importantly, that the above scheme will be slow for high-performance sites. You should modify everything to use a hash of some sort to speed up SELECTs. In short, the technique is like:
CREATE TABLE Foo (
ID integer unsigned PRIMARY KEY NOT NULL AUTO_INCREMENT,
Hash varbinary(16) NOT NULL,
Content text
) Type=MyISAM;
SELECT ID FROM Foo WHERE Hash=UNHEX(MD5('id=412&crsource=google+adwords'));
I deliberately eliminated it from the above to keep it simple, but comparing a TEXT to another TEXT for selects is slow, and breaks for really long query strings. Don't use a fixed-length index either because that will also break. For arbitrary length strings where accuracy matters, a hash failure rate is acceptable.
Finally, if you can, do the MD5 hash client side to save sending large blobs to the server to do the MD5 operation. Most modern languages supports MD5 built-in:
SELECT ID FROM Foo WHERE Hash=UNHEX('82fd4bcf8b686cffe81e937c43b5bfeb');
But I digress.
It really depends on what you want to do with the data. if you're doing statistics with the URLs e.g. to see what domains are the most popular, then it would be feasible to break it down through that. But if you are just storing them and accessing the url in it's entirety, there's no reason to split them up.
I've seen some people hashing long strings (e.g. md5) and searching against that, there might be a performance enhancement for URLs, but I'm not certain by how much (it's better for lots of text).
Whatever you do - don't forget to always use ints as primary keys as much as possible as those are the quickest lookups.
If you really want to split your URLs, you may want to consider keeping separate tables in order to not lock your table (in innoDB it doesn't matter as the table doesn't get locked up) but with seperate tables, you can just use foreign/primary_keys/ints to reference the rows you need.
A good read is friendfeed's blog entry - that might give you some ideas as well: