Prevent duplicate rows in MySQL without unique index/constraint? - mysql

I'm writing an application that needs to deal with millions of URLs. It also needs to do retrieval by URL.
My table currently looks like this:
CREATE TABLE Pages (
id bigint(20) unsigned NOT NULL,
url varchar(4096) COLLATE utf8_unicode_ci NOT NULL,
url_crc int(11) NOT NULL,
PRIMARY KEY (id),
KEY url_crc (url_crc)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The idea behind this structure is to do look ups by a CRC32 hash of the URL, since a b-tree index would be very inefficient on URLs which tend to have common prefixes (InnoDB doesn't support hash indexes). Duplicate results from the CRC32 are filtered by a comparison with the full URL. A sample retrieval query would look like this:
SELECT id
FROM Pages
WHERE url_crc = 2842100667
AND url = 'example.com/page.html';
The problem I have is avoiding duplicate entries being inserted. The application will always check the database for an existing entry before inserting a new one, but it's likely in my application that multiple queries for the same new URL will be made concurrently and duplicate CRC32s and URLs will be entered.
I don't want to create a unique index on url as it will be gigantic. I also don't want to write lock the table on every insert since that will destroy concurrent insert performance. Is there an efficient way to solve this issue?
Edit: To go into a bit more detail about the usage, it's a real-time table for looking up content in response to a URL. By looking up the URL, I can find the internal id for the URL, and then use that to find content for a page. New URLs are added to the system all the time, and I have no idea what those URLs will be before hand. When new URLs are referenced, they will likely be slammed by simultaneous requests referencing the same URLs, perhaps hundreds per second, which is why I'm concerned about the race condition when adding new content. The results need to be immediate and there can't be read lag (subsecond lag is okay).
To start, new URLs will be added only a few thousand per day, but the system will need to handle many times that before we have time to move to a more scalable solution next year.
One other issue with just using a unique index on url is that the length of the URLs can exceed the maximum length of a unique index. Even if I drop the CRC32 trick, it doesn't solve the issue of preventing duplicate URLs.

Have you actually benchmarked and found the btree to be a problem? I sense premature optimization.
Second, if you're worried about the start of all the strings being the same, one answer is to index your URL reversed—last character first. I don't think MySQL can do that natively, but you could reverse the data in your app before storing it. Or just don't use MySQL.

Have you considered creating a UNIQUE INDEX (url_crc, url) ? It may be 'gigantic', but with the number of collisions you're going to get using CRC32, it will probably help the performance of your page retrieval function while also preventing duplicate urls.
Another thing to consider would be allowing duplicates to be inserted, and removing them nightly (or whenever traffic is low) with a script.

In addition to your Pages table, create 3 additional tables with the same columns (PagesInsertA, PagesInsertB and PagesInsertC). When inserting URLs, check against Pages for an existing entry, and if it's not there, insert the URL into PagesInsertA. You can either use a unique index on that smaller table, or include a step to remove duplicates later (discussed below). At the end of your rotation time (perhaps one minute, see discussion below for constraints), switch to inserting new URLs into PagesInsertB. Perform the following steps on PagesInsertA: remove duplicates (if you weren't using a unique index), remove any entries that duplicate entries in PagesInsertC (that table will be empty the first time around, but not the second), add the entries from PagesInsertA to Pages, empty PagesInsertC.
At the end of the second period, switch to inserting new URLs into PagesInsertC. Perform the steps discussed above on PagesInsertB (only difference is that you will remove entries also found in PagesInsertA and empty PagesInsertA at the end). Continue the rotation of the table the new URLs are inserted into (A -> B -> C -> A -> ...).
A minimum of 3 insert tables is needed because there will be a lag between switching URL insertion to a new insert table and inserting the cleaned-up rows from the previous insert table into Pages. I used 1 minute as the time between switches in this example, but you can make that time smaller as long as the insertion from PagesInsertA to Pages and emptying PagesInsertC (for example), occurs before the switch between inserting new URLs into PagesInsertB and PagesInsertC.

Related

Duplicate table fields vs indexing only

I have a huge and very busy table (few thousands INSERT / second). The table stores loginlogs, it has a bigint ID which is not generated by MySQL but rather by pseudorandom generator on MySQL client.
Simply put, the table has loginlog_id, client_id, tons,of,other,columns,with,details,about,session....
I have few indexes on this table such as PRIMARY_KEY(loginlog_id) and INDEX(client_id)
In some other part of our system I need to fetch client_id based on loginlog_id. This does not happen that often (just few hundreds SELECT client_id FROM loginlogs WHERE loginlog_id=XXXXXX / second). Table loginlogs is read by various other scripts now and then, and always various columns are needed. But the most frequent call to read is for sure the above mentioned get client_id by loginlog_id.
My question is: should I create another table loginlogs_clientids and duplicate loginlog_id, client_id in there (this means another few thousands INSERTS, as for every loginlogs INSERT I get this new one). Or should I be happy with InnoDB handling my lookups by PRIMARY KEY efficiently.
We have tons of RAM (128GB, most of which is used by MySQL). Load of MySQL is between 40% and 350% CPU (we have 12 core CPU). When I tried to use the new table, I did not see any difference. But I am asking for the future, if our usage grows even more, what is the suggested approach? Duplicate or index?
Thanks!
No.
Looking up table data for a single row using the primary key is extremely efficient, and will take the same time for both tables.
Exceptions to that might be very large row sizes (e.g. 8KB+), and client_id is e.g. a varchar that is stored off-page, in which case you might need to read an additional data block, which at least theoretically could cost you some milliseconds.
Even if this strategy would have an advantage, you would not actually do it by creating a new table, but by adding an index (loginlog_id, client_id) to your original table. InnoDB stores everything, including the actual data, in an index structure, so that adding an index is basically the same as adding a new table with the same columns, but without (you) having the problem of synchronizing those two "tables".
Having a structure with a smaller row size can have some advantages for ranged scans, e.g. MySQL will evaluate select count(*) from tablename using the smallest index of the table, as it has to read less bytes. You already have such a small index (on client_id), so even in that regard, adding such an additonal table/index shouldn't have an effect. If you have any range scan on the primary key (which is probably unlikely for pseudorandom data), you may want to consider this though, or keep it in mind for cases when you have.

How can one prevent mysql insertion within a particular timestamp interval?

I have a system whereby users can input data into a mysql table from many sites across the globe.
The data is posted via ajax to my table without issues. But, I would like to improve my insertion code to prevent insertion if the timestamp is within some interval. This would weed out duplicate rows in my table.
Before you get angry -> I do understand I can set a primary key to certain columns and prevent duplicate insertion.
In my use case, I need to allow duplications of the numeric data where it is truly duplicated values from a unique submission -> this is valid in my case. I would like to leverage the timestamp to weed out obvious double insertions where the variables were submitted by accident twice.
I have tried to disable the button for 1-2 seconds, but this hasn't solved the problem entirely.
If I have columns: weight, height, country and the timestamp, I'd like to somehow check if there is an insert within n sections of the timestamp, where the post includes data that matches these variables. This would tell me that there is an accidental duplication from a user and I shouldn't insert it into the database.
I'm not too familiar with MYSQL, so I was hoping to get some guidance here.
Thanks.
There are different solutions, depending on the specifics of your case:
If you need to apply some rule that validates the new row using values inside the row itself a CHECK constraint will do. Consider, though, that MySQL enforces CHECK constraints starting in version 8.0.3 (if I remember well).
If you want to enforce a rule in relation to other rows, you can serialize the insertions into a queue. The consumer of the queue will validate the insertions one by one and will accept or reject them. Consider that serialization is not a good option for massive level of insertions, since it produce a bottleneck (this may be your case since you say insertions from across the globe).
Alternatively, you can use optimistic insertion, and always produce the insertion with an intermediate status "waiting for validation". Then other process(es) can validate the row. If all is good, then the row is approved; if not, then a compensation procedure is executed, in a-la-microservice way.
Which one is your case?

MySQL optimizing by splitting table into two

I currently have a table called map_tiles that will eventually have around a couple hundred thousand rows. Each row in this table represents an individual tile on a world map of my game. Currently, the table structure is like so:
id int(11) PRIMARY KEY
city_id int(11)
type varchar(20)
x int(11) INDEX KEY
y int(11) INDEX KEY
level int(11)
I also want to be able to store a stringified JSON object that will contain information regarding that specific tile. Since I could have 100,000+ rows, I want to optimize my queries and table design to get the best performance possible.
So here is my scenario: A player loads a position, say at 50,50 on the worldmap. We will load all tiles within a 25 tile radius of the player's coordinate. So, we will have to run a WHERE query on this table of hundreds of thousands of rows in my map_tiles table.
So, would adding another field of type text called data to the existing table prove better performance? However, this would slow down the main query.
Or, would it be worth it to create a separate table called map_tiles_data, that just has the structure like so:
tile_id int(11) PRIMARY KEY
data text
And I could run the main query that finds the tiles within the radius of the player from the map_tiles, and then do a UNION ALL possibly that just pulls the JSON stringified data from the second table?
EDIT: Sorry, I should have clarified. The second table if used, would not have a row for each corresponding tile in the map_tiles table. A row will only be added if data is to be stored on a map tile. So by default, there will be 0 rows in the map_tiles_data table, while there could be 100,000 thousand rows in the map_tiles table. When a player does x action, then the game will add a row to map_tiles_data.
No need to store data in separate table. You can use the same table. But you have to use InnoDB plugin and set innodb_file_format=barracuda and as data is going to be text, use ROW_FORMAT=Dynamic (or Compressed)
InnoDB will store the text out side the ROW page, so having data in the same table is efficient than having it in separate table (you can avoid joins and foreign keys). Also add index on x and y as all your queries are based on the location
Useful reading:
Innodb Plugin in “Barracuda” format and ROW_FORMAT=DYNAMIC. In this format Innodb stores either whole blob on the row page or only 20 bytes BLOB pointer giving preference to smaller columns to be stored on the page, which is reasonable as you can store more of them. BLOBs can have prefix index but this no more requires column prefix to be stored on the page – you can build prefix indexes on blobs which are often stored outside the page.
COMPRESSED row format is similar to DYNAMIC when it comes to handling blobs and will use the same strategy storing BLOBs completely off page. It however will always compress blobs which do not fit to the row page, even if KEY_BLOCK_SIZE is not specified and compression for normal data and index pages is not enabled.
Don't think that I am referring only to BLOBs. From storage prospective BLOB, TEXT as well as long VARCHAR are handled same way by Innodb.
Ref: https://www.percona.com/blog/2010/02/09/blob-storage-in-innodb/
The issue of storing data in one table or two tables is not really your main issue. The issue is getting the neighboring tiles. I'll return to that in a moment.
JSON can be a convenient format for flexibly storing attribute/value pairs. However, it is not so useful for accessing the data in the database. You might want to consider a hybrid form. This suggests another table, because you might want to occasionally add or remove columns
Another consideration is maintaining history. You may want history on the JSON component, but you don't need that for the rest of the data. This suggests using a separate table.
As for optimizing the WHERE. I think you have three choices. The first is your current approach, which is not reasonable.
The second is to have a third table that contains all the neighbors within a given distance (one row per tile and per neighboring tile). Unfortunately, this method doesn't allow you to easily vary the radius, which might be desirable.
The best solution is to use a GIS solution. You can investigate MySQL's support for geographic data types here.
Where you store your JSON really won't matter much. The main performance problem you face is the fact that your WHERE clause will not be able to make use of any indexes (because you're ultimately doing a greater / less than query rather than a fixed query). A hundred thousand rows isn't that many, so performance from this naive solution might be acceptable for your use case; ideally you should use the geospatial types supported by MySQL.

Keeping video viewing statistics breakdown by video time in a database

I need to keep a number of statistics about the videos being watched, and one of them is what parts of the video are being watched most. The design I came up with is to split the video into 256 intervals and keep the floating-point number of views for each of them. I receive the data as a number of intervals the user watched continuously. The problem is how to store them. There are two solutions I see.
Row per every video segment
Let's have a database table like this:
CREATE TABLE `video_heatmap` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`video_id` int(11) NOT NULL,
`position` tinyint(3) unsigned NOT NULL,
`views` float NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_lookup` (`video_id`,`position`)
) ENGINE=MyISAM
Then, whenever we have to process a number of views, make sure there are the respective database rows and add appropriate values to the views column. I found out it's a lot faster if the existence of rows is taken care of first (SELECT COUNT(*) of rows for a given video and INSERT IGNORE if they are lacking), and then a number of update queries is used like this:
UPDATE video_heatmap
SET views = views + ?
WHERE video_id = ? AND position >= ? AND position < ?
This seems, however, a little bloated. The other solution I came up with is
Row per video, update in transactions
A table will look (sort of) like this:
CREATE TABLE video (
id INT NOT NULL AUTO_INCREMENT,
heatmap BINARY (4 * 256) NOT NULL,
...
) ENGINE=InnoDB
Then, upon every time a view needs to be stored, it will be done in a transaction with consistent snapshot, in a sequence like this:
If the video doesn't exist in the database, it is created.
A row is retrieved, heatmap, an array of floats stored in the binary form, is converted into a form more friendly for processing (in PHP).
Values in the array are increased appropriately and the array is converted back.
Row is changed via UPDATE query.
So far the advantages can be summed up like this:
First approach
Stores data as floats, not as some magical binary array.
Doesn't require transaction support, so doesn't require InnoDB, and we're using MyISAM for everything at the moment, so there won't be any need to mix storage engines. (only applies in my specific situation)
Doesn't require a transaction WITH CONSISTENT SNAPSHOT. I don't know what are the performance penalties of those.
I already implemented it and it works. (only applies in my specific situation)
Second approach
Is using a lot less storage space (the first approach is storing video ID 256 times and stores position for every segment of the video, not to mention primary key).
Should scale better, because of InnoDB's per-row locking as opposed to MyISAM's table locking.
Might generally work faster because there are a lot less requests being made.
Easier to implement in code (although the other one is already implemented).
So, what should I do? If it wasn't for the rest of our system using MyISAM consistently, I'd go with the second approach, but currently I'm leaning to the first one. But maybe there are some reasons to favour one approach or another?
Second approach looks tempting at first sight, but it makes queries like "how many views for segment x of video y" unable to use an index on video.heatmap. Not sure if this is a real-life concern for you though. Also, you would have to parse back and forth the entire array every time you need data for one segment only.
But first and foremost, your second solution is hackish (but interesting nonetheless). I wouldn't recommend denormalising your database until you face an acutal performance issue.
Also, try populating the video_headmap table in advance with wiews = 0 as soon as a video is inserted (a trigger can help).
If space is really a concern, remove your surrogate key video_headmap.id and instead make (video_id, position) the primary key (then get rid of the superfluous UNIQUE constraint). But this shouldn't come into the equation. 256 x 12 bytes per video (rough row length with 3 numeric columns, okay add some for the index) is only an extra 3kb per video!
Finally, nothing prevents you from switching your current table to InnoDB and leverage its row-level locking capability.
Please note I fail to undestand why views cannot be an UNSIGNED INT. I would recommend changing this type.

Suggestions for string lookup optimizations

We have the below MySQL database table with about 75,000 entries. Each entry in the table represents a symbol in the system for which further data could be retrieved. This table is queried for autocomplete purposes - a user looks up a symbol, which is then matched to either the symbol's name or to its tags (semicolon separated list of strings). When the user selects the correct symbol, relevant data is fetched. Here is the table's description:
CREATE TABLE `symbols` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(512) NOT NULL,
`tags` varchar(512) DEFAULT NULL,
`type` enum('1','2','3','4','5','6','7','8','9') NOT NULL,
`popularity` int(11) DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `uc_symbol_name` (`type`,`symbol`),
KEY `symbol_idx` (`symbol`),
KEY `type_popularity_idx` (`type`,`popularity`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The above table is stored, alongside copious amounts of data, on a backend machine which serves this data over a JSON API. Currently, our frontend JavaScript code is querying the backend server directly in AJAX in order to do the autocomplete. Instead, to speed things up, we want to create a local cached version of the symbols table on the server from which the frontend is served (the frontend is written in django). This is possible since the table contains under 100,000 symbols, and because the table only gets updated about once an minute. Furthermore, it will allow us to implement better matching algorithms like Levenshtein distance.
How would you go about creating this type of cached symbol table? Obviously the lookup will have to happen in code (probably Python), but how would you store the data, and how would you sync it once a minute? We have a Redis server running on the django frontend server, but that introduces the question of persistence... Any thoughts are very welcome!
Just use a simple hash table, together with a "last updated time". Every time you do a lookup in the hash, check the "last updated" time. If it is more than a minute in the past, dump the data in the hash and reload it from the DB. Of course you have to make sure to avoid race conditions...
There are other alternatives but this is the simplest way and will be easiest to code correctly. If you find that hitting one of your transactions per minute with the extra latency of a big DB operation is not acceptable, you can come up with something a bit more complicated (such as running the DB operations asynchronously, on a background thread). To prepare for that eventuality, encapsulate this code in a class. (Then if it's too slow, you can play with the implementation without affecting any of your other code.)
Of course, there are other things you could do if you need more performance. You could add an updated_time column to your DB records, and then only load the ones which have been updated since last time. Will this actually make things faster? If it does, will the difference be big to enough to even matter? There's no way to know unless you try it out. That's why it's better to try the simpler solution first, and see if it reaches your performance goals.