Keeping video viewing statistics breakdown by video time in a database - mysql

I need to keep a number of statistics about the videos being watched, and one of them is what parts of the video are being watched most. The design I came up with is to split the video into 256 intervals and keep the floating-point number of views for each of them. I receive the data as a number of intervals the user watched continuously. The problem is how to store them. There are two solutions I see.
Row per every video segment
Let's have a database table like this:
CREATE TABLE `video_heatmap` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`video_id` int(11) NOT NULL,
`position` tinyint(3) unsigned NOT NULL,
`views` float NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_lookup` (`video_id`,`position`)
) ENGINE=MyISAM
Then, whenever we have to process a number of views, make sure there are the respective database rows and add appropriate values to the views column. I found out it's a lot faster if the existence of rows is taken care of first (SELECT COUNT(*) of rows for a given video and INSERT IGNORE if they are lacking), and then a number of update queries is used like this:
UPDATE video_heatmap
SET views = views + ?
WHERE video_id = ? AND position >= ? AND position < ?
This seems, however, a little bloated. The other solution I came up with is
Row per video, update in transactions
A table will look (sort of) like this:
CREATE TABLE video (
id INT NOT NULL AUTO_INCREMENT,
heatmap BINARY (4 * 256) NOT NULL,
...
) ENGINE=InnoDB
Then, upon every time a view needs to be stored, it will be done in a transaction with consistent snapshot, in a sequence like this:
If the video doesn't exist in the database, it is created.
A row is retrieved, heatmap, an array of floats stored in the binary form, is converted into a form more friendly for processing (in PHP).
Values in the array are increased appropriately and the array is converted back.
Row is changed via UPDATE query.
So far the advantages can be summed up like this:
First approach
Stores data as floats, not as some magical binary array.
Doesn't require transaction support, so doesn't require InnoDB, and we're using MyISAM for everything at the moment, so there won't be any need to mix storage engines. (only applies in my specific situation)
Doesn't require a transaction WITH CONSISTENT SNAPSHOT. I don't know what are the performance penalties of those.
I already implemented it and it works. (only applies in my specific situation)
Second approach
Is using a lot less storage space (the first approach is storing video ID 256 times and stores position for every segment of the video, not to mention primary key).
Should scale better, because of InnoDB's per-row locking as opposed to MyISAM's table locking.
Might generally work faster because there are a lot less requests being made.
Easier to implement in code (although the other one is already implemented).
So, what should I do? If it wasn't for the rest of our system using MyISAM consistently, I'd go with the second approach, but currently I'm leaning to the first one. But maybe there are some reasons to favour one approach or another?

Second approach looks tempting at first sight, but it makes queries like "how many views for segment x of video y" unable to use an index on video.heatmap. Not sure if this is a real-life concern for you though. Also, you would have to parse back and forth the entire array every time you need data for one segment only.
But first and foremost, your second solution is hackish (but interesting nonetheless). I wouldn't recommend denormalising your database until you face an acutal performance issue.
Also, try populating the video_headmap table in advance with wiews = 0 as soon as a video is inserted (a trigger can help).
If space is really a concern, remove your surrogate key video_headmap.id and instead make (video_id, position) the primary key (then get rid of the superfluous UNIQUE constraint). But this shouldn't come into the equation. 256 x 12 bytes per video (rough row length with 3 numeric columns, okay add some for the index) is only an extra 3kb per video!
Finally, nothing prevents you from switching your current table to InnoDB and leverage its row-level locking capability.
Please note I fail to undestand why views cannot be an UNSIGNED INT. I would recommend changing this type.

Related

How to index monotonically increasing data in a table?

I have a table with a monotonically increasing field that I want to put into an index. However, the best practices guide says to not put monotonically increasing data into a non-interleaved index. When I try putting the data into an interleaved index, I can't interleave an index in its parent table.
In other words, I want the Cloud Spanner equivalent of this MySQL schema.
CREATE TABLE `my_table` (
'id' bigint(20) unsigned NOT NULL,
'monotonically_increasing' int(10) unsigned DEFAULT '0',
PRIMARY KEY ('id'),
KEY 'index_name' ('monotonically_increasing')
)
It really depends the rate you'll be writing monotonically increasing/decreasing values.
Small write loads
I don't know the exact range of writes per second a Spanner server can handle before you'll hotspot (and it depends on your data), but if you are writing < 500 rows per second you should be okay with this pattern. It's only an issue if your write load is higher than a single Spanner server can comfortably handle by itself.
Large write loads
If your write rate is larger, or relatively unbounded (e.g. scales up with your systems/sites popularity), then you'll need to look alternatives. These alternatives really depend on your exact use case to work out which trade-offs you're willing to take.
One generic approach is to manually shard the index. Let's say for example you know your peak write load will be 1740 inserts per second. Using the approx 500 writes per server number from before, we would be able to avoid hotspotting if we could shard this load over 4 Spanner servers (435 writes/second each).
Using the INT64 type in Cloud Spanner allows for a maximum value of 9,223,372,036,854,775,808. One example way to shard is us by adding random(0,3)*1,000,000,000,000,000,000 to each value. This will split the index key range into 4 ranges that can be served by 4 Spanner servers. The down-side is you'll need to do 4 queries and merge the results on the client side after masking out x,000,000,000,000,000,000.
Note: Interleaving is when data/indexes from one table are interleaved with date from another table. You cannot interleave with only one table.

MySQL optimizing by splitting table into two

I currently have a table called map_tiles that will eventually have around a couple hundred thousand rows. Each row in this table represents an individual tile on a world map of my game. Currently, the table structure is like so:
id int(11) PRIMARY KEY
city_id int(11)
type varchar(20)
x int(11) INDEX KEY
y int(11) INDEX KEY
level int(11)
I also want to be able to store a stringified JSON object that will contain information regarding that specific tile. Since I could have 100,000+ rows, I want to optimize my queries and table design to get the best performance possible.
So here is my scenario: A player loads a position, say at 50,50 on the worldmap. We will load all tiles within a 25 tile radius of the player's coordinate. So, we will have to run a WHERE query on this table of hundreds of thousands of rows in my map_tiles table.
So, would adding another field of type text called data to the existing table prove better performance? However, this would slow down the main query.
Or, would it be worth it to create a separate table called map_tiles_data, that just has the structure like so:
tile_id int(11) PRIMARY KEY
data text
And I could run the main query that finds the tiles within the radius of the player from the map_tiles, and then do a UNION ALL possibly that just pulls the JSON stringified data from the second table?
EDIT: Sorry, I should have clarified. The second table if used, would not have a row for each corresponding tile in the map_tiles table. A row will only be added if data is to be stored on a map tile. So by default, there will be 0 rows in the map_tiles_data table, while there could be 100,000 thousand rows in the map_tiles table. When a player does x action, then the game will add a row to map_tiles_data.
No need to store data in separate table. You can use the same table. But you have to use InnoDB plugin and set innodb_file_format=barracuda and as data is going to be text, use ROW_FORMAT=Dynamic (or Compressed)
InnoDB will store the text out side the ROW page, so having data in the same table is efficient than having it in separate table (you can avoid joins and foreign keys). Also add index on x and y as all your queries are based on the location
Useful reading:
Innodb Plugin in “Barracuda” format and ROW_FORMAT=DYNAMIC. In this format Innodb stores either whole blob on the row page or only 20 bytes BLOB pointer giving preference to smaller columns to be stored on the page, which is reasonable as you can store more of them. BLOBs can have prefix index but this no more requires column prefix to be stored on the page – you can build prefix indexes on blobs which are often stored outside the page.
COMPRESSED row format is similar to DYNAMIC when it comes to handling blobs and will use the same strategy storing BLOBs completely off page. It however will always compress blobs which do not fit to the row page, even if KEY_BLOCK_SIZE is not specified and compression for normal data and index pages is not enabled.
Don't think that I am referring only to BLOBs. From storage prospective BLOB, TEXT as well as long VARCHAR are handled same way by Innodb.
Ref: https://www.percona.com/blog/2010/02/09/blob-storage-in-innodb/
The issue of storing data in one table or two tables is not really your main issue. The issue is getting the neighboring tiles. I'll return to that in a moment.
JSON can be a convenient format for flexibly storing attribute/value pairs. However, it is not so useful for accessing the data in the database. You might want to consider a hybrid form. This suggests another table, because you might want to occasionally add or remove columns
Another consideration is maintaining history. You may want history on the JSON component, but you don't need that for the rest of the data. This suggests using a separate table.
As for optimizing the WHERE. I think you have three choices. The first is your current approach, which is not reasonable.
The second is to have a third table that contains all the neighbors within a given distance (one row per tile and per neighboring tile). Unfortunately, this method doesn't allow you to easily vary the radius, which might be desirable.
The best solution is to use a GIS solution. You can investigate MySQL's support for geographic data types here.
Where you store your JSON really won't matter much. The main performance problem you face is the fact that your WHERE clause will not be able to make use of any indexes (because you're ultimately doing a greater / less than query rather than a fixed query). A hundred thousand rows isn't that many, so performance from this naive solution might be acceptable for your use case; ideally you should use the geospatial types supported by MySQL.

MySQL: Which is smaller, storing 2 sets of similar data in 1 table vs 2 tables (+indexes)?

I was asked to optimize (size-wise) statistics system for a certain site and I noticed that they store 2 sets of stat data in a single table. Those sets are product displays on search lists and visits on product pages. Each row has a product id, stat date, stat count and stat flag columns. The flag column indicates if it's a search list display or page visit stat. Stats are stored per day and product id, stat date (actually combined with product ids and stat types) and stat count have indexes.
I was wondering if it's better (size-wise) to store those two sets as separate tables or keep them as a single one. I presume that the part which would make a difference would be the flag column (lets say its a 1 byte TINYINT) and indexes. I'm especially interested about how the space taken by indexes would change in 2 table scenario. The table in question already has a few millions of records.
I'll probably do some tests when I have more time, but I was wondering if someone had already challenged a similar problem.
Ordinarily, if two kinds of observations are conformable, it's best to keep them in a single table. By "conformable," I mean that their basic data is the same.
It seems that your observations are indeed conformable.
Why is this?
First, you can add more conformable observations trivially easily. For example, you could add sales to search-list and product-page views, by adding a new value to the flag column.
Second, you can report quite easily on combinations of the kinds of observations. If you separate these things into different tables, you'll be doing UNIONs or JOINs when you want to get them back together.
Third, when indexing is done correctly the access times are basically the same.
Fourth, the difference in disk space usage is small. You need indexes in either case.
Fifth, the difference in disk space cost is trivial. You have several million rows, or in other words, a dozen or so gigabytes. The highest-quality Amazon Web Services storage costs about US$ 1.00 per year per gigabyte. It's less than the heat for your office will cost for the day you will spend refactoring this stuff. Let it be.
Finally I got a moment to conduct a test. It was just a small scale test with 12k and 48k records.
The table that stored both types of data had following structure:
CREATE TABLE IF NOT EXISTS `stat_test` (
`id_off` int(11) NOT NULL,
`stat_date` date NOT NULL,
`stat_count` int(11) NOT NULL,
`stat_type` tinyint(11) NOT NULL,
PRIMARY KEY (`id_off`,`stat_date`,`stat_type`),
KEY `id_off` (`id_off`),
KEY `stat_count` (`stat_count`)
) ENGINE=InnoDB DEFAULT CHARSET=latin2;
The other two tables had this structure:
CREATE TABLE IF NOT EXISTS `stat_test_other` (
`id_off` int(11) NOT NULL,
`stat_date` date NOT NULL,
`stat_count` int(11) NOT NULL,
PRIMARY KEY (`id_off`,`stat_date`),
KEY `id_off` (`id_off`),
KEY `stat_count` (`stat_count`)
) ENGINE=InnoDB DEFAULT CHARSET=latin2;
In case of 12k records 2 separate tables were actually slightly bigger than the one storing everything, but in case of 48k records, two tables were smaller and by a noticeable value.
In the end I didn't split the data into two tables to solve my initial space problem. I managed to considerably reduce the size of the database, by removing the redundant id_off index and adjusting the data types (in most cases unsigned smallint was more than enough to store all the values I needed). Note that originally stat_type was also of type int and for this column unsigned tinyint was enough. All in all this reduced the size of the database from 1.5GB to 600MB (and my limit was just 2GB for the database). Another advantage of this solution was the fact that I didn't have to modify a single line of code to make everything work (since the site was written by someone else, I didn't had to spend hours trying to understand the source code).

Need correct database structure to reduce the size

I want to design my database correctly. Maybe someone could help me with that.
I have a device which writes every 3s around 100 keys/values to a table.
Someone suggested to store it like this:
^ timestamp ^ key1 ^ key2 ^ [...] ^ key150 ^
| 12/06/12 | null | 2243466 | [...] | null ^
But I think thats completely wrong and not dynamic. Because I could have many null values.
So I tried to do my best and designed it how I learned it at school:
http://ondras.zarovi.cz/sql/demo/?keyword=tempidi
Here is the problem that I write for every value the timestamp which means within 100values it would be always the same and produce large amount of data.
Could someone give a me hint how to reduce the database size? Am I basically correct with my ERM?
I wouldn't worry so much about the database size. Your bigger problem is maintenance and flexibility.
Here's what I would do. First, define and fill this table with possible keys your device can write:
tblDataKey
(
ID int primary key (auto-increment - not sure how mysql does this)
Name varchar(32)
)
Next define a 'data event' table:
tblEvent
(
ID int primary key (auto-inc)
TimeStamp
...anything else you need - device ID's? ...
)
Then match events with keys and their values:
tblEventData
{
EventID INT FK-to-tblEvent
KeyID INT FK-to-tblDataKey
DataValue varchar(???)
)
Now every however-many-seconds your data comes in you create a single entry in tblEvent and multiple entries in tblEventData with key-values as needed. Not every event needs every key, and you can expand on the # of keys in the future.
This really shines in that space isn't wasted and you can easily do queries for evnets with specific data keys and values. Where this kind of structure falls down is when you need to produce 'cross-tab-like' tables of events and data items. You'll have to decide if that's a problem or not.
If you must implement a key-value store in MySQL, it doesn't make any sense to make it more complicated than this.
create table key_value_store (
run_time datetime not null,
key_name varchar(15) not null,
key_value varchar(15) not null,
primary key (run_time, key_name)
);
If the average length of both your keys and values is 10 bytes, you're looking at about 86 million rows and 2.5GB per month, and you don't need any joins. If all your values (column key_value) are either integers or floats, you can change the data type and reduce space a little more.
One of the main problems with implementing key-value stores in SQL is that, unless all values are the same data type, you have to use something like varchar(n) for all values. You lose type safety and declarative constraints. (You can't check that the value for key3 is between 1 and 15, while the value for key7 is between 0 and 3.)
Is this feasible?
This kind of structure (known as "EAV"--Google that) is a well-known table design anti-pattern. Part of the problem is that you're essentially storing columns as rows. (You're storing column names in key_value_store.key_name.) If you ever have to write out data in the format of a normal table, you'll discover three things.
It's hard to write queries to output the right format.
It takes forever to run. If you have to write hundreds of columns, it might never run to completion.
You'll wish you had much faster hardware. Much, much faster hardware.
What I look for
Opportunities to group keys into logical tables. This has to do with the first design, and it might not apply to you. It sounds like your application is basically storing a log file, and you don't know which keys will have values on each run.
Opportunities to reduce the number of rows. I'd ask, "Can we write less often?" So I'd be looking at writing to the database every 5 or 6 seconds instead of every 3 seconds, assuming that means I'm writing fewer rows. (The real goal is fewer rows, not fewer writes.)
The right platform. PostgreSQL 9.2 might be a better choice for this. Version 9.2 has index-only scans, and it has an hstore module that implements a key-value store.
Test before you decide
If I were in your shoes, I'd build this table in both MySQL and PostgreSQL. I'd load each with about a million rows of random-ish data. Then I'd try some queries and reports on each. (Reports are important.) Measure the performance. Increase the load to 10 million rows, retune the server and the dbms, and run the same queries and reports again. Measure again.
Repeat with 100 million rows. Quit when you're confident. Expect all this to take a couple of days.

Prevent duplicate rows in MySQL without unique index/constraint?

I'm writing an application that needs to deal with millions of URLs. It also needs to do retrieval by URL.
My table currently looks like this:
CREATE TABLE Pages (
id bigint(20) unsigned NOT NULL,
url varchar(4096) COLLATE utf8_unicode_ci NOT NULL,
url_crc int(11) NOT NULL,
PRIMARY KEY (id),
KEY url_crc (url_crc)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The idea behind this structure is to do look ups by a CRC32 hash of the URL, since a b-tree index would be very inefficient on URLs which tend to have common prefixes (InnoDB doesn't support hash indexes). Duplicate results from the CRC32 are filtered by a comparison with the full URL. A sample retrieval query would look like this:
SELECT id
FROM Pages
WHERE url_crc = 2842100667
AND url = 'example.com/page.html';
The problem I have is avoiding duplicate entries being inserted. The application will always check the database for an existing entry before inserting a new one, but it's likely in my application that multiple queries for the same new URL will be made concurrently and duplicate CRC32s and URLs will be entered.
I don't want to create a unique index on url as it will be gigantic. I also don't want to write lock the table on every insert since that will destroy concurrent insert performance. Is there an efficient way to solve this issue?
Edit: To go into a bit more detail about the usage, it's a real-time table for looking up content in response to a URL. By looking up the URL, I can find the internal id for the URL, and then use that to find content for a page. New URLs are added to the system all the time, and I have no idea what those URLs will be before hand. When new URLs are referenced, they will likely be slammed by simultaneous requests referencing the same URLs, perhaps hundreds per second, which is why I'm concerned about the race condition when adding new content. The results need to be immediate and there can't be read lag (subsecond lag is okay).
To start, new URLs will be added only a few thousand per day, but the system will need to handle many times that before we have time to move to a more scalable solution next year.
One other issue with just using a unique index on url is that the length of the URLs can exceed the maximum length of a unique index. Even if I drop the CRC32 trick, it doesn't solve the issue of preventing duplicate URLs.
Have you actually benchmarked and found the btree to be a problem? I sense premature optimization.
Second, if you're worried about the start of all the strings being the same, one answer is to index your URL reversed—last character first. I don't think MySQL can do that natively, but you could reverse the data in your app before storing it. Or just don't use MySQL.
Have you considered creating a UNIQUE INDEX (url_crc, url) ? It may be 'gigantic', but with the number of collisions you're going to get using CRC32, it will probably help the performance of your page retrieval function while also preventing duplicate urls.
Another thing to consider would be allowing duplicates to be inserted, and removing them nightly (or whenever traffic is low) with a script.
In addition to your Pages table, create 3 additional tables with the same columns (PagesInsertA, PagesInsertB and PagesInsertC). When inserting URLs, check against Pages for an existing entry, and if it's not there, insert the URL into PagesInsertA. You can either use a unique index on that smaller table, or include a step to remove duplicates later (discussed below). At the end of your rotation time (perhaps one minute, see discussion below for constraints), switch to inserting new URLs into PagesInsertB. Perform the following steps on PagesInsertA: remove duplicates (if you weren't using a unique index), remove any entries that duplicate entries in PagesInsertC (that table will be empty the first time around, but not the second), add the entries from PagesInsertA to Pages, empty PagesInsertC.
At the end of the second period, switch to inserting new URLs into PagesInsertC. Perform the steps discussed above on PagesInsertB (only difference is that you will remove entries also found in PagesInsertA and empty PagesInsertA at the end). Continue the rotation of the table the new URLs are inserted into (A -> B -> C -> A -> ...).
A minimum of 3 insert tables is needed because there will be a lag between switching URL insertion to a new insert table and inserting the cleaned-up rows from the previous insert table into Pages. I used 1 minute as the time between switches in this example, but you can make that time smaller as long as the insertion from PagesInsertA to Pages and emptying PagesInsertC (for example), occurs before the switch between inserting new URLs into PagesInsertB and PagesInsertC.