Database design for user and top 25 recommendations

Database design for user and top 25 recommendations - mysql

I got a MySQL database and need to store upto 25 recommendations for each of the users (when user visits the site), here is my simple table that holds userid, recommendation and rank for the recommendation:
userid | recommendation | rank
1 | movie_A | 1
1 | movie_X | 2
...
10 | movie_B | 1
10 | movie_A | 2
....
I expect about 10M users and that combined with 25 recommendations would result in 250M rows. Is there any other better ways to design a user-recommendation table?
Thanks!

Is your requirement only to retrieve the 25 recommendations and send it to a UI layer for consumption?
if that is the case, the system that computes the recommendations can build a JSON document and update the value against the Userid. MySQL has support for JSON datatype.
This might not be a good approach if you want to perform search queries on the JSON document.

250 million rows isn't unreasonable in a simple table like this:
CREATE TABLE UserMovieRecommendations (
user_id INT UNSIGNED NOT NULL,
movie_id INT UNSIGNED NOT NULL,
rank TINYINT UNSIGNED NOT NULL,
PRIMARY KEY (user_id, movie_id, rank),
FOREIGN KEY (user_id) REFERENCES Users(user_id),
FOREIGN KEY (movie_id) REFERENCES Movies(movie_id)
);
That's 9 bytes per row. so only about 2GB.
25 * 10,000,000 * 9 bytes = 2250000000 bytes, or 2.1GB.
Perhaps double that to account for indexes and so on. Still not hard to imagine a MySQL server configured to hold the entire data set in RAM. And it's probably not necessary to hold all the data in RAM, since not all 10 million users will be viewing their data at once.
You might never reach 10 million users, but if you do, I expect that you will be using a server with plenty of memory to handle this.

Related

mysql database design for tracking if a user has seen an item

I want to show a user only the content he has not viewed yet.
I considered storing a string containing the ids of the items separated by ',' that a user has viewed but i thought i won't know the possible length of the string.
The alternative i could find was to store it like a log. A table like
user_id | item_id
1 | 1
2 | 2
1 | 2
Which approach will be better for around ten thousand users and thousands of items.

A table of pairs like that would be only 10M rows. That is "medium sized" as tables go.
Have
PRIMARY KEY(user_id, item_id),
INDEX(item_id, user_id)
And, if you are not going past 10K and 1K, consider using SMALLINT UNSIGNED (up to 64K in 2 bytes). Or, to be more conservative, MEDIUMINT UNSIGNED (up to 16M in 3 bytes).

Optimizing SQL Query from a Big Table Ordered by Timestamp

We have a big table with the following table structure:
CREATE TABLE `location_data` (
`id` int(20) NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` char(30) NOT NULL,
`data` char(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` double(30,10) DEFAULT NULL,
`lng` double(30,10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dt` (`dt`),
KEY `data` (`data`),
KEY `device_sn` (`device_sn`,`data`,`dt`),
KEY `device_sn_2` (`device_sn`,`dt`)
) ENGINE=MyISAM AUTO_INCREMENT=721453698 DEFAULT CHARSET=latin1
Many times we have performed query such as follow:
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' ORDER BY dt DESC LIMIT 1;
OR
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' AND dt >= '2014-01-01 00:00:00 ' AND dt <= '2014-01-01 23:00:00' ORDER BY dt DESC;
We have been optimizing this in a few ways:
By adding index and using FORCE INDEX on device_sn.
Separating the table into multiple tables based on the date (e.g. location_data_20140101) and pre-checking if there is a data based on certain date and we will pull that particular table alone. This table is created by cron once a day and the data in location_data for that particular date will be deleted.
The table location_data is HIGH WRITE and LOW READ.
However, few times, the query is running really slow. I wonder if there are other methods / ways / restructure the data that allows us to read a data in sequential date manner based on a given device_sn.
Any tips are more than welcomed.
EXPLAIN STATEMENT 1ST QUERY:
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| 1 | SIMPLE | location_dat | ref | data,device_sn,device_sn_2 | device_sn | 50 | const,const | 1 | Using where |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
EXPLAIN STATEMENT 2nd QUERY:
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | test_udp_new | range | dt,data,device_sn,device_sn_2 | dt | 4 | NULL | 1 | Using where |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+

The index device_sn (device_sn,data,dt) is good. MySQL should use it without need to do any FORCE INDEX. You can verify it by running "explain select ..."
However, your table is MyISAM, which is only supports table level locks. If the table is heavily write it may be slow. I would suggest converting it to InnoDB.

Ok, I'll provide info that I know and this might not answer your question but could provide some insight.
There exits certain differences between InnoDB and MyISAM. Forget about full text indexing or spatial indexes, the huge difference is in how they operate.
InnoDB has several great features compared to MyISAM.
First off, it can store the data set it works with in RAM. This is why database servers come with a lot of RAM - so that I/O operations could be done quick. For example, an index scan is faster if you have indexes in RAM rather than on HDD because finding data on HDD is several magnitudes slower than doing it in RAM. Same applies for full table scans.
The variable that controls this when using InnoDB is called innodb_buffer_pool_size. By default it's 8 MB if I am not mistaken. I personally set this value high, sometimes even up to 90% of available RAM. Usually, when this value is optimized - a lot of people experience incredible speed gains.
The other thing is that InnoDB is a transactional engine. That means it will tell you that a write to disk succeeded or failed and that will be 100% correct. MyISAM won't do that because it doesn't force OS to force HDD to commit data permanently. That's why sometimes records are lost when using MyISAM, it thinks data is written because OS said it was when in reality OS tried to optimize the write and HDD might lose buffer data, thus not writing it down. OS tries to optimize the write operation and uses HDD's buffers to store larger chunks of data and then it flushes it in a single I/O. What happens then is that you don't have control over how data is being written.
With InnoDB you can start a transaction, execute say 100 INSERT queries and then commit. That will effectively force the hard drive to flush all 100 queries at once, using 1 I/O. If each INSERT is 4 KB long, 100 of them is 400 KB. That means you'll utilize 400kb of your disk's bandwith with 1 I/O operation and that remainder of I/O will be available for other uses. This is how inserts are being optimized.
Next are indexes with low cardinality - cardinality is a number of unique values in an indexed column. For primary key this value is 1. it's also the highest value. Indexes with low cardinality are columns where you have a few distinct values, such as yes or no or similar. If an index is too low in cardinality, MySQL will prefer a full table scan - it's MUCH quicker. Also, forcing an index that MySQL doesn't want to use could (and probably will) slow things down - this is because when using an indexed search, MySQL processes records one by one. When it does a table scan, it can read multiple records at once and avoid processing them. If those records were written sequentially on a mechanical disk, further optimizations are possible.
TL;DR:
use InnoDB on a server where you can allocate sufficient RAM
set the value of innodb_buffer_pool_size large enough so you can allocate more resources for faster querying
use an SSD if possible
try to wrap multiple INSERTs into transactions so you can better utilize your hard drive's bandwith and I/O
avoid indexing columns that have low unique value count compared to row count - they just waste space (though there are exceptions to this)

how can i handle billion records effectivily

I have a performance issue, while handling billion records using select query,I have a table as
CREATE TABLE `temp_content_closure2` (
`parent_label` varchar(2000) DEFAULT NULL,
`parent_code_id` bigint(20) NOT NULL,
`parent_depth` bigint(20) NOT NULL DEFAULT '0',
`content_id` bigint(20) unsigned NOT NULL DEFAULT '0',
KEY `code_content` (`parent_code_id`,`content_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY KEY (parent_depth)
PARTITIONS 20 */ |
I used partition which will increase the performance by subdividing the table,but it is not usefull in my case,my sample select in this table
+----------------+----------------+--------------+------------+
| parent_label | parent_code_id | parent_depth | content_id |
+----------------+----------------+--------------+------------+
| Taxonomy | 20000 | 0 | 447 |
| Taxonomy | 20000 | 0 | 2286 |
| Taxonomy | 20000 | 0 | 3422 |
| Taxonomy | 20000 | 0 | 5916 |
+----------------+----------------+--------------+------------+
Here the content_id will be unique in respect to the parent_dept,so i used parent_depth as a key for partitioning.In every depth i have 2577833 rows to handle ,so here partitioning is not useful,i got a idea from websites to use archive storage engine but it will use full table scan and not use index in select ,basically 99% i use select query in this table and this table is going to increases its count every day.currently i am in mysql database which has 5.0.1 version.i got an idea about nosql database to use ,but is any way to handle in mysql ,if you are ssuggesting nosql means which can i use cassandra or accumulo ?.

Add an index like this:
ALTER TABLE table ADD INDEX content_id ('content_id')
You can also add multiple indexes if you have more specific SELECT criteria which will also speed things up.
Multiple and single indexes
Overall though, if you have a table like this thats growing so fast then you should probably be looking at restructuring your sql design.
Check out "Big Data" solutions as well.

With that size and volume of data, you'd need to either setup a sharded-MySQL setup in a cluster of machines (Facebook and Twitter stored massive amounts of data on a sharded MySQL setup, so it is possible), or alternatively use a Big Table-based solution that natively distribute the data amongst nodes in various clusters- Cassandra and HBase are the most popular alternatives here. You must realize, a billion records on a single machine will hit almost every limit of the system- IO first, followed by memory, followed by CPU. It's simply not feasible.
If you do go the Big Table way, Cassandra will be the quickest to setup and test. However, if you anticipate map-reduce type analytic needs, then HBase is more tightly integrated with the Hadoop ecosystem, and should work out well. Performance-wise, they are both neck to neck, so take your pick.

What is the biggest ID number that autoincrement can produce in mysql

I have an database that is rapidly filled with data we talk about 10-20k rows per day.
What is a limit of an ID with and autoincrement option? If ID is created as INTEGER then I can do max value of 2,147,483,647 for unsigned values?
But what when autoincrement goes above this? Does it all collapses? What would be solution then?
I am sure that a lot of people have big databases, and I would like to hear them.
Thank you.

If you are worried about it growing out of bounds too quickly, I would set the PK as an UNSIGNED BIGINT. That gives you a max value of 18446744073709551615, which should be sufficient.

| Min. (inclusive) | Max. (inclusive)
-----------------------------------------------------------------------------
INT Signed (+|-) | -2,147,483,648 | 2,147,483,647
-----------------------------------------------------------------------------
INT Unsigned (+) | 0 | 4,294,967,295
-----------------------------------------------------------------------------
BIGINT Signed (+|-) | -9,223,372,036,854,775,807 | 9,223,372,036,854,775,806
-----------------------------------------------------------------------------
BIGINT Unsigned (+) | 0 | 18,446,744,073,709,551,615
MySQL reference.
If you have MySQL table with column ID (INT unsigned) with auto_increment, and the table has 4,294,967,295 records, then you try to insert 1 more record, the ID of the new record will be automatically changed and set to the max which is "4,294,967,295", so you get a MySQL error message Duplicate entry '4294967295' for key 'PRIMARY', you will have duplicated IDs if the column is set as Primary Key.
2 Possible Solutions:
Easy Approach: Extend the limits, by setting the ID to BIGINT unsigned, just like what Dan Armstrong said. Although this doesn't mean it's unbreakable! and performance might be affected when the table gets really large.
Harder Approach: Use partitioning, which is a little more complicated approach, but gives better performance, and truly no database limit (Your only limit is the size of your physical harddisk.). Twitter (and similar huge websites) use this approach for their millions of tweets (records) per day!

Mysql select - improve performances

I am working on an e-shop which sells products only via loans. I display 10 products per page in any category, each product has 3 different price tags - 3 different loan types. Everything went pretty well during testing time, query execution time was perfect, but today when transfered the changes to the production server, the site "collapsed" in about 2 minutes. The query that is used to select loan types sometimes hangs for ~10 seconds and it happens frequently and thus it cant keep up and its hella slow. The table that is used to store the data has approximately 2 milion records and each select looks like this:
SELECT *
FROM products_loans
WHERE KOD IN("X17/Q30-10", "X17/12", "X17/5-24")
AND 369.27 BETWEEN CENA_OD AND CENA_DO;
3 loan types and the price that needs to be in range between CENA_OD and CENA_DO, thus 3 rows are returned.
But since I need to display 10 products per page, I need to run it trough a modified select using OR, since I didnt find any other solution to this. I have asked about it here, but got no answer. As mentioned in the referencing post, this has to be done separately since there is no column that could be used in a join (except of course price and code, but that ended very, very badly). Here is the show create table, kod and CENA_OD/CENA_DO very indexed via INDEX.
CREATE TABLE `products_loans` (
`KOEF_ID` bigint(20) NOT NULL,
`KOD` varchar(30) NOT NULL,
`AKONTACIA` int(11) NOT NULL,
`POCET_SPLATOK` int(11) NOT NULL,
`koeficient` decimal(10,2) NOT NULL default '0.00',
`CENA_OD` decimal(10,2) default NULL,
`CENA_DO` decimal(10,2) default NULL,
`PREDAJNA_CENA` decimal(10,2) default NULL,
`AKONTACIA_SUMA` decimal(10,2) default NULL,
`TYP_VYHODY` varchar(4) default NULL,
`stage` smallint(6) NOT NULL default '1',
PRIMARY KEY (`KOEF_ID`),
KEY `CENA_OD` (`CENA_OD`),
KEY `CENA_DO` (`CENA_DO`),
KEY `KOD` (`KOD`),
KEY `stage` (`stage`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
And also selecting all loan types and later filtering them trough php doesnt work good, since each type has over 50k records and the select takes too much time as well...
Any ides about improving the speed are appreciated.
Edit:
Here is the explain
+----+-------------+----------------+-------+---------------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+-------+---------------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | products_loans | range | CENA_OD,CENA_DO,KOD | KOD | 92 | NULL | 190158 | Using where |
+----+-------------+----------------+-------+---------------------+------+---------+------+--------+-------------+
I have tried the combined index and it improved the performance on the test server from 0.44 sec to 0.06 sec, I cant access the production server from home though, so I will have to try it tomorrow.

Your issue is that you are searching for intervals which contain a point (rather than the more normal query of all points in an interval). These queries do not work well with the standard B-tree index, so instead you need to use an R-Tree index. Unfortunately MySQL doesn't allow you to select an R-Tree index on a column, but you can get the desired index by changing your column type to GEOMETRY and using the geometric functions to check if the interval contains the point.
See Quassnoi's article Adjacency list vs. nested sets: MySQL where he explains this in more detail. The use case is different, but the techniques involved are the same. Here's an extract from the relevant part of the article:
There is also a certain class of tasks that require searching for all ranges containing a known value:
Searching for an IP address in the IP range ban list
Searching for a given date within a date range
and several others. These tasks can be improved by using R-Tree capabilities of MySQL.

Try to refactor your query like:
SELECT * FROM products_loans
WHERE KOD IN("X17/Q30-10", "X17/12", "X17/5-24")
AND CENA_OD >= 369.27
AND CENA_DO <= 369.27;
(mysql is not very smart when choosing indexes) and check the performance.
The next try is to add a combined key - (KOD,CENA_OD,CENA_DO)
And the next major try is to refactor your base to have products separated from prices. This should really help.
PS: you can also migrate to postgresql, it's smarter than mysql when choosing right indexes.

MySQL can only use 1 key. If you always get the entry by the 3 columns, depending on the actual data (range) in the columns one of the following could very well add a serious amount of performance:
ALTER TABLE products_loans ADD INDEX(KOD, CENA_OD, CENA_DO);
ALTER TABLE products_loans ADD INDEX(CENA_OD, CENA_DO, KOD);
Notice that the order of the columns matter! If that doesn't improve performance, give us the EXPLAIN output of the query.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008