Database for 'who viewed this item also viewed..' - mysql

I want to create feature 'who viewed this item also viewed' like Amazon or Ebay. I'm deciding between MySql and non-relational database like MongoDB.
Edit: It seems to be straightforward to implement this feature in MySql. My guess is creating 'viewed' table in which userId, itemId, and time of viewing are saved. So, when trying to recommend off of a current item a user is looking at, I would Sub = (SELECT userId FROM viewed WHERE itemId == currentItemId) Then, SELECT itemId FROM viewed INNER JOIN Sub on viewed.userId = Sub.userId
Wouldn't this be too much for 100,000 users who viewed 100 pages this month?
For non-relational database, I don't feel it is right to have User to embed all users or Item to embed all Users. So, I'm thinking to have each User holds a list of itemIds he looked at and each Item holds a list of userIds seen by. And I'm not sure what to do next. Am I on the right path here?
If not, could you suggest a good way to implement this feature in non-relational database? And, does this suggestion have advantage in speed compared to MySql?

Initial Response
It seems to be straightforward to implement this feature in MySql by just calling JOIN on Item and User table.
Yes.
But, how fast or slow the database call will be to gather entire viewing history of 100,000 users at once?
How long is a piece of string ?
That depends on the standards and quality of your Relational Database implementation. If you have ID fields on all your files, it won't have Relational integrity, power, or speed, it will have 1970's ISAM Record Filing System speeds.
On a Sybase ASE server, on a small Unix box, a SELECT of similar intent on a table (not a file) with 16 billion rows returns 100 rows in 12 milliseconds.
For non-relational database, I don't feel it is right to have User to embed all users or Item to embed all Users. So, I'm thinking to have each User holds a list of item ids he looked at and each Item holds a list of user ids seen by.
I can't answer re MangoDb.
But for a Relational Database, that is how we implement it.
with one great difference: the two lists are implemented in a single table
each row is a single fact viewed [sorry] from two sides (the fact that an User has viewed an Item, is one and the same fact that an Item has been viewed by an User)
So it appears to be Relational thinking ... implemented Mango-style, which requires 100% data and table duplication. I have no idea whether that is good or bad in MongoDb, in the sense that it could well be what is required for the thing to "perform". Ugly as sin.
And I'm not sure what to do next. Am I on the right path here?
Right for Relational (as long as you use one table for the two "lists"). Ask a more specific question if you do not understand this point.
If not, could you suggest a good way to implement this feature in non-relational database? And, does this suggestion have advantage in speed compared to MySql?
Sorry, I can't answer that.
But it would be unlikely that a non-relational DB can store and retrieve info that is classic Relational, faster than a semi-relational Record Filing System such as MySQL. All things being equal, of course. A real SQL platform would be faster still.
Response to Comments
First you had:
So, I'm thinking to have each User holds a list of item ids he looked at and each Item holds a list of user ids seen by.
That is two lists. That is not good, because the second list is a 100% duplication of the first.
Now you have (edited in the Question, and in the new comments):
I didn't fully understand what you meant by 'use one table for the two list'. My interpretation is create 'viewed' table in which userId, itemId, and time of viewing are saved.
That is good, you now have one list.
Just to be clear about the database we are discussing, let me erect a model, and have you confirm it.
User Item Data Model
If you are not used to the standard Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Refer to the IDEF1X Notation.
So, when trying to recommend off of a current item a user is looking at, I would Sub = (SELECT userId FROM viewed WHERE itemId == currentItemId). Then, SELECT itemId FROM viewed INNER JOIN Sub on viewed.userId = Sub.userId. Is this what you mean?
I did make a declaration and caution about the table, but I didn't give any directions regarding non-SQL coding, so no.
I would never suggest doing something in two steps, that can be done in a single step. SQL has its problems, but difficulty in obtaining information from a set of Relational tables (ie. a derived relation) using a single SELECT is definitely not one of them.
SUB is not SQL. Although I can guess at what it does, I may well be wrong, therefore I cannot comment on that code.
Against the model I have supplied, on an ISO/IEC/ANSI Standard SQL platform, I would use:
SELECT DISTINCT ItemId -- Items viewed by ...
FROM UserItem
WHERE UserId = (
SELECT UserId -- Users who viewed Item
FROM UserItem
WHERE ItemId = #CurrentItemId
)
You will have to translate that into the non-SQL that your platform requires.
Wouldn't it be too much for 100,000 users who viewed 100 pages this month? Sorry for long question.
I have already answered that question in my initial response. Please read again.
You are trying to solve a performance problem that you do not yet have. That is not possible, given the laws of physics, the dependencies, our inability to reverse the chronology; etc. Therefore I recommend that you cease that activity.
Meanwhile, back at the farm, the cows need to be fed. Design the database first, then code the app, then if, and only if, there are performance problems, you can address them. IT Professionals can make scientific estimates, but I cannot give you a tutorial here in SO.
10,000,000 page views per month. You have not stated the no of Items, so the large figure is scary as hell. if you inform me as to how many Items; Users; Average Items viewed per session; and the duration (eg. month) you wish to cover, I can give you more specific advice.
As I understand it, an User views 1 (one) Item. As a selling-up feature, you want the system to identify the list of Items people "who viewed this item also viewed ...". That would appear to be a small fraction of 10,000,000 views. You do have an index on each table, yes ? So the non-SQL program you are using will not read 10,000,000 views to find that fraction, it will navigate the index, and read only the pages that contain that fraction.
Some of the non-SQLs need a second index to perform what real SQL platforms perform with one index. I have given that second index in the model.
While I appreciate that it was alright that a full definition was not provided for the file you described, up to now, since I am providing a model, I have to provide a complete and correct one, not a partial one.
Since Users view Items more than once, I have given a table that allows that, and tracks the Number of Views, and the Date Last Viewed. It is one row per User::Item, ever. If you would like a table that supports one row per User::Item view, please ask, I will provide.
From where I sit, on the basis of facts established thus far, the 10,000,000 figure is not concern.

This probably depends more on how you implement this feature than on the type of database used.
If you just store a lot of viewing history (like, "user x looked at item y"), you'd have to check out the users who viewed an item, and then all the items those users looked at. That can all be done on a single database table. However may end up with very large result sets.
It may be easier to use a graph structure of "connected" items that is continually updated during runtime and then easily queried.

Related

MySql multiple tables vs single table: performance

Imagine the following categories: bars, places to eat, shops, etc...
Each of these categories will have some common fields, such as
id, name, address and geolocation (Lat and Lng position).
I am strongly in doubt whether I should create one table that combines these different categories, or whether I should split it up into separate tables (1 table per category).
My thought is that retrieving places based on category and geolocation for each category table separately would be faster in both retrieval and update, certainly when the number of places per category increases.
From this approach, I would go for 1 table per category.
BUT there is a supplementary requirement. Every place will have an owner (user), and one user could own multiple places. So this means that I would either:
Need a many-to-many table that connects the user table with the central giant table;
Need one many-to-many table for each of the categories.
Second BUT the situation is, that when the user logs in, all places that this user owns will be returned from a single query, ie. the id + name, assuming that a user could be the owner of multiple places.
From this point of view, the second option seems a very bad idea, because it appears that I would need to create queries to scan through each of the tables.
I realize that I can use indexes to speed up long table scans, yet I am still uncertain about the performance when the number of places would increase dramatically, and there are currently about 8 different categories.
What do you consider the most optimal solution, based on the options that I proposed (or do you see a better option I'm missing out on?).
I should point out that the web application will not often mix up categories, although bars will be able to create events as well.
The answer to this question is invaluable to me, because it will define the fundaments for the further development of my application.
Let me know if anything in my question is not clear to you.
Thank you.
When the data is that similar, a single table is usually best. Consider also that in the future you may add a new category - it would be much easier to just update a category field to include the new category than build a new table, new queries, and modify all the existing code.
As far as speed goes, indexes will make it irrelevant. Add a category field, index it, and speed won't be an issue.

Is is necessary to link or join tables in MySQL?

I've created many databases before, but I have never linked two tables together. I've tried looking around, but cannot find WHY one would need to link two or more tables together.
There is a good tutorial here that goes over database relationships, but does not explain why they would be needed. He just simply says that they are.
Are they truly necessary? I understand that (in his example) all orders have a customer, and so one would link the orders table to the customers table, but I just don't see why this would be absolutely necessary. I can (and have) created shopping carts and other complex databases that work just fine without creating any table relationships.
I've just started playing around with MySQL Workbench v6.0 for a new project that has a fairly large and complex database, and so I'm wondering if I am losing anything by creating the entire project without relationships?
NOTE: Please let me know if this question is too general or off topic, and I will change it. I understand that a lot can be said about this topic, and so I'm really just looking to know if I am opening myself up to any security issues or significant performance issues by not using relationships. Please be specific in your response; "Yes you are opening yourself up to performance issues" is useless and not helpful for myself, nor for anyone else looking at this thread at a later date. Please include details and specifics in your response.
Thank you in advance!
As Sam D points out in the comments, entire books can be written about database design and why having tables with relationships can make a lot of sense.
That said, theoretically, you lose absolutely no expressive/computational power by just putting everything in the same table. The primary arguments against doing so likely deal with performance and maintenance issues that might arise.
The answer revolves around granularity, space consumption, speed, and detail.
Inherently different types of data will be more granular than others, as items can always be rolled up to a larger umbrella. For a chain of stores, items sold can be rolled up into transactions, transactions can be rolled up into register batches, register batches can be rolled up to store sales, store sales can be rolled up to company sales. The two options then are:
Store the data at the lowest grain in a single table
Store the data in separate tables that are dedicated to purpose
In the first case, there would be a lot of redundant data, as each item sold at location 3 of 430 would have store, date, batch, transaction, and item information. That redundant data takes up a large volume of space, when you could very easily create separated tables for their unique purpose.
In this example, lets say there were a thousand transactions a day totaling a million items sold from that one store. By creating separate tables you would have:
Stores = 430 records
Registers = 10 records
Transactions = 1000 records
Items sold = 1000000 records
I'm sure your asking where the space savings comes in ... it is in the detail for each record. The store table has names, address, phone, etc. The register has number, purchase date, manager who reconciles, etc. Transactions have customer, date, time, amount, tax, etc. If these values were duplicated for every record over a single table it would be a massive redundancy of data adding up to far more space consumption than would occur just by linking a field in one table (transaction id) to a field in another table (item id) to show that relationship.
Additionally, the amount of space consumed, as well as the size of the overall table, inversely impacts the speed of you querying that data. By keeping tables small and capitalizing on the relationship identifiers to link between them, you can greatly increase the response time. Every time the query engine needs to find a value, it traverses the table until it finds it (that is a grave oversimplification, but not untrue), so the larger and broader the table the longer the seek time. These problems do not exist with insignificant volumes of data, but for organizations that deal with millions, billions, trillions of records (I work for one of them) storing everything in a single table would make the application unusable.
There is so very, very much more on this topic, but hopefully this gives a bit more insight.
Short answer: In a relational database like MySQL Yes. Check this out about referential integrity http://databases.about.com/cs/administration/g/refintegrity.htm
That does not mean that you have to use relational databases for your project. In fact the trend is to use Non-Relational databases (NoSQL), like MongoDB to achieve same results with better performance. More about RDBMS vs NoSQL http://www.zdnet.com/rdbms-vs-nosql-how-do-you-pick-7000020803/
I think that with this example you will understand better:
Let's we want to create on-line store. We have at minimum Users, Payments and Events (events about the pages where the user navigates or other actions). In this scenario we want to link in a secure and relational way the Users with the Payments. We do not want a Payment to be lost or assigned to another User. So we can use a RDBMS like MySQL to create the tables Users and Payments and linked the with proper Foreign Keys. However for the events, we are going to be a lot of them per users (maybe millions) and we need to track them in a fast way without killing the relation database. In that case a No-SQL database like MongoDB makes totally sense.
To sum up to can use an hybrid of SQL and NO-SQL, but either if you use one, the other or both kind of solutions, do it properly.

Information Architecture & Retreival - Prioritizing queries

I've got an application that displays data to the user based on a relevancy score. There are 5 to 7 different types of information that I can display (e.g. Users Tags, Friends Tags, Recommended Tags, Popular Tags, etc.) Each information type would be a separate sql query.
Then I have an algorithm that ranks each type by how relevant it is. The algorithm is based on several factors including how long its been since an action was taken on a particular type, how important one information type is to another, how often one type has been shown, etc.
Once they are ranked, I show them to the users in a feed, similar to Facebook.
My question is a simple one. I need the data before I can run it through the ranking algorithm, so whats the most efficient way to pull only the data I need from the database.
Currently I pull the top 5 instances of each information type, and then rank those. Each piece of data gets a relevancy score, and if I don't have enough results that reach a certain relevancy threshold, I go back to the database for the next 5 of each.
The problem with this approach is that I risk pulling too many of one story type that I never use, and I have to keep going back to the database if I don't get what I need the first time.
I have thought about a massive sql query that incorporates all info types & algorithm, which could work, but that would really be a huge query, and them I'm having mysql do so much processing, and I'm of the general mind set that Mysql should do the data retrieval and my programming language (php) should do the processing stuff.
There has to be a better way! I'm sure there is a scholarly article somewhere, but I haven't been able to find it.
Thanks Stack Overflow
im assuming that by information type you mean (Users Tags, Friends Tags etc); i would recommend rather than fetching your data again again n again against a particular fixed threshold, change your algorithm a little. Try assigning weights to each information type, even if you get a few records of a low priority type, you dont have to fetch it again.

What is the most efficient way to store a list in a relational database?

I have read many strong statements here and elsewhere on the subject of storing arrays in mysql. The rules of normalization seem to suggest its a bad idea and searching within the stored array fosters inelegant code. HOWEVER, for the application I am working on it seems like a reasonable solution to store an array in a field. I'm sure that is what everyone wrongly thinks in this position but I can't figure out a better way. Here is the setup:
I have a series of tables that store registered students, courses they can take and their performance on each course. All are "normalized" to avoid duplication and errors. I want to be able to generate a "myCourses" section so after login the student sees courses they are eligible for and courses they have taken but are free to review. The approach that comes to mind is two arrays; my_eligible_courses and my_completed_courses. On registration, the student is given a set of courses for which they are eligible. This could be stored as rows where there are multiple occurrences of studentid, one for each course they can take:
student1 course 1
student1 course 2
student1 course n
The table could then be queried for all of student 1's eligible courses and displayed as a list when the student logs in.
Alternately, studentid could be a primary key and in a column "eligible_courses" there would be an array (course 1,course 2, course n).
There is a table for student performance, to record every course taken and metrics associated with student performance. It will be queried to report on student performance, quality of course etc but this table will grow quite large. I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
One other complication is that the set of courses a student is eligible is variable and expanding as new courses are developed, which to me seems to suggest that generating a set of new columns for each new course is a bad idea-for example, new course_name, pretest_score, posttest_score, time_to_complete, ... Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
So to restate the question, is it better to store "inelegant" arrayed list of eligible and completed courses in a registered student table or dynamically generate these lists?
I'm guessing this is still too vague but any discussion of db design that gives an example of an inelegant array vs a restructured schema would be appreciated.
You should feel confident that if you have indexes on your tables for the appropriate columns, querying for my_completed_courses will be pretty snappy.
When your table grows to the point that you notice slowdown, you can configure your MySQL server with appropriate memory allocation settings so that it can keep more data cached in memory. Or you could look into that now.
In response to the edit you made about adding new courses: Don't add a new column for each course. Don't add a new table for each course. Create a table for courses, and add rows for each course.
You should then be able to join your tables together on indexed columns to generate the list of data you need.
This is a bad idea for two obvious reasons:
DBMS can't enforce proper referentialX (and possibly domain) integrity and relying on application-level integrity is almost always a bad idea.
While the database will be able to answer the query: "based on given student, give me courses", you won't be able to (efficiently) go in the opposite direction, should you ever need to.
X What's to stop a buggy application from storing a non-existent ID in array? Or deleting a course that is still referenced by students? Even if your application is careful about course deletion, there is no way to do it efficiently - you'll need a full table scan to examine all arrays.
Why are you even trying this? A link (aka. junction) table would solve these problems, for a moderate cost of some additional storage space.
If you are really concerned about storage space, you could even switch the DBMS and use one that supports leading-edge index compression (such as Oracle).
I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
Databases are very good at querying humongous amounts of data. In this case, if you use the clustering properly, the DBMS will be able to get this data in very few I/O operations, meaning very fast. Did you perform any actual benchmarks? Have you measured any actual performance problem?
Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
Generating a new table may be justified in case it will have different columns. But, that doesn't sound like what you are trying to do.
It seems to me that you simply need:
CHECK (
(COMPLETED = 0 AND (performance fields) IS NULL)
OR (COMPLETED = 1 AND (performance fields) IS NOT NULL)
)
When a student enrolls into course, insert a row in STUDENT_COURSE, set COMPLETED to 0 and leave performance fields NULL.
When the student completed the course, set COMPLETED to 1 and fill the performance fields.
(BTW, you could even omit COMPLETED altogether and just rely on testing the performance fields for NULL.)
InnoDB tables are clustered, which means that rows in STUDENT_COURSE belonging to the same student are stored physically close together, which means that getting courses of the given student is extremely fast.
If you need to go in the opposite direction (get students of a given course), add an index on same fields but in opposite order: {COURSE_ID, STUDENT_ID}. You might even consider covering in this case.
Since we are talking about small number of rows, leaving COMPLETED unindexed is just fine. If you are really concerned about that, you can even do something like:
The COMPLETED_STUDENT_COURSE is a B-Tree only for completed courses (and essentially a subset of STUDENT_COURSE which is a B-Tree for all enrolled courses).
Here are a few thoughts that I believe may assist you in making a good decision.
Generally, it is a rule to use correctly normalised tables. But there can be exceptions to this. Perhaps your project may be such.
Most of the time, new developers tend to focus on getting the data into a DB. They get stuck when it comes to retrieving it for a specific purpose. So given both cases of arrays vs. relational tables, ask your self if either method serves your purpose. For example, if you wanted to list the courses of student X, your array method is just fine. This is because you can retrieve it by the primary key like a student ID. But if you wanted to know how many students are on course A, the array method will be a horrible way to go.
Then again, the above point would depend on your data volume as well. For example, if you only have about a hundred students, you'll probably not notice a difference in performance. But if you're looking at several thousand records and you have a big list of courses for students, the array approach is not the way to go.
Benchmark. This is the best way for you to find out your answer. You can use MySQL's explain or just time it using your program that executes the queries. Try each method with your standard volume of data and see which one works best. For example, in the recent past, MySQL was boasting about their strength of the ISAM engine. Then I had to work on a large application that involved millions of records. And here, I noticed that each time a new record came in, Indexes had to be rebuilt. So now we had to bend the rules. Likewise, you'd better do your tests with the correct volumes of data and make a better decision.
But do not take this example as a rule. Rather, go by the standards of normalisation and only bend the rules for exceptions.

Forum Schema: should the "Topics" table countain topic_starter_Id? Or is it redundant information?

I'm creating a forum app in php and have a question regarding database design:
I can get all the posts for a specific topic.All the posts have an auto_increment identity column as well as a timestamp.
Assuming I want to know who the topic starter was, which is the best solution?
Get all the posts for the topic and order by timestamp. But what happens if someone immediately replies to the topic. Then I have the first two posts with the same timestamp(unlikely but possible). I can't know who the first one was. This is also normalized but becomes expensive after the table grows.
Get all the posts for the topic and order by post_id. This is an auto_increment column. Can I be guaranteed that the database will use an index id by insertion order? Will a post inserted later always have a higher id than previous rows? What if I delete a post? Would my database reuse the post_id later? This is mysql I'm using.
The easiest way off course is to simply add a field to the Topics table with the topic_starter_id and be done with it. But it is not normalized. I believe this is also the most efficient method after topic and post tables grow to millions of rows.
What is your opinion?
Zed's comment is pretty much spot on.
You generally want to achieve normalization, but denormalization can save potentially expensive queries.
In my experience writing forum software (five years commercially, five years as a hobby), this particular case calls for denormalization to save the single query. It's perfectly sane and acceptable to store both the first user's display name and id, as well as the last user's display name and id, just so long as the code that adds posts to topics always updates the record. You want one and only one code path here.
I must somewhat disagree with Charles on the fact that the only way to save on performance is to de-normalize to avoid an extra query.
To be more specific, there's an optimization that would work without denormalization (and attendant headaches of data maintenance/integrity), but ONLY if the user base is sufficiently small (let's say <1000 users, for the sake of argument - depends on your scale. Our apps use this approach with 10k+ mappings).
Namely, you have your application layer (code running on web server), retrieve the list of users into a proper cache (e.g. having data expiration facilities). Then, when you need to print first/last user's name, look it up in a cache on server side.
This avoids an extra query for every page view (as you need to only retrieve the full user list ONCE per N page views, when cache expires or when user data is updated which should cause cache expiration).
It adds a wee bit of CPU time and memory usage on web server, but in Yet Another Holy War (e.g. spend more resources on DB side or app server side) I'm firmly on the "don't waste DB resources" camp, seeing how scaling up DB is vastly harder than scaling up a web or app server.
And yes, if that (or other equally tricky) optimization is not feasible, I agree with Charles and Zed that you have a trade-off between normalization (less headaches related to data integrity) and performance gain (one less table to join in some queries). Since I'm an agnostic in that particular Holy War, I just go with what gives better marginal benefits (e.g. how much performance loss vs. how much cost/risk from de-normalization)