Optimizing a mysql query to fetch "unseen" entries per user - mysql

This title is rather mesmerizing but I couldn't come up with something clearer.
Long story short, we're creating a mobile app connected to a node.js server communicating with a mySql database. Pretty common setup. Now, we have multiple users connected that are able to upload "moments" to our servers. These moments can be only seen once by all other users.
As soon as a user x sees another user y's moment, x cannot see this one y's moment, ever. Maybe a bit like Snapchat, except the moment is single user to multiple users instead of single to single. Moments are also ordered by distance according to the current user's location.
Now, I'm looking for an intelligent way of only fetching the "unseen" moments from database. For now, we're using a relational table between Users and Moments.
Let's say a user (ID = 20) sees a moment (ID = 30320), then we insert into this table 20 and 30320. I know. This is hardly scalable and probably a terrible idea.
I thought about maybe checking the last seen date and only fetching moments that are past this date, but again, moments are ordered by distance before being ordered by date so it is possible to see a moment that is 3 minutes old followed by a moment that is 30 seconds old.
Is there a more clever way of doing this, or am I doomed to use a relationship table between Moments and Users, and join to it when querying?
Thanks a lot.
EDIT -
This logic uses in total 3 tables.
Users
Moments
MomentSeen
MomentSeen only contains what user has seen what moment, and when. Since the moments aren't ordered by date, I can't fetch all the moments that were uploaded after the last seen moment.
EDIT -
I just realized the mobile app Tinder must use similar logic for which user "liked" which other user. Since you can't go back in time and see a user twice, they probably use a very similar query as what I'm looking for.
Considering they have a lot of users, and that they're ordered by distance and some other unknown criteria, there must be a more clever way of doing things than a "UserSawUser" relational table.
EDIT
I can't provide the entire database structure so I'll just leave the important tables and some of their fields.
Users {
UserID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY
}
Moments {
MomentID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
UploaderID INT UNSIGNED, /* FK to UserID */
TimeUploaded DATE /* usually NOW() while insertion */
}
MomentSeen {
/* Both are FK to Users and Moments */
MomentID INT UNSIGNED,
UserID INT UNSIGNED
}

You can consider implementing bloom filter. It is widely used to reduce disk seeks and drive better performance.
Medium is using it to check if a user has read a post already.
More details here-
https://medium.com/the-story/what-are-bloom-filters-1ec2a50c68ff
https://en.wikipedia.org/wiki/Bloom_filter

Do not use one table per user. Do have a single table for the moments.
You seem to have two conflicting orderings for "moments": 'distance' and 'unseen'; which is it?
If it is 'unseen', are the 'moments' numbered chronologically? This implies that each user has a last_moment_seen -- all Moments before then have been seen; all after that have not been seen. So...
SELECT ...
WHERE moment > ( SELECT last_moment_seen
FROM Users WHERE user_id = $user_id );
would get all the moments not yet seen for a given user.
Munch on that for a while; the come back for more suggestions.
Edit
This should give you the Moments not yet seen. You can then order them as you see fit.
SELECT m....
FROM Moments m
LEFT JOIN MomentSeen ms ON ms.MomentID = m.MomentID
WHERE ms.MomentID IS NULL
ORDER BY ...
LIMIT 1 -- if desired

Why hesitate on using join?
Have you try to fill your database with dummy data. million of them to be able to measure performance impact on your system?
Using joins is not such a bad idea and is often faster than a single table, if done right.
You should probably do a research on database design to provide some reference.
For instance, ordering a table is done using an index.
However, you could use more than one index on one table and a combination of column in each index.
This can be done by analyzing the query(s) often used against the table.
A handy recipe is by creating index containing a set of column used in join keys, another index for each possible "where" parameter combination as a set of column index, and a set of index column(s) for each "order" query run against that table (the ascending / descending order did matter).
So don't be shy to add another column and index to suit your need.
If you talking about scalability, then you should consider tuning the database engine.
eg. make a higher possible key using big integer.
using a database cluster setup would also require to do in depth analysis, because autoincrement keys have issue in a multiple master setup
If you try to squeeze more performance from your system, then you should consider designing the whole database to be table partition friendly from the very start. That will include serious analysis of your business rule. Creating a table partition friendly environment require setting up a series of column as key, and splitting the data physically (remember to set file_per_table = 1 on mysql config, otherwise the benefit of table partitioning is lost)
If not done right, however, partitioning will not do any benefit to you.
https://dev.mysql.com/doc/refman/5.5/en/partitioning.html

Related

Seeking a performant solution for accessing unique MySQL entries

I know very little about MySQL (or web development in general). I'm a Unity game dev and I've got a situation where users (of a region the size of which I haven't decided yet, possibly globally) can submit entries to an online database. The users must be able to then locate their entry at any time.
For this reason, I've generated a guid from .Net (System.Guid.NewGuid()) and am storing that in the database entry. This works for me! However... I'm no expert, but my gut tells me that looking up a complex string in what could be a gargantuan table might have terrible performance.
That said, it doesn't seem like anything other than a globally unique identifier will solve my problem. Is there a more elegant solution that I'm not seeing, or a way to mitigate against any issues this design pattern might create?
Thanks!
Make sure you define the GUID column as the primary key in the MySQL table. That will cause MySQL to create an index on it, which will enable MySQL to quickly find a row given the GUID. The table might be gargantuan but (assuming a regular B-tree index) the time required for a lookup will increase logarithmically relative to the size of the table. In other words, if it requires 2 reads to find a row in a 1,000-row table, finding a row in a 1,000,000-row table will only require 2 more reads, not 1,000 times as many.
As long as you have defined the primary key, the performance should be good. This is what the database is designed to do.
Obviously there are limits to everything. If you have a billion users and they're submitting thousands of these entries every second, then maybe a regular indexed MySQL table won't be sufficient. But I wouldn't go looking for some exotic solution before you even have a problem.
If you have a key of the row you want, and you have an index on that key, then this query will take less than a second, even if the table has a billion rows:
SELECT ... FROM t WHERE id = 1234.
The index in question might be the PRIMARY KEY, or it could be a secondary key.
GUIDs/UUIDs should be used only if you need to manufacture unique ids in multiple clients without asking the database for an id. If you do use such, be aware that GUIDs perform poorly if the table is bigger than RAM.

Lots of small mysql table or one big table

I have a forum where i have properties like - >
follow,voteup,votedown,report,favorite,view etc for each thread,answers,comments.
Which approach will be performance wise faster and better ?
I am expecting billions of favorite,views etc....just like youtube
Approach One
Make one big table counter
counter_id | user_id | object_id | object_type | property
where object_type = thread,comment,answer with their respective id from tables threads,comments,answers
and property = follow,voteup,votedown,report etc
Approach Two
Make individual tables of follow,views,report etc
views
view_id | user_id | object_id | object_type
follows
follow_id | user_id | object_id | object_type
There is no single answer to this, its quite subjective.
Most commonly it's best to consider the use cases for your design. Think carefully about what these fields will be used for before you add them to any table. And don't think that you have to add a numeric primary key ("ID") to every table. A table for tracking follows is just fine with only the fields user id | object id | object type and all three fields contained in the primary key.
Its unlikely your code ever will be used with such performance constraints as youtube or even stack overflow. If it is you will most likely have remodelled the database by then.
However for the sake of the exercise consider where and how data is to be used...
I would have separate tables as follows
Follow
User feeds, probably needs its own table as most commonly it gets hit from anywhere (a bit like a global inbox). The follow should also have some flag or timestamp to show changes so that its very easy to evaluate when changes have occurred since the last time the user was online.......
This is because a user needs to see what they've followed as some sort of feed and other's need to see how many people have followed. But other's don't need to see who else has followed.
Vote up, Vote down
That's just vote and a +- flag. Do denormalize this... That is store BOTH a user's individual votes in a table and store a count of votes against object on a field on the object's table. That way you only ever check a single user's vote (they're own) for a page view. The counts are retrieved from the same row containing the content.
Again. A user needs to see what they've up/down voted. You need to check they're not voting twice. What matters is the final count. So checking an object with a million up votes should not have to hit a million rows - Just one.
Pro tip: Some database engines perform badly if you constantly update rows with large content. So consider a "meta-data" table for all objects. Which stores counts such as this. This leaves the meta data free to update frequently even if the content doesn't.
Favorite
Own table again. user id | object id | object type. If you want to display number of favourites to the public then keep a count of this against the object, don't do a select count(*) every page view.
View
Why even store this? Keep a count against the object. If you're going to store a history then make sure you put a timestamp against it and purge it regularly. You don't need to store what a user was looking at six months ago.
As a general observation all of these are separate tables with the exception of up and down votes.
You should denormalize the counts to reduce the quantity of data your server needs to access to determine a page view. Most commonly a page view should be the fastest thing. Any form of update can be a little slower.
Where I mention for favourites and others that they don't need an additional primary key field. What I mean is that they have a primary key, just not an additional field. For example favourites could be:
CREATE TABLE favourites (
user INT,
object_type INT,
object_id INT,
PRIMARY KEY (user, object_type, object_id)
)
There's simply no reason to have a favorite_id field.
Answer, Part 1: Plan on redesigning as you go.
The best advice I can give you is to plan for change. What you design for the first million will not work for 30 million. The 30-million design will not survive to a billion. Whatever you do after reading this thread may last you through 30K rows.
Why is this? Well, partially because you will not be able to do it in a single machine. Don't shard your database now, but keep in the back of your mind that you will need to shard it. At that point, much of what worked on a single machine will either not work on multiple machines, or will be too slow to work. So you will have to redesign.
Let me point out another aspect of 1 billion rows. Think how fast you have to do INSERTs to grow a table to 1B rows in 1 year. It's over 30 per second. That's not bad, until you factor in the spikes you will get.
And what will happen when your second billion won't fit on the disk you have laid out?
Anyone who grows to a billion rows has to learn as he goes. The textbooks don't go there; the manuals don't go there; only the salesmen go there, but they don't stick around after the check clears. Look at YouTube (etc) -- almost nothing is "off the shelf".
And think of how many smart designers you will need to hire to get to 1 billion.
It is painful to add a column to a billion-row table, so (1) plan ahead, and (2) design a way to make changes without major outages.
Answer, Part 2: Some tips
Here are some of my comments on the ideas bounced around, and some tips from someone who has dealt with a billion-row, sharded system (not YouTube, but something similar).
Normalize vs denormalize: My motto: "Normalize, but don't overnormalize." You'll see what I mean after you have done some of it.
One table vs many: Two tables with the essentially identical CREATE TABLEs should usually be a single table. (Sharding, of course, violates that.) OTOH, if you need thousands of UPDATE...view_count = view_count + 1 per second it won't survive to a billion. However, it might survive to a million; then plan for change.
Minimize the size of datatypes -- Using a MEDIUMINT instead of an INT for one column saves a gigabyte.
Do not paginate using OFFSET and LIMIT. (I have a blog on a workaround.)
Batch INSERTs where possible.
Use InnoDB, you don't want to wait hours for a REPAIR to finish on a MyISAM table.
The simple task of getting a unique ID for the 'next' item can be a huge problem in a sharded system. Wait until you are closer to needing sharding before redesigning that part. Do not use UUIDs for a billion-row table; they will perform poorly. So, don't even think about UUIDs now; you will have throw them away.
Long before you hit 1 billion, you will have nightmares about the one machine crashing. Think about replication, HA, etc, early. It is painful to set up such after you have big tables.

MySQL Performance: Single table or multiple tables for large datasets

I am building an app to support 200,000+ registered users, and want to add an addressbook functionality for each user to import their own contacts (e.g. name, address, email, etc). Each user will have c.150 different contacts, with 10-15 fields for each record.
My question is simple: given the volume of users and the number of contacts for each user, is it better to create individual tables for each user's addressbook, or one single table with a user_id lookup for that associated user account?
If you could explain why from a performance perspective, that would be much appreciated.
UPDATE: Specifications
In response to questions in comments, here are the specifications: I will be hosting the database on AWS RDS (http://aws.amazon.com/rds). It will primarily be a heavy read load, rather than write. When write is accessed, it will be a balance between INSERT and UPDATE, with few deletes. Imagine the number of times you view vs edit your own addressbook.
Thanks
Specific answer in response to specifications
One table for contacts' data, with an indexed foreign key column back to user. Finding a particular user's contacts will require about 3 seeks, a relatively small number. Use a SSD if seeks are bottlenecking you.
If your 15 columns have 100 bytes each, and your have 150 of those, then your maximum data transfer per user is of the order 256k. I would design the application to show only the contact data required up front (say the top 3 most useful contact points -- name, email, phone), then to pull more specifics when requested for particular contacts. In the (presumably) rare cases when you need all contacts' info (eg export to CSV) consider SELECT INTO OUTFILE if you have that access. vCard output would be less performant: you'd need to get all the data, then stuff into the right format. If you need vCard often, consider writing vCard out when database is updated (caching approach).
If performance requirements are still not met, consider partitioning on the user id.
General answer
Design your schema around KISS and your performance requirements, while documenting the scalability plan.
In this particular situation, the volume of data does not strike me as being extreme, so I would lean KISS toward one table. However, it's not clear to me the kind of queries you will be making -- JOIN is the usual performance hog, not a straight SELECT. Also what's not clear to me is your SELECT/UPDATE mix. If read-heavy and by user, a single table will do it.
Anyway, if after implementation you find the performance requirements aren't met, I would suggest you consider scaling by faster hardware, different engine (eg MyISAM vs. InnoDB -- know what the differences are for your particular MySQL version!), materialized views, or partitioning (eg around the first letter of the corresponding username -- presuming you have one).
Have a Single table, but partition the table by the starting alphabet of the user like all Last Names starting with A will be loaded into 1 partition. All names starting with B will be loaded into another partition.
You could also do some amount of profiling to find the right distribution key.
I'm not a DBA, but I suggest you properly normalize the database, add indexes, etc and not bugger it up to meet a possible nonexistent performance issue. If possible, have a DBA review your schema. I don't think 20,000 users is excessive. All 200,000 users are not likely to hit the update button in the same x milliseconds it takes to process one person's input. Only a few will be logged in at any time and most of them will be filling out data or staring at existing data on the web page rather than hitting that update button. If by chance a bunch of them do hit it at the same time, there will probably be a performance wait rather than a crash. Here is a rough layout for your schema (mileage may vary):
User
long userID primary key
String firstName
String lastName
Contact
long contactID primary key
long userID foreign key
String firstName
String lastName
Address
long addressID primary key
long contactID foreign key

Performance suggestions for a MySQL table definition

I am concerned with the performance of a database table i have to store data related
with a customer survey application.
I have a database table storing customer responses from a survey. Since the survey questions change according to customer i though instead of defining
the table schema using each questionid as column to define it as as follows
customerdata(customerid varchar,
partkey varchar,
questionkey varchar,
value, varchar,
version, int,
lastupdate, timestamp)
Where:
partkey: is the shortcode of the part (part1,part2...)
questionkey: is the shortcode of the question
e.g age, gender etc
since some customers fill the survey twice, thrice etc i have added the version column.
With this design customerid,partkey,questionkey and version are primary keys.
i am concerned about the performance with such design. Should i define the other primary keys as indexes ? Would that help ? So far for 30 customers i have 7000 records. I expect to have maximum 300-500. What do you think ?
Sounds like a pretty small database. I doubt you'll have performance issues but if you detect any when querying on partkey, questionkey, or version later on you can always add one or more indexes to solve the problem at that time. There's no need to solve a performance problem you don't have and probably never will have.
Performance issues will arise only if you have to perform time-sensitive queries that don't use the customerid field as the primary filter. I suspect you'll have some queries like that (when you want to aggregate data across customers) but I doubt they'll be time-sensitive enough to be impacted by the one second or less response time I would expect to see from such a small collection of data. If they are, add the index(es) then.
Also, note that a table only has a single PRIMARY KEY. That key can use more than one column, so you can say that columns customerid, partkey, questionkey, and version are part of the PRIMARY KEY, but you can't say their all "primary keys".
rownumber-wise, i have experienced mysql database with over 100.000 rows and it runs just fine so you should be okay.
although it's a different case if you run complicated queries, which depends more on database design rather than row numbers.

When is it a good idea to move columns off a main table into an auxiliary table?

Say I have a table like this:
create table users (
user_id int not null auto_increment,
username varchar,
joined_at datetime,
bio text,
favorite_color varchar,
favorite_band varchar
....
);
Say that over time, more and more columns -- like favorite_animal, favorite_city, etc. -- get added to this table.
Eventually, there are like 20 or more columns.
At this point, I'm feeling like I want to move columns to a separate
user_profiles table is so I can do select * from users without
returning a large number of usually irrelevant columns (like
favorite_color). And when I do need to query by favorite_color, I can just do
something like this:
select * from users inner join user_profiles using user_id where
user_profiles.favorite_color = 'red';
Is moving columns off the main table into an "auxiliary" table a good
idea?
Or is it better to keep all the columns in the users table, and always
be explicit about the columns I want to return? E.g.
select user_id, username, last_logged_in_at, etc. etc. from users;
What performance considerations are involved here?
Don't use an auxiliary table if it's going to contain a collection of miscellaneous fields with no conceptual cohesion.
Do use a separate table if you can come up with a good conceptual grouping of a number of fields e.g. an Address table.
Of course, your application has its own performance and normalisation needs, and you should only apply this advice with proper respect to your own situation.
I would say that the best option is to have properly normalized tables, and also to only ask for the columns you need.
A user profile table might not be a bad idea, if it is structured well to provide data integrity and simple enhancement/modification later. Only you can truly know your requirements.
One thing that no one else has mentioned is that it is often a good idea to have an auxiliary table if the row size of the main table would get too large. Read about the row size limits of your specific databases in the documentation. There are often performance benefits to having tables that are less wide and moving the fields you don't use as often off to a separate table. If you choose to create an auxiliarary table with a one-to-one relationship make sure to set up the PK/FK relationship to maintain data integrity and set a unique index or constraint on the FK field to mainatin the one-to-one relationship.
And to go along with everyone else, I cannot stress too strongly how bad it is to ever use select * in production queries. You save a few seconds of development time and create a performance problem as well as make the application less maintainable (yes less - as you should not willy nilly return things you may not want to show on the application but you need in the database. You will break insert statements that use selects and show users things you don't want them to see when you use select *.).
Try not to get in the habit of using SELECT * FROM ... If your application becomes large, and you query the users table for different things in different parts of your application, then when you do add favorite_animal you are more likely to break some spot that uses SELECT *. Or at the least, that place is now getting unused fields that slows it down.
Select the data you need specifically. It self-documents to the next person exactly what you're trying to do with that code.
Don't de-normalize unless you have good reason to.
Adding a favorite column ever other day every time a user has a new favorite is a maintenance headache at best. I would highly consider creating a table to hold a favorites value in your case. I'm pretty sure I wouldn't just keep adding a new column all the time.
The general guideline that applies to this (called normalization) is that tables are grouped by distinct entities/objects/concepts and that each column(field) in that table should describe some aspect of that entity
In your example, it seems that favorite_color describes (or belongs to) the user. Some times it is a good idea to moved data to a second table: when it becomes clear that that data actually describes a second entity. For example: You start your database collecting user_id, name, email, and zip_code. Then at some point in time, the ceo decides he would also like to collect the street_address. At this point a new entity has been formed, and you could conceptually view your data as two tables:
user: userid, name, email
address: steetaddress, city, state, zip, userid(as a foreign key)
So, to sum it up: the real challenge is to decide what data describes the main entity of the table, and what, if any, other entity exists.
Here is a great example of normalization that helped me understand it better
When there is no other reason (e.g. there are normal forms for databases) you should not do it. You dont save any space, as the data must still stored, instead you waste more as you need another index to access them.
It is always better (though may require more maintenance if schemas change) to fetch only the columns you need.
This will result in lower memory usage by both MySQL and your client application, and reduced query times as the amount of data transferred is reduced. You'll see a benefit whether this is over a network or not.
Here's a rule of thumb: if adding a column to an existing table would require making it nullable (after data has been migrated etc) then instead create a new table with all NOT NULL columns (with a foreign key reference to the original table, of course).
You should not rely on using SELECT * for a variety of reasons (google it).