MySQL: Duplicated Data VS More Queries - mysql

Please take into consideration this is a MySQL Question for Web Development.
Currently I'm designing the database structure for a User Authentication System and I came across one question, that I myself can't figure it out:
Is it better to have duplicated data instead of making more queries?
Here's a little background, currently my users table looks something like this (pseudo-code):
id mediumint
username varchar(15)
password varchar(100)
email varchar(80)
status tinyint(1) <- is the user banned?
language varchar(100)
private_message_counter mediumint
notify_email tinyint(1)
Extra rows
I'm trying to put all the "most used" rows into the users table, to prevent more queries for example:
With Indicator on users table:
-
User Logged on? (query Sessions)
Get User Data (query Users)
Get User Permissions (query permissions)
-
Without indicators:
-
User Logged on? (query Sessions)
Is the user Banned? (query Bans)
Get User Data (query Users)
Get User Permissions (query Permissions)
Get Private Message information (query private_messages table)
One little "problem" is that the users table ends with a lot of rows. It's obvious also that I'll need to run more checks to prevent data mismatch, but isn't the improvement way better?
Note: My Website has around 14,500 simultaneous users connected. So I need to know if it'll improve or do the complete opposite.
Any opinions or recommendations are welcomed.

Very, very very rarely is the right answer data duplication. We talk about normalization for a reason so often.
Typically you only duplicate data in an offline data-warehouse situation where you're dealing with 10s+ of millions of rows and the processing time for aggregation is too long. In an online system to risk of data falling out of sync is almost always too great for any perceived gains from duplicating data. A few extra queries will not kill you.

Is there an actual performance issue that a clever UNION statement doesn't get around?
I wouldn't overload tables to gain short term performance. You'll sacrifice your uptime the bigger your tables get, (happened to me). You might need multiple layers of caching in your application. (Some combination of memcached for banned state and materialized views for sessions+permissions maybe?)
I suggest running tests to see what your results become after scaling an order of magnitude the number of rows of the data in your tables, and 100,000 simultaneous users. Your architecture might benefit by partitioning tables between current and frequent users and less frequent users. Or follow the notion of having servers that deal with sessions, and servers that deal with canonical data.
In my project, only about 0.01% of my users are logged in at one time. If you have 1% of your users logged in, you're scaling into the million-row range. I would suggest considering how to maintain your uptime requirements and some basic performance requirements. Table repairs, optimizations, backups: these operations won't be cheap and are tricky in a multi-master architecture. (Thought about table partitioning?)
Update (and repair) operations are less expensive when performed on smaller tables. Not only are they less likely to drop large groups of cached queries out of the mysql query cache, they also maintain better key cache performance. If your users table is updated frequently, you should separate the frequently updated columns from the uncommonly updated columns. Your key cache hit rate will improve and so will your query cache hit rate.
If you're actually planning on growing this application, you have to deal with more more queries every day, no matter what. If your application suffers when the query rate merely doubles, something is wrong. In my experience, duplicating data into the users table, (primary to your data model) is going to make it harder to enforce the use other data tables--and that might be very hard to get away from.

Related

How big does DB need to become for PHP/mySQL to be noticably slow?

I'm wondering if there are any developers out there who have experience doing this:
How many rows (in a table with 1 column, primary key) can you have before queries becomes slow? Is it something I don't need to worry about? I probably won't ever have more than 5000 or so. I don't have enough experience to know.
This answer is based more on your title "how big does DB need to become for PHP/mySQL to be noticably slow".
PHP...
If you are fetching one row, then all the time is in MySQL. If you are fetching an entire table, then a lot of the cost is in PHP. 5000 rows is not bad. A million rows would be slow and probably run out of RAM.
MySQL...
SELECT * FROM tbl WHERE primary_key = 'constant';
will be 'instantaneous' even if there are a billion rows in the table.
SELECT * FROM tbl WHERE non_indexed_column = 'constant';
will begin to be slow at 5000 rows, will take "too long" at a million rows, and take hours or days if you have a billion rows.
SELECT * FROM tbl WHERE indexed_uuid = 'constant';
Although this is 'instantaneous', it becomes I/O bound as the table grows. UUIDs are very random, hence caching is an issue. When the index is small enough to fit in RAM, this performs well. As the index gets bigger, the statement gets slower. More on the evils of UUIDs.
In other words, you deleted important information -- namely that UUIDs are involved.
Why have a 1-column table for checking the UUID? Why not have an index in some other table, and use it for checking. Performance will be virtually the same (good it cacheable, I/O-bound if not).
If it's indexed (and it should be; as the only field in the table, it should be the primary key), you have a massive amount of breathing room as far as database capacity goes. It's based on your server and software configuration, technically, but even MySQL is not going to break a sweat on a single table with a single UUID per row and 5,000 rows.
On the PHP end, it also depends on your server resources and the capacity of your connection compared to the bandwidth your requests are occupying... but if you're working on a table that's unlikely to store more than 5,000ish records, it seems unlikely the associated PHP is going to have capacity problems (provided your clients are making a sane number of requests).
A couple notes, though:
You seem to be implying the client (web client? Mobile client? Gas station client?) will be generating the UUID, which is then validated on the server... this isn't really authentication, or any assurance of authenticity. It's not clear what your use case is, but you may want to step back, define your goals for this token, and look around for people who have solved the same problem. If you're generating something like an API key or authentication token, a simple shared UUID likely is not the best way to go about it. In any case, this should happen over SSL, otherwise you're trivially handing out your credentials to anyone who's listening to your requests.
Less importantly, depending on whether or not your token needs to be persistent across sessions, which is also something I'm not clear on from your description, you may not need database persistence for it. An in-memory store like redis might be something worth looking into, though again, I don't get the impression performance is going to be a bottleneck based on your estimate.
Given what you've said, it's impossible to estimate capacity, but it strikes me as unlikely that you're going to run into any kind of bottleneck from your technology stack.

500000 user DB is being rather slow

I have a database with the following structure:
username,email,ip,hash,salt
Currently we have around 600.000 users in this database.
Users are complaining that querying this database is rather slow.
In our tests, we found that it takes around 1.15 seconds to retrieve a user record.
This test is based on the following query:
SELECT * FROM users WHERE email = 'test#mail.com'
I'm no expert in database management. I know how to get by when using it like a dictionary, however I have no idea on database optimization.
I was hoping I could get some help. Ideally, we'd be able to query the DB like this in under a second on even 10 million users.
Does anyone have any suggestion on optimizing simple queries like this? I'm open to anything right now, even restructuring the database if there's a more logical way to do it. Because right now, they're just ordered in the order that they registered with.
MySQL has two important facilities for improving performance. For your type of query, 500,000 rows or 10,000,000 rows is just not a big deal. Although other technologies such as NOSQL can perform the same actions, applications such as yours typically rely on the ACID properties of databases. A relational database is probably the right solution.
The first facility -- as mentioned elsewhere -- are indexes. In your case:
create index idx_users_email on users(email);
An index will incur a very small amount of overhead for insert and delete operations. However, with the index, looking up a row should go down to well under 0.1 seconds -- even with concurrent queries.
Depending on other queries you are running other indexes may be appropriate.
The second important capability is partitioning the tables. This is not necessary for a users table. However, it can be quite useful for transactions and other types of data.
you could add an index as already mentioned in the comments, but one thought present itself - you are currently retrieving ALL info for that row - it would be more efficient to target the query to only retrieve that information which is necessary - such as
SELECT username FROM users WHERE email = 'test#mail.com';
also - you should investigate PDO and bound parameters for security.

What Are Good Solutions for a Database Table that Gets to Long?

I will describe a problem using a specific scenario:
Imagine that you create a website towhich users can register,
and after they register, they can send Private Messages to each other.
This website enables every user to maintain his own Friends list,
and also maintain a Blocked Users list, from which he prefers not to get messages.
Now the problem:
Imagine this website getting to several millions of users,
and let's also assume that every user has about 10 Friends in the Friends table, and 10 Blocked Users in the Blocked Users table.
The Friends list Table, and the Blocked Users table, will become very long,
but worse than that, every time when someone wants to send a message to another person "X",
we need to go over the whole Blocked Users table, and look for records that the user "X" defined - people he blocked.
This "scanning" of a long database table, each time a message is sent from one user to another, seems quite inefficient to me.
So I have 2 questions about it:
What are possible solutions for this problem?
I am not afraid of long database tables,
but I am afraid of database tables that contain data for so many users,
which means that the whole table needs to be scanned every time, just to pull out a few records from it for that specific user.
A specific solution that I have in my mind, and that I would like to ask about:
One solution that I have in mind for this problem, is that every user that registers to the website, will have his own "mini-database" dynamically (and programmatically) created for him,
that way the Friends table, an the Blocked Users table, will contain only records for him.
This makes scanning those table very easy, because all the records are for him.
Does this idea exist in Databases like MS-SQL Server, or MySQL? And If yes, is it a good solution for the described problem?
(each user will have his own small database created for him, and of course there is also the main (common) database for all other data that is not user specific)
Thank you all
I would wait on the partitioning and on creating mini-database idea. Is your database installed with the data, log and temp files on different RAID drives? Do you have clustered indexes on the tables and indexes on the search and join columns?
Have you tried any kind of reading Query Plans to see how and where the slowdowns are occurring? Don't just add memory or try advanced features blindly before doing the basics.
Creating separate databases will become a maintenance nightmare and it will be challenging to do the type of queries (for all users....) that you will probably like to do in the future.
Partitioning is a wonderful feature of SQL Server and while in 2014 you can have thousands of partitions you probably (unless you put each partition on a separate drive) won't see the big performance bump you are looking for.
SQL Server has very fast response time for tables (especially for tables with 10s of millions of rows (in your case the user table)). Don't let the main table get too wide and the response time will be extremely fast.
Right off the bat my first thought is this:
https://msdn.microsoft.com/en-us/library/ms188730.aspx
Partitioning can allow you to break it up into more manageable pieces and in a way that can be scalable. There will be some choices you have to make about how you break it up, but I believe this is the right path for you.
In regards to table scanning if you have proper indexing you should be getting seeks in your queries. You will want to look at execution plans to know for sure on this though.
As for having mini-DB for each user that is sort of what you can accomplish with partitioning.
Mini-Database for each user is a definite no-go zone.
Plus on a side note A separate table to hold just Two columns UserID and BlockedUserID both being INT columns and having correct indexes, you cannot go wrong with this approach , if you write your queries sensibly :)
look into table partitioning , also a well normalized database with decent indexes will also help.
Also if you can afford Enterprise Licence table partitioning with the table schema described in last point will make it a very good , query friendly database schema.
I did it once for a social network system. Maybe you can look for your normalization. At the time I got a [Relationship] table and it just got
UserAId Int
UserBId Int
RelationshipFlag Smallint
With 1 million users and each one with 10 "friends" that table got 10 millions rows. Not a problem since we put indexes on the columns and it can retrieve a list of all "related" usersB to a specific userA in no time.
Take a good look on your schema and your indexes, if they are ok you DB ill not got problems handling it.
Edit
I agree with #M.Ali
Mini-Database for each user is a definite no-go zone.
IMHO you are fine if you stick with the basic and implement it the right way

COUNT(*) WHERE vs. SELECT(*) WHERE performance

I am building a forum and I am trying to count all of the posts submitted by each user. Should I use COUNT(*) WHERE user_id = $user_id, or would it be faster if I kept a record of how many posts each user has each time he made a post and used a SELECT query to find it?
How much of a performance difference would this make? Would there be any difference between using InnoDB and MyISAM storage engines for this?
If you keep a record of how many post a user made, it will definitely be faster.
If you have an index on user field of posts table, you will get decent query speeds also. But it will hurt your database when your posts table is big enough. If you are planning to scale, then I would definitely recommend keeping record of users posts on a specific field.
Storing precalculated values is a common and simple, but very efficient sort of optimization.
So just add the column with amount of comments user has posted and maintain it with triggers or by your application.
The performance difference is:
With COUNT(*) you always will have index lookup + counting of results
With additional field you'll have index lookup + returning of a number (that already has an answer).
And there will be no significant difference between myisam and innodb in this case
Store the post count. It seems that this is a scalability question, regardless of the storage engine. Would you recalculate the count each time the user submitted a post, or would you run a job to take care of this load somewhere outside of the webserver sphere? What is your post volume? What kind of load can your server(s) handle? I really don't think the storage engine will be the point of failure. I say store the value.
If you have the proper index on user_id, then COUNT(user_id) is trivial.
It's also the correct approach, semantically.
this is really one of those 'trade off' questions.
Realistically, if your 'Posts' table has an index on the 'UserID' column and you are truly only wanting to return the number of posts pers user then using a query based on this column should perform perfectly well.
If you had another table 'UserPosts' for e'g., yes it would be quicker to query that table, but the real question would be 'is your 'Posts' table really so large that you cant just query it for this count. The trade off on both approaches is obviously this:
1) having a separate audit table, then there is an overhead when adding, updating a post
2) not having a separate audit table, then overhead in querying the table directly
My gut instinct is always to design a system to record the data in a sensibly normalised fashion. I NEVER make tables based on the fact that it might be quicker to GET some data for reporting purposes. I would only create them, if the need arised and it was essential to incoroporate them then, i would incorporate it.
At the end of the day, i think unless your 'posts' table is ridiculously large (i.e. more than a few millions of records, then there should be no problem in querying it for a distinct user count, presuming it is indexed correctly, i.e. an index placed on the 'UserID' column.
If you're using this information purely for display purposes (i.e. user jonny has posted 73 times), then it's easy enough to get the info out from the DB once, cache it, and then update it (the cache), when or if a change detection occurs.
Performance on post or performance on performance on count? From a data purist perspective a recorded count is not the same as an actual count. You can watch the front door to an auditorium and add the people that come in and subtract those the leave but what if some sneak in the back door? What if you bulk delete a problem topic? If you record the count then the a post is slowed down to calculate and record the count. For me data integrity is everything and I will count(star) every time. I just did a test on a table with 31 million row for a count(star) on an indexed column where the value had 424,887 rows - 1.4 seconds (on my P4 2 GB development machine as I intentionally under power my development server so I get punished for slow queries - on the production 8 core 16 GB server that count is less than 0.1 second). You can never guard your data from unexpected changes or errors in your program logic. Count(star) is the count and it is fast. If count(star) is slow you are going to have performance issues in other queries. I did star as the symbol caused a format change.
there are a whole pile of trade-offs, so no-one can give you the right answer. but here's an approach no-one else has mentioned:
you could use the "select where" query, but cache the result in a higher layer (memcache for example). so you code would look like:
count = memcache.get('article-count-' + user_id)
if count is None:
count = database.execute('select ..... where user_id = ' + user_id)
memcache.put('article-count-' + user_id, count)
and you would also need, when a user makes a new post
memcache.delete('article-count-' + user_id)
this will work best when the article count is used often, but updated rarely. it combines the advantage of efficient caching with the advantage of a normalized database. but it is not a good solution if the article count is needed only rarely (in which case, is optimisation necessary at all?). another unsuitable case is when someone's article count is needed often, but it is almost always a different person.
a further advantage of an approach like this is that you don't need to add the caching now. you can use the simplest database design and, if it turns out to be important to cache this data, add the caching later (without needing to change your schema).
more generally: you don't need to cache in your database. you could also put a cache "around" your database. something i have done with java is to use caching at the ibatis level, for example.

Forum Schema: should the "Topics" table countain topic_starter_Id? Or is it redundant information?

I'm creating a forum app in php and have a question regarding database design:
I can get all the posts for a specific topic.All the posts have an auto_increment identity column as well as a timestamp.
Assuming I want to know who the topic starter was, which is the best solution?
Get all the posts for the topic and order by timestamp. But what happens if someone immediately replies to the topic. Then I have the first two posts with the same timestamp(unlikely but possible). I can't know who the first one was. This is also normalized but becomes expensive after the table grows.
Get all the posts for the topic and order by post_id. This is an auto_increment column. Can I be guaranteed that the database will use an index id by insertion order? Will a post inserted later always have a higher id than previous rows? What if I delete a post? Would my database reuse the post_id later? This is mysql I'm using.
The easiest way off course is to simply add a field to the Topics table with the topic_starter_id and be done with it. But it is not normalized. I believe this is also the most efficient method after topic and post tables grow to millions of rows.
What is your opinion?
Zed's comment is pretty much spot on.
You generally want to achieve normalization, but denormalization can save potentially expensive queries.
In my experience writing forum software (five years commercially, five years as a hobby), this particular case calls for denormalization to save the single query. It's perfectly sane and acceptable to store both the first user's display name and id, as well as the last user's display name and id, just so long as the code that adds posts to topics always updates the record. You want one and only one code path here.
I must somewhat disagree with Charles on the fact that the only way to save on performance is to de-normalize to avoid an extra query.
To be more specific, there's an optimization that would work without denormalization (and attendant headaches of data maintenance/integrity), but ONLY if the user base is sufficiently small (let's say <1000 users, for the sake of argument - depends on your scale. Our apps use this approach with 10k+ mappings).
Namely, you have your application layer (code running on web server), retrieve the list of users into a proper cache (e.g. having data expiration facilities). Then, when you need to print first/last user's name, look it up in a cache on server side.
This avoids an extra query for every page view (as you need to only retrieve the full user list ONCE per N page views, when cache expires or when user data is updated which should cause cache expiration).
It adds a wee bit of CPU time and memory usage on web server, but in Yet Another Holy War (e.g. spend more resources on DB side or app server side) I'm firmly on the "don't waste DB resources" camp, seeing how scaling up DB is vastly harder than scaling up a web or app server.
And yes, if that (or other equally tricky) optimization is not feasible, I agree with Charles and Zed that you have a trade-off between normalization (less headaches related to data integrity) and performance gain (one less table to join in some queries). Since I'm an agnostic in that particular Holy War, I just go with what gives better marginal benefits (e.g. how much performance loss vs. how much cost/risk from de-normalization)