In a news feed app, our customer requires to store all the ids of articles that read by a user.
We decided to create a single table for this, but from performance point of view, which of the following is a better approach:
Have one row per user with two fields, a user_id and article_ids, then each time a user read an article, append the id to the article_ids text - using update and concat (we might end up with a huge data in one column).
Have many rows with two columns, user_id and article_id, then each time a user read an article, insert the article_id along with the user_id in as a new record (we might end up with too many rows).
Or if there is a better way, any suggestions are very welcome.
With the second approach, you can keep track of other things which your client might ask going forward.
First, open/read/visit time.
Count of a total number of open/read/visit.
Last open/read/visit time.
In this approach, you can apply the indexing on article_id later on if required.
Note: As #Arjan said in his answer, with proper indexing there is no such a thing as too many rows.
Many records, one for each user_id and article_id combination. That's much easier to update (just insert a row, no need to apply logic) and also allows you to get information about articles when you want to list which ones a user has read. You can use a join and retrieve the correct information from the database at once, instead of having to convert a string to ids and then go back to the database to get the additional data.
With proper indexes there's not really such a thing as too many rows.
Try to split them as much as possible. Your performance will be increased a lot if, because you just have to pick small pieces of your database. If you go for the first option, you have to split it after certain characters to get the information you want. First it is more challenging in programming and if a user has a bad internet connection, the application would be very slow.
Related
Every time a user searches, I join 8 tables to get the maximum result. tables like tags, location, author, links, etc.
Is it better to create a new field and have all these information in that field and just make a Match query?
The negative side is: duplicate data, makes updating an article more difficult.
Not really. In fact, it is commonly used for large database, precisely to limit JOINs in queries.
Concerning the negative side you mentioned, it does really consist of duplication but the positives it brings on queries, is worth it (for large database explicitly). A surplus column doesn't take so much memory to be frightened of.
As for the updates, it as simple as creating a trigger which updates the columns with its values on each insert/update on the parent table.
I have 2 tables which I join very often. To simplify this, the join gives back a range of IDs that I use in another (complex) query as part of an IN.
So I do this join all the time to get back specific IDs.
To be clear, the query is not horribly slow. It takes around 2 mins. But since I call this query over a web page, the delay is noticeable.
As a concrete example let's say that the tables I am joining is a Supplier table and a table that contains the warehouses the supplier equipped specific dates. Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
The query it self can not be improved since it is a simple join between 2 tables that are indexed but since there is a date range this complicates things.
I had the following idea which, I am not sure if it makes sense.
Since the data I am querying (especially for previous dates) do not change, what if I created another table that has as primary key, the columns in my where and as a value the list of IDs (comma separated).
This way it is a simple SELECT of 1 row.
I.e. this way I "pre-store" the supplier ids I need.
I understand that this is not even 1st normal formal but does it make sense? Is there another approach?
It makes sense as a denormalized design to speed up that specific type of query you have.
Though if your date range changes, couldn't it result in a different set of id's?
The other approach would be to really treat the denormalized entries like entries in a key/value cache like memcached or redis. Store the real data in normalized tables, and periodically update the cached, denormalized form.
Re your comments:
Yes, generally storing a list of id's in a string is against relational database design. See my answer to Is storing a delimited list in a database column really that bad?
But on the other hand, denormalization is justified in certain cases, for example as an optimization for a query you run frequently.
Just be aware of the downsides of denormalization: risk of data integrity failure, poor performance for other queries, limiting the ability to update data easily, etc.
In the absence of knowing a lot more about your application it's impossible to say whether this is the right approach - but to collect and consider that volume of information goes way beyond the scope of a question here.
Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
While it's far from clear why you actually need 2 tables here, nor if denormalizing the data woul make the resulting query faster, one thing of note here is that your data is unlikely to change after capture, hence maintaining the current structure along with a materialized view would have minimal overhead. You first need to test the query performance by putting the sub-query results into a properly indexed table. If you get a significant performance benefit, then you need to think about how you maintain the new table - can you substitute one of the existing tables with a view on the new table, or do you keep both your original tables and populate data into the new table by batch, or by triggers.
It's not hard to try it out and see what works - and you'll get a far beter answer than anyone here can give you.
Let's say I would like to store votes to polls in mysql database.
As far as I know I have two options:
1. Create one table (let's say votes) with fields like poll_id, user_id, selected_option_id, vote_date and so on..
2. Create a new database for votes (let's say votes_base) and for each poll add a table to this base (a table, which consist the id of the poll in the name), let's say poll[id of the poll].
The problem with the first option is that the table will become big very soon. Let's say I have 1000 polls and each poll has 1000 votes - that's already a million records in the table. I don't know how much of the speed performance that will costs.
The problem with the second option is I'm not sure if this is the correct solution from the programming rules point of view. But I'm sure with this option it will be (much?) faster to find all votes to some poll.
Or maybe there is a better option?
Your first option is the better option. It is structurally more sound. Millions of rows in a table is no problem from MySQL. A new table per poll is an antipattern.
EDIT for first comment:
Even for a billion or more votes, MySQL should handle. Indexes are the key here. What is the difference between one database with 100 times the same table, or one table with 100 times the rows?
Technically, the second option works as well. Sometimes it might be even better. But we frequently see this:
Instead of one table, users, with 10 columns
Make 100 tables, users_uk, users_us, ... depending on where the users are from.
Great, no? Works, yes? Well it does, until you want to select all the male users, or join the users table onto another table. You'll have a huge UNION coming, and you won't even know the tables beforehand.
One big users table, with the appropriate indexes, is better. If it gets too big for your liking (or your disk), you can start with PARTITIONING: you still have the benefit of one table, but the partitions are stored on different locations.
Now, with your polls, these kind of queries might not happen. In that case, one big InnoDB table or 1000s of small tables might both work.. but the first option is a lot easier to program, and has no drawbacks over the second option. Why choose the second option?
The first option is the better, no doubt. Just be sure to define INDEXes for fields you will use to search data (such as poll_id, for sure) and you will not experience performance issues. MySQL is a DBMS perfectly capable to handle such amount of rows. Do not worry.
First option is better. And you can archive tables after a while, if you not going to use it often
I am working on a project using MySQL and PHP. I will have many (hundreds to thousands, possibly) users, and each user will have many (several thousand) entries relating to him/her. I was initially thinking of sticking all of the entries into one table, and having one of the columns be the user ID which the entry corresponds to, but this table would become huge, and likely hard to manage. I'd need to query the table frequently to get the entries which correspond to a particular user ID, and this may take a while. However, I would rarely need to query data that doesn't share a user ID.
I am now thinking about making a table for every user ID (something like "table1" for userID one, for example), and then just querying the individual tables. However, having thousands of tables sounds like a bad idea as well.
Which would you recommend? Or is there a better solution I haven't though of? (I hope my question made sense!)
The only valid way of doing that is having everything in one table. MySQL in not made for such extreme usages.
I suggest you keep all the entries in one table, each enty having UserID. And don't forget to put the index on that field.
It may be reasonable thinking about multiple tables, but if you do it that way, you queries will actually take more time, and data will use more disk space, because each table creates aditional overhead.
Go with one table, go the only vaid way. Splitting is not an option, you will just create data fragmentation, making yourself hard time when wou will want to for example do a backup.
Just a aditional comment: I have seen 20GB tables many times, but I have never seen a database with more than 100 tables.
I already saw a few forums with this question but they do not answer one thing I want to know. I'll explain first my topic:
I have a system where each log of multiple users are entered to the database (ex. User1 logged in, User2 logged in, User1 entered User management, User2 changed password, etc). So I would be expecting 100 to 200 entries per user per day. Right now, I'm doing it in a single table and to view it, I just have to filter out using UserID.
My question is, which is more efficient? Should I use one single table or create a table per user?
I am worried that if I use a single table, the system might have some difficulty filtering thousands of entries. I've read some pros and cons using multiple tables and a single table especially concerning updating the table(s).
I also want to know which one saves more space? multiple table or single table?
As long as you use indexes on the fields you're selecting from, you shouldn't have any speed problems (although indexes slow writes, so too many are a bad thing). A table with a few thousand entries is nothing to mySQL (or any other database engine).
The overhead of creating thousands of tables is much worse -- say you want to make a change to the fields in your user table -- now you'd have to change thousands of tables.
A table we regularly search against for a single record # work has about 150,000 rows, and because the field we search for is indexed, the search time is in very small fractions of a second.
If you're selecting those records without using the primary key, create an index on the field you use to select like this:
CREATE INDEX my_column_name ON my_table(my_column_name);
Thats the most basic form. To learn more about it, check here
I would go with a single table. With an index on userId, you should be able to scale easily to millions of rows with little issue.
A table per user might be more efficient, but it's generally poor design. The problem with a table per user is it makes it difficult to answer other kinds of questions like "who was in user management yesterday?" or "how many people have changed their passwords?"
As for storage space used - I would say a table per user would probably use a little more space, but the difference between the two options should be quite small.
I would go with just 1 table. I certainly wouldn't want to create a new table every time a user is added to the system. The number of entries you mention for each day really is really not that much data.
Also, create an index on the user column of your table to improve query times.
Definitely a single table. Having tables created dynamically for entities that are created by the application does not scale. Also, you would need to create your queries with variable tables names, something which makes things difficult to debug and maintain.
If you have an index on the user id you use for filtering it's not a big deal for a db to work through millions of lines.
Any database worth its salt will handle a single table containing all that user information without breaking a sweat. A single table is definitely the right way to do it.
If you used multiple tables, you'd need to create a new table every time a new user registered. You'd need to create a new statement object for each user you queried. It would be a complete mess.
I would go for the single table as well. You might want to go for multiple tables, when you want to server multiple customers with different set of users (multi tenancy).
Otherwise if you go for multiple tables, take a look at this refactoring tool: http://www.liquibase.org/. You can do schema modifications on the fly.
I guess, if you are using i.e. proper indexing, then the single table solution can perform well enough (and the maintenance will be much more simple).
Single table brings efficiency in $_POST and $_GET prepared statements of PHP. I think, for small to medium platforms, single table will be fine. Summary, few tables to many tables will be ideal.
However, multiple tables will not cause any much havoc as well. But, the best is on a single table.