For very very large tables, indexing may help quite a lot. But what is the solution for too many small tables in a data base. ?
what if I have a large DB, that has too many tables in it. how can i make query fast as indexes help fasten queries of a table?
Lets talk with a real example.
in stackoverflow.com , there is a table say. "questions". having id,date, votes. and then there exist a table for each id in the questions table. (this table will have the name as of the numeric id . eg. "q-45588") now its easy to index the "questions" table. but what about so many child tables of each question id. (that may contain ids,answer 1, answer 2, answer 3, comment 1, comment 2... votes, down votes, dates, flags, so many things) ?
This is what happens in usual accounts software. ie. debtors account table having ids of all debtors and each table exist for each of that id (having further details of the debtor)
or is it a design problem?
*update* -----------------
Some people might say that do all in 3 or 4 tables (which may have trillions of rows)
e.g questions table, answers table, comments table, users table.
heres an example of modified stack
Catagory of thread:-----info----
Question
Discussion
Catagory of Thread Response:----info-----
A Answer
c comment
Theads:----A table-----
Id (key)
Thread Id number (Long data type)
status (active,normal,closed(visible but not editable), deleted, flagged, etc.
type (Ques / Dis)
votes Up
vots Down
count of views
tag 1
tag 2
tag 3
Subject
body
maker ID
date time stramp of time creation
date time stramp of time last activity
A Answer count
c comment count
Thread: (table name is thread id (long data type) (in Threads table)----A table-----
id (key)
response text
response type ( A Answer / c comment)
vote up
vote down
abuse count
Typically, indexes are meant to make searching faster by providing and ordered structure to search within. In a very small table, since searching should be fast to begin with, it might not make much sense. Your best bet would be to try with and without indexes, and measure accordingly.
That being said, if your small tables have the same exact structure, it might make more sense (from a RDBMS point of view anyway) to merge them into a single entity.
What you have there is a design problem. Having multiple tables with the same columns should set off alarm bells immediately -- having multiple tables with the same unique key should as well.
In the example you give you should have a single child table.
Now, in some cases you might have a table with one or more distinct values that represent a large proportion of the table rows. For example, let's say that you have sales for 50 customers but one of them is responsible for 40% of the total sales records with the others distributed evenly between the other customers. Accessing the smaller customers' data through an index on customer_id makes sense, but it does not for the large customer. In that case you might look at partitioning the table to place the large customer's records in one child table and the other records in another, both being related to a master table http://www.postgresql.org/docs/9.2/static/ddl-partitioning.html .
However in general, and for your initial design, you should be using a single non-partitioned table for these child records.
Maybe this document can help you.
http://dev.mysql.com/doc/refman/5.0/en/table-cache.html
Actually, MySQL and other RDBMSs are focus on handling a big table, not many tables, right? If you want to handle extremely large number of tables, you should consider about NoSQL solutions.
Related
I need to create a table where each user (approx 60 atm) would have a defined task for each day. Right now the database have one column for each user with the task name in it (which is bad in my opinion as each new user would need to change the scheme of the table) and a "date" column.
A solution would be to have a "user" column and add a "task" column but that would mean there would be 60 (number of current users) rows per day.
I don't really know what's the best situation in this case.
Should I use more columns or more rows?
They're two completely different things, so this comparison doesn't make much sense...
Right now the database have one column for each user
Bad idea. Full stop. A user is a record of data, not a structural element of the database itself. For example, a table of users might contain columns like Username, Email, RegistrationDate, etc. It would not be a single row of data in which you add a column for each new user.
This would be a nightmare to maintain, would render things like Foreign Keys useless (and, honestly, render the entire concept of a relational database useless), would reach resource limits very quickly, etc.
Each record of information is a row, not a column (or table). In this case, each row in your table is a "User Task". It defines (or has a Feorign Key to) a User and defines (or has a Foreign Key to) a Task.
but that would mean there would be 60 (number of current users) rows per day
If the number of records in the table starts to become a problem, you can start looking into things like sharding and partitioning, archiving old data, etc. You've got time though, because "dozens of records per day" is sustainable for thousands of years. (And by then I imagine the hardware will be at least twice as good as it is today.)
Right now the database have one column for each user with the task
name in it (which is bad in my opinion as each new user would need > to change the scheme of the table)
You're right, this is very bad. Using one column for user, one for the task and one for the date, will be much better.
60 rows per day is not much. This means 21.900 rows per years and 219.000 rows in ten years. Mysql is able to handle millions of rows in a table
If you have two indexes, one for user and one for the date, searching for data will be fast enough.
Knowing nothing else about your database or schema, why not create a dimension table to store your users and fact table to track your task details?
That way you can more easily add new users and the tasks table would continue to grow as new facts are added. It would also be very easy to denormalize this model for query and/or reporting purposes.
Adding columns is a nuisance and can be slow. Instead have a table with columns (user, task, etc)
Even "60 rows per second" is not a problem. 600/second might be.
See the tag [pivot-table] for how to turn rows into columns for output display.
I have a MySQL/MariaDB database where posts are stored. Each post has some statistical counters such as the number of times the post has been viewed for the current day, the total number of views, number of likes and dislikes.
For now, I plan to have all of the counter columns updated in real-time every time an action happens - a post gets a view, a like or a dislike. That means that the post_stats table will get updated all the time while the posts table will rarely be updated and will only be read most of the time.
The table schema is as follows:
posts(post_id, author_id, title, slug, content, created_at, updated_at)
post_stats(post_id, total_views, total_views_today, total_likes, total_dislikes)
The two tables are connected with a post_id foreign key. Currently, both tables use InnoDB. The data from both tables will be always queried together to be able to show a post with its counters, so this means there will be an INNER JOIN used all the time. The stats are updated right after reading them (every page view).
My questions are:
For best performance when the tables grow, should I combine the two tables into one since the columns in post_status are directly related to the post entries, or should I keep the counter/summary table separate from the main posts table?
For best performance when the tables grow, should I use MyISAM for the posts table as I can imagine that MyISAM can be more efficient at reads while InnoDB at inserts?
This problem is general for this database and also applies to other tables in the same database such as users (counters such as the total number views of their posts, the total number of comments written by them, the total number of posts written by them, etc.) and categories (the number of posts in that category, etc.).
Edit 1: The views per day counters are reset once daily at midnight with a cron job.
Edit 2: One reason for having posts and post_stats as two tables is concerns about caching.
For low traffic, KISS -- Keep the counters in the main post table. (I assume you have ruled this out.)
For high traffic, keep the counters in a separate table. But let's do the "today's" counters differently. (This is what you want to discuss.)
For very high traffic, gather up counts so that you can do less than 1 Update per click/view/like. ("Summary Tables" is beyond the scope of this question.)
Let's study total_views_today. Do you have to do a big "reset" every midnight? That is (or will become) too costly, so let's try to avoid it.
Have only total_views in the table.
At midnight copy the table into another table. (SELECT is faster and less-invasive than the UPDATE needed to reset the values.) Do this copy by building a new table, then RENAME TABLE to move it into place.
Compute total_views_today by subtracting the corresponding values in the two tables.
That left you with
post_stats(post_id, total_views, total_likes, total_dislikes)
For "high traffic, it is fine to do
UPDATE post_stats SET ... = ... + 1 WHERE post_id = ...;
at the moment needed (for each counter).
But there is a potential problem. You can't increment a counter if the row does not exist. That would be best solved by creating a row with zeros at the same time the post is created. (Otherwise, see IODKU.)
(I may come back if I think of more.)
Using a basic star schema, I have been told that a fact table would have at least the amount of rows equal to the product of the number of rows in each dimension.
For example, 3 products, 5 promotions, and 10 stores would mean that the fact table should have at least 150 records, regardless of where or not a product actually had every promotion or exists in every store. Specifically, null values would exists where for example, a product does not have a specific promotion and etc.
Can someone please provide an academical source that supports, or in the least, please just confirm this idea.
The reason why I am asking this is that my understanding tells me this would create a MASSIVE amount of useless data in the fact table.
Thanks!
Hi thanks for the replies. I consulted my lecturer and he actually found a page reference for me: "...Take a very simplistic example of 3 products, 5 customers, 30 days, and 10 sales representatives represented as row in the dimension tables. Even in this example, the number of fact table rows will be 4500, very large in comparison with the dimension table rows..." (Ponniah, P., 2009. Data warehousing: Fundamentals for IT professionals, 2nd Edition. John Wiley & Sons, Inc., New Jersey. p. 237)
However, the author goes on to say that: "We have said that a single row in the fact table relates to a particular product, a specific calendar date, a specific customer, and an individual sales representative. In other words, for a particular product, a specific calendar date, a specific customer, and an individual sales representative, there is a corresponding row in the fact table. What happens when the date represents a closed holiday and no orders are received and processed? The fact table rows for such dates will not have values for the measures. Also there could be other combinations of dimension table attributes, values for which the fact table rows will have null measures. Do we need to keep such rows with nulls measures in the fact table? There is no need for this. Therefore it is important to realize this type of sparse data and understand that the fact table could have gaps."
In short, you guys seem to be correct, thanks!
Of course not. I suggest you ask your source to clarify this claim, it sounds as if there is a missunderstanding somewhere here.
And what if you add a time dimension..?
Also it is not even possible to have null values as keys where i.e. promotions are missing, because the reason for the key is to point to a dimensional value, wich a null value isn't doing.
The dimension values are there to support whatever facts you have, not the other way around.
This may relate to a specific kind of fact table: the pattern that Ralph Kimball terms a Periodic Snapshot Fact Table. That is where the fact table repeats an entire population of rows for each point in time. IMO the usefulness of that approach is extremely limited.
A Snapshot Fact Table does not implicitly require that the fact table is the product of its dimensions but it does pose the potential problem of what the correct population of each snapshot should be. The cross product of dimensions is one way to do it I suppose.
I am developing a forum in PHP MySQL. I want to make my forum as efficient as I can.
I have made these two tables
tbl_threads
tbl_comments
Now, the problems is that there is a like and dislike button under the each comment. I have to store the user_name which has clicked the Like or Dislike Button with the comment_id. I have made a column user_likes and a column user_dislikes in tbl_comments to store the comma separated user_names. But on this forum, I have read that this is not an efficient way. I have been advised to create a third table to store the Likes and Dislikes and to comply my database design with 1NF.
But the problem is, If I make a third table tbl_user_opinion and make two fields like this
1. comment_id
2. type (like or dislike)
So, will I have to run as many sql queries as there are comments on my page to get the like and dislike data for each comment. Will it not inefficient. I think there is some confusion on my part here. Can some one clarify this.
You have a Relational Scheme like this:
There are two ways to solve this. The first one, the "clean" one is to build your "like" table, and do "count(*)'s" on the appropriate column.
The second one would be to store in each comment a counter, indicating how many up's and down's have been there.
If you want to check, if a specific user has voted on the comment, you only have to check one entry, wich you can easily handle as own query and merge them two outside of your database (for this use a query resulting in comment_id and kind of the vote the user has done in a specific thread.)
Your approach with a comma-seperated-list is not quite performant, due you cannot parse it without higher intelligence, or a huge amount of parsing strings. If you have a database - use it!
("One Information - One Dataset"!)
The comma-separate list violates the principle of atomicity, and therefore the 1NF. You'll have hard time maintaining referential integrity and, for the most part, querying as well.
Here is one way to do it in a normalized fashion:
This is very clustering-friendly: it groups up-votes belonging to the same comment physically close together (ditto for down-votes), making the following query rather efficient:
SELECT
COMMENT.COMMENT_ID,
<other COMMENT fields>,
COUNT(DISTINCT UP_VOTE.USER_ID) - COUNT(DISTINCT DOWN_VOTE.USER_ID) SCORE
FROM COMMENT
LEFT JOIN UP_VOTE
ON COMMENT.COMMENT_ID = UP_VOTE.COMMENT_ID
LEFT JOIN DOWN_VOTE
ON COMMENT.COMMENT_ID = DOWN_VOTE.COMMENT_ID
WHERE
COMMENT.COMMENT_ID = <whatever>
GROUP BY
COMMENT.COMMENT_ID,
<other COMMENT fields>;
[SQL Fiddle]
Please measure on realistic amounts of data if that works fast enough for you. If not, then denormalize the model and cache the total score in the COMMENT table, and keep it current it through triggers every time a new row is inserted to or deleted from *_VOTE tables.
If you also need to get which comments a particular user voted on, you'll need indexes on *_VOTE {USER_ID, COMMENT_ID}, i.e. the reverse of the primary/clustering key above.1
1 This is one of the reasons why I didn't go with just one VOTE table containing an additional field that can be either 1 (for up-vote) or -1 (for down-vote): it's less efficient to cover with secondary indexes.
Using MySQL I have table of users, a table of matches (Updated with the actual result) and a table called users_picks (at first it's always going to be 10 football matches pr. gameweek pr. league because there's only one league as of now, but more leagues will come along eventually, and some of them only have 8 matches pr. gameweek).
In the users_picks table should i store each 'pick' (by pick I mean both 'hometeam score' and 'awayteam score') in a different row, or have all 10 picks in one single row? Both with a FK for user and gameweek. All picks in one row would mean I had columns with appended numbers like this:
Option 1: [pick_id, user_id, league_id, gameweek_id, match1_hometeam_score, match1_awayteam_score, match2_hometeam_score, match2_awayteam_score ... etc]
and that option doesn't quite fill me with joy, and looks a bit stupid. Especially since there's going to be lots of potential NULLs in the db. The second option would mean eventually millions of rows. But would look like this:
Option 2: [pick_id, user_id, league_id, gameweek_id, match_id, hometeam_score, awayteam_score]
What's the best practice? And would it be a PITA to do all sorts of statistics using the second option? eg. Calculating how many matches a user has hit correctly in a specific round, how many alltime correct hits etc.
If I'm not making much sense, I'll try to elaborate anything. I just wan't my table design to be good from the start, so I won't have a huge headache in a couple of months.
Thanks in advance.
The second choice is much better than the first. This is called database normalisation and makes querying easier, not harder. I would suggest reading the linked article, and the related descriptions of the various "normal forms", and aiming for a 3rd Normal Form data structure as a minimum.
To see the flaw in your first option, imagine if there were to be included later a new league with 11 matches. Or 400.
You should read up about database normalization.
When you have a 1:n relation, like in your case one team having many matches, you would create two tables. One table "teams" and a second table "matches" where each row includes the ID of the team which played the match.
In the same manner you should also have separate tables for users, picks and leagues.
Option two is better, provided you INDEX your table properly, since (as you indicate) it will grow quite large. The pick_id is the primary key, but also create an INDEX on the user_id field, as likely the most common query will be
SELECT * FROM `users_pics` WHERE `user_id`=?;
to get all the picks for a given user.