I would like to achieve something like you see on Facebook:
- Posting status
- Comment status
- Like status (like for comments not implemented yet)
My tables structure is like this :
Posts Users Comments Likes
------- ------- -------- -------
ID ID ID ID
UserID Username PostID PostID
Content UserID UserID
Date Content
Date
So at this time when someone access to the main page the system is going to show the 10 lasts posts. My query uses LEFT JOIN on theses tables.
If for example there is 10 posts without any comments and any likes the query will return 10 records.
But for each comment or likes my query will return a new record (row) with some NULL value in the corresponding column.
At the end by simply wanting to retrieve 10 posts my query will return at least 50 rows (if each post has some comments and likes).
I was wondering if that will cause problem in the future. And I was wondering if I should better use multiple queries and parse all the results into an array like:
1. Select the 10 last posts
2. Save the IDs into array and all data into global array
3. Parse the array and make a prepared query for the comments something like:
SELECT * FROM COMMENTS WHERE PostID IN (1, 2, 3, 4, 5, 6,...)
4. Save the result into global array
5. Repeat again for the like table
I hope my explanation was clear enough :) Thank you
Doing one 50 row query reduces the overhead when communicating with the server, on the other hand it adds processing after the rows are retrieved.
It really depends on the overall solution.
However, unless the application is performance critical with the server being the bottleneck, i would go with 10 result sets - one per row, probably using some class/widget/object to display the post on the page.
I'm not an expert, if I understand correctly your option are:
A) the single mega query that will return a lot of NULL's and repeated values.
[Note: By "all" I mean, all you are interested in]
B) Three queries: One for all posts, one for all comments, and one for all likes (all joined with the users table), and then you can process them into objects or structs or dictionaries with whatever language you are using to query the database.
I would go with the second because It is easier, and the increase in order of magnitude seems benign, and probably even more flexible design wise.
What I would prefer NOT to do is one query per post. That would probably become a problem sooner than later. At least much sooner than A or B.
Related
I have a quite complex view with two queries (a view in a view), one select users with related data and another one select orders with related data. Both of them have some filters, but now I have an issue and I am looking for proper and just decent solution, with good performance because I have a lot of data and relationships in the queries.
Assume I have:
Query 1 - Select user data, some left joins to other tables, conditions depends on provided parameters.
Query 2 - Select orders depends on users from Query 1, many joins, conditions depends on parameters.
I display data from two queries in one view, users, their data, orders, and some orders data and now I want to implement pager, but it has to work and display proper number of users depends on filters form Query 1 and Query 2. So there is an issue that I can't really limit from any query cuz another one has filters as well so maybe those users maybe aren't really selected to display depends on other query filters.
So I guess there are two ways, one is to put those queries in loop and collect data until I get proper number of results depends on query.
Another way is to merge those two queries into one, but there an issue that I get many rows per user, so I can't set any page limit and get results only for specific number of users, like for example 30. Because results will be like user 1 => order 1, user 1 => order 2, so is there any way to get specific number of unique results depends on user id or something.
Let me know if you have any questions.
Sample data will make more sense. I am unable to understand the whole requirement here in your question. will you be able to create some sample data and share with us ? if you are dealing with a lot of data, avoid loops as that will just make performance worse.
What I want to achieve:
I am developing website with a catalog of products.
This is normalized model (simplified) of entities which are related to my question:
So some product features exist (like size and type in this example), which all have predefined sets of values (e.g. sizes 1, 2 and 3 exist, and type may be 1, 2 or 3 (these sets do not have to be equal, just example.)).
Relationship between Product and each of features is "many-to-many" - different values of one feature do not exclude each other.
My task is to build form which will allow user to filter search results, based on features of products. Example screenshot:
Multiple checked values of one feature are mixed using "AND" logic, so if I have sizes One and Three checked, I need all products, which have both sizes (+ may have any other sizes, that doesn't matter, but selected ones must be present).
Number near each value of feature represents amount of products, which is returned if user checks this value right now. So it is effectively a number of products satisfying filter "current active filter + this one value applied".
When user checks/unchecks any value, counters must be updated considering new "current filter".
Problem:
Real use case is: ~200k products, ~6 features with ~5-15 values each.
My COUNT queries, (especially with decent number of selected options) are too slow, and to render the form I need as many of these counts as there are values of all filters - in total that gives unacceptable response time.
What I have tried:
Query to retrieve results:
select * from products p, product_size ps
where p.id = ps.product_id
and (ps.size_id IN (1, 2, 3, 5))
group by p.id
having count(p.id) = 4;
(this is to select products which have sizes 1, 2, 3 and 5 at the same time).
It completes in ~0.360 sec on 120k products, almost same time with COUNT wrapped around it. And this query does not allow more than one feature (but I could place values of all features in one table).
Another query to retrieve the same set:
SELECT ps1.product_id
FROM product_size AS ps1, (SELECT id FROM size AS s1 WHERE id IN (1, 2, 3, 5)) AS t
WHERE ps1.size_id = t.id
GROUP BY ps1.product_id
HAVING COUNT(ps1.size_id) = (SELECT COUNT(id) FROM (SELECT id FROM size AS s2 WHERE id IN (1, 2, 3, 5)) AS t2);
It completes in ~0.230 sec (same time when wrapped in COUNT) and does not allow multiple features too.
It is modified query I found here: https://www.simple-talk.com/sql/t-sql-programming/divided-we-stand-the-sql-of-relational-division/ (second query in "Division with a Remainder" part).
Alternative schema:
Denormalized model, where value of each feature is a boolean column in Products table.
The query is obvious here:
select * from products
where `size_1` = 1 and `size_2` = 1
and `size_3` = 1 and `size_5` = 1;
Weird and harder to maintain in application's code, but completes in ~0.056 sec when COUNT-ing.
None of these methods are acceptable per se because multiplied ~30 times (to populate all counters in form) that gives inadequate response time.
Caching and precomputing
Data in DB is going to be updated only few times a day (like, may be, even 2), so I could probably precompute counts for all combinations of filters when data is updated (I haven't measured necessary time to be honest), but it is anyway not going to work too - search form has fields with arbitrary values (like min/max price and text search by the product's name), which I can't precompute for.
Load counters in form dynamically
Render form, but fetch numbers through AJAX, so user would be able to see page, and then, after quite long waiting, numbers. This is my last thought, but it seems like poor quality of service for me (may be it is worse than no counters at all).
I am stuck. Any hints? May be I am not seeing some bigger picture? I would be very glad to any advice.
UPDATE: if we forget about counters, what is the effective and usually used way (query) for just retrieving results with such a filters (or what am I doing wrong)? Like "find post with all requested tags" model, that is equivalent. I suspect it can be faster than my 0.230 sec (query #2), considering small (?) amount of rows for MySQL.
You can
Create one table which will store all possible combinations (product_id <> size_id <> type_id)
Update this table when Admin will make any changes in product from backend (assuming there will be a backend management)
In frontend, for filters, use this table instead of product tables, and extract product ids once filter query is fired
Once you have list of product ids for result, you can fetch actual data by using those product Ids
I have used this before, and it worked for me, you can first make table and try running query to check response time.
Hope this helps.
Is it a good idea to store like count in the following format?
like table:
u_id | post_id | user_id
And count(u_id) of a post?
What if there were thousands of likes for each post? The like table is going to be filled with billions of rows after a few months.
What are other efficient ways to do so?
In two words answer is : yes , it is OK. (to store data about each like any user did for any post).
But I want just to separate or transform it to several questions:
Q. Is there other way to count(u_id)? or even better:
SELECT COUNT(u_id) FROM likes WHERE post_id = ?
A. Why not? you can save count in your post table and increase/decrease it every time when user like/dislike the post. You can set trigger (stored procedure) to automate this action. And then to get counter you need just:
SELECT counter FROM posts WHERE post_id = ?
If you like previous Q/A and think that it is good idea I have next question:
Q. Why do we need likes table then?
A. That depends of your application design and requirements. According to the columns set you posted : u_id, post_id, user_id (I would even add another column timestamp). Your requirements is to store info about user as well as about post when it liked. That means you can recognize if user already liked this post and refuse multilikes. If you don't care about multilikes or historical timeline and stats you can delete your likes table.
Last question I see here:
Q. The like table is going to be filled with billions of rows after a few months. isn't it?
A. I wish you that success but IMHO you are 99% wrong. to get just 1M records you need 1000 active users (which is very very good number for personal startup (you are building whole app with no architect or designer involved?)) and EVERY of those users should like EVERY of 1000 posts if you have any.
My point here is: fortunately you have enough time till your database become really big and that would hurt your application. Till your table get 10-20M of records you can do not worry about size and performance.
I'm building a kind of a forum, so I need to display posts. Each post has comments and tags assigned to it, and each post is assigned to a user; also, each comment is assigned to a user. So what I need to fetch is: a post, it's comments and their authors' usernames, it's tags and the user who the post is assigned to. Displayed post looks something like this:
post_title (submit_time)
tag1, tag2, tag3
user_name
comment_text
user_name
comment_text
user_name
etc.
The problem is that relationships between posts, users, comments and tags are all different. Posts to users have an N:1 relationship (multiple posts for one user), posts to comments have a 1:N relationship (multiple comments for one post), posts to tags have m:n relationship (any number of tags for any number of posts).
I have devised a complex query with a lot of LEFT JOINs that allows me to fetch all data for each object, but it has a lot of duplicating rows (for example, if there are 5 comments, data about the post author will be fetched 5 times; it gets even worse with tags). This doesn't seem very rational. Also, it still makes me do another query for each comment to find it's author's user_name.
I'm a bit of a newbie with MySQL, so I really have no idea how such problem should be tackled.
The question is: what is the best way to fetch such data? Should I make one large query, or a lot of small ones (fetch comments and tags with distinct queries for each post)?
Please comment if my situation is unclear: I will do my best to clarify it.
Sometimes it IS ok to use multiple queries to fetch data. In your case, you CAN use a monolithic big query to fetch a user's data, all their posts, and all the comments, but as you say - there's a lot of repeated data. If you end up throwing away most of a query's results because it's repeated data, then it's a very good candidate for splitting up:
1 query to fetch the user info
1 query to fetch the user's posts
1 query to fetch the posts' comments
The end result being that you're fetching the user's info only once. Assuming a particular user has 20 posts with around 50 comments, your monolithic single query would fetch 20x50 = 1000 copies of the user details, and 50 copies of the post details, causing you to throw away 999 + 30 = 1029 records' worth of user/post data.
By comparison, at the cost of doing 3 queries, you're fetching only 1+20+50 = 71 rows of data, none of which is redundant.
I have searched for a solution for this problem, but haven't found it (yet), probably because I don't quite know how to explain it properly myself. If it is posted somewhere already, please let me know.
What I have is three databases that are related to each other; main, pieces & groups. Basically, the main database contains the most elementary/ most used information from a post and the pieces database contains data that is associated with that post. The groups database contains all of the (long) names of the groups a post in the main database can be 'posted in'. A post can be posted in multiple groups simultaneously. When a new post is added to my site, I check the pieces too see if there are any duplicates (check if the post has been posted already). In order to make the search for duplicates more effective, I only check the pieces that are posted in the same group(s).
Hopefully you're still with me, cause here's where it starts to get really confusing I think (let me know if I need to specify things more clearly): right now, both the main and the pieces database contain the full name of the group(s) (basically I'm not using the groups database at all). What I want to do is replace the names of those groups with their associated IDs from the groups database. For example, I want to change this:
from:
MAIN_table:
id | group_posted_in
--------|---------------------------
1 | group_1, group_5
2 | group_15, group_75
3 | group_1, group_215
GROUPS_table:
id | group_name
--------|---------------------------
1 | group_1
2 | group_2
3 | group_3
etc...
into:
MAIN_table:
id | group_posted_in
--------|---------------------------
1 | 1,5
2 | 15,75
3 | 1,215
Or something similar to this. However, This format specifically causes issues as the following query will return all of the rows (from the example), instead of just the one I need:
SELECT * FROM main_table WHERE group = '5'
I either have to change the query to something like this:
...WHERE group = '5' OR group = '5,%' OR group = '%,5,%' OR group = '%,5'
Or I have to change the database structure from Comma Separated Values to something like this: [15][75]. The accompanying query would be simpler, but it somehow seems like a cumbersome solution to me. Additionally, (simple) joins will not be easy/ possible at all. It will always require me to run a separate query to fetch the names of the groups--whether a user searches for posts in a specific group (in which case, I first have to run a query to fetch the id's, then to search for the associated posts), or whether it is to display them (first the posts, then another query to match the groups).
So, in conclusion: I suppose I know there is a solution to this problem, but my gut tells me that it is not the right/ best way to do it. So, I suppose the question that ties this post together is:
What is the correct method to connect the group database to the others?
For a many-to-many relationship, you need to create a joining table. Rather than storing a list of groups in a single column, you should split that column out into multiple rows in a separate table. This will allow you to perform set based functions on them and will significantly speed up the database, as well as making it more robust and error proof.
Main
MainID ...
Group
GroupID GroupName
GroupsInMain
GroupsInMainID MainID(FK) GroupID(FK)
So, for MainID 1, you would have GroupsInMain records:
1,1,1
2,1,5
This associates groups 1 and 5 with MainID 1
FK in this case means a Foreign Key (i.e. a reference to a primary key in another table). You'd probably also want to add a unique constraint to GroupsInMain on MainID and GroupID, since you'd never want the same values for the pairing to show up more than once.
Your query would then be:
select GroupsInMain.MainID, Group.GroupName
from Group, GroupsInMain
where Group.GroupID=GroupsInMain.GroupID
and Group.GroupID=5