MySQL - how to optimize query to count votes - mysql

Just after some opinions on the best way to achieve the following outcome:
I would like to store in my MySQL database products which can be voted on by users (each vote is worth +1). I also want to be able to see how many times in total a user has voted.
To my simple mind, the following table structure would be ideal:
table: product table: user table: user_product_vote
+----+-------------+ +----+-------------+ +----+------------+---------+
| id | product | | id | username | | id | product_id | user_id |
+----+-------------+ +----+-------------+ +----+------------+---------+
| 1 | bananas | | 1 | matthew | | 1 | 1 | 2 |
| 2 | apples | | 2 | mark | | 2 | 2 | 2 |
| .. | .. | | .. | .. | | .. | .. | .. |
This way I can do a COUNT of the user_product_vote table for each product or user.
For example, when I want to look up bananas and the number of votes to show on a web page I could perform the following query:
SELECT p.product AS product, COUNT( v.id ) as votes
FROM product p
LEFT JOIN user_product_vote v ON p.id = v.product_id
WHERE p.id =1
If my site became hugely successful (we can all dream) and I had thousands of users voting on thousands of products, I fear that performing such a COUNT with every page view would be highly inefficient in terms of server resources.
A more simple approach would be to have a 'votes' column in the product table that is incremented each time a vote is added.
table: product
+----+-------------+-------+
| id | product | votes |
+----+-------------+-------+
| 1 | bananas | 2 |
| 2 | apples | 5 |
| .. | .. | .. |
While this is more resource friendly - I lose data (eg. I can no longer prevent a person from voting twice as there is no record of their voting activity).
My questions are:
i) am I being overly worried about server resources and should just stick with the three table option? (ie. do I need to have more faith in the ability of the database to handle large queries)
ii) is their a more efficient way of achieving the outcome without losing information

You can never be over worried about resources, when you first start building an application you should always have resources, space, speed etc. in mind, if your site's traffic grew dramatically and you never built for resources then you start getting into problems.
As for the vote system, personally I would keep the votes like so:
table: product table: user table: user_product_vote
+----+-------------+ +----+-------------+ +----+------------+---------+
| id | product | | id | username | | id | product_id | user_id |
+----+-------------+ +----+-------------+ +----+------------+---------+
| 1 | bananas | | 1 | matthew | | 1 | 1 | 2 |
| 2 | apples | | 2 | mark | | 2 | 2 | 2 |
| .. | .. | | .. | .. | | .. | .. | .. |
Reasons:
Firstly user_product_vote does not contain text, blobs etc., it's purely integer so it takes up less resources anyways.
Secondly, you have more of a doorway to new entities within your application such as Total votes last 24 hr, Highest rated product over the past 24 hour etc.
Take this example for instance:
table: user_product_vote
+----+------------+---------+-----------+------+
| id | product_id | user_id | vote_type | time |
+----+------------+---------+-----------+------+
| 1 | 1 | 2 | product |224.. |
| 2 | 2 | 2 | page |218.. |
| .. | .. | .. | .. | .. |
And a simple query:
SELECT COUNT(id) as total FROM user_product_vote WHERE vote_type = 'product' AND time BETWEEN(....) ORDER BY time DESC LIMIT 20
Another thing is if a user voted at 1AM and then tried to vote again at 2PM, you can easily check when the last time they voted and if they should be allowed to vote again.
There are so many opportunities that you will be missing if you stick with your incremental example.
In regards to your count(), no matter how much you optimize your queries it would not really make a difference on a large scale.
With an extremely large user-base your resource usage will be looked at from a different perspective such as load balancers, mainly server settings, Apache, catching etc., there's only so much you can do with your queries.

If my site became hugely successful (we can all dream) and I had thousands of users voting on thousands of products, I fear that performing such a COUNT with every page view would be highly inefficient in terms of server resources.
Don't waste your time solving imaginary problems. mysql is perfectly able to process thousands of records in fractions of a second - this is what databases are for. Clean and simple database and code structure is far more important than the mythical "optimization" that no one needs.

Why not mix and match both? Simply have the final counts in the product and users tables, so that you don't have to count every time and have the votes table , so that there is no double posting.
Edit:
To explain it a bit further, product and user table will have a column called "votes". Every time the insert is successfull in user_product_vote, increment the relevant user and product records. This would avoid dupe votes and you wont have to run the complex count query every time as well.
Edit:
Also i am assuming that you have created a unique index on product_id and user_id, in this case any duplication attempt will automatically fail and you wont have to check in the table before inserting. You will just to make sure the insert query ran and you got a valid value for the "id" in the form on insert_id

You have to balance the desire for your site to perform quickly (in which the second schema would be best) and the ability to count votes for specific users and prevent double voting (for which I would choose the first schema). Because you are only using integer columns for the user_product_vote table, I don't see how performance could suffer too much. Many-to-many relationships are common, as you have implemented with user_product_vote. If you do want to count votes for specific users and prevent double voting, a user_product_vote is the only clean way I can think of implementing it, as any other could result in sparse records, duplicate records, and all kinds of bad things.

You don't want to update the product table directly with an aggregate every time someone votes - this will lock product rows which will then affect other queries which are using products.
Assuming that not all product queries need to include the votes column, you could keep a separate productvotes table which would retain the running totals, and keep your userproductvote table as a means to enforce your user voting per product business rules / and auditing.

Related

MS Access help needed forming a specific report

I have a table with a column for agent names and a column for each of the skills those agents could possibly have. Each skill the agent is assigned shows a 1 in the field under that skill.
Columns look like this:
+---------+----------+----------+----------+
| Name | 'Skill1' | 'Skill2' | 'Skill3' |
+---------+----------+----------+----------+
| John | 1 | | 1 |
| Sam | 1 | 1 | |
| Roberta | 1 | | 1 |
+---------+----------+----------+----------+
I would like to make a query that returns a list of all agent names that have a 1 for each particular skill. The query would return something like this:
+-----------+
| Skill 1 |
+-----------+
| John |
| Sam |
| Roberta |
+-----------+
Additionally I would like to be able to query a single name and retrieve all skills that agent has (all rows the Name column has a 1 in) like this:
+-----------+
| John |
+-----------+
| Skill 1 |
| Skill 3 |
+-----------+
I've done this in Excel using an index but I'm new to Access and not sure how to complete this task.
Thanks in advance.
One of the reasons that you are finding this task difficult is because your database is not normalised and so due to the way that your database is structured, you are working against MS Access, not with it.
Consequently, whilst a solution is still possible with the current data, the resulting queries will be painful to construct and will either be full of multiple messy iif statements, or several union queries performing the same operations over & over again, one for each 'skill'.
Then, if you every wish to add another Skill to the database, all of your queries have to be rewritten!
Whereas, if your database was normalised (as Gustav has suggested in the comments), the task would be a simple one-liner; and what's more, if you add a new skill later on, your queries will automatically output the results as if the skill had always been there.
Your data has a many-to-many relationship: an agent may have many skills, and a skill may be known by many agents.
As such, the most appropriate way to represent this relationship is using a junction table.
Hence, you would have a table of Agents such as:
tblAgents
+-----+-----------+----------+------------+
| ID | FirstName | LastName | DOB |
+-----+-----------+----------+------------+
| 1 | John | Smith | 1970-01-01 |
| ... | ... | ... | ... |
+-----+-----------+----------+------------+
This would only contain information unique to each agent, i.e. minimising the repeated information between records in the table.
You would then have a table of possible Skills, such as:
tblSkills
+-----+---------+---------------------+
| ID | Name | Description |
+-----+---------+---------------------+
| 1 | Skill 1 | Skill 1 Description |
| 2 | Skill 2 | Skill 2 Description |
| ... | ... | ... |
+-----+---------+---------------------+
Finally, you would have a junction table linking Agents to Skills, e.g.:
tblAgentSkills
+----+----------+----------+
| ID | Agent_ID | Skill_ID |
+----+----------+----------+
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 2 | 1 |
| 4 | 3 | 2 |
+----+----------+----------+
Now, say you want to find out which agents have Skill 1, the query is simple:
select Agent_ID from tblAgentSkills where Skill_ID = 1
What if you want to find out the skills known by an agent? Equally as simple:
select Skill_ID from tblAgentSkills where Agent_ID = 1
Of course, these queries will merely return the ID fields as present in the junction table - but since the ID uniquely identifies a record in the tblAgents or tblSkills tables, such ID is all you need to retrieve any other required information:
select
tblAgents.FirstName,
tblAgents.LastName
from
tblAgentSkills inner join tblAgents on
tblAgentSkills.AgentID = tblAgents.ID
where
tblAgentSkills.Skill_ID = 1
To get all agents with skill1, open the query designer and create the following query:
this will generate the following sql
SELECT Skills.AgentName
FROM Skills
WHERE (((Skills.Skill1)=1));
If you adjust the names you can also paste this query into the sql pane of the designer to get the query you want.
To get all the skills an agent has I chose a parameterized query. Open the query designer and create a new query:
When you run this query it will ask you for the name of the agent. Make sure to type the agent name exactly. Here is the resulting sql:
SELECT Skills.AgentName, Skills.Skill1, Skills.Skill2, Skills.Skill3
FROM Skills
WHERE (((Skills.AgentName)=[Agent]));
If you continue working with this query I would improve the table design by breaking your table into a skills table, agents table, skills&agents table. Then link the skills and agents tables to the skills&agents table in a many to many relationship. The query to get all an agents skills would then look like this in the designer:

How should I design my database for different user has different devices?

I was confused about database design.
This is my request: There are different users; a user can has one or more devices. Some users will have thousands devices even ten thousand devices. A device has many operation records. One day adds 10-20 operation records.
I prepare to create a table for user record userId, passWord. A different user creates different device list table record deviceNum, devicState. A device has a operation table record operation records.
So, my database will have many many tables. I guess I have more one hundred thousand devices. Should I create one hundred thousand devices tables?
From you've said you need 3 tables only:
Users - user_id, username, pass, data, etc...
Devices - device_id, user_id (owner), device_data....
OpRecords - record_id, device_id, record_data_stuff...
All users go to the first table with additional fields you need for user data like address, phone, last login time, etc.
All devices are registered into the Devices table with each device having association to the user who owns it - the user_id. You can make as much columns as needed - like device type, device name, parameters, workhours and so on...
All records go to a single table and record is associated with the device that did the job (device_id). You can make your additional columns here to write record-specific data like start_time, end_time, and so on...
You don't need to have a separate table for the record list of each device. When you need to show the records of a single device just make SELECT with device_id = X and you will receive the records for your device only.
UPDATE:
ten thousand devices ,everyone make ten operation records one day .ten
days
It makes about 1 million records for 10 days. It is a chalange. But I don't think splitting this data into different tables will give you a better performance.
You should set as minimum columns in the records table as possible. Place there the minimum required information. Try not to use strings and if possible use only integer numbers. If you use fields to write date/time use TIMESTAMP as it is closest to integer number. Create indexes on all columns you will use to search data - device_id, time (if you have such column) an so on. This will increase disk usage but I think will raise the performance.
Consider if you can backup your data to external file after a given period - in example after a month you would free 3 million records.
I would suggest using a layout like this:
User
+----+----------+----------+-------+
| id | username | password | email |
+----+----------+----------+-------+
| 1 | jon | xxxx | xxxx |
| 2 | melissa | xxxx | xxxx |
+----+----------+----------+-------+
Device
+----+---------+--------+
| id | user_id | name |
+----+---------+--------+
| 1 | 1 | iPad |
| 2 | 1 | iPhone |
| 3 | 1 | Laptop |
| 4 | 2 | Laptop |
+----+---------+--------+
Operation
+----+-----------+-----+-------+
| id | device_id | key | value |
+----+-----------+-----+-------+
| 1 | 1 | x | y |
| 2 | 1 | y | z |
| 3 | 2 | x | x |
| 4 | 2 | z | y |
| 5 | 3 | x | y |
+----+-----------+-----+-------+
I hope that I got your question right.

The correct way of storing millions of Many-To-Many relations (pivot)

What I have
I have the next schema:
users table:
| id | name | ... | created_at | updated_at |
groups table:
| id | name | ... | created_at | updated_at |
messages table:
| id | text | ... | created_at | updated_at |
user_messages table (Pivot):
| user_id | message_id | sent_at |
user_groups table (Pivot):
| user_id | group_id | joined_at |
For now project is using only MySQL database.
Problem
Storing Many-To-Many is traditional way and it is okey. But in this case I am a little confused.
Groups can include unlimited users and a group can send 10-1000 messages (or more) per day. For example, let's take some basic numbers (not millions what in a real life can also be): 1 group, 10000 users, 100 messages per day. The relation row count per day: 10000 x 100 = 1000000. One million rows per day for one group, but groups count can also be thousands.
One idea that can come at once: Why do you need Pivot for messages? Answer is: The "target" sending option is required (For sending message only for x users, not for all).
Question
My question is: "What is correct way to store this kind of data?"
Maybe I need to use other Database system or maybe this numbers are not a problem for MySQL.

Manage popularity trends of database records

I need to create a system to order some articles by they popularity, like a trend.
I have this table:
| Id | Title | View |
| 1 | aaa | 232 |
| 2 | bbb | 132 |
| 3 | ccc | 629 |
This way I can easilly order by number of view, but if I want to show the populars articles in the last period (not definited) and not the articles that have a lot of views but they are not longer visit? Exist a technique? I have to track all visits?
You could have a daily_views/hourly_views table according to your needs with :
ID startTime endTime number_of_views
and INSERT/UPDATE that table every time you have a new view. That way you don't have to insert a record for each view and you can have queries for different time periods.

MySQL LEFT JOIN vs. 2 separate queries (Performance)

I have two tables:
++++++++++++++++++++++++++++++++++++
| Games |
++++++++++++++++++++++++++++++++++++
| ID | Name | Description |
++++++++++++++++++++++++++++++++++++
| 1 | Game 1 | A game description |
| 2 | Game 2 | And another |
| 3 | Game 3 | And another |
| .. | ... | ... |
++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++
| GameReviews |
+++++++++++++++++++++++++++++++++++++++
| ID |GameID| Review |
+++++++++++++++++++++++++++++++++++++++
| 1 | 1 |Review for game 1 |
| 2 | 1 |Another review for game 1|
| 3 | 1 |And another |
| .. | ... | ... |
+++++++++++++++++++++++++++++++++++++++
Option 1:
SELECT
Games.ID,
Games.Name,
Games.Description,
GameReviews.ID,
GameReviews.Review
FROM
GameReviews
LEFT JOIN
Games
ON
Games.ID = GameReviews.GameID
WHERE
Games.ID=?
Option 2:
SELECT
ID,
Name,
Description
FROM
Games
WHERE
ID=?
and then
SELECT
ID,
Review
FROM
GameReviews
WHERE
GameID=?
Obviously query 1 would be "simpler" where it is less code to write, and the other would seem to logically be "easier" on the database as it only queries the Games table once. The question is when it really gets down to it is there really a difference in performance and efficiency?
The vast majority of the time option 1 would be the way to go. The performance difference between the two would not be measurable until you have a lot of data. Keep it simple.
Your example is also fairly basic. At scale, performance issues can start revealing themselves based on what fields are being filtered, joined and pulled. The ideal scenario is to only pull data that exists in indexes (particularly with InnoDB). That usually is not possible, but a strategy is to pull the actual data you need at the last possible moment. Which is sort of what option 2 would be doing.
At extreme scale, you don't want to do any joins in the database at all. Your "joins" would happen in code, minimizing data sent over the network. Go with option 1 until you start having performance issues, which may never happen.
Go with the option 1, that is exactly what RDBMSes are optimized for.
And it always better to hit a database once from the client than hit it repeatedly multiple times.
I don't believe that you will ever have so many games and reviews that it will make sense to go with option 2.