I'm working on a political application for a client, and I've got two database tables. One has all the registered voters in a certain city, and the other has all the votes for each registered voter. Combined, these two tables number well over 7 million records. This site is coded on CakePHP.
I'm already narrowing the results by a number of criteria, but I need to filter it also based on the percentage of elections that a given voter has voted in since they registered. I have all the votes, the year they registered, and that there are 3 elections every 4 years. I've tried doing a subquery to filter the results, but it took far too long. It took 10 minutes to return 10 records. I have to do this in a join some way, but I'm not at all versed in joins.
This is basically what I need to do:
SELECT * FROM voters
WHERE (number of votes voter has) >= (((year(now())-(registration_year) ) * 3/4)
* (percentage needed))
All of that is pretty straight-forward. The trick is counting the votes the voter has from the votes database. Any ideas?
Either create another table, or extend your first table (the one containing voter information, but not their votes) with two columns -- #votes and registrationAge. Then you can update this table by scanning the 'votes' table once -- everytime you encounter a vote, just increase the count.
I wouldn't try to calculate this as part of your query
In a case where this info will only change 3 times in four years, I'd add the voted % field to the voter table and calculate it once after each election. Then you can simply filter by the field.
you can add a vote_count field to voters table and do a update count on that. You might want to do it in straight sql query: Aggregate function in an SQL update query?
Also, I'm not sure if mysql smart enough to optimize this, but don't use year(now()): you can either get that value in PHP, or just hard code it each time you run (you probably don't need to run it too often).
How about this:
SELECT voters.* FROM voters
LEFT JOIN (SELECT COUNT(voterid) AS votes,voterid AS id FROM votes) AS a
ON voters.id = a.id
WHERE a.votes >= (((year(now())-(voters.registration_year) ) * 3/4) * percentage
I would recomend to create a view, then model your vie to fetch the data
Related
I have a table of transactions that records the person that made the purchase. I want the number of people that have had more than one transaction. The part I became stuck at is how do I specify that Member must match at least twice (e.g. two or more transactions)?
I figured it'd be something along the lines of
SELECT COUNT(*) FROM `table` WHERE COUNT(`Member`)>2
but I realize that isn't a proper usage of the second count.
To further clarify: I want the result to be a single row that contains the number of users that this condition matches. So I don't want it to return how many times it matches per user or anything like that.
you need to use GROUP BY and HAVING.
SELECT COUNT(*) totalMember
FROM
(
SELECT Member
FROM `table`
GROUP BY Member
HAVING COUNT(Member) > 2
) a
So I have 2 tables, one called user, and one called user_favorite. user_favorite stores an itemId and userId, for storing the items that the user has favorited. I'm simply trying to locate the users who don't have a record in user_favorite, so I can find those users who haven't favorited anything yet.
For testing purposes, I have 6001 records in user and 6001 in user_favorite, so there's just one record who doesn't have any favorites.
Here's my query:
SELECT u.* FROM user u
JOIN user_favorite fav ON u.id != fav.userId
ORDER BY id DESC
Here the id in the last statement is not ambigious, it refers to the id from the user table. I have a PK index on u.id and an index on fav.userId.
When I run this query, my computer just becomes unresponsive and completely freezes, with no output ever being given. I have 2gb RAM, not a great computer, but I think it should be able to handle a query like this with 6k records easily.
Both tables are in MyISAM, could that be the issue? Would switching to INNODB fix it?
Let's first discuss what your query (as written) is doing. Because of the != in the on-clause, you are joining every user record with every one of the other user's favorites. So your query is going to produce something like 36M rows. This is not going to give you the answer that you want. And it explains why your computer is unhappy.
How should you write the query? There are three main patterns you can use. I think this is a pretty good explanation: http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/ and discusses performance specifically in the context of mysql. And it shows you how to look at and read an execution plan, which is critical to optimizing queries.
change your query to something like this:
select * from User
where not exists (select * from user_favorite where User.id = user_favorite.userId)
let me know how it goes
A join on A != B means that every record of A is joined with every record of B in which the id's aren't equal.
In other words, instead of producing 6000 rows, you're producing approximately 36 million (6000 * 6001) rows of output, which all have to be collected, then sorted...
SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.
To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.
The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.
I have a table with comments almost 2 million rows. We receive roughly 500 new comments per day. Each comment is assigned to a specific ID. I want to grab the most popular "discussions" based on the specific ID.
I have an index on the ID column.
What is best practice? Do I just group by this ID and then sort by the ID who has the most comments? Is this most efficient for a table this size?
Do I just group by this ID and then sort by the ID who has the most comments?
That's pretty much simply how I would do it. Let's just assume you want to retrieve the top 50:
SELECT id
FROM comments
GROUP BY id
ORDER BY COUNT(1) DESC
LIMIT 50
If your users are executing this query quite frequently in your application and you're finding that it's not running quite as fast as you'd like, one way you could optimize it is to store the result of the above query in a separate table (topdiscussions), and perhaps have a script or cron that runs intermittently every five minutes or so which would update that table.
Then in your application, just have your users select from the topdiscussions table so that they only need to select from 50 rows rather than 2 million.
The downside of this of course being that the selection will no longer be in real-time, but rather out of sync by up to five minutes or however often you want to update the table. How real-time you actually need it to be depends on the requirements of your system.
Edit: As per your comments to this answer, I know a little more about your schema and requirements. The following query retrieves the discussions that are the most active within the past day:
SELECT a.id, etc...
FROM discussions a
INNER JOIN comments b ON
a.id = b.discussion_id AND
b.date_posted > NOW() - INTERVAL 1 DAY
GROUP BY a.id
ORDER BY COUNT(1) DESC
LIMIT 50
I don't know your field names, but that's the general idea.
If I understand your question, the ID indicates the discussion to which a comment is attached. So, first you would need some notion of most popular.
1) Initialize a "Comment total" table by counting up comments by ID and setting a column called 'delta' to 0.
2) Periodically
2.1) Count the comments by ID
2.2) Subtract the old count from the new count and store the value into the delta column.
2.3) Replace the count of comments with the new count.
3) Select the 10 'hottest' discussions by selecting 10 row from comment total in order of descending delta.
Now the rest is trivial. That's just the comments whose discussion ID matches the ones you found in step 3.
I have a table with more than 1 million records. The problem is that the query takes too much times, like 5 minutes. The "ORDER BY" is my problem, but i need the expression in the query order by to get most popular videos. And because of the expression i can't create an index on it.
How can i resolve this problem?
Thx.
SELECT DISTINCT
`v`.`id`,`v`.`url`, `v`.`title`, `v`.`hits`, `v`.`created`, ROUND((r.likes*100)/(r.likes+r.dislikes),0) AS `vote`
FROM
`videos` AS `v`
INNER JOIN
`votes` AS `r` ON v.id = r.id_video
ORDER BY
(v.hits+((r.likes-r.dislikes)*(r.likes-r.dislikes))/2*v.hits)/DATEDIFF(NOW(),v.created) DESC
Does the most popular have to be calculated everytime? I doubt if the answer is yes. Some operations will take a long time to run no matter how efficient your query is.
Also bear in mind you have 1 million now, you might have 10 million in the next few months. So the query might work now but not in a month, the solution needs to be scalable.
I would make a job to run every couple of hours to calculate and store this information on a different table. This might not be the answer you are looking for but I just had to say it.
What I have done in the past is to create a voting system based on Integers.
Nothing will outperform integers.
The voting system table has 2 Columns:
ProductID
VoteCount (INT)
The votecount stores all the votes that are submitted.
Like = +1
Unlike = -1
Create an Index in the vote table based on ID.
You have to alternatives to improve this:
1) create a new column with the needed value pre-calculated
1) create a second table that holds the videos primary key and the result of the calculation.
This could be a calculated column (in the firts case) or modify your app or add triggers that allow you to keep it in sync (you'd need to manually load it the firs time, and later let your program keep it updated)
If you use the second option your key could be composed of the finalRating plus the primary key of the videos table. This way your searches would be hugely improved.
Have you try moving you arithmetic of the order by into your select, and then order by the virtual column such as:
SELECT (col1+col2) AS a
FROM TABLE
ORDER BY a
Arithmetic on sort is expensive.