mysql recount column in mysql based on rows in secondary table - mysql

I took over a database with two tables, lets name them entries and comments.
The entries table contains a column named comment_count which holds the amount of rows with entry_id in comments corresponding to that row in entries.
Lately this connection has become terribly out of sync due to version switching of the codebase. I need help to build a query to run in phpmyadmin to sync these numbers again. The amount of rows in entries is around 8000 and the rows in comments is around 80000 so it shouldn't be any problems to run the sync-query.
Structure:
entries countains:
id | comment_count | etc
comments contains
id | blogentry_id | etc
The only way I can think of is to loop each entry in the entries table with php and update individually but that seems extremly fragile compared to a pure SQL solution.
I'd appriciate for any help!

INSERT
INTO entries (id, comment_count)
SELECT blogentry_id, COUNT(*) AS cnt
FROM comments
GROUP BY
blogentry_id
ON DUPLICATE KEY
UPDATE comment_count = cnt

I think a pure SQL solution would invlolve using a subquery to gather the counts from the comments table having the entries table as the driver. Something like the following should "loop" over the entries table and for each row perform the subquery (that may be the incorrect terminology) and update the comment count to be that of the corresponding counts off of the auxillary table. Hope that helps!
UPDATE entries ent
SET comment_count =
(SELECT COUNT ( * )
FROM comments cmt
WHERE cmt.blogentry_id = ent.id)

Related

MySQL: Group by query optimization

I've got a table of the following schema:
+----+--------+----------------------------+----------------------------+
| id | amount | created_timestamp | updated_timestamp |
+----+--------+----------------------------+----------------------------+
| 1 | 1.00 | 2018-01-09 12:42:38.973222 | 2018-01-09 12:42:38.973222 |
+----+--------+----------------------------+----------------------------+
Here, for id = 1, there could be multiple amount entries. I want to extract the last added entry and its corresponding amount, grouped by id.
I've written a working query with an inner join on the self table as below:
SELECT t1.id,
t1.amount,
t1.created_timestamp,
t1.updated_timestamp
FROM transactions AS t1
INNER JOIN (SELECT id,
Max(updated_timestamp) AS last_transaction_time
FROM transactions
GROUP BY id) AS latest_transactions
ON latest_transactions.id = t1.id
AND latest_transactions.last_transaction_time =
t1.updated_timestamp;
I think inner join is an overkill and this can be replaced with a more optimized/efficient query. I've written the following query with where, group by, and having but it isn't working. Can anyone help?
select id, any_value(`updated_timestamp`), any_value(amount) from transactions group by `id` having max(`updated_timestamp`);
There are two (good) options when performing a query like this in MySQL. You have already tried one option. Here is the other:
SELECT t1.id,
t1.amount,
t1.created_timestamp,
t1.updated_timestamp
FROM transactions AS t1
LEFT OUTER JOIN transactions later_transactions
ON later_transactions.id = t1.id
AND later_transactions.last_transaction_time > t1.updated_timestamp
WHERE later_transactions.id IS NULL
These methods are the ones in the documentation, and also the ones I use in my work basically every day. Which one is most efficient depends on a variety of factors, but usually, if one is slow the other will be fast.
Also, as Strawberry points out in the comments, you need a composite index on (id,updated_timestamp). Have separate indexes for id and updated_timestamp is not equivalent.
Why a composite index?
Be aware that an index is just a copy of the data that is in the table. In many respects, it works the same as a table does. So, creating an index is creating a copy of the table's data that the RDBMS can use to query the table's information in a more efficient manner.
An index on just updated_timestamp will create a copy of the data that contains updated_timestamp as the first column, and that data will be sorted. It will also include a hidden row ID value (that will work as a primary key) in each of those index rows, so that it can use that to look up the full rows in the actual table.
How does that help in this query (either version)? If we wanted just the latest (or earliest) updated_timestamp overall, it would help, since it can just check the first or last record in the index. But since we want the latest for each id, this index is useless.
What about just an index on id. Here we have a copy of the id column, sorted by the id column, with the row ID attached to each row in the index.
How does this help this query? It doesn't, because it doesn't even have the updated_timestamp column as part of the index, and so won't even consider using this index.
Now, consider a composite index: (id,updated_timestamp).
This creates a copy of the data with the id column first, sorted, and then the second column updated_timestamp is also included, and it is also sorted within each id.
This is the same way that a phone book (if people still use those things as something more than paperweights) is sorted by last name and then first name.
Because the rows are sorted in this way, MySQL can look, for each id, at just the last record of a given id. It knows that that record contains the highest updated_timestamp value, because of how the index is defined.
So, it only has to look up one row for each id that exists. That is fast. Further explanation into why would take up a lot more space, but you can research it yourself if you like, by just looking into B-Trees. Suffice to say, finding the first (or last) record is easy.
Try the following:
ALTER TABLE transactions
ADD INDEX `LatestTransaction` (`id`,`updated_timestamp`)
And then see whether your original query or my alternate query is faster. Likely both will be faster than having no index. As your table grows, or your select statement changes it may affect which of these queries is faster, but the index is going to provide the biggest performance boost, regardless of which version of the query you use.

MySQL using current results for subquery. Correlated subquery?

I have a table that contains the following fields
system_id
partner_id
uptime
I'm trying to get output that shows:
system_id, uptime, partner_id, partner_uptime
So for every row that comes back from an initial select all I need to check if the partner id is in the table and retrieve it's uptime value. It's simple enough to do in excel but with 2M+ records it could take a while!
Can someone please help construct a basic query for this?
Thanks
You can use simple self join query here, assuming partner_id references to system_id:
select t.system_id, t.uptime, t1.partner_id, t1.uptime as partner_uptime
from table t join table t1 on t.system_id = t2.partner_id
where //your condition

Set sequential number in mysql table only where rows have same value

I have a table in which a new entree gets a number 0, and a status of unpublished. Users can publish or unpublish rows. When they do, the published row should get a number that's consecutive, but the unpublished rows should be skipped. Like this:
status | number
=============================
unpublished | 0
published | 1
unpublished | 0
unpublished | 0
published | 2
published | 3
unpublished | 0
published | 4
Right now I use:
mysql_query("update albums
join (SELECT #i:=0) t
SET id =(#i:=#i+1)");
When a user publishes something, but that will add consecutive number to all rows.
I need something like the above, but with some sort of WHERE = published statement in it, but I don;t know how.
What solution should I look into?
Many thanks,
Sam
Try an IF in the UPDATE statement:-
UPDATE albums
JOIN (SELECT #i:=0) t
SET id = IF(status='published', #i:=#i+1, 0)
However this is not going to consistently work as I think you want I suspect without an ORDER BY clause (update does support and order clause).
EDIT - further info as requested:-
Albums is a MySQL table. The UPDATE query in MySQL does support the ORDER BY clause (to update records in a particular order), but only for queries where there is only a single table. In this query a sub query is joined to the albums table (ie, JOIN (SELECT #i:=0) t ); even though this is not actually a table it seems MySQL regards it as one and so will not allow an ORDER BY clause in this update.
However #i is a user defined variable and can be initialised by a separate SQL statement. If your query was 2 statements:-
SET #i:=0
UPDATE albums
SET id = IF(status='published', #i:=#i+1, 0)
ORDER BY albums.insert_date;
then that should do it (note, I have just assumed a random column name of insert_date to order the records by).
However many MySQL api's do not support multiple statements in a single query. As #i is a user variable it is related to the connection to the database. As such if you issue one query to initialise it and then a 2nd query (using the same connection) to process the UPDATE it should work.
If you are using php with MySQLi then you can use mysqli-multi-query to perform both in one go.
Kickstarts answer is almost what I need, only if an entree gets published it should always be gets the highest number. Right now it follows the database order, which is by date. Should I integrate an ORDER BY, or is there a different solution?
Thanks

How to grab most popular rows in table?

I have a table with comments almost 2 million rows. We receive roughly 500 new comments per day. Each comment is assigned to a specific ID. I want to grab the most popular "discussions" based on the specific ID.
I have an index on the ID column.
What is best practice? Do I just group by this ID and then sort by the ID who has the most comments? Is this most efficient for a table this size?
Do I just group by this ID and then sort by the ID who has the most comments?
That's pretty much simply how I would do it. Let's just assume you want to retrieve the top 50:
SELECT id
FROM comments
GROUP BY id
ORDER BY COUNT(1) DESC
LIMIT 50
If your users are executing this query quite frequently in your application and you're finding that it's not running quite as fast as you'd like, one way you could optimize it is to store the result of the above query in a separate table (topdiscussions), and perhaps have a script or cron that runs intermittently every five minutes or so which would update that table.
Then in your application, just have your users select from the topdiscussions table so that they only need to select from 50 rows rather than 2 million.
The downside of this of course being that the selection will no longer be in real-time, but rather out of sync by up to five minutes or however often you want to update the table. How real-time you actually need it to be depends on the requirements of your system.
Edit: As per your comments to this answer, I know a little more about your schema and requirements. The following query retrieves the discussions that are the most active within the past day:
SELECT a.id, etc...
FROM discussions a
INNER JOIN comments b ON
a.id = b.discussion_id AND
b.date_posted > NOW() - INTERVAL 1 DAY
GROUP BY a.id
ORDER BY COUNT(1) DESC
LIMIT 50
I don't know your field names, but that's the general idea.
If I understand your question, the ID indicates the discussion to which a comment is attached. So, first you would need some notion of most popular.
1) Initialize a "Comment total" table by counting up comments by ID and setting a column called 'delta' to 0.
2) Periodically
2.1) Count the comments by ID
2.2) Subtract the old count from the new count and store the value into the delta column.
2.3) Replace the count of comments with the new count.
3) Select the 10 'hottest' discussions by selecting 10 row from comment total in order of descending delta.
Now the rest is trivial. That's just the comments whose discussion ID matches the ones you found in step 3.

Removing duplicate data from many rows in mysql?

I am a web developer so my knowledge of manipulating mass data is lacking.
A coworker is looking for a solution to our data problems. We have a table of about 400k rows with company names listed.
Whoever designed this didnt realize there needed to be some kind of unique identifier for a company, so there are duplicate entries for company names.
What method would one use in order to match all these records up based on company name, and delete the duplicates based on some kind of criteria (another column)
I was thinking of writing a script to do this in php, but I really have a hard time believing that my script would be able to execute while making comparisons between so many rows. Any advice?
Answer:
Answer origin
1) delete from table1
2) USING table1, table1 as vtable
3) WHERE (NOT table1.ID>vtable.ID)
4) AND (table1.field_name=vtable.field_name)
Here you tell mysql that there is a table1.
Then you tell it that you will use table1 and a virtual table with the values of table1.
This will let mysql not compare a record with itself!
Here you tell it that there shouldn’t be records with the same field_name.
The way I've done this in the past is to write a query that returns only the set I want (usually using DISTINCT + a subquery to determine the right record based on other values), and insert that into a different table. You can then delete the old table and rename the new one to the old name.
To find list of companies with duplicates in your table you can use script like that:
SELECT NAME
FROM companies
GROUP BY NAME
HAVING COUNT(*) > 1
And following will delete all duplicates except containing max values in col column
DELETE del
FROM companies AS del
INNER JOIN (
SELECT NAME, MAX(col) AS col
FROM companies
GROUP BY NAME
HAVING COUNT(*) > 1
) AS sub
ON del.NAME = sub.NAME AND del.col <> sub.col