Script to combine multiple MySQL records into one via summing - mysql

I'm a MySQL newbie, but I'm sure there must be a way to do this. I've been looking through StackOverflow for quite a while, though, and haven't found it yet.
I have a MySQL table that is generated from a multi-reducer Hadoop MapReduce job which is analyzing log files. The table is being used in the database that supports a Ruby-on-Rails app, and it looks like this:
+----+-----+------+---------+-----------+
| id | src | dest | time | requests |
+----+-----+------+---------+-----------+
| 0 | abc | xyz | 1000000 | 200000000 |
| 1 | def | uvw | 10 | 300 |
| 2 | abc | xyz | 100000 | 200000 |
| 3 | def | xyz | 1000 | 40000 |
| 4 | abc | uvw | 100 | 5000 |
| 5 | def | xyz | 10000 | 100000 |
+----+-----+------+---------+-----------+
I'm trying to coalesce/sum the columns which have the same src and dest, but I just can't figure out how to do it even after searching through the MySQL 5.1 documentation.
I'm trying to write a script which I could run and obtain something like this at the end (neither the order of the rows nor the id column is important):
+----+-----+------+---------+-----------+
| id | src | dest | time | requests |
+----+-----+------+---------+-----------+
| 6 | abc | xyz | 1100000 | 200200000 |
| 7 | def | uvw | 10 | 300 |
| 8 | abc | uvw | 100 | 5000 |
| 9 | def | xyz | 11000 | 140000 |
+----+-----+------+---------+-----------+
Any ideas on how I could figure this out?

You can't really combine the rows in a single table -- at least not easily. That would require both updates and deletes.
So, just create another table:
create table summary_t as
select src, desc, sum(time) as time, sum(requests) as requests
from table t
group by src, desc;
If you really want this go go back into the original table, then use a temporary table and re-insert the data:
create temporary table summary_t as
select src, desc, sum(time) as time, sum(requests) as requests
from t
group by src, desc;
truncate table t;
insert into t(src, desc, time, requests)
select src, desc, time, requests
from summary_t;
However, having said all that, you should just add another step to your map-reduce application to do that final summary.

Group By with SUM aggregate should work
select src, dest, sum(`time`) as `time`, sum(requests) as requests
from yourtable
group by src, dest

Check if this suite your needs, Create a table with the columns src and dest as primary key and other fields like totaltime and totalrequest.
Create an INSERT AFTER trigger on the existing tabl, which updates the other table totaltime and totalrequest with (old + new) using the src and dest as the key for where condition.

Related

Optimal way data set columns from multple tables in mysql

i am trying to fine tune a query which runs on application dashboard.
Query is like i have a master table & few transaction tables. I have to make some calculation on transataion table & showcase same output along with few columns from the master table.
I tried with join that worked but query is not fast enough for application ( 40Sec for 1k Records).
I am trying with sub query but maybe i am making mistake somewhere.
sharing dummy details below.
Master table :
id
name
1
Cell1
2
Cell2
3
Cell3
4
Cell4
transaction table 1 Session1
| id | TotalMarks |
| 1 | 21 |
| 1 | 21 |
| 2 | 23 |
| 3 | 24 |
Transaction table 2 Session2
| id | TotalMarks |
| 1 | 22 |
| 2 | 28 |
| 4 | 25 |
| 4 | 29 |
Result i want Like
| id | Name | ObtainMarksSession1 | totalObtainMarkSession2 |
| 1 | cell1 |
| |
I have checked indexes already but anyway index won't help as i am using aggregate function.
join query
Select m.id,m.name,sum(s1.TotalMarks) ObtainMarksSession1, sum(s1.TotalMarks) ObtainMarksSession2
from master join session1 s1 on m.id=s1.id and s1.id is not null
join session2 s2 on m.id=s2.id and s2.id is not null
group by m.id,m.name;
subquery Sample
Select id, sum(TotalMarks) ObtainMarksSession1 from session1 where id is not null;
Same way i got result from other table also but now i am unable to merge both output. these single query output are very fast.
Need to know how to merge result & get output with name as well from master. also, other suggestion if i can try some other method to make this query fast.
P.s Id is not primary key in transaction table so there might be possiblity for null values.

Select value from table sorted by a certain order from another table

I want to select value from table sorted by a certain order.
I have a table called test that looks like this:
| date | code | value |
+----------+-----------+----------+
| 20050104 | 000005.SZ | -6359.19 |
| 20050104 | 600601.SH | -7876.34 |
| 20050104 | 600602.SH | -25693.3 |
| 20050104 | 600651.SH | NULL |
| 20050104 | 600652.SH | -15309.9 |
...
| 20050105 | 000005.SZ | -4276.28 |
| 20050105 | 600601.SH | -3214.56 |
...
| 20170405 | 000005.SZ | 23978.13 |
| 20170405 | 600601.SH | 32212.54 |
Right now I want to select only one date, say date = 20050104, and then sort the data by a certain order (the order that each stock was listed in the stock market).
I have another table called stock_code which stores the correct order:
+---------+-----------+
| code_id | code |
+---------+-----------+
| 1 | 000002.SZ |
| 2 | 000004.SZ |
| 3 | 600656.SH |
| 4 | 600651.SH |
| 5 | 600652.SH |
| 6 | 600653.SH |
| 7 | 600654.SH |
| 8 | 600602.SH |
| 9 | 600601.SH |
| 10 | 000005.SZ |
...
I want to sorted the selected data by stock_code(code_id), but I don't want to use join because it takes too much time. Any thoughts?
I tried to use field but it gives me an error, please tell me how to correct it or give me an even better idea.
select * from test
where date = 20050104 and code in (select code from stock_code order by code)
order by field(code, (select code from stock_code order by code));
Error Code: 1242. Subquery returns more than 1 row
You told us that you don't want to join because it takes too much time, but the following join query is probably the best option here:
SELECT t.*
FROM test t
INNER JOIN stock_code sc
ON t.code = sc.code
WHERE t.date = '20050104'
ORDER BY sc.code_id
If this really runs slowly, then you should check to make sure you have indices setup on the appropriate columns. In this case, indices on the code columns from both tables as well as an index on test.date should be very helpful.
ALTER TABLE test ADD INDEX code_idx (code)
ALTER TABLE test ADD INDEX date_idx (date)
ALTER TABLE code ADD INDEX code_idx (code)

MySQL- List inside of a column

What I'm trying to is create a table that will keep track of users who report a comment on a website. Right now, I have a table that would look something like this:
id | num_reports | users
-----------------------------------
12345 1
12489 4
For this table, I'd like id to be unique and number_reports to keep incrementing starting at 1. But for users, I'm getting confused because I'd like to keep a record of user_ids who created a report and I'm unsure of how to make it so I can store multiple user_ids.
I thought of doing something like
id | user_id
---------------
123 567
123 689
and in this case, you would just count the number of rows with id being duplicated and user_id being unique, but this just seemed inefficient.
I've been looking around, and it looks like the correct way would be creating another table, but how does that allow me to store multiple user_ids?
That's the right way to do it. Here is what you should have:
USERS COMMENTS
+---------+------+ +------------+---------+------------+---------------------+
| id_user | name | | id_comment | id_user | id_article | date |
+---------+------+ +------------+---------+------------+---------------------+
| 171 | Joe | | 245 | 245 | 24 | 2015-03-22 10:12:00 |
| 180 | Jack | | 1245 | 180 | 68 | 2015-03-23 23:01:19 |
| ... | ... | | ... | ... | ... | ... |
+---------+------+ +------------+---------+------------+---------------------+
COMMENT_REPORTS
+-----------+------------+---------+---------------------+
| id_report | id_comment | id_user | date |
+-----------+------------+---------+---------------------+
| 1 | 245 | 171 | 2015-03-24 16:11:15 |
| 2 | 654 | 180 | 2015-03-24 18:13:42 |
| 3 | 1245 | 180 | 2015-03-24 18:34:01 |
| 4 | 1245 | 456 | 2015-03-25 09:58:10 |
| ... | ... | ... | ... |
+-----------+------------+---------+---------------------+
You then will be able to get:
# Every reports made by an user
SELECT *
FROM comment_reports
WHERE user_id = 180
# Every reports related to a comment
SELECT *
FROM comment_reports
WHERE comment_id = 1245
# Every reports made today
SELECT *
FROM comment_reports
WHERE date >= CURDATE()
# The amount of reports related to an user's comments
SELECT c.id_user AS User, COUNT(cr.id_report) AS Reported
FROM comment_reports cr
JOIN comments c ON (cr.id_comment = c.id_comment)
WHERE c.id_user = 180
GROUP BY c.id_user
Are you making datawarehouse? Normally quantity of reports for the websites are not saved. They are calculated on the fly by taking COUNT(*) by the website_id from the table where reports are saved. There you can save user who made this report. And then you can play by taking total of reports, or total of reports by user etc.
However if you have solution like that then you have no other option than to create separate link table for storing report<-->user links.
You can find users by there unique id, due to increment, user always unique, and never be overwrite.

Efficient MySQL Table Setup and Queries

Suppose I have the following database setup (a simplified version from what I actually have):
Table: news_posting (500,000+ entries)
| --------------------------------------------------------------|
| posting_id | name | is_active | released_date | token |
| 1 | posting_1 | 1 | 2013-01-10 | 123 |
| 2 | posting_2 | 1 | 2013-01-11 | 124 |
| 3 | posting_3 | 0 | 2013-01-12 | 125 |
| --------------------------------------------------------------|
PRIMARY posting_id
INDEX sorting ON (is_active, released_date, token)
Table: news_category (500 entries)
| ------------------------------|
| category_id | name |
| 1 | category_1 |
| 2 | category_2 |
| 3 | category_3 |
| ------------------------------|
PRIMARY category_id
Table: news_cat_match (1,000,000+ entries)
| ------------------------------|
| category_id | posting_id |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 2 | 2 |
| 3 | 2 |
| 1 | 3 |
| 2 | 3 |
| ------------------------------|
UNIQUE idx (category_id, posting_id)
My task is as follows. I must get a list of 50 latest news postings (at some offset) that are active, that are before today's date, and that are in one of the 20 or so categories that are specified in the request. Before I choose the 50 news postings to return, I must sort the appropriate news postings by token in descending order. My query is currently similar to the following:
SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50
With just one specified category_id the query does not involve a filesort and is reasonably fast because it does not have to process removal of duplicate results. However, calling EXPLAIN on the above query that has multiple category_id's returns a table that says that there is filesort to be done. And, the query is extremely slow on my data set.
Is there any way to optimize the table setup and/or the query?
I was able to get the above query to run even faster than with a single-value category list version by rewriting it as follows:
SELECT posting_id
FROM news_posting np
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
AND EXISTS (
SELECT ncm.posting_id
FROM news_cat_match ncm
WHERE ncm.posting_id = np.posting_id
AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
LIMIT 1
)
ORDER BY np.token DESC LIMIT 50
This now takes under a second on my data set.
The sad part is that this is even faster than if there is just one category_id specified. That's because the subset of news items is bigger than with just one category_id, so it finds the results more quickly.
Now my next question is whether this can be optimized for cases when a category has only few news that are spread in time?
The following is still pretty slow on my development machine. Although it's fast enough on the production server, I would like to optimize this if possible.
SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id = 1)
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50
Does anyone have any further suggestions?

MySQL Multi Duplicate Record Merging

A previous DBA managed a non relational table with 2.4M entries, all with unique ID's. However, there are duplicate records with different data in each record for example:
+---------+---------+--------------+----------------------+-------------+
| id | Name | Address | Phone | Email | LastVisited |
+---------+---------+--------------+---------+------------+-------------+
| 1 | bob | 12 Some Road | 02456 | | |
| 2 | bobby | | 02456 | bob#domain | |
| 3 | bob | 12 Some Rd | 02456 | | 2010-07-13 |
| 4 | sir bob | | 02456 | | |
| 5 | bob | 12SomeRoad | 02456 | | |
| 6 | mr bob | | 02456 | | |
| 7 | robert | | 02456 | | |
+---------+---------+--------------+---------+------------+-------------+
This isnt the exact table - the real table has 32 columns - this is just to illustrate
I know how to identify the duplicates, in this case i'm using the phone number. I've extracted the duplicates into a seperate table - there's 730k entires in total.
What would be the most efficient way of merging these records (and flagging the un-needed records for deletion)?
I've looked at using UPDATE with INNER JOIN's, but there are several WHERE clauses needed, because i want to update the first record with data from subsequent records, where that subsequent record has additional data the former record does not.
I've looked at third party software such as Fuzzy Dups, but i'd like a pure MySQL option if possible
The end goal then is that i'd be left with something like:
+---------+---------+--------------+----------------------+-------------+
| id | Name | Address | Phone | Email | LastVisited |
+---------+---------+--------------+---------+------------+-------------+
| 1 | bob | 12 Some Road | 02456 | bob#domain | 2010-07-13 |
+---------+---------+--------------+---------+------------+-------------+
Should i be looking at looping in a stored procedure / function or is there some real easy thing i've missed?
U have to create a PROCEDURE, but before that
create ur own temp_table like :
Insert into temp_table(column1, column2,....) values (select column1, column2... from myTable GROUP BY phoneNumber)
U have to create the above mentioned physical table so that u can run a cursor on it.
create PROCEDURE myPROC
{
create a cursor on temp::
fetch the phoneNumber and id of the current row from the temp_table to the local variable(L_id, L_phoneNum).
And here too u need to create a new similar_tempTable which will contain the values as
Insert into similar_tempTable(column1, column2,....) values (Select column1, column2,.... from myTable where phoneNumber=L_phoneNumber)
The next step is to extract the values of each column u want from similar_tempTable and update into the the row of myTable where id=L_id and delete the rest duplicate rows from myTable.
And one more thing, truncate the similar_tempTable after every iteration of the cursor...
Hope this will help u...