I have a visitor counter, the table looks like this:
id | country_code | datetime | Browser | ...
---------------------------------------------------------
1 | FR | 2014-06-20 05:00:28 | FireFox |
2 | US | 2014-06-20 05:00:28 | Chrome |
3 | ZW | 2014-06-20 05:00:28 | IE |
I want to count how many visitor I had (for example) in an hour at a certain day.
The query looks like this:
SELECT HOUR(datetime), COUNT(*) as hits
FROM counter_table WHERE datetime >= CURDATE()
GROUP BY HOUR(datetime) WITH ROLLUP
No problem with this query.
But I also want count how many visitors I got from a certain country.
I tried everything like GROUP BY HOUR(datetime), country_code WITH ROLLUP (I do not need a hourly ROLLUP for the country, I need hourly ROLLUP for hits) or JOIN queries but I can't find a good solution.
The best thing I could come up with was something like this:
SUM(IF(country_code = "AF", 1,0)) AS Afghanistan,
...
SUM(IF(country_code = "ZW", 1,0)) AS Zimbabwe
But the problem is that there are almost 400 countries in the world. I am not sure if such a long query like above would be good for performance. Unfortunately performance is very important in this case because the table is huge. But otherwise this solution provides exactly what I want.
Maybe there is another database better than MySQL for this kind of problem?
Related
Lately, I have been learning how to use SQL in order to process data. Normally, I would use Python for that purpose, but SQL is required for the classes and I still very much struggle with using it comfortably in more complicated scenarios.
What I want to achieve is the same result as in the following screenshot in Excel:
Behaviour in Excel, that I want to implement in SQL
The formula I used in Excel:
=SUMIF(B$2:B2;B2;C$2:C2)
Sample of the table:
> select * from orders limit 5;
+------------+---------------+---------+
| ID | clientID | tonnage |
+------------+---------------+---------+
| 2005-01-01 | 872-13-44-365 | 10 |
| 2005-01-04 | 369-43-03-176 | 2 |
| 2005-01-05 | 408-24-90-350 | 2 |
| 2005-01-10 | 944-16-93-033 | 5 |
| 2005-01-11 | 645-32-78-780 | 14 |
+------------+---------------+---------+
The implementation is supposed to return similar results as following group by query:
select
orders.clientID as ID,
sum(orders.tonnage) as Tonnage
from orders
group by orders.clientID;
That is, return how much each client have purchased, but at the same I want it to return each step of the addition as separate record.
For an instance:
Client A bought 350 in the first order and then 231 in the second one. In such case the query would return something like this:
client A - 350 - 350 // first order
client A - 281 - 581 // second order
Example, how it would look like in Excel
I have already tried to use something like:
select
orders.clientID as ID,
sum(case when orders.clientID = <ID> then orders.tonnage end)
from orders;
But got stuck quickly, since I would need to somehow dynamically change this <ID> and store it's value in some kind of temporary variable and I can't really figure out how to implement such thing in SQL.
You can use window function for running sum.
In your case, use like this
select id, clientID, sum(tonnage) over (partition by clientID order by id) tonnageRunning
from orders
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=13a8c2d46b5ac22c5c120ac937bd6e7a
i have a tbl_remit where i need to get the last remittance.
I'm developing as system wherein I need to get the potential collection of each Employer using the Employer's last remittance x 12. Ideally, Employers should remit once every month. But there are cases where an Employer remits again for the same month for the additional employee that is newly hired. The Mysql Statement that I used was this.
SELECT Employer, MAX(AP_From) as AP_From,
MAX(AP_To) as AP_To,
MAX(Amount) as Last_Remittance,
(MAX(Amount) *12) AS LastRemit_x12
FROM view_remit
GROUP BY PEN
Result
|RemitNo.| Employer | ap_from | ap_to | amount |
| 1 | 1 |2016-01-01 |2016-01-31 | 2000 |
| 2 | 1 |2016-02-01 |2016-02-28 | 2000 |
| 3 | 1 |2016-03-01 |2016-03-31 | 2000 |
| 4 | 1 |2016-03-01 |2016-03-31 | 400 |
By doing that statement, i ended up getting the wrong potential collection.
What I've got:
400 - Last_Remittance
4800 - LastRemit_x12 (potential collection)
What I need to get:
2400 - Last_Remittance
28800 - LastRemit_x12 (potential collection)
Any help is greatly appreciated. I don't have a team in this project. this may be a novice question to some but to me it's really a complex puzzle. thank you in advance.
You want to filter the data for the last time period. So, think where rather than group by. Then, you want to aggregate by employer.
Here is one method:
SELECT Employer, MAX(AP_From) as AP_From, MAX(AP_To) as AP_To,
SUM(Amount) as Last_Remittance,
(SUM(Amount) * 12) AS LastRemit_x12
FROM view_remit vr
WHERE vr.ap_from = (SELECT MAX(vr2.ap_from)
FROM view_remit vr2
WHERE vr2.Employer = vr.Employer
)
GROUP BY Employer;
EDIT:
For performance, you want an index on view_remit(Employer, ap_from). Of course, that assumes that view_remit is really a table . . . which may be unlikely.
If you want to improve performance, you'll need to understand the view.
I'm a bit of a newby at SQL and I don't really understand what to do here, so any help is really appreciated. I have a table full of readings from different readers, there's like 500.000 of them, so I can't do this by hand.
I received the table without the difference in it. I managed to calculate it, but there's a bit of a problem there...
It looks a bit like this:
reader_id | date | reading | difference
1 | 01-01-2013 | 205 | 0
1 | 02-01-2013 | 210 | 5
1 | 03-01-2013 | 213 | 3
... | ... | ... | ...
1 | 31-12-2013 | 2451 | 4
2 | 01-01-2013 | 8543 | 6092
2 | 02-01-2013 | 8548 | 5
reader_id and date form the primary key. The combination is unique.
How can I make sure I don't get the difference calculated when the last column contained a different reader_id?
When querying my data with a query like this one, the data get skewed by the incorrect difference between the two reader_ids:
SELECT AVG(difference), reader_id FROM table GROUP BY reader_id
For
I just want to get the average difference for each reader.
your query is perfectly good. I think you got something wrong in your difference calculation. The first value for reader_id=2, 6092, is the difference of the last reading from reader1 and the first reading from reader 2, i don't think that makes sense. If i'm not mistaken, the difference value is the current day reading - previous day reading. Therefore you should set the difference value of the first reading of each reader to 0.
You can do this with the following query:
UPDATE table t INNER JOIN (SELECT reader_id, min(date) as first_day FROM table GROUP BY reader_id) as tmp ON tmp.reader_id=t.reader_id AND tmp.first_day=t.date SET t.difference=0
Then
SELECT AVG(difference), reader_id FROM table GROUP BY reader_id
will do what you expect.
If you simply want the average difference, you can use the following query:
SELECT
meter_id,
MAX(reading) - MIN(reading) / COUNT(*) average_difference
FROM table
GROUP BY meter_id
ORDER BY meter_id;
It works on the logic that the the total difference for a given meter_id should be equal to MAX(reading) - MIN(reading).
In my application I have association between two entities employees and work-groups.
This association usually changes over time, so in my DB I have something like:
emplyees
| EMPLOYEE_ID | NAME |
| ... | ... |
workgroups
| GROUP_ID | NAME |
| ... | ... |
emplyees_workgroups
| EMPLOYEE_ID | GROUP_ID | DATE |
| ... | ... | ... |
So suppose I have an association between employee 1 and group 1, valid from 2014-01-01 on.
When a new association is created, for example from 2014-02-01 on, the old one is no longer valid.
This structure for the associative table is a bit problematic for queries, but I actually would avoid to add an END_DATE field to the table beacuse it will be a reduntant value and also requires the execution of an insert + update or update on two rows every time a change happens in an association.
So have you any idea to create a more practical architecture to solve my problem? Is this the better approach?
You have what is called a slowly changing dimension. That means that you need to have dates in the employees_workgroup table in order to find the right workgroup at the right time for a set of employees.
The best way to handle this is to have to dates, which I often call effdate and enddate on each row. This greatly simplifies queries, where you are trying to find the workgroup at a particular point in time. Such a query might look like with this structure:
select ew.*
from employees_workgroup ew
where MYDATE between effdate and enddate;
Now consider the same results using only one date per field. It might be something like this:
select ew.*,
from employees_workgroup ew join
(select employee_id, max(date) as maxdate
from employees_workgroup ew2
where ew2.employee_id = ew.employee_id and
ew2.date <= MYDATE
) as rec
on ew.employee_id = rec.employee_id and ew.adte = ew.maxdate;
The expense of doing an update along with the insert is minimal compared to the complexity this will introduce in the queries.
I have a table (pretty big one) with lots of columns, two of them being "post" and "user".
For a given "post", I want to know which "user" posted the most.
I was first thinking about getting all the entries WHERE (post='wanted_post') and then throw a PHP hack to find which "user" value I get the most, but given the large size of my table, and my poor knowledge of MySQL subtle calls, I am looking for a pure-MySQL way to get this value (the "user" id that posted the most on a given "post", basically).
Is it possible ? Or should I fall back on the hybrid SQL-PHP solution ?
Thanks,
Cystack
It sounds like this is what you want... am I missing something?
SELECT user
FROM myTable
WHERE post='wanted_post'
GROUP BY user
ORDER BY COUNT(*) DESC
LIMIT 1;
EDIT: Explanation of what this query does:
Hopefully the first three lines make sense to anyone familiar with SQL. It's the last three lines that do the fun stuff.
GROUP BY user -- This collapses rows with identical values in the user column. If this was the last line in the query, we might expect output something like this:
+-------+
| user |
+-------+
| bob |
| alice |
| joe |
ORDER BY COUNT(*) DESC -- COUNT(*) is an aggregate function, that works along with the previous GROUP BY clause. It tallies all of the rows that are "collapsed" by the GROUP BY for each user. It might be easier to understand what it's doing with a slightly modified statement, and it's potential output:
SELECT user,COUNT(*)
FROM myTable
WHERE post='wanted_post'
GROUP BY user;
+-------+-------+
| user | count |
+-------+-------+
| bob | 3 |
| alice | 1 |
| joe | 8 |
This is showing the number of posts per user.
However, it's not strictly necessary to actually output the value of an aggregate function in this case--we can just use it for the ordering, and never actually output the data. (Of course if you want to know how many posts your top-poster posted, maybe you do want to include it in your output, as well.)
The DESC keyword tells the database to sort in descending order, rather than the default of ascending order.
Naturally, the sorted output would look something like this (assuming we leave the COUNT(*) in the SELECT list):
+-------+-------+
| user | count |
+-------+-------+
| joe | 8 |
| bob | 3 |
| alice | 1 |
LIMIT 1 -- This is probably the easiest to understand, as it just limits how many rows are returned. Since we're sorting the list from most-posts to fewest-posts, and we only want the top poster, we just need the first result. If you wanted the top 3 posters, you might instead use LIMIT 3.