I am having trouble developing some queries on the fly for our clients and sometimes find myself asking "Would it be better to start with a subset of the data I know I'm looking for, then just import into a program like Excel and process the data accordingly using similar functions, such as Pivot Tables"?.
One instance in particular I am struggling with is the following example:
I have an online member enrollment system. For simplicity sake, let's assume the data captured is: Member ID, Sign Up Date, their referral code, their state.
A sample member table may look like the following:
MemberID | Date | Ref | USState
=====================================
1 | 2011-01-01 | abc | AL
2 | 2011-01-02 | bcd | AR
3 | 2011-01-03 | cde | CA
4 | 2011-02-01 | abc | TX
and so on....
ultimately, the types of queries I want to build and run with this data set can extend to:
"Show me a list of all referral codes and the number of sign ups they had by each month in a single result set".
For example:
Ref | 2011-01 | 2011-02 | 2011-03 | 2011-04
==============================================
abc | 1 | 1 | 0 | 0
bcd | 1 | 0 | 0 | 0
cde | 1 | 0 | 0 | 0
I have no idea how to build this type of query in MySQL to be honest (I imagine if it can be done it would require a LOT of code, joins, subqueries, and unions.
Similarly, another sample query may be how many members signed up in each state by month
USState | 2011-01 | 2011-02 | 2011-03 | 2011-04
==============================================
AL | 1 | 0 | 0 | 0
AR | 1 | 0 | 0 | 0
CA | 1 | 0 | 0 | 0
TX | 0 | 1 | 0 | 0
I suppose my question is two fold:
1) Is it in fact best to just try and build these out with the necessary data from within a MySQL GUI such as Navicat or just import the entire subset of data into Excel and work forward?
2) If I was to use the MySQL route, what is the proper way to build the subsets of data in the examples mentioned below (note that the queries could become far more complex such as "Show how many sign ups came in for each particular month by each state and grouped by each agent as well (each agent has 50 possible rows)"
Thank you so much for your assistance ahead of time.
I am a proponent of doing this kind of querying on the server side, at least to get just the data you need.
You should create a time-periods table. It can get as complex as you desire, going down to days even.
id year month monthstart monthend
1 2011 1 1/1/2011 1/31/2011
...
This gives you almost limitless ability to group and query data in all sorts of interesting ways.
Getting the data for the original referral counts by month query you mentioned would be quite simple...
select a.Ref, b.year, b.month, count(*) as referralcount
from myTable a
join months b on a.Date between b.monthstart and b.monthend
group by a.Ref, b.year, b.month
order by a.Ref, b.year, b.month
The result set would be in rows like ref = abc, year = 2011, month = 1, referralcount = 1 as opposed to a column for every month. I am assuming that since getting a larger set of data and manipulating it in Excel was an option, that changing the layout of this data wouldn't be difficult.
Check out this previous answer that goes into a little more detail about the concept with different examples: SQL query for Figuring counts by month
I work on an Excel based application that deals with multi-dimensional time series data, and have recently been working on implementing predefined pivot table spreadsheets, so I know exactly what you're thinking. I'm a big proponent of giving users tools rather than writing up individual reports or a whole query language for them to use. You can create pivot tables on the fly that connect to the database and it's not that hard. Andrew Whitechapel has a great example here. But, you will also need to launch that in Excel or setup a basic Excel VSTO program, which is fairly easy to do in Visual Studio 2010. (microsoft.com/vsto)
Another thing, don't feel like you have to create ridiculously complex queries. Every join that you have will slow down any relational database. I discovered years ago that doing multi-step queries into temp tables in most cases will be much clearer, faster, and easier to write and support.
Related
i have a tbl_remit where i need to get the last remittance.
I'm developing as system wherein I need to get the potential collection of each Employer using the Employer's last remittance x 12. Ideally, Employers should remit once every month. But there are cases where an Employer remits again for the same month for the additional employee that is newly hired. The Mysql Statement that I used was this.
SELECT Employer, MAX(AP_From) as AP_From,
MAX(AP_To) as AP_To,
MAX(Amount) as Last_Remittance,
(MAX(Amount) *12) AS LastRemit_x12
FROM view_remit
GROUP BY PEN
Result
|RemitNo.| Employer | ap_from | ap_to | amount |
| 1 | 1 |2016-01-01 |2016-01-31 | 2000 |
| 2 | 1 |2016-02-01 |2016-02-28 | 2000 |
| 3 | 1 |2016-03-01 |2016-03-31 | 2000 |
| 4 | 1 |2016-03-01 |2016-03-31 | 400 |
By doing that statement, i ended up getting the wrong potential collection.
What I've got:
400 - Last_Remittance
4800 - LastRemit_x12 (potential collection)
What I need to get:
2400 - Last_Remittance
28800 - LastRemit_x12 (potential collection)
Any help is greatly appreciated. I don't have a team in this project. this may be a novice question to some but to me it's really a complex puzzle. thank you in advance.
You want to filter the data for the last time period. So, think where rather than group by. Then, you want to aggregate by employer.
Here is one method:
SELECT Employer, MAX(AP_From) as AP_From, MAX(AP_To) as AP_To,
SUM(Amount) as Last_Remittance,
(SUM(Amount) * 12) AS LastRemit_x12
FROM view_remit vr
WHERE vr.ap_from = (SELECT MAX(vr2.ap_from)
FROM view_remit vr2
WHERE vr2.Employer = vr.Employer
)
GROUP BY Employer;
EDIT:
For performance, you want an index on view_remit(Employer, ap_from). Of course, that assumes that view_remit is really a table . . . which may be unlikely.
If you want to improve performance, you'll need to understand the view.
I have two mysql tables like bellow:
table_category
-----------------
id | name | type
1 | A | Cloth
2 | B | Fashion
3 | C | Electronics
4 | D | Electronics
table_product
------------------
id | cat_cloth | cat_fashion | cat_electronics
1 | 1 | 2 | 3
1 | NULL | 2 | 4
Here cat_cloth, cat_fashion, cat_electronics is ID from table_category
It is better to have another table for category type but I need a quick solution for now.
I want to get list of categories with total number of products. I wrote following query:
SELECT table_category.*, table_product.id, COUNT(table_product.id) as count
FROM table_category
LEFT JOIN table_product` ON table_category.id = table_product.cat_cloth
OR table_category.id = table_product.cat_fashion
OR table_category.id = table_product.cat_electronis
GROUP BY table_product.id
ORDER BY table_product.id ASC
Question: The sql I wrote it works but I have more then 14K categories and 50K products and the sql works very slow. I added index for cat_* ids but no improvement. My question how can I optimize this query?
I found the query takes 3-4 minutes to process the volume of data I mentioned. I want to reduce the execution time.
Best Regards
As far as I can say every "OR" either in "ON" or "WHERE" part is very cost expensive. It will sound very stupid but I would recommend you to make 3 separate small selects combined together with UNION ALL.
This we do with similar problems both in mysql and postgresql and in some cases when we got "resources exceeded" we had to do it also for bigquery. So it is very stupid and you will have more work but it certainly works and it is much quicker in producing results then many "OR"s.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I see questions all the time (and perhaps a Meta post could be made about how to handle them) that follow the lines of:
Get the count of [some field] for [some object].
Where the problem usually lies in:
SELECT myField, COUNT(*)
FROM myTable
GROUP BY myField;
Which will not return rows that have 0 counts, so it typically involves preforming an outer join back to the table to get those counts.
Is there a name for this type of procedure? Is it still just simply Aggregation? The reason I wonder if it's different, is because it involves using a join to aggregate data that doesn't exist in the table.
Also, I have heard of special types of aggregation such as conditional aggregation, so I thought there might be a term [slang, or standard] for this type of operation.
Edit, to explain what I meant by data that 'doesn't exist' consider a users table like this:
| id | name |
+----+-------+
| 1 | John |
| 2 | Bob |
| 3 | Sandy |
| 4 | Time |
And a login table like this:
| user_id | loginTime |
+---------+-----------+
| 1 | 01:43:44 |
| 1 | 02:43:44 |
| 3 | 03:43:44 |
| 3 | 04:43:44 |
| 3 | 05:43:44 |
| 4 | 06:43:44 |
If I want to get the total number of log ins for each user, I could do the following:
SELECT u.id, COUNT(*) AS numLogins
FROM users u
JOIN login l ON l.user_id = u.id
GROUP BY u.id;
However, this won't return a row for user 2 unless I use an outer join and the coalesce function. What is the name for this type of operation?
The issue you describe (very well) is typically referred to sparse data.
There's all sort of "how-to" suggestions/patterns for getting those "zero" counts returned out of sparse data.
The terms I've heard refer to density, and getting the data into a dense form. (Getting the data "densified" (is that even a real word?), referring to that process as "densification" (I don't think that's a real word either.)
I believe I ran across those terms in the Oracle Data Warehousing Guide (Oracle documentation). Other vendors may use different vernacular. I don't know that there's any official standard term.
EDIT
Reference: Oracle "Data Warehousing and Business Intelligence" http://docs.oracle.com/cd/B28359_01/server.111/b28313/analysis.htm#i1014934
Table lists
id | user_id | name
1 | 3 | ListA
2 | 3 | ListB
Table celebrities
id | user_id | list_id | celebrity_code
1 | 3 | 1 | AA000297
2 | 3 | 1 | AA000068
3 | 3 | 2 | AA000214
4 | 3 | 2 | AA000348
I am looking a JSON object like this
[
{id:1, name:'ListA', celebrities:[{celebrity_code:AA000297},{celebrity_code:AA000068}]},
{id:2, name:'ListB', celebrities:[{celebrity_code:AA000214},{celebrity_code:AA000348}]}
]
Moved this to an answer since the details were getting long, and I thought the additional references would be useful to future readers.
Since you are using MySQL, check out GROUP_CONCAT. To get your object, you will want to GROUP_CONCAT on a CONCATenated string. If you could live with a schema more like {id:2, name:'ListB', celebrity_codes:['AA000214','AA000348']} you'll have a simpler query. If you make a SQLfiddle of your basic schema (basically your create tables plus the inserts of the above sample data), someone might even write it for you. :-)
To be clear, while GROUP_CONCAT can do this, if you are trying to generate more than a fairly simple schema, it gets to be some pretty messy code and it starts making more and more sense to move it into your application layer both from a code maintenance standpoint as well as performance & scalability considerations.
Also note that SQLLite supports GROUP_CONCAT, for other databases:
Postgres user should look at string_agg
SQL Server users should check out this project on CodePlex.
Oracle users can use MODEL, as illustrated here.
I'm struggling to design an efficient automated task to clean up a reputation points table, similar to SO I suppose.
If a user reads an article, comments on an article and/or shares an article, I give my members some reputation points. If my member does all three of these for example, there would be three separate rows in that DB table. When showing the members points, I simply use a SUM query to count all points for that member.
Now, with a million active members, with high reputation, there are many, many rows in my table and would somehow like to clean them up. Using a Cron Job, I would like to merge all reputation rows for each member, older than 3-months, into one row. For example:
user | repTask | repPoints | repDate
-----------+-------------------------------+--------------+-----------------------
10001 + Commented on article | 5 | 2012-11-12 08:40:32
10001 + Read an article | 2 | 2012-06-12 12:32:01
10001 + Shared an article | 10 | 2012-06-04 17:39:44
10001 + Read an article | 2 | 2012-05-19 01:04:11
Would become:
user | repTask | repPoints | repDate
-----------+-------------------------------+--------------+-----------------------
10001 + Commented on article | 5 | 2012-11-12 08:40:32
10001 + (merged points) | 14 | Now()
Or (merging months):
user | repTask | repPoints | repDate
-----------+-------------------------------+--------------+-----------------------
10001 + Commented on article | 5 | 2012-11-12 08:40:32
10001 + (Merged for 06/2012) | 12 | Now()
10001 + (Merged for 05/2012) | 2 | Now()
Anything after 3-months is considered legitimate, anything before may need to be revoked in-case of cheating, hence why I state 3-months.
First of all, is this a good idea? I'm trying to avoid, say in 3 years time, having 100's of millions of rows. If it's not a good idea to merge points, is there a better way to store the data as it's inputted. I obviously cannot change what's already inputted but could make it better for the future.
If this is a good idea, I'm struggling to come up with an efficient query to modify the data. I'm not looking for exact code but if somebody could help describe a suitable query that could merge all points older than 3-months, for each user, or merge all points older than 3-months into separate months, for each user, it would be extremely helpful.
You can do it that way, with cron jobs, but how about this:
Create a trigger or procedure so that anytime a point is added, it updates a total column in the users table, and anytime a point is revoked the total column is subtracted from?
This way, no matter how many millions or billions of rows in the points table, you don't have to query those to get the total points results. You could even have separate columns for months or years. Also, since you're not deleting any rows you can go back and retroactively revoke a point from, say, a year ago if needed.