I have to produce a pretty complex report. Without a doubt someone will ask me for the individual records that make up the counts, totals etc.
Im new to mySql. I have tried and found out that I can store Sql statements as data of type Text without any real problems.
Being able to do it, however begs the main question:
Is this a good idea? Is there anything here that will hit me with a "Got ya!"
(Its always what you don't know that gets you, never what you do know)
The report will return Sales Count: 5
I would Live to store "SELECT * FROM Sales WHERE Producer_ID = 34" so I can quickly get to the 34 records that make up the 5 Sales count
Storing SQL queries as data in a SQL DB is sometimes labelled dynamic SQL queries. It is prone to injection issues, and should not be undertaken lightly. I recommend not doing so unless you have a very good reason to do so.
Related
It's my first time working with databases so I spent a bunch of hours reading and watching videos. The data I am analyzing is a limited set of marathon data, and the goal is to produce statistics on each runner.
I am looking for advice and suggestions on my database design as well as how I might go about producing statistics. Please see this image for my proposed design:
Basically, I'm thinking there's a many-to-many relationship between Races and Runners: there are multiple runners in a race, and a runner can have run multiple races. Therefore, I have the bridge table called Race_Results to store the time and age for a given runner in a given race.
The Statistics table is what I'm looking to get to in the end. In the image are just some random things I may want to calculate.
So my questions are:
Does this design make sense? What improvements might you make?
What kinds of SQL queries would be used to calculate these statistics? Would I have to make some other tables in between - for example, to find the percentage of the time a runner finished within 10 minutes of first place, would I have to first make a table of all runner data for that race and then do some queries, or is there a better way? Any links I should check out for more on calculating these sorts of statistics?
Should I possibly be using python or another language to get these statistics instead of SQL? My understanding was that SQL has the potential to cut down a few hundred lines of python code to one line, so I thought I'd try to give it a shot with SQL.
Thanks!
I think your design is fine, though Race_Results.Age is redundant - watch out if you update a runner's DOB or a race date.
It should be reasonably easy to create views for each of your statistics. For example:
CREATE VIEW Best_Times AS
SELECT Race_ID, MIN(Time) AS Time,
FROM Race_Results
GROUP BY Race_ID;
CREATE VIEW Within_10_Minutes AS
SELECT rr.*
FROM Race_Results rr
JOIN Best_Times b
ON rr.Race_ID = b.Race_ID AND rr.Time <= DATE_ADD(b.Time, INTERVAL 10 MINUTE);
SELECT
rr.Runner_ID,
COUNT(*) AS Number_of_races,
COUNT(w.Runner_ID) * 100 / COUNT(*) AS `% Within 10 minutes of 1st place`
FROM Race_Results rr
LEFT JOIN Within_10_Minutes w
ON rr.Race_ID = w.Race_ID AND rr.Runner_ID = w.Runner_ID
GROUP BY rr.Runner_ID
1) The design of your 3 tables Races, Race_Results and Runners make perfectly sense. Nothing to improve here. The statistics are something different. If you manage to write those probably slightly complicated queries in a way they can be used in a view, you should do that and avoid saving statistics that need to be recalculated each day. Calculating something like this on-the-fly whenever it is needed is better than saving it, as long as the performance is sufficient.
2) If you would be using Oracle or MSSQL, I'd say you would be fine with some aggregate functions and common table expressions. In MySQL, you will have to use group by and subqueries. Makes the whole approach a bit more complicated, but totally feasible.
If you ask for a specific metric in a comment, I might be able to suggest some code, though my expertise is more in Oracle and MSSQL.
3) If you can, put your code in the database. In this way, you avoid frequent context switches between your programming language and the database. This approach usually is the fastest in all database systems.
I'm working on my first SSRS report and I haven't been able to find general guidelines as to how to create reports. Specifically, I would like to know what the general approach is when aggregate data is needed on a report. For example, let's say I need to show the following in my report:
Pancakes ---34
Eggs----------56
Bacon--------73
I have a several more rows like the above that need to show aggregate data. I'm currently grouping the whole row by type and then on each cell I'm showing a count as follows: [Count(Status)].
My report is already taking 45+ seconds to run. Is it generally preferable to do aggregation like this in the query? Or does this depend on the amount of data being returned? Any pointers are greatly appreciated. Thanks!
As with all SQL answers: it depends.
But generally do your aggregation in SQL. SQL server is much better at performing aggregation than the report layer. Also bringing back less rows will reduce your data transfer and the amount of data which SSRS needs to process. Usually you would only want to do the aggregation at the report layer if there are other constraints which make doing it in the SQL query impossible or if doing so will make the report more difficult to maintain in the future. (There's certainly something to be said for sacrificing a bit of performance in the name of maintainability.) One case would be when you need to display all of the data and returning two datasets is either too complicated or actually slows down the performance of the report.
As a side note, if your report is taking 45+ seconds to run then likely your SQL is not optimized very well or your report is doing a lot of complicated calculations. The more work you can put back on the SQL server the better your performance will be. SQL Server is made for crunching numbers and doing aggregations so certainly let it do what it does best when you can.
YMMV, so always do performance testing for different methods to see what works best.
This is a simple question even though the title sounds complicated.
Let's say I'm storing data from a bunch of applications into one central database/ data warehouse. This is data at a pretty fine level -- say, daily summaries of various metrics.
HOWEVER, I know in the front-end I will be frequently displaying weekly and monthly aggregates of this data as well.
One idea would be to have scripting language do this for me after querying the SQL database - but that seems horribly inefficient, perhaps.
The second idea would be to have views in the database that represent business weeks and months -- this might be the best way to do it.
But my final idea is -- couldn't a SQL client simply run a query that aggregates all the daily data into weeks (or months) and store them in a separate table? The advantage of this is that it would reduce querying time of any user, since all the query work is done before a website or button is even loaded/ pushed. Even with a view, I guess that aggregation calculation would have to be done as soon as the view was queried.
The only downside to having the queries aggregated from the weeks/ months perhaps even once a day (instead of every time the website is loaded) -- is that it won't be up-to-date/ may reflect inconsistencies.
I'm not really an expert when it comes to this bigger picture stuff -- anyone have any thoughts? thanks
It depends on the user experience you're trying to create.
Is the user base expecting to watch monthly aggregates with one finger on the F5 key when watching this month's statistics? To cover this scenario, you might want to have a view with criteria that presents a window always relative to getdate(). Keeping in mind that with good indexing strategies and query design should mitigate the impact of this sort of approach to nearly nothing.
Is the user expecting informational data that doesn't include today's data? More performance might be seen out of a nightly job that does the aggregation into a new table.
Of all the scenarios, though, I would not recommend manual aggregation. Down that road are unexpected bugs and exceptions that can really be handled with a good SQL statement. Aggregates are a big part of all DBMSs', let their software handle that and work on the rest of your application.
I have a database table with around 2400 'items'. A user will have any arbitrary combination of items from the 2400 item set. Each item's details then need to be looked up and displayed to the user. What is the most efficient way to get all of the item details? I can think of three methods:
Select all 2400 items and parse the required details client-side.
Select the specific items with a SELECT which could be a very long SQL string (0-2400 ids)?
Select each item one at a time (too many connections)?
I'm not clued up on SQL efficiency so any guidance would help. It may help to know this is a web app heavily AJAX based.
Edit: On average a user will select ~150 items and very rarely more than 400-500.
The best method is to return the data you want from the database in a single query:
select i.*
from items i
where i.itemID in (<list of ids>);
MySQL queries can be quite large (see here), so I wouldn't worry about being able to pass in the values.
However, if your users have so many items, I would suggest storing them in the database first and then doing a join to get the additional information.
If the users never/rarely select more than ~50 elements, then I agree with Gordons answer.
If it is really plausible that they might select up to 2400 items, you'll probably be better off by inserting the selected ids into a holding table and then joining with that.
However, a more thorough answer can be found here - which I found through this answer.
He concludes that:
We see that for a large list of parameters, passing them in a temporary table is much faster that as a constant list, while for small lists performance is almost the same.
'Small' and 'large' are hardly static, but dependent upon your hardware - so you should test. My guess would be that with an average of 150 elements in your IN-list, you will see the temp table win.
(If you do test, please come back here and say what is the fastest in your setup.)
2400 items is nothing. I've have a MySql database which has hundreds of thousands of rows and relation and are working perfectly, by just optimizing the queries.
What you must do is, see how long the execution time is for each sql query. You can then optimize each query on its own, trying different querys and measure the execution time.
You can use ex. MySql Workbench, Sequel pro, Microsoft Server management studio or another software for building queries. Also you can add indexes to your tables, which can improve queries as well.
If you need to scale your database up you can use software like http://hadoop.apache.org
Also a great thing to mention is NoSQL (Not-Only SQL). It's a relation-less database, which can handle dynamic attributes and are build for handling large amount of data.
As you mention you could use AJAX. But that only helps the load time of the page and the stress of your web server (Not SQL server). Just ask if you wan't more info or more in-depth explanation.
I find that when trying to construct complex MySQL joins and groups between many tables I usually run into strife and have to spend a lot of 'trial and error' time to get the result I want.
I was wondering how other people approach the problems. Do you isolate the smaller blocks of data at the end of the branches and get these working first? Or do you start with what you want to return and just start linking tables on as you need them?
Also wondering if there are any good books or sites about approaching the problem.
I don't work in mySQL but I do frequently write extremely complex SQL and here's how I approach it.
First, there is no substitute whatsoever for thoroughly understanding your database structure.
Next I try to break up the task into chunks.
For instance, suppose I'm writing a report concerning the details of a meeting (the company I work for does meeting planning). I will need to know the meeting name and sales rep, the meeting venue and dates, the people who attened and the speaker information.
First I determine which of the tables will have the information for each field in the report. Now I know what I will have to join together, but not exactly how as yet.
So first I write a query to get the meetings I want. This is the basis for all the rest of the report, so I start there. Now the rest of the report can probably be done in any order although I prefer to work through the parts that should have one-one relationshisps first, so next I'll add the joins and the fields that will get me all the sales rep associated information.
Suppose I only want one rep per meeting (if there are multiple reps, I only want the main one) so I check to make sure that I'm still returning the same number of records as when I just had meeting information. If not I look at my joins and decide which one is giving me more records than I need. In this case it might be the address table as we are storing multiple address for the rep. I then adjust the query to get only one. This may be easy (you may have a field that indicates the specific unique address you want and so only need to add a where condition) or you may need to do some grouping and aggregate functions to get what you want.
Then I go on to the next chunk (working first through all the chunks that should have a 1-1 relationshisp to the central data in this case the meeting). Runthe query nd check the data after each addition.
Finally I move to those records which might have a one-many relationship and add them. Again I run the query and check the data. For instance, I might check the raw data for a particular meeting and make sure what my query is returning is exactly what I expect to see.
Suppose in one of these additions of a join I find the number of distinct meetings has dropped. Oops, then there is no data in one of the tables I just added and I need to change that to a left join.
Another time I may find too many records returned. Then I look to see if my where clause needs to have more filtering info or if I need to use an aggreagte function to get the data I need. Sometimes I will add other fields to the report temporarily to see if I can see what is causing the duplicated data. This helps me know what needs to be adjusted.
The real key is to work slowly, understand your data model and check the data after every new chunk is added to make sure it is returning the results the way you think they should be.
Sometimes, If I'm returning a lot of data, I will temporarily put an additonal where clause on the query to restrict to a few items I can easily check. I also strongly suggest the use of order by because it will help you see if you are getting duplicated records.
Well the best approach to break down your MySQL query is to run the EXPLAIN command as well as looking at the MySQL documentation for Optimization with the EXPLAIN command.
MySQL provides some great free GUI tools as well, the MySQL Query Browser is what you need to use.
When running the EXPLAIN command this will break down how MySQL interprets your query and displays the complexity. It might take some time to decode the output but thats another question in itself.
As for a good book I would recommend: High Performance MySQL: Optimization, Backups, Replication, and More
I haven't used them myself so can't comment on their effectiveness, but perhaps a GUI based query builder such as dbForge or Code Factory might help?
And while the use of Venn diagrams to think about MySQL joins doesn't necessarily help with the SQL, they can help visualise the data you are trying to pull back (see Jeff Atwood's post).