What is the efficiency or other difference between these two queries? - mysql

I downloaded the Yelp dataset and put it into MySQL as the datasets I have been working with have been too small to really have to consider efficiency. I am trying to unlearn or become aware of bad SQL habits which will cause problems with larger datasets.
Here are two ways of pulling exactly the same information out of the database:
USE yelp_db;
SELECT name, hours FROM business
LEFT JOIN hours
ON business.id = hours.business_id;
-- time taken 0,0025sec, 776071 rows returned
SELECT name, hours FROM
(SELECT name, id from business) AS b
LEFT JOIN
(SELECT hours, business_id from hours) AS h
ON b.id = h.business_id;
-- time taken 0,0015sec, 776071 rows returned
Here is a sample of the output:
John's Chinese BBQ Restaurant NULL
Primal Brewery Monday|16:00-22:00
Primal Brewery Tuesday|16:00-22:00
Primal Brewery Friday|12:00-23:00
The first method takes 3 lines but appears to be slightly slower than the second method which takes 5 lines.
Is either of these methods preferred in terms of efficiency or elegance and if so why?

The first method is preferred for both performance and elegance -- your results not withstanding.
I'm a little suspicious about the timings. I would expect more than a millisecond or two to return close to a million rows.
In any case, most versions of MySQL (the most recent might be exceptions) materialize subqueries. This adds an additional layer of writes and reads to the query. It can also prevent the use of indexes.
As for elegance, unnecessary subqueries do nothing for "elegance". This might be a matter of opinion, but I'm guessing it is pretty wide-spread.

Just to expand on #GordanLinoff 's excellent answer why you would see this difference.
If you ran them in the order shown simple caching of the data from the first one could explain the timing. This caching can happen many places all the way down to the disc controllers.
The only way to test with useful results is to run many iterations and average the results after clearing all the caches.

Related

COUNT(*) vs manual tracking of counters?

I have a table with approx. 70000 entries. It holds information about brands, models and categories of goods. The user can query them using any combination of those, and the displayed counter of goods matching the criteria has to be updated according to his selection.
I have it done using a query like
SELECT model,COUNT(*) AS count FROM table$model_where
GROUP BY model
ORDER BY count DESC
where $model_where depends on what the other conditions were. But my boss asked me to redo these queries into using a special counter table, because he believes they are slowing the whole process down, but a benchmark I put suggests otherwise, sample output:
The code took: 0 wallclock secs (0.02 usr + 0.00 sys = 0.02 CPU)
and it measures the whole routine from the start and until the data is send to the user, you can see it's really fast.
I have done some research on this matter, but I still haven't seen a definitive answer as to when to use COUNT(*) vs counter tables. Who is right? I'm not persuaded we actually need manual tracking of this, but maybe I just know little.
Depending on your specific case, this might, or might not be a case of premature optimization.
If next week you'll have 100x bigger tables, it might not be the case, but otherwise it is.
Also, your boss should take into consideration that you and everybody else will have to make sure that counters are updated whenever an INSERT or DELETE happens on the counted records. There are frameworks which do that automatically (ruby on rails's ActiveRecord comes to mind), but if you're not using one of them, there are about a gazillion ways you can end up with wrong counters in the DB

SQL Database design for statistical analysis of many-to-many relationship

It's my first time working with databases so I spent a bunch of hours reading and watching videos. The data I am analyzing is a limited set of marathon data, and the goal is to produce statistics on each runner.
I am looking for advice and suggestions on my database design as well as how I might go about producing statistics. Please see this image for my proposed design:
Basically, I'm thinking there's a many-to-many relationship between Races and Runners: there are multiple runners in a race, and a runner can have run multiple races. Therefore, I have the bridge table called Race_Results to store the time and age for a given runner in a given race.
The Statistics table is what I'm looking to get to in the end. In the image are just some random things I may want to calculate.
So my questions are:
Does this design make sense? What improvements might you make?
What kinds of SQL queries would be used to calculate these statistics? Would I have to make some other tables in between - for example, to find the percentage of the time a runner finished within 10 minutes of first place, would I have to first make a table of all runner data for that race and then do some queries, or is there a better way? Any links I should check out for more on calculating these sorts of statistics?
Should I possibly be using python or another language to get these statistics instead of SQL? My understanding was that SQL has the potential to cut down a few hundred lines of python code to one line, so I thought I'd try to give it a shot with SQL.
Thanks!
I think your design is fine, though Race_Results.Age is redundant - watch out if you update a runner's DOB or a race date.
It should be reasonably easy to create views for each of your statistics. For example:
CREATE VIEW Best_Times AS
SELECT Race_ID, MIN(Time) AS Time,
FROM Race_Results
GROUP BY Race_ID;
CREATE VIEW Within_10_Minutes AS
SELECT rr.*
FROM Race_Results rr
JOIN Best_Times b
ON rr.Race_ID = b.Race_ID AND rr.Time <= DATE_ADD(b.Time, INTERVAL 10 MINUTE);
SELECT
rr.Runner_ID,
COUNT(*) AS Number_of_races,
COUNT(w.Runner_ID) * 100 / COUNT(*) AS `% Within 10 minutes of 1st place`
FROM Race_Results rr
LEFT JOIN Within_10_Minutes w
ON rr.Race_ID = w.Race_ID AND rr.Runner_ID = w.Runner_ID
GROUP BY rr.Runner_ID
1) The design of your 3 tables Races, Race_Results and Runners make perfectly sense. Nothing to improve here. The statistics are something different. If you manage to write those probably slightly complicated queries in a way they can be used in a view, you should do that and avoid saving statistics that need to be recalculated each day. Calculating something like this on-the-fly whenever it is needed is better than saving it, as long as the performance is sufficient.
2) If you would be using Oracle or MSSQL, I'd say you would be fine with some aggregate functions and common table expressions. In MySQL, you will have to use group by and subqueries. Makes the whole approach a bit more complicated, but totally feasible.
If you ask for a specific metric in a comment, I might be able to suggest some code, though my expertise is more in Oracle and MSSQL.
3) If you can, put your code in the database. In this way, you avoid frequent context switches between your programming language and the database. This approach usually is the fastest in all database systems.

T-SQL Optimized conditional join

Hey guys it's Brian from OMDbAPI.com
I hit a little speed bump when trying to use a single query for both Movie and Episode data. I recently started collecting additional Episode details in a separate table (being only two new columns have been added, Season #/Episode #) I put them in a separate table because those columns would be null in my main table 90% of the time but the other columns do work across movies/episodes (title/rating/release date/plot etc.)
So I'm trying to use a single query for returning Movie data but if the ID has a type = 'episode' return the additional fields from the other table. Problem is I don't know that ID is an episode until it's queried, and the least amount of calls to the database (smaller execution plan) the better, as this is called hundreds of times per second (currently 25+ million requests a day)
I created a small SQL Fiddle of what I'm trying to achieve.
My question is what is the best method with the least performance cost to show these fields if it's an episode and completely suppress them if not? Is Dynamic SQL my only option? Thanks.
Supposing that each Movie row is associated with at most one Episode row, you are certain to get the best query plans by putting the episode data in the Movie table instead of in a separate one. That avoids having to determine during query execution whether to look at the episode data, and it also avoids any need for a JOIN when you do need it.
Having the 90% NULL episode data in your Movie table will cost you some space, and therefore it will have some performance impact, but I'm inclined to think that the resulting simpler query plans will offset that cost.
JOINing the tables every time is your next best bet, I think. That gives you reasonably simple query plans, and looks for performance gains through reducing the size of the Movie data. Still, as a general rule, the fewer JOINs you perform, the faster your queries will run.

What is the optimal solution, use Inner Join or multiple queries?

What is the optimal solution, use Inner Join or multiple queries?
something like this:
SELECT * FROM brands
INNER JOIN cars
ON brands.id = cars.brand_id
or like this:
SELECT * FROM brands
... (while query)...
SELECT * FROM cars WHERE brand_id = [row(brands.id)]
Generally speaking, one query is better, but there are come caveats to that. For example, older versions of SQL Server had a great decreases in performance if you did more than seven joins. The answer will really depend on the database engine, version, query, schema, fields, etc., so we can't say for sure which is better. Always look into minimizing the number of queries when possible without going too overboard and creating result sets that are crazy or impossible to maintain.
This is a very subjective question but remember that each time you call the database there's significant overhead.
Almost without exception the optimum is to issue as few commands and pull out all the data you'll need. However for practical reasons this clearly may not be possible.
Generally speaking if a database is well maintained one query is quicker than two. If it's not you need to look at your data/indicies and determine why.
A final point, you're hinting in your second example that you'd load the brand then issue a command to get all the cars in each brand. This is without a doubt your worst option as it doesn't issue 2 commands - it issues N+1 where N is the number of brands you have... 100 brands is 101 DB hits!
Your two queries are not exactly the same.
The first returns all fields from brands and cars in one row. The second returns two different result sets that need to be combined together.
In general, it is better to do as many operations in the database as possible. The database is more efficient for processing large amounts of data. And, it generally reduces the amount of data being brought back to the client.
That said, there are a few circumstances where more data is being returned in a single query than with multiple queries. For instance in your example, if you have one brand record with 100 columns and 10,000 car records with three columns, then the two-query method is probably faster. You are only bringing back the columns from brands table once rather than 10,000 times.
These examples where multiple queries is better are few and far between. In general, it is better to do the processing in the database. If performance needs to be improved, then in a few rare cases, you might be able to break up queries and improve performance.
In general, use first query. Why? Because query execution time is not just query itself time, but also some overheads, such as:
Creating connection overhead
Network data sending overhead
Closing (handling) connection overhead
Depending of situation, some overheads may present or not. For example, if you're using persistent connection, then you'll not get connection overhead. But in common case that's not true, thus, it will have place. And creating/maintaining/closing connection overhead is very significant part. Imagine that you have this overhead as only 1% from total query time (in real situation it will be much more). And you have - let's say, 1.000.000 rows. Then first query will produce that overhead only once, while second will be 1.000.000/100 = 10.000 times. Just think about - how slow it will be.
Besides, INNER JOIN will also be done using key - if it exists, thus, in terms of query itself speed it will be near same as second. So I highly recommend to use INNER JOIN option.
Breaking complex query into simple queries may be useful in a very specific cases. For example, case with IN subquery. In this situation, if you're using WHERE id IN (subquery), where (subquery) is some SQL, MySQL will treat that as = ANY subquery and will not use key for that, even if subquery results in narrow list of ids. And - yes, split it into two queries may have sense since WHERE IN(static list) will work in another way - MySQL will use range index scan for that (strange, but true - because for IN (static list) statement IN will be treated as comparison operator, and not =ANY subquery qualifier). This part isn't directly about your case - but to show that - yes, cases, when splitting processing from DBMS may be useful in terms of performance - exist.
One query is better, because up to about 90% of the expense of executing a query is in the overheads:
communication traffic to/from database
syntax checking
authority checking
access plan calculation by optimizer
logging
locking (even read-only requires a lock)
lots of other stuff too
Do all that just once for one query, or do it all n times for n queries, but get the same data.

Doing SUM() and GROUP BY over millions of rows on mysql

I have this query which only runs once per request.
SELECT SUM(numberColumn) AS total, groupColumn
FROM myTable
WHERE dateColumn < ? AND categoryColumn = ?
GROUP BY groupColumn
HAVING total > 0
myTable has less than a dozen columns and can grow up to 5 millions of rows, but more likely about 2 millions in production. All columns used in the query are numbers, except for dateColumn, and there are indexes on dateColumn and categoryColumn.
Would it be reasonble to expect this query to run in under 5 seconds with 5 million rows on most modern servers if the database is properly optimized?
The reason I'm asking is that we don't have 5 millions of data and we won't even hit 2 millions within the next few years, if the query doesn't run in under 5 seconds then, it's hard to know where the problem lies. Would it be because the query is not suitable for a large table, or the database isn't optimized, or the server isn't powerful enough? Basically, I'd like to know whether using SUM() and GROUP BY over a large table is reasonable.
Thanks.
As people in comments under your question suggested, the easiest way to verify is to generate random data and test query execution time. Please note that using clustered index on dateColumn can significantly change execution times due to the fact, that with "<" condition only subset of continuous disk data is retrieved in order to calculate sums.
If you are at the beginning of the process of development, I'd suggest concentrating not on the structure of table and indexes that collects data - but rather what do you expect to need to retrieve from the table in the future. I can share my own experience with presenting website administrator with web usage statistics. I had several webpages being requested from server, each of them falling into one on more "categories". My first approach was to collect each request in log table with some indexes, but the table grew much larger than I had at first estimated. :-) Due to the fact that statistics where analyzed in constant groups (weekly, monthly, and yearly) I decided to create addidtional table that was aggregating requests in predefined week/month/year grops. Each request incremented relevant columns - columns were refering to my "categories" . This broke some normalization rules, but allowed me to calculate statistics in a blink of an eye.
An important question is the dateColumn < ? condition. I am guessing it is filtering records that are out of date. It doesn't really matter how many records there are in the table. What matters is how much records this condition cuts down.
Having aggressive filtering by date combined with partitioning the table by date can give you amazing performance on ridiculously large tables.
As a side note, if you are not expecting to hit this much data in many years to come, don't bother solving it. Your business requirements may change a dozen times by then, together with the architecture, db layout, design and implementation details. planning ahead is great but sometimes you want to give a good enough solution as soon as possible and handle the future painful issues in the next release..