T-SQL Optimized conditional join - sql-server-2008

Hey guys it's Brian from OMDbAPI.com
I hit a little speed bump when trying to use a single query for both Movie and Episode data. I recently started collecting additional Episode details in a separate table (being only two new columns have been added, Season #/Episode #) I put them in a separate table because those columns would be null in my main table 90% of the time but the other columns do work across movies/episodes (title/rating/release date/plot etc.)
So I'm trying to use a single query for returning Movie data but if the ID has a type = 'episode' return the additional fields from the other table. Problem is I don't know that ID is an episode until it's queried, and the least amount of calls to the database (smaller execution plan) the better, as this is called hundreds of times per second (currently 25+ million requests a day)
I created a small SQL Fiddle of what I'm trying to achieve.
My question is what is the best method with the least performance cost to show these fields if it's an episode and completely suppress them if not? Is Dynamic SQL my only option? Thanks.

Supposing that each Movie row is associated with at most one Episode row, you are certain to get the best query plans by putting the episode data in the Movie table instead of in a separate one. That avoids having to determine during query execution whether to look at the episode data, and it also avoids any need for a JOIN when you do need it.
Having the 90% NULL episode data in your Movie table will cost you some space, and therefore it will have some performance impact, but I'm inclined to think that the resulting simpler query plans will offset that cost.
JOINing the tables every time is your next best bet, I think. That gives you reasonably simple query plans, and looks for performance gains through reducing the size of the Movie data. Still, as a general rule, the fewer JOINs you perform, the faster your queries will run.

Related

Database too large - store as a row or serialise data?

I have Quiz App that constitutes many Modules containing Questions. Each question has many Categories (many-to-many). Every time a quiz is completed, the user's score is sent to the Scores Table. (I've attached an entity-relation diagram for clarification purposes).
I have been thinking of breaking down the user scores according to categories (i.e. a user when completing a quiz will get an overall quiz score along with score for each category).
However, if each quiz consists of at least 30 questions, there could around 15-20 categories per quiz. So if one user completes a quiz, then it would create a minimum of 15-20 rows in the scores table. With multiple users, the Scores table would get really big really fast.
I assume this would affect the performance of retrieving data from the Scores table. For example, if I wanted to calculate the average score for a user for a specific category.
Does anyone have a better suggestion for how I can still be able to store scores based on categories?
I thought about serialising the JSON data, but of course, this has its limitations.
The DB should be able to handle millions of rows and there is nothing inherently wrong with your design. A few things I would suggest:
Put indexes in the following (or combinations of) user id, exam id (which I assume is what you call scorable id ) exam type (scorable Type?) and creation date.
As your table grows, partition it. Potential candidates could be creation date buckets (by year or year/month would probably work well) or maybe if students are in particular classes you could have class buckets
As your table grow even more you could move the partitions to different different disks (how you partitioned the data will be even more crucial here because if the data has to go across too many partitions you may end up hurting performance instead of helping)
Beyond that another suggestion would be to break the scores table into two score and scoreDetail. The score table would contain top level stuff like user id ,exam id, overall score, etc... While the child table would contain the scores by category (philosophy, etc....). I would bet 80% of the time people only care about the top score anyways. This way you only reach out to the bigger table when some one wants to get the details of their score in a particular exam.
Finally, you probably want to have the score by category in rows rather than columns to make it easier to do analysis and aggregations, but this is not necessarily a performance booster and really depends on how you plan to use the data.
In the end though, the best optimizations really depend on how you plan to use your data. I would suggest just creating a random data set that represents a few years worth of data and play with that.
I doubt that serialization would give you a significant benefit.
I would even dare to say that you'd kind of limit the power of a database by doing so.
Relational databases are designed to store a lot of rows in their tables, and they also usually use their own compression algorithms, so you should be fine.
Additionally, you will need to deserialize every time you want to read from your table. That would eliminate the possibility to use SQL statements for sorting, filtering, JOINing etc.
So in the end you will probably cause yourself more trouble by serializing than by simply storing the rows.

Performing Calculations SQL

I am trying to take information from one MySQL table, perform a bunch of calculations on this data, and then put the results in a second MySQL table. What would be the best way of doing this (i.e. in MySQL itself, using python, etc.)?
My apologies for the vagueness, I'll try to be more specific. Table 1 has every meal that every person in my class eats, so each meal is a primary key, and other columns include the person and the number of calories. The primary key for Table 2 is the person, and another column is the percentage of total calories this person has eaten, out of the calories of the entire class. Another column is the percentage of total calories of this person's gender in the class. Every day, I want to take the new eating information, and use it to update the percentages in Table 2. (Thanks for the help!)
Assming the calculations can be done in SQL (and percentages are definitely do-able), you have some choices.
The first, and academically correct, choice, is not to store this in a table at all. One of the principles of normalization is that you don't store duplicate or calculated values - instead, you calculate them as you need them.
This isn't just an academic concern - it avoids many silly bugs and anomalies, and it means your data is always up to date - you don't have to wait for your calculation query to run before you can use the data.
If the calculation is non-trivial and/or an essential part of the business domain, common practice is to create a database view, which behaves like a table when queried, but is actually calculated on the fly. This means that the business logic is encapsulated in the view, rather than repeated in multiple queries. You can go further, with materialized views etc. - but the basic principle is the same.
In some cases, where the volume of data is huge, or the calculations are time consuming, or you have calculations that are very hard to embed in a single SQL statement, it's common to create "aggregate tables" - this is what you are suggesting. You can populate these tables either by (scheduled) queries, or by using database triggers.
However, aggregate tables are a last resort - they make the solution much harder to maintain and debug - if the data is wrong, you don't have a single query to debug, you've got to follow the chain of logic all the way through.
Assuming you are in a class of a few dozen people, and are reporting on less than 10 million years of meals, any modern RDBMS can calculate this report in milliseconds - there's really no need to store it in an aggregate table.
A possible solution could be that you create a View or a Materialized View with the complex SELECT query behind it.
The Materialized View could be an other option too, as you have wrote that you would like to have these results re-queried/refreshed every day.
If you need to do more advanced operations on those tables, you could create a Stored procedure and call it when you need its data.
Note: you can't work furthermore (eg.: can't call it from a select for joining it's result set) with the procedures result set other than say a temporary table.

Doing SUM() and GROUP BY over millions of rows on mysql

I have this query which only runs once per request.
SELECT SUM(numberColumn) AS total, groupColumn
FROM myTable
WHERE dateColumn < ? AND categoryColumn = ?
GROUP BY groupColumn
HAVING total > 0
myTable has less than a dozen columns and can grow up to 5 millions of rows, but more likely about 2 millions in production. All columns used in the query are numbers, except for dateColumn, and there are indexes on dateColumn and categoryColumn.
Would it be reasonble to expect this query to run in under 5 seconds with 5 million rows on most modern servers if the database is properly optimized?
The reason I'm asking is that we don't have 5 millions of data and we won't even hit 2 millions within the next few years, if the query doesn't run in under 5 seconds then, it's hard to know where the problem lies. Would it be because the query is not suitable for a large table, or the database isn't optimized, or the server isn't powerful enough? Basically, I'd like to know whether using SUM() and GROUP BY over a large table is reasonable.
Thanks.
As people in comments under your question suggested, the easiest way to verify is to generate random data and test query execution time. Please note that using clustered index on dateColumn can significantly change execution times due to the fact, that with "<" condition only subset of continuous disk data is retrieved in order to calculate sums.
If you are at the beginning of the process of development, I'd suggest concentrating not on the structure of table and indexes that collects data - but rather what do you expect to need to retrieve from the table in the future. I can share my own experience with presenting website administrator with web usage statistics. I had several webpages being requested from server, each of them falling into one on more "categories". My first approach was to collect each request in log table with some indexes, but the table grew much larger than I had at first estimated. :-) Due to the fact that statistics where analyzed in constant groups (weekly, monthly, and yearly) I decided to create addidtional table that was aggregating requests in predefined week/month/year grops. Each request incremented relevant columns - columns were refering to my "categories" . This broke some normalization rules, but allowed me to calculate statistics in a blink of an eye.
An important question is the dateColumn < ? condition. I am guessing it is filtering records that are out of date. It doesn't really matter how many records there are in the table. What matters is how much records this condition cuts down.
Having aggressive filtering by date combined with partitioning the table by date can give you amazing performance on ridiculously large tables.
As a side note, if you are not expecting to hit this much data in many years to come, don't bother solving it. Your business requirements may change a dozen times by then, together with the architecture, db layout, design and implementation details. planning ahead is great but sometimes you want to give a good enough solution as soon as possible and handle the future painful issues in the next release..

MySQL: normalization, is this a valid exception?

We have 10 years of archived sports data, spread across separate databases.
Trying to consolidate all the data into a single database. Since we'll be handling 10X the number of records, I'm trying to make schema redesign changes now to avoid potential performance hit.
One change entails breaking up the team roster table into 2 tables; one, a players table that stores fixed data: playerID, firstName, lastName, birthDate, etc., and another, the new roster table that stores variable data about a player: yearInSchool, jerseyNumber, position, height, weight, etc. This will allow us to, among other things, create career 4 year aggregate views of player stats.
Fair enough, makes sense, but then again, when I look at queries that tally, for example, a players aggregate scoring stats, I have to join on both player & roster tables, in addition to scoring and schedule tables, in order to get all the information needed.
Where I'm considering denormalizing is with player first and last name. If I store player first and last name in the roster table, then I can omit the player table from the equation for stat queries, which I'm assuming will be a big performance win given that total record count per table will be over 100K (i.e. most query joins will be across tables that each contain at least 100K records, and up to, for now, 300K).
So, where to draw the line with denormalization in this case? I assume duplicating first, last name is OK. Generally I enjoy non-duplication/integrity of data, but I suspect site visitors enjoy performance more!
First thought is, are you sure you've exhausted tuning options to get good SELECT performance without denormalising here?
I'm very much with you in the sense of "no sacred cows" and denormalise when necessary, but this sounds like a case where it shouldn't be too hard to get decent performance.
Of course you guys have done your own exploration, if you've ruled that out then personal opinion is it's acceptable, yeah.
One issue - what happens if a player's name changes? Can it do so in your system? Would you use a transation to update all roster details in a single COMMIT operation? For a historical records Db this could be totally irrelevant mind you.

Right design for MySQL database

I want to build a MySQL database for storing the ranking of a game every 1h.
Since this database will become quite large in a short time, I figured it's important to have a proper design. Therefor some advice would be gratefully appreciated.
In order to keep it as small as possible, I decided to log only the first 1500 positions of the ranking. Every ranking of a player holds the following values:
ranking position, playername, location, coordinates, alliance, race, level1, level2, points1, points2, points3, points4, points5, points6, date/time
My approach was to simply grab all values of each top 1500 player every hour by a php script and insert them into the MySQL as one row. So every day the MySQL will grow 36,000 rows. I will have a second script that deletes every row that is older than 28 days, otherwise the database would get insanely huge. Both scripts will run as a cronjob.
The following queries will be performed on this data:
The most important one is simply the query for a certain name. It should return all stats for the player for every hour as an array.
The second is a query in which all players have to be returned that didn't gain points1 during a certain time period from the latest entry. This should return a list of players that didn't gain points (for the last 24h for example).
The third is a query in which all players should be listed that lost a certain amount or more points2 in a certain time period from the latest entry.
The queries shouldn't take a lifetime, so I thought I should probably index playernames, points1 and points2.
Is my approach to this acceptable or will I run into a performance/handling disaster? Is there maybe a better way of doing this?
Here is where you risk a performance problem:
Your indexes will speed up your reads, but will considerably slow down your writes. Especially since your DB will have over 1 million rows in that one table at any given time. Since your writes are happening via cron, you should be okay as long as you insert your 1500 rows in batches rather than one round trip to the DB for every row. I'd also look into query compiling so that you save that overhead as well.
Ranhiru Cooray is correct, you should only store data like the player name once in the DB. Create a players table and use the primary key to reference the player in your ranking table. The same will go for location, alliance and race. I'm guessing that those are more or less enumerated values that you can store in another table to normalize your design and be returned in your results with appropriates JOINs. Normalizing your data will reduce the amount of redundant information in your database which will decrease it's size and increase it's performance.
Your design may also be flawed in your ranking position. Can that not be calculated by the DB when you select your rows? If not, can it be done by PHP? It's the same as with invoice tables, you never store the invoice total because it is redundant. The items/pricing/etc can be used to calculate the order totals.
With all the adding/deleting, I'd be sure to run OPTIMIZE frequently and keep good backups. MySQL tables---if using MyISAM---can become corrupted easily in high writing/deleting scenarios. InnoDB tends to fair a little better in those situations.
Those are some things to think about. Hope it helps.