Efficient Query, Table Bridge/Indexing and strucuture - mysql

In my PhpMyAdmin database I have 9 tables. 8 of those tables are relevant to me at this moment. I would like my queries to be executed quickly enough but I am not sure the design/structure of the tables are the most effecient. Any suggestion of merging a table with another or creating another bridge table? Also, I am struggling in building a query that will display bridge results from the following tables: semester, schedule, office_hours, faculty, section, major_minor, major_class_br, class?
TABLE STRUCTURE
Basic Query- that shows class details
SELECT class_name, class_caption, class_credit_hours, class_description
FROM class

Here's a start:
select *
from
schedule
inner join semester
on schedule.semester_id = semester.id
inner join office_hours
on office_hours.id = schedule.???
It's not clear how office_hours correlates with schedule?

'queries to be executed quickly enough'
INSERT or SELECT?
If you want your INSERT/UPDATE to be fast, normalise it to the nth degree
If you want your SELECT to be fast, denormalise it (which of course makes INSERT's UPDATE's complicated and slow)
If your main table (schedule?) has < 10,000 records and it's a decent RDBMS then it's probably running as fast as it can.
Normally the performance tuning process involves
Identifying a workload (what queries do I usually run)
Getting a baseline (how long do they take to run)
Tuning (adding indexes, changing design)
Repeat
So we would really need to have an idea of what kind of queries are performing slowly, or alternatively what kind of growth you expect in which tables.
I'm not sure what you mean by 'bridge results'. What happens when you buidl a query that joins the tables as per the physical diagram? an error or unexpected ersults?

Related

Is the performance of join two one-to-one tables on a single-node database the same as a pure select on an equivalent denormalized table?

There are two big (millions of records) one-to-one tables:
course
prerequisite with a foreign key reference to the course table
in single-node relational MySQL database. A join is needed to list the full description of all the courses.
An alternative is to have only one single table to contain both the course and prerequisite data in the same database.
Question: is the performance of the join query still slower than that of a simple select query without join on the single denormalized table albeit the fact that they are on the same single-node MYSQL database?
It's true that denormalization is often done to shorten the work to look up one record with its associated details. This usually means the query responds in less time.
But denormalization improves one query at the expense of other queries against the same data. Making one query faster will often make other queries slower. For example, what if you want to query the set of courses that have a given prerequisite?
It's also a risk when you use denormalization that you create data anomalies. For example, if you change a course name, you would also need to update all the places where it is named as a prerequisite. If you forget one, then you'll have a weird scenario where the obsolete name for a course is still used in some places.
How will you know you found them all? How much work in the form of extra queries will you have to do to double-check that you have no anomalies? Do those types of extra queries count toward making your database slower on average?
The purpose of normalizing a database is not performance. It's avoiding data anomalies, which reduces your work in other ways.

Searching in all the tables of a mysql database

I have a mysql (mariadb) database with numerous tables and all the tables have the same structure.
For the sake of simplicity, let's assume the structure is as below.
UserID - Varchar (primary)
Email - Varchar (indexed)
Is it possible to query all the tables together for the Email field?
Edit: I have not finalized the db design yet, I could put all the data in single table. But I am afraid that large table will slow down the operations, and if it crashes, it will be painful to restore. Thoughts?
I have read some answers that suggested dumping all data together in a temporary table, but that is not an option for me.
Mysql workbench or PHPMyAdmin is not useful either, I am looking for a SQL query, not a frontend search technique.
There's no concise way in SQL to say this sort of thing.
SELECT a,b,c FROM <<<all tables>>> WHERE b LIKE 'whatever%'
If you know all your table names in advance, you can write a query like this.
SELECT a,b,c FROM table1 WHERE b LIKE 'whatever%'
UNION ALL
SELECT a,b,c FROM table2 WHERE b LIKE 'whatever%'
UNION ALL
SELECT a,b,c FROM table3 WHERE b LIKE 'whatever%'
UNION ALL
SELECT a,b,c FROM table4 WHERE b LIKE 'whatever%'
...
Or you can create a view like this.
CREATE VIEW everything AS
SELECT * FROM table1
UNION ALL
SELECT * FROM table2
UNION ALL
SELECT * FROM table3
UNION ALL
SELECT * FROM table4
...
Then use
SELECT a,b,c FROM everything WHERE b LIKE 'whatever%'
If you don't know the names of all the tables in advance, you can retrieve them from MySQL's information_schema and write a program to create a query like one of my suggestion. If you decide to do that and need help, please ask another question.
These sorts of queries will, unfortunately, always be significantly slower than querying just one table. Why? MySQL must repeat the overhead of running the query on each table, and a single index is faster to use than multiple indexes on different tables.
Pro tip Try to design your databases so you don't add tables when you add users (or customers or whatever).
Edit You may be tempted to use multiple tables for query-performance reasons. With respect, please don't do that. Correct indexing will almost always give you better query performance than searching multiple tables. For what it's worth, a "huge" table for MySQL, one which challenges its capabilities, usually has at least a hundred million rows. Truly. Hundreds of thousands of rows are in its performance sweet spot, as long as they're indexed correctly. Here's a good reference about that, one of many. https://use-the-index-luke.com/
Another reason to avoid a design where you routinely create new tables in production: It's a pain in the ***xxx neck to maintain and optimize databases with large numbers of tables. Six months from now, as your database scales up, you'll almost certainly need to add indexes to help speed up some slow queries. If you have to add many indexes, you, or your successor, won't like it.
You may also be tempted to use multiple tables to make your database more resilient to crashes. With respect, it doesn't work that way. Crashes are rare, and catastrophic unrecoverable crashes are vanishingly rare on reliable hardware. And crashes can corrupt multiple tables. (Crash resilience: decent backups).
Keep in mind that MySQL has been in development for over a quarter-century (as have the other RDBMSs). Thousands of programmer years have gone into making it fast and resilient. You may as well leverage all that work, because you can't outsmart it. I know this because I've tried and failed.
Keep your database simple. Spend your time (your only irreplaceable asset) making your application excellent so you actually get millions of users.

SQL Optimization: how to JOIN a table with itself

I'm trying to optimize a SQL query and I am not sure if further optimization is possible.
Here's my query:
SELECT someColumns
FROM (((smaller_table))) AS a
INNER JOIN (((smaller_table))) AS b
ON a.someColumnProperty = b.someColumnProperty
...the problem with this approach is that my table has half a trillion records in it. In my query, you'll notice (((smaller_table))). I wrote that as an abbreviation for a SELECT statement being run on MY_VERY_LARGE_TABLE to reduce it's size.
(((smaller_table))) appears twice, and the code within is exactly the same both times. There's no reason for me to run the same sub-query twice. This table is several TB and I shouldn't scan through it twice just to get the same results.
Do you have any suggestions on how I can NOT run the exact same reduction twice? I tried replacing the INNER JOIN line with INNER JOIN a AS b but got an "unrecognized table a" warning. Is there any way to store the value of a so I can reuse it?
Thoughts:
Make sure there is an index on userid and dayid.
I would ask you to define better what it is you are trying to find out.
Examples:
What is the busiest time of the week?
Who are the top 25 people who come to the gym the most often?
Who are the top 25 people who utilize the gem the most? (This is different than the one above because maybe I have a user that comes 5 times a month, but stays 5 hours per session vs a user that comes 30 times a month and stays .5 hour per session.)
Maybe doing all days in a horizontal method (day1, day2, day3) would be better visually to try to find out what you are looking for. You could easily put this into excel or libreoffice and color the days that are populated to get a visual "picture" of people who come consecutively.
It might be interesting to run this for multiple months to see if what the seasonality looks like.
Alas CTE is not available in MySQL. The ~equivalent is
CREATE TABLE tmp (
INDEX(someColumnProperty)
)
SELECT ...;
But...
You can't use CREATE TEMPORARY TABLE because such can't be used twice in the same query. (No, I don't know why.)
Adding the INDEX (or PK or ...) during the CREATE (or afterwards) provides the very necessary key for doing the self join.
You still need to worry about DROPping the table (or otherwise dealing with it).
The choice of ENGINE for tmp depends on a number of factors. If you are sure it will be "small" and has no TEXT/BLOB, then MEMORY may be optimal.
In a Replication topology, there are additional considerations.

MySQL temporary indexes for user-defined queries

I am building an analytics platform where users can create reports and such against a MySQL database. Some of the tables in this database are pretty huge (billions of rows), so for all of the features so far I have indexes built to speed up each query.
However, the next feature is to add the ability for a user to define their own query so that they can analyze data in ways that we haven't pre-defined. They have full read permission to the relevant database, so basically any SELECT query is a valid query for them to enter. This creates problems, however, if a query is defined that filters or joins on a column we haven't currently indexed - sometimes to the point of taking over a minute for a simple query to execute - something as basic as:
SELECT tbl1.a, tbl2.b, SUM(tbl3.c)
FROM
tbl1
JOIN tbl2 ON tbl1.id = tbl2.id
JOIN tbl3 ON tbl1.id = tbl3.id
WHERE
tbl1.d > 0
GROUP BY
tbl1.a, tbl1.b, tbl3.c, tbl1.d
Now, assume that we've only created indexes on columns not appearing in this query so far. Also, we don't want too many indexes slowing down inserts, updates, and deletes (otherwise the simple solution would be to build an index on every column accessible by the users).
My question is, what is the best way to handle this? Currently, I'm thinking that we should scan the query, build indexes on anything appearing in a WHERE or JOIN that isn't already indexed, execute the query, and then drop the indexes that were built afterwards. However, the main things I'm unsure about are a) is there already some best practice for this sort of use case that I don't know about? and b) would the overhead of building these indexes be enough that it would negate any performance gains the indexes provide?
If this strategy doesn't work, the next option I can see working is to collect statistics on what types of queries the users run, and have some regular job periodically check what commonly used columns are missing indexes and create them.
If using MyISAM, then performing an ALTER statement on tables with large (billions of rows) in order to add an index will take a considerable amount of time, probably far longer than the 1 minute you've said for the statement above (and you'll need another ALTER to drop the index afterwards). During that time, the table will be locked meaning other users can't execute their own queries.
If your tables use the InnoDB engine and you're running MySQL 5.1+, then CREATE / DROP index statements shouldn't lock the table, but it still may take some time to execute.
There's a good rundown of the history of ALTER TABLE [here][1].
I'd also suggest that automated query analysis to identify and build indeces would quite difficult to get right. For example, what about cases such as selecting by foo.a but ordering by foo.b? This kind of query often needs a covering index over multiple columns, otherwise you may find your server tries a filesort on a huge resultset which can cause big problems.
Giving your users an "explain query" option would be a good first step. If they know enough SQL to perform custom queries then they should be able to analyse EXPLAIN in order to best execute their query (or at least realise that a given query will take ages).
So, going further with my idea, I propose you segment your datas into well identified views. You used abstract names so I can't reuse your business model, but I'll take a virtual example.
Say you have 3 tables:
customer (gender, social category, date of birth, ...)
invoice (date, amount, ...)
product (price, date of creation, ...)
you would create some sorts of materialized views for specific segments. It's like adding a business layer on top of the very bottom data representation layer.
For example, we could identify the following segments:
seniors having at least 2 invoices
invoices of 2013 with more than 1 product
How to do that? And how to do that efficiently? Regular views won't help your problem because they will have poor explain plans on random queries. What we need is a real physical representation of these segments. We could do something like this:
CREATE TABLE MV_SENIORS_WITH_2_INVOICES AS
SELECT ... /* select from the existing tables */
;
/* add indexes: */
ALTER TABLE MV_SENIORS_WITH_2_INVOICES ADD CONSTRAINT...
... etc.
So now, your guys just have to query MV_SENIORS_WITH_2_INVOICES instead of the original tables. Since there are less records, and probably more indexes, the performances will be better.
We're done! Oh wait, no :-)
We need to refresh these datas, a bit like a FAST REFRESH in Oracle. MySql does not have (not that I know... someone corrects me?) a similar system, so we have to create some triggers for that.
CREATE TRIGGER ... AFTER INSERT ON `seniors`
... /* insert the datas in MV_SENIORS_WITH_2_INVOICES if it matches the segment */
END;
Now we're done!

What is a good way to denormalize a mysql database?

I have a large database of normalized order data that is becoming very slow to query for reporting. Many of the queries that I use in reports join five or six tables and are having to examine tens or hundreds of thousands of lines.
There are lots of queries and most have been optimized as much as possible to reduce server load and increase speed. I think it's time to start keeping a copy of the data in a denormalized format.
Any ideas on an approach? Should I start with a couple of my worst queries and go from there?
I know more about mssql that mysql, but I don't think the number of joins or number of rows you are talking about should cause you too many problems with the correct indexes in place. Have you analyzed the query plan to see if you are missing any?
http://dev.mysql.com/doc/refman/5.0/en/explain.html
That being said, once you are satisifed with your indexes and have exhausted all other avenues, de-normalization might be the right answer. If you just have one or two queries that are problems, a manual approach is probably appropriate, whereas some sort of data warehousing tool might be better for creating a platform to develop data cubes.
Here's a site I found that touches on the subject:
http://www.meansandends.com/mysql-data-warehouse/?link_body%2Fbody=%7Bincl%3AAggregation%7D
Here's a simple technique that you can use to keep denormalizing queries simple, if you're just doing a few at a time (and I'm not replacing your OLTP tables, just creating a new one for reporting purposes). Let's say you have this query in your application:
select a.name, b.address from tbla a
join tblb b on b.fk_a_id = a.id where a.id=1
You could create a denormalized table and populate with almost the same query:
create table tbl_ab (a_id, a_name, b_address);
-- (types elided)
Notice the underscores match the table aliases you use
insert tbl_ab select a.id, a.name, b.address from tbla a
join tblb b on b.fk_a_id = a.id
-- no where clause because you want everything
Then to fix your app to use the new denormalized table, switch the dots for underscores.
select a_name as name, b_address as address
from tbl_ab where a_id = 1;
For huge queries this can save a lot of time and makes it clear where the data came from, and you can re-use the queries you already have.
Remember, I'm only advocating this as the last resort. I bet there's a few indexes that would help you. And when you de-normalize, don't forget to account for the extra space on your disks, and figure out when you will run the query to populate the new tables. This should probably be at night, or whenever activity is low. And the data in that table, of course, will never exactly be up to date.
[Yet another edit] Don't forget that the new tables you create need to be indexed too! The good part is that you can index to your heart's content and not worry about update lock contention, since aside from your bulk insert the table will only see selects.
MySQL 5 does support views, which may be helpful in this scenario. It sounds like you've already done a lot of optimizing, but if not you can use MySQL's EXPLAIN syntax to see what indexes are actually being used and what is slowing down your queries.
As far as going about normalizing data (whether you're using views or just duplicating data in a more efficient manner), I think starting with the slowest queries and working your way through is a good approach to take.
I know this is a bit tangential, but have you tried seeing if there are more indexes you can add?
I don't have a lot of DB background, but I am working with databases a lot recently, and I've been finding that a lot of the queries can be improved just by adding indexes.
We are using DB2, and there is a command called db2expln and db2advis, the first will indicate whether table scans vs index scans are being used, and the second will recommend indexes you can add to improve performance. I'm sure MySQL has similar tools...
Anyways, if this is something you haven't considered yet, it has been helping a lot with me... but if you've already gone this route, then I guess it's not what you are looking for.
Another possibility is a "materialized view" (or as they call it in DB2), which lets you specify a table that is essentially built of parts from multiple tables. Thus, rather than normalizing the actual columns, you could provide this view to access the data... but I don't know if this has severe performance impacts on inserts/updates/deletes (but if it is "materialized", then it should help with selects since the values are physically stored separately).
In line with some of the other comments, i would definately have a look at your indexing.
One thing i discovered earlier this year on our MySQL databases was the power of composite indexes. For example, if you are reporting on order numbers over date ranges, a composite index on the order number and order date columns could help. I believe MySQL can only use one index for the query so if you just had separate indexes on the order number and order date it would have to decide on just one of them to use. Using the EXPLAIN command can help determine this.
To give an indication of the performance with good indexes (including numerous composite indexes), i can run queries joining 3 tables in our database and get almost instant results in most cases. For more complex reporting most of the queries run in under 10 seconds. These 3 tables have 33 million, 110 million and 140 millions rows respectively. Note that we had also already normalised these slightly to speed up our most common query on the database.
More information regarding your tables and the types of reporting queries may allow further suggestions.
For MySQL I like this talk: Real World Web: Performance & Scalability, MySQL Edition. This contains a lot of different pieces of advice for getting more speed out of MySQL.
You might also want to consider selecting into a temporary table and then performing queries on that temporary table. This would avoid the need to rejoin your tables for every single query you issue (assuming that you can use the temporary table for numerous queries, of course). This basically gives you denormalized data, but if you are only doing select calls, there's no concern about data consistency.
Further to my previous answer, another approach we have taken in some situations is to store key reporting data in separate summary tables. There are certain reporting queries which are just going to be slow even after denormalising and optimisations and we found that creating a table and storing running totals or summary information throughout the month as it came in made the end of month reporting much quicker as well.
We found this approach easy to implement as it didn't break anything that was already working - it's just additional database inserts at certain points.
I've been toying with composite indexes and have seen some real benefits...maybe I'll setup some tests to see if that can save me here..at least for a little longer.