MySQL JOIN vs LIKE - faster selects? - mysql

Okay, so first of all let me tell a little about what I'm trying to do. Basically, during my studies I wrote a little webservice in PHP that calculates how similar movies are to each other based on some measurable sizes like length, actors, directors, writers, genres etc. The data I used for this was basically a collection of data accquired from omdbapi.com.
I still have that database, but it is technically just a SINGLE table that contains all the information to each movie. This means, that for each movie all the above mentioned parameters are divided by commas. Therefore I have so far used a query that encapsulates all these things by using LIKE statements. The query can become quite large as I will pretty much query for every parameter within the table, sometimes 5 different LIKE statements for different actors, the same for directors and writers. Back when I last used this, it took about 30 to 60 seconds to enter a single movie and receive a list of 15 similar ones.
Now I started my first job and to teach myself in my freetime, I want to work on my own website. Because I have no real concept for what I want to do with it, I thought I'd get out my old "movie finder" again and use it differently this time.
Now to challenge myself, I want the whole thing to be faster. Understand, that the data is NEVER changed, only read. It is also not "really" relational, as actor names and such are just strings and have no real entry anywhere else. Which essentially means having the same name will be treated as the same actor.
Now here comes my actual question:
Assuming I want my select queries to operate faster, would it make sense to run a script that splits the comma divided strings into extra tables (these are n to m relations, see attempt below) and then JOIN all these tables (they will be 8 or more) or will using LIKE as I currently do be about the same speed? The ONLY thing I am trying to achieve is faster select queries, as there is nothing else to really do with the data.
This is what I currently have. Keep in mind, I would still have to create tables for the relation between movies + each of these tables. After doing that, I could remove the columns in the movie table and would end up having to join a lot of tables with EACH query. The only real advantage I can see here, is that it would be easier to create an index on individuals tables, rather than one (or a few) covering the one, big movie table.
I hope all of this even makes sense to you. I appreciate any answer short or long, like I said this is mostly for self studies and as such, I don't have/need a real business model.

I don't understand what you currently have. It seems that you only showd the size of tables but not its internal structure. You need to separate data into separate tables using normalization rules and then put correct indexes. Indexes will make your queries very fast. What does the sizing above your query mean? Have you ever run EXPLAIN ANALYZE for you queries, and please post the query I cannot guess your query out of the result. There are a lot of optimization videos on YT.

Related

Should I split up a complex query into one to filter results and one to gather data?

I'm designing a central search function in a PHP web application. It is focused around a single table and each result is exactly one unique ID out of that table. Unfortunately there are a few dozen tables related to this central one, most of them being 1:n relations. Even more unfortunate, I need to join quite a few of them. A couple to gather the necessary data for displaying the results, and a couple to filter according to the search criteria.
I have been mainly relying on a single query to do this. It has a lot of joins in there and, as there should be exactly one result displayed per ID, it also works with rather complex subqueries and group by uses. It also gets sorted according to a user-set sort method and there's pagination in play as well done by the use of LIMIT.
Anyways, this query has become insanely complex and while I nicely build it up in PHP it is a PITA to change or debug. I have thus been considering another approach, and I'm wondering just how bad (or not?) this is for performance before I actually develop it. The idea is as follows:
run one less complex query only filtering according the search parameters. This means less joins and I can completely ignore group by and similar constructs, I will just "SELECT DISTINCT item_id" on this and get a list of IDs
then run another query, this time only joining in the tables I need to display the results (only about 1/4 of the current total joins) using ... WHERE item_id IN (....), passing the list of "valid" IDs gathered in the first query.
Note: Obviously the IN () could actually contain the first query in full instead of relying on PHP to build up a comma-separated list).
How bad will the IN be performance-wise? And how much will it possibly hurt me that I can not LIMIT the first query at all? I'm also wondering if this is a common approach to this or if there are more intelligent ways to do it. I'd be thankful for any input on this :)
Note to clarify: We're not talking about a few simple joins here. There is even (simple) hierarchical data in there where I need to compare the search parameter against not only the items own data but also against its parent's data. In no other project I've ever worked on have I encountered a query close to this complexity. And before you even say it, yes, the data itself has this inherent complexity, which is why the data model is complex too.
My experience has shown that using the WHERE IN(...) approach tends to be slower. I'd go with the joins, but make sure you're joining on the smallest dataset possible first. Reduce down the simple main table, then join onto that. Make sure your most complex joins are saved to the end to minimize the rows required to search. Try to join on indexes wherever possible to improve speed, and ditch wildcards in JOINS where possible.
But I agree with Andomar, if you have the time build both and measure.

Complex SQL String Comparison

I'm merging two databases for a client. In an ideal world, I'd simply use the unique id to join them, but in this case the newer table has different id's.
So I have to join the tables on another column. For this I need to use a complex LIKE statement to join on the Title field. But... they have changed the title's of some rows which breaks the join on those rows.
How can I write a complex LIKE statement to connect slightly different titles?
For instance:
Table 1 Title = Freezer/Pantry Storage Basket
Table 2 Title = Deep Freezer/Pantry Storage Basket
or
Table 1 Title = Buddeez Bread Buddy
Table 2 Title = Buddeez Bread Buddy Bread Dispenser
Again, there are hundreds of rows with titles only slightly different, but inconsistently different.
Thanks!
UPDATE:
How far can MySQL Full-Text Search get me? Looks similar to Shark's suggestion in SQL Server.
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
Do it in stages. First get all the ones that match out of the way so that you are only working with the exceptions. Your mind is incredibly smarter than the computer in finding things that are 'like' each other so scan over the data and look for similarities and make sql statements that cover the specific cases you see until you get it narrowed down as much as possible.
You will have better results if you 'help' the computer in stages like this than if you try to develop a big routine to cover all cases at once.
Of course there are certainly apis out there that do this already (such as the one google uses to guess your search phrase before you finish it) but whether any are freely available I don't know. Certainly wouldn't hurt to search for one though.
It's fairly difficult to describe ' only slightly different ' in a way that computer would understand. I suggest choosing a group of certain criteria that can be considered either most common or most important and work around it. I am not sure what those criteria should be though since i have only a vague idea what the data set looks like.

Normalisation vs Performace: benefit/issues of removing linking tables in (this) schema?

Generally i like to keep my database as clean and expandable as possible.
However after doing some tests, I realised that whilst this is usually the best way to do it, when dealing with large datasets its a lot slower than what i refer to as the "dirty" approach to the problem.
Basically lets say i have a table of objects. These objects belong to certain people. One object may have one person, whilst others more than 1. My initial thought was as I always do, create an objects table for my objects, a peoples table for my people, and then a object_to_people linker table.
However joining the object and linker table in order to get all objects a person is assigned to, can take up to 3 seconds (that's based on around 400k records, but only 1 link per object). Yes i also set up index's e.c.t. to try and speed things up.
If I instead remove the people and linker table, and put the people in the objects table as columns and use 1/0 to set whether each person is assigned to that object, without joining the two large tables i see a speed of around 0.3 -> 0.7 seconds (varied greatly).
To begin with, we only need 2 people. But I don't want to be too restrictive if i can help it. I know I can use caching and what not to improve the end user timings, but is there any reason this would be considered a really bad idea to use columns rather than link tables?
I have a similar setup.
My join table has 17,000,000 rows. My "person" table has 8,400,000 rows, and my "objects" table has 300,000 rows.
I have queries with multiple joins on my joins table and unions of results that return tens of thousands of rows and they take less than 1 second to run (50-400ms).
I think your first layout could be fine, but you probably need to focus on your indexes and queries.
but is there any reason this would be considered a really bad idea to
use columns rather than link tables?
I would say it is a really bad idea if you value scalability more than the performance you gained.
I would say it is a really good idea if you value the performance you gained more than scalability.
if it is true that one object can simultaneously belong to more than one person... then keep the link table.
Also in mysql alter table on huge tables can execute extremely long so adding new users in an application won't be possible in reasonable time.

Is it better to return one big query or a few smaller ones?

I'm using MySQL to store video game data. I have tables for titles, platforms, tags, badges, reviews, developers, publishers, etc...
When someone is viewing a game, is it best to have have one query that returns all the data associated with a game, or is it better to use several queries? Intuitively, since we have reviews, it seems pointless to include them in the same query since they'll need to be paginated. But there are other situations where I'm unsure if to break the query down or use two queries...
I'm a bit worried about performance since I'm now joining to games the following tables: developers, publishers, metatags, badges, titles, genres, subgenres, classifications... to grab game badges, (from games_badges; many-to-many to games table, and many to many to badges table) I can either do another join, or run a separate query.... and I'm unsure what is best....
It is significantly faster to use one query than to use multiple queries because the startup of a query and calculation of the query plan itself is costly and running multiple queries in a row slows the server more each time. Obviously you should only get the data that you actually need, but fewer queries is always better.
So if you are going to show 20 games on a page, you can speed up the query (still using only one query) with a LIMIT clause and only run that query again later when they get to the next page. That or you can just make them wait for the query to complete and have all of the data there at once. One big wait or several little waits.
tl;dr use as few queries as possible.
There is no panacea.
Always try to get only necessary data.
There is no answer whether one big or several small queries is better. Each case is unique and to answer this question you should profile your application and examine queries' EXPLAINs
This is generally a processing problem.
If making one query would imply retrieving thousands of entries, call several queries to have MySQL do the processing (sums, etc.).
If making multiple queries involves making tens or hundreds of them, then call a single query.
Obviously you're always facing both of these since neither is a goto option if you're asking the question, so the choices really are:
Pick the one you can take the hit on
Cache or mitigate it as much as you can so that you take a hit very rarely
Try to insert preprocessed data in the database to help you process the current data
Do the processing as part of a cron and have the application only retrieve the data
Take a few steps back and explore other possible approaches that don't require the processing

Which is more efficient: Multiple MySQL tables or one large table?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.