Ranking search results - mysql

I have a tutoring website with a search feature. I want tutors to appear on the list according to several weighted criteria, including whether or not they are subscription holders, if they have submitted a profile photo, if they have included a lot of information about themselves, etc...
Basically, I have a lot of criteria by which I would like to weigh their rank.
Instead of writing a complicated SQL query with multiple ORDER BYs (if this is even possible), I was thinking of creating a table (maybe a temporary one), that assigns numerical values based on several criteria to come up with a final search rank.
I'm not entirely sure about how to go about this, or if this is a good idea, so I would like to know what the community thinks about a) this method, and b) possible ways of implementing this in SQL.

I would add a field to one of the existing tables that more or less was a representation of their "weight" for sorting purposes. I would then populate this column with a database procedure that ran every so often (you could make a queue that only runs on records that have been updated, or just run it on all records if you want). That way, I can just pull back the data and order by one column instead of multiple ones.
Also, you could use a View. It really depends on if you want to number crunching to be done by the procedure or by the database every time you pull data (for a search feature and for speed's sake, I'd suggest the database procedure).

Related

Performing Calculations SQL

I am trying to take information from one MySQL table, perform a bunch of calculations on this data, and then put the results in a second MySQL table. What would be the best way of doing this (i.e. in MySQL itself, using python, etc.)?
My apologies for the vagueness, I'll try to be more specific. Table 1 has every meal that every person in my class eats, so each meal is a primary key, and other columns include the person and the number of calories. The primary key for Table 2 is the person, and another column is the percentage of total calories this person has eaten, out of the calories of the entire class. Another column is the percentage of total calories of this person's gender in the class. Every day, I want to take the new eating information, and use it to update the percentages in Table 2. (Thanks for the help!)
Assming the calculations can be done in SQL (and percentages are definitely do-able), you have some choices.
The first, and academically correct, choice, is not to store this in a table at all. One of the principles of normalization is that you don't store duplicate or calculated values - instead, you calculate them as you need them.
This isn't just an academic concern - it avoids many silly bugs and anomalies, and it means your data is always up to date - you don't have to wait for your calculation query to run before you can use the data.
If the calculation is non-trivial and/or an essential part of the business domain, common practice is to create a database view, which behaves like a table when queried, but is actually calculated on the fly. This means that the business logic is encapsulated in the view, rather than repeated in multiple queries. You can go further, with materialized views etc. - but the basic principle is the same.
In some cases, where the volume of data is huge, or the calculations are time consuming, or you have calculations that are very hard to embed in a single SQL statement, it's common to create "aggregate tables" - this is what you are suggesting. You can populate these tables either by (scheduled) queries, or by using database triggers.
However, aggregate tables are a last resort - they make the solution much harder to maintain and debug - if the data is wrong, you don't have a single query to debug, you've got to follow the chain of logic all the way through.
Assuming you are in a class of a few dozen people, and are reporting on less than 10 million years of meals, any modern RDBMS can calculate this report in milliseconds - there's really no need to store it in an aggregate table.
A possible solution could be that you create a View or a Materialized View with the complex SELECT query behind it.
The Materialized View could be an other option too, as you have wrote that you would like to have these results re-queried/refreshed every day.
If you need to do more advanced operations on those tables, you could create a Stored procedure and call it when you need its data.
Note: you can't work furthermore (eg.: can't call it from a select for joining it's result set) with the procedures result set other than say a temporary table.

storing rows order in mysql

I need to give the ability to change order of displaying rows to my script admin page.
for that there is a default order for newly added rows (the go to the end of list) and admin should be able to change the position of an specific row.
I'm going to act the rows like a doubly linked list to be able to re-position rows.
Is it OK to use linked list method for saving the display position of mysql rows?
Is there a better method?
Should I use a separate table to store orders or it is OK to add two next & prev columns to original table?
Is it possibe then to use mysql order statement with this method?
Edit: I also thought of using spaced order codes (e.g. 0, 100, 200, ...) but this has a limit that may be reached
I think you'll be better off just storing the ordering position in a dedicated field, instead of trying to implement a linked list.
The issue with the linked list is that is requires some sort of list traversal to "reconstruct" the order before you can display it to the user. Normally, you'd employ a recursive query to do that, but unfortunately MySQL doesn't support recursive queries, so you'll either need to fiddle with stored procedures, or end-up making a database round-trip for each and every list node.
All in all, just updating the order field of several rows from time to time (when you need to reorder) is probably cheaper than traversing the list every time (when you need to display it), especially if you mostly move rows by small distancees. And if you introduce gaps (as you already mentioned), the number of rows that you'll actually need to update will fall dramatically, at the price of increased complexity.
You may also be able to piggy-back the order field onto the clustering mechanism offered by InnoDB.
YMMV, of course, but I'd advise benchmarking the simple order field approach on representative amounts of data before attempting to implement anything more sophisticated...

Should I split up a complex query into one to filter results and one to gather data?

I'm designing a central search function in a PHP web application. It is focused around a single table and each result is exactly one unique ID out of that table. Unfortunately there are a few dozen tables related to this central one, most of them being 1:n relations. Even more unfortunate, I need to join quite a few of them. A couple to gather the necessary data for displaying the results, and a couple to filter according to the search criteria.
I have been mainly relying on a single query to do this. It has a lot of joins in there and, as there should be exactly one result displayed per ID, it also works with rather complex subqueries and group by uses. It also gets sorted according to a user-set sort method and there's pagination in play as well done by the use of LIMIT.
Anyways, this query has become insanely complex and while I nicely build it up in PHP it is a PITA to change or debug. I have thus been considering another approach, and I'm wondering just how bad (or not?) this is for performance before I actually develop it. The idea is as follows:
run one less complex query only filtering according the search parameters. This means less joins and I can completely ignore group by and similar constructs, I will just "SELECT DISTINCT item_id" on this and get a list of IDs
then run another query, this time only joining in the tables I need to display the results (only about 1/4 of the current total joins) using ... WHERE item_id IN (....), passing the list of "valid" IDs gathered in the first query.
Note: Obviously the IN () could actually contain the first query in full instead of relying on PHP to build up a comma-separated list).
How bad will the IN be performance-wise? And how much will it possibly hurt me that I can not LIMIT the first query at all? I'm also wondering if this is a common approach to this or if there are more intelligent ways to do it. I'd be thankful for any input on this :)
Note to clarify: We're not talking about a few simple joins here. There is even (simple) hierarchical data in there where I need to compare the search parameter against not only the items own data but also against its parent's data. In no other project I've ever worked on have I encountered a query close to this complexity. And before you even say it, yes, the data itself has this inherent complexity, which is why the data model is complex too.
My experience has shown that using the WHERE IN(...) approach tends to be slower. I'd go with the joins, but make sure you're joining on the smallest dataset possible first. Reduce down the simple main table, then join onto that. Make sure your most complex joins are saved to the end to minimize the rows required to search. Try to join on indexes wherever possible to improve speed, and ditch wildcards in JOINS where possible.
But I agree with Andomar, if you have the time build both and measure.

MySQL: Complex queries or tracking/counter fields

I'm just thinking about MySQL database design and there are often situations where
A particular action is or is not carried out and consequently data is or is not stored in the database
Whether or not a user undertook a particular action is displayed statistically
An example of this would be:
A user does or does not fill out a survey. If they do fill out a survey, the data they provide is stored in the database. The total number of users who filled out the survey is displayed.
Now, in order to get the number of users who filled out the survey, we could either
create a field of type BOOL which is set to TRUE on suvey completion; we then calculate the number of users who completed the survey using a simple COUNT(*) WHERE field=TRUE
calculate the number of users who filled out the survey using the data they provided by joining the users and survey results tables and grouping on the user
This isn't a particularly complex example, but there are cases where without the BOOL flag, queries can be become very complex and expensive. But the flag is an almost unnecessary addition to the database tables - we use it only for convenience. Also it means we have to ensure that we UPDATE all user flags at the relevant time, as well as storing user data.
What would be your approach to this kind of problem? For smaller applications, i'll usually just write complex queries and cache their results (occasionally using views to make things more manageable). But in larger applications, with potentially many joins, I might be tempted to flag the users with an action field so that reads are simpler and cheaper.
The best solution is an indexed view (SQL Server terminology) or a materialized view (Oracle terminology) or a materialized query table (DB2 terminology). All those solutions keep the data up to date in real time. No maintenance.
When your platform doesn't support those kinds of database objects, you have to resort to using a table, along with all the other things necessary to keep the data right. You can keep the data right with
triggers
cron jobs
If you use triggers, you should probably also run a periodic cron job to make sure the data stored matches the data calculated.
It helps that, in the real world, most of these kinds of requirements really don't have to be up to date in real time. These kinds of numbers usually support management decisions; a lag of even a day is often acceptable. (In other words, it sometimes helps to think of it as a data warehouse problem or as a report rather than as an OLTP problem.) I've had to negotiate these kinds of requirements many times. I've never had anyone refuse to accept a two-hour update cycle. (But that's certainly application-dependent.)
calculate the number of users . . . by joining the users and
survey results tables and grouping on
the user
If you can join the users and the survey results tables, then the survey results table must have a user identifier, right? If that's right, you don't need to join those two tables to determine the number of users who completed a survey.
What you are describing is called a "denormalized view", i.e. a table that contains results which can be computed from other data already in the database. The reason to do this is indeed performance, whether to do this or not depends on the cost of (re-)generating the data, the effort in your code required to keep it coherent, and the extra amount of database space to store the computed values.

Best approach to construct complex MySQL joins and groups?

I find that when trying to construct complex MySQL joins and groups between many tables I usually run into strife and have to spend a lot of 'trial and error' time to get the result I want.
I was wondering how other people approach the problems. Do you isolate the smaller blocks of data at the end of the branches and get these working first? Or do you start with what you want to return and just start linking tables on as you need them?
Also wondering if there are any good books or sites about approaching the problem.
I don't work in mySQL but I do frequently write extremely complex SQL and here's how I approach it.
First, there is no substitute whatsoever for thoroughly understanding your database structure.
Next I try to break up the task into chunks.
For instance, suppose I'm writing a report concerning the details of a meeting (the company I work for does meeting planning). I will need to know the meeting name and sales rep, the meeting venue and dates, the people who attened and the speaker information.
First I determine which of the tables will have the information for each field in the report. Now I know what I will have to join together, but not exactly how as yet.
So first I write a query to get the meetings I want. This is the basis for all the rest of the report, so I start there. Now the rest of the report can probably be done in any order although I prefer to work through the parts that should have one-one relationshisps first, so next I'll add the joins and the fields that will get me all the sales rep associated information.
Suppose I only want one rep per meeting (if there are multiple reps, I only want the main one) so I check to make sure that I'm still returning the same number of records as when I just had meeting information. If not I look at my joins and decide which one is giving me more records than I need. In this case it might be the address table as we are storing multiple address for the rep. I then adjust the query to get only one. This may be easy (you may have a field that indicates the specific unique address you want and so only need to add a where condition) or you may need to do some grouping and aggregate functions to get what you want.
Then I go on to the next chunk (working first through all the chunks that should have a 1-1 relationshisp to the central data in this case the meeting). Runthe query nd check the data after each addition.
Finally I move to those records which might have a one-many relationship and add them. Again I run the query and check the data. For instance, I might check the raw data for a particular meeting and make sure what my query is returning is exactly what I expect to see.
Suppose in one of these additions of a join I find the number of distinct meetings has dropped. Oops, then there is no data in one of the tables I just added and I need to change that to a left join.
Another time I may find too many records returned. Then I look to see if my where clause needs to have more filtering info or if I need to use an aggreagte function to get the data I need. Sometimes I will add other fields to the report temporarily to see if I can see what is causing the duplicated data. This helps me know what needs to be adjusted.
The real key is to work slowly, understand your data model and check the data after every new chunk is added to make sure it is returning the results the way you think they should be.
Sometimes, If I'm returning a lot of data, I will temporarily put an additonal where clause on the query to restrict to a few items I can easily check. I also strongly suggest the use of order by because it will help you see if you are getting duplicated records.
Well the best approach to break down your MySQL query is to run the EXPLAIN command as well as looking at the MySQL documentation for Optimization with the EXPLAIN command.
MySQL provides some great free GUI tools as well, the MySQL Query Browser is what you need to use.
When running the EXPLAIN command this will break down how MySQL interprets your query and displays the complexity. It might take some time to decode the output but thats another question in itself.
As for a good book I would recommend: High Performance MySQL: Optimization, Backups, Replication, and More
I haven't used them myself so can't comment on their effectiveness, but perhaps a GUI based query builder such as dbForge or Code Factory might help?
And while the use of Venn diagrams to think about MySQL joins doesn't necessarily help with the SQL, they can help visualise the data you are trying to pull back (see Jeff Atwood's post).