Most efficient way to select lots of values from database - mysql

I have a database table with around 2400 'items'. A user will have any arbitrary combination of items from the 2400 item set. Each item's details then need to be looked up and displayed to the user. What is the most efficient way to get all of the item details? I can think of three methods:
Select all 2400 items and parse the required details client-side.
Select the specific items with a SELECT which could be a very long SQL string (0-2400 ids)?
Select each item one at a time (too many connections)?
I'm not clued up on SQL efficiency so any guidance would help. It may help to know this is a web app heavily AJAX based.
Edit: On average a user will select ~150 items and very rarely more than 400-500.

The best method is to return the data you want from the database in a single query:
select i.*
from items i
where i.itemID in (<list of ids>);
MySQL queries can be quite large (see here), so I wouldn't worry about being able to pass in the values.
However, if your users have so many items, I would suggest storing them in the database first and then doing a join to get the additional information.

If the users never/rarely select more than ~50 elements, then I agree with Gordons answer.
If it is really plausible that they might select up to 2400 items, you'll probably be better off by inserting the selected ids into a holding table and then joining with that.
However, a more thorough answer can be found here - which I found through this answer.
He concludes that:
We see that for a large list of parameters, passing them in a temporary table is much faster that as a constant list, while for small lists performance is almost the same.
'Small' and 'large' are hardly static, but dependent upon your hardware - so you should test. My guess would be that with an average of 150 elements in your IN-list, you will see the temp table win.
(If you do test, please come back here and say what is the fastest in your setup.)

2400 items is nothing. I've have a MySql database which has hundreds of thousands of rows and relation and are working perfectly, by just optimizing the queries.
What you must do is, see how long the execution time is for each sql query. You can then optimize each query on its own, trying different querys and measure the execution time.
You can use ex. MySql Workbench, Sequel pro, Microsoft Server management studio or another software for building queries. Also you can add indexes to your tables, which can improve queries as well.
If you need to scale your database up you can use software like http://hadoop.apache.org
Also a great thing to mention is NoSQL (Not-Only SQL). It's a relation-less database, which can handle dynamic attributes and are build for handling large amount of data.
As you mention you could use AJAX. But that only helps the load time of the page and the stress of your web server (Not SQL server). Just ask if you wan't more info or more in-depth explanation.

Related

Is filtered selection faster than fetching all the rows and then filtering

So I want to create a table in the frontend where I will list every single user. The thing is that the tables are relational and I have to get data from multiple tables in order to fulfill my goal.
Now here comes my question (keep in mind I have a MySQL database) :
Which method is better on the long run :
Generate joined queries that fetch all the data from each table where a user has any information (it outputs ~80 column per row and only 15 of them are needed)
Fetch the data that I need with multiple queries and then just "stick" the values together and output them (15 columns and all of them are needed, but I have to do extra work)
I would suggest you to go for third option.
Generate joined queries that fetch only necessary 15 columns for your front end. It would be the most efficient way.
If you are facing challenges with joining the tables then you can share table structures with sample data and desired output here with your query. We can try to help you achieve your goal.
This is a bit long for a comment.
I don't understand your first option. Why would you be selecting columns that you don't need? If there are 15 columns that you specifically want, then select those columns and nothing else.
In general, it is faster to have the database do most of the work. It can take advantage of its optimizer to produce the best execution plan that it can.
From Experience with embedded hardware mysql server.
If the hardware can do it and has enough resources you let the databse server run it course, as it can run its optimizer.
But if the server hardware lags on some fronts, you transpport all data to the client and let it run Javascript on all returned data.
The same goes for bandwith of the internet connection, it is slow, you want lesser number of rows, to transport because that the user will notice it, even old smartphones have to much power in cpu, amd can so handle everything with easy what you through at them.
In Basic there is no sime answer, you have to check server hardware and the usual bandwith offered and then program a solution that works best
A simple Rule of Thumb:
Fewer round-trips to the database server is usually the faster alternative.

SQL and fuzzy comparison

Let's assume we have a table of People (name, surname, address, SSN, etc).
We want to find all rows that are "very similar" to specified person A.
I would like to implement some kind of fuzzy logic comparation of A and all rows from table People. There will be several fuzzy inference rules working separately on several columns (e.g. 3 fuzzy rules for name, 2 rules on surname, 5 rules on address)
The question is Which of the following 2 approaches would be better and why?
Implement all fuzzy rules as stored procedures and use one heavy SELECT statement to return all rows that are "very similar" to A. This approach may include using soundex, sim metric etc.
Implement one or more simplier SELECT statements, that returns less accurate results, "rather similar" to A, and then fuzzy-compare A with all returned rows (outside database) to get "very similar" rows. So fuzzy comparation would be implemented in my favorit programming language.
Table People should have up to 500k rows, and I would like to make about 500-1000 queries like this a day. I use MySQL (but this is yet to be considered).
I don't really think there is a definitive answer because it depends on information not available in the question. Anyway, too long for a comment.
DBMSes are good at retrieving information according to indexes. It does not make sense to have a db server wasting time in heavy computations unless it is dedicated for this specific purpose (as answered by #Adrian).
Therefore, your client application should delegate to the DBMS the retrieval of information required by the rules.
If the computations are minor, all could be done on the server. Else, pull it off into the client system.
The disadvantage of the second approach lies in the amount of data traveling from the server to the client and the number of connections to establish. So, typically it is a compromise between computation and data transfer in the server. A balance to be achieved depending on the specificities of the fuzzy rules.
Edit: I've seen in a comment that you are almost sure to have to implement the code in the client. In that case, you should consider an additional criterion, code locality, for maintenance purposes, i.e., try to have all code that is related together, not spreading it between systems (and languages).
I would say you're best off using simple selects to get the closest matches you can without hammering the database, then do the heavy lifting in your application layer. The reason I would suggest this solution is scalability: if you do your heavy lifting in the application layer, your problem is a perfect use case for a map-reduce-style solution wherein you can distribute the processing of similarities across nodes and get your results back much faster than if you put it through the database; plus, this way, you're not locking up your database and slowing down any other operations that may be going on at the same time.
Since you're still considering what DB to use PostgreSQL has fuzzystrmatch module which provides Levenshtein and Soundex functions. Also, you might want to look on the pg_trm module as described here. Maybe you could also put the index on the column using soundex() so you won't have to calculate that every time.
But you seem to optimize prematurely so my advice would be to test using pg and then wonder if you need to optimize or not, the numbers you provided really don't seem like a lot considered you almost have two minutes to run one query.
An option i'd consider is to add a column in the "People Talbe" that is the SoundEx value of the person.
I've done joins using
Select [Column}
From People P
Inner join TableA A on Soundex(A.ComarisonColumn) = P.SoundexColumn
That'll return anything in TableA that has the same SoundEx value from the People Tables SoundEx Column.
I haven't used that kind of query on tables that size, but i see no issues with trying it. You can also index that SoundExColumn to help with performance.

Should I split up a complex query into one to filter results and one to gather data?

I'm designing a central search function in a PHP web application. It is focused around a single table and each result is exactly one unique ID out of that table. Unfortunately there are a few dozen tables related to this central one, most of them being 1:n relations. Even more unfortunate, I need to join quite a few of them. A couple to gather the necessary data for displaying the results, and a couple to filter according to the search criteria.
I have been mainly relying on a single query to do this. It has a lot of joins in there and, as there should be exactly one result displayed per ID, it also works with rather complex subqueries and group by uses. It also gets sorted according to a user-set sort method and there's pagination in play as well done by the use of LIMIT.
Anyways, this query has become insanely complex and while I nicely build it up in PHP it is a PITA to change or debug. I have thus been considering another approach, and I'm wondering just how bad (or not?) this is for performance before I actually develop it. The idea is as follows:
run one less complex query only filtering according the search parameters. This means less joins and I can completely ignore group by and similar constructs, I will just "SELECT DISTINCT item_id" on this and get a list of IDs
then run another query, this time only joining in the tables I need to display the results (only about 1/4 of the current total joins) using ... WHERE item_id IN (....), passing the list of "valid" IDs gathered in the first query.
Note: Obviously the IN () could actually contain the first query in full instead of relying on PHP to build up a comma-separated list).
How bad will the IN be performance-wise? And how much will it possibly hurt me that I can not LIMIT the first query at all? I'm also wondering if this is a common approach to this or if there are more intelligent ways to do it. I'd be thankful for any input on this :)
Note to clarify: We're not talking about a few simple joins here. There is even (simple) hierarchical data in there where I need to compare the search parameter against not only the items own data but also against its parent's data. In no other project I've ever worked on have I encountered a query close to this complexity. And before you even say it, yes, the data itself has this inherent complexity, which is why the data model is complex too.
My experience has shown that using the WHERE IN(...) approach tends to be slower. I'd go with the joins, but make sure you're joining on the smallest dataset possible first. Reduce down the simple main table, then join onto that. Make sure your most complex joins are saved to the end to minimize the rows required to search. Try to join on indexes wherever possible to improve speed, and ditch wildcards in JOINS where possible.
But I agree with Andomar, if you have the time build both and measure.

Is it better to return one big query or a few smaller ones?

I'm using MySQL to store video game data. I have tables for titles, platforms, tags, badges, reviews, developers, publishers, etc...
When someone is viewing a game, is it best to have have one query that returns all the data associated with a game, or is it better to use several queries? Intuitively, since we have reviews, it seems pointless to include them in the same query since they'll need to be paginated. But there are other situations where I'm unsure if to break the query down or use two queries...
I'm a bit worried about performance since I'm now joining to games the following tables: developers, publishers, metatags, badges, titles, genres, subgenres, classifications... to grab game badges, (from games_badges; many-to-many to games table, and many to many to badges table) I can either do another join, or run a separate query.... and I'm unsure what is best....
It is significantly faster to use one query than to use multiple queries because the startup of a query and calculation of the query plan itself is costly and running multiple queries in a row slows the server more each time. Obviously you should only get the data that you actually need, but fewer queries is always better.
So if you are going to show 20 games on a page, you can speed up the query (still using only one query) with a LIMIT clause and only run that query again later when they get to the next page. That or you can just make them wait for the query to complete and have all of the data there at once. One big wait or several little waits.
tl;dr use as few queries as possible.
There is no panacea.
Always try to get only necessary data.
There is no answer whether one big or several small queries is better. Each case is unique and to answer this question you should profile your application and examine queries' EXPLAINs
This is generally a processing problem.
If making one query would imply retrieving thousands of entries, call several queries to have MySQL do the processing (sums, etc.).
If making multiple queries involves making tens or hundreds of them, then call a single query.
Obviously you're always facing both of these since neither is a goto option if you're asking the question, so the choices really are:
Pick the one you can take the hit on
Cache or mitigate it as much as you can so that you take a hit very rarely
Try to insert preprocessed data in the database to help you process the current data
Do the processing as part of a cron and have the application only retrieve the data
Take a few steps back and explore other possible approaches that don't require the processing

Best approach to construct complex MySQL joins and groups?

I find that when trying to construct complex MySQL joins and groups between many tables I usually run into strife and have to spend a lot of 'trial and error' time to get the result I want.
I was wondering how other people approach the problems. Do you isolate the smaller blocks of data at the end of the branches and get these working first? Or do you start with what you want to return and just start linking tables on as you need them?
Also wondering if there are any good books or sites about approaching the problem.
I don't work in mySQL but I do frequently write extremely complex SQL and here's how I approach it.
First, there is no substitute whatsoever for thoroughly understanding your database structure.
Next I try to break up the task into chunks.
For instance, suppose I'm writing a report concerning the details of a meeting (the company I work for does meeting planning). I will need to know the meeting name and sales rep, the meeting venue and dates, the people who attened and the speaker information.
First I determine which of the tables will have the information for each field in the report. Now I know what I will have to join together, but not exactly how as yet.
So first I write a query to get the meetings I want. This is the basis for all the rest of the report, so I start there. Now the rest of the report can probably be done in any order although I prefer to work through the parts that should have one-one relationshisps first, so next I'll add the joins and the fields that will get me all the sales rep associated information.
Suppose I only want one rep per meeting (if there are multiple reps, I only want the main one) so I check to make sure that I'm still returning the same number of records as when I just had meeting information. If not I look at my joins and decide which one is giving me more records than I need. In this case it might be the address table as we are storing multiple address for the rep. I then adjust the query to get only one. This may be easy (you may have a field that indicates the specific unique address you want and so only need to add a where condition) or you may need to do some grouping and aggregate functions to get what you want.
Then I go on to the next chunk (working first through all the chunks that should have a 1-1 relationshisp to the central data in this case the meeting). Runthe query nd check the data after each addition.
Finally I move to those records which might have a one-many relationship and add them. Again I run the query and check the data. For instance, I might check the raw data for a particular meeting and make sure what my query is returning is exactly what I expect to see.
Suppose in one of these additions of a join I find the number of distinct meetings has dropped. Oops, then there is no data in one of the tables I just added and I need to change that to a left join.
Another time I may find too many records returned. Then I look to see if my where clause needs to have more filtering info or if I need to use an aggreagte function to get the data I need. Sometimes I will add other fields to the report temporarily to see if I can see what is causing the duplicated data. This helps me know what needs to be adjusted.
The real key is to work slowly, understand your data model and check the data after every new chunk is added to make sure it is returning the results the way you think they should be.
Sometimes, If I'm returning a lot of data, I will temporarily put an additonal where clause on the query to restrict to a few items I can easily check. I also strongly suggest the use of order by because it will help you see if you are getting duplicated records.
Well the best approach to break down your MySQL query is to run the EXPLAIN command as well as looking at the MySQL documentation for Optimization with the EXPLAIN command.
MySQL provides some great free GUI tools as well, the MySQL Query Browser is what you need to use.
When running the EXPLAIN command this will break down how MySQL interprets your query and displays the complexity. It might take some time to decode the output but thats another question in itself.
As for a good book I would recommend: High Performance MySQL: Optimization, Backups, Replication, and More
I haven't used them myself so can't comment on their effectiveness, but perhaps a GUI based query builder such as dbForge or Code Factory might help?
And while the use of Venn diagrams to think about MySQL joins doesn't necessarily help with the SQL, they can help visualise the data you are trying to pull back (see Jeff Atwood's post).