Optimal MySQL table schema for given use case

Optimal MySQL table schema for given use case - mysql

I have two tables - books and images. The books table has many columns - including id (primary key), name (which is not unique), releasedate, etc. The images table have two columns - id (which is not unique, i.e one book id may have multiple images associated with it, and we need all those images. This column has a non-unique index), and poster (which is unique primary key, all images lie in the same bucket, hence cannot have duplicate names). My requirement is given a book name, find all images associated with it (along with the year of release and the bucketname for each image, the bucketname being just a number in this case).
I am running this query:
select books.id,poster,bucketname,year(releasedate) from books
inner join images where images.bookId = books.id and books.name = "<name>";
A sample result set may look like this:
As you can see there are two results matching - one with id 2 and year 1989, having 5 images, other one with id 261009, year 2013 and one image.
The problem is, the query is extremely slow. It takes around .14 seconds from MySQL console itself, under zero load (in production there may be several concurrent requests and they may be queued, leading to further delay), which is unacceptable for autocomplete. Can anyone tell me how to optimize the query by adding correct indices/keys to the tables? If it is not possible from MySQL, suggestions regarding a proper Redis schema would be useful as well.
Edit: Approx no. of rows in images - 480k, in books - 285k. In future, autocomplete will show result for book authors as well as book names, hence the query will need to expand to take into account a separate table authors where each author will have an id and name, just like a book.

For optimal performance, you want suitable covering indexes available. For example:
... on `books` (`name`,`id`,`releasedate`)
... on `images` (`bookid`,`poster`,`bucketname`)
We want name as the leading column in the index, because of the equality predicate in the WHERE clause. We want id and releasedate also included in the index to make it a "covering index", so the query can be satisfied from the index, without a need to visit pages of the underlying table to retrieve values.
We want bookid as the leading column because of the reference in the ON clause. Again, having poster and bucketname available right in the index make it a "covering" index.
Use EXPLAIN to see the query execution plan.
Also, note that the inner join operation won't return a row from books if a matching row in images is not found. If we want to return a row from books even when no image is available, we could use an outer join.
I'd write the query like this:
SELECT b.id
, i.poster
, i.bucketname
, YEAR(b.releasedate)
FROM books b
LEFT
JOIN images i
ON i.bookid = b.id
WHERE b.name = ?

Related

MySQL join 2 tables on non-unique column and with timestamp conditions

I have 2 MySQL Tables: "parts_revisions" and "categories_revisions". My goal is to use the revisions data in these tables to create a log that lists out all the changes made to parts and categories. Listing the changes to "parts" in one single SQL statement has proven tricky though! Here is the situation:
All entries of each table have "timestamp" columns.
Every parts_revisions entry has a "categoryId" that basically links it to the categories_revisions table. (Every part is a child of a parent category.)
All I want to do is list out all the parts_revisions, but use the human-friendly "name" column from the categories_revisions table based on the categoryId column in parts_revisions. This will make the log more readable.
The trick is that, because there are usually multiple revisions for each category within the categories_revisions table, I cannot do just one big 'ol join on the categoryId column to get the name. The categoryId column is non-unique, and "name"s may vary. What I have to do is get the latest category_revisions entry that has a timestamp that is no later than the timestamp of the part_revisions entry. In other words, we want to get the appropriate category name that was in use AT THE TIME the part revision was made.

Not sure if this matches your table structure, but here's a go at it. It's a bit of an ugly subquery inside a subquery. Guessing it won't be terribly efficient
select part_name,
category,
(select name
from categories_revisions
where categories_revisions.match_id = parts_revisions.category
and categories_revisions.timestamp = (select MAX(categories_revisions.timestamp)
from categories_revisions
where categories_revisions.match_id = parts_revisions.category
and categories_revisions.timestamp < parts_revisions.timestamp)) as name
from parts_revisions;
http://sqlfiddle.com/#!2/da74e/1/0

Search a many to many relationship with a wild card, performance issues

I am building a database for an app and I am testing performance issues on a larger data set. I generated about 250,000 location records. Each location can be assigned to many categories and a category can be assigned to many locations. My data-set has 2-4 categories assigned to each location.
I want to allow the user to search for locations by filtering which categories should be allowed using a wild card search. So maybe I want to match all categories with the word "red" in it. So if I type red, now it shows all locations which have a category title that has "red" in it. In addition, I would like to wildcard search the location title with that same string.
I wrote up a query which works but performance is awful in large data-sets. Essentially I am using inner queries which is fine if my limit is set and I find results quick (around .05ms). If I don't find any results right away, it looks like it goes through the whole database and the query takes around 9-10 seconds.
Here is a simplified layout of my database:
locations: id | title | address
categories: id | title
locations_categories: id | location_id | category_id
Here is the query I currently am using:
SELECT `id`,`title`,`address`
FROM (`locations`)
WHERE title LIKE '%string%'
AND WHERE id IN (
SELECT location_id
FROM locations_categories
JOIN categories ON categories.id = locations_categories.category_id
WHERE categories.title LIKE '%string%')

First of all, you main query just uses the value of the subquery, so it can be rewritten:
SELECT location_id
FROM locations_categories
JOIN categories ON categories.id = locations_categories.category_id
WHERE categories.title LIKE '%string%'
But I'd propose to split this query in two—JOINs are slow for big datasets. First one will get necessary category IDs (with paging):
SELECT id
FROM categories
WHERE title LIKE '%string%' LIMIT BY <start>, <step>
Then you can get locations_categories:
SELECT location_id FROM locations_categories WHERE category_id IN (...)
And you'll use the location IDs you've got to retrieve corresponding records:
SELECT * FROM locations WHERE id IN (...)
These 3 queries combined will be much faster then your original one.
Also, make sure your title column is indexed—it can be the bottleneck. But since you have a wildcard at the start of the search term, you'll have to use FULLTEXT index here.

Your explain plan will confirm (or disprove) this but I suspect that your issue is that the leading % in the clauses
WHERE categories.title LIKE '%string%'
and
WHERE title LIKE '%string%`
forces full table scans. To address this often requires some knowledge of the domain and application in question
The simple approach is to only search for 'starts with'. Others include full text searching, function based indexes, having a 'grouping table' that presorts and lists the relevant records for known searches.

SQL Query to populate table based on PK of Main Table being joined

Here is my Database structure (basic relations):
I'm attempting to formulate a one-line query that will populate the clients_ID, Job_id, tech_id, & Part_id and return back all the work orders present. Nothing more nothing less.
Thus far I've struggled to generate this Query:
SELECT cli.client_name, tech.tech_name, job.Job_Name, w.wo_id, w.time_started, w.part_id, w.job_id, w.tech_id, w.clients_id, part.Part_name
FROM work_orders as w, technicians as tech, clients as cli, job_types as job, parts_list as part
LEFT JOIN technicians as techy ON tech_id = techy.tech_name
LEFT JOIN parts_list party ON part.part_id = party.Part_Name
LEFT JOIN job_types joby ON job_id = joby.Job_Name
LEFT JOIN clients cliy ON clients_id = cliy.client_name
Apparently, once all the joining happens it does not even populate the correct foreign key values according to their reference.
[some values came out as the actual foreign key id, not even
corresponding value.]
It just goes on about 20-30 times depending on largest row of a table that I have (one of the above).
I only have two work orders created, So ideally it should return just TWO Records, and columns, and fields with correct information. What could I be doing wrong? Haven't been with MySQL too long but am learning as much as I can.

Your join conditions are wrong. Join on tech_id = tech_id, not tech_id = tech_name. Looks like you do this for all your joins, so they all need to be fixed.
I really don't follow the text of your question, so I am basing my answer solely on your query.
Edit
Replying to your comment here. You said you want to "load up" the tech name column. I assume you mean you want tech name to be part of your result set.
The SELECT part of the query is what determines the columns that are in the result set. As long as the table where the column lives is referenced in the FROM/JOIN clauses, you can SELECT any column from that table.
Think of a JOIN statement as a way to "look up" a value in one table based on a value in another table. This is a very simplified definition, but it's a good way to start thinking about it. You want tech name in your result set, so you look it up in the Technicians table, which is where it lives. However, you want to look it up by a value that you have in the Work Orders table. The key (which is actually called a foreign key) that you have in the Work Orders table that relates it to the Technicians table is the tech_id. You use the tech_id to look up the related row in the Technicians table, and by doing so can include any column in that table in your result set.

Best structure for tables with more than 10000 columns

I am applying a group of data mining algorithms to a dataset comprised of a set of customers along with a large number of descriptive attributes that summarize various aspects of their past behavior. There are more than 10,000 attributes, each stored as a column in a table with the customer id as the primary key. For several reasons, it is necessary to pre-compute these attributes rather than calculating them on the fly. I generally try to select customer with a specified attribute set. The algorithms can combine any arbitrary number of these attributes together in a single SELECT statement and join the required tables. All the tables have the same number of rows (one per customer).
I am wondering what's the best way to structure these tables of attributes. Is it better to group the attributes into tables of 20-30 columns, requiring more joins on average but fewer columns per SELECT, or have tables with the maximum number of columns to minimize the number of joins, but having potentially all 10K columns joined at once?
I also thought of using one giant 3-column customerID-attribute-value table and storing all the info there, but it would be harder to structure a "select all customers with these attributes-type query that I need."
I'm using MySQL 5.0+, but I assume this is a general SQL-ish question.

From my expirience using tables with 10,000 columns is very-very-very bad idea. What if in future this number will be increased?
If there are a lot of attributes you shouldn't use a horizontal scaled tables (with large number of columns). You should create a new table attributes and place alltributes values into it. Then connect this table with Many-To-One relationship to main entry table
Maybe the second way is to use no-SQL (like MongoDB) systems

As #odiszapc said, you have to use a meta-model structure, like for instance:
CREATE TABLE customer(ID INT NOT NULL PRIMARY KEY, NAME VARCHAR(64));
CREATE TABLE customer_attribute(ID INT NOT NULL, ID_CUSTOMER INT NOT NULL, NAME VARCHAR(64), VALUE VARCHAR(1024));
Return basic informations of given customer:
SELECT * FROM customers WHERE name='John';
Return customer(s) matching certain attributes:
SELECT c.*
FROM customer c
INNER JOIN attribute a1 ON a1.id_customer = c.id
AND a1.name = 'address'
AND a1.value = '1078, c/ los gatos madrileños'
INNER JOIN attribute a2 ON a2.id_customer = c.id
AND a2.name = 'age'
AND a2.value = '27'
Your generator should generate the inner joins on the fly.
Proper indexes on the tables should allow all this engine to go relatively fast (if we assume 10k attributes per customer, and 10k customers, that's actually pretty much a challenge...)

10,000 columns is much. The SELECT statement will be very long and messy if you wouldn't use *. I think you can narrow the attributes down to most useful and meaningful ones, eliminating others

Problem with Mysql index when using "OR" statement

Is there any way how to create an functioning index for this query and to get rid of "filesort"?
SELECT id, title FROM recipes use index (topcat) where
(topcat='$cid' or topcat2='$cid' or topcat3='$cid')
and approved='1' ORDER BY id DESC limit 0,10;
I created index "topcat" ( columns: topcat1+topcat2+topcat3+approved+id) but still ge "Using where; Using filesort".
I can create one more column, lets say, "all_topcats" to store topcat numbers in an array - 1,5,7 and then to run query "... where $cid iIN ()...". But the probem is that in this case "all_topcats" column will be "varchar" but "approved" and "id" columns - int, and index will not be used anyway.
Any ideas? Thanks.

You might improve performance for that query if you reordered the columns in the index:
approved, topcat1, topcat2, topcat3, id
It would be useful to know what the table looks like and why you have three columns named like that. It might be easier to organise a good query if you had a subsidiary table to store the topcat values, with a link back to the main table, but without knowing why you have it set up like that it's hard to know whether that would be sensible.
Can you post the CREATE TABLE?
Edit in response to user message
Your table doesn't sound like it's well-designed. The following design would be better: Add two new tables, Category and Category_Recipe (a cross-referencing table). Category will contain a list of your categories and Category_Recipe will contain two columns, one a foreign key to Category and one a foreign key to the existing Recipe table. A row of Category_Recipe is a statement "this recipe is in this category". You will then be able to very simply write a query that will search for recipes in a given category. You also have the ability to put a recipe in arbitrarily many categories, rather than being limited to 3. Look up "database normalisation" and "foreign keys".

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008