I have a database containing different information from our website. One table (called Raw_Pages) contains a list of every page on our site and the path to it (along with other fields of course). Another table (called Paths) contains a list of various branches of the site that are owned by different departments.
I'm trying to run a query to basically find all pages on the site that do not fall under one of the branches specified.
table Raw_Pages
+-------------------------+--------------+
| Field | Type |
+-------------------------+--------------+
| ID | int(11) |
| Path | varchar(500) |
| Title | varchar(255) |
+-------------------------+--------------+
table Paths
+----------+--------------+
| Field | Type |
+----------+--------------+
| ID | int(11) |
| Path | varchar(255) |
+----------+--------------+
We currently have 64,002 pages I'm checking against 757 paths (All departments have multiple branches due to different ones for different file types). I'm also planning to do a similar query for files, of which we have 306,625 and pulls from the same list of 757 paths. Yes, our site is a giant mess.
From what I can tell, a LEFT JOIN is what would work best for me with a wildcard on the right side. I am a novice at code so I could be far off.
SELECT * FROM Raw_Pages LEFT JOIN Paths ON Raw_Pages.path LIKE CONCAT(Paths.Path,'%') WHERE Paths.ID IS NULL
I'm honestly not sure if the above code works or not since it just freezes phpMyAdmin when I try it. I'm assuming something is wrong in it, or there is a better way.
Thank you!
If you have an index on Paths(Path), you might be able to do:
select rp.*
from raw_pages rp
where not exists (select 1
from paths p
where p.path <= rp.path and
p.path > concat(rp.path, '(')
);
It is possible for the subquery to use an index. I'm not sure it will.
If the value of the path field is identical in the two tables you could use:
SELECT *
FROM Raw_Pages AS R
LEFT JOIN Paths AS P ON (R.path=P.path)
WHERE R.ID IS NULL
If it matches only the name of the page or a piece of the route
SELECT *
FROM Raw_Pages AS R
LEFT JOIN Paths AS P ON (R.path LIKE CONCAT('%',P.path,'%'))
WHERE R.ID IS NULL
You can check this page to verify the type of query you need
It is good practice to index the path fields in both tables so that the query is faster due to the number of records
Related
I'm attempting to take an existing application and re-architect the schema to support new customer requests and fix several outstanding issues (mostly around our current schema being heavily denormalized). In doing so, I've reached an interesting problem which at first glance seems to have a simple solution, but I can't seem to find the function I'm looking for.
The application is a media organization tool.
Our Old Schema:
Our old schema had separate models for "Groups", "Subgroups", and "Videos". A Group could have many Subgroups (one-to-many) and a Subgroup could have many Videos (one-to-many).
There were certain fields that were shared among Groups, Subgroups, and Videos. For instance, the Google Analytics ID to be used when the Video was embedded on a page. Whenever we displayed the embed page we would first look if the value was set on the Video. If not, we checked its Subgroup. If not, we checked its Group. The query looked roughly like so (I wish this were the real query, but unfortunately our application was written over many years by many junior developers, so the truth is much more painful):
SELECT
v.id,
COALESCE(v.google_analytics_id, sg.google_analytics_id, g.google_analytics_id) as google_analytics_id
FROM
Videos v
LEFT JOIN Subgroups sg ON sg.id = v.subgroup_id
LEFT JOIN Groups g ON g.id = sg.group_id
Pretty straight-forward. Now the issue we've run into is that customers want to be able to nest groups arbitrarily deep, and our schema clearly only allows for 2 levels (and, in fact, necessitates two levels - even if you only want one)
New Schema (First Pass):
As a first pass, I knew we'd want a basic tree structure for the Groups, so I came up with this:
CREATE TABLE Groups (
id INT PRIMARY KEY,
name VARCHAR(255),
parent_id INT,
ga_id VARCHAR(20)
)
We can then easily nest up to N levels deep with N joins like so:
SELECT
v.id,
COALESCE(v.ga_id, g1.ga_id, g2.ga_id, g3.ga_id, ...) as ga_id
FROM
Videos v
LEFT JOIN Groups g1 ON g1.id = v.group_id
LEFT JOIN Groups g2 ON g2.id = g1.parent_id
LEFT JOIN Groups g3 ON g3.id = g2.parent_id
...
There's obvious flaws with this approach: We don't know how many parents there will be so we don't know how many times we should JOIN, forcing us to implement a "max depth". Then even with a max depth, if a person only has a single level of groups we still perform multiple JOINs because our queries can't know how deep they need to go. MySQL offers recursive queries, but while looking into if that was the right option I found a smarter schema that produced the same results
New Schema (Take 2):
Looking into better ways to handle a tree structure, I learned about Adjacency Lists (my prior solution), Nested Sets, Materialized Paths, and Closure Tables. Other than Adjacency Lists (which depend on JOINs to grab the entire tree structure and so produces a single row with multiple columns per node on the tree), the other three solutions all return multiple rows for each node on the tree
I ended up going with a Closure Table solution like so:
CREATE TABLE Groups (
id INT PRIMARY KEY,
name VARCHAR(255),
ga_id VARCHAR(20)
)
CREATE TABLE Group_Closure (
ancestor_id INT,
descendant_id INT,
PRIMARY KEY (ancestor_id, descendant_id)
)
Now given a Video I can get all of its parents like so:
SELECT
v.id,
v.ga_id,
g.id,
g.ga_id
FROM
Videos v
JOIN Group_Closure gc ON v.group_id = gc.descendant
JOIN Groups g ON g.id = gc.ancestor;
This returns each group in the hierarchy as a separate row:
+------+---------+------+---------+
| v.id | v.ga_id | g.id | g.ga_id |
+------+---------+------+---------+
| 1 | abc123 | 2 | new_val |
| 1 | abc123 | 1 | default |
| 2 | NULL | 4 | xyz987 |
| 2 | NULL | 3 | NULL |
| 2 | NULL | 1 | default |
| 3 | NULL | 3 | NULL |
| 3 | NULL | 1 | default |
+------+---------+------+---------+
What I wish to do now is somehow achieve the same result I would have expected from using COALESCE on multiple self-joined Group tables: a single value for ga_id based on whichever node is "lowest" in the tree
Because I have multiple rows per Video, I suspect that this can be accomplished using GROUP BY and some kind of aggregate function:
SELECT
v.id,
COALESCE(v.ga_id, FIRST_NON_NULL(g.ga_id))
FROM
Videos v
JOIN Group_Closure gc ON v.group_id = gc.descendant
JOIN Groups g ON g.id = gc.ancestor
GROUP BY v.id, v.ga_id;
Note that because (ancestor, descendant) is my primary key, I believe the order of the group closure table can be guaranteed to always come back the same - meaning if I put the lowest node first, it will be the first row in the resulting query... If my understanding of this is incorrect, please let me know.
If you were to stick with an adjacency list, you could use a recursive CTE. This one traverses up from each video id value until it finds a non-NULL ga_id:
WITH RECURSIVE CTE AS (
SELECT id, ga_id, group_id
FROM videos
UNION ALL
SELECT CTE.id, COALESCE(CTE.ga_id, g.ga_id), g.parent_id
FROM `groups` g
JOIN CTE ON g.id = CTE.group_id AND CTE.ga_id IS NULL
)
SELECT id, ga_id
FROM CTE
WHERE ga_id IS NOT NULL
For my attempt to reconstruct your data from your question, this yields:
id ga_id
1 abc123
2 xyz987
3 default
Demo on dbfiddle
I have come across this problem and I've tried to solve it few days now.
Let's say I have following tables
properties
-----------------------------------------
| id | address | building_material |
-----------------------------------------
| 1 | Street 1 | 1 |
-----------------------------------------
| 2 | Street 2 | 2 |
-----------------------------------------
building_materials
-----------------------------
| id | building_material |
-----------------------------
| 1 | Wood |
-----------------------------
| 2 | Stone |
-----------------------------
Now. I would like to provide an API where you could send a request and ask for every property that has building material of wood. Like this:
myapi.com/properties?building_material=Wood
So I would like to query database like this (I want to return the string value of building_material not the numeric value):
SELECT p.id, p.address, bm.building_material
FROM properties as p
JOIN building_materials as bm ON (p.building_material = bm.id)
WHERE building_material = "Wood"
But this will give me an error
Column 'building_material' in where clause is ambiguous
Also if I want to get property with id of 1.
SELECT p.id, p.address, bm.building_material
FROM properties as p
JOIN building_materials as bm ON (p.building_material = bm.id)
WHERE id = 1
Column 'id' in where clause is ambiguous
I understand that the error means that I have same column name in two tables and I don't specify which id I want like p.id.
Problem is I don't know how many query parametes API user is going to send and I would like to avoid looping through them and changing id to p.id and building_material to bm.building_material. Also I don't want that user has to send request to the API like this
myapi.com/properties?bm.building_material=Wood
I've thought about changing the properties table building_material to fk_building_material and changing properties table id to property_id.
I just don't like the idea that on client side I would then have to refer property's building material as fk_building_material. Is this a valid method to solve this problem or what would be the correct way of designing these tables?
The query mentions two tables, so all the columns in both tables are "on the table" for use anywhere in the query.
In one table building_material is an "id" for linking to the other table; in the other table, it is a string. While this is possible, it is confusing to the reader. And to the parser. To resolve the confusion, you must qualify building_material with which one you want; that is done with a table alias (or table) in front (as you did in all other places).
There are two ids are all ambiguous. But this is the "convention" used by table designers. So, it is OK for an id in one table to be different than the id in the other table. (p.id refers to one thing in one table; bm.id refers to another in another table.)
SELECT p.id, p.address, bm.building_material
FROM properties as p
JOIN building_materials as bm ON (p.building_material = bm.id)
WHERE bm.building_material = "Wood" -- Note "bm."
I'm passing through the following situation and have not found a good solution to this problem. I am going through a optimization of a API so am looking for fastest possible solution.
The following description is not exactly what I am doing, but I think it represents the problem well.
Let's say I have a table of products:
+----+----------+
| id | name |
+----+----------+
| 1 | product1 |
| 2 | product2 |
+----+----------+
And I have a table of attachments to each product, separate by language:
+----+----------+------------+-----------------------+
| id | language | product_id | attachment_url |
+----+----------+------------+-----------------------+
| 1 | bb | 1 | image1_bb.jpg |
| 1 | en | 1 | image1_en.jpg |
| 1 | pt | 1 | image1_pt.jpg |
| 2 | bb | 1 | image2_bb.jpg |
| 2 | pt | 1 | image2_pt.jpg |
+----+----------+------------+-----------------------+
What I intend to do is to get the correct attachment according to the language selected on the request. As you can see above, I can have several attachments to each product. We use Babel (bb) as a generic language, so every time I don't have a attachment to the right language, I should get the babel version. Is also important to consider that the Primary Key of the attachments table is a composite of id + language.
So, supposing I try to get all the data in pt, my first option to create a SQL query was:
SELECT p.id, p.name,
GROUP_CONCAT( '{',a.id,',',a.attachment_url, '}' ) as attachments_list
FROM products p
LEFT JOIN attachments a
ON (a.product_id=p.id AND (a.language='pt' OR a.language='bb'))
The problem is that, with this query I always get the bb data and I only want to get it when there is no attachment on the right language.
I already tried to do a subquery changing attachments for:
(SELECT * FROM attachments GROUP BY id ORDER BY id ASC, language DESC)
but it doubles the time of the request.
I also tried using DISTINCT inside the GROUP_CONCAT, but it only works if the whole result of each row is equal, so it does not work for me.
Does anyone knows any other solution that I can apply directly into the query?
EDIT:
Combining the answers of #Vulcronos and #Barmar made the final solution at least 2x faster than the one I first suggested.
Just to add some context, for anybody else who is looking for it. I am using Phalcon. Because of it, I had a lot of trouble putting the pieces together, as Phalcon PHQL does not support subqueries, nor a lot of the other stuff I had to use.
For my scenario, where I had to deliver approximatelly 1.2MB of JSON content, with more than 2100 objects, using custom queries made the total request time up to 3x faster than Phalcon native relations management methods (hasMany(), hasManyToMany(), etc.) and 10x faster than my original solution (which used a lot the find() method).
Try doing two joins instead of one:
SELECT p.id, p.name,
GROUP_CONCAT( '{',COALESCE(a.id, b.id),',',COALESCE(a.attachment_url, b.attachment_url), '}' ) as attachments_list
FROM products p
LEFT JOIN attachments a
ON (a.product_id=p.id AND a.language='pt')
LEFT JOIN attachments b
ON (a.product_id=p.id AND a.language='bb')
and then using COALESCE to return b instead of a if a doesn't exist. You can also do it with a subselect if the above doesn't work.
OR conditions tend to make queries slow, because it's hard to optimize them with indexes. Try joining separately using the two different languages.
SELECT p.id, p.name,
IFNULL(apt.attachment_url, abb.attachment_url) AS attachment_url
FROM products AS p
JOIN attachments AS abb ON abb.product_id = p.id
LEFT JOIN attachments AS apt ON alang.product_id = p.id AND apt.language = 'pt'
WHERE abb.language = 'bb'
This assumes that all products have a bb attachment, while pt is optional.
I left out the join of Product, because it's not relevant for this problem. It's only needed to include the product name in the resultset.
SELECT a.product_id, a.id, a.attachment_url FROM attachments a
WHERE a.language = ?
OR (a.language = 'bb'
AND NOT EXISTS
(SELECT * FROM attachments
WHERE language = ?
AND id = a.id
AND product_id = a.product_id));
Notes: problems like this usually have many possible solutions. This is not necessarily the most efficient one.
I have a remarks table which can be linked to any number of other items in a system, in the case of this example we'll use bookings, enquiries and referrals.
Thus in the remarks table we have columns
remark_id | datetime | text | booking_id | enquiry_id | referral_id
1 | 2014-06-28 | abc | 0 | 8 | 0
2 | 2014-06-27 | def | 3 | 0 | 0
2 | 2014-05-31 | ghi | 0 | 0 | 10
Etc...
Each of the item tables will have a field called name. Thus when I want to select a remark the likelihood is I'll need this name.
I'd like to achieve this with a single query, getting a 2d array as follows:
['remark_id'=>1, 'datetime'=>'2014-06-28', 'text'=>'abc', 'name'=>'Harold']
However the query I'd expect to use would be
SELECT r.remark_id,r.datetime,r.text
,b.name AS book,rr.name AS referral,e.name AS enquiry
FROM remarks AS r
LEFT JOIN bookings AS b ON b.book_id=r.book_id
LEFT JOIN referrals AS rr ON rr.referral_id=r.referral_id
LEFT JOIN enquiries AS e ON e.enquiry_id=r.enquiry_id
Leaving me with the output
['remark_id'=>1, 'datetime'=>'2014-06-28', 'text'=>'abc', 'book'=>'Harold', 'referral'=>'', 'enquiry'=>'']
And more processing to do before or during rendering it to a view.
Is there a way to write a query such that it would fill a field from the first NOT NULL string it encountered in one of the joined tables?
Please only suggest using a different database system if you know that MySQL doesn't provide any way to do what I'm asking. If it's the case it can't be done there's no business sense in rewriting the system anyway, but I'd like to ask!
Two ways I can think of:
use UNION:
SELECT remark_id, datetime, text, name
FROM remarks
JOIN bookings ON (remarks.book_id = bookings.book_id)
UNION
SELECT remark_id, datetime, text, name
FROM remarks
JOIN referrals ON (remarks.referral_id = referrals.referral_id)
UNION
SELECT remark_id, datetime, text, name
FROM remarks
JOIN enquiries ON (remarks.enquiry_id = enquiries.enquiry_id)</code>
use IFNULL (probably much slower):
SELECT r.remark_id,r.datetime,r.text,
IFNULL(b.name,IFNULL(rr.name,e.name)) AS name
FROM remarks AS r
LEFT JOIN bookings AS b ON b.book_id=r.book_id
LEFT JOIN referrals AS rr ON rr.referral_id=r.referral_id
LEFT JOIN enquiries AS e ON e.enquiry_id=r.enquiry_id</code>
Variant 2 is really much slower because of the LEFT JOINs.
Also, generally I would not recommend using 0 as value for non-existent links, rather use NULL. This will allow MySQL to speed up the join.
one way to achieve this is with nested if statements:
if(b.name is not null, b.name, if(rr.name is not null, rr.name, e.name)) as name
one drawback is that this gives an implicit priority to books? not sure if that would be an issue.
perhaps the main drawback, though, is that this is kind of "magical" and has goofy syntax so it might be more clear to just handle those cases in the controller after all.
Seems quite messy that you have multiple unused columns for each entry, unless I'm not understanding correctly. If you add more tables, you'd have to adjust each of the views so that it would filter out the new table.
I'd be tempted to redesign your structure so that each of the tables has a remarkgroup_id column, then add the following remark table
remark_id, remarkgroup_id, date, message
This would clean up the extra unused columns and allow you to use simple joining logic.
I've got this tag system for tagging blog entries and such. The tags are in one table, containing only a tag name and a primary key. Then I have another table with objects that are using the tags.
It could look something like this:
_________________________________
| tags |
--------------------------------|
| id | name |
|-------------------------------|
| 1 | Scuba diving |
| 2 | Dancing |
---------------------------------
_________________________________
| tag_objects |
--------------------------------|
| id | tag | object |
|-------------------------------|
| 1 | 2 | 13 |
| 2 | 2 | 18 |
| 3 | 1 | 24 |
---------------------------------
Now, what I need to accomplish is to to add a column to the tags table, called "occurrences" or something. For each tag in tags, occurrences should be set to the number of times that tag is used in tag_objects.
So basically something like (obviously pseudo-code):
foreach(tags):
UPDATE tags
SET occurrences = (SELECT COUNT(id)
FROM tag_objects
WHERE tag = tags.id);
When people create new posts and stuff in the future, I'll just have a trigger to update the count, but I have a couple of thousand rows already that I need to count first. I don't know how to do this, so any assistance would be appreciated.
The easiest way to do this, without any extra tables, would be:
First add the extra field:
mysql> alter table tags add occurs int
default 0;
Then just update this new field with the number of occurences.
mysql> update tags left join (select tag,
count(id) as cnt from tag_objects
group by tag) as subq on
tags.id=subq.tag set
occurs=coalesce(subq.cnt,0);
Note the use of the left join to ensure all tags are counted, even the unused ones. The coalesce-function will convert NULL to 0.
You have done a good work, your query must work.
But, this will result in awful performance. I advise you to recreate a table :
CREATE TABLE newTags AS
SELECT t.id, t.name, COUNT(*) AS occurrences
FROM tags t
INNER JOIN tag_objects to
ON to.tag = tags.id
GROUP BY t.id, t.name
This will be very fast.
Unless you really need to denormalize your data, you should stay away from that. Counting on indexed columns is usually very fast. I am a big fan of clean and normalized data ;-)
I would generally not want to store computed values in columns on the database - it's messy, can easily get out of sync, and offends the deities of normalization.
However, if you really must have a database entity with the count, rather than calculating on the fly, I'd create a view (http://dev.mysql.com/doc/refman/5.0/en/create-view.html) which stores the pre-computed value, using the SQL provided by Scorpio
CREATE view tag_occurences AS
SELECT t.id, t.name,
COUNT(*) AS occurrences
FROM tags t
INNER JOIN tag_objects to
ON to.tag = tags.id
GROUP BY t.id, t.name
I think you will gain better performance if you will be incrementing and decrementing the value of occurrences on table tag_objects insert/delete trigger.
Your psuedeo code will work exactly as written (without the foreach loop). At least it would in oracle, I assume MySQL lets you use a correlated subquery as the value too.
For the inserting of new rows you could use a query like:
INSERT INTO tags VALUES(x,y,z,1) ON DUPLICATE KEY UPDATE occurrences = occurrences+1;
I didn't check the syntax, but something like that.