Counting instances in table - mysql

I've got this tag system for tagging blog entries and such. The tags are in one table, containing only a tag name and a primary key. Then I have another table with objects that are using the tags.
It could look something like this:
_________________________________
| tags |
--------------------------------|
| id | name |
|-------------------------------|
| 1 | Scuba diving |
| 2 | Dancing |
---------------------------------
_________________________________
| tag_objects |
--------------------------------|
| id | tag | object |
|-------------------------------|
| 1 | 2 | 13 |
| 2 | 2 | 18 |
| 3 | 1 | 24 |
---------------------------------
Now, what I need to accomplish is to to add a column to the tags table, called "occurrences" or something. For each tag in tags, occurrences should be set to the number of times that tag is used in tag_objects.
So basically something like (obviously pseudo-code):
foreach(tags):
UPDATE tags
SET occurrences = (SELECT COUNT(id)
FROM tag_objects
WHERE tag = tags.id);
When people create new posts and stuff in the future, I'll just have a trigger to update the count, but I have a couple of thousand rows already that I need to count first. I don't know how to do this, so any assistance would be appreciated.

The easiest way to do this, without any extra tables, would be:
First add the extra field:
mysql> alter table tags add occurs int
default 0;
Then just update this new field with the number of occurences.
mysql> update tags left join (select tag,
count(id) as cnt from tag_objects
group by tag) as subq on
tags.id=subq.tag set
occurs=coalesce(subq.cnt,0);
Note the use of the left join to ensure all tags are counted, even the unused ones. The coalesce-function will convert NULL to 0.

You have done a good work, your query must work.
But, this will result in awful performance. I advise you to recreate a table :
CREATE TABLE newTags AS
SELECT t.id, t.name, COUNT(*) AS occurrences
FROM tags t
INNER JOIN tag_objects to
ON to.tag = tags.id
GROUP BY t.id, t.name
This will be very fast.

Unless you really need to denormalize your data, you should stay away from that. Counting on indexed columns is usually very fast. I am a big fan of clean and normalized data ;-)

I would generally not want to store computed values in columns on the database - it's messy, can easily get out of sync, and offends the deities of normalization.
However, if you really must have a database entity with the count, rather than calculating on the fly, I'd create a view (http://dev.mysql.com/doc/refman/5.0/en/create-view.html) which stores the pre-computed value, using the SQL provided by Scorpio
CREATE view tag_occurences AS
SELECT t.id, t.name,
COUNT(*) AS occurrences
FROM tags t
INNER JOIN tag_objects to
ON to.tag = tags.id
GROUP BY t.id, t.name

I think you will gain better performance if you will be incrementing and decrementing the value of occurrences on table tag_objects insert/delete trigger.

Your psuedeo code will work exactly as written (without the foreach loop). At least it would in oracle, I assume MySQL lets you use a correlated subquery as the value too.

For the inserting of new rows you could use a query like:
INSERT INTO tags VALUES(x,y,z,1) ON DUPLICATE KEY UPDATE occurrences = occurrences+1;
I didn't check the syntax, but something like that.

Related

How do you batch SELECT statements when you can't rely on the IDs to be in literal order?

What I mean by literal order is that, altough the IDs are auto-increment, through business logic, it might end up that 8 comes after 4 when 5 should've been there. That is to say, if a deletion if ID happens, there's no re-indexing
This is how my rows look (table name is wp_posts):
+-----+-------------+----+--+--+--+
| ID | post_author | .. | | | |
+-----+-------------+----+--+--+--+
| 4 | .. | | | | |
+-----+-------------+----+--+--+--+
| 8 | .. | | | | |
+-----+-------------+----+--+--+--+
| 124 | .. | | | | |
+-----+-------------+----+--+--+--+
| 672 | .. | | | | |
+-----+-------------+----+--+--+--+
| 673 | .. | | | | |
+-----+-------------+----+--+--+--+
| 674 | .. | | | | |
+-----+-------------+----+--+--+--+
ID is an int that has the auto-increment characteristic, but when a post is deleted, there is no re-assignment of IDs. It will just simply get deleted and because it's auto-increment, you can still assume that, vertically, the items that come after the one you're looking at are always bigger than the ones before.
I'm querying for ID: SELECT ID FROM wp_posts to get a list of all the IDs I need. Now, it just so happens that I need to batch all of this, using AJAX requests because once I retrieve the IDs, I need to operate on them.
Thing is, I don't really understand how to pass my data back to AJAX. What LIMIT does is, if I provide 2 arguments, such as: SELECT ID FROM wp_posts LIMIT 1,3, it'll return back 4,8,124 because it looks at row number. But what do I do on the next call? Yes, the first call always starts with 1, but once I need to launch the second AJAX request to perform yet another SELECT, how do I know where I should start? In my case, I'd want to start again at 4, so, my second query would be SELECT ID FROM wp_posts LIMIT 4, 7 and so on.
Do I really need to send that counter (even if I can automate it, since, you see, it's an increment of 3) back?
Is there no way for SQL to handle this automatically?
You have many confusions in your question. Let me try to clear up some basic ones.
First, the auto-incremented key is the primary key for the table. You do not need to worry about gaps. In fact, the key should basically be meaningless. It fulfills the following:
It is guaranteed to be unique.
It is guaranteed to be in insertion order.
Gaps are allowed and of no concern. There is no re-indexing. It is a bad idea because:
Primary keys uniquely identify each row and this mapping should be consistent across time.
Primary keys are used in other tables to refer to values, so re-indexing would either invalidate those relationships or require massive changes to many tables.
Re-indexes pre-supposes that the value means something, when it doesn't.
Second, a query such as:
SELECT ID
FROM wp_posts
LIMIT 1, 3;
Can return any three rows. Why? Because you have no specified an ORDER BY and SQL result sets without ORDER BY are unordered. There are no guarantees. So you should always be in the habit of using an ORDER BY.
Third, if you want to essentially "page" through results, then use the OFFSET feature in LIMIT (as you have above):
SELECT ID
FROM wp_posts
ORDER BY ID
LIMIT #offset, 3;
This will allow you to reset the #offset value and go to which rows you want.
First query:
SELECT ID FROM wp_posts ORDER BY ID LIMIT 3
This returns 4,8,124 as you said. In your client, save the largest ID value in a variable.
Subsequent queries:
SELECT ID FROM wp_posts WHERE ID > ? ORDER BY ID LIMIT 3
Send a parameter into this query using the greated ID value from the previous result. It's still in a variable.
This also helps make the query faster, because it doesn't have to skip all those initial rows every time. Paging through a large dataset using LIMIT/OFFSET is pretty inefficient. SQL has to actually read all those rows even though it's not going to return them.
But if you use WHERE ID > ? then SQL can efficiently start the scan in the right place, on the first row that would be included in the result.
Seems, you want to return the first three rows of your query ordered by currently existing ID values(whatever they're after all DML statement's applied on the table wp_posts).
Then, Consider using an auxiliary iteration variable #i to provide an ordered integer value set starting from 1 and increasing as 2,3,... without any gaps :
select t.*
from
(
select #i := #i + 1 as rownum, t1.*
from tab t1
join (select #i:=0) t2
) t
order by rownum
limit 0,3;
Demo

ER_NON_UNIQ_ERROR and how to design tables correctly

I have come across this problem and I've tried to solve it few days now.
Let's say I have following tables
properties
-----------------------------------------
| id | address | building_material |
-----------------------------------------
| 1 | Street 1 | 1 |
-----------------------------------------
| 2 | Street 2 | 2 |
-----------------------------------------
building_materials
-----------------------------
| id | building_material |
-----------------------------
| 1 | Wood |
-----------------------------
| 2 | Stone |
-----------------------------
Now. I would like to provide an API where you could send a request and ask for every property that has building material of wood. Like this:
myapi.com/properties?building_material=Wood
So I would like to query database like this (I want to return the string value of building_material not the numeric value):
SELECT p.id, p.address, bm.building_material
FROM properties as p
JOIN building_materials as bm ON (p.building_material = bm.id)
WHERE building_material = "Wood"
But this will give me an error
Column 'building_material' in where clause is ambiguous
Also if I want to get property with id of 1.
SELECT p.id, p.address, bm.building_material
FROM properties as p
JOIN building_materials as bm ON (p.building_material = bm.id)
WHERE id = 1
Column 'id' in where clause is ambiguous
I understand that the error means that I have same column name in two tables and I don't specify which id I want like p.id.
Problem is I don't know how many query parametes API user is going to send and I would like to avoid looping through them and changing id to p.id and building_material to bm.building_material. Also I don't want that user has to send request to the API like this
myapi.com/properties?bm.building_material=Wood
I've thought about changing the properties table building_material to fk_building_material and changing properties table id to property_id.
I just don't like the idea that on client side I would then have to refer property's building material as fk_building_material. Is this a valid method to solve this problem or what would be the correct way of designing these tables?
The query mentions two tables, so all the columns in both tables are "on the table" for use anywhere in the query.
In one table building_material is an "id" for linking to the other table; in the other table, it is a string. While this is possible, it is confusing to the reader. And to the parser. To resolve the confusion, you must qualify building_material with which one you want; that is done with a table alias (or table) in front (as you did in all other places).
There are two ids are all ambiguous. But this is the "convention" used by table designers. So, it is OK for an id in one table to be different than the id in the other table. (p.id refers to one thing in one table; bm.id refers to another in another table.)
SELECT p.id, p.address, bm.building_material
FROM properties as p
JOIN building_materials as bm ON (p.building_material = bm.id)
WHERE bm.building_material = "Wood" -- Note "bm."

2 inner joins between same 2 tables

I am trying to select columns from 2 tables,
The INNER JOIN conditions are $table1.idaction_url=$table2.idaction AND $table1.idaction_name=$table2.idaction.
However, From the query below, there is no output. It seems like the INNER JOIN can only take 1 condition. If I put AND to include both conditions as shown in the query below, there wont be any output. Please look at the picture below. Please advice.
$mysql=("SELECT conv(hex($table1.idvisitor), 16, 10) as visitorId,
$table1.server_time, $table1.idaction_url,
$table1.time_spent_ref_action,$table2.name,
$table2.type, $table1.idaction_name, $table2.idaction
FROM $table1
INNER JOIN $table2
ON $table1.idaction_url=$table2.idaction
AND $table1.idaction_name=$table2.idaction
WHERE conv(hex(idvisitor), 16, 10)='".$id."'
ORDER BY server_time DESC");
Short answer:
You need to use two separate inner joins, not only a single join.
E.g.
SELECT `actionurls`.`name` AS `actionUrl`, `actionnames`.`name` AS `actionName`
FROM `table1`
INNER JOIN `table2` AS `actionurls` ON `table1`.`idaction_url` = `actionurls`.`idaction`
INNER JOIN `table2` AS `actionnames` ON `table1`.`idaction_name` = `actionurls`.`idaction`
(Modify this query with any additional fields you want to select).
In depth: INNER JOIN, when done on a value unique to the second table (the table joined to the first in this operation) will only ever fetch one row. What you want to do is fetch data from the other table twice, into the same row, reading the select part of the statement.
INNER JOIN table2 ON [comparison] will, for each row selected from table1, grab any rows from table2 for which [comparison] is TRUE, then copy the row from table1 N times, where N is the amount of rows found in table2. If N = 0, then the row is skipped. In our case N=1 so INNER JOIN of idaction_name in table1 to idaction in table2 for example will allow you to select all the action names.
In order to get the action urls as well we have to INNER JOIN a second time. Now you can't join the same table twice normally, as SQL won't know which of the two joined tables is meant when you type table2.name in the first part of your query. This would be ambiguous if both had the same name. There's a solution for this, table aliases.
The output (of my answer above) is going to be something like:
+-----+------------------------+-------------------------+
| Row | actionUrl | actionName |
+-----+------------------------+-------------------------+
| 1 | unx.co.jp/ | UNIX | Kumamoto Home |
| 2 | unx.co.jp/profile.html | UNIX | Kumamoto Profile |
| ... | ... | ... |
+-----+------------------------+-------------------------+
While if you used only a single join, you would get this kind of output (using OR):
+-----+-------------------------+
| Row | actionUrl |
+-----+-------------------------+
| 1 | unx.co.jp/ |
| 2 | UNIX | Kumamoto Home |
| 3 | unx.co.jp/profile.html |
| 4 | UNIX | Kumamoto Profile |
| ... | ... |
+-----+-------------------------+
Using AND and a single join, you only get output if idaction_name == idaction_url is TRUE. This is not the case, so there's no output.
If you want to know more about how to use JOINS, consult the manual about them.
Sidenote
Also, I can't help but notice you're using variables (e.g. $table1) that store the names of the tables. Do you make sure that those values do not contain user input? And, if they do, do you at least whitelist a list of tables that users can access? You may have some security issues with this.
INNER JOIN does not put any restriction on number of conditions it can have.
The zero resultant rows means, there is no rows satisfying the two conditions simultaneously.
Make sure you are joining using correct columns. Try going step by step to identify from where the data is lost

MySQL queries, selecting field from one of many databases

I have a remarks table which can be linked to any number of other items in a system, in the case of this example we'll use bookings, enquiries and referrals.
Thus in the remarks table we have columns
remark_id | datetime | text | booking_id | enquiry_id | referral_id
1 | 2014-06-28 | abc | 0 | 8 | 0
2 | 2014-06-27 | def | 3 | 0 | 0
2 | 2014-05-31 | ghi | 0 | 0 | 10
Etc...
Each of the item tables will have a field called name. Thus when I want to select a remark the likelihood is I'll need this name.
I'd like to achieve this with a single query, getting a 2d array as follows:
['remark_id'=>1, 'datetime'=>'2014-06-28', 'text'=>'abc', 'name'=>'Harold']
However the query I'd expect to use would be
SELECT r.remark_id,r.datetime,r.text
,b.name AS book,rr.name AS referral,e.name AS enquiry
FROM remarks AS r
LEFT JOIN bookings AS b ON b.book_id=r.book_id
LEFT JOIN referrals AS rr ON rr.referral_id=r.referral_id
LEFT JOIN enquiries AS e ON e.enquiry_id=r.enquiry_id
Leaving me with the output
['remark_id'=>1, 'datetime'=>'2014-06-28', 'text'=>'abc', 'book'=>'Harold', 'referral'=>'', 'enquiry'=>'']
And more processing to do before or during rendering it to a view.
Is there a way to write a query such that it would fill a field from the first NOT NULL string it encountered in one of the joined tables?
Please only suggest using a different database system if you know that MySQL doesn't provide any way to do what I'm asking. If it's the case it can't be done there's no business sense in rewriting the system anyway, but I'd like to ask!
Two ways I can think of:
use UNION:
SELECT remark_id, datetime, text, name
FROM remarks
JOIN bookings ON (remarks.book_id = bookings.book_id)
UNION
SELECT remark_id, datetime, text, name
FROM remarks
JOIN referrals ON (remarks.referral_id = referrals.referral_id)
UNION
SELECT remark_id, datetime, text, name
FROM remarks
JOIN enquiries ON (remarks.enquiry_id = enquiries.enquiry_id)</code>
use IFNULL (probably much slower):
SELECT r.remark_id,r.datetime,r.text,
IFNULL(b.name,IFNULL(rr.name,e.name)) AS name
FROM remarks AS r
LEFT JOIN bookings AS b ON b.book_id=r.book_id
LEFT JOIN referrals AS rr ON rr.referral_id=r.referral_id
LEFT JOIN enquiries AS e ON e.enquiry_id=r.enquiry_id</code>
Variant 2 is really much slower because of the LEFT JOINs.
Also, generally I would not recommend using 0 as value for non-existent links, rather use NULL. This will allow MySQL to speed up the join.
one way to achieve this is with nested if statements:
if(b.name is not null, b.name, if(rr.name is not null, rr.name, e.name)) as name
one drawback is that this gives an implicit priority to books? not sure if that would be an issue.
perhaps the main drawback, though, is that this is kind of "magical" and has goofy syntax so it might be more clear to just handle those cases in the controller after all.
Seems quite messy that you have multiple unused columns for each entry, unless I'm not understanding correctly. If you add more tables, you'd have to adjust each of the views so that it would filter out the new table.
I'd be tempted to redesign your structure so that each of the tables has a remarkgroup_id column, then add the following remark table
remark_id, remarkgroup_id, date, message
This would clean up the extra unused columns and allow you to use simple joining logic.

Store multiple values in a single cell instead of in different rows

Is there a way I can store multiple values in a single cell instead of different rows, and search for them?
Can I do:
pId | available
1 | US,UK,CA,SE
2 | US,SE
Instead of:
pId | available
1 | US
1 | UK
1 | CA
1 | SE
Then do:
select pId from table where available = 'US'
You can do that, but it makes the query inefficient. You can look for a substring in the field, but that means that the query can't make use of any index, which is a big performance issue when you have many rows in your table.
This is how you would use it in your special case with two character codes:
select pId from table where find_in_set('US', available)
Keeping the values in separate records makes every operation where you use the values, like filtering and joining, more efficient.
you can use the like operator to get the result
Select pid from table where available like '%US%'