I have two tables that make up a full text index of article content for search purposes. One of the tables is just a primary key associated with a word, whereas the other records the article it occurred in and its location in the document. A single word can conceivably appear many times in the same document with different locations, so the same word id can occur several times in the word_locations table.
Here are the structures:
words:
id bigint
word tinytext
word_location:
id bigint(20)
wordid bigint(20)
location int(11)
article_id int(11)
What i need to write is a query that will find the count of occurrences for each word for any one profile. I need to preserve a zero value for wordids that don't appear at all, so I assume this needs to be a left join. However, whenever I try to add a where query to limit off article, any wordids that don't appear at all are not included in the result set.
I have tried:
select words.wordid, COUNT(word_location.wordid) as appears from words left join word_location on word.id = word_location.wordid where article_id = %s GROUP BY wordid
But this query does not return zeros for words that don't appear at all.
How can I modify this left join?
Thanks in advance!
EDIT:
Here is an example data set and the result sets for the different queries.
Example article content:
Bob's Restaurant is one of the finest restaurants in greater
County where you can enjoy the finest Turkish Cuisine.
So the vocabulary table, after being adjusted by the application to exclude stop words, will have in its vocabulary rows for Bob, Restaurant, finest, greater, county, enjoy, Turkish, and cusine. (I'm using this actual article since it's the first in the set, so the ids actually appear starting from integer 1.
The query provided by #Mark Bannister produces this result set:
wordid - word - occurances:
128 clifton 0
1 bob's 2
2 restaurant 2
3 one 1
4 finest 3
5 restaurants 2
6 greater 1
9 county 1
12 enjoy 3
13 turkish 6
14 cuisine 1
The result set is correct per se - but id 128 doesn't appear in the document at all and is the only thing in the result set with occurance 0. The goal is to have the entire vocabulary returned with number of occurrences from the document (this is roughly 2500 different words)
My original problematic query from before the edit above actually returned the same result set, but without ANY 0 occurance rows at all.
You need to include your article selection in your join condition:
select words.wordid, COUNT(word_location.wordid) as appears
from words
left join word_location on word.id = word_location.wordid and article_id = ?
GROUP BY wordid
Including the restriction on article_id in the WHERE clause effectively turns your left join back into an inner join.
I would use a subselect instead of a join.
SELECT words.id, (SELECT count(*) FROM word_location WHERE word_location.wordid = words.id) as appears
Bit of a guess this one, but I think COUNT() is just disregarding your nulls, not COUNTing them and arriving at 0. (NULL + NULL != 0)
Look at the IFNULL() function, you might be able to do something like:
COUNT(IFNULL(word_location.wordid, 0))
(Disclaimer - I am more used to Oracle's NVL(, ) function hence this is a little speculative!)
Related
I got into an argument with a professor today when we ran into the following problem. Say we want to build a movie quiz, where each question is a "Choose from four answers..." type of game. We then build our questions based on information queried from our database. One of the questions reads as follows:
Who directed the movie X...?
We would then query the database from our Movies table, that is described as follows
Field Type Null Key Default Extra
id int(11) NO PRI NULL auto_increment
title varchar(100) NO NULL
year int(11) NO NULL
director varchar(100) NO NULL
banner_url varchar(200) YES NULL
trailer_url varchar(200) YES NULL
Now, here's where my question lies. In my mind, I believe should be able to query the DB once, and limit our request to produce 4 results. From these 4 answers, randomly select one to be the correct answer, while the other 3 are the incorrect answers (NOTE this would be done offline)
Here was the query I came up with:
SELECT DISTINCT title, director
FROM movies
ORDER BY RAND()
LIMIT 4;
However, my professor argued that the two SQL keywords DISTINCT and LIMIT are NOT safe enough to prevent us from getting possible duplicates. Further more, he brought up the edge case of "What if we only had one director in our movies table....?" And therefore concluded that we must use two queries; the first to get our correct answer, and the second query to get our incorrect answer.
If we could guarantee our table has more than one director, thereby eliminating the edge case my professor presented, wouldn't my query produce successful results every time? I've ran the query about 10-20 times, each one producing the exact results of what I wanted. Therefore, I'm struggling to find further evidence to pick the 2 query approach over the 1 query.
EDIT - I believe my question may have failed to address the point. The two answers are relying on the movie title being known prior to our query. However, we are not sure what movie will fill the question "Who directed ..?" I was hoping to query the DB for 4 random results, then pick from the 4 random results on the Java side of our code to decide the "correct" answer, insert said movie's title into the question, and produce the 4 possible answers to the question.
I think you need a query like this:
SELECT title, director, CASE WHEN title = :title THEN 1 ELSE 0 END As isAnswer
FROM movies
GROUP BY title, director
ORDER BY
CASE WHEN title = :title THEN 0 ELSE 1 END,
RAND()
LIMIT 4;
And remember that the first row is the answer.
I believe your professor is partially correct on this... Your query in itself may coincidentally work, but that is probably based on a small sample of movies and is getting the movie in its result set. So, take for example, you have 1000 movies and 47 directors, and the one movie "X" you have chosen, that director only made 3 of the 1000 movies in the list... How realistic will your result set of directors be sure you have that director in question...
Sha's answer is very close in that it guarantees 4 results, but floating the director of movie "X" to the top, but that version has extra stuff not applicable. You only want director's names, not what movies they did. Then you would order that result by rand() to ensure the final order is randomized.
select
pq.*
from
( select
m.director,
max( case when m.title = cTitleOfMovieYouWant then 1 else 0 end )
as FinalAnswer
from
movies m
group by
m.director
order by
max( case when m.title = cTitleOfMovieYouWant then 1 else 0 end ) DESC,
RAND()
limit
4 ) pq
order by
RAND()
So, the inner query only cares about a director, and a flag if they were the director or not of the movie in question. The MAX( case/when ) is important because what if Director "Joe" directed 5 movies, only one of which was the movie desired. You would not want Joe to appear once as the valid director, and once as not the director. So, for the 1 movie, the flag will get set to 1, all the other movies that are NOT "X" will have flag of 0, so we want to keep the overall flag for the director as 1.
Now, since only one director for a given movie, the order by the same MAX( case/when) is in DESCENDING order, it will force this director to the top of the list, and then random for all others.
Once that result of 4 records is returned, the outer query runs that and orders IT by RAND() thus changing the final order.
It gets messy because one director may direct multiple titles.
Try these two steps:
SELECT #correct := director FROM movies WHERE title = :title; -- first get corret answer
( SELECT DISTINCT director
FROM movies
WHERE director != #correct
ORDER BY RAND() LIMIT 3 ) -- 3 other directors
UNION ALL
( SELECT #correct ) -- and the correct answer
ORDER BY RAND(); -- shuffle the 4 answers
Within the big subquery:
WHERE is done entirely before...
DISTINCT happens before...
ORDER BY happens before...
LIMIT
I'm trying to pull a list of IDs from a table Company where the first 6 characters of the ID are the same. The way our application creates a company ID is it takes the first 3 characters of the company name and the first 3 characters of the City. Beceause of that, overtime we have company IDs with the same first 6 characters, followed by a sequential number...
I was thinking using something using LIKE
Select companyID, companyName from Company Where
substring(companyID,1,6)+'%' like substring(companyID,1,6)+'%'
Basically i'm trying to get all company IDs where the first 6 characters match; The result set should show the just the top company ID ( The first 1 created) and the company name. I'm not expecting a tone of results, so i can then use the IDs returned to find the IDs below it.
I'm thinking it could maybe also be done using HAVING, where the count of IDs with the same first 6 characters are the same HAVING Count(*)>1??
Not really sure what the syntax would be...
SELECT distinct c1.CompanyID, c1.CompanyName, c2.CompanyID, c2.CompanyName
FROM dbo.Company c1
JOIN dbo.Company c2
ON SUBSTRING(c1.CompanyName,1,6) = SUBSTRING(c2.CompanyName,1,6)
AND c1.CompanyID < c2.CompanyID
order by c1.CompanyName, c2.CompanyName
SELECT c1.CompanyID, c1.CompanyName, c2.CompanyID, c2.CompanyName
FROM dbo.Company c1
INNER JOIN dbo.Company c2
ON SUBSTRING(c1.CompanyName,1,6) + '%' LIKE SUBSTRING(c2.CompanyName,1,6) + '%'
AND c1.CompanyID <> c2.CompanyID
If this is something that you envision doing frequently, I'd add a computed column to the table that has a definition of substring(CompanyName, 1, 6). You can then index it and make this efficient. As it is, it will have to scan all the entries and calculate the substring on the fly. With the computed column, you amortize the substring calculation up front and at least have a chance at an efficient query.
After trying to use Blam's script, i made a few slight changes and got some better results. His script was returning more results than rows in the table and it was pretty slow; think it's because of the company_name column. I got rid of it and wrote it like this:
select distinct c1.cmp_id, count(substring(c2.cmp_id,1,6)) as TotalCount
from company c1
join company c2 on substring(c1.cmp_id,1,6)=substring(c2.cmp_id,1,6)
group by c1.cmp_id
order by c1.cmp_id asc
This still returns all the table records, but atleast i can see the total count when the first 6 characters are listed more than once. Also, it ran in only 1 second so that's also a plus. Thank again for you input guys, always appreciated!
Note: You can find the previous question and its answer here. A deep testing on it proved the previous answer is incorrect: Writing a Complex MySQL Query
I have 3 tables.
Table Words_Learned contains all the words known by a user, and the order in which the words were learned. It has 3 columns 1) word ID and 2)user id and 3) order in which the word was learned.
Table Article contains the articles. It has 3 columns 1) article ID, 2) unique word count and 3) article contents.
Table Words contains a list of all unique words contained in each article. It has 2 columns 1) word ID and 2) article ID
The database diagram is as below/
Now, using this database and using "only" mysql, I need to do the below work.
Given a user ID, it should get a list of all words known by this user, sorted in the revese order from which they were learned. In other words, the most recently learned words will be at the top of the list.
Let’s say that a query on a user ID shows that they’ve memorized the following 3 words, and we track the order in which they’ve learned the words.
Octopus - 3
Dog - 2
Spoon - 1
First we get a list of all articles containing the word Octopus, and then do the calculation using table Words on just those articles. Calculation means if that article contains more than 10 words that do not appear in the user’s vocabulary list (pulled from table words_learned), then it is excluded from the listing.
Then, we do a query for all records that contain dog, but DO NOT contain “octopus”
Then, we do a query for all records that contain spoon, but DO NOT contain the words Octopus or Dog
And you keep doing this repetitive process until we’ve found 100 records that meet this criteria.
To achieve this process, I did the below (Please visit the SQLFiddle link to see the table structures, test data and my query)
http://sqlfiddle.com/#!2/48dae/1
In my query, you can see the generated results and they are invalid. But on a "Proper Query", the result should be,
Level 1
Level 1
Level 1
Level 2
Level 2
Level 2
Level 3
Level 3
Here is a phudocode for better understanding.
Do while articles found < 100
{
for each ($X as known words, in order that those words were learned)
{
Select all articles that contain the word $X, where the 1) article has not been included in any previous loops, and 2)where the count of "unknown" words is less than 10.
Keep these articles in order.
}
}
select * from (
select a.idArticle, a.content, max(`order`) max_order
from words_learned wl
join words w on w.idwords = wl.idwords
join article a on a.idArticle = w.idArticle
where wl.userId = 4
group by a.idArticle
) a
left join (
select count(*) unknown_count, w2.idArticle from words w2
left join words_learned wl2 on wl2.idwords = w2.idwords
and wl2.userId = 4
where wl2.idwords is null
group by w2.idArticle
) unknown_counts on unknown_counts.idArticle = a.idArticle
where unknown_count is null or unknown_count < 10
order by max_order desc
limit 100
http://sqlfiddle.com/#!2/6944b/9
The first derived table selects unique articles a given user knows one or more words from as well as the maximum order value of those words. The maximum order value is used to sort the final results so that articles containing high order words appear first.
The second derived table counts the number of words a given user doesn't know for each article. This table is used to exclude any articles that contain 10 or more words the user doesn't know.
I have inherited a database in which a person table has a field called authorised_areas. The front end allows the user to choose multiple entries from a pick list (populated with values from the description field of the area table) and then sets the value of authorised_areas to a comma-delimited list. I am migrating this to a MySQL database and while I'm at it, I would like to improve the database integrity by removing the authorised_areas field from the person table and create a many-to-many table person_area which would just hold pairs of person-area keys. There are several hundred person records, so I would like to find a way to do this efficiently using a few MySQL statements, rather than individual insert or update statements.
Just to clarify, the current structure is something like:
person
id name authorised_areas
1 Joe room12, room153, 2nd floor office
2 Anna room12, room17
area
id description
1 room12
2 room17
3 room153
4 2nd floor office
...but what I would like is:
person
id name
1 Joe
2 Anna
area
id description
1 room12
2 room17
3 room153
4 2nd floor office
person_area
person_id area_id
1 1
1 3
1 4
2 1
2 2
There is no reference to the area id in the person table (and some text values in the lists are not exactly the same as the description in the area table), so this would need to be done by text or pattern matching. Would I be better off just writing some php code to split the strings, find the matches and insert the appropriate values into the many-to-many table?
I'd be surprised if I were the first person to have to do this, but google search didn't turn up anything useful (perhaps I didn't use the appropriate search terms?) If anyone could offer some suggestions of a way to do this efficiently, I would very much appreciate it.
While it is possible to do this I would suggest that as a one off job it would probably be quicker to knock up a php (or your favorite scripting language) script to do it with multiple inserts.
If you must do it in a single statement then have a table of integers (0 to 9, cross join against itself to get as big a range as you need) and join this against your original table, using string functions to get the Xth comma and from that each of the values for each row.
Possible, and I have done it but mainly to show that having a delimited field is not a good idea. It would likely be FAR quicker to knock up a script with multiple inserts.
You could base an insert on something like this SELECT (although this also comes up with a blank line for each person as well as the relevant ones, and will only cope with up to 1000 authorised areas per person)
SELECT z.id, z.name, x.an_authorised_area
FROM person z
LEFT OUTER JOIN (
SELECT DISTINCT a.id, SUBSTRING_INDEX( SUBSTRING_INDEX( authorised_areas, ",", b.ournumbers ) , ",", -1 ) AS an_authorised_area
FROM person a, (
SELECT hundreds.i *100 + tens.i *10 + units.i AS ournumbers
FROM integers AS hundreds
CROSS JOIN integers AS tens
CROSS JOIN integers AS units
)b
)x ON z.id = x.id
I don't think this is a duplicate posting because I've looked around and this seems a bit more specific than whats already been asked (but I could be wrong).
I have 4 tables and one of them is just a lookup table
SELECT exercises.id as exid, name, sets, reps, type, movement, categories.id
FROM exercises
INNER JOIN exercisecategory ON exercises.id = exerciseid
INNER JOIN categories ON categoryid = categories.id
INNER JOIN workoutcategory ON workoutid = workoutcategory.id
WHERE (workoutcategory.id = '$workouttypeid')
AND rand_id > UNIX_TIMESTAMP()
ORDER BY rand_id ASC LIMIT 6;
exercises table contains a list of exercise names, sets, reps, and an id
categories table contains an id, musclegroup, and type of movement
workoutcategory table contains an id, and a more specific motion (ie: upper body push, or upper body pull)
exercisecategory table is the lookup table that contains (and matches the id's) for exerciseid, categoryid, and workoutid
I've also added a column to the exercises table that generates a random number upon entering the row in the database. This number is then updated only for the specified category when it is called, and then sorted and displays the ascending order of the top 6 listings. This generates a nice random entry for me. (Found that solution elsewhere here on SO).
This works fine for generating 6 random exercises from a specific top level category. But I'd like to drill down further. Here's an example...
select all rows inside categoryid 4
then still within the category 4 results, find all that have movementid 2, and then find one entry with a typeid 1, then another for typeid 2, etc
TLDR; Basically there's a few levels of categories and I'm looking to select a few from here and a few from there and they're all within this top level. I'm thinking this could all be executed within more than one query but im not sure how... in the end I'm looking to end with one array of the randomized entries.
Sorry for the long read, its the best explanation I've got.
Just realized I never came back to this posting...
I ended up using several mysql queries within a switch based on what is needed during the request. Worked out perfectly.