MySQL Sort Alphabetically but Ignore "The" - mysql

I have MySQL database that has a table with book data in it. One of the columns in the table is called "title". Some of the titles begin the word "the" and some do not.
Example:
"The Book Title One"
"Book Title Two"
"Book Title Three"
I need to pull these out of the database in alphabetical order, but I need to ignore the "the" in the beginning of the titles that start with it.
Does SQL (specifically MySQL) provide a way to do this in the query?

do a case when to check if the column value starts with the and if it does, return the title without the 'The'. This will be a new column that you will be using later on for the sort order
select title, case when title like 'The %' then trim(substr(title from 4)) else title end as title2 from tablename order by title2;

You can use a CASE statement in the ORDER BY and the use REGEXP or LIKE to match strings that start with words you would like to remove.
In the example below I find all words that begin with a, an, or the followed by a space, and then remove everything up to the space, and trim away additional white space (you might have two or spaces following an instance of the).
SELECT *
FROM books
ORDER BY
CASE
WHEN title REGEXP '^(A|An|The)[[:space:]]' = 1 THEN
TRIM(SUBSTR(title , INSTR(title ,' ')))
ELSE title
END ;

if you are sure that you will NEVER EVER have a typo (and use lowercase instead of uppercase)
select *
from books b
order by UPPER(LTRIM(Replace(b.Title, 'The', '')))
Otherwise your sorting will do all Upper and then all lower.
for example, this is ascending order:
Have a Great Day
Wild west
Zorro
aZtec fries are hotter
alfred goes shopping
bart is small
will i am not
adapted from AJP's answer

I've seen some convoluted answers here which I tried but were just wrong (didn't work) or unsafe (replaced every occurrence of 'the'). The solution I believe to be easy, or maybe I'm getting it wrong or not considering edge cases (sincerely, no sarcasm intended).
... ORDER BY SUBSTRING(UPPER(fieldname), IF(fieldname LIKE 'The %', 5, 1))
As stated elsewhere, UPPER just prevents ASCII sorting which orders B before a (note the case difference).
There's no need for a switch-case statement when there is only one condition, IF() will do
I'm using MySQL 5.6 and it seems like the string functions are 1-indexed, in contrast to say PHP where strings are 0-indexed (this caught me out). I've tested the above on my dataset and it works

select *
from books b
order by LTRIM(Replace(b.Title, 'The', ''))
PLease note this will replace The from the title.. no matter where in the title. so use substring to get first 3 characters.

Simply:
SELECT Title
FROM book
ORDER BY IF(Title LIKE "The %", substr(Title, 5), Title);
Explanation:
We use the IF function to strip the "The" (if present) from the beginning of the string before returning the string to the ORDER BY clause. For more complex alphabetization rules we could create a user-defined function and place that in the ORDER BY clause instead. Then you would have ...ORDER BY MyFunction(Title).

Related

MySQL query to remove hyphenated duplicates

I am taking the MySQL class by Duke on Coursera. In week two there is reference to messy data and I figured I would ask my question here. The scope of my question would be how to compare an entry in a row of table that would already match an instance except that it was entered with a hyphen, i.e. "Golden Retriever Mix" is the same instance as "Golden Retriever- Mix". And when I run a DISTINCT SELECT statement I do not want it to pull both results. The catch is, we cannot just remove all hyphens from the column fields because we still want them for instance for entry of "Golden Retriever-Airedale Terrier Mix". How would a query look for doing this. The example code that pulls in both "Golden Retriever Mix" and "Golden Retriever- Mix" is below.
SELECT DISTINCT breed,
TRIM(LEADING '-' FROM breed)
FROM dogs
ORDER BY (LEADING '-' FROM breed) LIMIT 1000, 1000;
I am thinking I need and IF/THEN statement that says
IF(REPLACE(breed,'-','') = breed)
THEN DELETE breed;
Obviously this is not correct syntax which is what I am looking for.
I think what you are looking for is the Levenshtein distance (https://en.wikipedia.org/wiki/Levenshtein_distance).
This one calculates the difference between words e.g. comparison of "Test" and "Test1" would result to 1 because there is one letter more.
You could use the suggested procedures from
How to add levenshtein function in mysql? or Levenshtein: MySQL + PHP
This will not only bring up all entries having a leading "-" it even includes the ones with misspelling. You can filter your result data by the calculated distance then.
If you do not want this one because of performance issues you can still use TRIM or REPLACE to filter your symbol and compare it with the other string.
You're almost there, all you need to do is get rid of the plain breed column in your select clause and change TRIM() with REPLACE()
SELECT DISTINCT REPLACE(breed, '-', ' ')
FROM dogs
TRIM(LEADING...) would remove the hyphens at the beginning of the string, but what you want to show is the distinct values of breed considering hyphens as spaces.
Edit
I was assuming the two strings were "Golden Retriever Mix" and "Golden Retriever-Mix", but if there's actually a space after the hyphen ("Golden Retriever- Mix"), you can use REPLACE(breed, '-', '') instead
Edit 2
After the clarification in your comment, I think what you need is a GROUP BY clause
SELECT MIN(breed)
FROM dogs
GROUP BY REPLACE(breed, '-', ' ')
Any string with an hypen will be considered higher in value than the same string with a space instead, so when there are both this query will return the one with the space. If there's only one instead, it will be returned as is

MySQL regex matching rows containing two or more spaces in a row

I am trying to write a MySQL statement which finds and returns the book registrations that contain 2 or more spaces in a row.
The statement below is wrong.
SELECT * FROM book WHERE titles REGEXP '[:space]{2,}';
Since the 2 spaces already meet your condition, you really do not need to check if there are more than 2. Moreover, if you need to match a regular ASCII space (decimal code 32), you do not need a REGEXP operator, you can safely use
SELECT * FROM book WHERE titles LIKE '% %';
LIKE is preferred in all cases where you can use it instead of REGEXP (see MySQL | REGEXP VS Like)
When you need to match numerous whitespace symbols, you can use WHERE titles REGEXP '[[:space:]]{2}' (it will match [ \t\r\n\v\f]), and if you only plan to match tabs and spaces, use WHERE titles REGEXP '[[:blank:]]{2}'. For more details, see POSIX Bracket Expressions.
Note that [:class_name:] should only be used inside a character class (i.e. inside another pair of [...], otherwise, they are not recognized.
Your POSIX class must be,
SELECT * FROM book WHERE titles REGEXP '[[:space:]]{2,}';
No need for ,
SELECT * FROM book WHERE titles REGEXP '[[:space:]]{2}';
You may also use [[:blank:]]
SELECT * FROM book WHERE titles REGEXP '[[:blank:]]{2}';
If you mean just the space character: REGEXP ' '. Or you could use LIKE "% %", which would be faster. (Note: there are 2 blanks in those.)
Otherwise, see http://dev.mysql.com/doc/refman/5.6/en/regexp.html for blank and space.

Sphinx match by first letter

I need simple explanation of why my queries fail to bring the results i need.
Sphinx 2.0.8-id64-release (r3831)
Here is what i have in sphinx.conf:
SELECT
trackid,
title,
artistname,
SUBSTRING(REPLACE(TRIM(`artist_name`), 'the ', ''),1,3) AS artistname_init
....
sql_field_string = title
sql_field_string = artistname
sql_field_string = artistname_init
Additional settings:
docinfo = extern
charset_type = utf-8
min_prefix_len = 1
enable_star = 1
expand_keywords= 0
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z, A..Z->a..z, a..z
Query works. I index my data without problems. However i am failing to make sphinx bring any sensible results. I am using SphinxQL to query.
Example:
select
artistname, artistname_init from myindex
WHERE MATCH('#artistname_init ^t*')
GROUP BY artistname ORDER BY artistname_init ASC limit 0,10;
brings nothing related to the query.
I've tried everything i could think of like:
MATCH('#artistname_init ^t*')
MATCH('#artistname_init[1] t')
MATCH('#artistname_init ^t$')
Can anyone please point where is my mistake and perhaps give me query that will work for my case?
My target is to get results that follow this sorting order:
B (Single letter)
B-T (Single letter + non-alphabet sign after)
B as Blue (Single letter + space after)
Baccara (First letter of single word)
Bad Religion (First letter of several words)
The B (not counting "The ")
The B.Y.Z (Single letter + non-alphabet sign after not counting "The ")
The B 2 B (Single letter + space after not counting "The ")
The Boyzz (First letter of single word not counting "The ")
The Blue Boy (First letter of several words not counting "The ")
Or close to it.
There are a lot of moving parts in what you're trying to do, but I can at least answer the title portion of it. Sphinx offers field-level ranking factors to let you customize the WEIGHT() function – it should be much easier to order the matches the way you want, rather than trying to actually filter out entries that matched the query later than the 1st or 2nd word.
Here's an example, which will return all results with a word starting with "b", sorted by how early that word appears:
SELECT id, artistname, WEIGHT()
FROM myindex
WHERE MATCH('(#artistname (b*))')
ORDER BY WEIGHT() DESC
LIMIT 10
OPTION ranker=expr('sum(100 - min_hit_pos)');
If you want to filter out other cases like "Several other words then B", I think I'd suggest doing that in your application. For example, if the fourth result has the keyword in the 3rd word, only return the first 3 results. That, or actually create a new field in Sphinx without the leading "The", and then add a numeric attribute to the index to show that a word was removed (you can use numeric attributes in your ranker expressions).
As for ranking "B-t" more highly than "Bat", I'm not sure if that's possible without somehow changing Sphinx's concept of alphabetical order.. You could try diving into the source code? ;)
One last note. For this particular kind of query, MySQL (I say MySQL because it's the common way of sourcing a Sphinx index) may actually work just as well. If you strip the leading "The", a B-tree index (which MySQL uses) is a perfectly good way of searching if you're sure you only want results where the query matches the beginning of the field. Sphinx's inverted indexes are kind of overkill for that sort of thing.

Mysql SELECT query on name ignoring the first word if it is "the", "a", "an" etc

I've been trying (without success) to construct a MYSQL query which will select a group of records with a "title" field starting with a single alphabetical character but ignoring the first word if it's "The", "An" or "A". I've found plenty of examples that do this for the ORDER BY part of the query, but it's the initial WHERE part that I need to do it for, as the order is irrelevant if the correct records haven't been found. Using
WHERE title LIKE "R%"
will just give me titles that have this as the very first letter (e.g. "Robin Hood") but won't match "The Red House". I think I need some kind of REGEX, but I can't seem to get it to work.
So for example, given the following movie titles,
Road House
The Return of the King
Mamma Mia
Argo
Titanic
A River Runs Through it
Selecting movie titles that start with "R" would return the following:
The Return of the King
A River Runs Through it
Roadhouse
(other fields omitted)
The easiest way is to programmatically expand the query to something like
SELECT
...
WHERE
title LIKE 'R%'
OR title LIKE 'The R%'
OR title LIKE 'A R%'
OR title LIKE 'An R%'
...
This should perform better than a REGEX, as it will be able to use an index, which a REGEX never will.
BTW: The canonical way to do this, is to store the article in a seperate field.
This regex should suit your needs:
WHERE title REGEXP '^(The |An? )?R.*$'
But as #EugenRieck noticed, since you probably use an index on the title column, you should better use the WHERE... OR... clauses.
To add to the above suggestions (both of which I agree with), for the sort you would probably need to use a CASE to dervive a field for the ORDER BY clause.
SELECT somefield,
CASE
WHEN title LIKE 'R%' THEN title
WHEN title LIKE 'The R%' THEN SUBSTRING(title FROM 5)
WHEN title LIKE 'A R%' THEN SUBSTRING(title FROM 3)
WHEN title LIKE 'An R%' THEN SUBSTRING(title FROM 4)
ELSE title
END AS SortTitle
FROM sometable
ORDER BY SortTitle

Text manipulation of strings containing "The "

I have a table in a MySQL database which contains data like this;
ID text
1 Action Jackson
2 The impaler
3 The chubby conquistador
4 Cornholer
I want to display them in alphabetical order minus the leading "The ". This is what I've come up with which works.
SELECT ID, CASE LEFT(l.text, 4) WHEN "The " THEN CONCAT(RIGHT(l.text, LENGTH(l.text) - 4), ", The") ELSE l.text END AS "word"
FROM list l
This solution seems a little clunky, does anyone have a more elegant answer?
I think this is what you are looking for:
SELECT ID,
text
FROM list l
ORDER BY TRIM(LEADING 'The ' FROM text);
If you can at all, I would think of restructuring your data a bit.. Its hundreds of times better to rely on mysql indexes and proper sorting instead of doing it dynamically like this.
How about adding a field that drops the 'The ', and sort on that? You could make sure that this secondary field is always correct with a few triggers.
SELECT TRIM(LEADING 'The' FROM text) as word
FROM list
ORDER BY TRIM(LEADING 'The' FROM text)