MySQL query to remove hyphenated duplicates - mysql

I am taking the MySQL class by Duke on Coursera. In week two there is reference to messy data and I figured I would ask my question here. The scope of my question would be how to compare an entry in a row of table that would already match an instance except that it was entered with a hyphen, i.e. "Golden Retriever Mix" is the same instance as "Golden Retriever- Mix". And when I run a DISTINCT SELECT statement I do not want it to pull both results. The catch is, we cannot just remove all hyphens from the column fields because we still want them for instance for entry of "Golden Retriever-Airedale Terrier Mix". How would a query look for doing this. The example code that pulls in both "Golden Retriever Mix" and "Golden Retriever- Mix" is below.
SELECT DISTINCT breed,
TRIM(LEADING '-' FROM breed)
FROM dogs
ORDER BY (LEADING '-' FROM breed) LIMIT 1000, 1000;
I am thinking I need and IF/THEN statement that says
IF(REPLACE(breed,'-','') = breed)
THEN DELETE breed;
Obviously this is not correct syntax which is what I am looking for.

I think what you are looking for is the Levenshtein distance (https://en.wikipedia.org/wiki/Levenshtein_distance).
This one calculates the difference between words e.g. comparison of "Test" and "Test1" would result to 1 because there is one letter more.
You could use the suggested procedures from
How to add levenshtein function in mysql? or Levenshtein: MySQL + PHP
This will not only bring up all entries having a leading "-" it even includes the ones with misspelling. You can filter your result data by the calculated distance then.
If you do not want this one because of performance issues you can still use TRIM or REPLACE to filter your symbol and compare it with the other string.

You're almost there, all you need to do is get rid of the plain breed column in your select clause and change TRIM() with REPLACE()
SELECT DISTINCT REPLACE(breed, '-', ' ')
FROM dogs
TRIM(LEADING...) would remove the hyphens at the beginning of the string, but what you want to show is the distinct values of breed considering hyphens as spaces.
Edit
I was assuming the two strings were "Golden Retriever Mix" and "Golden Retriever-Mix", but if there's actually a space after the hyphen ("Golden Retriever- Mix"), you can use REPLACE(breed, '-', '') instead
Edit 2
After the clarification in your comment, I think what you need is a GROUP BY clause
SELECT MIN(breed)
FROM dogs
GROUP BY REPLACE(breed, '-', ' ')
Any string with an hypen will be considered higher in value than the same string with a space instead, so when there are both this query will return the one with the space. If there's only one instead, it will be returned as is

Related

Why ' ' was used in select statement

Came across this code today:
SELECT 'Overall' as Main,
wave,
country,
catg,
'' AS hw,
SUM(0) AS headwinds_sum
....
....
Can someone explain what ' ' in the above stands for?
Its not a typo as it is repeated #multiple instances.
Not a typo, no text was missed to add.
'' as hw
adds a column named hw to your select query of type varchar that contains empty strings.
Depending on how you process the resultset afterwards this can make sense.
The symbol is used to return an empty column in a result set. Users occasionally do this to match column counts in insert selects or when exporting data to Excel files and you want standard column names for capturing audit recommendations on data etc.

Why isn't MySQL REGEXP filtering out these values?

So I'm trying to find what "special characters" have been used in my customer names. I'm going through updating this query to find them all one-by-one, but it's still showing all customers with a - despite me trying to exlude that in the query.
Here's the query I'm using:
SELECT * FROM customer WHERE name REGEXP "[^\da-zA-Z\ \.\&\-\(\)\,]+";
This customer (and many others with a dash) are still showing in the query results:
Test-able Software Ltd
What am I missing? Based on that regexp, shouldn't that one be excluded from the query results?
Testing it on https://regex101.com/r/AMOwaj/1 shows there is no match.
Edit - So I want to FIND any which have characters other than the ones in the regex character set. Not exclude any which do have these characters.
Your code checks if the string contains any character that does not belong to the character class, while you want to ensure that none does belong to it.
You can use ^ and $ to check the while string at once:
SELECT * FROM customer WHERE name REGEXP '^[^\da-zA-Z .&\-(),]+$';
This would probably be simpler expressed with NOT, and without negating the character class:
SELECT * FROM customer WHERE name NOT REGEXP '[\da-zA-Z .&\-(),]';
Note that you don't need to escape all the characters within the character class, except probably for -.
Use [0-9] or [[:digit:]] to match digits irrespective of MySQL version.
Use the hyphen where it can't make part of a range construction.
Fix the expression as
SELECT * FROM customer WHERE name REGEXP "[^0-9a-zA-Z .&(),-]+";
If the entire text should match this pattern, enclose with ^ / $:
SELECT * FROM customer WHERE name REGEXP "^[^0-9a-zA-Z .&(),-]+$";
- implies a range except if it is first. (Well, after the "not" (^).)
So use
"[^-0-9a-zA-Z .&(),]"
I removed the + at the end because you don't really care how many; this way it will stop after finding one.

how to skip the inverted commas when selecting a column from MySQL

There is a table with fields name, age, city and state. Now I need to select rows based on city name. The value of the column city is surrounded with ", for example "LA".
How can I write a SELECT statement for getting data based on city.
\" is the escape combination for double quotes:
SELECT * FROM mytable WHERE city = '\"LA\"';
See MySQL documentation "String Literals".
Just suggestion, if you are collecting and storing the information in the table to be queried later (ie you are in control of the input), try to clean the data up before storing it to make it easier to query?
If the input has quotes and white space, clean that before inserting the values into the table. Use programming to do this, or mySQL: TRIM() and REPLACE() to remove the characters that might make a query hard to build and then store the resulting value into the table.
Of course, if you do not have control of the input data, that is where the answers above and the challenge to a programmer begins, trying to figure out the different input possibilities and dealing with that.
SELECT *
FROM table1
WHERE city LIKE '%LA%'
or
SELECT *
FROM table1
WHERE city REGEXP '^[[:space:]]*LA[[:space:]]*$'

MySQL Sort Alphabetically but Ignore "The"

I have MySQL database that has a table with book data in it. One of the columns in the table is called "title". Some of the titles begin the word "the" and some do not.
Example:
"The Book Title One"
"Book Title Two"
"Book Title Three"
I need to pull these out of the database in alphabetical order, but I need to ignore the "the" in the beginning of the titles that start with it.
Does SQL (specifically MySQL) provide a way to do this in the query?
do a case when to check if the column value starts with the and if it does, return the title without the 'The'. This will be a new column that you will be using later on for the sort order
select title, case when title like 'The %' then trim(substr(title from 4)) else title end as title2 from tablename order by title2;
You can use a CASE statement in the ORDER BY and the use REGEXP or LIKE to match strings that start with words you would like to remove.
In the example below I find all words that begin with a, an, or the followed by a space, and then remove everything up to the space, and trim away additional white space (you might have two or spaces following an instance of the).
SELECT *
FROM books
ORDER BY
CASE
WHEN title REGEXP '^(A|An|The)[[:space:]]' = 1 THEN
TRIM(SUBSTR(title , INSTR(title ,' ')))
ELSE title
END ;
if you are sure that you will NEVER EVER have a typo (and use lowercase instead of uppercase)
select *
from books b
order by UPPER(LTRIM(Replace(b.Title, 'The', '')))
Otherwise your sorting will do all Upper and then all lower.
for example, this is ascending order:
Have a Great Day
Wild west
Zorro
aZtec fries are hotter
alfred goes shopping
bart is small
will i am not
adapted from AJP's answer
I've seen some convoluted answers here which I tried but were just wrong (didn't work) or unsafe (replaced every occurrence of 'the'). The solution I believe to be easy, or maybe I'm getting it wrong or not considering edge cases (sincerely, no sarcasm intended).
... ORDER BY SUBSTRING(UPPER(fieldname), IF(fieldname LIKE 'The %', 5, 1))
As stated elsewhere, UPPER just prevents ASCII sorting which orders B before a (note the case difference).
There's no need for a switch-case statement when there is only one condition, IF() will do
I'm using MySQL 5.6 and it seems like the string functions are 1-indexed, in contrast to say PHP where strings are 0-indexed (this caught me out). I've tested the above on my dataset and it works
select *
from books b
order by LTRIM(Replace(b.Title, 'The', ''))
PLease note this will replace The from the title.. no matter where in the title. so use substring to get first 3 characters.
Simply:
SELECT Title
FROM book
ORDER BY IF(Title LIKE "The %", substr(Title, 5), Title);
Explanation:
We use the IF function to strip the "The" (if present) from the beginning of the string before returning the string to the ORDER BY clause. For more complex alphabetization rules we could create a user-defined function and place that in the ORDER BY clause instead. Then you would have ...ORDER BY MyFunction(Title).

MySQL sort by name

Is ir possible to sort a column alphabetically but ignoring certain words like e.g 'The'
e.g.
A normal query would return
string 1
string 3
string 4
the string 2
I would like to return
string 1
the string 2
string 3
string 4
Is this possible?
EDIT
Please note I am looking to replace multiple words like The, A, etc... Can this be done?
You can try
SELECT id, text FROM table ORDER BY TRIM(REPLACE(LOWER(text), 'the ', ''))
but note that it will be very slow for large datasets as it has to recompute the new string for every row.
IMO you're better off with a separate column with an index on it.
For multiple stopwords just keep nesting REPLACE calls. :)
This will replace all leading "The " as an example
SELECT *
FROM YourTable
ORDER BY REPLACE(Val,'The ', '')
Yes, it should be possible to use expressions with the ORDER-part:
SELECT * FROM yourTable ORDER BY REPLACE(yourField, "the ", "")
I have a music listing that is well over 75,000 records and I had encountered a similar situation. I wrote a PHP script that checked for all string that began with 'A ', 'An ' or 'The ' and truncated that part off the string. I also converted all uppercase letters to lowercase and stored that string in a new column. After setting an index on that column, I was done.
Obviously you display the initial column but sort by the newly-created indexed column. I get results in a second or so now.