Matching websites in MySQL via regexp - mysql

I have two columns: "website 1" and "website 2" and I want to test for equality. The issue is that there are differences in format such as:
www.example.com http://www.example.com/
example2.co.uk/ https://www.example2.co.uk/Default.aspx
I would like to be able to match the domain only. Can this be done in MySQL given a regexp such as:
(http://|https://)?(www.)?[A-z]*(.com|.co.uk|.us|.org|.net|.mobi)

This can be done without RegEx,
TRY something like this to only take out the domain
SELECT
(t1.website1, instr(reverse(t1.website1),'.')),
(t1.website2, instr(reverse(t1.website2),'.'))
FROM table1 t1
Then you can check for equality on the domain.
You may also have to replace also the data from / and to end with nothing in your select. Otherwise .aspx will be a domin in above case for instance.
You will also have to have a extra column that counts the amount of . , since some domains have 2 like .co.uk / .co.jp
like this
SELECT LEN('t1.website1') - LEN(REPLACE('t1.website1','.','')) AS AmountOfDotsInString
This gives you a count on how many dots exists in the string, do this count after the data after any / has been cleared.

Related

Matching content in two different columns-MySQL

Let's say I have a list of person names and a list of social media URL's (that might or might not contain a portion of the person names).
I'm trying to see if the full name is not contained in the list of URL's I have. I don't think a "not like" would work here (because the URL has plenty of other characters to throw back a result), but I can't think of any other way to address this. Any tips? The closest I could find was from this:
Matching partial words in two different columns
But I'm unsure if that applies here.
Just use SELECT * FROM yourtable WHERE url LIKE '%name%' % means any characters even whitespace. Then just check if it returned any rows.
From mysql doc:
% matches any number of characters, even zero characters.
mysql> SELECT 'David!' LIKE 'David_';
-> 1
mysql> SELECT 'David!' LIKE '%D%v%';
-> 1
So let's say these are your url's in your list:
website.com/peterjohnson
website.com/jackmiller
website.com/robertjenkins
Then if you would do:
SELECT * FROM urls WHERE url LIKE '%peter%'
It would return 1 row.
You can also use NOT LIKE so you will get all the rows not containing the name.

MySQL different counts between "where =" and "where like"

1. select count(*) from tableX where code = "XYZ";
2. select count(*) from tableX where code like "%XYZ";
Result for query 1 is 18734. <== Not Correct
Result for query 2 is 93003. <== Correct
We know that query 2's count is correct based on independent verification.
We expect these two queries to have the exact same count for each because we know that no rows in tableX have a code that ends with "XYZ", so the wildcard at the beginning shouldn't affect the query.
Why would these queries produce different counts?
We have already researched the differences between "=" comparison and "like" string comparison, but based on all our verification checks, we still don't understand why this would give us different counts
We have confirmed the following:
There are no leading or trailing characters in the "code" field
There are no hidden characters (tried all found here: How can I find non-ASCII characters in MySQL?)
The collation is "utf8_unicode_ci"
We are using MySQL version 5.5.40-0ubuntu0.12.04.1.
Try this in order to get your answer:
SELECT code
FROM tableX
WHERE code LIKE "%XYZ"
AND code <> "XYZ"
LIMIT 10
My guess is that some of your codes end with a lowercase xyz, and since LIKE is case-insensitive, it matched these where = did not.
where code = "XYZ"; gives exact match whereas where code LIKE "%XYZ"; includes partial match as well. In your case, there could be an extra space present which is giving wrong count. Consider trimming before comparing like
where UPPER(TRIM(code)) = 'XYZ';
We restarted the server that the database resides on, we re-ran the queries, and now they all are producing the expected, correct results...
We'll have to look into possibilities for why this "fixed" the issue.

Anyway to filter access records containing an '#'?

I have an Access database that is being used for a website. In the DB there is a field for image file names that is used to display images on the site. In some case the person responsible for gathering the images started using the # character in the image file names when saving them and this is causing the images not show on the website.
Is there anyway to filter out just the records where the image field contains the '#' character?
Everything I've tried has Access treating it like a wildcard and picking up any number.
As mentioned in a comment, you can select rows which have # contained in your image_file_name field by checking whether the field is LIKE '*[#]*'
However, since you want to filter out those rows, target the inverse of that pattern match ... NOT LIKE '*[#]*'
A query like this would work within an Access session with default settings:
SELECT y.*
FROM YourTable AS y
WHERE y.image_file_name Not Like '*[#]*';
However, since you're using the Access db to feed a website, you may be using ADO/OleDb to connect to the db file. If that is the case, use % instead of * as the wildcard character:
WHERE y.image_file_name Not Like '%[#]%';
Or you could use Alike instead of Like. In that situation, the wildcard should always be % and the query will work correctly whether you're running it from within or outside of Access:
WHERE y.image_file_name Not ALike '%[#]%';
A totally different approach is to use InStr to find the position of # within image_file_name and select the rows where InStr returns zero:
SELECT y.*
FROM YourTable AS y
WHERE InStr(1, y.image_file_name, '#') = 0;
If you also want rows where image_file_name is Null, you can add that condition to the WHERE clause with OR.

Select all from DB where the content has defined

I am trying to retrieve a list of database records which have specific 'interest codes' inside of the 'custom_fields' table. So for example right now there is 100 records, I need the Name, Email and Interest Code from each of those records.
I've tried with the following statement:
SELECT * FROM `subscribers` WHERE list = '27' AND custom_fields LIKE 'CV'
But with no luck, the response was:
MySQL returned an empty result set (i.e. zero rows). ( Query took 0.0003 sec )
You can see in this screenshot that at-least two rows have 'CV' inside custom_fields. Whilst within the database it's not called 'Interest Code', that's what they are so therefore why I am referencing it in this way.
You need to enclose your "search string" inside some wildcards:
select * from subscribers where list=27 and custom_fields like '%CV%';
The % wildcard means "zero or more chacarcters at this position". The "_" wildcard means "a character in this position". Please read the reference manual on the topic. Also, you may want to read about regular expressions in MySQL for more complex string comparissons.

Format list of urls in mysql

I have a list of a million or urls in an mysql table.
I need to cleanse the data (extract domains) so I can be confident about DISTINCT type queries.
Data is in several different types: -
www.domain.tld
domain.tld
http://domain.tld
https://vhost.domain.tld
domain.tld/
There are invalid domains and empty data.
Ideally I'd like to do something along the lines of : -
UPDATE table1 SET domain = website REGEXP '^(https?://)?[a-zA-Z0-9\\\\.\\\\-]+(/|$|\\\\?)'
domain being a new empty field, website being the original url.
You can't use regex like that in MySQL as is, but apparently you can some some UDFs that implement it. See:
How to do a regular expression replace in MySQL?
https://launchpad.net/mysql-udf-regexp
http://www.mysqludf.org/lib_mysqludf_preg/