Trying to clean up string data. TRIM removed leading and trailing spaces. Using REPLACE as REPLACE (col_name," ","") removed all spaces. Need a solution that will result in expected output.
Sample data:
7136 South Yale #300 Tulsa,;Oklahoma
428 NW 10th St. OKC
2903 W. Britton Road OKC
Expected output :
7136 South Yale #300 Tulsa,;Oklahoma
428 NW 10th St. OKC
2903 W. Britton Road OKC
I use MySQL 5.7.
On MySQL 8+ we could just have done a regex replacement on \s{2,} and replaced with a single space. On 5.7 this is a bit harder. Assuming that each address would only have at most one block of two or more unwanted spaces in it, we use substring operations here:
SELECT address,
CASE WHEN address LIKE '% %'
THEN CONCAT(SUBSTRING(address, 1, INSTR(address, ' ') - 1), ' ',
LTRIM(SUBSTRING(address, INSTR(address, ' '))))
ELSE address END AS output
FROM yourTable;
Demo
The above logic uses an INSTR() trick to find the start of the block of two or more spaces. It generates the output address by piecing together the two substrings on either side of this block, with excess spaces removed.
Related
I'm quite new to regular expressions and I'm not getting what I expect while using regex in MySql. I did investigate these regex expressions at "https://regexr.com/" which is giving me results that are what I expect. The query below returns 3 columns:
one_or_more: I'm expecting to get 6, but I'm getting 0. Doesn't "\s+" mean one ore more?
zero_or_more: I'm expecting 6, but I'm getting 7. If "\s*" means zero or more, shouldn't the match start one character earlier to include the whitespace?
zero_or_once: I'm expecting 6, but I'm getting 7. If "\s?" means one or more, shouldn't the match start one character earlier to include the whitespace?
SELECT
# 0, 6
REGEXP_INSTR("Birch Street, Boston, MA 02131, United States", "\s+street") one_or_more,
# 7, 6
REGEXP_INSTR("Birch Street, Boston, MA 02131, United States", "\s*street") zero_or_more,
# 7, 6
REGEXP_INSTR("Birch Street, Boston, MA 02131, United States", "\s?street") zero_or_once
FROM
DUAL;
Any helps is appreciated. Thank you. Paul
You need to use double \, in this case you'll get the expected results, i.e.:
REGEXP_INSTR("Birch Street, Boston, MA 02131, United States", "\\s+street")
To use a literal instance of a special character in a regular
expression, precede it by two backslash (\) characters. The MySQL
parser interprets one of the backslashes, and the regular expression
library interprets the other.
https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax
String1 = "Widgets Inc. is the largest widgets producer in the world. It's much bigger than McWidgets Inc."
String2 = "Fidgets Inc is the second largest fidgets producer. It's just behind McFidgets Inc. The CEO of this company loves synergy."
String3 = "Glorious Gagets Co. is considered blah blah jdfglmdslgmldfg."
For all of the above scenarios, I would like to reliably select the first sentence only. I would use:
[EDIT]: note that there are no real patterns in the sentences.
SUBSTRING_INDEX(string, '. ', 1)
However this would cause issues with the first and third string, as they sometimes have a '.' after the name, and sometimes not.
My thought was to use something like SUBSTRING_INDEX(string, '. [A-Z]', 1), and essentially tell it to look for the first '.' which is followed by a space and then any capital letter (i.e start of the next sentence), but my SQL-fu is not strong enough yet to figure out how to do that.
Any help would be appreciated!
When you have a fixed pattern, you can use LOCATE to find the index and then use SUBSTRING to remove it. For the startung point you need regular explression, if you don't want to use functions or stored procedures, which you also need for more complex patterns
CREATE TABLE table1 (tex varchar(200))
INSERT INTO table1 VALUES ("Widgets Inc. is the largest widgets producer in the world. It's much bigger than McWidgets Inc.")
,("Fidgets Inc is the largest fidgets producer in the world. It's much bigger than McFidgets Inc.")
SELECT SUBSTRING(tex,REGEXP_INSTR(tex, '[A-Z]'),LOCATE('producer in the world.',tex)+ 21) FROM table1
| SUBSTRING(tex,REGEXP_INSTR(tex, '[A-Z]'),LOCATE('producer in the world.',tex)+ 21) |
| :--------------------------------------------------------------------------------- |
| Widgets Inc. is the largest widgets producer in the world. |
| Fidgets Inc is the largest fidgets producer in the world. |
db<>fiddle here
K looks like I have a work-around in the absence of actually identifying sentences in the requested manner, i.e. by somehow including a capital letter check in the substring parameter.
Found a list of abbreviations which would contain a period (i.e. Co., Inc., Ltd., etc...) and hardcoded to replace them without the period - Co, Ltd, Inc etc... then did the substring as normal. Not ideal but it works.
I have a MySQL query to find 10 digit phone numbers that start with +1
SELECT blah
FROM table
WHERE phone REGEXP'^\\+[1]{1}[0-9]{10}$'
How can I filter this REGEXP further to only search certain 3 digit area codes? (ie. International 10 digit phone numbers who share US number format)
I tried using the IN clause ie. IN('+1809%','+1416%') but ended up with error in syntax
WHERE phone REGEXP'^\\+[1]{1}[0-9]{10}$' IN('+1809%','+1416%')
You may use a grouping construct with an alternation operator here, like
REGEXP '^\\+1(809|416)[0-9]{7}$'
^^^^^^^^^
Just subtract 3 from 10 to match the trailing digits. Note that in MySQL versions prior to 8.x, you cannot use non-capturing groups, you may only use capturing ones.
Also, [1]{1} pattern is equal to 1 because each pattern is matched exactly once by default (i.e. {1} is always redundant) and it makes littel sense to use a character class [...] with just one single symbol inside, it is meant for two or more, or to avoid escaping some symbols, but 1 does not have to be escaped as it is a word char, so the square brackets are totally redundant here.
I have 2 scenarios
Scenario 1:
abc Ins Services,
123 Pine St Fl 23
San Francisco, CA, USA
SCENARIO 2:
abc Ins Services,
#4567
123 Pine St Fl 23
San Francisco, CA, USA
All fields are dynamic and I used trim in every expression but white space still comes as shown in scenario 1 ,I dont want this white space.
The space that you're seeing there isn't just a space character, it's a line return. These can be stored in the strings in your database as part of the address. They are hard to see when you preview results in a program like SSMS. The Trim function only removes spaces. Line returns are usually made up of ASCII characters 10 and 13. In order to remove line returns you can use the Replace function like so:
=REPLACE(REPLACE(<string to search in>, CHR(13), ""), CHR(10), "")
This allows you to add your own line returns where you actually want them.
I'm trying to load a csv that has 21 columns and 240 rows through phpmyadmin. The most common error message is:
"invalid column count on line 1" (using csv import)
though when using load data, I get:
"error: #1083 – Field separator argument is not what is expected; check the manual"
Columns separated with ,
Columns enclosed with "
Columns escaped with \
Lines terminated with auto (though I've tried \r, \n, \r\n and any combination of the 3)
I have also tried escaping the quotes and commas but it seems to not do anything.
This is the first row of the data:
Denis,NULL,Wirtz,"221Maryland Hall 3400 North Charles Street\, Baltimore\, MD 21236",,410-516-7006,410-516-5528,wirtz#jhu.edu,Theophilus Halley Smoot Professor,NULL,NULL,"K.L. Yap\, S.I. Fraley\, M.M. Thiaville\, N. Jinawath\, K. Nakayama\, J.-L. Wang\, T.-L. Wang\, D. Wirtz\, and I.-M. Shih\, ÒNAC1 is an actin-binding protein that is essential for effective cytokinesis in cancer cellsÓ\, Cancer Research 72: 4085_4096 (2012).D.H. Kim\, S.B. Khatau\, Y. Feng\, S. Walcott\, S.X. Sun\, G.D. Longmore\, and D. Wirtz\, ÒActin cap associated focal adhesions and their distinct role in cellular mechanosensingÓ\, Scientific Reports (Nature) 2:555-568 (2012).S.I. Fraley\, Y. Feng\, G.D. Longmore\, and D. Wirtz\, ""Dimensional and temporal controls of cell migration by zyxin and binding partners in three-dimensional matrix""\, Nature Communications 3:719-731 (2012)P.-H. Wu\, C.M. Hale\, J.S.H. Lee\, Y. Tseng\, and D. Wirtz\, ÒHigh-throughput ballistic injection nanorheology (htBIN) to measure cell mechanicsÓ\, Nature Protocols 7: 155_170 (2012)C.M. Hale\, W.-C. Chen\, S.B. Khatau\, B.R. Daniels\, J.S.H. Lee\, and D. Wirtz\, ÒSMRT analysis of MTOC and nuclear positioning reveals the role of EB1 and LIC1 in single-cell polarizationÓ\, Journal of Cell Science124: 4267-4285 (2011).D. Wirtz\, K. Konstantopoulos\, and P.C. Searson\, ÒPhysics of cancer: the role of physical interactions and mechanical forces in cancer metastasisÓ\, Nature Reviews Cancer 11: 512-522 (2011)",,NULL,http://www.jhu.edu/chembe/faculty-template/DenisWirtz.jpg,Department of Chemical and Biomolecular Engineering,NULL,Whiting School of Engineering,"Postdoctoral\, Physics\, Biophysics. ESPCI\, Paris. 1993 - 1994Ph.D.\, Cemical Engineering. Stanford University. 1993M.S.\, Chemical Engineering. Stanford University. 1989B.S.\, Physics Engineering. Free University of Brussels. 1983-1988",Johns_Hopkins_University
Any help is greatly appreciated.
There are backslashes in front of commas inside double-quoted strings. If your utility treats those as escaped commas, they will function as column separators, and you will get the wrong number of columns:
"221Maryland Hall 3400 North Charles Street\, Baltimore\, MD 21236"
Again, the doubled double-quotes "" are usually a way to escape a single double-quote within a string. But if the parsing reads it as a string terminator, that's another way it can throw off your column count.
I have seen Excel mess up exported data in many fascinating ways, but this one is new to me.