I need to clean up a database where one of the columns (TOTAL_AREA) has some characters on some of the entries (not all of them)
Such as 5000㎡
I need to clean all the fields that have this entry to show only 500
How can I do it with SQL? I looked at TRIM but couldn't find a way to select all entries that have a character after the number and them TRIM it
Any help would be appreciated
Thanks
This is pretty easy. MySQL does implicit conversion, ignoring characters after the digits. So, you can do:
select (col * 1.0 / 10)
For your example, this will return 500.
Assuming you want to get rid of all characters that are not digits, you can use e.g. REGEXP_REPLACE, e.g.
create or replace table x(s string);
insert into x values
('111'),
('abc234xyz'),
('5000㎡'),
('9000㎡以上');
select s, regexp_replace(s, '[^\\d]*(\\d+)[^\\d]*', '\\1') from x;
-----------+--------------------------------------------------+
S | REGEXP_REPLACE(S, '[^\\D]*(\\D+)[^\\D]*', '\\1') |
-----------+--------------------------------------------------+
111 | 111 |
abc234xyz | 234 |
5000㎡ | 5000 |
9000㎡以上 | 9000 |
-----------+--------------------------------------------------+
What we do there is we match sequences of 0-or-more non-digit characters, followed by 1-or-more digit characters, and again 0-or-more non-digit characters, and product only the middle sequence.
Note, that you can use a different regexp depending what characters exactly you want to keep/remove.
Related
I came across an old post and tried the code with a project that I am working on, and it worked, but I am still confused as to why, could anyone here please unpack the logic behind the code here? I am specifically referring to this fiddle.
I understand substring_index, but not sure what "numbers" does, as well as the char length calculations.
Thanks in advance.
The numbers table is a way to create an ad hoc table that consists of sequential integers.
mysql> SELECT 1 n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4;
+---+
| n |
+---+
| 1 |
| 2 |
| 3 |
| 4 |
+---+
These numbers are used to extract the N'th word from the comma-separated string. It's just a guess that 4 is enough to account for the number of words in the string.
The CHAR_LENGTH() expression is a tricky way to count the words in the command-separated string. The number of commas determines the number of words. So if you compare the length of the string to the length of that string with commas removed, it tells you the number of commas, and therefore the number of words.
mysql> set #string = 'a,b,c,d,e,f';
mysql> select char_length(#string) - char_length(replace(#string, ',', '')) + 1 as word_count;
+------------+
| word_count |
+------------+
| 6 |
+------------+
Confusing code like this is one of the many reasons it's a bad idea to store data in comma-separated strings.
I've this record in a Mysql table:
ADDRESS
----------------------------------
sdasd 4354 ciao 12345 sdsdsa asfds
I would like to match all chars from the beginning to the first occurrence of a 5 digits word, including it.
In this case, using REGEXP_REPLACE, I would like to remove the substring matched and return sdsdsa asfds.
What I've tried to do is this:
SELECT REGEXP_REPLACE(ADDRESS, '^.+\b\d{5}\b.','') FROM `mytable`
The regexp seems to work testing it in this snippet and I cannot understand why Mysql won't.
MySQL supports POSIX regex which doesn't support PERL like properties e.g. \b, \d etc.
This regex should work for you:
SELECT REGEXP_REPLACE
('sdasd 4354 ciao 12345 sdsdsa asfds', '^.+[[:<:]][0-9]{5}[[:blank:]]+', '') as val;
+--------------+
| val |
+--------------+
| sdsdsa asfds |
+--------------+
RegEx Details:
^.+: Match 1 or more of any characters at the start (greedy)
[[:<:]]: Match a word boundary (zero width)
[0-9]{5}: Match exactly 5 digits
[[:blank:]]+: Match 1 or more of whitespaces (tab or space)
I got a big data (approximately 600,000).
I want the rows with value "word's" will appear.
Special characters will be completely ignored.
TABLE:
| column_value |
| ------------- |
| word's |
| hello |
| world |
QUERY: select * from table where column_value like '%words%'
RESULTS:
| column_value |
| ------------- |
| word's |
I want the rows with special characters will appear and ignore their special characters.
Can you please help me how can we achieve it with fast runtime?
You can use replace to remove the "special" character prior the matching.
SELECT *
FROM table
WHERE replace(column_value, '''', '') LIKE '%words%';
Nest the replace() calls for other characters.
Or you try it with regular expressions.
SELECT *
FROM table
WHERE column_value REGEXP 'w[^a-zA-Z]*o[^a-zA-Z]*r[^a-zA-Z]*d[^a-zA-Z]*s';
[^a-zA-Z]* matches optional characters, that are not a, ..., y and z and not A, ..., Y and Z, so this matches your search word also with any non alphas between the letters.
Or you have a look at the options full text search brings with it. Maybe that can help too.
You must add an index on your column_value.
MySQL doc
I have a table in which I added english dictionary words. Now I have some records that seem to be duplicates but the length of the string differs.
for example 'aaron' is repeated twice in my table, but when I use this query:
select id, word, char_length(word) from my_table;
I get the following back:
id | word | char_length
7 | aaron | 5
12 | aaron | 6
How can the char_length change for the same word? What can I do to remove one word which exceeds length by 1?
It's likely #Vatev is on the right track with his comment.
Try these two queries:
1. SELECT * FROM my_table WHERE word = 'aaron';
2. SELECT * FROM my_table WHERE word like '%aaron%';
The first will only match rows where word is exactly aaron while the second will match rows which contain aaron anywhere. If there are rows with extra content, like whitespace, they would show up in the second, but not the first.
One possible way to clean up these duplicates would be to run the following:
DELETE FROM my_table WHERE TRIM(word) != word;
But don't run that blindly - it will delete all rows with extra whitespace, even if there isn't a matching "correct" entry.
I've got a database with UTF-8 characters in it, which are improperly displayed. I figured that I could use UNHEX(HEX(column)) != column condition to know what fields have UTF-8 characters in them. The results are rather interesting:
id | content | HEX(content) | UNHEX(HEX(content)) LIKE '%c299%' | UNHEX(HEX(content)) LIKE '%FFF%' | UNHEX(HEX(content))
49829102 | | C299 | 0 | 0 | c299
874625485 | FFF | 464646 | 0 | 1 | FFF
How is this possible and, possibly, how can I find the row with this character in it?
-- edit(2): since my edit has been removed (probably when JamWaffles was fixing my beautiful data table), here it is again: as editor strips out UTF-8 characters, the content in first row is \uc299 (if that's not clear ;) )
-- edit(3): I've figured out what the issue is - the actual representation of UNHEX(HEX(content)) is WRONG - to display my multibyte character I had to do the following: SELECT UNHEX(SUBSTR(HEX(content),1))). Sadly UNHEX(C299) doesn't work as UNHEX(C2)+UNHEX(99) so it's back to the drawing board.
There are two ways to determine if a string contains UTF-8 specific characters. The first is to see if the string has values outside the ASCII character set:
SELECT _utf8 'amńbcd' REGEXP '[^[.NUL.]-[.DEL.]]';
The second is to compare the binary and character lengths:
SELECT LENGTH(_utf8 'amńbcd') <> CHAR_LENGTH(_utf8 'amńbcd');
Both return TRUE.
See http://sqlfiddle.com/#!2/d41d8/9811