select name from emp_profile;
Result:
tom#rj6.com
In the above result how to determine whether there are trailing spaces in it or not
RTRIM() removes trailing spaces.
If RTRIM(name) varies from name, there are trailing spaces in the field.
Related functions are LTRIM() (trims starting spaces) and TRIM() (both sides)
As a side note, I would recommend removing trailing spaces (and other invalid data) during input time on application level, not in the database.
If name is a char field it wil not have trailing spaces as far as I can ascertain varchar's do have trailing spaces.
An easy way to check for trailing whitespace to check the length against the trimmed length. rtrim()
Related
I have a comma delimited text file where one of the columns (appropriately) has text encased with double quotes. There are also many instances of double quotes within the content of this particular column. I've used the following to remove many of the double quotes, replacing them with single quotes (excluding any double quotes next to a comma).
(?<!^)(?<![,])"(?![,])(?!$)
How do I isolate/replace the double quote after [fine,] without removing the "good" double quotes?
column1,"he's doing 'fine," says Tom, but nothing specific. Blah, blah, blah", column3
Here is another example of "good" double quotes that I don't want to remove (where the first two columns are blank/empty)
,,"This is text I need",
Assuming that double quotes only occur in one column then I suggest a two-step approach. First change all double quotes in the file to single quotes, using a simple replace all. Next change the first and last single quotes back to double quotes. This can be done in one regex, replace (^[^\r\n']*)'(.*)'(^[^\r\n']*)$ with \1"\2"\3.
If single quotes occur in other columns and see should not be altered then a three-step approach can be used. Choose a character that does not occur anywhere in the text. Change all double quotes to that character, I will use ! as an example. As above, change the first and last ! to double quotes. This can be done in one regex, replace (^[^\r\n']*)!(.*)!(^[^\r\n']*)$ with \1"\2"\3. Finally change all the ! to single quotes. If you cannot find an unused character then you can use a longer string that is not in the file instead, perhaps something like _<<abc>>_ instead of the !.
Struggled with this a bit, but based on your question, there might be a possible solution. If you only have one column which has unescaped quotes or commas, you might be able to count the commas in front of that column and the commas after that column then strip all the quotes and commas between them. If you have multiple columns with unescaped characters, this might be harder.
Not familiar with Notepad++, but reading other answers I assume there is a way to use regex. If so, you can use this one:
(?<!^|",)"(?!,"|$)
I am using the following code to export my data frame to csv:
data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')
Note that I use delimiter="\t", as I don't want to add additional quotation marks around each field. However, when I checked the output csv file, there are still some fields which are enclosed by quotation marks. e.g.
abcdABCDAAbbcd ....
1234_3456ABCD ...
"-12345678AbCd" ...
It seems that the quotation mark appears when the leading character of a field is "-". Why is this happening and is there a way to avoid this? Thanks!
You don't use all the options provided by the CSV writer. It has quoteMode parameter which takes one of the four values (descriptions from the org.apache.commons.csv documentation:
ALL - quotes all fields
MINIMAL (default) - quotes fields which contain special characters such as a delimiter, quotes character or any of the characters in line separator
NON_NUMERIC - quotes all non-numeric fields
NONE - never quotes fields
If want to avoid quoting the last options looks a good choice, doesn't it?
It happens occasionally that users erroneously enter text with a trailing space in a text column, which is hard to spot visually. This can later cause problems when this text field has to be matched against another where the trailing space is not present. Is it possible in mySQL to enforce that a text string cannot contain a certain character (space in this case)?
Thankful for feedback!
There are a number of ways to achieve what you want:
Check the user input in your application and reject it if it contains a space. If your primary worry is the quality of the user inputs, then this is probably the best way to do this.
You can remove spaces (or just starting / trailing spaces) from the user input either in the application logic or using sql.
If you opt for removing all spaces from the user input in sql, then use the replace() function. If you just want to remove the starting and trailing spaces, then use the trim() function to achieve the desired results.
Using mysql function a simple way is based on trim()
select
trim(' try with trim ')
, length (trim(' try with trim '))
, length (' try with trim ')
from dual ;
I have a database table with a primary key called PremiseID.
Its MySQL column definition is CHAR(10).
The data that goes into the column is always 10 digits, which is either a 9-digit number followed by a space, like '113091000 ' or a 9-digit number followed by a letter, like '113091000A'.
I've tried writing one of these values into a table in a test MySQL database table t1. It has three columns
mainid integer
parentid integer
premiseid char(10)
If I insert a row that has the following values: 1,1,'113091000 ' and try to read row back, the '113991000 ' value is truncated, so it reads '113091000'; that is the space is removed. If I insert a number like '113091000A', that value is retained.
How can I get the CHAR(10) field retain the space character?
I have a programmatic way around this problem. It would be to take the len('113091000'), realize it's nine characters, and then realize a length of 9 infers there is a space suffix for that number.
To quote from the MySQL reference:
The length of a CHAR column is fixed to the length that you declare when you create the table. The length can be any value from 0 to 255. When CHAR values are stored, they are right-padded with spaces to the specified length. When CHAR values are retrieved, trailing spaces are removed.
So there's no way around it. If you're using MySQL 5.0.3 or greater, then using VARCHAR is probably the best way to go (the overhead is only 1 extra byte):
VARCHAR values are not padded when they are stored. Handling of trailing spaces is version-dependent. As of MySQL 5.0.3, trailing spaces are retained when values are stored and retrieved, in conformance with standard SQL. Before MySQL 5.0.3, trailing spaces are removed from values when they are stored into a VARCHAR column; this means that the spaces also are absent from retrieved values.
If you're using MySQL < 5.0.3, then I think you just have to check returned lengths, or use a character other than a space.
Probably the most portable solution would be to just use CHAR and check the returned length.
Q: How can I get the CHAR(10) field retain the space character?
Actually, that space is retained and stored. It's the retrieval of the value that's removing the spaces. (The removal of the trailing spaces on returned values is a documented "feature".)
One option (as a workaround) is to modify your SQL query to append trailing spaces to the returned value, e.g.
SELECT RPAD(premiseid,10,' ') AS premiseid FROM t1
That will return your value with as a character string with a length of 10 characters, padded with spaces if the value is shorter than 10 characters, or truncated to 10 characters if its longer.
A standard CHAR(10) column will always have trailing spaces to pad out the string to the required length of 10 characters. As such, any deliberately trailing spaces will be blended in and, typically, stripped by your database adapter.
If possible, convert to a VARCHAR(10) column if you want to preserve the trailing spaces. You can do this with the ALTER TABLE statement.
Though Gordon's answer may still be right by itself, there is on later versions than mentioned a solution.
In your code run SET sql_mode = 'PAD_CHAR_TO_FULL_LENGTH';
With this session setting you'll retrieve perfect columns on full length of the CHAR(10), while VARCHAR does not when no trailing spaces are entered beforehand. If you don't need the spaces, you can always rtrim().
From the CSV spec (RFC 4180), Spaces are considered part of a field and should not be ignored. Obviously if the field contains double quotes it should retain the spaces inside the quotes.
My question is, what about spaces outside of the double quotes? The only way I can see this happening is if the tool that generated the CSV didn't do it properly.
Example: one, "two" ,three
Should the space before and after "two" be included?
That cell is invalid - to properly code that row it should be:
one," ""two"" ",three
Double quotes must also be escaped (as double-double quote) since they are used as the escape sequence. If you don't want to preserve the quotes around two, technically there are two things invalid about the row - (1) the spaces before and after the quotes and (2) the fact that there are quotes around the cell but nothing to be escaped. CSV demands that there can only be quotes around the cell if there are commas or quotes inside the content of the cell.
If I were in your case, I would err on the side of leniency.
I dealt with this using BULK INSERT and BCP format files, which is tricky to account for the quote and comma delimiting. In the event that there could be variation, say with a , " delimiter We used the lowest common delimiter, so the comma in your example, then stripped out what wasn't needed like all the double quotes.
But it could also be that your source data was only comma delimited and this was the actual contents of that field. Either way, I would toss out the quotes when loading the field, in whatever method was appropriate.