I have a comma delimited text file where one of the columns (appropriately) has text encased with double quotes. There are also many instances of double quotes within the content of this particular column. I've used the following to remove many of the double quotes, replacing them with single quotes (excluding any double quotes next to a comma).
(?<!^)(?<![,])"(?![,])(?!$)
How do I isolate/replace the double quote after [fine,] without removing the "good" double quotes?
column1,"he's doing 'fine," says Tom, but nothing specific. Blah, blah, blah", column3
Here is another example of "good" double quotes that I don't want to remove (where the first two columns are blank/empty)
,,"This is text I need",
Assuming that double quotes only occur in one column then I suggest a two-step approach. First change all double quotes in the file to single quotes, using a simple replace all. Next change the first and last single quotes back to double quotes. This can be done in one regex, replace (^[^\r\n']*)'(.*)'(^[^\r\n']*)$ with \1"\2"\3.
If single quotes occur in other columns and see should not be altered then a three-step approach can be used. Choose a character that does not occur anywhere in the text. Change all double quotes to that character, I will use ! as an example. As above, change the first and last ! to double quotes. This can be done in one regex, replace (^[^\r\n']*)!(.*)!(^[^\r\n']*)$ with \1"\2"\3. Finally change all the ! to single quotes. If you cannot find an unused character then you can use a longer string that is not in the file instead, perhaps something like _<<abc>>_ instead of the !.
Struggled with this a bit, but based on your question, there might be a possible solution. If you only have one column which has unescaped quotes or commas, you might be able to count the commas in front of that column and the commas after that column then strip all the quotes and commas between them. If you have multiple columns with unescaped characters, this might be harder.
Not familiar with Notepad++, but reading other answers I assume there is a way to use regex. If so, you can use this one:
(?<!^|",)"(?!,"|$)
Related
I need to be able to replace a single character as long as it's not a specific combination of other characters. For example, if I find a single ' in my TEXT value, I need to escape it; but if it's already escaped (\') I need to ignore it.
There could also be more complex variations such as \"\ that I simply want to replace with -.
I also need to see if there are \" entries, but make sure they aren't already \\" (double escaped).
I'm guessing the solution is REGEXP_REPLACE, but I don't know how to properly tell it to NOT replace a matching pattern if another pattern exists...
Thanks!
For example
SELECT "hello"'jacky'"hi" as value from dual;
the result is hellojackyhi
But it confuse me that the use of this pattern "a"+'b'+"c".
What does this pattern exactly mean?It's a use of double quotes and single quote, is it mean I can always combine 3 string using this pattern "a"+'b'+"c"?
MySQL has a "feature" where it concatenates strings that are adjacent (and separated by a space):
select 'a' 'b' 'c'
---> abc
This works for both single quotes and double quotes. Of course, double quotes might also be a column name.
So, this is a short-cut for string concatenation. I strongly recommend that you use CONCAT() instead, so the intention is clear.
I am using the following code to export my data frame to csv:
data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')
Note that I use delimiter="\t", as I don't want to add additional quotation marks around each field. However, when I checked the output csv file, there are still some fields which are enclosed by quotation marks. e.g.
abcdABCDAAbbcd ....
1234_3456ABCD ...
"-12345678AbCd" ...
It seems that the quotation mark appears when the leading character of a field is "-". Why is this happening and is there a way to avoid this? Thanks!
You don't use all the options provided by the CSV writer. It has quoteMode parameter which takes one of the four values (descriptions from the org.apache.commons.csv documentation:
ALL - quotes all fields
MINIMAL (default) - quotes fields which contain special characters such as a delimiter, quotes character or any of the characters in line separator
NON_NUMERIC - quotes all non-numeric fields
NONE - never quotes fields
If want to avoid quoting the last options looks a good choice, doesn't it?
Is there any way to perform a SQL injection when single quotes are escaped by two single quotes? I know the MySQL server is using this specific technique to prevent against an attack. I'm trying to log in as a specific user but all of the common injections I've tried for the password have not worked successfully (i.e. ' or '1'='1, ' or ' 1=1, etc.).
No, and yes.
There's no way to have an unsafe values "breakout" of literal values that are enclosed in single quotes, if the value being supplied is "escaped" by preceding single quotes by with an additional single quote.
That is, assuming that your statement is guaranteeing that string literals are enclosed in quotes, as part of the "static" SQL text.
example perl-ish/php-ish
$sql = "... WHERE t.foo = '" . $safe_value . "' ... ";
^ ^
I've underscored here that the single quotes enclosing the literal are part of the SQL text. If $safe_value has been "escaped" by preceding each single quote in the "unsafe" value with another single value to make it "safe"...
$unsafe_value $safe_value
------------- ------------
I'm going I''m going
'she''s' ''she''''s''
1'='1 -- 1''=''1 --
As long as the escaping is handled properly, that we guarantee that potentially unsafe values are are run through the escaping, then including single quotes in data values is not a viable way to "breakout" of a literal with the SQL text.
That's the "no" part of the answer.
The "yes" part of the answer.
One of the biggest problems is making sure this is done EVERYWHERE, and that a mistake has not been made somewhere, assuming that a potentially unsafe string is "safe", and is not escaped. (For example, assuming that values pulled from a database table are "safe", and not escaping them before including them in SQL text.)
Also, the single quote trick is not the only avenue for SQL injection. The code could still be vulnerable.
Firstly, if we're not careful about other parts of the statement, like the single quotes enclosing string literals. Or, if for example, the code were to run the $sql through some other function, before it gets submitted to the database:
$sql = some_other_function($sql);
The return from some_other_function could potentially return SQL text that was in fact vulnerable. (As a ridiculous example, some_other_function might replace all occurrences of two consecutive single quotes with a single single quote. DOH!)
Also, with the vast number of possible unicode characters, if we're ever running through a characterset translation, there's also a possibility that some unicode character could get mapped to a single quote character. I don't have any specific example of that, but dollars to donuts that somewhere, in that plethora of multibyte encodings, there's some unicode character somewhere that will get translated to a single quote in some target.
There's a default character in the target for unmapped characters in the source, and that's usually a question mark (or a white question mark in a black diamond.) It would be a huge problem if the default character in the target (for unmapped characters in the source) was a single quote.
Bottom line: escaping unsafe strings by replacing single quotes with two single quotes goes a long ways towards mediating (mitigating?) SQL injection vulnerabilities. But in and of itself, it doesn't guarantee that code is not vulnerable in some other way.
if the input accepts unicode and is implicitly converted to ascii in the database (not as uncommon as it sounds) then an attacker can simply substitute ʻ or ʼ (0x02BB or 0x02BC) in place of single tick to get around the escaping mechanism and the implicit conversion will map those characters to single ticks (at least that's the case in SQL Server)
From the CSV spec (RFC 4180), Spaces are considered part of a field and should not be ignored. Obviously if the field contains double quotes it should retain the spaces inside the quotes.
My question is, what about spaces outside of the double quotes? The only way I can see this happening is if the tool that generated the CSV didn't do it properly.
Example: one, "two" ,three
Should the space before and after "two" be included?
That cell is invalid - to properly code that row it should be:
one," ""two"" ",three
Double quotes must also be escaped (as double-double quote) since they are used as the escape sequence. If you don't want to preserve the quotes around two, technically there are two things invalid about the row - (1) the spaces before and after the quotes and (2) the fact that there are quotes around the cell but nothing to be escaped. CSV demands that there can only be quotes around the cell if there are commas or quotes inside the content of the cell.
If I were in your case, I would err on the side of leniency.
I dealt with this using BULK INSERT and BCP format files, which is tricky to account for the quote and comma delimiting. In the event that there could be variation, say with a , " delimiter We used the lowest common delimiter, so the comma in your example, then stripped out what wasn't needed like all the double quotes.
But it could also be that your source data was only comma delimited and this was the actual contents of that field. Either way, I would toss out the quotes when loading the field, in whatever method was appropriate.