Am using opencsv 2.3 and it does not appear to be dealing with escape characters as I expect. I need to be able to handle an escaped separator in a CSV file that does not use quoting characters.
Sample test code:
CSVReader reader = new CSVReader(new FileReader("D:/Temp/test.csv"), ',', '"', '\\');
String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
for (String string : nextLine) {
System.out.println("Field [" + string + "].");
}
}
and the csv file:
first field,second\,field
and the output:
Field [first field].
Field [second].
Field [field].
Note that if I change the csv to
first field,"second\,field"
then I get the output I am after:
Field [first field].
Field [second,field].
However, in my case I do not have the option of modifying the source CSV.
Unfortunately it looks like opencsv does not support escaping of separator characters unless they're in quotes. The following method (taken from opencsv's source) is called when an escape character is encountered.
protected boolean isNextCharacterEscapable(String nextLine, boolean inQuotes, int i) {
return inQuotes // we are in quotes, therefore there can be escaped quotes in here.
&& nextLine.length() > (i + 1) // there is indeed another character to check.
&& (nextLine.charAt(i + 1) == quotechar || nextLine.charAt(i + 1) == this.escape);
}
As you can see, this method only returns true if the character following the escape character is a quote character or another escape character. You could patch the library to this, but in its current form, it won't let you do what you're trying to do.
Related
I just want to format a decimal number for output to a simple CSV formatted file.
I feel like I'm stupid, but I can't find a way to do it without leading zeroes or spaces, of course I can simply trim the leading spaces, but there has to be a proper way to just format like I that, isn't there?
Example
define variable test as decimal.
define variable testString as character.
test = 12.3456.
testString = string(test, '>>>>>9.99').
message '"' + testString + '"' view-as alert-box. /* " 12.35" */
I tried using >>>>>9.99 and zzzzz9.99 for the number format, but both format the string with leading spaces. I actually have no idea what the difference is between using > and z.
The SUBSTITUTE() function will do what you describe wanting:
define variable c as character no-undo.
c = substitute( "&1", 1.23 ).
display "[" + c + "]".
(Toss in a TRUNCATE( 1.2345, 2 ) if you really only want 2 decimal places.)
Actually, this also works:
string( truncate( 1.2345, 2 )).
If you are creating a CSV file you might want to think about using EXPORT. EXPORT format removes leading spaces and omits decorations like ",". The SUBSTITUTE() function basically uses EXPORT format to make its substitutions. The STRING() function uses EXPORT format when no other format is specified.
The EXPORT statement will format your data for you. Here is an example:
DEFINE VARIABLE test AS DECIMAL NO-UNDO.
DEFINE VARIABLE testRound AS DECIMAL NO-UNDO.
DEFINE VARIABLE testString AS CHARACTER NO-UNDO.
test = 12.3456.
testRound = ROUND(test, 2).
testString = STRING(test).
OUTPUT TO VALUE("test.csv").
EXPORT DELIMITER "," test testRound testString.
OUTPUT CLOSE.
Here is the output:
12.3456,12.35,"12.3456"
The EXPORT statement's default delimiter is a space so you have to specify a comma for your CSV file. Since the test and testRound variables are decimals, they are not in quotes in the output. testString is character so it is in quotes.
I have a set of strings that's JSONish, but totally JSON uncompliant. It's also kind of CSV, but values themselves sometimes have commas.
The strings look like this:
ATTRIBUTE: Value of this attribute, ATTRIBUTE2: Another value, but this one has a comma in it, ATTRIBUTE3:, another value...
The only two patterns I can see that would mostly work are that the attribute names are in caps and followed by a : and space. After the first attribute, the pattern is , name-in-caps : space.
The data is stored in Redshift, so I was going to see if I can use regex to resolved this, but my regex knowledge is limited - where would I start?
If not, I'll resort to python hacking.
What you're describing would be something like:
^([A-Z\d]+?): (.*?), ([A-Z\d]+?): (.*?), ([A-Z\d]+?): (.*)$
Though this answer would imply your third attribute value doesn't really start with a comma, and that your attributes name could countain numbers.
If we take this appart:
[A-Z\d] Capital letters and numbers
+?: As many as needed, up to the first :
(.*?), a space, then as many characters as needed up to a coma and a space
^ and $ The begining and the end of a string, respectively
And the rest is a repetition of that pattern.
The ( ) are just meant to identify your capture sections, in this case, they don't impact directly the match.
Here's a working example
Often regex is not the right tool to use when it seems like it is.
Read this thoughtful post for details: https://softwareengineering.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems
When a simpler scheme will do, use it! Here is one scheme that would successfully parse the structure as long as colons only occur between attributes and values, and not in them:
Code
static void Main(string[] args)
{
string data = "ATTRIBUTE: Value of this attribute,ATTRIBUTE2: Another value, but this one has a comma in it,ATTRIBUTE3:, another value,value1,ATTRIBUTE4:end of file";
Console.WriteLine();
Console.WriteLine("As an String");
Console.WriteLine();
Console.WriteLine(data);
string[] arr = data.Split(new[] { ":" }, StringSplitOptions.None);
Dictionary<string, string> attributeNameToValue = new Dictionary<string, string>();
Console.WriteLine();
Console.WriteLine("As an Array Split on ':'");
Console.WriteLine();
Console.WriteLine("{\"" + String.Join("\",\"", arr) + "\"}");
string currentAttribute = null;
string currentValue = null;
for (int i = 0; i < arr.Length; i++)
{
if (i == 0)
{
// The first element only has the first attribute name
currentAttribute = arr[i].Trim();
}
else if (i == arr.Length - 1)
{
// The last element only has the final value
attributeNameToValue[currentAttribute] = arr[i].Trim();
}
else
{
int indexOfLastComma = arr[i].LastIndexOf(",");
currentValue = arr[i].Substring(0, indexOfLastComma).Trim();
string nextAttribute = arr[i].Substring(indexOfLastComma + 1).Trim();
attributeNameToValue[currentAttribute] = currentValue;
currentAttribute = nextAttribute;
}
}
Console.WriteLine();
Console.WriteLine("As a Dictionary");
Console.WriteLine();
foreach (string key in attributeNameToValue.Keys)
{
Console.WriteLine(key + " : " + attributeNameToValue[key]);
}
}
Output:
As an String
ATTRIBUTE: Value of this attribute,ATTRIBUTE2: Another value, but this one has a comma in it,ATTRIBUTE3:, another value,value1,ATTRIBUTE4:end of file
As an Array Split on ':'
{"ATTRIBUTE"," Value of this attribute,ATTRIBUTE2"," Another value, but this one has a comma in it,ATTRIBUTE3",", another value,value1,ATTRIBUTE4","end of file"}
As a Dictionary
ATTRIBUTE : Value of this attribute
ATTRIBUTE2 : Another value, but this one has a comma in it
ATTRIBUTE3 : , another value,value1
ATTRIBUTE4 : end of file
I am inserting arbitrary binary data into a mysql database MEDIUMBLOB using the below code. I am writing the same data to a file, from the same program. I then create a file from the DB contents:
select data from table where tag=95 order by date, time into outfile "dbout";
I then compare the output written directly to the file to the output in dbout. There are escape (0x5c, '\') characters before some bytes in the dbout file (e.g. before 0x00). This garbles the output from the database. My understanding was that by using a MEDIUMBLOB and prepared statements, I could avoid this problem. Initially I was using mysql_real_escape_string with a regular INSERT, and having the problem. Nothing seems to be fixing this.
void
insertdb(int16_t *data, size_t size, size_t nmemb)
{
int16_t *fwbuf; // I have also tried this as char *fwbuf
unsigned long i;
struct tm *info;
time_t rawtime;
char dbuf[12];
char tbuf[12];
if(fwinitialized==0){
fwbuf = malloc(CHUNK_SZ);
fwinitialized = 1;
}
if(fwindex + (nmemb*size) + 1 >= CHUNK_SZ || do_exit == 1){
MYSQL_STMT *stmt = mysql_stmt_init(con);
MYSQL_BIND param[1];
time(&rawtime);
info = localtime(&rawtime);
snprintf(dbuf, 16, "%d-%02d-%02d", 1900+info->tm_year, 1+info->tm_mon, info->tm_mday);
snprintf(tbuf, 16, "%02d:%02d:%02d", info->tm_hour, info->tm_min, info->tm_sec);
char *tmp = "INSERT INTO %s (date, time, tag, data) VALUES ('%s', '%s', %d, ?)";
int len = strlen(tmp)+strlen(db_mon_table)+strlen(dbuf)+strlen(tbuf)+MAX_TAG_LEN+1;
char *sql = (char *) malloc(len);
int sqllen = snprintf(sql, len, tmp, db_mon_table, dbuf, tbuf, tag);
if(mysql_stmt_prepare(stmt, sql, strlen(sql)) != 0){
printf("Unable to create session: mysql_stmt_prepare()\n");
exit(1);
}
memset(param, 0, sizeof(param));
param[0].buffer_type = MYSQL_TYPE_MEDIUM_BLOB;
param[0].buffer = fwbuf;
param[0].is_unsigned = 0;
param[0].is_null = 0;
param[0].length = &fwindex;
if(mysql_stmt_bind_param(stmt, param) != 0){
printf("Unable to create session: mysql_stmt_bind_param()\n");
exit(1);
}
if(mysql_stmt_execute(stmt) != 0){
printf("Unabel to execute session: mysql_stmt_execute()\n");
exit(1);
}
printf("closing\n");
mysql_stmt_close(stmt);
free(sql);
fwindex = 0;
} else {
memcpy((void *) fwbuf+fwindex, (void *) data, nmemb*size);
fwindex += (nmemb*size);
}
}
So, why the escape characters in the database? I have tried a couple of combinations of hex/unhex in the program and when creating the file from msyql. That didn't seem to help either. Isn't inserting arbitrary binary data into a database be a common thing with a well-defined solution?
P.S. - Is it ok to have prepared statements that open, insert, and close like this, or are prepared statements generally for looping and inserting a bunch of data before closing?
PPS - Maybe this is important to the problem: When I try to use UNHEX like this:
select unhex(data) from table where tag=95 order by date, time into outfile "dbout";
the output is very short (less than a dozen bytes, truncated for some reason).
As MEDIUMBLOB can contain any character (even an ASCII NUL) MySQL normally escapes the output so you can tell when fields end. You can control this using ESCAPED BY. The documentation is here. Below is an excerpt. According to the last paragraph below (which I've put in bold), you can entirely disable escaping. I have never tried that, for the reason in the last sentence.
FIELDS ESCAPED BY controls how to write special characters. If the FIELDS ESCAPED BY character is not empty, it is used when necessary to avoid ambiguity as a prefix that precedes following characters on output:
The FIELDS ESCAPED BY character
The FIELDS [OPTIONALLY] ENCLOSED BY character
The first character of the FIELDS TERMINATED BY and LINES TERMINATED BY values
ASCII NUL (the zero-valued byte; what is actually written following the escape character is ASCII "0", not a zero-valued byte)
The FIELDS TERMINATED BY, ENCLOSED BY, ESCAPED BY, or LINES TERMINATED BY characters must be escaped so that you can read the file back in reliably. ASCII NUL is escaped to make it easier to view with some pagers.
The resulting file does not have to conform to SQL syntax, so nothing else need be escaped.
If the FIELDS ESCAPED BY character is empty, no characters are escaped and NULL is output as NULL, not \N. It is probably not a good idea to specify an empty escape character, particularly if field values in your data contain any of the characters in the list just given.
A better strategy (if you only need one blob in the output file) is SELECT INTO ... DUMPFILE, documented on the same page, per the below:
If you use INTO DUMPFILE instead of INTO OUTFILE, MySQL writes only one row into the file, without any column or line termination and without performing any escape processing. This is useful if you want to store a BLOB value in a file.
I have a line like this in my CSV:
"Samsung U600 24"","10000003409","1","10000003427"
Quote next to 24 is used to express inches, while the quote just next to that quote closes the field. I'm reading the line with fgetcsv but the parser makes a mistake and reads the value as:
Samsung U600 24",10000003409"
I tried putting a backslash before the inches quote, but then I just get a backslash in the name:
Samsung U600 24\"
Is there a way to properly escape this in the CSV, so that the value would be Samsung U600 24" , or do I have to regex it in the processor?
Use 2 quotes:
"Samsung U600 24"""
Not only double quotes, you will be in need for single quote ('), double quote ("), backslash (\) and NUL (the NULL byte).
Use fputcsv() to write, and fgetcsv() to read, which will take care of all.
I have written in Java.
public class CSVUtil {
public static String addQuote(
String pValue) {
if (pValue == null) {
return null;
} else {
if (pValue.contains("\"")) {
pValue = pValue.replace("\"", "\"\"");
}
if (pValue.contains(",")
|| pValue.contains("\n")
|| pValue.contains("'")
|| pValue.contains("\\")
|| pValue.contains("\"")) {
return "\"" + pValue + "\"";
}
}
return pValue;
}
public static void main(String[] args) {
System.out.println("ab\nc" + "|||" + CSVUtil.addQuote("ab\nc"));
System.out.println("a,bc" + "|||" + CSVUtil.addQuote("a,bc"));
System.out.println("a,\"bc" + "|||" + CSVUtil.addQuote("a,\"bc"));
System.out.println("a,\"\"bc" + "|||" + CSVUtil.addQuote("a,\"\"bc"));
System.out.println("\"a,\"\"bc\"" + "|||" + CSVUtil.addQuote("\"a,\"\"bc\""));
System.out.println("\"a,\"\"bc" + "|||" + CSVUtil.addQuote("\"a,\"\"bc"));
System.out.println("a,\"\"bc\"" + "|||" + CSVUtil.addQuote("a,\"\"bc\""));
}
}
Since no one has mentioned the way I usually do it, I'll just type this down. When there's a tricky string, I don't even bother escaping it.
What I do is just base64_encode and base64_decode, that is, encode the value to Base64 before writing the CSV line and when I want to read it, decode.
For your example assuming it's PHP:
$csvLine = [base64_encode('Samsung U600 24"'),"10000003409","1","10000003427"];
And when I want to take the value, I do the opposite.
$value = base64_decode($csvLine[0])
I just don't like to go through the pain.
I know this is an old post, but here's how I solved it (along with converting null values to empty string) in C# using an extension method.
Create a static class with something like the following:
/// <summary>
/// Wraps value in quotes if necessary and converts nulls to empty string
/// </summary>
/// <param name="value"></param>
/// <returns>String ready for use in CSV output</returns>
public static string Q(this string value)
{
if (value == null)
{
return string.Empty;
}
if (value.Contains(",") || (value.Contains("\"") || value.Contains("'") || value.Contains("\\"))
{
return "\"" + value + "\"";
}
return value;
}
Then for each string you're writing to CSV, instead of:
stringBuilder.Append( WhateverVariable );
You just do:
stringBuilder.Append( WhateverVariable.Q() );
If a value contains a comma, a newline character or a double quote, then the string must be enclosed in double quotes. E.g: "Newline char in this field \n".
You can use below online tool to escape "" and , operators.
https://www.freeformatter.com/csv-escape.html#ad-output
For example:
select * from tablename where fields like "%string "hi" %";
Error:
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'hi" "' at line 1
How do I build this query?
The information provided in this answer can lead to insecure programming practices.
The information provided here depends highly on MySQL configuration, including (but not limited to) the program version, the database client and character-encoding used.
See http://dev.mysql.com/doc/refman/5.0/en/string-literals.html
MySQL recognizes the following escape sequences.
\0 An ASCII NUL (0x00) character.
\' A single quote (“'”) character.
\" A double quote (“"”) character.
\b A backspace character.
\n A newline (linefeed) character.
\r A carriage return character.
\t A tab character.
\Z ASCII 26 (Control-Z). See note following the table.
\\ A backslash (“\”) character.
\% A “%” character. See note following the table.
\_ A “_” character. See note following the table.
So you need
select * from tablename where fields like "%string \"hi\" %";
Although as Bill Karwin notes below, using double quotes for string delimiters isn't standard SQL, so it's good practice to use single quotes. This simplifies things:
select * from tablename where fields like '%string "hi" %';
I've developed my own MySQL escape method in Java (if useful for anyone).
See class code below.
Warning: wrong if NO_BACKSLASH_ESCAPES SQL mode is enabled.
private static final HashMap<String,String> sqlTokens;
private static Pattern sqlTokenPattern;
static
{
//MySQL escape sequences: http://dev.mysql.com/doc/refman/5.1/en/string-syntax.html
String[][] search_regex_replacement = new String[][]
{
//search string search regex sql replacement regex
{ "\u0000" , "\\x00" , "\\\\0" },
{ "'" , "'" , "\\\\'" },
{ "\"" , "\"" , "\\\\\"" },
{ "\b" , "\\x08" , "\\\\b" },
{ "\n" , "\\n" , "\\\\n" },
{ "\r" , "\\r" , "\\\\r" },
{ "\t" , "\\t" , "\\\\t" },
{ "\u001A" , "\\x1A" , "\\\\Z" },
{ "\\" , "\\\\" , "\\\\\\\\" }
};
sqlTokens = new HashMap<String,String>();
String patternStr = "";
for (String[] srr : search_regex_replacement)
{
sqlTokens.put(srr[0], srr[2]);
patternStr += (patternStr.isEmpty() ? "" : "|") + srr[1];
}
sqlTokenPattern = Pattern.compile('(' + patternStr + ')');
}
public static String escape(String s)
{
Matcher matcher = sqlTokenPattern.matcher(s);
StringBuffer sb = new StringBuffer();
while(matcher.find())
{
matcher.appendReplacement(sb, sqlTokens.get(matcher.group(1)));
}
matcher.appendTail(sb);
return sb.toString();
}
You should use single-quotes for string delimiters. The single-quote is the standard SQL string delimiter, and double-quotes are identifier delimiters (so you can use special words or characters in the names of tables or columns).
In MySQL, double-quotes work (nonstandardly) as a string delimiter by default (unless you set ANSI SQL mode). If you ever use another brand of SQL database, you'll benefit from getting into the habit of using quotes standardly.
Another handy benefit of using single-quotes is that the literal double-quote characters within your string don't need to be escaped:
select * from tablename where fields like '%string "hi" %';
MySQL has the string function QUOTE, and it should solve the problem
For strings like that, for me the most comfortable way to do it is doubling the ' or ", as explained in the MySQL manual:
There are several ways to include quote characters within a string:
A “'” inside a string quoted with “'” may be written as “''”.
A “"” inside a string quoted with “"” may be written as “""”.
Precede the quote character by an escape character (“\”).
A “'” inside a string quoted with “"” needs no special treatment and need not be doubled or escaped. In the same way, “"” inside a
Strings quoted with “'” need no special treatment.
It is from http://dev.mysql.com/doc/refman/5.0/en/string-literals.html.
You can use mysql_real_escape_string. mysql_real_escape_string() does not escape % and _, so you should escape MySQL wildcards (% and _) separately.
For testing how to insert the double quotes in MySQL using the terminal, you can use the following way:
TableName(Name,DString) - > Schema
insert into TableName values("Name","My QQDoubleQuotedStringQQ")
After inserting the value you can update the value in the database with double quotes or single quotes:
update table TableName replace(Dstring, "QQ", "\"")
If you're using a variable when searching in a string, mysql_real_escape_string() is good for you. Just my suggestion:
$char = "and way's 'hihi'";
$myvar = mysql_real_escape_string($char);
select * from tablename where fields like "%string $myvar %";