CSV record with a comma in it being inferred as separate fields - csv

So I am using Rust's CSV module to read a CSV file, but I see an issue that I have a workaround to, but it's just not sitting well with me because it's either just a hack or I don't know any better.
Issue: In the records themselves, some values have a comma (and obviously they're expected to have), and my code breaks at that, saying CSV error: record 3 (line: 4, byte: 73): found record with 4 fields, but the previous record has 3 fields. This is correct, the previous record is 1970, 17, "Bloody Mama" and the next one - the one it's breaking at - is 1970, 73, "Hi, Mom!". As you can see, it has a comma in the third field. I have been able to resolve it using the flexible flag (.flexible(true)), which essentially is telling the Reader whether the number of fields in records will be changed in the future, meaning it is inferring the field in question as two separate fields. I just wanted to know if that can be avoided somehow or if that is what I'd have to live with. Because I wish the reader not to wrongly infer my value as two different values.
Here's the code for it:
let mut rdr: csv::Reader<fs::File> = csv::ReaderBuilder::new()
.quoting(false)
.trim(csv::Trim::All)
.flexible(true) // With this enabled, it works/prints fine.
.from_path(file)?;

If there are some values with quotes " in your data, why are you setting the quoting parameter to false?
According to the documentation:
quoting
Enable or disable quoting.
This is enabled by default, but it
may be disabled. When disabled, quotes are not treated specially.
So I think
.quoting(false)
should be set to true
.quoting(true)
Either that, or just leave it out because true/enabled is the default.

Related

Apache Nifi: Replacing values in a column using Update Record Processor

I have a csv, which looks like this:
name,code,age
Himsara,9877,12
John,9437721,16
Razor,232,45
I have to replace the column code according to some regular expressions. My logic is shown in a Scala code below.
if(str.trim.length == 9 && str.startsWith("369")){"PROB"}
else if(str.trim.length < 8){"SHORT"}
else if(str.trim.startsWith("94")){"LOCAL"}
else{"INT"}
I used a UpdateRecord Processor to replace the data in the code column. I added a property called /code which contains the value.
${field.value:replaceFirst('^[0-9]{1,8}$','SHORT'):replaceFirst('[94]\w+','OFF_NET')}
This works when replacing code's with
length less than 8 with "SHORT"
starting with 94 with "LOCAL"
I am unable to find a way to replace data in the column, code when it's equal to 8 digits AND when it starts with 0. Also how can I replace the data if it doesn't fall into any condition mentioned above. (Situation which the data should be replaced with INT)
Hope you can suggest a workflow or value to be added to the property in Update record to make the above two replacements happen.
There is a length and startsWith functions.
${field.value:length():lt(8):ifElse(
'SHORT', ${field.value:startsWith(94):ifElse(
'LOCAL', ${field.value:length():equals(9):and(${field.value:startsWith(369)}):ifElse(
'PROB', 'INT'
)})})}
I have put the line breaks for easy to recognize the functions but it should be removed.
By the way, the INT means that some string values to replace? Sorry for the confusion.
Well, if you want to regular expression only, you can try the below code.
${field.value
:replaceFirst('[0-9]{1,8}', 'SHORT')
:replaceFirst('[94]\w+', 'OFF_NET')
:replaceFirst('369[0-9]{6}', 'PROB')
:replace(${field.value}, 'INT')
}

Univocity parser - false delimiter autodetection when too little information given

I set the parser to detect the delimiters automatically
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();
I have only 1 single record : 47W2E2qxPs, http://usda.gov/mattis.html
What I got :
code: 47W2E2qxPshttp url: //usda.gov/mattis.html
I expected the delimiter to be , and not :
so my expected result would be 47W2E2qxPs and http://usda.gov/mattis.html .
Could I fix it in an elegant way?
Author of the library here. The detection process is a heuristic that uses statistics collected from multiple rows of part of your input. Therefore it depends a lot on the size of the input.
Its purpose is to handle situations where you can't easily determine what is the CSV format - such as when users upload random files to you. Don't use the detection process if you already know what is the correct delimiter.
In your case, one row of data is absolutely not enough to reliably detect the delimiter, especially if there are multiple symbols present. There is little you can do about it except for testing what was the detected delimiter before continuing:
parser.beginParsing(new File("/path/to/your.csv"));
CsvFormat format = parser.getDetectedFormat();
//check if the format is sane.
The next version (2.6.0) will include more options to assist the heuristic such as providing a set of allowed characters to be used as delimiters - which will probably help in your case.

How to remove string that is no longer needed in MySQL database?

This seemed like it should be very simple to do yet I've not been able to find an answer after weeks of looking.
I'm trying to remove strings that are no longer needed. Regex_replace sounds perfect but is not available in MySQL.
In MySQL how would I accomplish changing this:
[quote=ABC;xxxxxx]
to this:
[quote=ABC]
The issues are:
- this can appear anywhere in a text blob
- the xxxxxx can only be numeric but may be 6, 7 or 8 characters long
- not adding/removing any rows, just rewriting the contents of one column on one row at a time.
Thanks.
I don't think you really need REGEX_Replace (though it would make things easier of course).
Assuming that the example you presented is a real reflection of what you have:
Your starting point is with the string [quote=<something>;, meaning that you can start searching for [quote=,
Once you found it, you need to search for ; and after that for ],
Once you found them both, you know what to extract when where to start for the next search (if the pattern you mentioned can appear more than once within a singe blob.
Did I get you correctly?
EDIT
This paradigm is aimed to convert all instances of [quote=ABC;xxxxxx] to [quote=ABC] under the following assumptions:
The pattern can appear any number of times within the input string,
The length of xxxxxx is not fixed,
The resulting string (after removing all the appearances of ;xxxxxx) should replace the value in the table,
Performance is not an issue since either this is going to be a one-time job (through the whole table) or it will run every time on a single string (e.g. before INSERTing a new record).
Some MySQL functions that will be used:
INSTR: Searches within a string for the first appearance of a sub-string and returns the position (offset) where the sub-string was found,
SUBSTR: Returns a substring from a string (several ways to use it),
CONCAT: Concatenates two or more strings.
The guidelines presented here apply for the manipulation of a single INPUT string. If this needs to be used over, say, a whole table, simply get the strings into a CURSOR and loop.
Here are the steps:
Declare five INT local variables to serve as indices and total input string length, say L_Start, L_UpTo, l_Total_Length, l_temp1 and l_temp 2, setting the initial value for l_Start = 1 and l_Total_Length = LENGTH(INPUT_String),
Declare a string variable into which you will copy the "cleaned" result and initiate it as '', say l_Output_str; also declare a temporary string to hold the value of 'ABD', say l_Quote,
Start a infinite loop (you will set the exit condition within it; see below),
Exit loop if l_Start >= l_Total_Length (here is one of the two exit points from the loop),
Find the first location of '[quote=' within the input string starting from L_Start,
If the returned value is 0 (i.e. substring not found), concatenate the current contents of l_Output_str with whatever remains if the input string from position L_start (e.g. SET l_Output_str = CONCAT(l_Output_str,SUBSTR(INPUT_String,L_Start) ;) and exit loop (second exit position),
Search the input string for the ; symbol starting from L_start + 7 (i.e. the length of [quote=) and save the value in l_temp_1,
Search the input string for the ] symbol starting from L_start + 7 + l_temp2 and save the value in l_temp_2,
Add the found result to output string as SET l_Output_str = CONCAT(l_Output_str,'[quote=',SUBSTR(INPUT_String,L_Start + 7, l_temp_2 - l_temp_1),']') ;,
Set L_Start = L_Start + 7 + l_temp_2 + 1 ;
End of loop.
Notes:
As I neither made the code nor tested it, it is possible that I'm not setting indices correctly; you will need to perform detailed tests to make get it working as needed;
The above IS the method I suggested;
If the input string is very long (many MBs), you might observe poor performance (i.e. it might take few seconds to complete) because of the concatenations. There are some steps that can be taken to improve performance, but let's have this working first and then, if needed, tackle the performance issues.
Hope that the above is clear and comprehensive.

Can I read the rest of the line after a positive value of IOSTAT?

I have a file with 13 columns and 41 lines consisting of the coefficients for the Joback Method for 41 different groups. Some of the values are non-existing, though, and the table lists them as "X". I saved the table as a .csv and in my code read the file to an array. An excerpt of two lines from the .csv (the second one contains non-exisiting coefficients) looks like this:
48.84,11.74,0.0169,0.0074,9.0,123.34,163.16,453.0,1124.0,-31.1,0.227,-0.00032,0.000000146
X,74.6,0.0255,-0.0099,X,23.61,X,797.0,X,X,X,X,X
What I've tried doing was to read and define an array to hold each IOSTAT value so I can know if an "X" was read (that is, IOSTAT would be positive):
DO I = 1, 41
(READ(25,*,IOSTAT=ReadStatus(I,J)) JobackCoeff, J = 1, 13)
END DO
The problem, I've found, is that if the first value of the line to be read is "X", producing a positive value of ReadStatus, then the rest of the values of those line are not read correctly.
My intent was to use the ReadStatus array to produce an error message if JobackCoeff(I,J) caused a read error, therefore pinpointing the "X"s.
Can I force the program to keep reading a line after there is a reading error? Or is there a better way of doing this?
As soon as an error occurs during the input execution then processing of the input list terminates. Further, all variables specified in the input list become undefined. The short answer to your first question is: no, there is no way to keep reading a line after a reading error.
We come, then, to the usual answer when more complicated input processing is required: read the line into a character variable and process that. I won't write complete code for you (mostly because it isn't clear exactly what is required), but when you have a character variable you may find the index intrinsic useful. With this you can locate Xs (with repeated calls on substrings to find all of them on a line).
Alternatively, if you provide an explicit format (rather than relying on list-directed (fmt=*) input) you may be able to do something with non-advancing input (advance='no' in the read statement). However, as soon as an error condition comes about then the position of the file becomes indeterminate: you'll also have to handle this. It's probably much simpler to process the line-as-a-character-variable.
An outline of the concept (without declarations, robustness) is given below.
read(iunit, '(A)') line
idx = 1
do i=1, 13
read(line(idx:), *, iostat=iostat) x(i)
if (iostat.gt.0) then
print '("Column ",I0," has an X")', i
x(i) = -HUGE(0.) ! Recall x(i) was left undefined
end if
idx = idx + INDEX(line(idx:), ',')
end do
An alternative, long used by many many Fortran programmers, and programmers in other languages, would be to use an editor of some sort (I like sed) and modify the file by changing all the Xs to NANs. Your compiler has to provide support for IEEE NaNs for this to work (most of the current crop in widespread use do) and they will correctly interpret NAN in the input file to a real number with value NaN.
This approach has the benefit, compared with the already accepted (and perfectly good) answer, of not requiring clever programming in Fortran to parse input lines containing mixed entries. Use an editor for string processing, use Fortran for reading numbers.

Problems caused by Spaces in Field and Control Names

Obviously including spaces in table names, field names and control names is a bad idea (since it is used as a separator in practically all the imperative programming languages use commercially today*). You have to surround those controls and fields with [], etc. I'm trying to demonstrate another one of the problems with spaces to someone else.
I seem to recall that there is a situation that can arise where because a field name has a space in it (e.g., "Foo ID") and a control based on it is also called "Foo ID", that you can end accidentally referencing the underlying field instead of the control.
e.g., you update Foo ID from empty to "hello world" and then you need to check the value for null before the record is saved; something like "me.[Foo ID]" returns Null instead of "Hello World"
How can I duplicate this unexpected behaviour?
(* - Lisp, Prolog and and APL aren't imperative programming languages)
Since the control name can't have spaces in it when you reference it in code using the default controls property of the form (ie, Me.Foo_ID), the spaces are replaced with underscores. So in your example, Me.Foo_ID would refer to the control, but Me![Foo ID] would refer to the underlying field. (Even this statement appears incorrect on further consideration: Me![Foo ID] almost certainly refers to the control named "Foo ID".)
As David Fenton rightly points out, the control itself can be named with spaces. And it can be referenced in code with spaces when referenced as follows: Me.Controls("Foo ID") or Me![Foo ID] as those forms can properly handle spaces. But if you want to use the shorthand, you'll need to add the underscore: Me.Foo_ID.
In that case Me.Foo_ID would return "Hello World" before the record is saved (while the form is Dirty), but Me![Foo ID] would return Null.
EDIT: After some testing I have not been able to actually reproduce the odd behavior you are after (using several different combinations).
Thanks to David Fenton for setting me straight (please let me know if I'm still off somewhere).