Remove Speech Marks from file where Text Qualifier is " - ssis

Introduction
We have a pretty standard way of importing .txt and .csv into our data warehouse using SSIS.
Our txt/csvs are produced with speech marks as text qualifiers. So a typical file may look like the below:
"0001","025",1,"01/01/19","28/12/18",4,"ST","SMITH,JOHN","15/01/19"
"0002","807",1,"01/01/19","29/12/18",3,"ST","JONES,JOY","06/02/19"
"0003","160",1,"01/01/19","29/12/18",3,"ST","LEWIS,HANNAH","18/01/19"
We have set all our SSIS packages to strip out the speech marks by setting Text Qualifier = "
Problem
However, as some of our data entry is done manually, speech marks are sometimes used - particularly in free text fields such as NAME where people have nicknames/alias. This causes errors in our SSIS loading.
An example of a problematic row would be:
"0004","645",1,"01/01/19","29/12/18",3,"ST","MOORE,STANLEY "STAN"","12/04/19"
My question
Is there a way to somehow strip out these problematic speech marks? i.e. the speech marks surrounding "STAN", so that column would be treated as MOORE, STANLEY STAN.
If there was a way within SSIS to do this, great. If not, we are open to other ideas outside of SSIS.
Solution needs to be scalable as we have hundreds of SSIS packages where this problem can occur.

I have a few suggestions:
I know Excel has a setting that says something like "Treat Consecutive Delimiters as one."
Change your delimiter to something else, like a pipe (the thing above the backslash, not sure what it is called elsewhere, looks like a vertical line). You can distinguish delimiters from quote marks that are meant to be included in the resulting value because any string delimiter either immediately precedes or immediately follows a comma. A quote character anywhere else is not a delimiter.
If you do not need to pass the data through any T-SQL you might want to replace non-delimiter quotes with single quotes or, depending on the final output, maybe the html entity (") instead.
Hope this helps,
Joey

Related

Delimiters in BIML

I am migrating to biml for SSIS integration of flat, excel and CSV files and I want to know all the possible delimeters and text qualifiers we can use since the documentation isn't telling that much.
Basically, there are 3 options to choose from:
The “official” ENUM
Allowed values here are:
– CRLF
– CR
– LF
– Semicolon
– Comma
– Tab
– VerticalBar
– UnitSeparator
Use the Hex Code
If you know the ASCII Code of your qualifier, you can use it starting with “x” and ending with “”.
A ” would be described by “x0022” for example.
Use the actual character (HTML encoded or escaped)
If you want to (for example) define a ” as your qualifier, you can do so. Just make sure, depending on how you use it, to either encode or escape it. When defining it as an actual Biml property, it has to be encoded.

Godot/gdscript strings with escape characters from database

I'm making a dialogue system in gdscript and am struggling with escape characters, specifically '\n'.
I'm using CastleDB as, although not perfect, it has allowed me to have almost everything stored in data and will allow the person doing the writing for the game to do everything outside the engine, without me having to copy and paste stuff in.
I've hit a stumbling block with escape characters. A single text entry in CastleDB doesn't support spaces, and '\n' within the string prints to '\n', not a space, in the dialog box.
I've tried using the format string function with 'some text here {space} some more text', with the space referencing a string consisting of just \n. This still prints \n. If I feed some constant string with \n in the middle directly into the function which displays the dialog text, it adds a space so I'm not really sure what is going on here.
I don't have a computer science background (I've done some C up until pointers, at which point I decided to return later).
Is there something going on in the background with my string in gdscript? It prints out just like you would expect a string to, apart from ignoring my escape characters.
Could it be something to do with the fact that it comes in as a JSON? As far as I'm aware, even if a string is chopped up and reassembled, it should still just behave like a string...?!
Anyway, I haven't included any code because I don't know what code you'd need to see. I'm hoping it's something simple that because I'm teaching myself as I go I just wasn't aware of, but can post code if it helps.
Thanks,
James
Escape sequences are a way of getting around issues with syntax. When you type a string in most programming languages, it starts with " and ends with another ". And it needs to stay on one line. Simple, right?
What if you want to put an actual " in your string? Or a new line? We need some way of telling the compiler, "hey, we want to insert a newline here, even though we can't use an actual newline character". So we use a \ to let the compiler know that the next character is part of an escape sequence.
This causes another problem: What if we literally want to put a backslash in a string? That's where the double backslash comes from: \\ is the escape sequence for \, since \ by itself has a special meaning.
But CastleDB (apparently, I'm not familiar with it) doesn't recognize escape sequences, so when you type \n it thinks you literally want \ followed by n. When it converts this to JSON, it inserts the \\ because JSON does recognize escape sequences.
GDScript also recognizes escape sequences, so print("Hello\nworld!") prints
Hello
world!
You could try input_string.replace("\\n", "\n") to replace the \n escape sequences.
I've solved this by looking at the way CastleDB data is stored on the project's github page.
For some reason "\n" was stored as "\\n" behind the scenes. Now that I know why it was printing weirdly I can change it, even though it feels like a messy solution!
To add even more weirdness to this whole backslash business, stack overflow displays a double backslash as a single backslash so I have to write \ \ \n minus the spaces to get \\n...
I'm sure there must be a reason, but it eludes me.

CSV standard regarding end of a row

I am writing a CSV parser and I want it to comply with this standards. It states:
Each record is located on a separate line, delimited by a line break (CRLF)
How should I handle rows ending with only CR of LF character? Should I treat them as literals and pass to field, interpret as a row end. Or maybe dub the file malformed?
I guess, that most flexible solution would be to accept either type of line end, but I am trying to figure out what standards say.
What do you think about it?
You should certainly not treat them as malformed, because there can be different line endings on Linux, Windows and Mac for example.
It's better to support them all.
Also, fields can have newlines in them as well, if they are properly quoted. So you'll need to check for that too.
For example:
123,"test on 2
lines",456
is a valid csv row.

csv parsing, quotes as characters

I have a csv file that contains fields such as this:
""#33CCFF"
I would imagine that should be the text value:
"#33CCFF
But both excel and open office calc will display:
#33CCFF"
What rule of csv am I missing?
When Excel parses the value, it does not first remove the outer quotes, to then proceed reading what's in between them. Even if it would, what would it do with the remaining " in front of #? It can not show this as you expect "#33CCFF, because for it to appear like that, the double quote should have been escaped by duplicating it. (That might be the 'csv' rule you are missing.)
Excel reads the value from left to right, interpreting "" as a single ", it then reads on, and finds an unexpected double quote at the end, 'panics' and simply displays it.
The person/application creating the csv file made the mistake of adding encapsulation without escaping the encapsulation character within the values. Now your data is malformed. Since when using encapsulation, the value should be """#33CCFF", and when not using encapsulation, it should be "#33CCFF.
This might clarify some things

MS Access transfer text throws unparseable errors when quotes and apostrophes are together

A section of what I'm parsing is this
E,"1"".""0""1","1""/""1""1","3""4""5""6","6""5""4""3",'1"'1"'1"'1","1""1""/"
The parse always stops at '1"'1"'1"'1" right on the first quotation. Nothing after that is put into the table
Its being imported using the transfer text macro from a txt file. I have tried using both text and memo types for the specification and it's still failing. Is there a work around to this?
Edit: Yay setting the text qualifier to none fixes it!
Splitting up your input string into what I believe you think the fields should be:
E,"1"".""0""1","1""/""1""1","3""4""5""6","6""5""4""3",'1"'1"'1"'1","1""1""/"
Even if the second-last field is not impossible to parse (heck, I did it and I'm not really all that clever...), I'm not surprised that Access chokes in it. In my experience with Microsoft Office apps and CSV files the rule is:
Text fields need to be enclosed in double-quotes if they contain commas or double quotes.
So, one might expect 1,O'Rourke,2 to pass, or maybe even 1,'thing,2, but 1,'abc"xyz,2? Not likely.
The workaround would be to fix the input file, e.g., by running it through a pre-processor to fix up the quoting.