H2 DB CSVREAD does not translate line breaks - csv

I would like to use H2 database embedded in my application to create an in-memory database, populated by a CSV file. So I have used the CSVREAD function.
All is working well except one pesky problem, which is that it doesn't seem to recognize line breaks. It translates \n literally as the two characters \ and n.
The docs say that the default escape character is quotation mark but if I try to use the quotation mark to escape anything else besides another quotation mark, it simply ends the record there.
Is it possible to put text with line breaks into a CSV file and have H2's CSVREAD interpret it correctly somehow? Thanks!

After much experimentation, I found that literally putting returns into the middle of the text is interpreted correctly.
e.g. if you want a record with multiple lines, try this in your csv:
"column 1", "column 2", "column 3 multi-line
still part of column 3
yet still pat of column 3"

Related

Godot/gdscript strings with escape characters from database

I'm making a dialogue system in gdscript and am struggling with escape characters, specifically '\n'.
I'm using CastleDB as, although not perfect, it has allowed me to have almost everything stored in data and will allow the person doing the writing for the game to do everything outside the engine, without me having to copy and paste stuff in.
I've hit a stumbling block with escape characters. A single text entry in CastleDB doesn't support spaces, and '\n' within the string prints to '\n', not a space, in the dialog box.
I've tried using the format string function with 'some text here {space} some more text', with the space referencing a string consisting of just \n. This still prints \n. If I feed some constant string with \n in the middle directly into the function which displays the dialog text, it adds a space so I'm not really sure what is going on here.
I don't have a computer science background (I've done some C up until pointers, at which point I decided to return later).
Is there something going on in the background with my string in gdscript? It prints out just like you would expect a string to, apart from ignoring my escape characters.
Could it be something to do with the fact that it comes in as a JSON? As far as I'm aware, even if a string is chopped up and reassembled, it should still just behave like a string...?!
Anyway, I haven't included any code because I don't know what code you'd need to see. I'm hoping it's something simple that because I'm teaching myself as I go I just wasn't aware of, but can post code if it helps.
Thanks,
James
Escape sequences are a way of getting around issues with syntax. When you type a string in most programming languages, it starts with " and ends with another ". And it needs to stay on one line. Simple, right?
What if you want to put an actual " in your string? Or a new line? We need some way of telling the compiler, "hey, we want to insert a newline here, even though we can't use an actual newline character". So we use a \ to let the compiler know that the next character is part of an escape sequence.
This causes another problem: What if we literally want to put a backslash in a string? That's where the double backslash comes from: \\ is the escape sequence for \, since \ by itself has a special meaning.
But CastleDB (apparently, I'm not familiar with it) doesn't recognize escape sequences, so when you type \n it thinks you literally want \ followed by n. When it converts this to JSON, it inserts the \\ because JSON does recognize escape sequences.
GDScript also recognizes escape sequences, so print("Hello\nworld!") prints
Hello
world!
You could try input_string.replace("\\n", "\n") to replace the \n escape sequences.
I've solved this by looking at the way CastleDB data is stored on the project's github page.
For some reason "\n" was stored as "\\n" behind the scenes. Now that I know why it was printing weirdly I can change it, even though it feels like a messy solution!
To add even more weirdness to this whole backslash business, stack overflow displays a double backslash as a single backslash so I have to write \ \ \n minus the spaces to get \\n...
I'm sure there must be a reason, but it eludes me.

Remove Speech Marks from file where Text Qualifier is "

Introduction
We have a pretty standard way of importing .txt and .csv into our data warehouse using SSIS.
Our txt/csvs are produced with speech marks as text qualifiers. So a typical file may look like the below:
"0001","025",1,"01/01/19","28/12/18",4,"ST","SMITH,JOHN","15/01/19"
"0002","807",1,"01/01/19","29/12/18",3,"ST","JONES,JOY","06/02/19"
"0003","160",1,"01/01/19","29/12/18",3,"ST","LEWIS,HANNAH","18/01/19"
We have set all our SSIS packages to strip out the speech marks by setting Text Qualifier = "
Problem
However, as some of our data entry is done manually, speech marks are sometimes used - particularly in free text fields such as NAME where people have nicknames/alias. This causes errors in our SSIS loading.
An example of a problematic row would be:
"0004","645",1,"01/01/19","29/12/18",3,"ST","MOORE,STANLEY "STAN"","12/04/19"
My question
Is there a way to somehow strip out these problematic speech marks? i.e. the speech marks surrounding "STAN", so that column would be treated as MOORE, STANLEY STAN.
If there was a way within SSIS to do this, great. If not, we are open to other ideas outside of SSIS.
Solution needs to be scalable as we have hundreds of SSIS packages where this problem can occur.
I have a few suggestions:
I know Excel has a setting that says something like "Treat Consecutive Delimiters as one."
Change your delimiter to something else, like a pipe (the thing above the backslash, not sure what it is called elsewhere, looks like a vertical line). You can distinguish delimiters from quote marks that are meant to be included in the resulting value because any string delimiter either immediately precedes or immediately follows a comma. A quote character anywhere else is not a delimiter.
If you do not need to pass the data through any T-SQL you might want to replace non-delimiter quotes with single quotes or, depending on the final output, maybe the html entity (") instead.
Hope this helps,
Joey

How to get rid of in AA

I am reading data (combination of letters and numbers) from an excel sheet and put it into a text field in target application, where the input should yield a unique item from a database.
However there (sometimes) is a whitespace behind the data in the excel cell, which results in a "no data found" when this whitespace is entered into the search field in target application. The whitespace does not seem to be a space though, since i am unable to trim that whitespace AA-internally. I guess it is a (or some similar html special character).
edit: confirmed to be a by now.
Q: How can i get rid of such characters AA internally?
Tried: Neither (a) Trim, (b) Replace " " ->"", nor (c) Replace " "->"" work.
Workaround: I am currently checking for the length of the data provided: if its longer than 10 chars i only take the leftmost 10 chars. This works here, since its a business rule for the data i am working with, but i am still interested in an original solution, since there may be upcoming cases, where no business rule will help me out.
AA Version: 11.3.1
Thankful for input...
Okay, since it's non-breaking spaces character, you can replace it using Regex in replace command.
Find: \u00a0
Options: Regular Expression.
Got rid of it using Replace Command with RegEx ticked:
[^a-z A-Z 0-9]

Weka and CSV files

I'm currently trying to import some data into weka. Currently the data is in a CSV file, and consists of a numerical ID and then some string data(Tweets). I'm getting an error where it is reading "Wrong number of values, Read 1, expected 2 Token[EOL], line 17". I'm using quotes as my enclosure characters for the String data. I understand that something(presumably an EOL character?) is causing weka to incorrectly separate some of the String data into multiple entries on the same line, but I'm not sure how to fix the EOL token problem.
My data set can be viewed here. The current data set is on Sheet 2:
https://docs.google.com/spreadsheets/d/1Yclu0t4ITFWn6itYBsVtkGalmP9BPaWFFP6U6jAeLMU/edit?usp=sharing
The text file itself may be found here:
https://drive.google.com/file/d/0B433FqC3TscQQkRxZklQclA3Z3M/view?usp=sharing
Current error is now on the 3rd line, with the same error. The only newline character there is the one at the end of the line denoting a new entry, so I'm not sure why its having issues.
In its datasets, Weka considers a newline character as an indication of the end of instance. Your line 17 is actually a multi-line tweet which confuses Weka. You can use either
a RegEx to get rid of the newline characters in every single tweet or
during downloading the tweets, clean the tweets to get rid of any newline character in them.
Unfortunately, Weka does not have a mechanism to get rid of this problem by itself (as far as I know).
EDIT
Okay, here are some other things that need to be fixed (according to your EDITS in the question):
Replace ' with \'
Replace grave accent with \grave accent
Many tweets contain quotes inside quotes. The inside double quotes (") should be replaced by \"
If you put your tweets inside double quotes, then your header should be id, "text"
Some tweets contain two consecutive double quotes, get rid of them or replace them with \".
I cannot say exactly where, because I lost trace, but I think still some tweets contain new lines in them (or at least one tweet has it still)
These are just a few things that I noticed. There might be more. Time will tell.

csv parsing, quotes as characters

I have a csv file that contains fields such as this:
""#33CCFF"
I would imagine that should be the text value:
"#33CCFF
But both excel and open office calc will display:
#33CCFF"
What rule of csv am I missing?
When Excel parses the value, it does not first remove the outer quotes, to then proceed reading what's in between them. Even if it would, what would it do with the remaining " in front of #? It can not show this as you expect "#33CCFF, because for it to appear like that, the double quote should have been escaped by duplicating it. (That might be the 'csv' rule you are missing.)
Excel reads the value from left to right, interpreting "" as a single ", it then reads on, and finds an unexpected double quote at the end, 'panics' and simply displays it.
The person/application creating the csv file made the mistake of adding encapsulation without escaping the encapsulation character within the values. Now your data is malformed. Since when using encapsulation, the value should be """#33CCFF", and when not using encapsulation, it should be "#33CCFF.
This might clarify some things