How do I deal with commas/tabs that are part of the data in CSV/TSV in MarkLogic - csv

I am trying to load a CSV file that have commas as part of the data into MarkLogic using RecordLoader. The data loads but MarkLogic takes commas that are part of the data as delimiters. I tried to escape commas by using backslashes but didn't work and the data remains dirty with the backslashes. I thought about replacing the data commas with other symbols so that I can change them back to commas after I load but I don't know if there is a way to modify the data after I load and I would have to reposition the XML tags line by line.
How can I load a CSV/TSV file and keep the commas/tabs that are part of the data as part of the data and not as delimiters?
Thanks in advance.

RecordLoader's DelimitedDataLoader doesn't support any escaping today. If you want to add it as a patch, https://github.com/marklogic/recordloader/blob/master/src/java/com/marklogic/recordloader/xcc/DelimitedDataLoader.java#L102 is the place to start looking at the code.

Although you asked about RecordLoader, you could also use the MarkLogic Content Pump. See Creating Documents from Delimited Text Files.

Related

make FPE algorithm of DLP in GCP refrain from using a specific set of characters when encoding

I was trying to de-identify 1 column of CSV file using the FPE algorithm of DLP service in GCP. My CSV file has a comma(,) as the delimiter. But in the encoded data for some rows comma(,) is included in the data which is causing trouble when re-identifying the data. Is there any way to specify to the FPE algorithm to refrain from using a specific set of characters when encoding?
https://github.com/googleapis/googleapis/blob/01d4201e2620da2084d2151522c25cf49dda9da3/google/privacy/dlp/v2/dlp.proto#L852
If you inspect it as a structured file using ByteContentItem.CSV then the transformations will apply over the cell values ... this will avoid having the commas get caught up in findings.

Load a CSV onto Apache Beam where there is a comma in some of the fields

I am loading a CSV into Apache Beam, but the CSV I am loading has commas in the fields. It looks like this:
ID, Name
1, Barack Obama
2, Barry, Bonds
How can I go about fixing this issue?
This is not specific to Beam, but a general problem with CSV. It's unclear if the second line should have ID="2, Barry" Name="Bonds" or the other way around.
If you can use some context (e.g. ID is always an integer, only one field could possibly contain commas) you could solve this by reading it as a text file line by line and parsing it into separate fields with a custom DoFn (assuming rows also contain newlines).
Generally, non-separating commas should be inside quotes in well-formed CSV, which makes this much more tractable (e.g. it would just work with the Beam Dataframes read_csv.)

Java.io.IOException: wrong number of values (WEKA CSV to ARFF)

Currently working on a Data Mining project using my own dataset I had found using Weka. The only issue is that taking my file from csv format and converting it into arff format is causing issues.
java.io.IOException: wrong number of values. Read 2, expected 5, Read Token[EOL], line 3
This is the error I am getting. I have browsed around online looking for similar issues and have tried removing all quotes and special characters that throw this exception. Every place I looked told me to remove special characters and I believe there are none left. The link to my dataset is here : https://docs.google.com/spreadsheets/d/1xqEe7MZE9SdKB_yvFSgWeSVYuDrq0b31Eu5oECNbGH0/edit#gid=1736568367&vpid=A1
This is the first three lines of my file where the first is the attribute names, file is separated by commas in note
Inequality Adjusted HPI Rank,Sub Region,Inequality Adjusted Life Expectancy,Inquality Adjusted Well being,Footprint
,Inequality adjusted HPI
1,1,73.1,6.9,2.5,48.2
2,6,65.17333333,5.487667631,1.390974448,45.97489063
If you open your file with a text editor, you will see that Footprint has quotes around it. Delete the quotes and you are good to go!
Weka is normally not that good in reading CSV files that include special characters, and ARFF files are normally easier to use. Therefore, in such cases, the easiest way is to convert your CSV file to an ARFF file using R ("RWeka" and "foreign" libraries can handle this conversion).
There is also another possibility. I was creating my CSV file and the header had a different number of elements compared to the rest of the data. So, check the header as well...!

When COPYing to Redshift, how to deal with special characters in a CSV?

I am using a COPY with ACCEPTINVCHARS to load a CSV into Amazon Redshift.
Unfortunately I get errors like
Missing newline: Unexpected character 0x69 found at location 129
However, if I try to use the ESCAPE option as well, I get the exception
CSV is not compatible with ESCAPE
What am I supposed to do in order to COPY this into Redshift? I'm fine if the chars get replaced with ? or whatever.
Ignore the header as the headers might not be of the same datatype as your fields.
Use IGNOREHEADER AS
Refer to the forum for more details,
https://forums.aws.amazon.com/thread.jspa?messageID=557452
For future generations, "CSV is not compatible with ESCAPE" is probably right but you don't actually need the CSV keyword to load CSV, so it's worth trying to remove the CSV keyword from your copy command.

how to use ascii character for quote in COPY in cqlsh

I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.