make FPE algorithm of DLP in GCP refrain from using a specific set of characters when encoding - csv

I was trying to de-identify 1 column of CSV file using the FPE algorithm of DLP service in GCP. My CSV file has a comma(,) as the delimiter. But in the encoded data for some rows comma(,) is included in the data which is causing trouble when re-identifying the data. Is there any way to specify to the FPE algorithm to refrain from using a specific set of characters when encoding?

https://github.com/googleapis/googleapis/blob/01d4201e2620da2084d2151522c25cf49dda9da3/google/privacy/dlp/v2/dlp.proto#L852
If you inspect it as a structured file using ByteContentItem.CSV then the transformations will apply over the cell values ... this will avoid having the commas get caught up in findings.

Related

Java.io.IOException: wrong number of values (WEKA CSV to ARFF)

Currently working on a Data Mining project using my own dataset I had found using Weka. The only issue is that taking my file from csv format and converting it into arff format is causing issues.
java.io.IOException: wrong number of values. Read 2, expected 5, Read Token[EOL], line 3
This is the error I am getting. I have browsed around online looking for similar issues and have tried removing all quotes and special characters that throw this exception. Every place I looked told me to remove special characters and I believe there are none left. The link to my dataset is here : https://docs.google.com/spreadsheets/d/1xqEe7MZE9SdKB_yvFSgWeSVYuDrq0b31Eu5oECNbGH0/edit#gid=1736568367&vpid=A1
This is the first three lines of my file where the first is the attribute names, file is separated by commas in note
Inequality Adjusted HPI Rank,Sub Region,Inequality Adjusted Life Expectancy,Inquality Adjusted Well being,Footprint
,Inequality adjusted HPI
1,1,73.1,6.9,2.5,48.2
2,6,65.17333333,5.487667631,1.390974448,45.97489063
If you open your file with a text editor, you will see that Footprint has quotes around it. Delete the quotes and you are good to go!
Weka is normally not that good in reading CSV files that include special characters, and ARFF files are normally easier to use. Therefore, in such cases, the easiest way is to convert your CSV file to an ARFF file using R ("RWeka" and "foreign" libraries can handle this conversion).
There is also another possibility. I was creating my CSV file and the header had a different number of elements compared to the rest of the data. So, check the header as well...!

When COPYing to Redshift, how to deal with special characters in a CSV?

I am using a COPY with ACCEPTINVCHARS to load a CSV into Amazon Redshift.
Unfortunately I get errors like
Missing newline: Unexpected character 0x69 found at location 129
However, if I try to use the ESCAPE option as well, I get the exception
CSV is not compatible with ESCAPE
What am I supposed to do in order to COPY this into Redshift? I'm fine if the chars get replaced with ? or whatever.
Ignore the header as the headers might not be of the same datatype as your fields.
Use IGNOREHEADER AS
Refer to the forum for more details,
https://forums.aws.amazon.com/thread.jspa?messageID=557452
For future generations, "CSV is not compatible with ESCAPE" is probably right but you don't actually need the CSV keyword to load CSV, so it's worth trying to remove the CSV keyword from your copy command.

How do I deal with commas/tabs that are part of the data in CSV/TSV in MarkLogic

I am trying to load a CSV file that have commas as part of the data into MarkLogic using RecordLoader. The data loads but MarkLogic takes commas that are part of the data as delimiters. I tried to escape commas by using backslashes but didn't work and the data remains dirty with the backslashes. I thought about replacing the data commas with other symbols so that I can change them back to commas after I load but I don't know if there is a way to modify the data after I load and I would have to reposition the XML tags line by line.
How can I load a CSV/TSV file and keep the commas/tabs that are part of the data as part of the data and not as delimiters?
Thanks in advance.
RecordLoader's DelimitedDataLoader doesn't support any escaping today. If you want to add it as a patch, https://github.com/marklogic/recordloader/blob/master/src/java/com/marklogic/recordloader/xcc/DelimitedDataLoader.java#L102 is the place to start looking at the code.
Although you asked about RecordLoader, you could also use the MarkLogic Content Pump. See Creating Documents from Delimited Text Files.

how to use ascii character for quote in COPY in cqlsh

I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.

Using EmEditor saving a Unicode file to another format distorts/changes the format. Solution?

There is a MySQL backup file which is a huge file - about 3 GB. There is one table that has a LONGBLOB column that stores JPEG image data.
The file imports successfully if done from MySQL Workbench - Data Import/Restore.
I need to open this file and extract the first few lines (about two rows of INSERTs of the table with the image data) so that I can test if another program can import this data into another MySQL database.
I tried opening the file with EmEditor (which is good at opening large files) and then copy/paste only upto one Insert statement of the script into a new file (upto about line 25, because the table in question is the first table in the backup script), and then Paste the selection into a new file.
Here comes the problem:
However this messes up the encoding (even though I save as utf8). I realize this when I try to import (restore) this new file (again using MySQL Workbench) into a MySQL database, the restore goes ahead without errors, but the JPEG images in the blob column are now destroyed/corrupted.
My guess is that the encoding is different between the original file and new file.
EmEditor does not show the encoding on the original file, there is an option to detect, and it detects it as 'UTF8 Unsigned'. But when saving I save it as UTF8. I tried also saving as ANSI, ISO8859 (windows default), etc, etc.. but everytime the same result.
Do you have any solution for this particular problem? ie I want to only cut the first few lines of the huge backup file and save to a new file keeping the encoding the same, so that the images (blobs) are not changed. Is there any way this can be done with EmEditor (ie do I have the wrong approach [ie Cut-Paste]?) Is there any specialized software that can do this? How can I diagnose what is going wrong here?
Thanks for any responses.
this messes up the encoding (even though I save as utf8)
UTF-8 is not a good choice for arbitrary binary data. There are many sequences of high-bytes which are not valid in UTF-8, so you will mangle them at some point during the load-alter-save process.
If you load the file using an encoding that maps every single byte to a unique character, and re-save the file using that same encoding, you should preserve the original content(*). ISO-8859-1 is the encoding usually chosen for this purpose, since it simply maps each byte 0..0xFF to the Unicode code point with the same number.
(*: assuming the editor is binary-safe with regard to other tricky points like nulls, \n/\r and other control characters... I believe EmEditor can be.)
When opening the original file in EmEditor, trying selecting the encoding as Binary (ASCII View). The Binary (ASCII View) will, as bobince said, map each byte to a unique character and preserve that when you save the file. I think this should fix your problem.