LOAD XML LOCAL INFILE ampersand issue - mysql

I want to import XML data which contains ampersands into MySQL.
The import fails to run after a row has been encountered with a raw ampersand (&). Admittedly this is not correct XML but that is what I am working with.
I have tried replacing the raw ampersands with & - this appears in the database as the raw text (not the equivalent ASCII ampersand).
I have tried replacing the raw ampersands with \& - this stops the import routine from running further.
Can you suggest how I can get the raw ampersand into the database using LOAD XML LOCAL INFILE?
Sample raw XML follows:
<?xml version="1.0" ?>
<REPORT>
<CLA>
<PLOT>R&S</PLOT>
<VAL>100.10</VAL>
</CLA>
<CLA>
<PLOT>G&N</PLOT>
<VAL>200.20</VAL>
</CLA>
</REPORT>

Admittedly this is not correct xml but that is what I am working
with.
No, it's not that it's incorrect XML. It is not XML at all because it is not well-formed.
You have two ways forward:
Fix the data by treating it as text to turn it into XML. (Replace
the & with &.)
Load the data into the database using a non-XML data type.

Related

How to import csv in KNIME and ignore the quote marks

I have a csv file with data like this:
"Column1; Column2; Column3"
"ValueA; ValueB; ValueC"
"ValueD; ValueE; ValueF"
When i import it using the 'CSV Reader'-Node it interprets the quote marks as content.
I need the data to be imported without the quotation marks though (formatting it after that does not feel like a clean way of doing this and the node interprets the data formats wrong).
The setting of the node is as follows: https://i.stack.imgur.com/FJC1k.png
How can i deal with this?
In configuration dialog add " as Quote Char.
#FlipForties Hi,
As a heavy KNIME user I would recommend trying to load your data via the File Reader node instead. It's much more flexible than the CSV Read node and you should be able to load your data as is without issues. I made a test data-set and it looks ok upon load. See screen shot below:
enter image description here

Dump Chinese data into a json file

I am falling on a problem, while dumping a chinese data (non-latin language data) into a json file.
I am trying to store list into a json file with the following code;
with open("file_name.json","w",encoding="utf8") as file:
json.dump(edits,file)
It will dumped without any errors.
When i am viewing a file, it will look like this,
[{sentence: \u5979\u7d30\u5c0f\u8072\u5c0d\u6211\u8aaa\uff1a\u300c\u6211\u501f\u4f60\u4e00\u679d\u925b\u7b46\u3002\u300d}...]
And I also tried out, without encoding option.
with open("file_name.json","w") as file:
json.dump(edits,file)
My question is, why my json file look like this, and how to dump my json file with having chinese string instead of unicode string.
Any helps would be appreciated. Thanks : )
Check out the docs for json.dump.
Specifically, it has a switch ensure_ascii that if set to False should make the function not escape the characters.
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.

What character encoding is this in the JSON file?

I downloaded a 6GB gz file from the openlibrary, I extracted it on my ubuntu machine which turned into a 40GB txt file. When inspecting the head of the file using head, I find this string:
"name": "Mawlu\u0304d Qa\u0304sim Na\u0304yit Bulqa\u0304sim"
What encoding is this? Is it possible to get something that is human readable or does it look like it will require the data source to be exported correctly again?
It's standard escaping of unicode characters in a javascript literal string.
the string is Mawlūd Qāsim Nāyit Bulqāsim
This is plain JSON encoding. Your JSON parser will translate the \uNNNN references to Unicode characters. See also: json_encode function: special characters
looks like unicode
http://www.charbase.com/0304-unicode-combining-macron
U+0304: COMBINING MACRON

How do I deal with commas/tabs that are part of the data in CSV/TSV in MarkLogic

I am trying to load a CSV file that have commas as part of the data into MarkLogic using RecordLoader. The data loads but MarkLogic takes commas that are part of the data as delimiters. I tried to escape commas by using backslashes but didn't work and the data remains dirty with the backslashes. I thought about replacing the data commas with other symbols so that I can change them back to commas after I load but I don't know if there is a way to modify the data after I load and I would have to reposition the XML tags line by line.
How can I load a CSV/TSV file and keep the commas/tabs that are part of the data as part of the data and not as delimiters?
Thanks in advance.
RecordLoader's DelimitedDataLoader doesn't support any escaping today. If you want to add it as a patch, https://github.com/marklogic/recordloader/blob/master/src/java/com/marklogic/recordloader/xcc/DelimitedDataLoader.java#L102 is the place to start looking at the code.
Although you asked about RecordLoader, you could also use the MarkLogic Content Pump. See Creating Documents from Delimited Text Files.

how to use ascii character for quote in COPY in cqlsh

I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.