Encode String in XQuery - csv

Wrote a function in my web application based on eXist-db to export some xml elements to csv with XQuery. Everything works fine but I have some umlauts like ü, ä or ß in my elements which are displayed the wrong way in my csv. I tried to encode the content by using fn:normalize-unicode but this is not working.
Here is a minimalized example of my code snippet:
let $input =
<root>
<number>1234</number>
<name>Aufmaß</name>
</root>
let $csv := string-join(
for $ta in $input
return concat($ta/number/text(), fn:normalize-unicode($ta/name/text())))
let $csv-ueber-string := concat($csv-ueber, string-join($massnahmen, $nl))
let $set-content-type := response:set-header('Content-Type', 'text/csv')
let $set-accept := response:set-header('Accept', 'text/csv')
let $set-file-name := response:set-header('Content-Disposition', 'attachment; filename="export.csv"')
return response:stream($csv, '')

It's very unlikely indeed that there's anything wrong with your query, or that there's anything you can do in your query to correct this.
The problem is likely to be either
(a) the input data being passed to your query is in a different character encoding from what the query processor thinks it is
(b) the output data from your query is in a different character encoding from what the recipient of the output thinks it is.
A quick glance at your query suggests that it doesn't actually have any external input other that the query source code itself. But the source code is one of the inputs, and that's a possible source of error. A good way to eliminate this possibility might be to see what happens if you replace
<name>Aufmaß</name>
by
<name>Aufma{codepoints-to-string(223)}</name>
If that solves the problem, then your query source text is not in the encoding that the query compiler thinks it is.
The other possibility is that the problem is on the output side, and frankly, this seems more likely. You seem to be producing an HTTP response stream as output, and constructing the HTTP headers yourself. I don't see any evidence that you are setting any particular encoding in the HTTP response headers. The response:stream() function is vendor-specific and I'm not familiar with its details, but I suspect that you need to ensure it encodes the content in UTF-8 and that the HTTP headers say it is in UTF-8; this may be by extra parameters to the function, or by external configuration options.

As you might expect, eXist is serializing the CSV as Unicode (UTF-8). But when you open the resulting export.csv file directly in Excel (i.e., via File > Open), Excel will try its best to guess the encoding of the CSV file. But CSV files lack any way of declaring their encoding, so applications may well guess wrong, as it sounds like Excel did in your case. On my computer, Excel guesses wrong too, mangling the encoding of Aufmaß as Aufmaß. Here's the way to force Excel to use the encoding of a UTF-8 encoded CSV file such as the one produced by your query.
In Excel, start a new spreadsheet via File > New
Select File > Import to bring up a series of dialogs that let you specify how to import the CSV file.
In the first dialog, select "CSV file" as the type of file.
In the next dialog, titled "Text Import Wizard -
Step 1 of 3", select "Unicode (UTF-8)" as the "File origin." (At least these are the titles/order in my copy of MS Excel for Mac 2016).
Proceed through the remainder of the dialogs, keeping the default values.
Excel will then place the contents of your export.csv in the new spreadsheet.
Lastly, let me provide the following query I used to test and confirm that the CSV file produced by eXist does open as expected when following the directions above. The query is essentially the same as yours but fixes some problems in your query that prevented me from running it directly. I saved this query at /db/csv-test.xq and called it via http://localhost:8080/exist/rest/db/csv-test.xq,
xquery version "3.1";
let $input :=
<root>
<number>1234</number>
<name>Aufmaß</name>
</root>
let $cell-separator := ","
let $column-headings := $input/*/name()
let $header-row := string-join($column-headings, $cell-separator)
let $body-row := string-join($input/*/string(), $cell-separator)
let $newline := '
'
let $csv := string-join(($header-row, $body-row), $newline)
return
response:stream-binary(
util:string-to-binary($csv),
"text/csv",
"export.csv"
)

Related

Reading a .dat file in Julia, issues with variable delimeter spacing

I am having issues reading a .dat file into a dataframe. I think the issue is with the delimiter. I have included a screen shot of what the data in the file looks like below. My best guess is that it is tab delimited between columns and then new-line delimited between rows. I have tried reading in the data with the following commands:
df = CSV.File("FORCECHAIN00046.dat"; header=false) |> DataFrame!
df = CSV.File("FORCECHAIN00046.dat"; header=false, delim = ' ') |> DataFrame!
My result either way is just a DataFrame with only one column including all the data frome each column concatenated into one string. I tried to even specify the types with the following code:
df = CSV.File("FORCECHAIN00046.dat"; types=[Float64,Float64,Float64,Float64,
Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64]) |> DataFrame!
And I received an the following error:
┌ Warning: 2; something went wrong trying to determine row positions for multithreading; it'd be very helpful if you could open an issue at https://github.com/JuliaData/CS
V.jl/issues so package authors can investigate
I can work around this by uploading it into google sheets and then downloading a csv, but I would like to find a way to make the original .dat file work.
Part of the issue here is that .dat is not a proper file format—it's just something that seems to be written out in a somewhat human-readable format with columns of numbers separated by variable numbers of spaces so that the numbers line up when you look at them in an editor. Google Sheets has a lot of clever tricks built in to "do what you want" for all kinds of ill-defined data files, so I'm not too surprised that it manages to parse this. The CSV package on the other hand supports using a single character as a delimiter or even a multi-character string, but not a variable number of spaces like this.
Possible solutions:
if the files aren't too big, you could easily roll your own parser that splits each line and then builds a matrix
you can also pre-process the file turning multiple spaces into single spaces
That's probably the easiest way to do this and here's some Julia code (untested since you didn't provide test data) that will open your file and convert it to a more reasonable format:
function dat2csv(dat_path::AbstractString, csv_path::AbstractString)
open(csv_path, write=true) do io
for line in eachline(dat_path)
join(io, split(line), ',')
println(io)
end
end
return csv_path
end
function dat2csv(dat_path::AbstractString)
base, ext = splitext(dat_path)
ext == ".dat" ||
throw(ArgumentError("file name doesn't end with `.dat`"))
return dat2csv(dat_path, "$base.csv")
end
You would call this function as dat2csv("FORCECHAIN00046.dat") and it would create the file FORCECHAIN00046.csv, which would be a proper CSV file using commas as delimiters. That won't work well if the files contain any values with commas in them, but it looks like they are just numbers, in which case it should be fine. So you can use this function to convert the files to proper CSV and then load that file with the CSV package.
A little explanation of the code:
the two-argument dat2csv method opens csv_path for writing and then calls eachline on dat_path to read one line form it at a time
eachline strips any trailing newline from each line, so each line will be bunch of numbers separated by whitespace with some leading and/or trailing whitespace
split(line) does the default splitting of line which splits it on whitespace, dropping any empty values—this leaves just the non-whitespace entries as strings in an array
join(io, split(line), ',') joins the strings in the array together, separated by the , character and writes that to the io write handle for csv_path
println(io) writes a newline after that—otherwise everything would just end up on a single very long line
the one-argument dat2csv method calls splitext to split the file name into a base name and an extension, checking that the extension is the expected .dat and calling the two-argument version with the .dat replaced by .csv
Try using the readdlm function in DelimitedFiles library, and convert to DataFrame afterwards:
using DelimitedFiles, DataFrames
df = DataFrame(readdlm("FORCECHAIN00046.dat"), :auto)

Dump Chinese data into a json file

I am falling on a problem, while dumping a chinese data (non-latin language data) into a json file.
I am trying to store list into a json file with the following code;
with open("file_name.json","w",encoding="utf8") as file:
json.dump(edits,file)
It will dumped without any errors.
When i am viewing a file, it will look like this,
[{sentence: \u5979\u7d30\u5c0f\u8072\u5c0d\u6211\u8aaa\uff1a\u300c\u6211\u501f\u4f60\u4e00\u679d\u925b\u7b46\u3002\u300d}...]
And I also tried out, without encoding option.
with open("file_name.json","w") as file:
json.dump(edits,file)
My question is, why my json file look like this, and how to dump my json file with having chinese string instead of unicode string.
Any helps would be appreciated. Thanks : )
Check out the docs for json.dump.
Specifically, it has a switch ensure_ascii that if set to False should make the function not escape the characters.
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.

Python String replacement in a file each time gives different result

I have a JSON file and I want to do some replacements in it. I've made a code, it works but it's wonky.
This is where the replacement gets done.
replacements1 = {builtTelefon:'Isim', builtIlce:'Isim', builtAdres:'Isim', builtIsim:'Isim'}
replacements3 = {builtYesterdayTelefon:'Isim', builtYesterdayIlce:'Isim', builtYesterdayAdres:'Isim', builtYesterdayIsim:'Isim'}
with open('veri3.json', encoding='utf-8') as infile, open('veri2.json', 'w') as outfile:
for line in infile:
for src, target in replacements1.items():
line = line.replace(src, target)
for src, target in replacements3.items():
line = line.replace(src, target)
outfile.write(line)
Here's some examples to what builtAdres and builtYesterdayAdres looks like:
01 Temmuz 2018 Pazar.1
30 Haziran 2018 Cumartesi.1
I run this on my data but it results in many different outputs each time. Please do check the screenshot below because I don't know how else I can tell about it.
This is the very same code and I run the same thing everytime but it results in with different outcomes each time.
Here is the original JSON file:
What it should do is testing entire file against 01 Temmuz 2018 Pazar and if it finds just replaces it with string Isim without touching anything else. On a second run checks if anything is 30 Haziran 2018 Cumartesi and replaces them with string Isim too.
What's causing this?
Example files for re-testing:
pastebin - veri3.json
pastebin - code.py
I think you have just one problem: you're trying to use "Isim" as key name multiple times within the same object, and this will botch the JSON.
The reason why you might be "getting different results" might have to do with the client you're using to display the JSON. I think that if you look at the raw data, the JSON should have been fully altered (I ran your script and it seems to be altered). However, the client will not handle very well the repeated key, and will display all objects as well as it can.
In fact, I'm not sure how you get "Isim.1", "Isim.2" as keys, since you actually use "Isim" for all. The client must be trying to cope with the duplicity there.
Try this code, where I use "Isim.1", "Isim.2" etc.:
replacements1 = {builtTelefon:'Isim.3', builtIlce:'Isim.2', builtAdres:'Isim.1', builtIsim:'Isim'}
replacements3 = {builtYesterdayTelefon:'Isim.3', builtYesterdayIlce:'Isim.2', builtYesterdayAdres:'Isim.1', builtYesterdayIsim:'Isim'}
I think you should be able to have all the keys displayed now.
Oh and PS: to use your code with my locale I had to change line 124 to specify 'utf-8' as encoding for the outfile:
with open('veri3.json', encoding='utf-8') as infile, open('veri2.json', 'w', encoding='utf-8') as outfile:

Changing The Delimiter to CTRL+A in Python CSV Module

I'm trying to write a csv file with the delimiter ctrl+a. I'm going to have to eventually write the file to hadoop and I'm unable to use a standard delimiter.
Currently I'm trying this:
writer = csv.writer(f, delimiter = "\u0001")
for item in aList:
writer.writerow(item)
f.close()
However, the outputted excel file doesn't appear to be written correctly...
Some rows are condensed into one block, while others will have one field in the first and then the rest condensed into the second block, etc.
Is the error where I'm setting up the writer object, or am I just not familiar with separating files this way?
You can try using the nonprinting "group separator" character, which can be represented in python code as '\035'
see http://www.asciitable.com/index/asciifull.gif for some other nonprinting characters if you need more.
It may be helpful to include more context about why you want to use nonstandard delimiter. And whether Excel parsing of the file is necessary, or just a quick check to see if the file might be parsed properly by the target system, Hadoop.

PHP: creating CSV file with windows encoding

i am creating csv files with php. To write the data into my csv file, i use the php function "fputcsv".
this is the issue:
i can open the created file normally with Excel. But i cant import the file into a shopsystem (in this case "shopware"). It says something like "the data could not be read".
And now comes the clue:
If i open the created file and choose "save as" and select "CSV (comma delimited)" in type, this file can be imported into shopware. I read something about the php function "mb_convert_encoding" which i used to encode the data, but it could not fix the problem.
I will be very glad if you can help me.
thanks.
Thanks for your input.
I solved this problem by replacing fputcsv with fwrite. Then i just needed to add "\r\n" (thanks wmil) to the end of the line and the generated file can be read by shopware.
Obviously the fputcsv function uses \n and not \r\n as EOL character.
I think you cannot set the encode using fputcsv. However fputcsv looks to the locale setting, wich you can change with setlocale.
Maybe you could send your file directly to the users browser and use changing contenttype and charset with header function.
This can't be answered without knowing more about your system. Most likely it has nothing to do with character encoding. It's probably a problem with wrong number of columns or column headers being incorrect.
If it is a character encoding issue, your best bet is:
$new_str = mb_convert_encoding($str, 'Windows-1252', 'auto');
Also end newlines with \r\n, not just \n.
If that doesn't work you'll need to check the software docs.