How to read a non-UTF8 encoded csv file? - csv

With the csv crate and the latest Rust version 1.31.0, I would want to read CSV files with ANSI (Windows 1252) encoding, as easily as in UTF-8.
Things I have tried (with no luck), after reading the whole file in a Vec<u8>:
CString
OsString
Indeed, in my company, we have a lot of CSV files, ANSI encoded.
Also, I would want, if possible, not to load the entire file in a Vec<u8> but a reading line by line (CRLF ending), as many of the files are big (50 Mb or more…).
In the file Cargo.toml, I have this dependency:
[dependencies]
csv = "1"
test.csv consist of the following content saved as Windows-1252 encoding:
Café;au;lait
Café;au;lait
The code in main.rs file:
extern crate csv;
use std::error::Error;
use std::fs::File;
use std::io::BufReader;
use std::path::Path;
use std::process;
fn example() -> Result<(), Box<Error>> {
let file_name = r"test.csv";
let file_handle = File::open(Path::new(file_name))?;
let reader = BufReader::new(file_handle);
let mut rdr = csv::ReaderBuilder::new()
.delimiter(b';')
.from_reader(reader);
// println!("ANSI");
// for result in rdr.byte_records() {
// let record = result?;
// println!("{:?}", record);
// }
println!("UTF-8");
for result in rdr.records() {
let record = result?;
println!("{:?}", record);
}
Ok(())
}
fn main() {
if let Err(err) = example() {
println!("error running example: {}", err);
process::exit(1);
}
}
The output is:
UTF-8
error running example: CSV parse error: record 0 (line 1, field: 0, byte: 0): invalid utf-8: invalid UTF-8 in field 0 near byte index 3
error: process didn't exit successfully: `target\debug\test-csv.exe` (exit code: 1)
When using rdr.byte_records() (uncommenting the relevant part of code), the output is:
ANSI
ByteRecord(["Caf\\xe9", "au", "lait"])

I suspect this question is under specified. In particular, it's not clear why your use of the ByteRecord API is insufficient. In the csv crate, byte records specifically exists for exactly cases like this, where your CSV data isn't strictly UTF-8, but is in an alternative encoding such as Windows-1252 that is ASCII compatible. (An ASCII compatible encoding is an encoding in which ASCII is a subset. Windows-1252 and UTF-8 are both ASCII compatible. UTF-16 is not.) Your code sample above shows that you're using byte records, but doesn't explain why this is insufficient.
With that said, if your goal is to get your data into Rust's string data type (String/&str), then your only option is to transcode the contents of your CSV data from Windows-1252 to UTF-8. This is necessary because Rust's string data type uses UTF-8 for its in-memory representation. You cannot have a Rust String/&str that is Windows-1252 encoded because Windows-1252 is not a subset of UTF-8.
Other comments have recommended the use of the encoding crate. However, I would instead recommend the use of encoding_rs, if your use case aligns with the same use cases solved by the Encoding Standard, which is specifically geared towards the web. Fortunately, I believe such an alignment exists.
In order to satisfy your criteria for reading CSV data in a streaming fashion without first loading the entire contents into memory, you need to use a wrapper around the encoding_rs crate that implements streaming decoding for you. The encoding_rs_io crate provides this for you. (It's used inside of ripgrep to do fast streaming decoding before searching UTF-8.)
Here is an example program that puts all of the above together, using Rust 2018:
use std::fs::File;
use std::process;
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() {
if let Err(err) = try_main() {
eprintln!("{}", err);
process::exit(1);
}
}
fn try_main() -> csv::Result<()> {
let file = File::open("test.csv")?;
let transcoded = DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(file);
let mut rdr = csv::ReaderBuilder::new()
.delimiter(b';')
.from_reader(transcoded);
for result in rdr.records() {
let r = result?;
println!("{:?}", r);
}
Ok(())
}
with the Cargo.toml:
[package]
name = "so53826986"
version = "0.1.0"
edition = "2018"
[dependencies]
csv = "1"
encoding_rs = "0.8.13"
encoding_rs_io = "0.1.3"
And the output:
$ cargo run --release
Compiling so53826986 v0.1.0 (/tmp/so53826986)
Finished release [optimized] target(s) in 0.63s
Running `target/release/so53826986`
StringRecord(["Café", "au", "lait"])
In particular, if you swap out rdr.records() for rdr.byte_records(), then we can see more clearly what happened:
$ cargo run --release
Compiling so53826986 v0.1.0 (/tmp/so53826986)
Finished release [optimized] target(s) in 0.61s
Running `target/release/so53826986`
ByteRecord(["Caf\\xc3\\xa9", "au", "lait"])
Namely, your input contained Caf\xE9, but the byte record now contains Caf\xC3\xA9. This is a result of translating the Windows-1252 codepoint value of 233 (encoded as its literal byte, \xE9) to U+00E9 LATIN SMALL LETTER E WITH ACUTE, which is UTF-8 encoded as \xC3\xA9.

Related

Load json file bundled within the executable [duplicate]

I'm working on a small web application in Go that's meant to be used as a tool on a developer's machine to help debug their applications/web services. The interface to the program is a web page that includes not only the HTML but some JavaScript (for functionality), images, and CSS (for styling). I'm planning on open-sourcing this application, so users should be able to run a Makefile, and all the resources will go where they need to go. However, I'd also like to be able to simply distribute an executable with as few files/dependencies as possible. Is there a good way to bundle the HTML/CSS/JS with the executable, so users only have to download and worry about one file?
Right now, in my app, serving a static file looks a little like this:
// called via http.ListenAndServe
func switchboard(w http.ResponseWriter, r *http.Request) {
// snipped dynamic routing...
// look for static resource
uri := r.URL.RequestURI()
if fp, err := os.Open("static" + uri); err == nil {
defer fp.Close()
staticHandler(w, r, fp)
return
}
// snipped blackhole route
}
So it's pretty simple: if the requested file exists in my static directory, invoke the handler, which simply opens the file and tries to set a good Content-Type before serving. My thought was that there's no reason this needs to be based on the real filesystem: if there were compiled resources, I could simply index them by request URI and serve them as such.
Let me know if there's not a good way to do this or I'm barking up the wrong tree by trying to do this. I just figured the end-user would appreciate as few files as possible to manage.
If there are more appropriate tags than go, please feel free to add them or let me know.
Starting with Go 1.16 the go tool has support for embedding static files directly in the executable binary.
You have to import the embed package, and use the //go:embed directive to mark what files you want to embed and into which variable you want to store them.
3 ways to embed a hello.txt file into the executable:
import "embed"
//go:embed hello.txt
var s string
print(s)
//go:embed hello.txt
var b []byte
print(string(b))
//go:embed hello.txt
var f embed.FS
data, _ := f.ReadFile("hello.txt")
print(string(data))
Using the embed.FS type for the variable you can even include multiple files into a variable that will provide a simple file-system interface:
// content holds our static web server content.
//go:embed image/* template/*
//go:embed html/index.html
var content embed.FS
The net/http has support to serve files from a value of embed.FS using http.FS() like this:
http.Handle("/static/", http.StripPrefix("/static/", http.FileServer(http.FS(content))))
The template packages can also parse templates using text/template.ParseFS(), html/template.ParseFS() functions and text/template.Template.ParseFS(), html/template.Template.ParseFS() methods:
template.ParseFS(content, "*.tmpl")
The following of the answer lists your old options (prior to Go 1.16).
Embedding Text Files
If we're talking about text files, they can easily be embedded in the source code itself. Just use the back quotes to declare the string literal like this:
const html = `
<html>
<body>Example embedded HTML content.</body>
</html>
`
// Sending it:
w.Write([]byte(html)) // w is an io.Writer
Optimization tip:
Since most of the times you will only need to write the resource to an io.Writer, you can also store the result of a []byte conversion:
var html = []byte(`
<html><body>Example...</body></html>
`)
// Sending it:
w.Write(html) // w is an io.Writer
Only thing you have to be careful about is that raw string literals cannot contain the back quote character (`). Raw string literals cannot contain sequences (unlike the interpreted string literals), so if the text you want to embed does contain back quotes, you have to break the raw string literal and concatenate back quotes as interpreted string literals, like in this example:
var html = `<p>This is a back quote followed by a dot: ` + "`" + `.</p>`
Performance is not affected, as these concatenations will be executed by the compiler.
Embedding Binary Files
Storing as a byte slice
For binary files (e.g. images) most compact (regarding the resulting native binary) and most efficient would be to have the content of the file as a []byte in your source code. This can be generated by 3rd party toos/libraries like go-bindata.
If you don't want to use a 3rd party library for this, here's a simple code snippet that reads a binary file, and outputs Go source code that declares a variable of type []byte that will be initialized with the exact content of the file:
imgdata, err := ioutil.ReadFile("someimage.png")
if err != nil {
panic(err)
}
fmt.Print("var imgdata = []byte{")
for i, v := range imgdata {
if i > 0 {
fmt.Print(", ")
}
fmt.Print(v)
}
fmt.Println("}")
Example output if the file would contain bytes from 0 to 16 (try it on the Go Playground):
var imgdata = []byte{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
Storing as base64 string
If the file is not "too large" (most images/icons qualify), there are other viable options too. You can convert the content of the file to a Base64 string and store that in your source code. On application startup (func init()) or when needed, you can decode it to the original []byte content. Go has nice support for Base64 encoding in the encoding/base64 package.
Converting a (binary) file to base64 string is as simple as:
data, err := ioutil.ReadFile("someimage.png")
if err != nil {
panic(err)
}
fmt.Println(base64.StdEncoding.EncodeToString(data))
Store the result base64 string in your source code, e.g. as a const.
Decoding it is just one function call:
const imgBase64 = "<insert base64 string here>"
data, err := base64.StdEncoding.DecodeString(imgBase64) // data is of type []byte
Storing as quoted string
More efficient than storing as base64, but may be longer in source code is storing the quoted string literal of the binary data. We can obtain the quoted form of any string using the strconv.Quote() function:
data, err := ioutil.ReadFile("someimage.png")
if err != nil {
panic(err)
}
fmt.Println(strconv.Quote(string(data))
For binary data containing values from 0 up to 64 this is how the output would look like (try it on the Go Playground):
"\x00\x01\x02\x03\x04\x05\x06\a\b\t\n\v\f\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !\"#$%&'()*+,-./0123456789:;<=>?"
(Note that strconv.Quote() appends and prepends a quotation mark to it.)
You can directly use this quoted string in your source code, for example:
const imgdata = "\x00\x01\x02\x03\x04\x05\x06\a\b\t\n\v\f\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !\"#$%&'()*+,-./0123456789:;<=>?"
It is ready to use, no need to decode it; the unquoting is done by the Go compiler, at compile time.
You may also store it as a byte slice should you need it like that:
var imgdata = []byte("\x00\x01\x02\x03\x04\x05\x06\a\b\t\n\v\f\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !\"#$%&'()*+,-./0123456789:;<=>?")
The go-bindata package looks like it might be what you're interested in.
https://github.com/go-bindata/go-bindata
It will allow you to convert any static file into a function call that can be embedded in your code and will return a byte slice of the file content when called.
Bundle React application
For example, you have a build output from react like the following:
build/favicon.ico
build/index.html
build/asset-manifest.json
build/static/css/**
build/static/js/**
build/manifest.json
When you use go:embed like this, it will serve the contents as http://localhost:port/build/index.html which is not what we want (unexpected /build).
//go:embed build/*
var static embed.FS
// ...
http.Handle("/", http.FileServer(http.FS(static)))
In fact, we will need to take one more step to make it works as expected by using fs.Sub:
package main
import (
"embed"
"io/fs"
"log"
"net/http"
)
//go:embed build/*
var static embed.FS
func main() {
contentStatic, _ := fs.Sub(static, "build")
http.Handle("/", http.FileServer(http.FS(contentStatic)))
log.Fatal(http.ListenAndServe("localhost:8080", nil))
}
Now, http://localhost:8080 should serve your web application as expected.
Credit to Amit Mittal.
Note: go:embed requires go 1.16 or higher.
also there is some exotic way - I use maven plugin to build GoLang projects and it allows to use JCP preprocessor to embed binary blocks and text files into sources. In the case code just look like line below (and some example can be found here)
var imageArray = []uint8{/*$binfile("./image.png","uint8[]")$*/}
As a popular alternative to go-bindata mentioned in another answer, mjibson/esc also embeds arbitrary files, but handles directory trees particularly conveniently.

How to iterate / stream a gzip file (containing a single csv)?

How to iterate over a gziped file which contains a single text file (csv)?
Searching crates.io I found flate2 which has the following code example for decompression:
extern crate flate2;
use std::io::prelude::*;
use flate2::read::GzDecoder;
fn main() {
let mut d = GzDecoder::new("...".as_bytes()).unwrap();
let mut s = String::new();
d.read_to_string(&mut s).unwrap();
println!("{}", s);
}
How to stream a gzip csv file?
For stream io operations rust has the Read and Write traits. To iterate over input by lines you usually want the BufRead trait, which you can always get by wrapping a Read implementation in BufReader::new.
flate2 already operates with these traits; GzDecoder implements Read, and GzDecoder::new takes anything that implements Read.
Example decoding stdin (doesn't work well on playground of course):
extern crate flate2;
use std::io;
use std::io::prelude::*;
use flate2::read::GzDecoder;
fn main() {
let stdin = io::stdin();
let stdin = stdin.lock(); // or just open any normal file
let d = GzDecoder::new(stdin).expect("couldn't decode gzip stream");
for line in io::BufReader::new(d).lines() {
println!("{}", line.unwrap());
}
}
You can then decode your lines with your usual ("without gzip") logic; perhaps make it generic by taking any input implementing BufRead.

Why does json.Encoder add an extra line?

json.Encoder seems to behave slightly different than json.Marshal. Specifically it adds a new line at the end of the encoded value. Any idea why is that? It looks like a bug to me.
package main
import "fmt"
import "encoding/json"
import "bytes"
func main() {
var v string
v = "hello"
buf := bytes.NewBuffer(nil)
json.NewEncoder(buf).Encode(v)
b, _ := json.Marshal(&v)
fmt.Printf("%q, %q", buf.Bytes(), b)
}
This outputs
"\"hello\"\n", "\"hello\""
Try it in the Playground
Because they explicitly added a new line character when using Encoder.Encode. Here's the source code to that func, and it actually states it adds a newline character in the documentation (see comment, which is the documentation):
https://golang.org/src/encoding/json/stream.go?s=4272:4319
// Encode writes the JSON encoding of v to the stream,
// followed by a newline character.
//
// See the documentation for Marshal for details about the
// conversion of Go values to JSON.
func (enc *Encoder) Encode(v interface{}) error {
if enc.err != nil {
return enc.err
}
e := newEncodeState()
err := e.marshal(v)
if err != nil {
return err
}
// Terminate each value with a newline.
// This makes the output look a little nicer
// when debugging, and some kind of space
// is required if the encoded value was a number,
// so that the reader knows there aren't more
// digits coming.
e.WriteByte('\n')
if _, err = enc.w.Write(e.Bytes()); err != nil {
enc.err = err
}
encodeStatePool.Put(e)
return err
}
Now, why did the Go developers do it other than "makes the output look a little nice"? One answer:
Streaming
The go json Encoder is optimized for streaming (e.g. MB/GB/PB of json data). It is typical that when streaming you need a way to deliminate when your stream has completed. In the case of Encoder.Encode(), that is a \n newline character. Sure, you can certainly write to a buffer. But you can also write to an io.Writer which would stream the block of v.
This is opposed to the use of json.Marshal which is generally discouraged if your input is from an untrusted (and unknown limited) source (e.g. an ajax POST method to your web service - what if someone posts a 100MB json file?). And, json.Marshal would be a final complete set of json - e.g. you wouldn't expect to concatenate a few 100 Marshal entries together. You'd use Encoder.Encode() for that to build a large set and write to the buffer, stream, file, io.Writer, etc.
Whenever in doubt if it's a bug, I always lookup the source - that's one of the advantages to Go, it's source and compiler is just pure Go. Within [n]vim I use \gb to open the source definition in a browser with my .vimrc settings.
You can erease the newline by backward stream:
f, _ := os.OpenFile(fname, ...)
encoder := json.NewEncoder(f)
encoder.Encode(v)
f.Seek(-1, 1)
f.WriteString("other data ...")
They should let user control this strange behavior:
a build option to disable it
Encoder.SetEOF(eof string)
Encoder.SetIndent(prefix, indent, eof string)
The Encoder writes a stream of documents. The extra whitespace terminates a JSON document in the stream.
A terminator is required for stream readers. Consider a stream containing these JSON documents: 1, 2, 3. Without the extra whitespace, the data on the wire is the sequence of bytes 123. This is a single JSON document with the number 123, not three documents.

Read json file in and write without indentation

The following code take a folder of json files (saved with indentation) open it, get content and serialize to json and write to file a new file.
Same code task in python works, so it is not the data. But the rust version you see in here:
extern crate rustc_serialize;
use rustc_serialize::json;
use std::io::Read;
use std::fs::read_dir;
use std::fs::File;
use std::io::Write;
use std::io;
use std::str;
fn write_data(filepath: &str, data: json::Json) -> io::Result<()> {
let mut ofile = try!(File::create(filepath));
let encoded: String = json::encode(&data).unwrap();
try!(ofile.write(encoded.as_bytes()));
Ok(())
}
fn main() {
let root = "/Users/bling/github/data/".to_string();
let folder_path = root + &"texts";
let paths = read_dir(folder_path).unwrap();
for path in paths {
let input_filename = format!("{}", path.unwrap().path().display());
let output_filename = str::replace(&input_filename, "texts", "texts2");
let mut data = String::new();
let mut f = File::open(input_filename).unwrap();
f.read_to_string(&mut data).unwrap();
let json = json::Json::from_str(&data).unwrap();
write_data(&output_filename, json).unwrap();
}
}
Do you have spot an Error in my code already or did I get some language concepts wrong. Is the rustc-serialize cargo wrongly used. At the end it does not work as expected - to outperform python.
± % cargo run --release --verbose
Fresh rustc-serialize v0.3.16
Fresh fileprocessing v0.1.0 (file:///Users/bling/github/rust/fileprocessing)
Running `target/release/fileprocessing`
thread '<main>' panicked at 'called `Result::unwrap()` on an `Err` value: SyntaxError("unescaped control character in string", 759, 55)', ../src/libcore/result.rs:736
Process didn't exit successfully: `target/release/fileprocessing` (exit code: 101)
Why does it throw an error is my serializing json done wrong?
Can I get the object it fails on? What about encoding?
...is the code right or is there something obvious wrong with some more experience?
Wild guess: if the same input file can be parsed by other JSON parsers (e.g. in Python), you may be hitting a rustc-serialize bug that was fixed in https://github.com/rust-lang-nursery/rustc-serialize/pull/142. Try to update?

Remove invalid UTF-8 characters from a string

I get this on json.Marshal of a list of strings:
json: invalid UTF-8 in string: "...ole\xc5\"
The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it.
In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?
UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.
(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)
In Go 1.13+, you can do this:
strings.ToValidUTF8("a\xc5z", "")
In Go 1.11+, it's also very easy to do the same using the Map function and utf8.RuneError like this:
fixUtf := func(r rune) rune {
if r == utf8.RuneError {
return -1
}
return r
}
fmt.Println(strings.Map(fixUtf, "a\xc5z"))
fmt.Println(strings.Map(fixUtf, "posic�o"))
Output:
az
posico
Playground: Here.
For example,
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "a\xc5z"
fmt.Printf("%q\n", s)
if !utf8.ValidString(s) {
v := make([]rune, 0, len(s))
for i, r := range s {
if r == utf8.RuneError {
_, size := utf8.DecodeRuneInString(s[i:])
if size == 1 {
continue
}
}
v = append(v, r)
}
s = string(v)
}
fmt.Printf("%q\n", s)
}
Output:
"a\xc5z"
"az"
Unicode Standard
FAQ - UTF-8, UTF-16, UTF-32 & BOM
Q: Are there any byte sequences that are not generated by a UTF? How
should I interpret them?
A: None of the UTFs can generate every arbitrary byte sequence. For
example, in UTF-8 every byte of the form 110xxxxx2 must be followed
with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2
0xxxxxxx2> is illegal, and must never be generated. When faced with
this illegal byte sequence while transforming or interpreting, a UTF-8
conformant process must treat the first byte 110xxxxx2 as an illegal
termination error: for example, either signaling an error, filtering
the byte out, or representing the byte with a marker such as FFFD
(REPLACEMENT CHARACTER). In the latter two cases, it will continue
processing at the second byte 0xxxxxxx2.
A conformant process must not interpret illegal or ill-formed byte
sequences as characters, however, it may take error recovery actions.
No conformant process may use irregular byte sequences to encode
out-of-band information.
Another way to do this, according to this answer, could be
s = string([]rune(s))
Example:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "...ole\xc5"
fmt.Println(s, utf8.Valid([]byte(s)))
// Output: ...ole� false
s = string([]rune(s))
fmt.Println(s, utf8.Valid([]byte(s)))
// Output: ...ole� true
}
Even though the result doesn't look "pretty", it still nevertheless converts the string into a valid UTF-8 encoding.