How to iterate over a gziped file which contains a single text file (csv)?
Searching crates.io I found flate2 which has the following code example for decompression:
extern crate flate2;
use std::io::prelude::*;
use flate2::read::GzDecoder;
fn main() {
let mut d = GzDecoder::new("...".as_bytes()).unwrap();
let mut s = String::new();
d.read_to_string(&mut s).unwrap();
println!("{}", s);
}
How to stream a gzip csv file?
For stream io operations rust has the Read and Write traits. To iterate over input by lines you usually want the BufRead trait, which you can always get by wrapping a Read implementation in BufReader::new.
flate2 already operates with these traits; GzDecoder implements Read, and GzDecoder::new takes anything that implements Read.
Example decoding stdin (doesn't work well on playground of course):
extern crate flate2;
use std::io;
use std::io::prelude::*;
use flate2::read::GzDecoder;
fn main() {
let stdin = io::stdin();
let stdin = stdin.lock(); // or just open any normal file
let d = GzDecoder::new(stdin).expect("couldn't decode gzip stream");
for line in io::BufReader::new(d).lines() {
println!("{}", line.unwrap());
}
}
You can then decode your lines with your usual ("without gzip") logic; perhaps make it generic by taking any input implementing BufRead.
Related
How to write a vector to a JSON file in rust?
Code:
use std::fs::File;
use std::io::prelude::*;
let vec1 = vec![1.0,2.0,2.1,0.6];
let mut file = File::create("results.json").unwrap();
let data = serde_json::to_string(&vec1).unwrap();
file.write(&data);
Error:
mismatched types
expected reference `&[u8]`
found reference `&std::string::String`rustc(E0308)
Instead of writing the data to an in-memory string first, you can also write it directly to the file:
use std::fs::File;
use std::io::{BufWriter, Write};
fn main() -> std::io::Result<()> {
let vec = vec![1, 2, 3];
let file = File::create("a")?;
let mut writer = BufWriter::new(file);
serde_json::to_writer(&mut writer, &vec)?;
writer.flush()?;
Ok(())
}
This approach has a lower memory footprint and is generally preferred. Note that you should use buffered writes to the file, since serialization could otherwise result in many small writes to the file, which would severely reduce performance.
If you need to write the data to memory first for some reason, I suggest using serde_json::to_vec() instead of serde_json::to_string(), since that function will give you a Vec<u8> immediately.
I'd like to read multiple JSON objects from a file/reader in Rust, one at a time. Unfortunately serde_json::from_reader(...) just reads until end-of-file; there doesn't seem to be any way to use it to read a single object or to lazily iterate over the objects.
Is there any way to do this? Using serde_json would be ideal, but if there's a different library I'd be willing use that instead.
At the moment I'm putting each object on a separate line and parsing them individually, but I would really prefer not to need to do this.
Example Use
main.rs
use serde_json;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let stdin = std::io::stdin();
let stdin = stdin.lock();
for item in serde_json::iter_from_reader(stdin) {
println!("Got {:?}", item);
}
Ok(())
}
in.txt
{"foo": ["bar", "baz"]} 1 2 [] 4 5 6
example session
Got Object({"foo": Array([String("bar"), String("baz")])})
Got Number(1)
Got Number(2)
Got Array([])
Got Number(4)
Got Number(5)
Got Number(6)
This was a pain when I wanted to do it in Python, but fortunately in Rust this is a directly-supported feature of the de-facto-standard serde_json crate! It isn't exposed as a single convenience function, but we just need to create a serde_json::Deserializer reading from our file/reader, then use its .into_iter() method to get a StreamDeserializer iterator yielding Results containing serde_json::Value JSON values.
use serde_json; // 1.0.39
fn main() -> Result<(), Box<dyn std::error::Error>> {
let stdin = std::io::stdin();
let stdin = stdin.lock();
let deserializer = serde_json::Deserializer::from_reader(stdin);
let iterator = deserializer.into_iter::<serde_json::Value>();
for item in iterator {
println!("Got {:?}", item?);
}
Ok(())
}
One thing to be aware of: if a syntax error is encountered, the iterator will start to produce an infinite sequence of error results and never move on. You need to make sure you handle the errors inside of the loop, or the loop will never end. In the snippet above, we do this by using the ? question mark operator to break the loop and return the first serde_json::Result::Err from our function.
With the csv crate and the latest Rust version 1.31.0, I would want to read CSV files with ANSI (Windows 1252) encoding, as easily as in UTF-8.
Things I have tried (with no luck), after reading the whole file in a Vec<u8>:
CString
OsString
Indeed, in my company, we have a lot of CSV files, ANSI encoded.
Also, I would want, if possible, not to load the entire file in a Vec<u8> but a reading line by line (CRLF ending), as many of the files are big (50 Mb or more…).
In the file Cargo.toml, I have this dependency:
[dependencies]
csv = "1"
test.csv consist of the following content saved as Windows-1252 encoding:
Café;au;lait
Café;au;lait
The code in main.rs file:
extern crate csv;
use std::error::Error;
use std::fs::File;
use std::io::BufReader;
use std::path::Path;
use std::process;
fn example() -> Result<(), Box<Error>> {
let file_name = r"test.csv";
let file_handle = File::open(Path::new(file_name))?;
let reader = BufReader::new(file_handle);
let mut rdr = csv::ReaderBuilder::new()
.delimiter(b';')
.from_reader(reader);
// println!("ANSI");
// for result in rdr.byte_records() {
// let record = result?;
// println!("{:?}", record);
// }
println!("UTF-8");
for result in rdr.records() {
let record = result?;
println!("{:?}", record);
}
Ok(())
}
fn main() {
if let Err(err) = example() {
println!("error running example: {}", err);
process::exit(1);
}
}
The output is:
UTF-8
error running example: CSV parse error: record 0 (line 1, field: 0, byte: 0): invalid utf-8: invalid UTF-8 in field 0 near byte index 3
error: process didn't exit successfully: `target\debug\test-csv.exe` (exit code: 1)
When using rdr.byte_records() (uncommenting the relevant part of code), the output is:
ANSI
ByteRecord(["Caf\\xe9", "au", "lait"])
I suspect this question is under specified. In particular, it's not clear why your use of the ByteRecord API is insufficient. In the csv crate, byte records specifically exists for exactly cases like this, where your CSV data isn't strictly UTF-8, but is in an alternative encoding such as Windows-1252 that is ASCII compatible. (An ASCII compatible encoding is an encoding in which ASCII is a subset. Windows-1252 and UTF-8 are both ASCII compatible. UTF-16 is not.) Your code sample above shows that you're using byte records, but doesn't explain why this is insufficient.
With that said, if your goal is to get your data into Rust's string data type (String/&str), then your only option is to transcode the contents of your CSV data from Windows-1252 to UTF-8. This is necessary because Rust's string data type uses UTF-8 for its in-memory representation. You cannot have a Rust String/&str that is Windows-1252 encoded because Windows-1252 is not a subset of UTF-8.
Other comments have recommended the use of the encoding crate. However, I would instead recommend the use of encoding_rs, if your use case aligns with the same use cases solved by the Encoding Standard, which is specifically geared towards the web. Fortunately, I believe such an alignment exists.
In order to satisfy your criteria for reading CSV data in a streaming fashion without first loading the entire contents into memory, you need to use a wrapper around the encoding_rs crate that implements streaming decoding for you. The encoding_rs_io crate provides this for you. (It's used inside of ripgrep to do fast streaming decoding before searching UTF-8.)
Here is an example program that puts all of the above together, using Rust 2018:
use std::fs::File;
use std::process;
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() {
if let Err(err) = try_main() {
eprintln!("{}", err);
process::exit(1);
}
}
fn try_main() -> csv::Result<()> {
let file = File::open("test.csv")?;
let transcoded = DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(file);
let mut rdr = csv::ReaderBuilder::new()
.delimiter(b';')
.from_reader(transcoded);
for result in rdr.records() {
let r = result?;
println!("{:?}", r);
}
Ok(())
}
with the Cargo.toml:
[package]
name = "so53826986"
version = "0.1.0"
edition = "2018"
[dependencies]
csv = "1"
encoding_rs = "0.8.13"
encoding_rs_io = "0.1.3"
And the output:
$ cargo run --release
Compiling so53826986 v0.1.0 (/tmp/so53826986)
Finished release [optimized] target(s) in 0.63s
Running `target/release/so53826986`
StringRecord(["Café", "au", "lait"])
In particular, if you swap out rdr.records() for rdr.byte_records(), then we can see more clearly what happened:
$ cargo run --release
Compiling so53826986 v0.1.0 (/tmp/so53826986)
Finished release [optimized] target(s) in 0.61s
Running `target/release/so53826986`
ByteRecord(["Caf\\xc3\\xa9", "au", "lait"])
Namely, your input contained Caf\xE9, but the byte record now contains Caf\xC3\xA9. This is a result of translating the Windows-1252 codepoint value of 233 (encoded as its literal byte, \xE9) to U+00E9 LATIN SMALL LETTER E WITH ACUTE, which is UTF-8 encoded as \xC3\xA9.
The idea is getting a cursor from Mongo and serializing the result set to JSON in a string. I have working code:
extern crate bson;
extern crate mongodb;
use mongodb::db::ThreadedDatabase;
use mongodb::{Client, ThreadedClient};
extern crate serde;
extern crate serde_json;
fn main() {
let client =
Client::connect("localhost", 27017).expect("Failed to initialize standalone client.");
let coll = client.db("foo").collection("bar");
let cursor = coll.find(None, None).ok().expect("Failed to execute find.");
let docs: Vec<_> = cursor.map(|doc| doc.unwrap()).collect();
let serialized = serde_json::to_string(&docs).unwrap();
println!("{}", serialized);
}
Is there a better way to do this? If not I will close this thread.
This is the sort of situation that serde-transcode was made for. What it does is it converts directly between serde formats. How it works is it takes in a Deserializer and a Serializer, then directly calls the corresponding serialize function for each deserialized item. Conceptually this is a bit similar to using serde_json::Value as an intermediate format, but it may include some extra type information if available in the input format.
Unfortunatly, the bson crate does not expose bson::de::raw::Deserializer or bson::ser::raw::Serializer so this is not currently possible. If you look in the documentation, the Deserializer and Serializer actually refer to different structs which handle the conversion to and from the Bson enum.
If bson::de::raw::Deserializer was public, then this code would have the desired effect. Hopefully this will be helpful to anyone who has a similar problem (or anyone who wants this enough to raise an issue on their repository).
let mut buffer = Vec::new();
// Manually add array separators because the proper way requires going through
// DeserializeSeed and that is a whole other topic.
buffer.push(b'[');
while cursor.advance().await? {
let bytes = cursor.current().as_bytes();
// Create deserializer and serializer
let deserializer = bson::de::raw::Deserializer::new(bytes, false);
let serializer = serde_json::Serializer::new(&mut buffer);
// Transcode between formats
serde_transcode::transcode(deserializer, serializer).unwrap();
// Manually add array separator
buffer.push(b',');
}
// Remove trailing comma and add closing bracket
if buffer.len() > 1 {
buffer.pop();
}
buffer.push(']');
// Do something with the result
println!("{}", String::from_utf8(buffer).unwrap())
The following code take a folder of json files (saved with indentation) open it, get content and serialize to json and write to file a new file.
Same code task in python works, so it is not the data. But the rust version you see in here:
extern crate rustc_serialize;
use rustc_serialize::json;
use std::io::Read;
use std::fs::read_dir;
use std::fs::File;
use std::io::Write;
use std::io;
use std::str;
fn write_data(filepath: &str, data: json::Json) -> io::Result<()> {
let mut ofile = try!(File::create(filepath));
let encoded: String = json::encode(&data).unwrap();
try!(ofile.write(encoded.as_bytes()));
Ok(())
}
fn main() {
let root = "/Users/bling/github/data/".to_string();
let folder_path = root + &"texts";
let paths = read_dir(folder_path).unwrap();
for path in paths {
let input_filename = format!("{}", path.unwrap().path().display());
let output_filename = str::replace(&input_filename, "texts", "texts2");
let mut data = String::new();
let mut f = File::open(input_filename).unwrap();
f.read_to_string(&mut data).unwrap();
let json = json::Json::from_str(&data).unwrap();
write_data(&output_filename, json).unwrap();
}
}
Do you have spot an Error in my code already or did I get some language concepts wrong. Is the rustc-serialize cargo wrongly used. At the end it does not work as expected - to outperform python.
± % cargo run --release --verbose
Fresh rustc-serialize v0.3.16
Fresh fileprocessing v0.1.0 (file:///Users/bling/github/rust/fileprocessing)
Running `target/release/fileprocessing`
thread '<main>' panicked at 'called `Result::unwrap()` on an `Err` value: SyntaxError("unescaped control character in string", 759, 55)', ../src/libcore/result.rs:736
Process didn't exit successfully: `target/release/fileprocessing` (exit code: 101)
Why does it throw an error is my serializing json done wrong?
Can I get the object it fails on? What about encoding?
...is the code right or is there something obvious wrong with some more experience?
Wild guess: if the same input file can be parsed by other JSON parsers (e.g. in Python), you may be hitting a rustc-serialize bug that was fixed in https://github.com/rust-lang-nursery/rustc-serialize/pull/142. Try to update?