How to read a CSV that includes Chinese characters in Rust?

How to read a CSV that includes Chinese characters in Rust? - csv

When I read a CSV file that includes Chinese characters using the csv crate, it has a error.
fn main() {
let mut rdr =
csv::Reader::from_file("C:\\Users\\Desktop\\test.csv").unwrap().has_headers(false);
for record in rdr.decode() {
let (a, b): (String, String) = record.unwrap();
println!("a:{},b:{}", a, b);
}
thread::sleep_ms(500000);
}
The error:
Running `target\release\rust_Work.exe`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Decode("Could not convert bytes \'FromUtf8Error { bytes: [208, 213, 195, 251], error: Utf8Error { va
lid_up_to: 0 } }\' to UTF-8.")', ../src/libcore\result.rs:788
note: Run with `RUST_BACKTRACE=1` for a backtrace.
error: Process didn't exit successfully: `target\release\rust_Work.exe` (exit code: 101)
test.csv:
1. 姓名 性别 年纪 分数 等级
2. 小二 男 12 88 良好
3. 小三 男 13 89 良好
4. 小四 男 14 91 优秀

I'm not sure what could be done to make the error message more clear:
Decode("Could not convert bytes 'FromUtf8Error { bytes: [208, 213, 195, 251], error: Utf8Error { valid_up_to: 0 } }' to UTF-8.")
FromUtf8Error is documented in the standard library, and the text of the error says "Could not convert bytes to UTF-8" (although there's some extra detail in the middle).
Simply put, your data isn't in UTF-8 and it must be. That's all that the Rust standard library (and thus most libraries) really deal with. You will need to figure out what encoding it is in and then find some way of converting from that to UTF-8. There may be a crate to help with either of those cases.
Perhaps even better, you can save the file as UTF-8 from the beginning. Sadly, it's relatively common for people to hit this issue when using Excel, because Excel does not have a way to easily export UTF-8 CSV files. It always writes a CSV file in the system locale encoding.

I have a way to solve it. Thanks all.
extern crate csv;
extern crate rustc_serialize;
extern crate encoding;
use encoding::{Encoding, EncoderTrap, DecoderTrap};
use encoding::all::{GB18030};
use std::io::prelude::*;
fn main() {
let path = "C:\\Users\\Desktop\\test.csv";
let mut f = File::open(path).expect("cannot open file");
let mut reader: Vec<u8> = Vec::new();
f.read_to_end(&mut reader).expect("can not read file");
let mut chars = String::new();
GB18030.decode_to(&mut reader, DecoderTrap::Ignore, &mut chars);
let mut rdr = csv::Reader::from_string(chars).has_headers(true);
for row in rdr.decode() {
let (x, y, r): (String, String, String) = row.unwrap();
println!("({}, {}): {:?}", x, y, r);
}
}
output:

Part 1: Read Unicode (Chinese or not) characters:
The easiest way to achieve your goal is to use the read_to_string function that mutates the String you pass to it, appending the Unicode content of your file to that passed String:
use std::io::prelude::*;
use std::fs::File;
fn main() {
let mut f = File::open("file.txt").unwrap();
let mut buffer = String::new();
f.read_to_string(&mut buffer);
println!("{}", buffer)
}
Part 2: Parse a CSV file, its delimiter being a ',':
extern crate regex;
use regex::Regex;
use std::io::prelude::*;
use std::fs::File;
fn main() {
let mut f = File::open("file.txt").unwrap();
let mut buffer = String::new();
let delimiter = ",";
f.read_to_string(&mut buffer);
let modified_buffer = buffer.replace("\n", delimiter);
let mut regex_str = "([^".to_string();
regex_str.push_str(delimiter);
regex_str.push_str("]+)");
let mut final_part = "".to_string();
final_part.push_str(delimiter);
final_part.push_str("?");
regex_str.push_str(&final_part);
let regex_str_copy = regex_str.clone();
regex_str.push_str(&regex_str_copy);
regex_str.push_str(&regex_str_copy);
let re = Regex::new(&regex_str).unwrap();
for cap in re.captures_iter(&modified_buffer) {
let (s1, s2, dist): (String, String, usize) =
(cap[1].to_string(), cap[2].to_string(), cap[3].parse::<usize>().unwrap());
println!("({}, {}): {}", s1, s2, dist);
}
}
Sample input and output here

Related

Is this possible to set a default value of a function without loosing the polymorphic type of an argument in ocaml [duplicate]

This question already has answers here:
Force a broader type for optional argument with more restrictive default value
(3 answers)
Closed 1 year ago.
I have this function to add logs in a file :
let log_datas f file (datas: 'a list) =
let oc = open_out file.file_name in
List.iter (fun x -> Printf.fprintf oc "%s" ## f x) datas;
close_out oc
let () = let f = string_of_int in log_datas f {file_name="log"} [1;2]
Which works.
I tried to make it by default accepting string list as argument :
let log_datas ?(f:'a -> string = fun x -> x^"\n") file (datas: 'a list) =
let oc = open_out file.file_name in
List.iter (fun x -> Printf.fprintf oc "%s" ## f x) datas;
close_out oc
but when I try
let () = let f = string_of_int in log_datas ~f {file_name="log"} [1;2]
I get a type error
23 | let () = let f = string_of_int in log_datas ~f {file_name="log"} [1;2]
^
Error: This expression has type int -> string
but an expression was expected of type string -> string
Type int is not compatible with type string
An obvious solution would be to make 2 function, one with no f argument and one with a f argument. But I was wondering, is there any other workaround possible ?

No, it is not possible, you have to specify both parameters to keep it polymorphic. Basically, your example could be distilled to,
let log ?(to_string=string_of_int) data =
print_endline (to_string data)
If OCaml would keep it polymorphic then the following would be allowed,
log "hello"
and string_of_int "hello" is not well-typed.
So you have to keep both parameters required, e.g.,
let log to_string data =
print_endline (to_string data)
I would also suggest looking into the Format module and defining your own polymorphic function that uses format specification to define how data of different types are written, e.g.,
let log fmt =
Format.kasprintf print_endline fmt
Substitute print_endline with our own logging facility. The log function could be used as printf, e.g.,
log "%s %d" "hello" 42

What is a memory efficient way to read a CSV file?

My program reads CSV files using the csv crate into Vec<Vec<String>>, where the outer vector represents rows, and the inner separates rows into columns.
use std::{time, thread::{sleep, park}};
use csv;
fn main() {
different_scope();
println!("Parked");
park();
}
fn different_scope() {
println!("Reading csv");
let _data = read_csv("data.csv");
println!("Sleeping");
sleep(time::Duration::from_secs(4));
println!("Going out of scope");
}
fn read_csv(path: &str) -> Vec<Vec<String>> {
let mut rdr = csv::Reader::from_path(path).unwrap();
return rdr
.records()
.map(|row| {
row
.unwrap()
.iter()
.map(|column| column.to_string())
.collect()
})
.collect();
}
I'm looking at RAM usage with htop and this uses 2.5GB of memory to read a 250MB CSV file.
Here's the contents of cat /proc/<my pid>/status
Name: (name)
Umask: 0002
State: S (sleeping)
Tgid: 18349
Ngid: 0
Pid: 18349
PPid: 18311
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 1000 1000
FDSize: 256
Groups: 4 24 27 30 46 118 128 133 1000
NStgid: 18349
NSpid: 18349
NSpgid: 18349
NSsid: 18311
VmPeak: 2748152 kB
VmSize: 2354932 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 2580156 kB
VmRSS: 2345944 kB
RssAnon: 2343900 kB
RssFile: 2044 kB
RssShmem: 0 kB
VmData: 2343884 kB
VmStk: 136 kB
VmExe: 304 kB
VmLib: 2332 kB
VmPTE: 4648 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
Threads: 1
SigQ: 0/127783
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000180000440
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ffffffff
Cpus_allowed_list: 0-31
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 9
nonvoluntary_ctxt_switches: 293
When I drop the variable, it frees the correct amount (approx. 250MB), but there's still 2.2GB left. I'm unable to read more than 2-3GB before all my memory is used and the process is killed (cargo prints "Killed").
How do I free the excess memory while the CSV is being read?
I need to process every line, but in this case I don't need to hold all this data at once, but what if I did?
I asked a related question and I was pointed to What is Rust strategy to uncommit and return memory to the operating system? which was helpful in understanding the problem, but I don't know how to solve it.
My understanding is I should switch my crate to a different memory allocator, but brute forcing through all the allocators I can find feels like an ignorant approach.

For questions about memory, it's good to develop a technique where you quantify your memory usage. You can do this by examining your representation. In this case, that's Vec<Vec<String>>. In particular, if you have a 250MB CSV file that is represented as a sequence of a sequence of fields, then it is not necessarily the case that you'll only use 250MB of memory. You need to consider the overhead of your representation.
For a Vec<Vec<String>>, we can dismiss the overhead of the outer Vec<...> as it will (in your program) be on the stack and not the heap. It is on the inner Vec<String> that is on the heap.
So if your CSV file has M records and each record has N fields, then there will be M instances of Vec<String> and M * N instances of String. The overhead of both a Vec<T> and a String is 3 * sizeof(word), with one word being the pointer to the data, another word being the length and yet another being the capacity. (That's 24 bytes for a 64-bit target.) So your total overhead for a 64-bit target is (M * 24) + (M * N * 24).
Let's test this experimentally. Since you didn't share your CSV input (you really should in the future), I'll bring my own. It's 145MB, has M=3,173,958 records with N=7 fields per record. So the total overhead for your representation is (3173958 * 24) + (3173958 * 7 * 24) = 609,399,936 bytes, or 609 MB. Let's test that with a real program:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input_path = match std::env::args_os().nth(1) {
Some(p) => p,
None => {
eprintln!("Usage: csvmem <path>");
std::process::exit(1);
}
};
let rdr = csv::Reader::from_path(input_path)?;
let mut records: Vec<Vec<String>> = vec![];
for result in rdr.into_records() {
let mut record: Vec<String> = vec![];
for column in result?.iter() {
record.push(column.to_string());
}
records.push(record);
}
println!("{}", records.len());
Ok(())
}
(I've added some unnecessary type annotations in a couple places to make the code a little clearer, particularly with respect to our representation.) So let's run this program (whose only dependency is csv = "1" in my Cargo.toml):
$ echo $TIMEFMT
real %*E user %*U sys %*S maxmem %M MB faults %F
$ cargo b --release
$ time ./target/release/csvmem /m/sets/csv/pop/worldcitiespop-nice.csv
3173958
real 1.542
user 1.236
sys 0.296
maxmem 1287 MB
faults 0
The time utility here reports peak memory usage, which is actually a bit higher than what we might expect it to be: 609 + 145 = 754MB. I don't quite know enough about allocators to reason through the difference completely. It could be that the system allocator I'm using allocates bigger chunks than what is actually needed. Let's make our representation a bit more efficient by using a Box<str> instead of String. We sacrifice the ability to expand the string, but in exchange, we save 8 bytes of overhead per field. So our new overhead calculation is (3173958 * 24) + (3173958 * 7 * 16) = 431,658,288 bytes or 431MB for a difference of 609 - 431 = 178MB. So let's test our new representation and see what our delta is:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input_path = match std::env::args_os().nth(1) {
Some(p) => p,
None => {
eprintln!("Usage: csvmem <path>");
std::process::exit(1);
}
};
let rdr = csv::Reader::from_path(input_path)?;
let mut records: Vec<Vec<Box<str>>> = vec![];
for result in rdr.into_records() {
let mut record: Vec<Box<str>> = vec![];
for column in result?.iter() {
record.push(column.to_string().into());
}
records.push(record);
}
println!("{}", records.len());
Ok(())
}
And to compile and run:
$ cargo b --release
$ time ./target/release/csvmem /m/sets/csv/pop/worldcitiespop-nice.csv
3173958
real 1.459
user 1.183
sys 0.266
maxmem 1093 MB
faults 0
for a total delta of 194MB. Which is pretty close to our guess.
We can optimize the representation even further by using a Vec<Box<[Box<str>]>>:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input_path = match std::env::args_os().nth(1) {
Some(p) => p,
None => {
eprintln!("Usage: csvmem <path>");
std::process::exit(1);
}
};
let rdr = csv::Reader::from_path(input_path)?;
let mut records: Vec<Box<[Box<str>]>> = vec![];
for result in rdr.into_records() {
let mut record: Vec<Box<str>> = vec![];
for column in result?.iter() {
record.push(column.to_string().into());
}
records.push(record.into());
}
println!("{}", records.len());
Ok(())
}
That gives a peak memory usage of 1069 MB. So not much of a savings.
However, the best thing we can do is use a csv::StringRecord:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input_path = match std::env::args_os().nth(1) {
Some(p) => p,
None => {
eprintln!("Usage: csvmem <path>");
std::process::exit(1);
}
};
let rdr = csv::Reader::from_path(input_path)?;
let mut records = vec![];
for result in rdr.into_records() {
let record = result?;
records.push(record);
}
println!("{}", records.len());
Ok(())
}
And that gives a peak memory usage of 727MB. The secret is that a StringRecord stores fields inline without that second layer of indirection. It ends up saving quite a bit!
Of course, if you don't need to store all of the records in memory at once, then you shouldn't. And the CSV crate supports that just fine:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input_path = match std::env::args_os().nth(1) {
Some(p) => p,
None => {
eprintln!("Usage: csvmem <path>");
std::process::exit(1);
}
};
let mut count = 0;
let rdr = csv::Reader::from_path(input_path)?;
for result in rdr.into_records() {
let _ = result?;
count += 1;
}
println!("{}", count);
Ok(())
}
And that program's peak memory usage is only 9MB, as you'd expect of a streaming implementation. (Technically, you can use no heap memory at all if you drop down and use the csv-core crate.)

Why can't I call gen_range with two i32 arguments?

I have this code but it doesn't compile:
use rand::Rng;
use std::io;
fn main() {
println!("Guess the number!");
let secret_number = rand::thread_rng().gen_range(0, 101);
println!("The secret number is: {}", secret_number);
println!("Please input your guess.");
let mut guess = String::new();
io::stdin()
.read_line(&mut guess)
.expect("Failed to read line");
println!("You guessed: {}", guess);
}
Compile error:
error[E0061]: this function takes 1 argument but 2 arguments were supplied
--> src/main.rs:7:44
|
7 | let secret_number = rand::thread_rng().gen_range(0, 101);
| ^^^^^^^^^ - --- supplied 2 arguments
| |
| expected 1 argument

The gen_range method expects a single Range argument, not two i32 arguments, so change:
let secret_number = rand::thread_rng().gen_range(0, 101);
to:
let secret_number = rand::thread_rng().gen_range(0..101);
And it will compile and work. Note: the method signature was updated in version 0.8.0 of the rand crate, in all prior versions of the crate your code should work as-is.

run cargo update
It will list the version for rand.
update the version in Cargo.toml under dependencies
[dependencies]
rand = "0.8.3"
hopefully this solves the above problem

Convert JSON back to Data in Haskell

I have the following Haskell to call my WCF Service:
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
module Main where
import Data.Aeson
import Data.Dynamic
import Data.Aeson.Lens
import Data.ByteString.Lazy as BS
import GHC.Generics
import Network.Wreq
import Control.Lens
data Point = Point { x :: Int, y :: Int } deriving (Generic, Show)
instance ToJSON Point
instance FromJSON Point
data Rectangle = Rectangle { width :: Int, height :: Int, point :: Point } deriving (Generic, Show)
instance ToJSON Rectangle
instance FromJSON Rectangle
main = do
let p = Point 1 2
let r = Rectangle 10 20 p
let url = "http://localhost:8000/Rectangle"
let opts = defaults & header "Content-Type" .~ ["application/json"]
r <- postWith opts url (encode r)
let returnData = r ^? responseBody
case (decode returnData) of
Nothing -> BS.putStrLn "Error decoding JSON"
Just json -> BS.putStrLn $ show $ decode json
The output in this case is:
Just "{\"height\":20,\"point\":{\"x\":1,\"y\":2},\"width\":10}"
I already tried it with fromJSON:
print $ fromJSON returnData
and got this error:
Couldn't match expected type `Value'
with actual type `Maybe
bytestring-0.10.6.0:Data.ByteString.Lazy.Internal.ByteString'
In the first argument of `fromJSON', namely `returnData'
In the second argument of `($)', namely `fromJSON returnData'
Failed, modules loaded: none.
My question is now how to convert this JSON string back to an object of type "Rectangle"?
EDIT 1: I changed my code due to Janos Potecki's answer and get now the following error:
Couldn't match type `[Char]' with `ByteString'
Expected type: ByteString
Actual type: String
In the second argument of `($)', namely `show $ decode json'
In the expression: BS.putStrLn $ show $ decode json
In a case alternative:
Just json -> BS.putStrLn $ show $ decode json
Failed, modules loaded: none.
EDIT 2: i changed it to:
main = do
let point = Point 1 2
let rectangle = Rectangle 10 20 point
let url = "http://localhost:8000/Rectangle/Move/100,200"
let opts = defaults & header "Content-Type" .~ ["application/json"]
r <- postWith opts url (encode rectangle)
let returnData = (r ^? responseBody) >>= decode
case returnData of
Nothing -> BS.putStrLn "Error decoding JSON"
Just json -> BS.putStrLn json
and now i get:
No instance for (FromJSON ByteString)
arising from a use of `decode'
In the second argument of `(>>=)', namely `decode'
In the expression: (r ^? responseBody) >>= decode
In an equation for `returnData':
returnData = (r ^? responseBody) >>= decode

working solution
r' <- asJSON =<< postWith opts url (encode rectangle) :: IO Res
case r' of
Nothing -> print "Error decoding JSON"
Just x -> print x
For performance I'd suggest you add the following to your instance ToJSON:
instance ToJSON Point where
toEncoding = genericToEncoding defaultOptions
and the same for Rectangle

Creating a list when parsing a file using Cassava in Haskell

I'm able to parse my csv file using the following code from Data.Csv:
valuesToList :: Foo -> (Int, Int)
valuesToList (Foo a b) = (a,b)
loadMyData :: IO ()
loadMyData = do
csvData <- BL.readFile "mydata.csv"
case decodeByName csvData of
Left err -> putStrLn err
Right (_, v) -> print $ V.toList $ V.map valuesToList v
When I run this I get the correct output on the screen. The problem I have is I'm not clear how to create a pure function so I can consume the contents of the list, as in:
let l = loadMyData
Where l is of type [Int,Int] I'm guessing it's because I'm in the IO Monad and I'm doing something hopelessly silly...

I'm doing something hopelessly silly...
Yes, but worry not!
loadMyData = BL.readFile "mydata.csv"
processMyData :: String -> String
processMyData csvData =
case decodeByName csvData of
Left err -> err
Right (_, v) -> show $ V.toList $ V.map valuesToList v
main = do
csv <- loadMyData
let output = processMyData csv
print output
This way you separate the pure, "let" part from the impure loading part. This isn't Code Review (if you asked there, I could probably elaborate more), but I'd type the processing as String -> Either String [Int, Int] or something and keep the failure information in the type system.
processMyData csvData =
case decodeByName csvData of
Left err -> Left err
Right (_, v) -> Right $ V.toList $ V.map valuesToList v
And that in turn could simply be (pardon me if I make a mistake somewhere, I'm doing it live):
processMyData = fmap (V.toList . V.map valuesToList) . decodeByName
That should work because of how the Functor instance for Either is constructed.
Oh, and also use Control.Applicative for bonus style points:
main = do
output <- processMyData <$> loadMyData
print output
(in case you don't understand that example, (<$>) is infix fmap, but that's not stricly necessary; hence bonus)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to read a CSV that includes Chinese characters in Rust? - csv

Related

Is this possible to set a default value of a function without loosing the polymorphic type of an argument in ocaml [duplicate]

What is a memory efficient way to read a CSV file?

Why can't I call gen_range with two i32 arguments?

Convert JSON back to Data in Haskell

Creating a list when parsing a file using Cassava in Haskell

Categories

Resources