What is a memory efficient way to read a CSV file?

What is a memory efficient way to read a CSV file? - csv

My program reads CSV files using the csv crate into Vec<Vec<String>>, where the outer vector represents rows, and the inner separates rows into columns.
use std::{time, thread::{sleep, park}};
use csv;
fn main() {
different_scope();
println!("Parked");
park();
}
fn different_scope() {
println!("Reading csv");
let _data = read_csv("data.csv");
println!("Sleeping");
sleep(time::Duration::from_secs(4));
println!("Going out of scope");
}
fn read_csv(path: &str) -> Vec<Vec<String>> {
let mut rdr = csv::Reader::from_path(path).unwrap();
return rdr
.records()
.map(|row| {
row
.unwrap()
.iter()
.map(|column| column.to_string())
.collect()
})
.collect();
}
I'm looking at RAM usage with htop and this uses 2.5GB of memory to read a 250MB CSV file.
Here's the contents of cat /proc/<my pid>/status
Name: (name)
Umask: 0002
State: S (sleeping)
Tgid: 18349
Ngid: 0
Pid: 18349
PPid: 18311
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 1000 1000
FDSize: 256
Groups: 4 24 27 30 46 118 128 133 1000
NStgid: 18349
NSpid: 18349
NSpgid: 18349
NSsid: 18311
VmPeak: 2748152 kB
VmSize: 2354932 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 2580156 kB
VmRSS: 2345944 kB
RssAnon: 2343900 kB
RssFile: 2044 kB
RssShmem: 0 kB
VmData: 2343884 kB
VmStk: 136 kB
VmExe: 304 kB
VmLib: 2332 kB
VmPTE: 4648 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
Threads: 1
SigQ: 0/127783
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000180000440
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ffffffff
Cpus_allowed_list: 0-31
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 9
nonvoluntary_ctxt_switches: 293
When I drop the variable, it frees the correct amount (approx. 250MB), but there's still 2.2GB left. I'm unable to read more than 2-3GB before all my memory is used and the process is killed (cargo prints "Killed").
How do I free the excess memory while the CSV is being read?
I need to process every line, but in this case I don't need to hold all this data at once, but what if I did?
I asked a related question and I was pointed to What is Rust strategy to uncommit and return memory to the operating system? which was helpful in understanding the problem, but I don't know how to solve it.
My understanding is I should switch my crate to a different memory allocator, but brute forcing through all the allocators I can find feels like an ignorant approach.

For questions about memory, it's good to develop a technique where you quantify your memory usage. You can do this by examining your representation. In this case, that's Vec<Vec<String>>. In particular, if you have a 250MB CSV file that is represented as a sequence of a sequence of fields, then it is not necessarily the case that you'll only use 250MB of memory. You need to consider the overhead of your representation.
For a Vec<Vec<String>>, we can dismiss the overhead of the outer Vec<...> as it will (in your program) be on the stack and not the heap. It is on the inner Vec<String> that is on the heap.
So if your CSV file has M records and each record has N fields, then there will be M instances of Vec<String> and M * N instances of String. The overhead of both a Vec<T> and a String is 3 * sizeof(word), with one word being the pointer to the data, another word being the length and yet another being the capacity. (That's 24 bytes for a 64-bit target.) So your total overhead for a 64-bit target is (M * 24) + (M * N * 24).
Let's test this experimentally. Since you didn't share your CSV input (you really should in the future), I'll bring my own. It's 145MB, has M=3,173,958 records with N=7 fields per record. So the total overhead for your representation is (3173958 * 24) + (3173958 * 7 * 24) = 609,399,936 bytes, or 609 MB. Let's test that with a real program:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input_path = match std::env::args_os().nth(1) {
Some(p) => p,
None => {
eprintln!("Usage: csvmem <path>");
std::process::exit(1);
}
};
let rdr = csv::Reader::from_path(input_path)?;
let mut records: Vec<Vec<String>> = vec![];
for result in rdr.into_records() {
let mut record: Vec<String> = vec![];
for column in result?.iter() {
record.push(column.to_string());
}
records.push(record);
}
println!("{}", records.len());
Ok(())
}
(I've added some unnecessary type annotations in a couple places to make the code a little clearer, particularly with respect to our representation.) So let's run this program (whose only dependency is csv = "1" in my Cargo.toml):
$ echo $TIMEFMT
real %*E user %*U sys %*S maxmem %M MB faults %F
$ cargo b --release
$ time ./target/release/csvmem /m/sets/csv/pop/worldcitiespop-nice.csv
3173958
real 1.542
user 1.236
sys 0.296
maxmem 1287 MB
faults 0
The time utility here reports peak memory usage, which is actually a bit higher than what we might expect it to be: 609 + 145 = 754MB. I don't quite know enough about allocators to reason through the difference completely. It could be that the system allocator I'm using allocates bigger chunks than what is actually needed. Let's make our representation a bit more efficient by using a Box<str> instead of String. We sacrifice the ability to expand the string, but in exchange, we save 8 bytes of overhead per field. So our new overhead calculation is (3173958 * 24) + (3173958 * 7 * 16) = 431,658,288 bytes or 431MB for a difference of 609 - 431 = 178MB. So let's test our new representation and see what our delta is:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input_path = match std::env::args_os().nth(1) {
Some(p) => p,
None => {
eprintln!("Usage: csvmem <path>");
std::process::exit(1);
}
};
let rdr = csv::Reader::from_path(input_path)?;
let mut records: Vec<Vec<Box<str>>> = vec![];
for result in rdr.into_records() {
let mut record: Vec<Box<str>> = vec![];
for column in result?.iter() {
record.push(column.to_string().into());
}
records.push(record);
}
println!("{}", records.len());
Ok(())
}
And to compile and run:
$ cargo b --release
$ time ./target/release/csvmem /m/sets/csv/pop/worldcitiespop-nice.csv
3173958
real 1.459
user 1.183
sys 0.266
maxmem 1093 MB
faults 0
for a total delta of 194MB. Which is pretty close to our guess.
We can optimize the representation even further by using a Vec<Box<[Box<str>]>>:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input_path = match std::env::args_os().nth(1) {
Some(p) => p,
None => {
eprintln!("Usage: csvmem <path>");
std::process::exit(1);
}
};
let rdr = csv::Reader::from_path(input_path)?;
let mut records: Vec<Box<[Box<str>]>> = vec![];
for result in rdr.into_records() {
let mut record: Vec<Box<str>> = vec![];
for column in result?.iter() {
record.push(column.to_string().into());
}
records.push(record.into());
}
println!("{}", records.len());
Ok(())
}
That gives a peak memory usage of 1069 MB. So not much of a savings.
However, the best thing we can do is use a csv::StringRecord:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input_path = match std::env::args_os().nth(1) {
Some(p) => p,
None => {
eprintln!("Usage: csvmem <path>");
std::process::exit(1);
}
};
let rdr = csv::Reader::from_path(input_path)?;
let mut records = vec![];
for result in rdr.into_records() {
let record = result?;
records.push(record);
}
println!("{}", records.len());
Ok(())
}
And that gives a peak memory usage of 727MB. The secret is that a StringRecord stores fields inline without that second layer of indirection. It ends up saving quite a bit!
Of course, if you don't need to store all of the records in memory at once, then you shouldn't. And the CSV crate supports that just fine:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input_path = match std::env::args_os().nth(1) {
Some(p) => p,
None => {
eprintln!("Usage: csvmem <path>");
std::process::exit(1);
}
};
let mut count = 0;
let rdr = csv::Reader::from_path(input_path)?;
for result in rdr.into_records() {
let _ = result?;
count += 1;
}
println!("{}", count);
Ok(())
}
And that program's peak memory usage is only 9MB, as you'd expect of a streaming implementation. (Technically, you can use no heap memory at all if you drop down and use the csv-core crate.)

Related

Ocaml function, what is the nature of the problem?

Currently i'm trying
I have a function to calculate the inverse sum of a number
let inverseSum n =
let rec sI n acc =
match n with
| 1 -> acc
| _ -> sI (n - 1) ((1.0 /. float n) +. acc)
in sI n 1.0;;
For example, inverseSum 2 -> 1/2 + 1 = 3/2 = 1.5
I try to test the function with 2 and 5, it's okay:
inverseSum 2;;
inverseSum 5;;
inverseSum 2;;
- : float = 1.5
inverseSum 5;;
- : float = 2.28333333333333321
For the moment, no problem.
After that, I initialize a list which contains all numbers between 1 and 10000 ([1;…;10000])
let initList = List.init 10000 (fun n -> n + 1);;
no problem.
I code a function so that an element of the list becomes the inverse sum of the element
(e.g. [1;2;3] -> [inverseSum 1; inverseSum 2; inverseSum 3])
let rec invSumLst lst =
match lst with
| [] -> []
| h::t -> (inverseSum h) :: invSumLst t;;
and I use it on the list initList:
let invInit = invSumLst initList;;
So far so good, but I start to have doubts from this stage:
I select the elements of invList strictly inferior to 5.0
let listLess5 = List.filter (fun n -> n < 5.0) invInit;;
And I realize the sum of these elements using fold_left:
let foldLess5 = List.fold_left (+.) 0.0 listLess5;;
I redo the last two steps with floats greater than or equal to 5.0
let moreEg5 = List.filter (fun n -> n >= 5.0) invInit;;
let foldMore5 = List.fold_left (+.) 0.0 moreEg5;;
Finally, I sum all the numbers of the list:
let foldInvInit = List.fold_left (+.) 0.0 invInit;;
but at the end when I try to calculate the absolute error between the numbers less than 5, those greater than 5 and all the elements of the list, the result is surprising:
Float.abs ((foldLess5 +. foldMore5) -. foldInvInit);;
Printf.printf "%f\n" (Float.abs ((foldLess5 +. foldMore5) -. foldInvInit));;
Printf.printf "%b\n" ((foldLess5+.foldMore5) = foldInvInit);;
returns:
let foldMore5 = List.fold_left (+.) 0.0 moreEg5;;
val foldMore5 : float = 87553.6762998474733
let foldInvInit = List.fold_left (+.) 0.0 invInit;;
val foldInvInit : float = 87885.8479664799379
Float.abs ((foldLess5 +. foldMore5) -. foldInvInit);;
- : float = 1.45519152283668518e-11
Printf.printf "%f\n" (Float.abs ((foldLess5 +. foldMore5) -. foldInvInit));;
0.000000
- : unit = ()
Printf.printf "%b\n" ((foldLess5+.foldMore5) = foldInvInit);;
false
- : unit = ()
it's probably a rounding problem, but I would like to know where exactly the error comes from?
Because here I am using an interpreter, so I see the error "1.45519152283668518e-11"
But if I used a compiler like ocamlpro, I would just get 0.000000 and false on the terminal and I wouldn't understand anything.
So I would just like to know if the problem comes from one of the functions of the code, or from a rounding made by the Printf.printf function which wrote the result with a "non scientific" notation.

OCaml is showing you the actual results of the operations you performed. The difference between the two sums is caused by the finite precision of floating values.
When adding up a list of large-ish numbers, by the time you reach the end of the list the number is large enough that the lowest-order bits of the new values simply can't be represented. But when adding a list of small-ish numbers, fewer bits are lost.
A system that shows foldLess5 +. foldMore5 as equal to foldInvInit is most likely lying to you for your convenience.

How can I print in decimal form, binary numbers with 1000 bits?

I have numbers with several hundred of decimal digits stored in binary form in arrays (appr. 1000 bits, 10^300). How can I print these numbers in decimal form? The standard way to convert from bin to dec with div 10 and mod 10 are hard to implement. I'm using C/C++.

create an array of integer64 with length = 1000
calculate the decimal value of the bits one by one
save them in the array
add the array rows
For example: 25 = 11001
The array should contain following values: [16, 8, 0, 0, 1]
2^4*1: 1 6
2^3*1: 8
2^2*0: 0
2^1*0: 0
2^0*1: 1
+ (1)
2 5
This method should work without mod or div (tested in python3 up to 2^1000).

Sorry, my first answer was not that understandable. So I programmed the converter in Rust (the closest language to C I know). I hope that makes it a bit more clear.
use rand::Rng;
fn pow_(num: i64, exp: i64) -> (f64, i64) {
let mut e: i64 = 0;
let mut current = num as f64;
for i in 0..(exp - 1){
current *= num as f64;
if current / 10.0 > 1.0{
current /= 10.0;
e += 1;
}
}
(current, e)
}
fn rand_bit_array(l: usize) -> Vec<bool>{
let mut ba = Vec::with_capacity(l);
let mut rng = rand::thread_rng();
ba.push(true);
for _ in 1..l{
ba.push(rng.gen_bool(0.5));
}
ba
}
fn add(ba: Vec<bool>){
println!("{:?}", ba);
let (mut out, e) = pow_(2, ba.len() as i64);
// round to the first 30 numbers
for i in 1..30{
if ba[i] {
let dec = pow_(2, ba.len() as i64 - i as i64);
out += dec.0 / 10.0_f64.powf((e - dec.1) as f64);
}
}
println!("Result: {} * 10 ^ {}", out, e);
}
fn main() {
println!("Result: {} * 10 ^ {}", pow_(2, 1000).0, pow_(2, 1000).1);
add(rand_bit_array(1000));
}
Explanation:
pow_ - power up to 2^1000 easily
rand_bit_array - generates a random bit array
add_ - calculates the decimal number
Example Output:
[true, true, false, false, true, true, true, false, false, false, ...] (1000 Elements)
Result: 1.7247795190143262 * 10 ^ 301

Haskell Conduit Aeson: Parsing Large JSONs and filter matching key/values

I have written an application in Haskell that does the following:
Recursively list a directory,
Parse the JSON files from the directory list,
Look for matching key-value pairs, and
Return filenames where matches have been found.
My first version of this application was the simplest, naive version I could write, but I noticed that space usage seemed to increase monotonically.
As a result, I switched to conduit, and now my primary functionality looks like this:
conduitFilesFilter :: ProjectFilter -> Path Abs Dir -> IO [Path Abs File]
conduitFilesFilter projFilter dirname' = do
(_, allFiles) <- listDirRecur dirname'
C.runConduit $
C.yieldMany allFiles
.| C.filterMC (filterMatchingFile projFilter)
.| C.sinkList
Now my application has bounded memory usage but it's still quite slow. Out of this, I have two questions.
1)
I used stack new to generate the skeleton to create this application and it by default uses the ghc options -threaded -rtsopts -with-rtsopts=-N.
The surprising thing (to me) is that the application uses all processors available to it (about 40 in the target machine) when I actually go to run it. However, I didn't write any part of the application to be run in parallel (I considered it, actually).
What's running in parallel?
2)
Additionally, most of the JSON files are really large (10mb) and there are probably 500k of them to be traversed. This means my program is very slow as a result of all the Aeson-decoding. My idea was to run my filterMatchingFile part in parallel, but looking at the stm-conduit library, I can't see an obvious way to run this middle action in parallel across a handful of processors.
Can anyone suggest a way to smartly parallelize my function above using stm-conduit or some other means?
Edit
I realized that I could break up my readFile -> decodeObject -> runFilterFunction into separate parts of the conduit and then I could use stm-conduit there with a bounded channel. Maybe I'll give it a shot...
I ran my application with +RTS -s (I reconfigured it to -N4) and I see the following:
115,961,554,600 bytes allocated in the heap
35,870,639,768 bytes copied during GC
56,467,720 bytes maximum residency (681 sample(s))
1,283,008 bytes maximum slop
145 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 108716 colls, 108716 par 76.915s 20.571s 0.0002s 0.0266s
Gen 1 681 colls, 680 par 0.530s 0.147s 0.0002s 0.0009s
Parallel GC work balance: 14.99% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.007s elapsed)
MUT time 34.813s ( 42.938s elapsed)
GC time 77.445s ( 20.718s elapsed)
EXIT time 0.000s ( 0.010s elapsed)
Total time 112.260s ( 63.672s elapsed)
Alloc rate 3,330,960,996 bytes per MUT second
Productivity 31.0% of total user, 67.5% of total elapsed
gc_alloc_block_sync: 188614
whitehole_spin: 0
gen[0].sync: 33
gen[1].sync: 811204

From your program description, there is no reason for it to have increasing memory usage. I think it was accidental memory leak from missed lazy computation. This can be easily detected by heap profiling: https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/profiling.html#hp2ps-rendering-heap-profiles-to-postscript. Other possible reason is that runtime does not release all memory back to OS. Until some threshold, it will keep holding to memory proportional to the largest file processed. This may look as a memory leak if tracked through process RSS size.
-A32m option increases nursery size. It lets your program allocate more memory before garbage collection is triggered. Stats shows that very little memory is retained during GC, so less often it happens, more time program spends doing actual work.

Prompted by Michael Snoyman on Haskell Cafe, who pointed out that my first version was not truly taking advantage of Conduit's streaming capabilities, I rewrote my Conduit version of the application (without using stm-conduit). This was a large improvement: my first Conduit version was operating over all data and I didn't realize this.
I also increased the nursery size and this increased my productivity by doing garbage collection less frequently.
My revised function ended up looking like this:
module Search where
import Conduit ((.|))
import qualified Conduit as C
import Control.Monad
import Control.Monad.IO.Class (MonadIO, liftIO)
import Control.Monad.Trans.Resource (MonadResource)
import qualified Data.ByteString as B
import Data.List (isPrefixOf)
import Data.Maybe (fromJust, isJust)
import System.Path.NameManip (guess_dotdot, absolute_path)
import System.FilePath (addTrailingPathSeparator, normalise)
import System.Directory (getHomeDirectory)
import Filters
sourceFilesFilter :: (MonadResource m, MonadIO m) => ProjectFilter -> FilePath -> C.ConduitM () String m ()
sourceFilesFilter projFilter dirname' =
C.sourceDirectoryDeep False dirname'
.| parseProject projFilter
parseProject :: (MonadResource m, MonadIO m) => ProjectFilter -> C.ConduitM FilePath String m ()
parseProject (ProjectFilter filterFunc) = do
C.awaitForever go
where
go path' = do
bytes <- liftIO $ B.readFile path'
let isProj = validProject bytes
when (isJust isProj) $ do
let proj' = fromJust isProj
when (filterFunc proj') $ C.yield path'
My main just runs the conduit and prints those that pass the filter:
mainStreamingConduit :: IO ()
mainStreamingConduit = do
options <- getRecord "Search JSON Files"
let filterFunc = makeProjectFilter options
searchDir <- absolutize (searchPath options)
itExists <- doesDirectoryExist searchDir
case itExists of
False -> putStrLn "Search Directory does not exist" >> exitWith (ExitFailure 1)
True -> C.runConduitRes $ sourceFilesFilter filterFunc searchDir .| C.mapM_ (liftIO . putStrLn)
I run it like this (without the stats, typically):
stack exec search-json -- --searchPath $FILES --name NAME +RTS -s -A32m -n4m
Without increasing nursery size, I get a productivity around 30%. With the above, however, it looks like this:
72,308,248,744 bytes allocated in the heap
733,911,752 bytes copied during GC
7,410,520 bytes maximum residency (8 sample(s))
863,480 bytes maximum slop
187 MB total memory in use (27 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 580 colls, 580 par 2.731s 0.772s 0.0013s 0.0105s
Gen 1 8 colls, 7 par 0.163s 0.044s 0.0055s 0.0109s
Parallel GC work balance: 35.12% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.006s elapsed)
MUT time 26.155s ( 31.602s elapsed)
GC time 2.894s ( 0.816s elapsed)
EXIT time -0.003s ( 0.008s elapsed)
Total time 29.048s ( 32.432s elapsed)
Alloc rate 2,764,643,665 bytes per MUT second
Productivity 90.0% of total user, 97.5% of total elapsed
gc_alloc_block_sync: 3494
whitehole_spin: 0
gen[0].sync: 15527
gen[1].sync: 177
I'd still like to figure out how to parallelize the filterProj . parseJson . readFile part, but for now I'm satisfied with this.

I figured out how to run this application using stm-conduit with some help from the Haskell wiki on parallelism and a Stack Overflow answer that talks about waiting for threads to end before main exits.
The way it works is that I create a channel that holds all of the filenames to be operated on. Then, I fork a bunch of threads that each runs a Conduit with the filepath-channel as a Source. I track all of the child threads and wait for them to finish.
Maybe this solution will be useful for someone else?
Not all of my lower-level filter functions are present, but the gist of it is that I have a Conduit that tests some JSON. If it passes, then it yields the FilePath.
Here's my main in entirety:
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Conduit ((.|))
import qualified Conduit as C
import Control.Concurrent
import Control.Monad (forM_)
import Control.Monad.IO.Class (liftIO)
import Control.Concurrent.STM
import Control.Monad.Trans.Resource (register)
import qualified Data.Conduit.TMChan as STMChan
import Data.Maybe (isJust, fromJust)
import qualified Data.Text as T
import Options.Generic
import System.Directory (doesDirectoryExist)
import System.Exit
import Search
data Commands =
Commands { searchPath :: String
, par :: Maybe Int
, project :: Maybe T.Text
, revision :: Maybe T.Text
} deriving (Generic, Show)
instance ParseRecord Commands
makeProjectFilter :: Commands -> ProjectFilter
makeProjectFilter options =
let stdFilts = StdProjectFilters
(ProjName <$> project options)
(Revision <$> revision options)
in makeProjectFilters stdFilts
main :: IO ()
main = do
options <- getRecord "Search JSON Files"
-- Would user like to run in parallel?
let runner = if isJust $ par options
then mainSTMConduit (fromJust $ par options)
else mainStreamingConduit
-- necessary things to search files: search path, filters to use, search dir exists
let filterFunc = makeProjectFilter options
searchDir <- absolutize (searchPath options)
itExists <- doesDirectoryExist searchDir
-- Run it if it exists
case itExists of
False -> putStrLn "Search Directory does not exist" >> exitWith (ExitFailure 1)
True -> runner filterFunc searchDir
-- Single-threaded version with bounded memory usage
mainStreamingConduit :: ProjectFilter -> FilePath -> IO ()
mainStreamingConduit filterFunc searchDir = do
C.runConduitRes $
sourceFilesFilter filterFunc searchDir .| C.mapM_C (liftIO . putStrLn)
-- Multiple-threaded version of this program using channels from `stm-conduit`
mainSTMConduit :: Int -> ProjectFilter -> FilePath -> IO ()
mainSTMConduit nrWorkers filterFunc searchDir = do
children <- newMVar []
inChan <- atomically $ STMChan.newTBMChan 16
_ <- forkIO . C.runResourceT $ do
_ <- register $ atomically $ STMChan.closeTBMChan inChan
C.runConduitRes $ C.sourceDirectoryDeep False searchDir .| STMChan.sinkTBMChan inChan True
forM_ [1..nrWorkers] (\_ -> forkChild children $ runConduitChan inChan filterFunc)
waitForChildren children
return ()
runConduitChan :: STMChan.TBMChan FilePath -> ProjectFilter -> IO ()
runConduitChan inChan filterFunc = do
C.runConduitRes $
STMChan.sourceTBMChan inChan
.| parseProject filterFunc
.| C.mapM_C (liftIO . putStrLn)
waitForChildren :: MVar [MVar ()] -> IO ()
waitForChildren children = do
cs <- takeMVar children
case cs of
[] -> return ()
m:ms -> do
putMVar children ms
takeMVar m
waitForChildren children
forkChild :: MVar [MVar ()] -> IO () -> IO ThreadId
forkChild children io = do
mvar <- newEmptyMVar
childs <- takeMVar children
putMVar children (mvar:childs)
forkFinally io (\_ -> putMVar mvar ())
Note: I'm using stm-conduit 3.0.0 with conduit 1.12.1, which is why I needed to include the boolean argument:
STMChan.sinkTBMChan inChan True
In version 4.0.0 of stm-conduit, this function automatically closes the channel, so the boolean argument has been removed.

How to read a CSV that includes Chinese characters in Rust?

When I read a CSV file that includes Chinese characters using the csv crate, it has a error.
fn main() {
let mut rdr =
csv::Reader::from_file("C:\\Users\\Desktop\\test.csv").unwrap().has_headers(false);
for record in rdr.decode() {
let (a, b): (String, String) = record.unwrap();
println!("a:{},b:{}", a, b);
}
thread::sleep_ms(500000);
}
The error:
Running `target\release\rust_Work.exe`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Decode("Could not convert bytes \'FromUtf8Error { bytes: [208, 213, 195, 251], error: Utf8Error { va
lid_up_to: 0 } }\' to UTF-8.")', ../src/libcore\result.rs:788
note: Run with `RUST_BACKTRACE=1` for a backtrace.
error: Process didn't exit successfully: `target\release\rust_Work.exe` (exit code: 101)
test.csv:
1. 姓名 性别 年纪 分数 等级
2. 小二 男 12 88 良好
3. 小三 男 13 89 良好
4. 小四 男 14 91 优秀

I'm not sure what could be done to make the error message more clear:
Decode("Could not convert bytes 'FromUtf8Error { bytes: [208, 213, 195, 251], error: Utf8Error { valid_up_to: 0 } }' to UTF-8.")
FromUtf8Error is documented in the standard library, and the text of the error says "Could not convert bytes to UTF-8" (although there's some extra detail in the middle).
Simply put, your data isn't in UTF-8 and it must be. That's all that the Rust standard library (and thus most libraries) really deal with. You will need to figure out what encoding it is in and then find some way of converting from that to UTF-8. There may be a crate to help with either of those cases.
Perhaps even better, you can save the file as UTF-8 from the beginning. Sadly, it's relatively common for people to hit this issue when using Excel, because Excel does not have a way to easily export UTF-8 CSV files. It always writes a CSV file in the system locale encoding.

I have a way to solve it. Thanks all.
extern crate csv;
extern crate rustc_serialize;
extern crate encoding;
use encoding::{Encoding, EncoderTrap, DecoderTrap};
use encoding::all::{GB18030};
use std::io::prelude::*;
fn main() {
let path = "C:\\Users\\Desktop\\test.csv";
let mut f = File::open(path).expect("cannot open file");
let mut reader: Vec<u8> = Vec::new();
f.read_to_end(&mut reader).expect("can not read file");
let mut chars = String::new();
GB18030.decode_to(&mut reader, DecoderTrap::Ignore, &mut chars);
let mut rdr = csv::Reader::from_string(chars).has_headers(true);
for row in rdr.decode() {
let (x, y, r): (String, String, String) = row.unwrap();
println!("({}, {}): {:?}", x, y, r);
}
}
output:

Part 1: Read Unicode (Chinese or not) characters:
The easiest way to achieve your goal is to use the read_to_string function that mutates the String you pass to it, appending the Unicode content of your file to that passed String:
use std::io::prelude::*;
use std::fs::File;
fn main() {
let mut f = File::open("file.txt").unwrap();
let mut buffer = String::new();
f.read_to_string(&mut buffer);
println!("{}", buffer)
}
Part 2: Parse a CSV file, its delimiter being a ',':
extern crate regex;
use regex::Regex;
use std::io::prelude::*;
use std::fs::File;
fn main() {
let mut f = File::open("file.txt").unwrap();
let mut buffer = String::new();
let delimiter = ",";
f.read_to_string(&mut buffer);
let modified_buffer = buffer.replace("\n", delimiter);
let mut regex_str = "([^".to_string();
regex_str.push_str(delimiter);
regex_str.push_str("]+)");
let mut final_part = "".to_string();
final_part.push_str(delimiter);
final_part.push_str("?");
regex_str.push_str(&final_part);
let regex_str_copy = regex_str.clone();
regex_str.push_str(&regex_str_copy);
regex_str.push_str(&regex_str_copy);
let re = Regex::new(&regex_str).unwrap();
for cap in re.captures_iter(&modified_buffer) {
let (s1, s2, dist): (String, String, usize) =
(cap[1].to_string(), cap[2].to_string(), cap[3].parse::<usize>().unwrap());
println!("({}, {}): {}", s1, s2, dist);
}
}
Sample input and output here

How to optimize this short factorial function in scala? (Creating 50000 BigInts)

I've compaired the scala version
(BigInt(1) to BigInt(50000)).reduce(_ * _)
to the python version
reduce(lambda x,y: x*y, range(1,50000))
and it turns out, that the scala version took about 10 times longer than the python version.
I'm guessing, a big difference is that python can use its native long type instead of creating new BigInt-objects for each number. But is there a workaround in scala?

The fact that your Scala code creates 50,000 BigInt objects is unlikely to be making much of a difference here. A bigger issue is the multiplication algorithm—Python's long uses Karatsuba multiplication and Java's BigInteger (which BigInt just wraps) doesn't.
The easiest workaround is probably to switch to a better arbitrary precision math library, like JScience's:
import org.jscience.mathematics.number.LargeInteger
(1 to 50000).foldLeft(LargeInteger.ONE)(_ times _)
This is faster than the Python solution on my machine.
Update: I've written some quick benchmarking code using Caliper in response to Luigi Plingi's answer, which gives the following results on my (quad core) machine:
benchmark ms linear runtime
BigIntFoldLeft 4774 ==============================
BigIntFold 4739 =============================
BigIntReduce 4769 =============================
BigIntFoldLeftPar 4642 =============================
BigIntFoldPar 500 ===
BigIntReducePar 499 ===
LargeIntegerFoldLeft 3042 ===================
LargeIntegerFold 3003 ==================
LargeIntegerReduce 3018 ==================
LargeIntegerFoldLeftPar 3038 ===================
LargeIntegerFoldPar 246 =
LargeIntegerReducePar 260 =
I don't see the difference between reduce and fold that he does, but the moral is clear: if you can use Scala 2.9's parallel collections, they'll give you a huge improvement, but switching to LargeInteger helps as well.

Python on my machine:
def func():
start= time.clock()
reduce(lambda x,y: x*y, range(1,50000))
end= time.clock()
t = (end-start) * 1000
print t
gives 1219 ms
Scala:
def timed[T](f: => T) = {
val t0 = System.currentTimeMillis
val r = f
val t1 = System.currentTimeMillis
println("Took: "+(t1 - t0)+" ms")
r
}
timed { (BigInt(1) to BigInt(50000)).reduce(_ * _) }
4251 ms
timed { (BigInt(1) to BigInt(50000)).fold(BigInt(1))(_ * _) }
4224 ms
timed { (BigInt(1) to BigInt(50000)).par.reduce(_ * _) }
2083 ms
timed { (BigInt(1) to BigInt(50000)).par.fold(BigInt(1))(_ * _) }
689 ms
// using org.jscience.mathematics.number.LargeInteger from Travis's answer
timed { val a = (1 to 50000).foldLeft(LargeInteger.ONE)(_ times _) }
3327 ms
timed { val a = (1 to 50000).map(LargeInteger.valueOf(_)).par.fold(
LargeInteger.ONE)(_ times _) }
361 ms
This 689 ms and 361 ms were after a few warmup runs. They both started at about 1000 ms, but seem to warm up by different amounts. The parallel collections seem to warm up significantly more than the non-parallel: the non-parallel operations did not reduce significantly from their first runs.
The .par (meaning, use parallel collections) seemed to speed up fold more than reduce. I only have 2 cores, but a greater number of cores should see a bigger performance gain.
So, experimentally, the way to optimize this function is
a) Use fold rather than reduce
b) Use parallel collections
update:
Inspired by the observation that breaking the calculation down into smaller chunks speeds things up, I managed to get he following to run in 215 ms on my machine, which is a 40% improvement on the standard parallelized algorithm. (Using BigInt, it takes 615 ms.) Also, it doesn't use parallel collections, but somehow uses 90% CPU (unlike for BigInt).
import org.jscience.mathematics.number.LargeInteger
def fact(n: Int) = {
def loop(seq: Seq[LargeInteger]): LargeInteger = seq.length match {
case 0 => throw new IllegalArgumentException
case 1 => seq.head
case _ => loop {
val (a, b) = seq.splitAt(seq.length / 2)
a.zipAll(b, LargeInteger.ONE, LargeInteger.ONE).map(i => i._1 times i._2)
}
}
loop((1 to n).map(LargeInteger.valueOf(_)).toIndexedSeq)
}

Another trick here could be to try both reduceLeft and reduceRight to see what is fastest. On your example I get a much faster execution of reduceRight:
scala> timed { (BigInt(1) to BigInt(50000)).reduceLeft(_ * _) }
Took: 4605 ms
scala> timed { (BigInt(1) to BigInt(50000)).reduceRight(_ * _) }
Took: 2004 ms
Same difference between foldLeft and foldRight. Guess it matters what side of the tree you start reducing from :)

Most efficient way to calculate factorial in Scala is using of divide and conquer strategy:
def fact(n: Int): BigInt = rangeProduct(1, n)
private def rangeProduct(n1: Long, n2: Long): BigInt = n2 - n1 match {
case 0 => BigInt(n1)
case 1 => BigInt(n1 * n2)
case 2 => BigInt(n1 * (n1 + 1)) * n2
case 3 => BigInt(n1 * (n1 + 1)) * ((n2 - 1) * n2)
case _ =>
val nm = (n1 + n2) >> 1
rangeProduct(n1, nm) * rangeProduct(nm + 1, n2)
}
Also to get more speed use latest version of JDK and following JVM options:
-server -XX:+TieredCompilation
Bellow are results for Intel(R) Core(TM) i7-2640M CPU # 2.80GHz (max 3.50GHz), RAM 12Gb DDR3-1333, Windows 7 sp1, Oracle JDK 1.8.0_25-b18 64-bit:
(BigInt(1) to BigInt(100000)).product took: 3,806 ms with 26.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduce(_ * _) took: 3,728 ms with 25.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceLeft(_ * _) took: 3,510 ms with 25.1 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceRight(_ * _) took: 4,056 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).fold(BigInt(1))(_ * _) took: 3,697 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.product took: 406 ms with 66.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduce(_ * _) took: 296 ms with 71.1 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceLeft(_ * _) took: 3,495 ms with 25.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceRight(_ * _) took: 3,900 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.fold(BigInt(1))(_ * _) took: 327 ms with 56.1 % of CPU usage
fact(100000) took: 203 ms with 28.3 % of CPU usage
BTW to improve efficience of factorial calculation for numbers that greater than 20000 use following implementation of Schönhage-Strassen algorithm or wait until it will be merged to JDK 9 and Scala will be able to use it

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

What is a memory efficient way to read a CSV file? - csv

Related

Ocaml function, what is the nature of the problem?

How can I print in decimal form, binary numbers with 1000 bits?

Haskell Conduit Aeson: Parsing Large JSONs and filter matching key/values

How to read a CSV that includes Chinese characters in Rust?

How to optimize this short factorial function in scala? (Creating 50000 BigInts)

Categories

Resources