How to read and process a pipe delimited file in Rust? - csv

I want to read a pipe delimited file, process the data, and generate a result in CSV format.
Input file data
A|1|Pass
B|2|Fail
A|3|Fail
C|6|Pass
A|8|Pass
B|10|Fail
C|25|Pass
A|12|Fail
C|26|Pass
C|26|Fail
I'm want to apply a group by function on column 1 and column 3 and generate column 2's sum according to a particular group.
I'm stuck at the point of how to maintain the records to apply the group by function:
use std::fs::File;
use std::io::{BufReader};
use std::io::{BufRead};
use std::collections::HashMap;
fn say_hello(id: &str, value: i32, no_change : i32) {
if no_change == 101 {
let mut data = HashMap::new();
}
if value == 0 {
if data.contains_key(id) {
for (key, value) in &data {
if value.is_empty() {
}
}
} else {
data.insert(id,"");
}
} else if value == 2 {
if data.contains_key(id) {
for (key, value) in &data {
if value.is_empty() {
} else {
}
}
} else {
data.insert(id,"");
}
}
}
fn main() {
let f = File::open("sample2.txt").expect("Unable to open file");
let br = BufReader::new(f);
let mut no_change = 101;
for line in br.lines() {
let mut index = 0;
for value in line.unwrap().split('|') {
say_hello(&value,index,no_change);
index = index + 1;
}
}
}
I'm expecting a result like:
name,result,num
A,Fail,15
A,Pass,9
B,Fail,12
C,Fail,26
C,Pass,57
Is there any specific technique to read a pipe-delimited file and process the data like above? Python's pandas accomplished this requirement but I want to do it in Rust.

As was mentioned, use the csv crate to do the heavy lifting of parsing the file. Then it's just a matter of grouping each row by using a BTreeMap which also helpfully performs sorting. The entry API helps efficiently insert into the BTreeMap.
extern crate csv;
extern crate rustc_serialize;
use std::fs::File;
use std::collections::BTreeMap;
#[derive(Debug, RustcDecodable)]
struct Record {
name: String,
value: i32,
passed: String,
}
fn main() {
let file = File::open("input").expect("Couldn't open input");
let mut csv_file = csv::Reader::from_reader(file).delimiter(b'|').has_headers(false);
let mut sums = BTreeMap::new();
for record in csv_file.decode() {
let record: Record = record.expect("Could not parse input file");
let key = (record.name, record.passed);
*sums.entry(key).or_insert(0) += record.value;
}
println!("name,result,num");
for ((name, passed), sum) in sums {
println!("{},{},{}", name, passed, sum);
}
}
You'll note that the output is correct:
name,result,num
A,Fail,15
A,Pass,9
B,Fail,12
C,Fail,26
C,Pass,57

I'd suggest something like this:
use std::str;
use std::collections::HashMap;
use std::io::{BufReader, BufRead, Cursor};
fn main() {
let data = "
A|1|Pass
B|2|Fail
A|3|Fail
C|6|Pass
A|8|Pass
B|10|Fail
C|25|Pass
A|12|Fail
C|26|Pass
C|26|Fail";
let lines = BufReader::new(Cursor::new(data))
.lines()
.flat_map(Result::ok)
.flat_map(parse_line);
for ((fa, fb), s) in group(lines) {
println!("{}|{}|{}", fa, fb, s);
}
}
type ParsedLine = ((String, String), usize);
fn parse_line(line: String) -> Option<ParsedLine> {
let mut fields = line
.split('|')
.map(str::trim);
if let (Some(fa), Some(fb), Some(fc)) = (fields.next(), fields.next(), fields.next()) {
fb.parse()
.ok()
.map(|v| ((fa.to_string(), fc.to_string()), v))
} else {
None
}
}
fn group<I>(input: I) -> Vec<ParsedLine> where I: Iterator<Item = ParsedLine> {
let mut table = HashMap::new();
for (k, v) in input {
let mut sum = table.entry(k).or_insert(0);
*sum += v;
}
let mut output: Vec<_> = table
.into_iter()
.collect();
output.sort_by(|a, b| a.0.cmp(&b.0));
output
}
playground link
Here a HashMap is used for grouping entries and then results are moved to a Vec for sorting.

Related

How to speed up Rust function to search through a large JSON file

Currently I have a Rust function that searches through a large JSON file (About 1,080,000 lines) Currently this function takes about 1 second to search through this file, the data in this file is mostly stuff like this:
{"although":false,"radio":2056538449,"hide":1713884795,"hello":1222349560.787047,"brain":903780409.0046091,"heard":-1165604870.8374772}
How would I be able to increase the performance of this function?
Here is my Main.rs file.
use std::collections::VecDeque;
use std::fs::File;
use std::io::BufWriter;
use std::io::{BufRead, BufReader, Write};
fn search(filename: &str, search_line: &str) -> Result<VecDeque<u32>, std::io::Error> {
let file = File::open(filename)?;
let mut reader = BufReader::with_capacity(2048 * 2048, file);
let mut line_numbers = VecDeque::new();
let mut line_number = 0;
let start = std::time::Instant::now();
loop {
line_number += 1;
let mut line = String::new();
let n = reader.read_line(&mut line)?;
if n == 0 {
break;
}
if line.trim() == search_line {
line_numbers.push_back(line_number);
println!(
"Matching line found on line number {}: {}",
line_number, line
);
break;
}
}
let elapsed = start.elapsed();
println!("Elapsed time: {:?}", elapsed);
if line_numbers.is_empty() {
println!("No lines found that match the given criteria");
}
Ok(line_numbers)
}
fn main() {
let database = "Test.json";
if let Err(e) = search(database, r#"{"08934":420696969}"#) {
println!("Error reading file: {}", e);
}
}

Programatically creating an iterator from mapping functions in Rust

I'm trying to write code like the following, but where I apply f1 and f2 some variable number of times:
#![feature(impl_trait_in_bindings)]
fn f1(c: char) -> impl IntoIterator<Item = char> {
vec!['A', c]
}
fn f2(c: char) -> impl IntoIterator<Item = char> {
vec!['C', 'D', c]
}
fn main() {
let x = vec!['X', 'X', 'X'];
let v: impl Iterator<Item = char> = x.into_iter();
let v = v.flat_map(f1);
let v = v.flat_map(f2);
println!("Force evaluation of five elements: {:?}", v.take(5).collect::<Vec<_>>());
}
I'd like to replace the let v = ... lines with a loop that iteratively reassigns v, like
let mut v: impl Iterator<Item = char> = x.into_iter();
for i in 0..f1Times {
v = v.flat_map(f1);
}
for i in 0..f2Times {
v = v.flat_map(f2);
}
... e.g. for my use case I may have several functions and I won't know which ones (or how many times) to apply ahead of time. I'd like the result to be an iterator that I can take only a limited number of items from, and I'd like to avoid invoking any functions that aren't needed to generate those items.
I can't get the types to work. For instance with the let mut block I proposed above I get:
mismatched types
expected opaque type `impl Iterator`
found struct `FlatMap<impl Iterator, impl IntoIterator, fn(char) -> impl IntoIterator {f1}>`
Is there a good way to build up this sort of iterator programatically?
I've found this pattern works, but still don't know if it's idiomatic or recommended...
#![feature(impl_trait_in_bindings)]
fn f1(c: char) -> impl Iterator<Item = char> {
Box::new(vec!['A', c].into_iter())
}
fn f2(c: char) -> impl Iterator<Item = char> {
Box::new(vec!['C', 'D', c].into_iter())
}
fn main() {
let x = vec!['X', 'X', 'X'];
let mut v: Box<dyn Iterator<Item = char>> = Box::new(x.into_iter());
let f1_ntimes = 2;
for _i in 0..f1NTimes {
v = Box::new(v.into_iter().flat_map(f1));
}
let f2_ntimes = 2;
for _i in 0..f2_ntimes {
v = Box::new(v.into_iter().flat_map(f2));
}
println!("Force evaluation of five elements: {:?}", v.take(5).collect::<Vec<_>>());
}

Load dataframe from json given headers and rowSet

I am trying to use the polars rust library to create dataframes from json fetched from stats.nba.com, (example json). The best example I could find for creating a dataframe from json was from the docs but I'm not sure how to load a serde_json::Value into a Cursor and pass it into the JsonReader. Below is my code to load everything into Vecs and then create the Series and DataFrame, but is there a better way?:
fn load_dataframe(&self) -> Result<()> {
let endpoint_json = self.send_request().unwrap();
let result_sets = endpoint_json["resultSets"].as_array().unwrap();
for data_set in result_sets {
let data_set_values = data_set["rowSet"].as_array().unwrap();
let data_set_headers = data_set["headers"].as_array().unwrap();
let mut headers_to_values: HashMap<&str, Vec<&Value>> = HashMap::new();
for (pos, row) in data_set_values.iter().enumerate() {
if pos == 0 {
init_columns(&mut headers_to_values, row, data_set_headers);
} else {
insert_row_values(&mut headers_to_values, row, data_set_headers);
}
}
let mut df_series: Vec<Series> = Vec::new();
for (col_name, json_values) in headers_to_values {
if json_values.is_empty() { continue; }
let first_val = json_values[0];
if first_val.is_null() { continue; }
if first_val.is_i64() {
let typed_data = json_values.iter().map(|&v| v.as_i64().unwrap_or(0)).collect::<Vec<i64>>();
df_series.push(Series::new(col_name, typed_data));
} else if first_val.is_f64() {
let typed_data = json_values.iter().map(|&v| v.as_f64().unwrap_or(0.0)).collect::<Vec<f64>>();
df_series.push(Series::new(col_name, typed_data));
} else {
let typed_data = json_values.iter().map(|&v| v.as_str().unwrap_or("")).collect::<Vec<&str>>();
df_series.push(Series::new(col_name, typed_data));
}
}
let data_set_name = data_set["name"].as_str().unwrap();
let df = DataFrame::new(df_series)?;
println!("{} \n{:?}", data_set_name, df);
}
Ok(())
}
fn init_columns<'a>(headers_to_values: &mut HashMap<&'a str, Vec<&'a Value>>, first_row: &'a Value, headers: &'a Vec<Value>) -> () {
let first_row_array = first_row.as_array().unwrap();
for (pos, col_val) in first_row_array.iter().enumerate() {
let col_name = headers[pos].as_str().unwrap();
headers_to_values.insert(col_name, vec![col_val]);
}
}
fn insert_row_values<'a>(headers_to_values: &mut HashMap<&'a str, Vec<&'a Value>>, row: &'a Value, headers: &'a Vec<Value>) -> () {
let row_array = row.as_array().unwrap();
for (pos, col_val) in row_array.iter().enumerate() {
let col_name = headers[pos].as_str().unwrap();
let series_values = headers_to_values.get_mut(col_name).unwrap();
series_values.push(col_val);
}
}

Reading CSV with list valued columns in rust

I'm trying to use csv and serde to read a mixed-delimiter csv-type file in rust, but I'm having a hard time seeing how to use these libraries to accomplish it. Each line looks roughly like:
value1|value2|subvalue1,subvalue2,subvalue3|value4
and would de-serialize to a struct that looks like:
struct Line {
value1:u64,
value2:u64,
value3:Vec<u64>,
value4:u64,
}
Any guidance on how to tell the library that there are two different delimiters and that one of the columns has this nested structure?
Ok, I'm still a beginner in Rust so I can't guarantee that this is good at all- I suspect it could be done more efficiently, but I do have a solution that works-
use csv::{ReaderBuilder};
use serde::{Deserialize, Deserializer};
use serde::de::Error;
use std::error::Error as StdError;
#[derive(Debug, Deserialize)]
pub struct ListType {
values: Vec<u8>,
}
fn deserialize_list<'de, D>(deserializer: D) -> Result<ListType , D::Error>
where D: Deserializer<'de> {
let buf: &str = Deserialize::deserialize(deserializer)?;
let mut rdr = ReaderBuilder::new()
.delimiter(b',')
.has_headers(false)
.from_reader(buf.as_bytes());
let mut iter = rdr.deserialize();
if let Some(result) = iter.next() {
let record: ListType = result.map_err(D::Error::custom)?;
return Ok(record)
} else {
return Err("error").map_err(D::Error::custom)
}
}
struct Line {
value1:u64,
value2:u64,
#[serde(deserialize_with = "deserialize_list")]
value3:ListType,
value4:u64,
}
fn read_line(line: &str) -> Result<Line, Box<dyn StdError>> {
let mut rdr = ReaderBuilder::new()
.delimiter(b'|')
.from_reader(line.as_bytes());
let mut iter = rdr.deserialize();
if let Some(result) = iter.next() {
let record: Line = result?;
return Ok(Line)
} else {
return Err(From::from("error"));
}
}
[EDIT]
I found that the above solution was intolerably slow, but I was able to make performance acceptable by simply manually deserializing the nested type into a fixed size array by-
#[derive(Debug, Deserialize)]
pub struct ListType {
values: [Option<u8>; 8],
}
fn deserialize_farray<'de, D>(deserializer: D) -> Result<ListType, D::Error>
where
D: Deserializer<'de>,
{
let buf: &str = Deserialize::deserialize(deserializer)?;
let mut split = buf.split(",");
let mut dest: CondList = CondList {
values: [None; 8],
};
let mut ind: usize = 0;
for tok in split {
if tok == "" {
break;
}
match tok.parse::<u8>() {
Ok(val) => {
dest.values[ind] = Some(val);
}
Err(e) => {
return Err(e).map_err(D::Error::custom);
}
}
ind += 1;
}
return Ok(dest);
}

Omit values that are Option::None when encoding JSON with rustc_serialize

I have a struct that I want to encode to JSON. This struct contains a field with type Option<i32>. Let's say
extern crate rustc_serialize;
use rustc_serialize::json;
#[derive(RustcEncodable)]
struct TestStruct {
test: Option<i32>
}
fn main() {
let object = TestStruct {
test: None
};
let obj = json::encode(&object).unwrap();
println!("{}", obj);
}
This will give me the output
{"test": null}
Is there a convenient way to omit Option fields with value None? In this case I would like to have the output
{}
If someone arrives here with the same question like I did, serde has now an option skip_serializing_none to do exactly that.
https://docs.rs/serde_with/1.8.0/serde_with/attr.skip_serializing_none.html
It doesn't seem to be possible by doing purely from a struct, so i converted the struct into a string, and then converted that into a JSON object. This method requires that all Option types be the same type. I'd recommend if you need to have variable types in the struct to turn them into string's first.
field_vec and field_name_vec have to be filled with all fields at compile time, as I couldn't find a way to get the field values, and field names without knowing them in rust at run time.
extern crate rustc_serialize;
use rustc_serialize::json::Json;
fn main() {
#[derive(RustcEncodable)]
struct TestStruct {
test: Option<i32>
}
impl TestStruct {
fn to_json(&self) -> String {
let mut json_string = String::new();
json_string.push('{');
let field_vec = vec![self.test];
let field_name_vec = vec![stringify!(self.test)];
let mut previous_field = false;
let mut count = 0;
for field in field_vec {
if previous_field {
json_string.push(',');
}
match field {
Some(value) => {
let opt_name = field_name_vec[count].split(". ").collect::<Vec<&str>>()[1];
json_string.push('"');
json_string.push_str(opt_name);
json_string.push('"');
json_string.push(':');
json_string.push_str(&value.to_string());
previous_field = true;
},
None => {},
}
count += 1;
}
json_string.push('}');
json_string
}
}
let object = TestStruct {
test: None
};
let object2 = TestStruct {
test: Some(42)
};
let obj = Json::from_str(&object.to_json()).unwrap();
let obj2 = Json::from_str(&object2.to_json()).unwrap();
println!("{:?}", obj);
println!("{:?}", obj2);
}
To omit Option<T> fields, you can create an implementation of the Encodable trait (instead of using #[derive(RustcEncodable)]).
Here I updated your example to do this.
extern crate rustc_serialize;
use rustc_serialize::json::{ToJson, Json};
use rustc_serialize::{Encodable,json};
use std::collections::BTreeMap;
#[derive(PartialEq, RustcDecodable)]
struct TestStruct {
test: Option<i32>
}
impl Encodable for TestStruct {
fn encode<S: rustc_serialize::Encoder>(&self, s: &mut S) -> Result<(), S::Error> {
self.to_json().encode(s)
}
}
impl ToJson for TestStruct {
fn to_json(&self) -> Json {
let mut d = BTreeMap::new();
match self.test {
Some(value) => { d.insert("test".to_string(), value.to_json()); },
None => {},
}
Json::Object(d)
}
}
fn main() {
let object = TestStruct {
test: None
};
let obj = json::encode(&object).unwrap();
println!("{}", obj);
let decoded: TestStruct = json::decode(&obj).unwrap();
assert!(decoded==object);
}
It would be nice to implement a custom #[derive] macro which does this automatically for Option fields, as this would eliminate the need for such custom implementations of Encodable.