Reading CSV with list valued columns in rust - csv

I'm trying to use csv and serde to read a mixed-delimiter csv-type file in rust, but I'm having a hard time seeing how to use these libraries to accomplish it. Each line looks roughly like:
value1|value2|subvalue1,subvalue2,subvalue3|value4
and would de-serialize to a struct that looks like:
struct Line {
value1:u64,
value2:u64,
value3:Vec<u64>,
value4:u64,
}
Any guidance on how to tell the library that there are two different delimiters and that one of the columns has this nested structure?

Ok, I'm still a beginner in Rust so I can't guarantee that this is good at all- I suspect it could be done more efficiently, but I do have a solution that works-
use csv::{ReaderBuilder};
use serde::{Deserialize, Deserializer};
use serde::de::Error;
use std::error::Error as StdError;
#[derive(Debug, Deserialize)]
pub struct ListType {
values: Vec<u8>,
}
fn deserialize_list<'de, D>(deserializer: D) -> Result<ListType , D::Error>
where D: Deserializer<'de> {
let buf: &str = Deserialize::deserialize(deserializer)?;
let mut rdr = ReaderBuilder::new()
.delimiter(b',')
.has_headers(false)
.from_reader(buf.as_bytes());
let mut iter = rdr.deserialize();
if let Some(result) = iter.next() {
let record: ListType = result.map_err(D::Error::custom)?;
return Ok(record)
} else {
return Err("error").map_err(D::Error::custom)
}
}
struct Line {
value1:u64,
value2:u64,
#[serde(deserialize_with = "deserialize_list")]
value3:ListType,
value4:u64,
}
fn read_line(line: &str) -> Result<Line, Box<dyn StdError>> {
let mut rdr = ReaderBuilder::new()
.delimiter(b'|')
.from_reader(line.as_bytes());
let mut iter = rdr.deserialize();
if let Some(result) = iter.next() {
let record: Line = result?;
return Ok(Line)
} else {
return Err(From::from("error"));
}
}
[EDIT]
I found that the above solution was intolerably slow, but I was able to make performance acceptable by simply manually deserializing the nested type into a fixed size array by-
#[derive(Debug, Deserialize)]
pub struct ListType {
values: [Option<u8>; 8],
}
fn deserialize_farray<'de, D>(deserializer: D) -> Result<ListType, D::Error>
where
D: Deserializer<'de>,
{
let buf: &str = Deserialize::deserialize(deserializer)?;
let mut split = buf.split(",");
let mut dest: CondList = CondList {
values: [None; 8],
};
let mut ind: usize = 0;
for tok in split {
if tok == "" {
break;
}
match tok.parse::<u8>() {
Ok(val) => {
dest.values[ind] = Some(val);
}
Err(e) => {
return Err(e).map_err(D::Error::custom);
}
}
ind += 1;
}
return Ok(dest);
}

Related

Load dataframe from json given headers and rowSet

I am trying to use the polars rust library to create dataframes from json fetched from stats.nba.com, (example json). The best example I could find for creating a dataframe from json was from the docs but I'm not sure how to load a serde_json::Value into a Cursor and pass it into the JsonReader. Below is my code to load everything into Vecs and then create the Series and DataFrame, but is there a better way?:
fn load_dataframe(&self) -> Result<()> {
let endpoint_json = self.send_request().unwrap();
let result_sets = endpoint_json["resultSets"].as_array().unwrap();
for data_set in result_sets {
let data_set_values = data_set["rowSet"].as_array().unwrap();
let data_set_headers = data_set["headers"].as_array().unwrap();
let mut headers_to_values: HashMap<&str, Vec<&Value>> = HashMap::new();
for (pos, row) in data_set_values.iter().enumerate() {
if pos == 0 {
init_columns(&mut headers_to_values, row, data_set_headers);
} else {
insert_row_values(&mut headers_to_values, row, data_set_headers);
}
}
let mut df_series: Vec<Series> = Vec::new();
for (col_name, json_values) in headers_to_values {
if json_values.is_empty() { continue; }
let first_val = json_values[0];
if first_val.is_null() { continue; }
if first_val.is_i64() {
let typed_data = json_values.iter().map(|&v| v.as_i64().unwrap_or(0)).collect::<Vec<i64>>();
df_series.push(Series::new(col_name, typed_data));
} else if first_val.is_f64() {
let typed_data = json_values.iter().map(|&v| v.as_f64().unwrap_or(0.0)).collect::<Vec<f64>>();
df_series.push(Series::new(col_name, typed_data));
} else {
let typed_data = json_values.iter().map(|&v| v.as_str().unwrap_or("")).collect::<Vec<&str>>();
df_series.push(Series::new(col_name, typed_data));
}
}
let data_set_name = data_set["name"].as_str().unwrap();
let df = DataFrame::new(df_series)?;
println!("{} \n{:?}", data_set_name, df);
}
Ok(())
}
fn init_columns<'a>(headers_to_values: &mut HashMap<&'a str, Vec<&'a Value>>, first_row: &'a Value, headers: &'a Vec<Value>) -> () {
let first_row_array = first_row.as_array().unwrap();
for (pos, col_val) in first_row_array.iter().enumerate() {
let col_name = headers[pos].as_str().unwrap();
headers_to_values.insert(col_name, vec![col_val]);
}
}
fn insert_row_values<'a>(headers_to_values: &mut HashMap<&'a str, Vec<&'a Value>>, row: &'a Value, headers: &'a Vec<Value>) -> () {
let row_array = row.as_array().unwrap();
for (pos, col_val) in row_array.iter().enumerate() {
let col_name = headers[pos].as_str().unwrap();
let series_values = headers_to_values.get_mut(col_name).unwrap();
series_values.push(col_val);
}
}

How extend a BiMap with a Vec<String> in Rust?

How can i push this
["a", "bb", "c", "dd", "e", "ff", "g", "hh"] //type is Vec<String>
to this
fn list_codes() -> BiMap<char, &'static str>
{
let letters = BiMap::<char, &str>::new();
[('a',"cl01"),
('b',"cl02"),
('c',"cl03"),
('d',"cl04")]
.into_iter()
.collect()
}
That should be like this
fn list_codes() -> BiMap<char, &'static str>
{
[('a',"cl01"),
('b',"cl02"),
('c',"cl03"),
('d',"cl04"),
('a',"bb"),
('c',"dd"),
('e',"ff"),
('g',"hh")]
.into_iter()
.collect()
}
Actually the logic is simple. It will take a list from csv file and import to a BiMap. After the importion of bimap, the program will use this BiMap to encode/decode the texts.
EXTRAS:
The rest of code is here:
//-------------------------//
#![allow(non_snake_case)]
//-------------------------//
#![allow(dead_code)]
//-------------------------//
#![allow(unused_variables)]
//-------------------------//
#![allow(unused_imports)]
//-------------------------//
// #![allow(inactive_code)]
//-------------------------//
use std::error::Error as stdErr;
use std::io::Error as ioErr;
use std::net::ToSocketAddrs;
use std::process;
use std::env;
use std::fs;
use std::array;
use std::io;
use std::slice::SliceIndex;
use csv::Error as csvErr;
use csv::Reader;
use csv::StringRecord;
use bimap::BiMap;
use directories::ProjectDirs;
use serde::Deserialize;
//#[cfg(doc)] #[doc = include_str!("../Changelog.md")] pub mod _changelog{}
#[derive(Deserialize, Debug)]
struct Config {
seperator: String,
list_file_path: String,
}
#[derive(Debug, Deserialize)]
struct Record <'a> {
letter: char,
code: &'a str,
}
fn find_and_read_config_file() -> Result<String, ioErr>
{
if let Some(proj_dirs) = ProjectDirs::from
(
"dev",
"rusty-bois",
"test-config-parser",
)
{
let config_file_name = "settings.toml";
let config_path = proj_dirs.config_dir();
let config_file = fs::read_to_string(config_path.join(config_file_name),);
config_file
}
else
{
Err(ioErr::new(io::ErrorKind::NotFound, "no"))
}
}
fn parse_toml() -> String
{
let config_file = find_and_read_config_file();
let config: Config = match config_file
{
Ok(file) => toml::from_str(&file).unwrap(),
Err(_) => Config
{
seperator: "x".to_string(),
list_file_path: "rdl".to_string(),
},
};
let result = format!("{}\n{}",config.list_file_path, config.seperator);
return result;
}
fn reformat_parse_toml() -> Vec<String>
{
let mut vec_values = Vec::new();
let mut i = 0;
for values in parse_toml().split("\n")
{
vec_values.insert(i,values.to_string());
i += 1;
}
vec_values
}
fn read_and_parse_csv() -> Result<Vec<StringRecord>, csvErr>
{
let config_vars = reformat_parse_toml();
let csv_file_path = &config_vars[0];
let mut rdr = Reader::from_path(csv_file_path)?;
rdr.records().collect()
}
fn reformat_read_and_parse_csv() -> Vec<String>
{
let csv_records = read_and_parse_csv();
let mut csv_records_as_list = Vec::new();
let mut i = 0;
for a in csv_records.iter()
{
for b in a
{
for c in b
{
csv_records_as_list.insert(i, c.to_string());
i += 1
}
}
}
csv_records_as_list
}
fn input() -> String
{
let mut input = String::new();
match io::stdin().read_line(&mut input) {
Ok(_) => {
return input.to_string();
},
Err(e) => {
return e.to_string();
}
}
}
fn list_codes() -> BiMap<char, &'static str>
{
let letters = BiMap::<char, &str>::new();
[('a',"cl01"),
('b',"cl02"),
('c',"cl03"),
('d',"cl04")]
.into_iter()
.collect()
}
fn coder(flag: &str) -> String
{
let letters = list_codes();
let mut result_raw = String::new();
let result = String::new();
let config_vars = reformat_parse_toml();
let split_char = &config_vars[1];
if flag == "e"
{
println!("Enter the text that you want to encrypt:");
let ipt = input().trim_end().to_string();
let ipt_char = ipt.chars();
for letter in ipt_char
{
result_raw = format!("{}{}{}",result_raw,split_char,letters.get_by_left(&letter).unwrap());
}
let result = &result_raw[1..];
return result.to_string();
}
else if flag == "d"
{
println!("Enter the text that you want to decrypt:");
let ipt = input().trim_end().to_string();
let ipt_char = ipt.chars();
for code in ipt.split(split_char)
{
result_raw = format!("{} {}", result_raw, letters.get_by_right(code).unwrap());
}
let decoded = result_raw;
return decoded;
}
else
{
return "Error while decode/encode the input".to_string();
}
}
fn coder_from_file(flag: &str, path: &str) -> String
{
let letters = list_codes();
let mut result_raw = String::new();
let result = String::new();
let contents = fs::read_to_string(path)
.expect("Something went wrong reading the file");
let config_vars = reformat_parse_toml();
let split_char = &config_vars[1];
if flag == "ef"
{
for letter in contents.chars()
{
result_raw = format!("{}{}{}",result_raw,split_char,letters.get_by_left(&letter).unwrap());
}
let result = &result_raw[1..];
return result.to_string();
}
else if flag == "df"
{
for code in contents.replace("\n", "xpl01").split(split_char)
{
// You might want to have a look at the `String.push_str()` function to avoid creating a new string every time
result_raw = format!("{}{}", result_raw, letters.get_by_right(code).unwrap());
}
let result = result_raw;
return result;
}
else
{
return "Error while decode/encode the input".to_string();
}
}
fn coder_options() -> String
{
let args: Vec<String> = env::args().collect();
let mode = &args[1];
let config_vars = reformat_parse_toml();
let split_char = &config_vars[1];
let mut m_opt = String::new();
if mode == "-d"
{
m_opt = coder(&"d");
}
else if mode == "-e"
{
m_opt = coder(&"e");
}
else if mode == "-ef"
{
let filename = &args[2];
m_opt = coder_from_file(&"ef",&filename);
}
else if mode == "-df"
{
let filename = &args[2];
m_opt = coder_from_file(&"df",&filename);
}
else
{
println!("You picked wrong flag. Please select a vaild one.")
}
return m_opt;
}
fn main ()
{
let coder_options = coder_options();
println!("{}", coder_options)
}
Here is Cargo.toml
[package]
name = "Coder"
version = "0.1.0"
edition = "2021"
authors = ["Huso112"]
license= "GPL-3.0"
description = "This program encrypt your datas with special codes."
repository=""
[[bin]]
path = "src/Coder.rs"
name = "Coder"
[dependencies]
bimap = "~0.6.1"
directories = "~4.0.1"
serde = {version="~1.0.133", features=["derive"]}
toml = "~0.5.8"
csv = "~1.1.6"
Here is settings.toml file
seperator = "z"
list_file_path = "/home/hoovy/.config/test-config-parser/lang-table.csv"
Here is lang-table.csv file
letter,code
a,bb
c,dd
e,ff
g,hh
Boring way:
let mut i = 0;
while i < vec_values.len() - 1 {
let key_str = &vec_values[i];
let value = &vec_values[i + 1].clone();
let key = match key_str.chars().nth(0) {
Some(key) => key,
None => {
println!("error: empty key");
break;
}
};
letters.insert(key, value);
i += 2;
}
A version without clone() if you don't need vec_values after this, and can tear it apart:
if vec_values.len() % 2 != 0 {
println!("error: missing a last value");
return;
}
while !vec_values.is_empty() {
// Note: ok to unwrap, because we are sure that vec_values has an even number of values
let value = vec_values.pop().unwrap();
let key_str = vec_values.pop().unwrap();
let key = match key_str.chars().nth(0) {
Some(key) => key,
None => {
println!("error: empty key");
break;
}
};
letters.insert(key, value);
}
There's also a "smart" way using Iterator methods, but I'll refrain from advising it here.

Serde Deserialize into one of multiple structs?

Is there a nice way to tentatively deserialize a JSON into different structs? Couldn't find anything in the docs and unfortunately the structs have "tag" to differentiate as in How to conditionally deserialize JSON to two different variants of an enum?
So far my approach has been like this:
use aws_lambda_events::event::{
firehose::KinesisFirehoseEvent, kinesis::KinesisEvent,
kinesis_analytics::KinesisAnalyticsOutputDeliveryEvent,
};
use lambda::{lambda, Context};
use serde_json::Value;
type Error = Box<dyn std::error::Error + Send + Sync + 'static>;
enum MultipleKinesisEvent {
KinesisEvent(KinesisEvent),
KinesisFirehoseEvent(KinesisFirehoseEvent),
KinesisAnalyticsOutputDeliveryEvent(KinesisAnalyticsOutputDeliveryEvent),
None,
}
#[lambda]
#[tokio::main]
async fn main(event: Value, _: Context) -> Result<String, Error> {
let multi_kinesis_event = if let Ok(e) = serde_json::from_value::<KinesisEvent>(event.clone()) {
MultipleKinesisEvent::KinesisEvent(e)
} else if let Ok(e) = serde_json::from_value::<KinesisFirehoseEvent>(event.clone()) {
MultipleKinesisEvent::KinesisFirehoseEvent(e)
} else if let Ok(e) = serde_json::from_value::<KinesisAnalyticsOutputDeliveryEvent>(event) {
MultipleKinesisEvent::KinesisAnalyticsOutputDeliveryEvent(e)
} else {
MultipleKinesisEvent::None
};
// code below is just sample
let s = match multi_kinesis_event {
MultipleKinesisEvent::KinesisEvent(_) => "Kinesis Data Stream!",
MultipleKinesisEvent::KinesisFirehoseEvent(_) => "Kinesis Firehose!",
MultipleKinesisEvent::KinesisAnalyticsOutputDeliveryEvent(_) => "Kinesis Analytics!",
MultipleKinesisEvent::None => "Not Kinesis!",
};
Ok(s.to_owned())
}
You should use the #untagged option.
use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize, Debug)]
struct KinesisFirehoseEvent {
x: i32,
y: i32
}
#[derive(Serialize, Deserialize, Debug)]
struct KinesisEvent(i32);
#[derive(Serialize, Deserialize, Debug)]
#[serde(untagged)]
enum MultipleKinesisEvent {
KinesisEvent(KinesisEvent),
KinesisFirehoseEvent(KinesisFirehoseEvent),
None,
}
fn main() {
let event = MultipleKinesisEvent::KinesisFirehoseEvent(KinesisFirehoseEvent { x: 1, y: 2 });
// Convert the Event to a JSON string.
let serialized = serde_json::to_string(&event).unwrap();
// Prints serialized = {"x":1,"y":2}
println!("serialized = {}", serialized);
// Convert the JSON string back to a MultipleKinesisEvent.
// Since it is untagged
let deserialized: MultipleKinesisEvent = serde_json::from_str(&serialized).unwrap();
// Prints deserialized = KinesisFirehoseEvent(KinesisFirehoseEvent { x: 1, y: 2 })
println!("deserialized = {:?}", deserialized);
}
See in playground
Docs for: Untagged

How to read and process a pipe delimited file in Rust?

I want to read a pipe delimited file, process the data, and generate a result in CSV format.
Input file data
A|1|Pass
B|2|Fail
A|3|Fail
C|6|Pass
A|8|Pass
B|10|Fail
C|25|Pass
A|12|Fail
C|26|Pass
C|26|Fail
I'm want to apply a group by function on column 1 and column 3 and generate column 2's sum according to a particular group.
I'm stuck at the point of how to maintain the records to apply the group by function:
use std::fs::File;
use std::io::{BufReader};
use std::io::{BufRead};
use std::collections::HashMap;
fn say_hello(id: &str, value: i32, no_change : i32) {
if no_change == 101 {
let mut data = HashMap::new();
}
if value == 0 {
if data.contains_key(id) {
for (key, value) in &data {
if value.is_empty() {
}
}
} else {
data.insert(id,"");
}
} else if value == 2 {
if data.contains_key(id) {
for (key, value) in &data {
if value.is_empty() {
} else {
}
}
} else {
data.insert(id,"");
}
}
}
fn main() {
let f = File::open("sample2.txt").expect("Unable to open file");
let br = BufReader::new(f);
let mut no_change = 101;
for line in br.lines() {
let mut index = 0;
for value in line.unwrap().split('|') {
say_hello(&value,index,no_change);
index = index + 1;
}
}
}
I'm expecting a result like:
name,result,num
A,Fail,15
A,Pass,9
B,Fail,12
C,Fail,26
C,Pass,57
Is there any specific technique to read a pipe-delimited file and process the data like above? Python's pandas accomplished this requirement but I want to do it in Rust.
As was mentioned, use the csv crate to do the heavy lifting of parsing the file. Then it's just a matter of grouping each row by using a BTreeMap which also helpfully performs sorting. The entry API helps efficiently insert into the BTreeMap.
extern crate csv;
extern crate rustc_serialize;
use std::fs::File;
use std::collections::BTreeMap;
#[derive(Debug, RustcDecodable)]
struct Record {
name: String,
value: i32,
passed: String,
}
fn main() {
let file = File::open("input").expect("Couldn't open input");
let mut csv_file = csv::Reader::from_reader(file).delimiter(b'|').has_headers(false);
let mut sums = BTreeMap::new();
for record in csv_file.decode() {
let record: Record = record.expect("Could not parse input file");
let key = (record.name, record.passed);
*sums.entry(key).or_insert(0) += record.value;
}
println!("name,result,num");
for ((name, passed), sum) in sums {
println!("{},{},{}", name, passed, sum);
}
}
You'll note that the output is correct:
name,result,num
A,Fail,15
A,Pass,9
B,Fail,12
C,Fail,26
C,Pass,57
I'd suggest something like this:
use std::str;
use std::collections::HashMap;
use std::io::{BufReader, BufRead, Cursor};
fn main() {
let data = "
A|1|Pass
B|2|Fail
A|3|Fail
C|6|Pass
A|8|Pass
B|10|Fail
C|25|Pass
A|12|Fail
C|26|Pass
C|26|Fail";
let lines = BufReader::new(Cursor::new(data))
.lines()
.flat_map(Result::ok)
.flat_map(parse_line);
for ((fa, fb), s) in group(lines) {
println!("{}|{}|{}", fa, fb, s);
}
}
type ParsedLine = ((String, String), usize);
fn parse_line(line: String) -> Option<ParsedLine> {
let mut fields = line
.split('|')
.map(str::trim);
if let (Some(fa), Some(fb), Some(fc)) = (fields.next(), fields.next(), fields.next()) {
fb.parse()
.ok()
.map(|v| ((fa.to_string(), fc.to_string()), v))
} else {
None
}
}
fn group<I>(input: I) -> Vec<ParsedLine> where I: Iterator<Item = ParsedLine> {
let mut table = HashMap::new();
for (k, v) in input {
let mut sum = table.entry(k).or_insert(0);
*sum += v;
}
let mut output: Vec<_> = table
.into_iter()
.collect();
output.sort_by(|a, b| a.0.cmp(&b.0));
output
}
playground link
Here a HashMap is used for grouping entries and then results are moved to a Vec for sorting.

Omit values that are Option::None when encoding JSON with rustc_serialize

I have a struct that I want to encode to JSON. This struct contains a field with type Option<i32>. Let's say
extern crate rustc_serialize;
use rustc_serialize::json;
#[derive(RustcEncodable)]
struct TestStruct {
test: Option<i32>
}
fn main() {
let object = TestStruct {
test: None
};
let obj = json::encode(&object).unwrap();
println!("{}", obj);
}
This will give me the output
{"test": null}
Is there a convenient way to omit Option fields with value None? In this case I would like to have the output
{}
If someone arrives here with the same question like I did, serde has now an option skip_serializing_none to do exactly that.
https://docs.rs/serde_with/1.8.0/serde_with/attr.skip_serializing_none.html
It doesn't seem to be possible by doing purely from a struct, so i converted the struct into a string, and then converted that into a JSON object. This method requires that all Option types be the same type. I'd recommend if you need to have variable types in the struct to turn them into string's first.
field_vec and field_name_vec have to be filled with all fields at compile time, as I couldn't find a way to get the field values, and field names without knowing them in rust at run time.
extern crate rustc_serialize;
use rustc_serialize::json::Json;
fn main() {
#[derive(RustcEncodable)]
struct TestStruct {
test: Option<i32>
}
impl TestStruct {
fn to_json(&self) -> String {
let mut json_string = String::new();
json_string.push('{');
let field_vec = vec![self.test];
let field_name_vec = vec![stringify!(self.test)];
let mut previous_field = false;
let mut count = 0;
for field in field_vec {
if previous_field {
json_string.push(',');
}
match field {
Some(value) => {
let opt_name = field_name_vec[count].split(". ").collect::<Vec<&str>>()[1];
json_string.push('"');
json_string.push_str(opt_name);
json_string.push('"');
json_string.push(':');
json_string.push_str(&value.to_string());
previous_field = true;
},
None => {},
}
count += 1;
}
json_string.push('}');
json_string
}
}
let object = TestStruct {
test: None
};
let object2 = TestStruct {
test: Some(42)
};
let obj = Json::from_str(&object.to_json()).unwrap();
let obj2 = Json::from_str(&object2.to_json()).unwrap();
println!("{:?}", obj);
println!("{:?}", obj2);
}
To omit Option<T> fields, you can create an implementation of the Encodable trait (instead of using #[derive(RustcEncodable)]).
Here I updated your example to do this.
extern crate rustc_serialize;
use rustc_serialize::json::{ToJson, Json};
use rustc_serialize::{Encodable,json};
use std::collections::BTreeMap;
#[derive(PartialEq, RustcDecodable)]
struct TestStruct {
test: Option<i32>
}
impl Encodable for TestStruct {
fn encode<S: rustc_serialize::Encoder>(&self, s: &mut S) -> Result<(), S::Error> {
self.to_json().encode(s)
}
}
impl ToJson for TestStruct {
fn to_json(&self) -> Json {
let mut d = BTreeMap::new();
match self.test {
Some(value) => { d.insert("test".to_string(), value.to_json()); },
None => {},
}
Json::Object(d)
}
}
fn main() {
let object = TestStruct {
test: None
};
let obj = json::encode(&object).unwrap();
println!("{}", obj);
let decoded: TestStruct = json::decode(&obj).unwrap();
assert!(decoded==object);
}
It would be nice to implement a custom #[derive] macro which does this automatically for Option fields, as this would eliminate the need for such custom implementations of Encodable.