How to deserialize csv based on line format [closed] - csv

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 9 months ago.
Improve this question
I have a csv without headers that can have lines in these three following formats:
char,int,int,string,int
char,int,string
char
The first character defines the format and be one of the values (A,B,C) respectively. Does anyone know a way to deserialize it based on the line format?

Just keep it simple. You can always parse it manually.
use std::io::{self, BufRead, Error, ErrorKind};
pub enum CsvLine {
A(i32, i32, String, i32),
B(i32, String),
C,
}
pub fn read_lines<R: BufRead>(reader: &mut R) -> io::Result<Vec<CsvLine>> {
let mut lines = Vec::new();
for line in reader.lines() {
let line = line?;
let trimmed = line.trim();
if trimmed.is_empty() {
continue
}
// Split line by commas
let items: Vec<&str> = trimmed.split(',').collect();
match items[0] {
"A" => {
lines.push(CsvLine::A (
items[1].parse::<i32>().map_err(|e| Error::new(ErrorKind::Other, e))?,
items[2].parse::<i32>().map_err(|e| Error::new(ErrorKind::Other, e))?,
items[3].to_string(),
items[4].parse::<i32>().map_err(|e| Error::new(ErrorKind::Other, e))?,
));
}
"B" => {
lines.push(CsvLine::B (
items[1].parse::<i32>().map_err(|e| Error::new(ErrorKind::Other, e))?,
items[2].to_string(),
));
}
"C" => lines.push(CsvLine::C),
x => panic!("Unexpected string {:?} in first column!", x),
}
}
Ok(lines)
}
Calling this function would look something like this:
let mut file = File::open("path/to/data.csv").unwrap();
let mut reader = BufReader::new(file);
let lines: Vec<CsvLine> = read_lines(&mut reader).unwrap();
But you may want to keep in mind that I didn't bother to handle a couple edge cases. It may panic if there are not enough items to satisfy the requirements and it makes no attempt to parse more complex strings. For example, "\"quoted strings\"" and "\"string, with, commas\"" would likely cause issues.

Related

Issue printing header using Rust's CSV crate

Here is my setup:
I am reading a csv file, the path to which is passed into the built exe as an argument, and I am using the crate Clap for it.
It all reads the file with no problem, but I am having trouble printing the headers.
I'd like to be able to print the headers without the quotes, but when I print it, only the first header/column gets printed without them, and the remaining ones do not.
Here's what I mean:
This is the part of the code that prints the header:
let mut rdr = csv::Reader::from_path(file)?;
let column_names = rdr.headers();
println!("{}", match column_names {
Ok(v) => v.as_slice(),
Err(_) => "Error!"
});
With this, this is what the output is:
warning: `csv_reader` (bin "csv_reader") generated 2 warnings
Finished release [optimized] target(s) in 0.13s
Running `target\release\csv_reader.exe -f C:\nkhl\Projects\dataset\hw_25000.csv`
Index "Height(Inches)" "Weight(Pounds)"
()
As you can see, Index does not get printed with the quotes, which is how I'd like the others to be printed. Printing with Debug marker enabled, I get this:
let mut rdr = csv::Reader::from_path(file)?;
let column_names = rdr.headers();
println!("{:?}", match column_names {
Ok(v) => v.as_slice(),
Err(_) => "Error!"
});
warning: `csv_reader` (bin "csv_reader") generated 2 warnings
Finished release [optimized] target(s) in 1.92s
Running `target\release\csv_reader.exe -f C:\nkhl\Projects\dataset\hw_25000.csv`
"Index \"Height(Inches)\" \"Weight(Pounds)\""
()
The CSV can be found here: https://people.sc.fsu.edu/~jburkardt/data/csv/hw_25000.csv
This is how it looks:
"Index", "Height(Inches)", "Weight(Pounds)"
1, 65.78331, 112.9925
2, 71.51521, 136.4873
3, 69.39874, 153.0269
I hope I am doing something utterly silly, but for the life of me, I am unable to figure it out.
Your csv data contains extraneous spaces after the commas, because of that Rusts csv thinks that the quotes around Height(Inches) are part of the header, not meant to escape them.
Unfortunately the lack of standardization around csv makes both interpretations valid.
You can use trim to get rid of the extra spaces:
let data: &[u8] = include_bytes!("file.csv");
let mut rdr = csv::ReaderBuilder::new().trim(csv::Trim::All).from_reader(data);
But csv does the unquoting before it applies the trim so this does still leave you with the same problem.
You can additionaly disable quoting to at least get the same behaviour on all columns:
let mut rdr = csv::ReaderBuilder::new().quoting(false).trim(csv::Trim::All).from_reader(data);
If you somehow can remove the spaces from your csv file it works just fine:
fn main() {
let data: &[u8] = br#""Index","Height(Inches)","Weight(Pounds)"
1,65.78331,112.9925
2,71.51521,136.4873
3,69.39874,153.0269"#;
let mut rdr = csv::Reader::from_reader(data);
let hd = rdr.headers().unwrap();
println!("{}", hd.as_slice());
// prints `IndexHeight(Inches)Weight(Pounds)` without any `"`
}
Playground

Rust: Read dataframe in polars from mysql

Problem
How to read a dataframe in polars from mysql.
Docs are silent on the issue. Currently probably there is only support for parquet, json, ipc, etc, and no direct support for sql as mentioned here.
Regardless what would be an appropriate method to read in data using libraries like: sqlx or mysql
Current Approach
Currently I am following this approach as provided in this answer:
Read in a Vec<Struct> using sqlx
Convert it into a tuple of vecs (Vec<T>, Vec<T>) using the code below
Convert (Vec<T>, Vec<T>) into (Series, Series)
Create a dataframe using: DataFrame::new(vec![s0, s1]); where s0 and s1 are Series
struct A(u8, i8);
fn main() {
let v = vec![A(1, 4), A(2, 6), A(3, 5)];
let result = v.into_iter()
.fold((vec![], vec![]), |(mut u, mut i), item| {
u.push(item.0);
i.push(item.1);
(u, i)
});
dbg!(result);
// `result` is just a tuple of vectors
// let (unsigneds, signeds): (Vec<u8>, Vec<i8>) = result;
}
This can help you?
let schema = Arc::new(Schema::new(vec![
Field::new("country", DataType::Int64, false),
Field::new("count", DataType::Int64, false),
]));
let datas = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int64Array::from(vec![1, 1, 2])),
Arc::new(Int64Array::from(vec![1, 2, 3])),
],
)
.unwrap();
let mut df = DataFrame::try_from(datas)?;
Same answer as in this question, seems quite duplicate IMO.
You could use the builders for that or collect from iterators. Collecting from iterators is often fast, but in this case it requires you to loop the Vec<Country> twice, so you should benchmark.
Below is an example function for both the solutions shown.
use polars::prelude::*;
struct Country {
country: String,
count: i64,
}
fn example_1(values: &[Country]) -> (Series, Series) {
let ca_country: Utf8Chunked = values.iter().map(|v| &*v.country).collect();
let ca_count: NoNull<Int64Chunked> = values.iter().map(|v| v.count).collect();
let mut s_country: Series = ca_country.into();
let mut s_count: Series = ca_count.into_inner().into();
s_country.rename("country");
s_count.rename("country");
(s_count, s_country)
}
fn example_2(values: &[Country]) -> (Series, Series) {
let mut country_builder = Utf8ChunkedBuilder::new("country", values.len(), values.len() * 5);
let mut count_builder = PrimitiveChunkedBuilder::<Int64Type>::new("count", values.len());
values.iter().for_each(|v| {
country_builder.append_value(&v.country);
count_builder.append_value(v.count)
});
(
count_builder.finish().into(),
country_builder.finish().into(),
)
}
Once you've got the Series, you can use DataFrame::new(columns) where columns: Vec<Series> to create a DataFrame.
Btw, if you want maximum performance, I really recommend connector-x. It has got polars and arrow integration and has got insane performance.

Is it possible to parse a text file using Rust's csv crate?

I have a text file with multiple lines. Is it possible to use Rust's csv crate to parse it such that each line is parsed into a different record?
I've tried specifying b'\n' as the field delimiter and left the record terminator as the default. The issue I'm having is that lines can sometimes end with \r\n and sometimes with just \n.
This however raises the UnequalLengths error unless the flexible option is specified because apparently new lines take precedence over field delimiters, so the code below:
use csv::{ByteRecord, Reader as CsvReader, ReaderBuilder, Terminator};
fn main() {
let data = "foo,foo2\r\nbar,bar2\nbaz\r\n";
let mut reader = ReaderBuilder::new()
.delimiter(b'\n')
.has_headers(false)
.flexible(true)
.from_reader(data.as_bytes());
let mut record = ByteRecord::new();
loop {
match reader.read_byte_record(&mut record) {
Ok(true) => {},
Ok(false) => { break },
Err(csv_error) => {
println!("{}", csv_error);
break;
}
}
println!("fields: {}", record.len());
for field in record.iter() {
println!("{:?}", ::std::str::from_utf8(&field))
}
}
}
Will print:
fields: 1
Ok("foo,foo2")
fields: 2
Ok("bar,bar2")
Ok("baz")
I would like the string to be parsed into 3 records with one field each, so the expected output would be:
fields: 1
Ok("foo,foo2")
fields: 1
Ok("bar,bar2")
fields: 1
Ok("baz")
Is it possible to tweak the CSV reader somehow to obtain that behavior?
Conceptually I'd like the field terminator to be None but it seems that the terminator must be a single u8 value
I guess I'll re-post my comment as the answer. More succinctly, as the author of the csv crate, I'd say the answer to your question is "no."
Firstly, it's not clear to me why you're trying to use a csv parser for this task at all. As the comments indicate, it's likely that your question is under-specified. Nevertheless, it seems more prudent to just write your own parser.
Secondly, setting both the delimiter and the terminator to the same thing is probably a condition in which the csv reader should panic or return an error. It doesn't really make sense from the perspective of the parser, and its behavior is likely unspecified.
Finally, it seems to me like your desired output indicates that you should just iterate over the lines in your input. It should give you exactly the output you want, as it handles both \n and \r\n.

Replace Quotation in List of Lists R

I am trying to get a JSON response from an API:
test <- GET(url, add_headers(`api_key` = key))
content(test, 'parsed')
When I run content(test, 'parsed'), I get the following error:
# Error: lexical error: invalid string in json text. .Note: Final passage of the "fiscal cliff bill" on January 1
I think this is because of the double quotations. How can I either replace the double quotes or if this is not the problem, how can I fix this issue?
Thanks!
So I had run into a similar problem before, and I had intended to write a quite function to use Jeroen's fix to try to repair the JSON. Since I intended to do it anyway, here's a quick hack attempt.
NB: repairing a structured format like this is speculative at best and most certainly prone to errors. The good news is that I tried to keep this specific enough so that it will not produce false results: it'll either fix what it knows it can, or fail. The "unit-testing" really needs to check other corner-cases. If you find something that this does not fix (and should) or that this breaks (gasp!), please comment!
fix_json_quotes <- function(s) {
if (length(s) != 1) {
warning("the argument has length > 1 and only the first element will be used")
s <- s[[1]]
}
stopifnot(is.character(s))
val <- jsonlite::validate(s)
while (! val) {
ind <- attr(val, "offset") - 1
snew <- gsub("(.*)(['\"])([[:space:],]*)$", "\\1\\\\\\2\\3", substr(s, 1, ind))
if (snew != substr(s, 1, ind)) {
s <- paste0(snew, substr(s, ind + 1, nchar(s)))
} else {
break
}
val <- jsonlite::validate(s)
}
if (! val) {
# still not validating
stop("unable to fix quotes")
}
return(s)
}
Some sample data, unit-testing if you will (testthat is not required for use of the function):
library(testthat)
lst <- list(a="final \"cliff bill\" on")
json <- as.character(toJSON(lst))
json
# [1] "{\"a\":[\"final \\\"cliff bill\\\" on\"]}"
Okay, there should be no change:
expect_equal(json, fix_json_quotes(json))
Some bad data:
# un-escape the double quotes
badlst <- "{\"a\":[\"final \"cliff bill\" on\"]}"
expect_error(jsonlite::fromJSON(badlst))
expect_equal(json, fix_json_quotes(badlst))
PS: this looks specifically for double-quotes, nothing more. However, I believe that there are related errors that this might also be able to fix. I "left room" for this, in the second group within the regex (([\"])); for example, if single-quotes could also cause a problem, then the group could be changed to be ([\"']). I don't know if it's useful or even necessary.

With Boost.spirit, is there any way to pass additional argument to attribute constructor?

Maybe a noob question, I've a piece of code like this:
struct S {
S() {...}
S(int v) {
// ...
}
};
qi::rule<const char*, S(), boost::spirit::ascii::space_type> ip=qi::int_parser<S()>();
qi::rule<const char*, std::vector<S>(), boost::spirit::ascii::space_type> parser %= ip % ',';
...
Rules above can work, but the code breaks if S constructors require additional parameters, such as:
struct S {
S(T t) {...}
S(T t, int v) {
// ...
}
};
I've spent days to find solution, but no luck so far.
Can anyone help?
There is no direct way, but you can probably explicitely initialize things:
qi::rule<It, optional<S>(), Skipper> myrule;
myrule %=
qi::eps [ _val = phoenix::construct<S>(42) ] >>
int_parser<S()>;
However, since you are returning it from the int_parser, my intuition says that default-initialization should be appropriate (or perhaps the type S doesn't have a single, clear, responsibility?).
Edit
In response to the comment, it looks like you want this:
T someTvalue;
myrule = qi::int_
[ qi::_val = phx::construct<S>(someTvalue, qi::_1) ];
Or, if someTvalue is a variable outside the grammar constructor, and may change value during execution of the parser (and it lives long enough!), you could do
myrule = qi::int_
[ qi::_val = phx::construct<S>(phx::ref(someTvalue), qi::_1) ];
Hope that helps