How to load a irregular csv file using Rust - csv

I would like to load the following csv file, which has a difference of notation rules between before 2nd line and after 3rd line.
test.csv
[YEAR],2022,[Q],1,
[TEST],mid-term,[GRADE],3,
FirstName,LastName,Score,
AA,aaa,97,
BB,bbbb,15,
CC,cccc,66,
DD,ddd,73,
EE,eeeee,42,
FF,fffff,52,
GG,ggg,64,
HH,h,86,
II,iii,88,
JJ,jjjj,72,
However, I have the following error. I think this error is caused by the difference of notation rules. How do I correct this error and load the csv file as I want.
error message
StringRecord(["[YEAR]", "2022", "[Q]", "1", ""])
StringRecord(["[TEST]", "mid-term", "[GRADE]", "3", ""])
Error: Error(UnequalLengths { pos: Some(Position { byte: 47, line: 2, record: 2 }), expected_len: 5, len: 4 })
error: process didn't exit successfully: `target\debug\read_csv.exe` (exit code: 1)
main.rs
use csv::Error;
use csv::ReaderBuilder;
use encoding_rs;
use std::fs;
fn main() -> Result<(), Error> {
let path = "./test.csv";
let file = fs::read(path).unwrap();
let (res, _, _) = encoding_rs::SHIFT_JIS.decode(&file);
let mut reader = ReaderBuilder::new()
.has_headers(false)
.from_reader(res.as_bytes());
for result in reader.records() {
let record = result?;
println!("{:?}", record)
}
Ok(())
}
Version
cargo = "1.62.0"
rustc = "1.62.0"
csv = "1.1.6"
encoding_rs = "0.8.31"

I can correct this error by using "flexible" method.
use csv::Error;
use csv::ReaderBuilder;
use encoding_rs;
use std::fs;
fn main() -> Result<(), Error> {
let path = "./test.csv";
let file = fs::read(path).unwrap();
let (res, _, _) = encoding_rs::SHIFT_JIS.decode(&file);
let mut reader = ReaderBuilder::new()
+ .flexible(true)
.has_headers(false)
.from_reader(res.as_bytes());
for result in reader.records() {
let record = result?;
println!("{:?}", record)
}
Ok(())
}

Related

How to speed up Rust function to search through a large JSON file

Currently I have a Rust function that searches through a large JSON file (About 1,080,000 lines) Currently this function takes about 1 second to search through this file, the data in this file is mostly stuff like this:
{"although":false,"radio":2056538449,"hide":1713884795,"hello":1222349560.787047,"brain":903780409.0046091,"heard":-1165604870.8374772}
How would I be able to increase the performance of this function?
Here is my Main.rs file.
use std::collections::VecDeque;
use std::fs::File;
use std::io::BufWriter;
use std::io::{BufRead, BufReader, Write};
fn search(filename: &str, search_line: &str) -> Result<VecDeque<u32>, std::io::Error> {
let file = File::open(filename)?;
let mut reader = BufReader::with_capacity(2048 * 2048, file);
let mut line_numbers = VecDeque::new();
let mut line_number = 0;
let start = std::time::Instant::now();
loop {
line_number += 1;
let mut line = String::new();
let n = reader.read_line(&mut line)?;
if n == 0 {
break;
}
if line.trim() == search_line {
line_numbers.push_back(line_number);
println!(
"Matching line found on line number {}: {}",
line_number, line
);
break;
}
}
let elapsed = start.elapsed();
println!("Elapsed time: {:?}", elapsed);
if line_numbers.is_empty() {
println!("No lines found that match the given criteria");
}
Ok(line_numbers)
}
fn main() {
let database = "Test.json";
if let Err(e) = search(database, r#"{"08934":420696969}"#) {
println!("Error reading file: {}", e);
}
}

How can I return a record from a CSV file using the byte position of line?

I have an assets.csv file with 172 MB, a million rows, and 16 columns. I would like to read it using an offset -> bytes/line/record. In the code below, I am using the byte value.
I have stored the required positions (record.postion.bytes() in assets_index.csv) and I would like to read a particular line in the assets.csv using the saved offset.
I am able to get an output, but I feel there must be a better way to read from a CSV file based on byte position.
Please advise. I am new to programming and also to Rust, and learned a lot using the tutorials.
The assets.csv is of this format:
asset_id,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation
1000001,2015,10000,2016,10000,2017,10000,2018,10000,2019,10000,2020,10000,2021,10000,2022,10000,2023,10000,2024,10000,2025,10000,2026,10000,2027,10000,2028,10000,2029,10000
I used another function to get the Position { byte: 172999933, line: 1000000, record: 999999 }.
The assets_index.csv is of this format:
asset_id,offset_inbytes
1999999,172999933
fn read_from_position() -> Result<(), Box<dyn Error>> {
let asset_pos = 172999933 as u64;
let file_path = "assets.csv";
let mut rdr = csv::ReaderBuilder::new()
.flexible(true)
.from_path(file_path)?;
let mut wtr = csv::Writer::from_writer(io::stdout());
let mut record = csv::ByteRecord::new();
while rdr.read_byte_record(&mut record)? {
let pos = &record.position().expect("position of record");
if pos.byte() == asset_pos
{
wtr.write_record(&record)?;
break;
}
}
wtr.flush()?;
Ok(())
}
$ time ./target/release/testcsv
1999999,2015,10000,2016,10000,2017,10000,2018,10000,2019,10000,2020,10000,2021,10000,2022,10000,2023,10000,2024,10000,2025,10000,2026,10000,2027,10000,2028,10000,2029,10000
Time elapsed in readcsv() is: 239.290125ms
./target/release/testcsv 0.22s user 0.02s system 99% cpu 0.245 total
Instead of using from_path you can use from_reader with a File and seek in that file before creating the CsvReader:
use std::{error::Error, fs, io::{self, Seek}};
fn read_from_position() -> Result<(), Box<dyn Error>> {
let asset_pos = 0x116 as u64; // offset to only record in example
let file_path = "assets.csv";
let mut f = fs::File::open(file_path)?;
f.seek(io::SeekFrom::Start(asset_pos))?;
let mut rdr = csv::ReaderBuilder::new()
.flexible(true)
// edit: as noted by #BurntSushi5 we have to disable headers here.
.has_headers(false)
.from_reader(f);
let mut wtr = csv::Writer::from_writer(io::stdout());
let mut record = csv::ByteRecord::new();
rdr.read_byte_record(&mut record)?;
wtr.write_record(&record)?;
wtr.flush()?;
Ok(())
}
Then the first record read will be the one you're looking for.

Parse json in rust with reqwest and serde_json

I am trying to retrieve and parse a JSON file using reqwest.
I used this question as a starting point but it doesn't work with my API.
The error:
Error: reqwest::Error { kind: Decode, source: Error("expected value", line: 1, column: 1) }
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let resp = reqwest::get("https://tse.ir/json/MarketWatch/data_7.json")
.await?
.json::<serde_json::Value>()
.await?;
println!("{:#?}", resp);
Ok(())
}
The API works fine with other languages. thank for your help.
Cargo.toml:
[package]
name = "rust_workspace"
version = "0.1.0"
edition = "2021"
[dependencies]
serde_json = "1.0"
serde = { version = "1.0", features = ["derive"] }
reqwest = { version = "0.11", features = ["json", "blocking"] }
tokio = { version = "1", features = ["full"] }
bytes = "1"
The error happens usually when the response doesn't contain
"content-type":"application/json"
Even though the content is a valid json you will get that error.
To solve the issue you need to use
let text_response = reqwest.get("...").await?.text().await?;
let resp: serde_json::Value = serde_json::from_str(&text_response)?;

Something before csv header and how can i deal with it when i use serde in rust

Before csv header(time,ampl), there are some 'invalid' data.
the csv is about:
LECROYWS3024,13568,Waveform
Segments,1,SegmentSize,100002
Segment,TrigTime,TimeSinceSegment1
#1,01-Apr-2021 16:49:34,0
Time,Ampl
-2.510018e-005,0
-2.509968e-005,0
-2.509918e-005,0
-2.509868e-005,0
-2.509818e-005,0
...
when i build and run the exe, then an error is occured as below :
the error is :
CSV deserialize error: record 1 (line: 1, byte: 29): missing field Time
How can I deal with the invalid data with serde or other crates? Thanks!
use std::error::Error;
use std::io;
use std::process;
use serde::Deserialize;
#[derive(Debug, Deserialize)]
struct Record {
Time: Option<f32>,
Ampl:Option<f32>,
}
...
fn example() -> Result<(), Box<dyn Error>> {
let mut rdr = csv::Reader::from_path("foo.csv")?;
for result in rdr.deserialize() {
let record: Record = result?;
let x0= match record.Time{
Some(x)=> x,
None=> 0.0,
};
...
}
Ok(())
}
fn main() {
if let Err(err) = example() {
println!("error running example: {}", err);
process::exit(1);
}
}
You can use the csv crate, which has a custom deserializer: csv::invalid_option.
Then you can use a macro like this in your struct:
#[derive(Debug, Deserialize)]
struct Record {
Time: Option<f32>,
#[serde(deserialize_with = "csv::invalid_option")]
Ampl:Option<f32>,
}
to have invalid data converted to None values

How to append to an existing CSV file?

As an example, when the below code is run, each time the previous test.csv file is overwritten with a new one. How to append to test.csv instead of overwriting it?
extern crate csv;
use std::error::Error;
use std::process;
fn run() -> Result<(), Box<Error>> {
let file_path = std::path::Path::new("test.csv");
let mut wtr = csv::Writer::from_path(file_path).unwrap();
wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
wtr.flush()?;
Ok(())
}
fn main() {
if let Err(err) = run() {
println!("{}", err);
process::exit(1);
}
}
Will the append solution work if the file does not yet exist?
The csv crate provides Writer::from_writer so you can use anything, which implements Write. When using File, this answer from What is the best variant for appending a new line in a text file? shows a solution:
Using OpenOptions::append is the clearest way to append to a file
let mut file = OpenOptions::new()
.write(true)
.append(true)
.open("test.csv")
.unwrap();
let mut wtr = csv::Writer::from_writer(file);
Will the append solution work if the file does not yet exist?
Just add create(true) to the OpenOptions:
let mut file = OpenOptions::new()
.write(true)
.create(true)
.append(true)
.open("test.csv")
.unwrap();
let mut wtr = csv::Writer::from_writer(file);