How can I return a record from a CSV file using the byte position of line? - csv

I have an assets.csv file with 172 MB, a million rows, and 16 columns. I would like to read it using an offset -> bytes/line/record. In the code below, I am using the byte value.
I have stored the required positions (record.postion.bytes() in assets_index.csv) and I would like to read a particular line in the assets.csv using the saved offset.
I am able to get an output, but I feel there must be a better way to read from a CSV file based on byte position.
Please advise. I am new to programming and also to Rust, and learned a lot using the tutorials.
The assets.csv is of this format:
asset_id,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation
1000001,2015,10000,2016,10000,2017,10000,2018,10000,2019,10000,2020,10000,2021,10000,2022,10000,2023,10000,2024,10000,2025,10000,2026,10000,2027,10000,2028,10000,2029,10000
I used another function to get the Position { byte: 172999933, line: 1000000, record: 999999 }.
The assets_index.csv is of this format:
asset_id,offset_inbytes
1999999,172999933
fn read_from_position() -> Result<(), Box<dyn Error>> {
let asset_pos = 172999933 as u64;
let file_path = "assets.csv";
let mut rdr = csv::ReaderBuilder::new()
.flexible(true)
.from_path(file_path)?;
let mut wtr = csv::Writer::from_writer(io::stdout());
let mut record = csv::ByteRecord::new();
while rdr.read_byte_record(&mut record)? {
let pos = &record.position().expect("position of record");
if pos.byte() == asset_pos
{
wtr.write_record(&record)?;
break;
}
}
wtr.flush()?;
Ok(())
}
$ time ./target/release/testcsv
1999999,2015,10000,2016,10000,2017,10000,2018,10000,2019,10000,2020,10000,2021,10000,2022,10000,2023,10000,2024,10000,2025,10000,2026,10000,2027,10000,2028,10000,2029,10000
Time elapsed in readcsv() is: 239.290125ms
./target/release/testcsv 0.22s user 0.02s system 99% cpu 0.245 total

Instead of using from_path you can use from_reader with a File and seek in that file before creating the CsvReader:
use std::{error::Error, fs, io::{self, Seek}};
fn read_from_position() -> Result<(), Box<dyn Error>> {
let asset_pos = 0x116 as u64; // offset to only record in example
let file_path = "assets.csv";
let mut f = fs::File::open(file_path)?;
f.seek(io::SeekFrom::Start(asset_pos))?;
let mut rdr = csv::ReaderBuilder::new()
.flexible(true)
// edit: as noted by #BurntSushi5 we have to disable headers here.
.has_headers(false)
.from_reader(f);
let mut wtr = csv::Writer::from_writer(io::stdout());
let mut record = csv::ByteRecord::new();
rdr.read_byte_record(&mut record)?;
wtr.write_record(&record)?;
wtr.flush()?;
Ok(())
}
Then the first record read will be the one you're looking for.

Related

How to speed up Rust function to search through a large JSON file

Currently I have a Rust function that searches through a large JSON file (About 1,080,000 lines) Currently this function takes about 1 second to search through this file, the data in this file is mostly stuff like this:
{"although":false,"radio":2056538449,"hide":1713884795,"hello":1222349560.787047,"brain":903780409.0046091,"heard":-1165604870.8374772}
How would I be able to increase the performance of this function?
Here is my Main.rs file.
use std::collections::VecDeque;
use std::fs::File;
use std::io::BufWriter;
use std::io::{BufRead, BufReader, Write};
fn search(filename: &str, search_line: &str) -> Result<VecDeque<u32>, std::io::Error> {
let file = File::open(filename)?;
let mut reader = BufReader::with_capacity(2048 * 2048, file);
let mut line_numbers = VecDeque::new();
let mut line_number = 0;
let start = std::time::Instant::now();
loop {
line_number += 1;
let mut line = String::new();
let n = reader.read_line(&mut line)?;
if n == 0 {
break;
}
if line.trim() == search_line {
line_numbers.push_back(line_number);
println!(
"Matching line found on line number {}: {}",
line_number, line
);
break;
}
}
let elapsed = start.elapsed();
println!("Elapsed time: {:?}", elapsed);
if line_numbers.is_empty() {
println!("No lines found that match the given criteria");
}
Ok(line_numbers)
}
fn main() {
let database = "Test.json";
if let Err(e) = search(database, r#"{"08934":420696969}"#) {
println!("Error reading file: {}", e);
}
}

How to load a irregular csv file using Rust

I would like to load the following csv file, which has a difference of notation rules between before 2nd line and after 3rd line.
test.csv
[YEAR],2022,[Q],1,
[TEST],mid-term,[GRADE],3,
FirstName,LastName,Score,
AA,aaa,97,
BB,bbbb,15,
CC,cccc,66,
DD,ddd,73,
EE,eeeee,42,
FF,fffff,52,
GG,ggg,64,
HH,h,86,
II,iii,88,
JJ,jjjj,72,
However, I have the following error. I think this error is caused by the difference of notation rules. How do I correct this error and load the csv file as I want.
error message
StringRecord(["[YEAR]", "2022", "[Q]", "1", ""])
StringRecord(["[TEST]", "mid-term", "[GRADE]", "3", ""])
Error: Error(UnequalLengths { pos: Some(Position { byte: 47, line: 2, record: 2 }), expected_len: 5, len: 4 })
error: process didn't exit successfully: `target\debug\read_csv.exe` (exit code: 1)
main.rs
use csv::Error;
use csv::ReaderBuilder;
use encoding_rs;
use std::fs;
fn main() -> Result<(), Error> {
let path = "./test.csv";
let file = fs::read(path).unwrap();
let (res, _, _) = encoding_rs::SHIFT_JIS.decode(&file);
let mut reader = ReaderBuilder::new()
.has_headers(false)
.from_reader(res.as_bytes());
for result in reader.records() {
let record = result?;
println!("{:?}", record)
}
Ok(())
}
Version
cargo = "1.62.0"
rustc = "1.62.0"
csv = "1.1.6"
encoding_rs = "0.8.31"
I can correct this error by using "flexible" method.
use csv::Error;
use csv::ReaderBuilder;
use encoding_rs;
use std::fs;
fn main() -> Result<(), Error> {
let path = "./test.csv";
let file = fs::read(path).unwrap();
let (res, _, _) = encoding_rs::SHIFT_JIS.decode(&file);
let mut reader = ReaderBuilder::new()
+ .flexible(true)
.has_headers(false)
.from_reader(res.as_bytes());
for result in reader.records() {
let record = result?;
println!("{:?}", record)
}
Ok(())
}

Rust Read CSV without header

How does one read a CSV without a header in Rust? I've searched through the docs and gone through like 15 examples each of which is subtly not what I'm looking for.
Consider how easy Python makes it:
csv.DictReader(f, fieldnames=['city'])
How do you do this in Rust?
Current attempt:
use std::fs::File;
use serde::Deserialize;
#[derive(Debug, Deserialize)]
struct CityRow {
city: &str,
pop: u32,
}
fn doit() -> zip::result::ZipResult<()>
{
let filename = "cities.csv";
let mut zip = zip::ZipArchive::new(File::open(filename).unwrap())?;
let mut file = zip.by_index(0).unwrap();
println!("Filename: {}", file.name());
let mut reader = csv::Reader::from_reader(Box::new(file));
reader.set_headers(csv::StringRecord([ "city", "pop" ]));
for record in reader.records() {
// let record: CityRow = record.unwrap();
// let record = record?;
println!("{:?}", record);
}
Ok(())
}
Use a ReaderBuilder, and call ReaderBuilder::has_headers to disable header parsing. You can then use StringRecord::deserialize to extract and print each record, skipping the first header row:
let mut reader = csv::ReaderBuilder::new()
.has_headers(false)
.from_reader(Box::new(file));
let headers = csv::StringRecord::from(vec!["city", "pop"]);
for record in reader.records().skip(1) {
let record: CityRow = record.unwrap().deserialize(Some(&headers)).unwrap();
println!("{:?}", record);
}
(playground)
#smitop's answer didn't totally make sense to me when looking at the underlying code since the library appears to assume headers will exist by default. This means actually the below should work directly, and I found it did:
let mut reader = csv::Reader::from_reader(data.as_bytes());
for record in reader.deserialize() {
let record: CityRow = record.unwrap();
println!("{:?}", record);
}
I checked through the variants in this playground.
For what it's worth, it turned out in my case I had accidentally left a code path in that was reading my csv as a plain file, which is why I had seen headers read as a row. (Oops.)

Idiomatic way to parse TSV file (ASCII)

I need to parse files containing tab separated numbers, I also know there will always be only two of them. Since my files can be as heavy as a few gigabytes, I wondered if my current parsing method was correct. It looks like I could make the map faster considering I have a fixed size but I couldn't find how.
use std::io::{self, prelude::*, BufReader};
type Record = (u32, u32);
fn read(content: &[u8]) -> io::Result<Vec<Record>> {
Ok(BufReader::new(content)
.lines()
.map(|line| {
let nums: Vec<u32> = line
.unwrap()
.split("\t")
.map(|s| s.parse::<u32>().unwrap())
.collect();
(nums[0], nums[1])
})
.collect::<Vec<Record>>())
}
fn main() -> io::Result<()> {
let content = "1\t1\n\
2\t2\n";
let records = read(content.as_bytes())?;
assert_eq!(records.len(), 2);
assert_eq!(records[0], (1, 1));
assert_eq!(records[1], (2, 2));
Ok(())
}
Playground
If your entries are only numbers then we can reduce one inner Vec allocation within map like so:
use std::io::{self, prelude::*, BufReader};
type Record = (u32, u32);
fn read(content: &[u8]) -> io::Result<Vec<Record>> {
return Ok(BufReader::new(content).lines().map(|line| {
let line = line.unwrap();
let mut pair = line.split("\t").map(|s|s.parse::<u32>().unwrap());
(pair.next().unwrap(), pair.next().unwrap())
}).collect::<Vec<Record>>())
}
fn main() -> io::Result<()> {
let content = "1\t1\n\
2\t2\n";
let records = read(content.as_bytes())?;
assert_eq!(records.len(), 2);
assert_eq!(records[0], (1, 1));
assert_eq!(records[1], (2, 2));
Ok(())
}
You may want to add better error handling :)

How to append to an existing CSV file?

As an example, when the below code is run, each time the previous test.csv file is overwritten with a new one. How to append to test.csv instead of overwriting it?
extern crate csv;
use std::error::Error;
use std::process;
fn run() -> Result<(), Box<Error>> {
let file_path = std::path::Path::new("test.csv");
let mut wtr = csv::Writer::from_path(file_path).unwrap();
wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
wtr.flush()?;
Ok(())
}
fn main() {
if let Err(err) = run() {
println!("{}", err);
process::exit(1);
}
}
Will the append solution work if the file does not yet exist?
The csv crate provides Writer::from_writer so you can use anything, which implements Write. When using File, this answer from What is the best variant for appending a new line in a text file? shows a solution:
Using OpenOptions::append is the clearest way to append to a file
let mut file = OpenOptions::new()
.write(true)
.append(true)
.open("test.csv")
.unwrap();
let mut wtr = csv::Writer::from_writer(file);
Will the append solution work if the file does not yet exist?
Just add create(true) to the OpenOptions:
let mut file = OpenOptions::new()
.write(true)
.create(true)
.append(true)
.open("test.csv")
.unwrap();
let mut wtr = csv::Writer::from_writer(file);