Editing a .csv file with a batch file [closed] - csv

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
this is my first question on here. I work as a meteorologist and have some coding experience, though it is far from professionally taught. Basically what I have is a .csv file from a weather station that is giving me data that is too detailed. (65.66 degrees and similar values) What I want to do is automate a way via a script file that would access the .csv file and get rid of values that were too detailed. (Take a temp from 65.66 to 66 (rounding up for anything above .5 and down for below) or for a pressure (29.8889) and making it (29.89) using the same rounding rules.) Is this possible to be done? If so how should I go about it. Again keep in mind that my coding skills for batch files are not the strongest.
Any help would be much appreciated.
Thanks,

I agree with the comments above. Math in batch is limited to integers, and won't work well for the manipulations you want.
I'd use PowerShell. Besides easily handling floating point math, it also has built-in methods for objectifying CSV data (as well as XML and other types of structured data). Take the following hypothetical CSV data contained within weather.csv:
date,time,temp,pressure,wx
20160525,12:30,65.66,30.1288,GHCND:US1TNWS0001
20160525,13:00,67.42,30.3942,GHCND:US1TNWS0001
20160525,13:30,68.92,31.0187,GHCND:US1TNWS0001
20160525,14:00,70.23,30.4523,GHCND:US1TNWS0001
20160525,14:30,70.85,29.8889,GHCND:US1TNWS0001
20160525,15:00,69.87,28.7384,GHCND:US1TNWS0001
The first thing you want to do is import that data as an object (using import-csv), then round the numbers as desired -- temp rounded to a whole number, and pressure rounded to a precision of 2 decimal places. Rounding to a whole number is easy. Just recast the data as an integer. It'll be rounded automatically. Rounding the pressure column is pretty easy as well if you invoke the .NET [math]::round() method.
# grab CSV data as a hierarchical object
$csv = import-csv weather.csv
# for each row of the CSV data...
$csv | foreach-object {
# recast the "temp" property as an integer
$_.temp = [int]$_.temp
# round the "pressure" property to a precision of 2 decimal places
$_.pressure = [math]::round($_.pressure, 2)
}
Now pretend you want to display the temperature, barometric pressure, and weather station name where "date" = 20160525 and "time" = 14:30.
$row = $csv | where-object { ($_.date -eq 20160525) -and ($_.time -eq "14:30") }
$row | select-object pressure,temp,wx | format-table
Assuming "pressure" started with a value of 29.8889 and "temp" had a value of 70.85, then the output would be:
pressure temp wx
-------- ---- --
29.89 71 GHCND:US1TNWS0001
If the CSV data had had multiple rows with the same date and time values (perhaps measurements from different weather stations), then the table would display with multiple rows.
And if you wanted to export that to a new csv file, just replace the format-table cmdlet with export-csv destination.csv
$row | select-object pressure,temp,wx | export-csv outfile.csv
Handy as a pocket on a shirt, right?
Now, pretend you want to display the human-readable station names rather than NOAA's designations. Make a hash table.
$stations = #{
"GHCND:US1TNWS0001" = "GRAY 1.5 E TN US"
"GHCND:US1TNWS0003" = "GRAY 1.9 SSE TN US"
"GHCND:US1TNWS0016" = "GRAY 1.3 S TN US"
"GHCND:US1TNWS0018" = "JOHNSON CITY 5.9 NW TN US"
}
Now you can add a "station" property to your "row" object.
$row = $row | select *,"station"
$row.station = $stations[$row.wx]
And now if you do this:
$row | select-object pressure,temp,station | format-table
Your console shows this:
pressure temp station
-------- ---- -------
29.89 71 GRAY 1.5 E TN US
For extra credit, say you want to export this row data to JSON (for a web page or something). That's slightly more complicated, but not impossibly so.
add-type -AssemblyName System.Web.Extensions
$JSON = new-object Web.Script.Serialization.JavaScriptSerializer
# convert $row from a PSCustomObject to a more generic hash table
$obj = #{}
# the % sign in the next line is shorthand for "foreach-object"
$row.psobject.properties | %{
$obj[$_.Name] = $_.Value
}
# Now, stringify the row and display the result
$JSON.Serialize($obj)
The output of that should be similar to this:
{"station":"GRAY 1.5 E TN US","wx":"GHCND:US1TNWS0001","temp":71,"date":"201605
25","pressure":29.89,"time":"14:30"}
... and you can redirect it to a .json file by using > or pipe it into the out-file cmdlet.

DOS batch scripting is, by far, not the best place to edit text files. However, it is possible. I will include sample, incomplete DOS batch code at the bottom of this post to demonstrate the point. I recommend you focus on Excel (no coding needed) or Python.
Excel - You don't need to code at all with Excel. Open the csv file. Let's say you have 66.667 in cell B12. In cell C12 enter a formula using the round function (code below). You can also teach yourself some Visual Basic for Applications. But, for this simple task, that is overkill. When done, if you save as csv format, you will loose your formulae and only have data. Consider saving as xlsx or xlsm.
Visual Basic Script - you can run vbscript on your machine with
cscript.exe (or wscript.exe), which is part of Windows. But, if using VB script, you might as well use VBA in Excel. It is almost identical.
Python is a very high level langauge with built in libraries
that make editing a csv file super easy. I recommend Anaconda
(a Python suite) from continuum.io. But, you can find the generic Python at
python.org as well. Anaconda will come prepackaged with lots of
helpful libraries. For csv editing, you will likely want to use the
pandas library. You can find plenty of short videos on YouTube.
Excel
Say you have 66.667 in cell B12. Set the formula in C13 to...
"=ROUND(B12,0)" to round to integer
"=ROUND(B12,1)" to round to one decimal place
As you copy and past, Excel will attempt to intelligently update the formulas for you.
Python
import pandas as pd
from StringIO import StringIO
import numpy as np
# load csv file to memory. Name your columns "using names=[]"
df = pd.read_csv(StringIO("C:/temp/weather.csv"), names=["city", "temperature", "date"])
df["temperature"].apply(np.round) #you just rounded the temperature column
pd.to_csv('newfile.csv') # export to a new csv file
pd.to_xls('newfile.xls') # or export to an excel file instead
DOS Batch
A Batch script for this is much, much harder. I will not write the whole program, because it is not a great solution. But, I'll give you a taste in DOS batch code at the bottom of this post. Compared to using Python or Excel, it is extremely complex.
Here is a rough sketch of DOS code. Because I don't recommend this method, I didn't take the time to debug this code.
setlocal ENABLEDELAYEDEXPANSION
:: prep our new file for output. Let's write the header row.
echo col1, col2, col3 >newfile.csv
:: read the existing text file line by line
:: since it is csv, we will parse on comma
:: skip lines starting with semi-colon
FOR /F "eol=; tokens=2,3* delims=, " %%i in (input_file.txt) do (
set col1=%%I, set col2=%%J, set col3=%%K
:: truncate col2 to 1 decimal place
for /f "tokens=2 delims==." %%A in ("col2") do (
set integer=%%A
set "decimal=%%B
set decimal=%decimal:~0,1%
:: or, you can use an if statement to round up or down
:: Now, put the integer and decimal together again and
:: redefine the value for col2.
set col2=%integer%.%decimal%
:: write output to a new csv file
:: > and >> can redirect output from console to text file
:: >newfile.csv will overwrite file.csv. We don't want
:: that, since we are in a loop.
:: >>newfile.csv will append to file.csv, perfect!
echo col1, col2, col3 >>newfile.csv
)
)
:: open csv file in default application
start myfile.csv

Related

How to split a large CSV file with no code?

I have a CSV file which has nearly 22M records. I want to split this into multiple CSV files so that I can use it further.
I tried to open it using Excel(tried Transform Data Option as well)/Notepad++/Notepad, but all give me an error.
When I explore the options, I found that we can split the file using some coding methodologies like Java, Python, etc.. I am not much familiar with coding and want to know if there is any option to split the file without using any coding process. Also, since the file has client sensitive data I don't want to download/use any external tools.
Any help would be much appreciated.
I know you're concerned about security of sensitive data, and that makes you want to avoid external tools (even a nominally trusted tool like Google Big Query... unless your data is medical in nature).
I know you don't want a custom solution w/Python, but I don't understand why that is—this is a big problem, and CSVs can be tricky to handle.
Maybe your CSV is a "simple one" where there are no embedded line breaks, and the quoting is minimal. But if it isn't, you're going to want to a tool that's meant for CSV.
And because the file is so big, I don't see how you can do it without code. Even if you could load it into a trusted tool, how would you process the 22M records?
I look forward to seeing what else the community has to offer you.
The best I can think of based on my experience is exactly what you said you don't want.
It's a small-ish Python script that uses its CSV library to correctly read in your large file and write out several smaller files. If you don't trust this, or me, maybe find someone you do trust who can read this and assure you it won't compromise your sensitive data.
#!/usr/bin/env python3
import csv
MAX_ROWS = 22_000
# The name of your input
INPUT_CSV = 'big.csv'
# The "base name" of all new sub-CSVs, a counter will be added after the '-':
# e.g., new-1.csv, new-2.csv, etc...
NEW_BASE = 'new-'
# This function will be called over-and-over to make a new CSV file
def make_new_csv(i, header=None):
# New name
new_name = f'{NEW_BASE}{i}.csv'
# Create a new file from that name
new_f = open(new_name, 'w', newline='')
# Creates a "writer", a dedicated object for writing "rows" of CSV data
writer = csv.writer(new_f)
if header:
writer.writerow(header)
return new_f, writer
# Open your input CSV
with open(INPUT_CSV, newline='') as in_f:
# Like the "writer", dedicated to reading CSV data
reader = csv.reader(in_f)
your_header = next(reader) # see note below about "header"
# Give your new files unique, and sequential names: e.g., new-1.csv, new-2.csv, etc...
new_i = 1
# Make first new file and writer
new_f, writer = make_new_csv(new_i, your_header)
# Loop over all input rows, and count how many
# records have been written for each "new file"
new_rows = 0
for row in reader:
if new_rows == MAX_ROWS:
new_f.close() # This file is full, close it and...
break
new_i += 1
new_f, writer = make_new_csv(new_i, your_header) # get a new file and writer
new_rows = 0 # Reset row counter
writer.writerow(row)
new_rows +=1
# All done reading input rows, close last file
new_f.close()
There's also a fantastic tool I use daily for processing large CSVs, also with sensitive client contact and personally identifying information, GoCSV.
Its split command is exactly what you need:
Split a CSV into multiple files.
Usage:
gocsv split --max-rows N [--filename-base FILENAME] FILE
I'd recommend downloading it for your platform, unzipping it, putting a sample file with non-sensitive information in that folder and trying it out:
gocsv split --max-rows 1000 --filename-base New sample.csv
would end up creating a number of smaller CSVs, New-1.csv, New-2.csv, etc..., each with a header and no more than 1000 rows.

Replace headers on export-csv from selectable list using powershell

fairly new to powershell and I have given myself a bit of a challenge which I think should be possible, I'm just not sure about the best way around it.
We have a user who has a large number of columns in a csv (can vary from 20-50), rows can vary between 1 and 10,000. the data is say ClientName,Address1,Address2,Postcode etc.. (although these can vary wildly depending on the source of the data - external companies) This needs importing into a system using a pre-built routine which looks at the file and needs the database column headers as the csv headers. so say ClientDisplay,Ad_Line1,Ad_Line2,PCode etc..
I was thinking along the lines of either a generic powershell 'mapping' form which could read the headers from ExternalSource.csv and either a DatabaseHeaders.csv (or a direct sql query lookup) display them as columns in a form and then highlight one from each column and a 'link' button, once you have been through all the columns in ExternalSource.csv a 'generate csv' button which takes the mapped headers an appends the correct data columns from ExternalSource.csv
Am I barking up the wrong tree completely trying to use powershell? at the moment its a very time consuming process so just trying to make life easier for users.
Any advice appreciated..
Thanks,
Jon
You can use the Select-Object cmdlet with dynamic columns to shape the data into the form you need.
Something like:
Import-Csv -Path 'source.svc' |
Select-Object Id, Name, #{ Name='Ad_Line1'; Expression={ $_.Address1 } } |
Export-Csv -Path 'target.csv'
In this example, the code #{ Name='Ad_Line1'; Expression={ $_.Address1 } } is a dynamic column, that creates a column with name AD_Line1' and the value ofAddress1`
It is possible to read the column mappings from a file, you will have to write some code to read the file, select the properties and create the format.
A very simple solution could be to read the Select-Object part from another script file, so you can differentiate that part for each import.
A (simple, naive, low performant) solution could look like this (untested code):
# read input file
$input = Import-Csv -Path $inputFile
# read source, target name columns from mapping file
$mappings = Import-Csv -Path $mappingFile | Select Source, Target
# apply transformations
$transformed = $input
foreach($mapping in $mappings) {
# collect the data, add an extra column for each mapping
$transformed = $transformed | Select-Object *, #{ Name = $mapping.Target; Expression = { $_.$($mapping.Source) } }
}
#export transformed data
$transformed | Export-Csv -Path $outputFile
Alternatively; It is possible to convert the data into XML with Import-Csv | Export-CliXml, apply an Xslt template on the Xml to perform a transformation, and save the Xml objects into Csv again with Import-CliXml | Export-Csv.
See this blog by Scott Hansleman on how you can use XSLT with PowerShell.

splitting CSV file by columns

I have a really huge CSV files. There are about 1700 columns and 40000 rows like below:
x,y,z,x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1700 more)...,x1700
0,0,0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1700 more)...,a1700
1,1,1,b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1700 more)...,b1700
// (about 40000 more rows below)
I need to split this CSV file into multiple files which contain a less number of columns like:
# file1.csv
x,y,z
0,0,0
1,1,1
... (about 40000 more rows below)
# file2.csv
x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1000 more)...,x1000
a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1000 more)...,a1000
b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1000 more)...,b1700
// (about 40000 more rows below)
#file3.csv
x1001,x1002,x1003,x1004,x1005,...(about 700 more)...,x1700
a1001,a1002,a1003,a1004,a1005,...(about 700 more)...,a1700
b1001,b1002,b1003,b1004,b1005,...(about 700 more)...,b1700
// (about 40000 more rows below)
Is there any program or library doing this?
I've googled for it , but programs that I found only split a file by rows not by columns.
Or which language could I use to do this efficiently?
I can use R, shell script, Python, C/C++, Java
A one-line solution for your example data and desired output:
cut -d, -f -3 huge.csv > file1.csv
cut -d, -f 4-1004 huge.csv > file2.csv
cut -d, -f 1005- huge.csv > file3.csv
The cut program is available on most POSIX platforms and is part of GNU Core Utilities. There is also a Windows version.
update in python, since the OP asked for a program in an acceptable language:
# python 3 (or python 2, if you must)
import csv
import fileinput
output_specifications = ( # csv file name, selector function
('file1.csv', slice(3)),
('file2.csv', slice(3, 1003)),
('file3.csv', slice(1003, 1703)),
)
output_row_writers = [
(
csv.writer(open(file_name, 'wb'), quoting=csv.QUOTE_MINIMAL).writerow,
selector,
) for file_name, selector in output_specifications
]
reader = csv.reader(fileinput.input())
for row in reader:
for row_writer, selector in output_row_writers:
row_writer(row[selector])
This works with the sample data given and can be called with the input.csv as an argument or by piping from stdin.
Use a small python script like:
fin = 'file_in.csv'
fout1 = 'file_out1.csv'
fout1_fd = open(fout1,'w')
...
lines = []
with open(fin) as fin_fd:
lines = fin_fd.read().split('\n')
for l in lines:
l_arr = l.split(',')
fout1_fd.write(','.join(l_arr[0:3]))
fout1_fd.write('\n')
...
...
fout1_fd.close()
...
You can open the file in Microsoft Excel, delete the extra columns, save as csv for file #1. Repeat the same procedure for the other 2 tables.
I usually use open office ( or microsof excel in case you are using windows) to do that without writing any program and change the file and save it. Following are two useful links showing how to do that.
https://superuser.com/questions/407082/easiest-way-to-open-csv-with-commas-in-excel
http://office.microsoft.com/en-us/excel-help/import-or-export-text-txt-or-csv-files-HP010099725.aspx

How to merge multiple csv files into 1 SAS file

I just started using SAS 3 days ago and I need to merge ~50 csv files into 1 SAS dataset.
The 50 csv files have multiple variables with only 1 variable in common i.e. "region_id"
I've used SAS enterprise guide drag and drop functionalities to do this but it was too manual and took me half a day to upload and merge 47 csv files into 1 SAS file.
I was wondering whether anyone has a more intelligent way of doing this using base SAS?
Any advice and tips appreciated!
Thank you!
Example filenames:
2011Census_B01_AUST_short
2011Census_B02A_AUST_short
2011Census_B02B_AUST_short
2011Census_B03_AUST_short
.
.
2011Census_xx_AUST_short
I have more than 50 csv files to upload and merge.
The number and type of variables in the csv file varies in each csv file. However, all csv files have 1 common variable = "region_id"
Example variables:
region_id, Tot_P_M, Tot_P_F, Tot_P_P, Age_0_4_yr_F etc...
First, we'll need an automated way to import. The below simple macro takes the location of the file and the name of the file as inputs, and outputs a dataset to the work directory. (I'd use the concatenate function in Excel to create the SAS code 50 times). Also, we are sorting it to make the merge easier later.
%macro importcsv(location=,filename=);
proc import datafile="&location./&filename..csv"
out=&filename.
dbms=csv
replace;
getnames=yes;
run;
proc sort data= &filename.; by region_id; run;
%mend;
%importcsv(location = C:/Desktop,filename = 2011Census_B01_AUST_short)
.
.
.
Then simply merge all of the data together again. I added ellipses simply because I didn't want to right out 50 times.
data merged;
merge dataseta datasetb datasetc ... datasetax;
by region_id;
run;
Hope this helps.

Python 3 code to read CSV file, manipulate then create new file....works, but looking for improvements

This is my first ever post here. I am trying to learn a bit of Python. Using Python 3 and numpy.
Did a few tutorials then decided to dive in and try a little project I might find useful at work as thats a good way to learn for me.
I have written a program that reads in data from a CSV file which has a few rows of headers, I then want to extract certain columns from that file based on the header names, then output that back to a new csv file in a particular format.
The program I have works fine and does what I want, but as I'm a newbie I would like some tips as to how I can improve my code.
My main data file (csv) is about 57 columns long and about 36 rows deep so not big.
It works fine, but looking for advice & improvements.
import csv
import numpy as np
#make some arrays..at least I think thats what this does
A=[]
B=[]
keep_headers=[]
#open the main data csv file 'map.csv'...need to check what 'r' means
input_file = open('map.csv','r')
#read the contents of the file into 'data'
data=csv.reader(input_file, delimiter=',')
#skip the first 2 header rows as they are junk
next(data)
next(data)
#read in the next line as the 'header'
headers = next(data)
#Now read in the numeric data (float) from the main csv file 'map.csv'
A=np.genfromtxt('map.csv',delimiter=',',dtype='float',skiprows=5)
#Get the length of a column in A
Alen=len(A[:,0])
#now read the column header values I want to keep from 'keepheader.csv'
keep_headers=np.genfromtxt('keepheader.csv',delimiter=',',dtype='unicode_')
#Get the length of keep headers....i.e. how many headers I'm keeping.
head_len=len(keep_headers)
#Now loop round extracting all the columns with the keep header titles and
#append them to array B
i=0
while i < head_len:
#use index to find the apprpriate column number.
item_num=headers.index(keep_headers[i])
i=i+1
#append the selected column to array B
B=np.append(B,A[:,item_num])
#now reshape the B array
B=np.reshape(B,(head_len,36))
#now transpose it as thats the format I want.
B=np.transpose(B)
#save the array B back to a new csv file called 'cmap.csv'
np.savetxt('cmap.csv',B,fmt='%.3f',delimiter=",")
Thanks.
You can greatly simplify your code using more of numpy capabilities.
A = np.loadtxt('stack.txt',skiprows=2,delimiter=',',dtype=str)
keep_headers=np.loadtxt('keepheader.csv',delimiter=',',dtype=str)
headers = A[0,:]
cols_to_keep = np.in1d( headers, keep_headers )
B = np.float_(A[1:,cols_to_keep])
np.savetxt('cmap.csv',B,fmt='%.3f',delimiter=",")