I've been looking around and couldn't find the answer so here it is.
I'm trying to look into a way for automating of changing the content of a CSV file into something else for machine learning purposes. I have the content of a single line like this:
0, 0, 0, -2.3145, 5.567...... 65, 65, 125, 70.
(516 columns)
And trying to change it to this:
0,
0,
-2.3145,
5.567
....
65,
65,
125,
70.
(516 rows)
So basically transposing the data from horizontal to vertical (single row to single column).
It's easily done using Excel but problem is I have 4000+ of the CSV file so it takes a lot of time.
On top of that, I have to get the first 512 rows and store it into a CSV of another folder adding the last 4 rows into another CSV of another folder while both files have the same name.
Eg:
features(folder)
1.CSV
2.CSV
.....
4000+.CSV
labels(folder)
1.CSV
2.CSV
.....
4000+.CSV
Any suggestions on how I can speed things up? Tried writing my own program but I'm stumped on changing it from row to column. I've only managed to split the single CSV file to it's 4000+ pieces.
EDIT:
I've tested by putting the csv rows into an array and then storing the array into the csv where the code looks like this:
with open('FFTMIM16_512L1H1S0D0_1194.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list[0:512])
print(your_list[512:516])
print(your_list)
with open('test.csv', 'w', newline = '') as fa:
writer = csv.writer(fa)
writer.writerows(your_list[0:511])
with open('test1.csv', 'w', newline = '') as fb:
writer = csv.writer(fb)
writer.writerows(your_list[512:516])
It works but I just need to run it in a loop. A problem that I don't understand is that if I save the values from 0 to 512 on test.csv, it will show 512 counts of rows but when I store from 513 to 516 to test1.csv, it only shows three instead of four rows that I need. Changing fb content from 512 to 516 will work which doesn't make sense to me because the value of 512 in test.csv is 0 while test1.csv is 69. Why is that? From what I can understand is the index of the array, it starts from 0 to the place of number I need. Or is it not the case in python?
EDIT 2:
My new code is as follows:
import csv
import os
import glob
#import itertools
directory = input("INPUT FOLDER: ")
output1 = input("FEATURES FODLER: ")
output2 = input("LABELS FOLDER: ")
in_files = os.path.join(directory, '*.csv')
for in_file in glob.glob(in_files):
with open(in_file) as input_file:
reader = csv.reader(input_file)
your_list = (reader)
filename = os.path.splitext(os.path.basename(in_file))[0] + '.csv'
with open(os.path.join(output1, filename), 'w', newline='') as output_file1:
writer = csv.writer(output_file1)
writer.writerow(your_list[0:512])
with open(os.path.join(output2, filename), 'w', newline='' ) as output_file2:
writer = csv.writer(output_file2)
writer.writerow(your_list[512:516])
It shows the output as I wanted but now it stores apostrophes and braces eg. ['0.0'], ['2.321223'] as well. How do I remove these?
I don't understand why you can't do it programatically if you have your 4000+ pieces, just write every piece in a new line?
In my opinion the easiest way, but not automatically, would be some editor like Notepad ++.
Here you can Replace "," by "\r\n" or if you want to keep the "," you replace it with ",\r\n".
If you want it automated i don't see a not programmatical way.
By the way... if you use python with numpy/scipy you can just use the .transpose() function
*Edit to your comment:
what do you mean with "split from the first to the 512"? If you want parts with the size 512 it would be something like:
new_array = []
temp_array = []
k = 0
for num in your_array:
temp_array.append(num)
k += 1
if k % 512 == 0:
new_array.append(temp_array)
k = 0
temp_array = []
#to append the last block which might not be 512 sized
if len(temp_array) > 0:
new_array.append(temp_array)
# Save Arrays
for i in len(new_array):
saveToCsv(array = new_array[i], name="csv_"+str(i))
Your new_array would now be an array filled with 512 sized arrays.
Might be mistakes here, i did not test the code. To save you only need a function saveToCsf(array, name) which saves an array into a file.
Related
I have a csv file of 150500 rows and I want to split it into multiple files containing 500 rows (entries)
I'm using Jupyter and I know how to open and read the file. However, I don't know how to specify an output_path to record the newly created files from splitting the big one.
I have found this code online but once again since I don't know what is my output_path I don't know how to use it. Moreover, for this block of code I don't understand how we specify the input file.
import os
def split(filehandler, delimiter=',', row_limit=1000,
output_name_template='output_%s.csv', output_path='.', keep_headers=True):
import csv
reader = csv.reader(filehandler, delimiter=delimiter)
current_piece = 1
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
current_limit = row_limit
if keep_headers:
headers = reader.next()
current_out_writer.writerow(headers)
for i, row in enumerate(reader):
if i + 1 > current_limit:
current_piece += 1
current_limit = row_limit * current_piece
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
if keep_headers:
current_out_writer.writerow(headers)
current_out_writer.writerow(row)
My file name is DataSet2.csv and it's in the same file in jupyter as my ipynb notebook is running.
number_of_small_files = 301
lines_per_small_file = 500
largeFile = open('large.csv', 'r')
header = largeFile.readline()
for i in range(number_of_small_files):
smallFile = open(str(i) + '_small.csv', 'w')
smallFile.write(header) # This line copies the header to all small files
for x in range(lines_per_small_file):
line = largeFile.readline()
smallFile.write(line)
smallFile.close()
largeFile.close()
This will create many small files in the same directory. About 301 of them. They will be named from 0_small.csv to 300_small.csv.
Using standard unix utilities:
cat DataSet2.csv | tail -n +2 | split -l 500 --additional-suffix=.csv output_
This pipeline takes the original file, strips off the first line with 'tail -n +2', and then splits the rest into 500 line chunks that are put into files with names that start with 'output_' and end with '.csv'
I have a question regarding octave or matlab data post processing.
I have files exported from fluent like below:
"Surface Integral Report"
Mass-Weighted Average
Static Temperature (k)
crossplane-x-0.001 1242.9402
crossplane-x-0.025 1243.0017
crossplane-x-0.050 1243.2036
crossplane-x-0.075 1243.5321
crossplane-x-0.100 1243.9176
And I want to use octave/matlab for post processing.
If I read first line by line, and save only the lines with "crossplane-x-" into a new file, or directly save the data in those lines into a matrix. Since I have many similar files, I can make plots by just calling their titles.
But I go trouble on identify lines which contain the char "crossplane-x-". I am trying to do things like this:
clear, clean, clc;
% open a file and read line by line
fid = fopen ("h20H22_alongHGpath_temp.dat");
% save full lines into a new file if only chars inside
txtread = fgetl (fid)
num_of_lines = fskipl(fid, Inf);
char = 'crossplane-x-'
for i=1:num_of_lines,
if char in fgetl(fid)
[x, nx] = fscanf(fid);
print x
endif
endfor
fclose (fid);
Would anybody shed some light on this issue ? Am I using the right function ? Thank you.
Here's a quick way for your specific file:
>> S = fileread("myfile.dat"); % collect file contents into string
>> C = strsplit(S, "crossplane-x-"); % first cell is the header, rest is data
>> M = str2num (strcat (C{2:end})) % concatenate datastrings, convert to numbers
M =
1.0000e-03 1.2429e+03
2.5000e-02 1.2430e+03
5.0000e-02 1.2432e+03
7.5000e-02 1.2435e+03
1.0000e-01 1.2439e+03
I have a function takes a file as input and prints certain statistics and also copies the file into a file name provided by the user. Here is my current code:
def copy_file(option):
infile_name = input("Please enter the name of the file to copy: ")
infile = open(infile_name, 'r')
outfile_name = input("Please enter the name of the new copy: ")
outfile = open(outfile_name, 'w')
slist = infile.readlines()
if option == 'statistics':
for line in infile:
outfile.write(line)
infile.close()
outfile.close()
result = []
blank_count = slist.count('\n')
for item in slist:
result.append(len(item))
print('\n{0:<5d} lines in the list\n{1:>5d} empty lines\n{2:>7.1f} average character per line\n{3:>7.1f} average character per non-empty line'.format(
len(slist), blank_count, sum(result)/len(slist), (sum(result)-blank_count)/(len(slist)-blank_count)))
copy_file('statistics')
It prints the statistics of the file correctly, however the copy it makes of the file is empty. If I remove the readline() part and the statistics part, the function seems to make a copy of the file correctly. How can I correct my code so that it does both. It's a minor problem but I can't seem to get it.
The reason the file is blank is that
slist = infile.readlines()
is reading the entire contents of the file, so when it gets to
for line in infile:
there is nothing left to read and it just closes the newly truncated (mode w) file leaving you with a blank file.
I think the answer here is to change your for line in infile: to for line in slist:
def copy_file(option):
infile_name= input("Please enter the name of the file to copy: ")
infile = open(infile_name, 'r')
outfile_name = input("Please enter the name of the new copy: ")
outfile = open(outfile_name, 'w')
slist = infile.readlines()
if option == 'statistics':
for line in slist:
outfile.write(line)
infile.close()
outfile.close()
result = []
blank_count = slist.count('\n')
for item in slist:
result.append(len(item))
print('\n{0:<5d} lines in the list\n{1:>5d} empty lines\n{2:>7.1f} average character per line\n{3:>7.1f} average character per non-empty line'.format(
len(slist), blank_count, sum(result)/len(slist), (sum(result)-blank_count)/(len(slist)-blank_count)))
copy_file('statistics')
Having said all that, consider if it's worth using your own copy routine rather than shutil.copy - Always better to delegate the task to your OS as it will be quicker and probably safer (thanks to NightShadeQueen for the reminder)!
Am trying to create a function that takes a filename and it returns a 2-tuple with the number of the non-empty lines in that program, and the sum of the lengths of all those lines. Here is my current program:
def code_metric(file):
with open(file, 'r') as f:
lines = len(list(filter(lambda x: x.strip(), f)))
num_chars = sum(map(lambda l: len(re.sub('\s', '', l)), f))
return(lines, num_chars)
The result I get is get if I do:
if __name__=="__main__":
print(code_metric('cmtest.py'))
is
(3, 0)
when it should be:
(3,85)
Also is there a better way of finding the sum of the length of lines using using the functionals map, filter, and reduce? I did it for the first part but couldn't figure out the second half. AM kinda new to python so any help would be great.
Here is the test file called cmtest.py:
import prompt,math
x = prompt.for_int('Enter x')
print(x,'!=',math.factorial(x),sep='')
First line has 18 characters (including white space)
Second line has 29 characters
Third line has 38 characters
[(1, 18), (1, 29), (1, 38)]
The line count is 85 characters including white spaces. I apologize, I mis-read the problem. The length total for each line should include the whitespaces as well.
A fairly simple approach is to build a generator to strip trailing whitespace, then enumerate over that (with a start value of 1) filtering out blank lines, and summing the length of each line in turn, eg:
def code_metric(filename):
line_count = char_count = 0
with open(filename) as fin:
stripped = (line.rstrip() for line in fin)
for line_count, line in enumerate(filter(None, stripped), 1):
char_count += len(line)
return line_count, char_count
print(code_metric('cmtest.py'))
# (3, 85)
In order to count lines, maybe this code is cleaner:
with open(file) as f:
lines = len(file.readlines())
For the second part of your program, if you intend to count only non-empty characters, then you forgot to remove '\t' and '\n'. If that's the case
with open(file) as f:
num_chars = len(re.sub('\s', '', f.read()))
Some people have advised you to do both things in one loop. That is fine, but if you keep them separated you can make them into different functions and have more reusability of them that way. Unless you are handling huge files (or executing this coded millions of times), it shouldn't matter in terms of performance.
I have a script that opens and modifies a text file. The text file which contains personnel info and a lunch account balance. My script takes the text file removes the quotes and only writes rows that contain the values D, F or R in column 8. It writes this filtered data to two files, a csv import file called lunchimport.csv for a separate program and a csv temp file called to be used for further filtering. The second stage of the script uses the csv temp file to generate two additional csv files. One file, negativebal.csv, contains only rows with a negative value in column 14. The other file, lowbal.cav, contains rows with a value between 0 and 5 in column 14. My issue is that I cant get the script to filter "between" values properly. When using the code below to just write rows with values in column14 between 0 and 5 nothing will filter out. If I use values between 0 and 1.99 it works. Anything greater than 1.99 and the code doesnt filter anything:
if row[13] > "0" and row[13] < "1.99":
lowwriter.writerow([row[0], row[13]])
I have pasted my entire code below. I do use alot of temp files to accomplish my tasks. There probably is a better way but im just interested in getting my filters to work properly.
import os
import csv
infile = open("\\\\comalexsrv\\export\\update.txt", "r")
outfile1 = open("casttemp1.csv", "w")
infile2 = open("casttemp1.csv", "r")
outfile2 = open("casttemp2.csv", "w")
infile3 = open("casttemp2.csv", "r")
outfile3 = open("casttemp3.csv", "w")
infile4 = open("casttemp3.csv", "r")
inowcsv = open("F:\zbennett\Lunch_Imports\lunchimport.csv", "w")
negcastcsv = open("\\\\tcdc\\inow_transfer$\\negativebal.csv", "w")
lowcastcsv = open("\\\\tcdc\\inow_transfer$\\lowbal.csv", "w")
# Remove quotes in update.txt, write to outfile1(casttemp1.csv)
string = infile.read()
outfile1.write(string.replace("\"", ''))
# Open infile2(casttemp1.csv), write rows with D,F,R in column 8 to outfile2(casttemp2.csv)
# Open infile2(casttemp1.csv), write rows with D,F,R in column 8 to inowcsv(F:\zbennett\Lunch_Imports\lunchimport.csv)
# Open infile2(casttemp1.csv), write rows with D,R in column 8 to outfile3(casttemp3.csv)
tempwriter = csv.writer(outfile2, delimiter=',', lineterminator= '\n')
importwriter = csv.writer(inowcsv, delimiter=',', lineterminator= '\n')
lowtemp = csv.writer(outfile3, delimiter=',', lineterminator= '\n')
for row in csv.reader(infile2, delimiter=','):
if row[7] == "D":
tempwriter.writerow(row)
importwriter.writerow(row)
lowtemp.writerow(row)
if row[7] == "F":
tempwriter.writerow(row)
importwriter.writerow(row)
if row[7] == "R":
tempwriter.writerow(row)
importwriter.writerow(row)
lowtemp.writerow(row)
# Open infile3(casttemp2.csv), write columns 1,14 for rows with less than 0 in column 14 to negcastcsv(\\tcdc\inow_transfer$\negativebal.csv)
negwriter = csv.writer(negcastcsv, delimiter=',', lineterminator= '\n')
for row in csv.reader(infile3, delimiter=','):
if row[13] < "0":
negwriter.writerow([row[0], row[13]])
# Open infile4(casttemp3.csv), write columns 1,14 for rows with column 14 greater than 0 and less than 1.75 to lowcastcsv(\\tcdc\inow_transfer$\lowbal.csv)
lowwriter = csv.writer(lowcastcsv, delimiter=',', lineterminator= '\n')
for row in csv.reader(infile4, delimiter=','):
if row[13] > "0" and row[13] < "1.99":
lowwriter.writerow([row[0], row[13]])
infile.close()
outfile1.close()
infile2.close()
outfile2.close()
inowcsv.close()
outfile3.close()
infile3.close()
infile4.close()
negcastcsv.close()
lowcastcsv.close()
# Delete casttemp1.csv file
os.remove("casttemp1.csv")
os.remove("casttemp2.csv")
os.remove("casttemp3.csv")
Comparison is happening using strings, when you probably want numeric comparison:
if 0. < float(row[13]) < 1.99:
lowwriter.writerow([row[0], row[13]])