How to read fixed-width data? - csv

data looks like
212253820000025000.00000002500.00000000375.00111120211105202117
212456960000000750.00000000075.00000000011.25111120211102202117
212387470000010000.00000001000.00000000150.00111120211105202117
need to add separator like
21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17
The CSV file length is high nearly 20000 rows are there is there any possibility to do

This question is generally about reading "fixed width data".
If you're stuck with this data, you'll need to parse it line by line then column by column. I'll show you how to do this with Python.
First off, the columns you counted off in the comment do not match your sample output. You seemed to have omitted the last column with a count of 2 characters.
You'll need accurate column widths to perform the task. I took your sample data and counted the columns for you and got these numbers:
8, 13, 12, 12, 8, 8, 2
So, we'll read the input data line by line, and for every line we'll:
Read 8 chars and save it as a column, then 13 chars and save it as a column, then 12 chars, etc... till we've read all the specified column widths
As we move through the line we'll keep track of our position with the variables beg and end to denote where a column begins (inclusive) and where it ends (exclusive)
The end of the first column becomes the beginning of the next, and so on down the line
We'll store those columns in a list (array) that is the new row
At the end of the line we'll save the new row to a list of all the rows
Then, we'll repeat the process for the next line
Here's how this looks in Python:
import pprint
Col_widths = [8, 13, 12, 12, 8, 8, 2]
all_rows = []
with open("data.txt") as in_file:
for line in in_file:
row = []
beg = 0
for width in Col_widths:
end = beg + width
col = line[beg:end]
row.append(col)
beg = end
all_rows.append(row)
pprint.pprint(all_rows, width=100)
all_rows is just a list of lists of text:
[['21225382', '0000025000.00', '000002500.00', '000000375.00', '11112021', '11052021', '17'],
['21245696', '0000000750.00', '000000075.00', '000000011.25', '11112021', '11022021', '17'],
['21238747', '0000010000.00', '000001000.00', '000000150.00', '11112021', '11052021', '17']]
With this approach, if you miscounted the column width or the number of columns you can easily modify the Column_widths to match your data.
From here we'll use Python's CSV module to make sure the CSV file is written correctly:
import csv
with open("data.csv", "w", newline="") as out_file:
writer = csv.writer(out_file)
writer.writerows(all_rows)
and my data.csv file looks like:
21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17

If you have access to the command-line tool awk, you can fix your data like the following:
substr() gives a portion of the string $0, which is the entire line
you start at char 1 then specify the width of your first column, 8
for the next substr(), you again use $0, you start at 9 (1+8 from the last substr), and give it the second column's width, 13
and repeat for each column, starting at "the start of the last column plus the last column's width"
#!/bin/sh
# Col_widths = [8, 13, 12, 12, 8, 8, 2]
awk '{print substr($0,1,8) "," substr($0,9,13) "," substr($0,22,12) "," substr($0,34,12) "," substr($0,46,8) "," substr($0,54,8) "," substr($0,62,2)}' data.txt > data.csv

Related

Gnuplot one-liner to generate a titled line for each row in a CSV file

I've been trying to figure out gnuplot but haven't been getting anywhere for seemingly 2 reasons. My lack of understanding gnuplot set commands, and the layout of my data file. I've decided the best option is to ask for help.
Getting this gnuplot command into a one-liner is the hope.
Example rows from my CSV data file (MyData.csv):
> _TitleRow1_,15.21,15.21,...could be more, could be less
> _TitleRow2_,16.27,16.27,101,55.12,...could be more, could be less
> _TitleRow3_,16.19,16.19,20.8,...could be more, could be less
...(over 100 rows)
Contents of MyData.csv rows will always be a string as the first column for title, followed by an undetermined amount of decimal values. (Each row gets appended to periodically, so specifying an open ended amount of columns to include is needed)
What I'd like to happen is to generate a line graph showing a line for each row in the csv, using the first column as a row title, and the following numbers generating the actual line.
This is the I'm trying:
gnuplot -e 'set datafile separator ","; set key autotitle columnhead; plot "MyData.csv"'
Which results in:
set datafile separator ","; set key autotitle columnhead; plot "MyData.csv"
^
line 0: Bad data on line 2 of file MyData.csv
This looks like an amazing tool and I'm looking forward to learning more about it. Thanks in advance for any hints/assistance!
Your datafile format is very unfortunate for gnuplot which prefers data in columns.
Although, you can also plot rows (which is not straightforward in gnuplot, but see an example here). This requires a strict matrix, but the problem with your data is that you have a variable column count.
Actually, your CSV is not a "correct" CSV, because a CSV should have the same number of columns for all rows, i.e. if one row has less data than the row with maximum data the line should be filled with ,,, as many as needed. That's basically what the script below is doing.
With this you can plot rows with the option matrix (check help matrix). However, you will get some warnings warning: matrix contains missing or undefined values which you can ignore.
Alternatively, you could transpose your data (with variable column count maybe not straightforward). Maybe there are external tools which can do it easily. With gnuplot-only it will be a bit cumbersome (and first you would have to fill your shorter rows as in the example below).
Maybe there is a simpler and better gnuplot-only solution which I am currently not aware of.
Data: SO73099645.dat
_TitleRow1_, 1.2, 1.3
_TitleRow2_, 2.2, 2.3, 2.4, 2.5
_TitleRow3_, 3.2, 3.3, 3.4
Script:
### plotting rows with variable columns
reset session
FILE = "SO73099645.dat"
getColumns(s) = (sum [i=1:strlen(s)] (s[i:i] eq ',') ? 1 : 0) + 1
set datafile separator "\t"
colCount = 0
myNaNs = myHeaders = ''
stats FILE u (rowCount=$0+1, c=getColumns(strcol(1)), c>colCount ? colCount=c : 0) nooutput
do for [i=1:colCount] {myNaNs=myNaNs.',NaN' }
set table $Data
plot FILE u (s=strcol(1),c=getColumns(s),s.myNaNs[1:(colCount-c)*4]) w table
unset table
set datafile separator ","
stats FILE u (myHeaders=sprintf('%s "%s"',myHeaders,strcol(1))) nooutput
myHeader(n) = word(myHeaders,n)
set key noenhanced
plot for [row=0:rowCount-1] $Data matrix u 1:3 every ::1:row::row w lp pt 7 ti myHeader(row+1)
### end of script
As "one-liner":
FILE = "SO/SO73099645.dat"; getColumns(s) = (sum [i=1:strlen(s)] (s[i:i] eq ',') ? 1 : 0) + 1; set datafile separator "\t"; colCount = 0; myNaNs = myHeaders = ''; stats FILE u (rowCount=$0+1, c=getColumns(strcol(1)), c>colCount ? colCount=c : 0) nooutput; do for [i=1:colCount] {myNaNs=myNaNs.',NaN' }; set table $Data; plot FILE u (s=strcol(1),c=getColumns(s),s.myNaNs[1:(colCount-c)*4]) w table; unset table; set datafile separator ","; stats FILE u (myHeaders=sprintf('%s "%s"',myHeaders,strcol(1))) nooutput; myHeader(n) = word(myHeaders,n); set key noenhanced; plot for [row=0:rowCount-1] $Data matrix u 1:3 every ::1:row::row w lp pt 7 ti myHeader(row+1)
Result:

Automating a process for multiple CSV file

I've been looking around and couldn't find the answer so here it is.
I'm trying to look into a way for automating of changing the content of a CSV file into something else for machine learning purposes. I have the content of a single line like this:
0, 0, 0, -2.3145, 5.567...... 65, 65, 125, 70.
(516 columns)
And trying to change it to this:
0,
0,
-2.3145,
5.567
....
65,
65,
125,
70.
(516 rows)
So basically transposing the data from horizontal to vertical (single row to single column).
It's easily done using Excel but problem is I have 4000+ of the CSV file so it takes a lot of time.
On top of that, I have to get the first 512 rows and store it into a CSV of another folder adding the last 4 rows into another CSV of another folder while both files have the same name.
Eg:
features(folder)
1.CSV
2.CSV
.....
4000+.CSV
labels(folder)
1.CSV
2.CSV
.....
4000+.CSV
Any suggestions on how I can speed things up? Tried writing my own program but I'm stumped on changing it from row to column. I've only managed to split the single CSV file to it's 4000+ pieces.
EDIT:
I've tested by putting the csv rows into an array and then storing the array into the csv where the code looks like this:
with open('FFTMIM16_512L1H1S0D0_1194.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list[0:512])
print(your_list[512:516])
print(your_list)
with open('test.csv', 'w', newline = '') as fa:
writer = csv.writer(fa)
writer.writerows(your_list[0:511])
with open('test1.csv', 'w', newline = '') as fb:
writer = csv.writer(fb)
writer.writerows(your_list[512:516])
It works but I just need to run it in a loop. A problem that I don't understand is that if I save the values from 0 to 512 on test.csv, it will show 512 counts of rows but when I store from 513 to 516 to test1.csv, it only shows three instead of four rows that I need. Changing fb content from 512 to 516 will work which doesn't make sense to me because the value of 512 in test.csv is 0 while test1.csv is 69. Why is that? From what I can understand is the index of the array, it starts from 0 to the place of number I need. Or is it not the case in python?
EDIT 2:
My new code is as follows:
import csv
import os
import glob
#import itertools
directory = input("INPUT FOLDER: ")
output1 = input("FEATURES FODLER: ")
output2 = input("LABELS FOLDER: ")
in_files = os.path.join(directory, '*.csv')
for in_file in glob.glob(in_files):
with open(in_file) as input_file:
reader = csv.reader(input_file)
your_list = (reader)
filename = os.path.splitext(os.path.basename(in_file))[0] + '.csv'
with open(os.path.join(output1, filename), 'w', newline='') as output_file1:
writer = csv.writer(output_file1)
writer.writerow(your_list[0:512])
with open(os.path.join(output2, filename), 'w', newline='' ) as output_file2:
writer = csv.writer(output_file2)
writer.writerow(your_list[512:516])
It shows the output as I wanted but now it stores apostrophes and braces eg. ['0.0'], ['2.321223'] as well. How do I remove these?
I don't understand why you can't do it programatically if you have your 4000+ pieces, just write every piece in a new line?
In my opinion the easiest way, but not automatically, would be some editor like Notepad ++.
Here you can Replace "," by "\r\n" or if you want to keep the "," you replace it with ",\r\n".
If you want it automated i don't see a not programmatical way.
By the way... if you use python with numpy/scipy you can just use the .transpose() function
*Edit to your comment:
what do you mean with "split from the first to the 512"? If you want parts with the size 512 it would be something like:
new_array = []
temp_array = []
k = 0
for num in your_array:
temp_array.append(num)
k += 1
if k % 512 == 0:
new_array.append(temp_array)
k = 0
temp_array = []
#to append the last block which might not be 512 sized
if len(temp_array) > 0:
new_array.append(temp_array)
# Save Arrays
for i in len(new_array):
saveToCsv(array = new_array[i], name="csv_"+str(i))
Your new_array would now be an array filled with 512 sized arrays.
Might be mistakes here, i did not test the code. To save you only need a function saveToCsf(array, name) which saves an array into a file.

How to check each value is greater or less than zero in csv file using python?

I want to check each value of one column and according to the values give them label (trends) on the next column. For example, if the value is greater than zero or equal or less than zero, according to this positive , negative and same labels are to be written in next column.
My input file is look like this :
Weightage /// column name
0.000555
0.002333
0
-0.22222
And I want my output file is look like:
Weightage Labels // column name
0.000555 positive
0.002333 positive
0 same
-0.22222 negative
Any one can help me??
The code is:
print (results)
for r in results:
if r >0:
print("test")
label = "positive"
print(label)
elif r == 0.0:
label = "equal"
print(label)
else:
print("nothing")
I have problem in 'r' for loop.
The error occur :
Traceback (most recent call last):
File "C:\Python34\col.py", line 23, in <module>
if r >0:
TypeError: unorderable types: tuple() > int()
At first glance, it looks like you are confusing rows and columns. I suggest using more explicit names. It helps to avoid confusion. Also, do not compare strings to numeric types like integers. It will give surprising results in Python 2. In Python 3, it is an error.
for row in results:
column = row[0] # The first column of this row.
value = float(column) # The csv module returns strings, so we should
# turn them into floats for numeric comparison.
if value > 0:
print "positive"
elif value < 0:
print "negative"
else:
print "zero"

counting non-empty lines and sum of lengths of those lines in python

Am trying to create a function that takes a filename and it returns a 2-tuple with the number of the non-empty lines in that program, and the sum of the lengths of all those lines. Here is my current program:
def code_metric(file):
with open(file, 'r') as f:
lines = len(list(filter(lambda x: x.strip(), f)))
num_chars = sum(map(lambda l: len(re.sub('\s', '', l)), f))
return(lines, num_chars)
The result I get is get if I do:
if __name__=="__main__":
print(code_metric('cmtest.py'))
is
(3, 0)
when it should be:
(3,85)
Also is there a better way of finding the sum of the length of lines using using the functionals map, filter, and reduce? I did it for the first part but couldn't figure out the second half. AM kinda new to python so any help would be great.
Here is the test file called cmtest.py:
import prompt,math
x = prompt.for_int('Enter x')
print(x,'!=',math.factorial(x),sep='')
First line has 18 characters (including white space)
Second line has 29 characters
Third line has 38 characters
[(1, 18), (1, 29), (1, 38)]
The line count is 85 characters including white spaces. I apologize, I mis-read the problem. The length total for each line should include the whitespaces as well.
A fairly simple approach is to build a generator to strip trailing whitespace, then enumerate over that (with a start value of 1) filtering out blank lines, and summing the length of each line in turn, eg:
def code_metric(filename):
line_count = char_count = 0
with open(filename) as fin:
stripped = (line.rstrip() for line in fin)
for line_count, line in enumerate(filter(None, stripped), 1):
char_count += len(line)
return line_count, char_count
print(code_metric('cmtest.py'))
# (3, 85)
In order to count lines, maybe this code is cleaner:
with open(file) as f:
lines = len(file.readlines())
For the second part of your program, if you intend to count only non-empty characters, then you forgot to remove '\t' and '\n'. If that's the case
with open(file) as f:
num_chars = len(re.sub('\s', '', f.read()))
Some people have advised you to do both things in one loop. That is fine, but if you keep them separated you can make them into different functions and have more reusability of them that way. Unless you are handling huge files (or executing this coded millions of times), it shouldn't matter in terms of performance.

Scientific data

I want to import data from a corrupted CSV file. It contains scientific numbers and it's a big data set with about 300000 rows and 27 columns. When I import it using,
Import["data.csv","HeaderLines"->1]
the data format is string. So I change it to data table format by
StringSplit[ToString[data[[#]]], ";"] & /#
Range[Dimensions[
Import["data.csv"]][[1]]]
and I need to use the first column to analyse the data. But the problem is that this row is
scientific numbers in string type!! I want to change it to numbers. I used this command:
ToExpression[Internal`StringToDouble[fdata[[All, 1]][[#]]]] & /#
Range[291407];
But it takes more than hours to do so!!! Do you have any idea how I can do this without wasting of time??
You could try the following:
(* read the first 5 rows *)
d = ReadList["data.csv", Table[Number, {27}], 5]
(* read the rows 100 to 150 *)
s = OpenRead["data.csv"];
Skip[s, Record, 99]
d = ReadList[s, Table[Number, {27}], 51]
Close[s]
And d[[All,1]] will get you the first column.