How to check each value is greater or less than zero in csv file using python? - csv

I want to check each value of one column and according to the values give them label (trends) on the next column. For example, if the value is greater than zero or equal or less than zero, according to this positive , negative and same labels are to be written in next column.
My input file is look like this :
Weightage /// column name
0.000555
0.002333
0
-0.22222
And I want my output file is look like:
Weightage Labels // column name
0.000555 positive
0.002333 positive
0 same
-0.22222 negative
Any one can help me??
The code is:
print (results)
for r in results:
if r >0:
print("test")
label = "positive"
print(label)
elif r == 0.0:
label = "equal"
print(label)
else:
print("nothing")
I have problem in 'r' for loop.
The error occur :
Traceback (most recent call last):
File "C:\Python34\col.py", line 23, in <module>
if r >0:
TypeError: unorderable types: tuple() > int()

At first glance, it looks like you are confusing rows and columns. I suggest using more explicit names. It helps to avoid confusion. Also, do not compare strings to numeric types like integers. It will give surprising results in Python 2. In Python 3, it is an error.
for row in results:
column = row[0] # The first column of this row.
value = float(column) # The csv module returns strings, so we should
# turn them into floats for numeric comparison.
if value > 0:
print "positive"
elif value < 0:
print "negative"
else:
print "zero"

Related

How to read fixed-width data?

data looks like
212253820000025000.00000002500.00000000375.00111120211105202117
212456960000000750.00000000075.00000000011.25111120211102202117
212387470000010000.00000001000.00000000150.00111120211105202117
need to add separator like
21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17
The CSV file length is high nearly 20000 rows are there is there any possibility to do
This question is generally about reading "fixed width data".
If you're stuck with this data, you'll need to parse it line by line then column by column. I'll show you how to do this with Python.
First off, the columns you counted off in the comment do not match your sample output. You seemed to have omitted the last column with a count of 2 characters.
You'll need accurate column widths to perform the task. I took your sample data and counted the columns for you and got these numbers:
8, 13, 12, 12, 8, 8, 2
So, we'll read the input data line by line, and for every line we'll:
Read 8 chars and save it as a column, then 13 chars and save it as a column, then 12 chars, etc... till we've read all the specified column widths
As we move through the line we'll keep track of our position with the variables beg and end to denote where a column begins (inclusive) and where it ends (exclusive)
The end of the first column becomes the beginning of the next, and so on down the line
We'll store those columns in a list (array) that is the new row
At the end of the line we'll save the new row to a list of all the rows
Then, we'll repeat the process for the next line
Here's how this looks in Python:
import pprint
Col_widths = [8, 13, 12, 12, 8, 8, 2]
all_rows = []
with open("data.txt") as in_file:
for line in in_file:
row = []
beg = 0
for width in Col_widths:
end = beg + width
col = line[beg:end]
row.append(col)
beg = end
all_rows.append(row)
pprint.pprint(all_rows, width=100)
all_rows is just a list of lists of text:
[['21225382', '0000025000.00', '000002500.00', '000000375.00', '11112021', '11052021', '17'],
['21245696', '0000000750.00', '000000075.00', '000000011.25', '11112021', '11022021', '17'],
['21238747', '0000010000.00', '000001000.00', '000000150.00', '11112021', '11052021', '17']]
With this approach, if you miscounted the column width or the number of columns you can easily modify the Column_widths to match your data.
From here we'll use Python's CSV module to make sure the CSV file is written correctly:
import csv
with open("data.csv", "w", newline="") as out_file:
writer = csv.writer(out_file)
writer.writerows(all_rows)
and my data.csv file looks like:
21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17
If you have access to the command-line tool awk, you can fix your data like the following:
substr() gives a portion of the string $0, which is the entire line
you start at char 1 then specify the width of your first column, 8
for the next substr(), you again use $0, you start at 9 (1+8 from the last substr), and give it the second column's width, 13
and repeat for each column, starting at "the start of the last column plus the last column's width"
#!/bin/sh
# Col_widths = [8, 13, 12, 12, 8, 8, 2]
awk '{print substr($0,1,8) "," substr($0,9,13) "," substr($0,22,12) "," substr($0,34,12) "," substr($0,46,8) "," substr($0,54,8) "," substr($0,62,2)}' data.txt > data.csv

Removing data from a json file on bases of their value

I had produced a script to parse some blast files from different samples. As I wanted to know the genes that all the samples had it commum I created a list, and a dictionary to count them. I have also produced a json file from the dictionary. Now I want to removed those genes whose counts are less than 100, as this is the number of samples, either from the dictionary or from the json file but I don't know how to.
This is part of the code:
###to produce a dictionary with the genes, and their repetitions
for extracted_gene in matches:
if extracted_gene in matches_counts:
matches_counts[extracted_gene]+=1
else:
matches_counts[extracted_gene]=1
print matches_counts #check point
#if matches_counts[extracted_gene]==100:
#print extracted_gene
#to convert a dictionary into a txt file and format it with json
with open('my_gene_extraction_trial.txt', 'w') as file:
json.dump(matches_counts,file, sort_keys=True, indent=2, separators=(',',':'))
print 'Parsing has finished'
I had tried different ways to do so:
a) ignoring the else statement but then it will give me an empty dict
b)trying to print only the ones whose values is 100, but it does not get printed
c) I read the documentation about json but I only can see how to delete elements by objects but not by values.
Can I anyone help me with this issue, please? This is getting me mad!
This is what it should look like:
# matches (list) and matches_counts (dict) already defined
for extracted_gene in matches:
if extracted_gene in matches_counts:
matches_counts[extracted_gene] += 1
else: matches_counts[extracted_gene] = 1
print matches_counts #check point
# Create a copy of the dict of matches to remove items from
counts_100 = matches_counts.copy()
for extracted_gene in matches_counts:
if matches_counts[extracted_gene] < 100:
del counts_100[extracted_gene]
print counts_100
Let me know if you still get errors.

csv empty strings handling and values appending

With a csv of ~50 rows (stars) and ~30 columns (name, magnitudes and distance), that has some empty string values (''), I am trying to do two things in which all the help so far hasn't been useful. (1) I need to parse empty strings as 0.0, so I can (2) append each row in a list of lists (what I called s).
In other words:
- s is a list of stars (each one has all its parameters)
- d is a particular parameter for all the stars (distance), which I obtain correctly.
Big issue is with s. My try:
with open('stars.csv', 'r') as mycsv:
csv_stars = csv.reader(mycsv)
next(csv_stars) #skip header
stars = list(csv_stars)
s = [] # star
d = [] # distances
for row in stars:
row[row==''] = '0'
s.append(float(row)) #stars
d.append(arcsec*AU*float(row[30]))
I can't think of a better syntax, and so I get the error
s.append(float(row)) # stars
TypeError: float() argument must be a string or a number
From s I would obtain later the magnitudes for all the stars, separately. But first things first...
#cwasdwa Please look at below code. it will give you an idea. I am sure there might be better way. This solution is based on what I have understood from your code.
with open('stars.csv', 'r') as mycsv:
csv_stars = csv.reader(mycsv)
next(csv_stars) #skip header
stars = list(csv_stars)
s = [] # star
d = [] # distances
for row in stars:
newRow = [] #create new row array to convert all '' to 0.0
for x in row:
if x =='':
newRow.append(0.0)
else:
newRow.append(x)
s.append(newRow) #stars
if row[30] == '':
value = 0.0
else:
value = row[30]
d.append(arcsec*AU*float(value))

counting non-empty lines and sum of lengths of those lines in python

Am trying to create a function that takes a filename and it returns a 2-tuple with the number of the non-empty lines in that program, and the sum of the lengths of all those lines. Here is my current program:
def code_metric(file):
with open(file, 'r') as f:
lines = len(list(filter(lambda x: x.strip(), f)))
num_chars = sum(map(lambda l: len(re.sub('\s', '', l)), f))
return(lines, num_chars)
The result I get is get if I do:
if __name__=="__main__":
print(code_metric('cmtest.py'))
is
(3, 0)
when it should be:
(3,85)
Also is there a better way of finding the sum of the length of lines using using the functionals map, filter, and reduce? I did it for the first part but couldn't figure out the second half. AM kinda new to python so any help would be great.
Here is the test file called cmtest.py:
import prompt,math
x = prompt.for_int('Enter x')
print(x,'!=',math.factorial(x),sep='')
First line has 18 characters (including white space)
Second line has 29 characters
Third line has 38 characters
[(1, 18), (1, 29), (1, 38)]
The line count is 85 characters including white spaces. I apologize, I mis-read the problem. The length total for each line should include the whitespaces as well.
A fairly simple approach is to build a generator to strip trailing whitespace, then enumerate over that (with a start value of 1) filtering out blank lines, and summing the length of each line in turn, eg:
def code_metric(filename):
line_count = char_count = 0
with open(filename) as fin:
stripped = (line.rstrip() for line in fin)
for line_count, line in enumerate(filter(None, stripped), 1):
char_count += len(line)
return line_count, char_count
print(code_metric('cmtest.py'))
# (3, 85)
In order to count lines, maybe this code is cleaner:
with open(file) as f:
lines = len(file.readlines())
For the second part of your program, if you intend to count only non-empty characters, then you forgot to remove '\t' and '\n'. If that's the case
with open(file) as f:
num_chars = len(re.sub('\s', '', f.read()))
Some people have advised you to do both things in one loop. That is fine, but if you keep them separated you can make them into different functions and have more reusability of them that way. Unless you are handling huge files (or executing this coded millions of times), it shouldn't matter in terms of performance.

Rpy2 - Select Results and Output to CSV File

I'm currently doing Cox Proportional Hazards Modeling using Rpy2 - I imagine my question will cover other functions and the results from calling them as well though.
After I run the function, I have a variable which contains the results from the function, in the form of a vector. I have tried explicitly converting this to a DataFrame (resultsDataFrame = DataFrame(resultVector)). There are no errors returned when doing this. However, when I do resultsDataFrame.to_csvfile(filename) I get the following error:
Traceback (most recent call last):
File "<pyshell#171>", line 1, in <module>
modelFrame.to_csvfile('/Users/fortylashes/Documents/Matthews_Research/Cox_PH/ResultOutput_Exp1.csv')
File "/Library/Python/2.7/site-packages/rpy2/robjects/vectors.py", line 1031, in to_csvfile
'col.names': col_names, 'qmethod': qmethod, 'append': append})
RRuntimeError: Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""coxph"" to a data.frame
Furthermore, when I simply do:
for result in resultVector:
print (result)
I get an extremely long list of results- including information on each entry in the dataset used in the model, for each variable (so 9,000 records x 9 variables = 81,000 unneeded results). The results I really need are at the bottom of this vector and look like this:
coef exp(coef) se(coef) z p
age_age6574 -0.057775 0.944 0.05469 -1.056 2.9e-01
age_age75plus -0.020795 0.979 0.04891 -0.425 6.7e-01
sex_female -0.005304 0.995 0.03961 -0.134 8.9e-01
stage_late -0.261609 0.770 0.04527 -5.779 7.5e-09
access -0.000494 1.000 0.00069 -0.715 4.7e-01
Likelihood ratio test=36.6 on 5 df, p=7.31e-07 n= 9752, number of events= 2601
*NOTE: There were several more variables for which data was reported in the initial results (the 9,000 x 9 that I was talking about) but weren't actually used in the model.
I was wondering if there was a way to explicitly get this data, put it in one long ordered row, and then output it to a csv file?
::::UPDATE::::
When I call theModel.names I get a list of the various measures which can be called by numerical index:
[1] "coefficients" "var" "loglik"
[4] "score" "iter" "linear.predictors"
[7] "residuals" "means" "concordance"
[10] "method" "n" "nevent"
[13] "terms" "assign" "wald.test"
[16] "y" "formula" "call"
From this I can get the coefficients, which can then be exponentiated. I have not found, however, the p-value, the z score or the likelihood test ratio, which I will need.