csv empty strings handling and values appending

csv empty strings handling and values appending - csv

With a csv of ~50 rows (stars) and ~30 columns (name, magnitudes and distance), that has some empty string values (''), I am trying to do two things in which all the help so far hasn't been useful. (1) I need to parse empty strings as 0.0, so I can (2) append each row in a list of lists (what I called s).
In other words:
- s is a list of stars (each one has all its parameters)
- d is a particular parameter for all the stars (distance), which I obtain correctly.
Big issue is with s. My try:
with open('stars.csv', 'r') as mycsv:
csv_stars = csv.reader(mycsv)
next(csv_stars) #skip header
stars = list(csv_stars)
s = [] # star
d = [] # distances
for row in stars:
row[row==''] = '0'
s.append(float(row)) #stars
d.append(arcsec*AU*float(row[30]))
I can't think of a better syntax, and so I get the error
s.append(float(row)) # stars
TypeError: float() argument must be a string or a number
From s I would obtain later the magnitudes for all the stars, separately. But first things first...

#cwasdwa Please look at below code. it will give you an idea. I am sure there might be better way. This solution is based on what I have understood from your code.
with open('stars.csv', 'r') as mycsv:
csv_stars = csv.reader(mycsv)
next(csv_stars) #skip header
stars = list(csv_stars)
s = [] # star
d = [] # distances
for row in stars:
newRow = [] #create new row array to convert all '' to 0.0
for x in row:
if x =='':
newRow.append(0.0)
else:
newRow.append(x)
s.append(newRow) #stars
if row[30] == '':
value = 0.0
else:
value = row[30]
d.append(arcsec*AU*float(value))

Related

Plotly Express: Prevent bars from stacking when Y-axis catgories have the same name

I'm new to plotly.
Working with:
Ubuntu 20.04
Python 3.8.10
plotly==5.10.0
I'm doing a comparative graph using a horizontal bar chart. Different instruments measuring the same chemical compounds. I want to be able to do an at-a-glance, head-to-head comparison if the measured value amongst all machines.
The problem is; if the compound has the same name amongst the different instruments - Plotly stacks the data bars into a single bar with segment markers. I very much want each bar to appear individually. Is there a way to prevent Plotly Express from automatically stacking the common bars??
Examples:
CODE
gobardata = []
for blended_name in _df[:20].blended_name: # should always be unique
##################################
# Unaltered compound names
compound_names = [str(c) for c in _df[_df.blended_name == blended_name]["injcompound_name"].tolist()]
# Random number added to end of compound_names to make every string unique
# compound_names = ["{} ({})".format(str(c),random.randint(0, 1000)) for c in _df[_df.blended_name == blended_name]["injcompound_name"].tolist()]
##################################
deltas = _df[_df.blended_name == blended_name]["delta_rettime"].to_list()
gobardata.append(
go.Bar(
name = blended_name,
x = deltas,
y = compound_names,
orientation='h',
))
fig = go.Figure(data = gobardata)
fig.update_traces(width=1)
fig.update_layout(
bargap=1,
bargroupgap=.1,
xaxis_title="Delta Retention Time (Expected - actual)",
yaxis_title="Instrument name(Injection ID)"
)
fig.show()
What I'm getting (Using actual, but repeated, compound names)
What I want (Adding random text to each compound name to make it unique)

OK. Figured it out. This is probably pretty klugy, but it consistently works.
Basically...
Use go.FigureWidget...
...with make_subplots having a common x-axis...
...controlling the height of each subplot based on number of bars.
Every bar in each subplot is added as an individual trace...
...using a dictionary matching bar name to a common color.
The y-axis labels for each subplot is a list containing the machine name as [0], and then blank placeholders ('') so the length of the y-axis list matches the number of bars.
And manually manipulating the legend so each bar name appears only once.
# Get lists of total data
all_compounds = list(_df.injcompound_name.unique())
blended_names = list(_df.blended_name.unique())
#################################################################
# The heights of each subplot have to be set when fig is created.
# fig has to be created before adding traces.
# So, create a list of dfs, and use these to calculate the subplot heights
dfs = []
subplot_height_multiplier = 20
subplot_heights = []
for blended_name in blended_names:
df = _df[(_df.blended_name == blended_name)]#[["delta_rettime", "injcompound_name"]]
dfs.append(df)
subplot_heights.append(df.shape[0] * subplot_height_multiplier)
chart_height = sum(subplot_heights) # Prep for the height of the overall chart.
chart_width = 1000
# Make the figure
fig = make_subplots(
rows=len(blended_names),
cols=1,
row_heights = subplot_heights,
shared_xaxes=True,
)
# Create the color dictionary to match a color to each compound
_CSS_color = CSS_chart_color_list()
colors = {}
for compound in all_compounds:
try: colors[compound] = _CSS_color.pop()
except IndexError:
# Probably ran out of colors, so just reuse
_CSS_color = CSS_color.copy()
colors[compound] = _CSS_color.pop()
rowcount = 1
for df in dfs:
# Add bars individually to each subplot
bars = []
for label, labeldf in df.groupby('injcompound_name'):
fig.add_trace(
go.Bar(x = labeldf.delta_rettime,
y = [labeldf.blended_name.iloc[0]]+[""]*(len(labeldf.delta_rettime)-1),
name = label,
marker = {'color': colors[label]},
orientation = 'h',
),
row=rowcount,
col=1,
)
rowcount += 1
# Set figure to FigureWidget
fig = go.FigureWidget(fig)
# Adding individual traces creates redundancies in the legend.
# This removes redundancies from the legend
names = set()
fig.for_each_trace(
lambda trace:
trace.update(showlegend=False)
if (trace.name in names) else names.add(trace.name))
fig.update_layout(
height=chart_height,
width=chart_width,
title_text="∆ of observed RT to expected RT",
showlegend = True,
)
fig.show()

get_coherence : C_V method gets an error but U_Mass works

I'm using the following code to check the coherence value. The problem is code below works well when I change the coherence type into "u_mass", but if I want to compute "c_v", an Index error occure.
Previous text process:
# Remove Stopwords, Form Bigrams, Trigrams and Lemmatization
def process_words(texts, stop_words=stop_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
texts = [bigram_mod[doc] for doc in texts]
texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
texts_out = []
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
# remove stopwords once more after lemmatization
texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts_out]
## Remove numbers, but not words that contain numbers.
texts_out = [[word for word in simple_preprocess(str(doc)) if not word.isdigit()] for doc in texts_out]
## Remove words that are only one character.
texts_out = [[word for word in simple_preprocess(str(doc)) if len(word) > 3] for doc in texts_out]
return texts_out
data_ready = process_words(data_words)
# Create Dictionary
id2word = corpora.Dictionary(data_ready)
#dictionary.filter_extremes(no_below=10, no_above=0.2) #filter out tokens
# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in data_ready]
# View:the produced corpus shown above is a mapping of (word_id, word_frequency).
print(corpus[:1])
print('Number of unique tokens: %d' % len(id2word))
print('Number of documents: %d' % len(corpus))
The output is :
[[(0, 1), (1, 1), (2, 1), (3, 1)]]
Number of unique tokens: 6558
Number of documents: 23141
Now I set a base model:
## set a base model
num_topics = 5
chunksize = 100
passes = 10
iterations = 100
eval_every = 1
lda_model = LdaModel(corpus=corpus,id2word=id2word, chunksize=chunksize, \
alpha='auto', eta='auto', \
iterations=iterations, num_topics=num_topics, \
passes=passes, eval_every=eval_every)
The last step is where the problem occurs:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_ready, dictionary=id2word, coherence="c_v")
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Here is the error:
IndexError: index 0 is out of bounds for axis 0 with size 0
If I change coherence into 'u_mass', however, the code ablove can compute successfully. I don't understand why and how to fix it?

!pip install gensim==4.1.0
It seems that downgrade solves everything.

Just in case anyone else runs into the same issue.
Apparently the error described here persist in gensim 4.2.0. Downgrading to 4.1.0 worked well for me.

How to not exceed the maximum number of fonts when generating XLS spreadsheets

I am taking several comma delimited CSV files and using them to generate an XLS spreadsheet in which the names of the files become separate tabs in the spreadsheet. The code I have produces the results I want except for when opening the spreadsheet I get the following warning: "Some text formatting may have changed in this file because the maximum number of fonts was exceeded. It may help to close other documents and try again." I am pretty sure that the problem arises from the code trying to change the format of cells beyond the 65536 row limit, but I'm not sure how to limit the row changes. I need no more than a few hundred rows across four columns.
import csv, glob, xlwt, sys, os
csvFiles = os.path.join(LogFileFolder, "*")
wb = xlwt.Workbook()
colNames = ['iNFADS_FAC','CAT','Crosswalk_FAC','FAC']
for filename in glob.glob(csvFiles):
(f_path, f_name) = os.path.split(filename)
(f_short_name, f_extension) = os.path.splitext(f_name)
ws = wb.add_sheet(f_short_name)
with open(filename, 'rb') as csvf:
csvReader = csv.reader(csvf)
for rowx, row in enumerate(csvReader):
for colx, value in enumerate(row):
if value in colNames:
ws.write(rowx, colx, value, xlwt.easyxf(
"border: top medium, right medium, bottom double, left medium;
font: bold on; pattern: pattern solid, fore_color pale_blue;
align: vert centre, horiz centre"))
elif value not in colNames:
ws.write(rowx, colx, float(value),
xlwt.easyxf("align: vert centre, horiz centre"))
##This second "xlwt.easyxf(align...)" part is the offending section of the code, if
##I remove just that part then the problem goes away. Is there a way to keep
##it within the 65536 limit here?
else:
pass
wb.set_active_sheet = 1
outXLS = os.path.join(LogFileFolder, "FAC-CAT Code Changes.xls")
wb.save(outXLS)

I wish to thank John Machin at Google Group 'python-excel' for answering my question. Apparently, the solution is to move the easyxf portion to a variable earlier in the script and then just call it whenever needed. So the script should read:
csvFiles = os.path.join(LogFileFolder, "*")
wb = xlwt.Workbook()
headerStyle = xlwt.easyxf("border: top medium, right medium, bottom double," \
"left medium; font: bold on; pattern: pattern solid, fore_color pale_blue;" \
"align: vert centre, horiz centre")
valueStyle = xlwt.easyxf("align: vert centre, horiz centre")
colNames = ['iNFADS_FAC','CAT','Crosswalk_FAC','FAC']
for filename in glob.glob(csvFiles):
(f_path, f_name) = os.path.split(filename)
(f_short_name, f_extension) = os.path.splitext(f_name)
ws = wb.add_sheet(f_short_name)
with open(filename, 'rb') as csvf:
csvReader = csv.reader(csvf)
for rowx, row in enumerate(csvReader):
for colx, value in enumerate(row):
if value in colNames:
ws.col(colx).width = 256 * 15
ws.write(rowx, colx, value, headerStyle)
elif value not in colNames:
ws.write(rowx, colx, float(value), valueStyle)
else:
pass
wb.set_active_sheet = 1
outXLS = os.path.join(LogFileFolder, "FAC-CAT Code Changes.xls")
wb.save(outXLS)

How to check each value is greater or less than zero in csv file using python?

I want to check each value of one column and according to the values give them label (trends) on the next column. For example, if the value is greater than zero or equal or less than zero, according to this positive , negative and same labels are to be written in next column.
My input file is look like this :
Weightage /// column name
0.000555
0.002333
0
-0.22222
And I want my output file is look like:
Weightage Labels // column name
0.000555 positive
0.002333 positive
0 same
-0.22222 negative
Any one can help me??
The code is:
print (results)
for r in results:
if r >0:
print("test")
label = "positive"
print(label)
elif r == 0.0:
label = "equal"
print(label)
else:
print("nothing")
I have problem in 'r' for loop.
The error occur :
Traceback (most recent call last):
File "C:\Python34\col.py", line 23, in <module>
if r >0:
TypeError: unorderable types: tuple() > int()

At first glance, it looks like you are confusing rows and columns. I suggest using more explicit names. It helps to avoid confusion. Also, do not compare strings to numeric types like integers. It will give surprising results in Python 2. In Python 3, it is an error.
for row in results:
column = row[0] # The first column of this row.
value = float(column) # The csv module returns strings, so we should
# turn them into floats for numeric comparison.
if value > 0:
print "positive"
elif value < 0:
print "negative"
else:
print "zero"

Egg dropping in worst case

I have been trying to write an algorithm to compute the maximum number or trials required in worst case, in the egg dropping problem. Here is my python code
def eggDrop(n,k):
eggFloor=[ [0 for i in range(k+1) ] ]* (n+1)
for i in range(1, n+1):
eggFloor[i][1] = 1
eggFloor[i][0] = 0
for j in range(1, k+1):
eggFloor[1][j] = j
for i in range (2, n+1):
for j in range (2, k+1):
eggFloor[i][j] = 'infinity'
for x in range (1, j + 1):
res = 1 + max(eggFloor[i-1][x-1], eggFloor[i][j-x])
if res < eggFloor[i][j]:
eggFloor[i][j] = res
return eggFloor[n][k]print eggDrop(2, 100)
```
The code is outputting a value of 7 for 2eggs and 100floors, but the answer should be 14, i don't know what mistake i have made in the code. What is the problem?

The problem is in this line:
eggFloor=[ [0 for i in range(k+1) ] ]* (n+1)
You want this to create a list containing (n+1) lists of (k+1) zeroes. What the * (n+1) does is slightly different - it creates a list containing (n+1) copies of the same list.
This is an important distinction - because when you start modifying entries in the list - say,
eggFloor[i][1] = 1
this actually changes element [1] of all of the lists, not just the ith one.
To instead create separate lists that can be modified independently, you want something like:
eggFloor=[ [0 for i in range(k+1) ] for j in range(n+1) ]
With this modification, the program returns 14 as expected.
(To debug this, it might have been a good idea to write out a function to pring out the eggFloor array, and display it at various points in your program, so you can compare it with what you were expecting. It would soon become pretty clear what was going on!)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

csv empty strings handling and values appending - csv

Related

Plotly Express: Prevent bars from stacking when Y-axis catgories have the same name

get_coherence : C_V method gets an error but U_Mass works

How to not exceed the maximum number of fonts when generating XLS spreadsheets

How to check each value is greater or less than zero in csv file using python?

Egg dropping in worst case

Categories

Resources