Plotly Express: Prevent bars from stacking when Y-axis catgories have the same name - bar-chart

I'm new to plotly.
Working with:
Ubuntu 20.04
Python 3.8.10
plotly==5.10.0
I'm doing a comparative graph using a horizontal bar chart. Different instruments measuring the same chemical compounds. I want to be able to do an at-a-glance, head-to-head comparison if the measured value amongst all machines.
The problem is; if the compound has the same name amongst the different instruments - Plotly stacks the data bars into a single bar with segment markers. I very much want each bar to appear individually. Is there a way to prevent Plotly Express from automatically stacking the common bars??
Examples:
CODE
gobardata = []
for blended_name in _df[:20].blended_name: # should always be unique
##################################
# Unaltered compound names
compound_names = [str(c) for c in _df[_df.blended_name == blended_name]["injcompound_name"].tolist()]
# Random number added to end of compound_names to make every string unique
# compound_names = ["{} ({})".format(str(c),random.randint(0, 1000)) for c in _df[_df.blended_name == blended_name]["injcompound_name"].tolist()]
##################################
deltas = _df[_df.blended_name == blended_name]["delta_rettime"].to_list()
gobardata.append(
go.Bar(
name = blended_name,
x = deltas,
y = compound_names,
orientation='h',
))
fig = go.Figure(data = gobardata)
fig.update_traces(width=1)
fig.update_layout(
bargap=1,
bargroupgap=.1,
xaxis_title="Delta Retention Time (Expected - actual)",
yaxis_title="Instrument name(Injection ID)"
)
fig.show()
What I'm getting (Using actual, but repeated, compound names)
What I want (Adding random text to each compound name to make it unique)

OK. Figured it out. This is probably pretty klugy, but it consistently works.
Basically...
Use go.FigureWidget...
...with make_subplots having a common x-axis...
...controlling the height of each subplot based on number of bars.
Every bar in each subplot is added as an individual trace...
...using a dictionary matching bar name to a common color.
The y-axis labels for each subplot is a list containing the machine name as [0], and then blank placeholders ('') so the length of the y-axis list matches the number of bars.
And manually manipulating the legend so each bar name appears only once.
# Get lists of total data
all_compounds = list(_df.injcompound_name.unique())
blended_names = list(_df.blended_name.unique())
#################################################################
# The heights of each subplot have to be set when fig is created.
# fig has to be created before adding traces.
# So, create a list of dfs, and use these to calculate the subplot heights
dfs = []
subplot_height_multiplier = 20
subplot_heights = []
for blended_name in blended_names:
df = _df[(_df.blended_name == blended_name)]#[["delta_rettime", "injcompound_name"]]
dfs.append(df)
subplot_heights.append(df.shape[0] * subplot_height_multiplier)
chart_height = sum(subplot_heights) # Prep for the height of the overall chart.
chart_width = 1000
# Make the figure
fig = make_subplots(
rows=len(blended_names),
cols=1,
row_heights = subplot_heights,
shared_xaxes=True,
)
# Create the color dictionary to match a color to each compound
_CSS_color = CSS_chart_color_list()
colors = {}
for compound in all_compounds:
try: colors[compound] = _CSS_color.pop()
except IndexError:
# Probably ran out of colors, so just reuse
_CSS_color = CSS_color.copy()
colors[compound] = _CSS_color.pop()
rowcount = 1
for df in dfs:
# Add bars individually to each subplot
bars = []
for label, labeldf in df.groupby('injcompound_name'):
fig.add_trace(
go.Bar(x = labeldf.delta_rettime,
y = [labeldf.blended_name.iloc[0]]+[""]*(len(labeldf.delta_rettime)-1),
name = label,
marker = {'color': colors[label]},
orientation = 'h',
),
row=rowcount,
col=1,
)
rowcount += 1
# Set figure to FigureWidget
fig = go.FigureWidget(fig)
# Adding individual traces creates redundancies in the legend.
# This removes redundancies from the legend
names = set()
fig.for_each_trace(
lambda trace:
trace.update(showlegend=False)
if (trace.name in names) else names.add(trace.name))
fig.update_layout(
height=chart_height,
width=chart_width,
title_text="∆ of observed RT to expected RT",
showlegend = True,
)
fig.show()

Related

get_coherence : C_V method gets an error but U_Mass works

I'm using the following code to check the coherence value. The problem is code below works well when I change the coherence type into "u_mass", but if I want to compute "c_v", an Index error occure.
Previous text process:
# Remove Stopwords, Form Bigrams, Trigrams and Lemmatization
def process_words(texts, stop_words=stop_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
texts = [bigram_mod[doc] for doc in texts]
texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
texts_out = []
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
# remove stopwords once more after lemmatization
texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts_out]
## Remove numbers, but not words that contain numbers.
texts_out = [[word for word in simple_preprocess(str(doc)) if not word.isdigit()] for doc in texts_out]
## Remove words that are only one character.
texts_out = [[word for word in simple_preprocess(str(doc)) if len(word) > 3] for doc in texts_out]
return texts_out
data_ready = process_words(data_words)
# Create Dictionary
id2word = corpora.Dictionary(data_ready)
#dictionary.filter_extremes(no_below=10, no_above=0.2) #filter out tokens
# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in data_ready]
# View:the produced corpus shown above is a mapping of (word_id, word_frequency).
print(corpus[:1])
print('Number of unique tokens: %d' % len(id2word))
print('Number of documents: %d' % len(corpus))
The output is :
[[(0, 1), (1, 1), (2, 1), (3, 1)]]
Number of unique tokens: 6558
Number of documents: 23141
Now I set a base model:
## set a base model
num_topics = 5
chunksize = 100
passes = 10
iterations = 100
eval_every = 1
lda_model = LdaModel(corpus=corpus,id2word=id2word, chunksize=chunksize, \
alpha='auto', eta='auto', \
iterations=iterations, num_topics=num_topics, \
passes=passes, eval_every=eval_every)
The last step is where the problem occurs:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_ready, dictionary=id2word, coherence="c_v")
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Here is the error:
IndexError: index 0 is out of bounds for axis 0 with size 0
If I change coherence into 'u_mass', however, the code ablove can compute successfully. I don't understand why and how to fix it?
!pip install gensim==4.1.0
It seems that downgrade solves everything.
Just in case anyone else runs into the same issue.
Apparently the error described here persist in gensim 4.2.0. Downgrading to 4.1.0 worked well for me.

Networkx pyvis: change color of nodes

I have a dataframe that has source: person 1, target: person 2 and in_rewards_program : binary.
I created a network using the pyvis package"
got_net = Network(notebook=True, height="750px", width="100%")
# got_net = Network(notebook=True, height="750px", width="100%", bgcolor="#222222", font_color="white")
# set the physics layout of the network
got_net.barnes_hut()
got_data = df
sources = got_data['source']
targets = got_data['target']
# create graph using pviz network
edge_data = zip(sources, targets)
for e in edge_data:
src = e[0]
dst = e[1]
#add nodes and edges to the graph
got_net.add_node(src, src, title=src)
got_net.add_node(dst, dst, title=dst)
got_net.add_edge(src, dst)
neighbor_map = got_net.get_adj_list()
# add neighbor data to node hover data
for node in got_net.nodes:
node["title"] += " Neighbors:<br>" + "<br>".join(neighbor_map[node["id"]])
node["value"] = len(neighbor_map[node["id"]]) # this value attrribute for the node affects node size
got_net.show("test.html")
I want to add the functionality where the nodes are different colors based on the value in in_rewards_program. If the source node has 0 then make the node red and if the source node had 1 then make it blue. I am not sure how to do this.
There is not much information to know more about your data but based on your code I can assume that you can zip "source" and "target" columns with "in_rewards_program" column and make a conditional statement before adding the nodes so that it will change the node color based on the reward value. According to pyvis documentation, you can pass a color parameter with add_node method:
got_net = Network(notebook=True, height="750px", width="100%")
# set the physics layout of the network
got_net.barnes_hut()
sources = df['source']
targets = df['target']
rewards = df['in_rewards_program']
# create graph using pviz network
edge_data = zip(sources, targets, rewards)
for src, dst, reward in edge_data:
#add nodes and edges to the graph
if reward == 0:
got_net.add_node(src, src, title=src, color='red')
if reward == 1:
got_net.add_node(dst, dst, title=dst, color='blue')
got_net.add_edge(src, dst)

How to automatically crop an .OBJ 3D model to a bounding box?

In the now obsoleted Autodesk ReCap API it was possible to specify a "bounding box" around the scene to be generated from images.
In the resulting models, any vertices outside the bounding box were discarded, and any volumes that extended beyond the bounding box were truncated to have faces at the box boundaries.
I am now using Autodesk's Forge Reality Capture API which replaced ReCap. Apparently, This new API does not allow the user to specify a bounding box.
So I am now searching for a program that takes an .OBJ file and a specified bounding box as input, and outputs a file of just the vertices and faces within this bounding box.
Given that there is no way to specify the bounding box in Reality Capture API, I created this python program. It is crude, in that it only discards faces that have vertices that are outside the bounding box. And it actually does discards nondestructively, only by commenting them out in the output OBJ file. This allows you to uncomment them and then use a different bounding box.
This may not be what you need if you truly want to remove all relevant v, vn, vt, vp and f lines that are outside the bounding box, because the OBJ file size remains mostly unchanged. But for my particular needs, keeping all the records and just using comments was preferable.
# obj3Dcrop.py
# (c) Scott L. McGregor, Dec 2019
# License: free for all non commercial uses. Contact author for any other uses.
# Changes and Enhancements must be shared with author, and be subject to same use terms
# TL;DR: This program uses a bounding box, and "crops" faces and vertices from a
# Wavefront .OBJ format file, created by Autodesk Forge Reality Capture API
# if one of the vertices in a face is not within the bounds of the box.
#
# METHOD
# 1) All lines other than "v" vertex definitions and "f" faces definitions
# are copied UNCHANGED from the input .OBJ file to an output .OBJ file.
# 2) All "v" vertex definition lines have their (x, y, z) positions tested to see if:
# minX < x < maxX and minY < y < maxY and minZ < z < maxZ ?
# If TRUE, we want to keep this vertex in the new OBJ, so we
# store its IMPLICIT ORDINAL position in the file in a dictionary called v_keepers.
# If FALSE, we will use its absence from the v_keepers file as a way to identify
# faces that contain it and drop them. All "v" lines are also copied unchanged to the
# output file.
# 3) All "f" lines (face definitions) are inspected to verify that all 3 vertices in the face
# are in the v_keepers list. If they are, the f line is output unchanged.
# 4) Any "f" line that refers to a vertex that was cropped, is prefixed by "# CROPPED: "
# in the output file. Lines beginning # are treated as comments, and ignored in future
# processing.
# KNOWN LIMITATIONS: This program generates models in which the outside of bound faces
# have been removed. The vertices that were found outside the bounding box, are still in the
# OBJ file, but they are now disconnected and therefore ignored in later processing.
# The "f" lines for faces with vertices outside the bounding box are also still in the
# output file, but now commented out, so they don't process. Because this is non-destructive.
# we can easily change our bounding box later, uncomment cropped lines and reprocess.
#
# This might be an incomplete solution for some potential users. For such users
# a more complete program would delete unneeded v, vn, vt and vp lines when the v vertex
# that they refer to is dropped. But note that this requires renumbering all references to these
# vertice definitions in the "f" face definition lines. Such a more complete solution would also
# DISCARD all 'f' lines with any vertices that are out of bounds, instead of making them copies.
# Such a rewritten .OBJ file would be var more compact, but changing the bounding box would require
# saving the pre-cropped original.
# QUIRK: The OBJ file format defines v, vn, vt, vp and f elements by their
# IMPLICIT ordinal occurrence in the file, with each element type maintaining
# its OWN separate sequence. It then references those definitions EXPLICITLY in
# f face definitions. So deleting (or commenting out) element references requires
# appropriate rewriting of all the"f"" lines tracking all the new implicit positions.
# Such rewriting is not particularly hard to do, but it is one more place to make
# a mistake, and could make the algorithm more complicated to understand.
# This program doesn't bother, because all further processing of the output
# OBJ file ignores unreferenced v, vn, vt and vp elements.
#
# Saving all lines rather than deleting them to save space is a tradeoff involving considerations of
# Undo capability, compute cycles, compute space (unreferenced lines) and maintenance complexity choice.
# It is left to the motivated programmer to add this complexity if needed.
import sys
#bounding_box = sys.argv[1] # should be in the only string passsed (maxX, maxY, maxZ, minX, minY, minZ)
bounding_box = [10, 10, 10, -10, -10, 1]
maxX = bounding_box[0]
maxY = bounding_box[1]
maxZ = bounding_box[2]
minX = bounding_box[3]
minY = bounding_box[4]
minZ = bounding_box[5]
v_keepers = dict() # keeps track of which vertices are within the bounding box
kept_vertices = 0
discarded_vertices = 0
kept_faces = 0
discarded_faces = 0
discarded_lines = 0
kept_lines = 0
obj_file = open('sample.obj','r')
new_obj_file = open('cropped.obj','w')
# the number of the next "v" vertex lines to process.
original_v_number = 1 # the number of the next "v" vertex lines to process.
new_v_number = 1 # the new ordinal position of this vertex if out of bounds vertices were discarded.
for line in obj_file:
line_elements = line.split()
# Python doesn't have a SWITCH statement, but we only have three cases, so we'll just use cascading if stmts
if line_elements[0] != "f": # if it isn't an "f" type line (face definition)
if line_elements[0] != "v": # and it isn't an "v" type line either (vertex definition)
# ************************ PROCESS ALL NON V AND NON F LINE TYPES ******************
# then we just copy it unchanged from the input OBJ to the output OBJ
new_obj_file.write(line)
kept_lines = kept_lines + 1
else: # then line_elements[0] == "v":
# ************************ PROCESS VERTICES ****************************************
# a "v" line looks like this:
# f x y z ...
x = float(line_elements[1])
y = float(line_elements[2])
z = float(line_elements[3])
if minX < x < maxX and minY < y < maxY and minZ < z < maxZ:
# if vertex is within the bounding box, we include it in the new OBJ file
new_obj_file.write(line)
v_keepers[str(original_v_number)] = str(new_v_number)
new_v_number = new_v_number + 1
kept_vertices = kept_vertices +1
kept_lines = kept_lines + 1
else: # if vertex is NOT in the bounding box
new_obj_file.write(line)
discarded_vertices = discarded_vertices +1
discarded_lines = discarded_lines + 1
original_v_number = original_v_number + 1
else: # line_elements[0] == "f":
# ************************ PROCESS FACES ****************************************
# a "f" line looks like this:
# f v1/vt1/vn1 v2/vt2/vn2 v3/vt3/vn3 ...
# We need to delete any face lines where ANY of the 3 vertices v1, v2 or v3 are NOT in v_keepers.
v = ["", "", ""]
# Note that v1, v2 and v3 are the first "/" separated elements within each line element.
for i in range(0,3):
v[i] = line_elements[i+1].split('/')[0]
# now we can check if EACH of these 3 vertices are in v_keepers.
# for each f line, we need to determine if all 3 vertices are in the v_keepers list
if v[0] in v_keepers and v[1] in v_keepers and v[2] in v_keepers:
new_obj_file.write(line)
kept_lines = kept_lines + 1
kept_faces = kept_faces +1
else: # at least one of the vertices in this face has been deleted, so we need to delete the face too.
discarded_lines = discarded_lines + 1
discarded_faces = discarded_faces +1
new_obj_file.write("# CROPPED "+line)
# end of line processing loop
obj_file.close()
new_obj_file.close()
print ("kept vertices: ", kept_vertices ,"discarded vertices: ", discarded_vertices)
print ("kept faces: ", kept_faces, "discarded faces: ", discarded_faces)
print ("kept lines: ", kept_lines, "discarded lines: ", discarded_lines)
Unfortunately, (at least for now) there is no way to specify the bounding box in Reality Capture API.

Number of observations when using plm with first differences

I have a simple issue after running a regression with panel data using plm with a dataset that resembles the one below:
dataset <- data.frame(id = rep(c(1,2,3,4,5), 2),
time = rep(c(0,1), each = 5),
group = rep(c(0,1,0,0,1), 2),
Y = runif(10,0,1))
model <-plm(Y ~ time*group, method = 'fd', effect = 'twoways', data = dataset,
index = c('id', 'time'))
summary(model)
stargazer(model)
As you can see, both the model summary and the table displayed by stargazer would say that my number of observations is 10. However, is it not more correct to say that N = 5, since I have taken away the time element after with the first differences?
You are right about the number of observations. However, your code does not what you want it to do (a first differenced model).
If you want a first differenced model, switch the argument method to model (and delete argument effect because it does not make sense for a first differenced model):
model <-plm(Y ~ time*group, model = 'fd', data = dataset,
index = c('id', 'time'))
summary(model)
## Oneway (individual) effect First-Difference Model
##
## Call:
## plm(formula = Y ~ time * group, data = dataset, model = "fd",
## index = c("id", "time"))
##
## Balanced Panel: n = 5, T = 2, N = 10
## Observations used in estimation: 5
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.3067240 -0.0012185 0.0012185 0.1367080 0.1700160
## [...]
In the summary output, you can see the number of observations in your original data (N=10) and the number of observations used in the FD model (5).

csv empty strings handling and values appending

With a csv of ~50 rows (stars) and ~30 columns (name, magnitudes and distance), that has some empty string values (''), I am trying to do two things in which all the help so far hasn't been useful. (1) I need to parse empty strings as 0.0, so I can (2) append each row in a list of lists (what I called s).
In other words:
- s is a list of stars (each one has all its parameters)
- d is a particular parameter for all the stars (distance), which I obtain correctly.
Big issue is with s. My try:
with open('stars.csv', 'r') as mycsv:
csv_stars = csv.reader(mycsv)
next(csv_stars) #skip header
stars = list(csv_stars)
s = [] # star
d = [] # distances
for row in stars:
row[row==''] = '0'
s.append(float(row)) #stars
d.append(arcsec*AU*float(row[30]))
I can't think of a better syntax, and so I get the error
s.append(float(row)) # stars
TypeError: float() argument must be a string or a number
From s I would obtain later the magnitudes for all the stars, separately. But first things first...
#cwasdwa Please look at below code. it will give you an idea. I am sure there might be better way. This solution is based on what I have understood from your code.
with open('stars.csv', 'r') as mycsv:
csv_stars = csv.reader(mycsv)
next(csv_stars) #skip header
stars = list(csv_stars)
s = [] # star
d = [] # distances
for row in stars:
newRow = [] #create new row array to convert all '' to 0.0
for x in row:
if x =='':
newRow.append(0.0)
else:
newRow.append(x)
s.append(newRow) #stars
if row[30] == '':
value = 0.0
else:
value = row[30]
d.append(arcsec*AU*float(value))