Straight line appearing in seaboard barplot - bar-chart

I have a line appearing on one line of my bar plot
this is what it looks like
and my code below
sns.barplot(x='Police force sent NRM referral for Crime Recording', y='Total', data=dfForces1)
sns.despine()
plt. xticks(rotation=90)
import matplotlib.ticker as ticker
ax_data = sns.barplot(x= 'Police force sent NRM referral for Crime Recording', y = 'Total', data=dfForces1) # change as per how you are plotting, just for an example
ax_data.yaxis.set_major_locator(ticker.MultipleLocator(40)) # it would have a tick frequency of 40, change 40 to the tick-frequency you want.
ax_data.yaxis.set_major_formatter(ticker.ScalarFormatter())````

Related

Plotly Express: Prevent bars from stacking when Y-axis catgories have the same name

I'm new to plotly.
Working with:
Ubuntu 20.04
Python 3.8.10
plotly==5.10.0
I'm doing a comparative graph using a horizontal bar chart. Different instruments measuring the same chemical compounds. I want to be able to do an at-a-glance, head-to-head comparison if the measured value amongst all machines.
The problem is; if the compound has the same name amongst the different instruments - Plotly stacks the data bars into a single bar with segment markers. I very much want each bar to appear individually. Is there a way to prevent Plotly Express from automatically stacking the common bars??
Examples:
CODE
gobardata = []
for blended_name in _df[:20].blended_name: # should always be unique
##################################
# Unaltered compound names
compound_names = [str(c) for c in _df[_df.blended_name == blended_name]["injcompound_name"].tolist()]
# Random number added to end of compound_names to make every string unique
# compound_names = ["{} ({})".format(str(c),random.randint(0, 1000)) for c in _df[_df.blended_name == blended_name]["injcompound_name"].tolist()]
##################################
deltas = _df[_df.blended_name == blended_name]["delta_rettime"].to_list()
gobardata.append(
go.Bar(
name = blended_name,
x = deltas,
y = compound_names,
orientation='h',
))
fig = go.Figure(data = gobardata)
fig.update_traces(width=1)
fig.update_layout(
bargap=1,
bargroupgap=.1,
xaxis_title="Delta Retention Time (Expected - actual)",
yaxis_title="Instrument name(Injection ID)"
)
fig.show()
What I'm getting (Using actual, but repeated, compound names)
What I want (Adding random text to each compound name to make it unique)
OK. Figured it out. This is probably pretty klugy, but it consistently works.
Basically...
Use go.FigureWidget...
...with make_subplots having a common x-axis...
...controlling the height of each subplot based on number of bars.
Every bar in each subplot is added as an individual trace...
...using a dictionary matching bar name to a common color.
The y-axis labels for each subplot is a list containing the machine name as [0], and then blank placeholders ('') so the length of the y-axis list matches the number of bars.
And manually manipulating the legend so each bar name appears only once.
# Get lists of total data
all_compounds = list(_df.injcompound_name.unique())
blended_names = list(_df.blended_name.unique())
#################################################################
# The heights of each subplot have to be set when fig is created.
# fig has to be created before adding traces.
# So, create a list of dfs, and use these to calculate the subplot heights
dfs = []
subplot_height_multiplier = 20
subplot_heights = []
for blended_name in blended_names:
df = _df[(_df.blended_name == blended_name)]#[["delta_rettime", "injcompound_name"]]
dfs.append(df)
subplot_heights.append(df.shape[0] * subplot_height_multiplier)
chart_height = sum(subplot_heights) # Prep for the height of the overall chart.
chart_width = 1000
# Make the figure
fig = make_subplots(
rows=len(blended_names),
cols=1,
row_heights = subplot_heights,
shared_xaxes=True,
)
# Create the color dictionary to match a color to each compound
_CSS_color = CSS_chart_color_list()
colors = {}
for compound in all_compounds:
try: colors[compound] = _CSS_color.pop()
except IndexError:
# Probably ran out of colors, so just reuse
_CSS_color = CSS_color.copy()
colors[compound] = _CSS_color.pop()
rowcount = 1
for df in dfs:
# Add bars individually to each subplot
bars = []
for label, labeldf in df.groupby('injcompound_name'):
fig.add_trace(
go.Bar(x = labeldf.delta_rettime,
y = [labeldf.blended_name.iloc[0]]+[""]*(len(labeldf.delta_rettime)-1),
name = label,
marker = {'color': colors[label]},
orientation = 'h',
),
row=rowcount,
col=1,
)
rowcount += 1
# Set figure to FigureWidget
fig = go.FigureWidget(fig)
# Adding individual traces creates redundancies in the legend.
# This removes redundancies from the legend
names = set()
fig.for_each_trace(
lambda trace:
trace.update(showlegend=False)
if (trace.name in names) else names.add(trace.name))
fig.update_layout(
height=chart_height,
width=chart_width,
title_text="∆ of observed RT to expected RT",
showlegend = True,
)
fig.show()

When using the frame skipping wrapper for OpenAI Gym, what is the purpose of the np.max line?

I'm implementing the following wrapper used commonly in OpenAI's Gym for Frame Skipping. It can be found in dqn/atari_wrappers.py
I'm very confused about the following line:
max_frame = np.max(np.stack(self._obs_buffer), axis=0)
I have added comments throughout the code for the parts I understand and to aid anyone who may be able to help.
np.stack(self._obs_buffer) stacks the two states in _obs_buffer.
np.max returns the maximum along axis 0.
But what I don't understand is why we're doing this or what it's really doing.
class MaxAndSkipEnv(gym.Wrapper):
"""Return only every 4th frame"""
def __init__(self, env=None, skip=4):
super(MaxAndSkipEnv, self).__init__(env)
# Initialise a double ended queue that can store a maximum of two states
self._obs_buffer = deque(maxlen=2)
# _skip = 4
self._skip = skip
def _step(self, action):
total_reward = 0.0
done = None
for _ in range(self._skip):
# Take a step
obs, reward, done, info = self.env.step(action)
# Append the new state to the double ended queue buffer
self._obs_buffer.append(obs)
# Update the total reward by summing the (reward obtained from the step taken) + (the current
# total reward)
total_reward += reward
# If the game ends, break the for loop
if done:
break
max_frame = np.max(np.stack(self._obs_buffer), axis=0)
return max_frame, total_reward, done, info
At the end of the for loop the self._obs_buffer holds the last two frames.
Those two frames are then max-pooled over, resulting in an observation, that contains some temporal information.

Calculating the average of a column in csv per hour

I have a csv file that contains data in the following format.
Layer relative_time Ht BSs Vge Temp Message
57986 2:52:46 0.00m 87 15.4 None CMSG
20729 0:23:02 45.06m 82 11.6 None BMSG
20729 0:44:17 45.06m 81 11.6 None AMSG
I want to get read in this csv file and calculate the average BSs for every hour. My csv file is quite huge about 2000 values. However the values are not evenly distributed across every hour. For e.g.
I have 237 samples from hour 3 and only 4 samples from hour 6. Also I should mention that the BSs can be collected from multiple sources.The value always ranges from 20-100. Because of this it is giving a skewed result. For each hour I am calculating the sum of BSs for that hour divided by the number of samples in that hour.
The primary purpose is to understand how BSs evolves over time.
But what is the common approach to this problem. Is this where people apply normalization? It would be great if someone could explain how to apply normalization in such a situation.
The code I am using for my processing is shown below. I believe the code below is correct.
#This 24x2 matrix will contain no of values recorded per hour per hour
hours_no_values = [[0 for i in range(24)] for j in range(2)]
#This 24x2 matrix will contain mean bss stats per hour
mean_bss_stats = [[0 for i in range(24)] for j in range(2)]
with open(PREFINAL_OUTPUT_FILE) as fin, open(FINAL_OUTPUT_FILE, "w",newline='') as f:
reader = csv.reader(fin, delimiter=",")
writer = csv.writer(f)
header = next(reader) # <--- Pop header out
writer.writerow([header[0],header[1],header[2],header[3],header[4],header[5],header[6]]) # <--- Write header
sortedlist = sorted(reader, key=lambda row: datetime.datetime.strptime(row[1],"%H:%M:%S"), reverse=True)
print(sortedlist)
for item in sortedlist:
rel_time = datetime.datetime.strptime(item[1], "%H:%M:%S")
if rel_time.hour not in hours_no_values[0]:
print('item[6] {}'.format(item[6]))
if 'MAN' in item[6]:
print('Hour found {}'.format(rel_time.hour))
hours_no_values[0][rel_time.hour] = rel_time.hour
mean_bss_stats[0][rel_time.hour] = rel_time.hour
mean_bss_stats[1][rel_time.hour] += int(item[3])
hours_no_values[1][rel_time.hour] +=1
else:
pass
else:
if 'MAN' in item[6]:
print('Hour Previous {}'.format(rel_time.hour))
mean_bss_stats[1][rel_time.hour] += int(item[3])
hours_no_values[1][rel_time.hour] +=1
else:
pass
for i in range(0,24):
if(hours_no_values[1][i] != 0):
mean_bss_stats[1][i] = mean_bss_stats[1][i]/hours_no_values[1][i]
else:
mean_bss_stats[1][i] = 0
pprint.pprint('mean bss stats {} \n hour_no_values {} \n'.format(mean_bss_stats,hours_no_values))
The number of value per each hour are as follows for hours starting from 0 to 23.
[31, 117, 85, 237, 3, 67, 11, 4, 57, 0, 5, 21, 2, 5, 10, 8, 29, 7, 14, 3, 1, 1, 0, 0]
You could do it with pandas using groupby and aggregate to appropriate column:
import pandas as pd
import numpy as np
df = pd.read_csv("your_file")
df.groupby('hour')['BSs'].aggregate(np.mean)
If you don't have that column in initial dataframe you could add it:
df['hour'] = your_hour_data
numpy.mean - calculates the mean of the array.
Compute the arithmetic mean along the specified axis.
pandas.groupby
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns
From pandas docs:
By “group by” we are referring to a process involving one or more of the following steps
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
Aggregation: computing a summary statistic (or statistics) about each group.
Some examples:
Compute group sums or means
Compute group sizes / counts

counting non-empty lines and sum of lengths of those lines in python

Am trying to create a function that takes a filename and it returns a 2-tuple with the number of the non-empty lines in that program, and the sum of the lengths of all those lines. Here is my current program:
def code_metric(file):
with open(file, 'r') as f:
lines = len(list(filter(lambda x: x.strip(), f)))
num_chars = sum(map(lambda l: len(re.sub('\s', '', l)), f))
return(lines, num_chars)
The result I get is get if I do:
if __name__=="__main__":
print(code_metric('cmtest.py'))
is
(3, 0)
when it should be:
(3,85)
Also is there a better way of finding the sum of the length of lines using using the functionals map, filter, and reduce? I did it for the first part but couldn't figure out the second half. AM kinda new to python so any help would be great.
Here is the test file called cmtest.py:
import prompt,math
x = prompt.for_int('Enter x')
print(x,'!=',math.factorial(x),sep='')
First line has 18 characters (including white space)
Second line has 29 characters
Third line has 38 characters
[(1, 18), (1, 29), (1, 38)]
The line count is 85 characters including white spaces. I apologize, I mis-read the problem. The length total for each line should include the whitespaces as well.
A fairly simple approach is to build a generator to strip trailing whitespace, then enumerate over that (with a start value of 1) filtering out blank lines, and summing the length of each line in turn, eg:
def code_metric(filename):
line_count = char_count = 0
with open(filename) as fin:
stripped = (line.rstrip() for line in fin)
for line_count, line in enumerate(filter(None, stripped), 1):
char_count += len(line)
return line_count, char_count
print(code_metric('cmtest.py'))
# (3, 85)
In order to count lines, maybe this code is cleaner:
with open(file) as f:
lines = len(file.readlines())
For the second part of your program, if you intend to count only non-empty characters, then you forgot to remove '\t' and '\n'. If that's the case
with open(file) as f:
num_chars = len(re.sub('\s', '', f.read()))
Some people have advised you to do both things in one loop. That is fine, but if you keep them separated you can make them into different functions and have more reusability of them that way. Unless you are handling huge files (or executing this coded millions of times), it shouldn't matter in terms of performance.

How do I feed in my own data into PyAlgoTrade?

I'm trying to use PyAlogoTrade's event profiler
However I don't want to use data from yahoo!finance, I want to use my own but can't figure out how to parse in the CSV, it is in the format:
Timestamp Low Open Close High BTC_vol USD_vol [8] [9]
2013-11-23 00 800 860 847.666666 886.876543 853.833333 6195.334452 5248330 0
2013-11-24 00 745 847.5 815.01 860 831.255 10785.94131 8680720 0
The complete CSV is here
I want to do something like:
def main(plot):
instruments = ["AA", "AES", "AIG"]
feed = yahoofinance.build_feed(instruments, 2008, 2009, ".")
Then replace yahoofinance.build_feed(instruments, 2008, 2009, ".") with my CSV
I tried:
import csv
with open( 'FinexBTCDaily.csv', 'rb' ) as csvfile:
data = csv.reader( csvfile )
def main( plot ):
feed = data
But it throws an attribute error. Any ideas how to do this?
I suggest to create your own Rowparser and Feed, which is much easier than it sounds, have a look here: yahoofeed
This also allows you to work with intraday data and cleanup the data if needed, like your timestamp.
Another possibility, of course, would be to parse your file and save it, so it looks like a yahoo feed. In your case, you would have to adapt the columns and the Timestamp.
Step A: follow PyAlgoTrade doc on GenericBarFeed class
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.16
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.17
Note
- The CSV file must have the column names in the first row.
- It is ok if the Adj Close column is empty.
- When working with multiple instruments:
--- If all the instruments loaded are in the same timezone, then the timezone parameter may not be specified.
--- If any of the instruments loaded are in different timezones, then the timezone parameter should be set.
addBarsFromCSV( instrument, path, timezone = None )
Loads bars for a given instrument from a CSV formatted file. The instrument gets registered in the bar feed.
Parameters:
(string) instrument – Instrument identifier.
(string) path – The path to the CSV file.
(pytz) timezone – The timezone to use to localize bars.Check pyalgotrade.marketsession.
Next:
A BarFeed loads bars from CSV files that have the following format:
Date Time, Open, High, Low, Close, Volume, Adj Close
2013-01-01 13:59:00,13.51001,13.56,13.51,13.56789,273.88014126,13.51001
Step B: implement a documented CSV-file pre-formatting
Your CSV data will need a bit of sanity ( before will be able to be used in PyAlgoTrade methods ),however it is doable and you can create an easy transformator either by hand or with a use of a powerful numpy.genfromtxt() lambda-based converters facilities.
This sample code is intended for an illustration purpose, to see immediately the powers of converters for your own transformations, as CSV-structure differs.
with open( getCsvFileNAME( ... ), "r" ) as aFH:
numpy.genfromtxt( aFH,
skip_header = 1, # Ref. pyalgotrade
delimiter = ",",
# v v v v v v
# 2011.08.30,12:00,1791.20,1792.60,1787.60,1789.60,835
# 2011.08.30,13:00,1789.70,1794.30,1788.70,1792.60,550
# 2011.08.30,14:00,1792.70,1816.70,1790.20,1812.10,1222
# 2011.08.30,15:00,1812.20,1831.50,1811.90,1824.70,2373
# 2011.08.30,16:00,1824.80,1828.10,1813.70,1817.90,2215
converters = { 0: lambda aString: mPlotDATEs.date2num( datetime.datetime.strptime( aString, "%Y.%m.%d" ) ), #_______________________________________asFloat ( 1.0, +++ )
1: lambda aString: ( ( int( aString[0:2] ) * 60 + int( aString[3:] ) ) / 60. / 24. ) # ( 15*60 + 00 ) / 60. / 24.__asFloat < 0.0, 1.0 )
# HH: :MM HH MM
}
)
You can use pyalgotrade.barfeed.addBarsFromSequence with list comprehension to feed in data from CSV row by row/bar by bar. Basically you create a bar from each row, pass OHLCV as init parameters and extra columns with additional data in a dictionary. You can try something like this (with all the required imports):
data = pd.DataFrame(index=pd.date_range(start='2021-11-01', end='2021-11-05'), columns=['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'ExtraCol1', 'ExtraCol3', 'ExtraCol4', 'ExtraCol5'], data=np.random.rand(5, 10))
feed = yahoofeed.Feed()
feed.addBarsFromSequence('instrumentID', data.index.map(lambda i:
BasicBar(
i,
data.loc[i, 'Open'],
data.loc[i, 'High'],
data.loc[i, 'Low'],
data.loc[i, 'Close'],
data.loc[i, 'Volume'],
data.loc[i, 'Adj Close'],
Frequency.DAY,
data.loc[i, 'ExtraCol1':].to_dict())
).values)
The input data frame was created with random values to make this example easier to reproduce, but the part where the bars are added to the feed should work the same for data frames from CSVs given that the valid column names are used.