I am trying to solve a problem which contains bi-LSTM and CRF, while fitting the model, i am facing this issue ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list). Below is the structure of the dataframe.
Columns named "CompanyId" that contains integer. "Name" that contains string. "TableTypeCode" that is a string that is constant and is same as "BS". and final column named "BlockName". I want to train a model using bidirectional lstm and crf . Input being "CompanyId", "Name", and "TableTypeCode" and should predict "BlockName".
import numpy as np
import pandas as pd
df=pd.read_excel("data.xlsx")
from keras.layers import TimeDistributed
from keras.layers import Dense
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.layers import Input, Embedding, LSTM, Dense, TimeDistributed, Bidirectional
from keras.models import Model
!pip install tensorflow-addons==0.16.1
import tensorflow_addons as tfa
X = df[['CompanyId', 'Name', 'TableTypeCode']]
y = df['BlockName']
# Preprocess the data
# One-hot encode the 'CompanyId' and 'TableTypeCode' columns
X = pd.get_dummies(X, columns=['CompanyId', 'TableTypeCode'])
# Tokenize the 'Name' column
X['Name'] = X['Name'].apply(str)
tokenizer = Tokenizer()
X['Name'] = X['Name'].apply(lambda x: x.split())
X['Name'] = tokenizer.texts_to_sequences(X['Name'])
# Encode the target column
encoder = LabelEncoder()
y = encoder.fit_transform(y)
y = to_categorical(y)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
n_classes = df['BlockName'].nunique()
# Define the model architecture
input_ = Input(shape=(X.shape[1],))
embedding = Embedding(input_dim=X.shape[1], output_dim=50)(input_)
lstm = Bidirectional(LSTM(units=100))(embedding)
output = Dense(n_classes, activation='softmax')(lstm)
model = Model(input_, output)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(X_train, y_train)
There was no issue till the last line of code. Help me fix this and train my model.
How do I download a trained model as a pickle file using Streamlit Download Button?
You can use io.BytesIO to store the pickled data inside bytes in RAM. Then, give these bytes as data argument in the st.download_button function.
import io
import pickle
import streamlit as st
def create_model():
"""Create an sklearn model so that we will have
something interesting to pickle.
Example taken from here:
https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
"""
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ["preg", "plas", "pres", "skin", "test", "mass", "pedi", "age", "class"]
dataframe = pd.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.33
seed = 7
X_train, _, Y_train, _ = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
# Fit the model on training set
model = LogisticRegression()
model.fit(X_train, Y_train)
return model
def pickle_model(model):
"""Pickle the model inside bytes. In our case, it is the "same" as
storing a file, but in RAM.
"""
f = io.BytesIO()
pickle.dump(model, f)
return f
st.title("My .pkl downloader")
model = create_model()
data = pickle_model(model)
st.download_button("Download .pkl file", data=data, file_name="my-pickled-model.pkl")
I actually want to estimate a normalizing flow with a mixture of gaussians as the base distribution, so I'm sort of stuck with torch. However you can reproduce my error in my code by just estimating a mixture of Gaussian model in torch. My code is below:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as datasets
import torch
from torch import nn
from torch import optim
import torch.distributions as D
num_layers = 8
weights = torch.ones(8,requires_grad=True).to(device)
means = torch.tensor(np.random.randn(8,2),requires_grad=True).to(device)#torch.randn(8,2,requires_grad=True).to(device)
stdevs = torch.tensor(np.abs(np.random.randn(8,2)),requires_grad=True).to(device)
mix = D.Categorical(weights)
comp = D.Independent(D.Normal(means,stdevs), 1)
gmm = D.MixtureSameFamily(mix, comp)
num_iter = 10001#30001
num_iter2 = 200001
loss_max1 = 100
for i in range(num_iter):
x = torch.randn(5000,2)#this can be an arbitrary x samples
loss2 = -gmm.log_prob(x).mean()#-densityflow.log_prob(inputs=x).mean()
optimizer1.zero_grad()
loss2.backward()
optimizer1.step()
The error I get is:
0
8.089411823514835
Traceback (most recent call last):
File "/home/cameron/AnacondaProjects/gmm.py", line 183, in <module>
loss2.backward()
File "/home/cameron/anaconda3/envs/torch/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/cameron/anaconda3/envs/torch/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
After as you see the model runs for 1 iteration.
There is ordering problem in your code, since you create Gaussian mixture model outside of training loop, then when calculate the loss the Gaussian mixture model will try to use the initial value of the parameters that you set when you define the model, but the optimizer1.step() already modify that value so even you set loss2.backward(retain_graph=True) there will still be the error: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
Solution to this problem is simply create new Gaussian mixture model whenever you update the parameters, example code running as expected:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as datasets
import torch
from torch import nn
from torch import optim
import torch.distributions as D
num_layers = 8
weights = torch.ones(8,requires_grad=True)
means = torch.tensor(np.random.randn(8,2),requires_grad=True)
stdevs = torch.tensor(np.abs(np.random.randn(8,2)),requires_grad=True)
parameters = [weights, means, stdevs]
optimizer1 = optim.SGD(parameters, lr=0.001, momentum=0.9)
num_iter = 10001
for i in range(num_iter):
mix = D.Categorical(weights)
comp = D.Independent(D.Normal(means,stdevs), 1)
gmm = D.MixtureSameFamily(mix, comp)
optimizer1.zero_grad()
x = torch.randn(5000,2)#this can be an arbitrary x samples
loss2 = -gmm.log_prob(x).mean()#-densityflow.log_prob(inputs=x).mean()
loss2.backward()
optimizer1.step()
print(i, loss2)
I have ASCII data and i need to cluster the data using HDBSCAN.
I got the lables but i don't know how to print the output cluster results i.e unique and segregated results from hdbscan.
snippet:
import hdbscan
import numpy as np
datafile = "ascii.txt"
data = np.loadtxt(datafile, dtype = np.uint8)
clusterer = hdbscan.HDBSCAN(min_cluster_size = 20)
clusterer.fit(data)
print (np.unique(clusterer.labels_, return_counts = True))
You can use Pandas to read the file and then print out the cluster labels along with the dataset you have as the input. Try something like:
import pandas as pd
df = pd.read_csv("ascii.txt")
clusterer = hdbscan.HDBSCAN().fit_predict(df.ColumnName)
df_pd = pd.DataFrame({'Datapoints:' df.ColumnName, 'Cluster Labels:' clusterer)
import hdbscan
import numpy as np
datafile = "ascii.txt"
data = np.loadtxt(datafile, dtype = np.uint8)
Modified_data=pd.DataFrame(data)
clusterer = hdbscan.HDBSCAN(min_cluster_size = 20)
clusterer.fit(Modified_data)
Modified_data['Clusters']=clusterer.labels_
Now Modified_data returns a pandas dataframe where you have a column named "Clusters" and cluster corresponding to each instance will be specified in the Clusters column.
You can manipulate this dataframe as per your requirement
I would like create a daily candlestick plot from data i downloaded from yahoo using pandas. I'm having trouble figuring out how to use the candlestick matplotlib function in this context.
Here is the code:
#The following example, downloads stock data from Yahoo and plots it.
from pandas.io.data import get_data_yahoo
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots, draw
from matplotlib.finance import candlestick
symbol = "GOOG"
data = get_data_yahoo(symbol, start = '2013-9-01', end = '2013-10-23')[['Open','Close','High','Low','Volume']]
ax = subplots()
candlestick(ax,data['Open'],data['High'],data['Low'],data['Close'])
Thanks
Andrew.
Using bokeh:
import io
from math import pi
import pandas as pd
from bokeh.plotting import figure, show, output_file
df = pd.read_csv(
io.BytesIO(
b'''Date,Open,High,Low,Close
2016-06-01,69.6,70.2,69.44,69.76
2016-06-02,70.0,70.15,69.45,69.54
2016-06-03,69.51,70.48,68.62,68.91
2016-06-04,69.51,70.48,68.62,68.91
2016-06-05,69.51,70.48,68.62,68.91
2016-06-06,70.49,71.44,69.84,70.11
2016-06-07,70.11,70.11,68.0,68.35'''
)
)
df["Date"] = pd.to_datetime(df["Date"])
inc = df.Close > df.Open
dec = df.Open > df.Close
w = 12*60*60*1000
TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=1000, title
= "Candlestick")
p.xaxis.major_label_orientation = pi/4
p.grid.grid_line_alpha=0.3
p.segment(df.Date, df.High, df.Date, df.Low, color="black")
p.vbar(df.Date[inc], w, df.Open[inc], df.Close[inc], fill_color="#D5E1DD", line_color="black")
p.vbar(df.Date[dec], w, df.Open[dec], df.Close[dec], fill_color="#F2583E", line_color="black")
output_file("candlestick.html", title="candlestick.py example")
show(p)
Code above forked from here:
http://docs.bokeh.org/en/latest/docs/gallery/candlestick.html
I have no reputation to comment #randall-goodwin answer, but for pandas 0.16.2 line:
# convert the datetime64 column in the dataframe to 'float days'
data.Date = mdates.date2num(data.Date)
must be:
data.Date = mdates.date2num(data.Date.dt.to_pydatetime())
because matplotlib does not support the numpy datetime64 dtype
I stumbled across a great pastebin entry: http://pastebin.com/ne7Fjdiq that does this well. I too was having trouble getting the calling syntax right. It usually revolves around transforming your data in simple ways to get the function to work right. My issue was with the datetime. There must be something in my format data. Once I replaced the Date series with range(maxdata) then it worked.
data = pandas.read_csv('data.csv', parse_dates={'Timestamp': ['Date', 'Time']}, index_col='Timestamp')
ticks = data.ix[:, ['Price', 'Volume']]
bars = ticks.Price.resample('1min', how='ohlc')
barsa = bars.fillna(method='ffill')
fig = plt.figure()
fig.subplots_adjust(bottom=0.1)
ax = fig.add_subplot(111)
plt.title("Candlestick chart")
volume = ticks.Volume.resample('1min', how='sum')
value = ticks.prod(axis=1).resample('1min', how='sum')
vwap = value / volume
Date = range(len(barsa))
#Date = matplotlib.dates.date2num(barsa.index)#
DOCHLV = zip(Date , barsa.open, barsa.close, barsa.high, barsa.low, volume)
matplotlib.finance.candlestick(ax, DOCHLV, width=0.6, colorup='g', colordown='r', alpha=1.0)
plt.show()
Here is the solution:
from pandas.io.data import get_data_yahoo
import matplotlib.pyplot as plt
from matplotlib import dates as mdates
from matplotlib import ticker as mticker
from matplotlib.finance import candlestick_ohlc
import datetime as dt
symbol = "GOOG"
data = get_data_yahoo(symbol, start = '2014-9-01', end = '2015-10-23')
data.reset_index(inplace=True)
data['Date']=mdates.date2num(data['Date'].astype(dt.date))
fig = plt.figure()
ax1 = plt.subplot2grid((1,1),(0,0))
plt.ylabel('Price')
ax1.xaxis.set_major_locator(mticker.MaxNLocator(6))
ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
candlestick_ohlc(ax1,data.values,width=0.2)
Found this question when I too was looking how to use candlestick with a pandas dataframe returned from one of the DataReader services like get_data_yahoo. I eventually figured it out. One of the keys was this other question, answered by Wes McKinney and RJRyV. Here is that link:
Pandas convert dataframe to array of tuples
The key was to read the candlestick.py function definition to determine how it expected to receive the data. The date needed to be converted first, then the entire dataframe needed to be converted to an array of tuples.
Here is the final code that worked for me. Maybe there is some other Candlestick chart out there somewhere that works directly on a pandas dataframe returned from one of the stock quote services. That would be very nice.
# Imports
from pandas.io.data import get_data_yahoo
from datetime import datetime, timedelta
import matplotlib.dates as mdates
from matplotlib.pyplot import subplots, draw
from matplotlib.finance import candlestick
import matplotlib.pyplot as plt
# get the data on a symbol (gets last 1 year)
symbol = "TSLA"
data = get_data_yahoo(symbol, datetime.now() - timedelta(days=365))
# drop the date index from the dateframe
data.reset_index(inplace = True)
# convert the datetime64 column in the dataframe to 'float days'
data.Date = mdates.date2num(data.Date)
# make an array of tuples in the specific order needed
dataAr = [tuple(x) for x in data[['Date', 'Open', 'Close', 'High', 'Low']].to_records(index=False)]
# construct and show the plot
fig = plt.figure()
ax1 = plt.subplot(1,1,1)
candlestick(ax1, dataAr)
plt.show()