How to get element wise intersection from two Series in python pandas - intersection

My question is for python pandas.
I have two Series and each Series has elements of string as follows:
To simplify, I've concatenated two Series in DataFrame.
import pandas as pd
import numpy as np
my_df = pd.DataFrame([['ab', 'bz', 'b'], ['cd', 'ct', 'c'], ['ef', 'ka', np.nan]], columns=['sr_1', 'sr_2', 'intersection'])
Any ideas for this?

This is what you can do:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'sr1' : ['ab','cd','ef'] ,
'sr2' : ['bz','ct','ka',]})
df1['intersection'] = df1.apply(lambda x: set(x.sr1) & set(x.sr2), axis=1)
df1['intersection'] = df1.intersection.apply(lambda x: list(x)[0] if len(x)>0 else np.nan)
The output:

Related

Regarding Implementation of Gradient Descent for Polynomial Regression

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
from numpy.linalg import inv
import seaborn as sns
url = r'C:\Users\pchan\kc_house_train_data.csv'
df = pd.read_csv(url,index_col=0)
features_1 = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']
x=df.filter(features_1)
x = np.c_[np.ones((x.shape[0], 1)), x]
x=pd.DataFrame(x)
y=df.filter(['price'])
y=y.reset_index(drop=True)
x_new=x.T
y.rename(columns = {'price':0}, inplace = True)
w=pd.DataFrame([0]*(x_new.shape[0]))
cost=[]
i=0
a=0.00001
while(i<50):
temp=x.T#(y-x#w)
w=w+(a*temp)
i+=1
print(w)
from sklearn.linear_model import LinearRegression
reg=LinearRegression().fit(x,y)
res=reg.coef_
print(res)
w_closed=np.linalg.inv(x.T#x) # x.T # y
print(w_closed)
Closed Form and Linear Regression from sklearn was able to get correct weights,
But not with gradient descent approach(using Matrix notation).
Whats wrong with Gradient Descent approach here?

Facing the issue while fitting my model (bi-lstm + crf). ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list)

I am trying to solve a problem which contains bi-LSTM and CRF, while fitting the model, i am facing this issue ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list). Below is the structure of the dataframe.
Columns named "CompanyId" that contains integer. "Name" that contains string. "TableTypeCode" that is a string that is constant and is same as "BS". and final column named "BlockName". I want to train a model using bidirectional lstm and crf . Input being "CompanyId", "Name", and "TableTypeCode" and should predict "BlockName".
import numpy as np
import pandas as pd
df=pd.read_excel("data.xlsx")
from keras.layers import TimeDistributed
from keras.layers import Dense
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.layers import Input, Embedding, LSTM, Dense, TimeDistributed, Bidirectional
from keras.models import Model
!pip install tensorflow-addons==0.16.1
import tensorflow_addons as tfa
X = df[['CompanyId', 'Name', 'TableTypeCode']]
y = df['BlockName']
# Preprocess the data
# One-hot encode the 'CompanyId' and 'TableTypeCode' columns
X = pd.get_dummies(X, columns=['CompanyId', 'TableTypeCode'])
# Tokenize the 'Name' column
X['Name'] = X['Name'].apply(str)
tokenizer = Tokenizer()
X['Name'] = X['Name'].apply(lambda x: x.split())
X['Name'] = tokenizer.texts_to_sequences(X['Name'])
# Encode the target column
encoder = LabelEncoder()
y = encoder.fit_transform(y)
y = to_categorical(y)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
n_classes = df['BlockName'].nunique()
# Define the model architecture
input_ = Input(shape=(X.shape[1],))
embedding = Embedding(input_dim=X.shape[1], output_dim=50)(input_)
lstm = Bidirectional(LSTM(units=100))(embedding)
output = Dense(n_classes, activation='softmax')(lstm)
model = Model(input_, output)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(X_train, y_train)
There was no issue till the last line of code. Help me fix this and train my model.

How to reduce Geojson size for repeated geometries (like timestamped data) in Pandas

I have a geopandas data frame which contain respective geometries as follows:
Date , value, Region Name, Geometry
2022-01-01 10 , ABC , Point((194 34),(121,23))
2022-02-01, 12 , ABC , Point((194 34),(121,23))
2022-02-01, 13 , DEF , Point((195 35),(123,24))
Almost equivalent Py code
import pandas as pd
import geopandas
import matplotlib.pyplot as plt
from shapely.geometry import Point
import geopandas
d = pd.DataFrame({'RegionName': ['ABC', 'ABC','DEF'],'Date': ['2021-01-01', '2021-02-01','2021-01-01'], 'Values': [10,11,12], 'Latitude': [-34.58, -34.58, -33.45], 'Longitude': [-58.66, -58.66, -70.66]})
gdf = geopandas.GeoDataFrame(d, geometry=geopandas.points_from_xy(d.Longitude, d.Latitude))
gdf = geopandas.GeoDataFrame(d, crs="EPSG:4326")
How can I save this data into a json/geojson file by reducing the size of the file and appending non-repetitive data (e.g. date and value) to the repetitive value (e.g. geometry)
Sth like this:
[
---Region name:
-----ABC
-----Date:
--------2022-01-01
--------2022-02-01
-----Value:
--------10
--------12
-----Geometry
--------Polygon((194 34),(121,23))
---Region name:
-----DEF
-----Date:
--------2022-02-01
-----Value:
--------13
-----Geometry
--------Polygon((194 34),(121,23))
]
Requirement:
This file needs to be consumed by mapbox/leaflet/or any other similar tool
Was able to solve this, first we need the distinct of repetitive columns (lets call it A,e.g. geometry), then form a list of non repetetive ones (lets call this B, e.g. date and value) and then merge B and A and then do the Json conversion.
Py code:
import pandas as pd
import geopandas
import matplotlib.pyplot as plt
from shapely.geometry import Point
import geopandas
d = pd.DataFrame({'RegionName': ['ABC', 'ABC','DEF'],'Date': ['2021-01-01', '2021-02-01','2021-01-01'], 'Values': [10,11,12], 'Latitude': [-34.58, -34.58, -33.45], 'Longitude': [-58.66, -58.66, -70.66]})
gdf = geopandas.GeoDataFrame(d, geometry=geopandas.points_from_xy(d.Longitude, d.Latitude))
gdf = geopandas.GeoDataFrame(d, crs="EPSG:4326")
#create a unique list of static data
df_dis_test= pd.DataFrame({'RegionName': ['ABC', 'DEF'],'Latitude': [-34.58, -33.45], 'Longitude': [-58.66, -70.66]})
gdfdf_dis_test = geopandas.GeoDataFrame(df_dis_test, geometry=geopandas.points_from_xy(df_dis_test['Longitude'], df_dis_test['Latitude']))
gdfdf_dis_test = geopandas.GeoDataFrame(df_dis_test, crs="EPSG:4326")
dgrp=d.groupby(['RegionName']). agg({ 'Date': lambda x: ','.join(x) } )
result = dgrp.merge( gdfdf_dis_test, how="inner", on="RegionName")
dgrpval=d.groupby(['RegionName']). agg({ 'Values': lambda x: list(x) } )
result2 = result.merge( dgrpval, how="inner", on="RegionName")
result2=result2.rename(columns={'geometry_x':'geometry'})
result2Gpd = geopandas.GeoDataFrame(result2, crs="EPSG:4326")#.drop(['geometry_y'],axis=1)
with open('Result2.geojson', 'w') as f:
f.write(result2Gpd.to_json (sort_keys=True, default=str))
and the output

Expected an indented block Exception

After running this code, i get this exception and i didn't found any place to fix it properly
import networkx as nx
from networkx.algorithms import bipartite
import numpy as np
from pandas import DataFrame, concat
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import ast
import csv
import sys
def plot_degree_dist(G):
in_degrees = G.in_degree()
in_degrees=dict(in_degrees)
in_values = sorted(set(in_degrees.values()))
in_hist = [in_degrees.values().count(x) for x in in_values]
plt.figure()
plt.grid(True)
plt.loglog(in_values, in_hist, 'ro-')
plt.plot(out_values, out_hist, 'bv-')
plt.legend(['In-degree', 'Out-degree'])
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.title('network of places in Cambridge')
#plt.xlim([0, 2*10**2])
I expect to receive a proper graph but all i get is this warning
File "<ipython-input-32-f89b896484d7>", line 2
in_degrees = G.in_degree()
^
IndentationError: expected an indented block
Python relies on proper indentation to identify function blocks. This code should work:
import networkx as nx
from networkx.algorithms import bipartite
import numpy as np
from pandas import DataFrame, concat
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import ast
import csv
import sys
def plot_degree_dist(G):
in_degrees = G.in_degree()
in_degrees=dict(in_degrees)
in_values = sorted(set(in_degrees.values()))
in_hist = [in_degrees.values().count(x) for x in in_values]
plt.figure()
plt.grid(True)
plt.loglog(in_values, in_hist, 'ro-')
plt.plot(out_values, out_hist, 'bv-')
plt.legend(['In-degree', 'Out-degree'])
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.title('network of places in Cambridge')
#plt.xlim([0, 2*10**2])
Basically just indent it by 2 or 4 spaces as per your style requirements.

Plotting candlestick data from a dataframe in Python

I would like create a daily candlestick plot from data i downloaded from yahoo using pandas. I'm having trouble figuring out how to use the candlestick matplotlib function in this context.
Here is the code:
#The following example, downloads stock data from Yahoo and plots it.
from pandas.io.data import get_data_yahoo
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots, draw
from matplotlib.finance import candlestick
symbol = "GOOG"
data = get_data_yahoo(symbol, start = '2013-9-01', end = '2013-10-23')[['Open','Close','High','Low','Volume']]
ax = subplots()
candlestick(ax,data['Open'],data['High'],data['Low'],data['Close'])
Thanks
Andrew.
Using bokeh:
import io
from math import pi
import pandas as pd
from bokeh.plotting import figure, show, output_file
df = pd.read_csv(
io.BytesIO(
b'''Date,Open,High,Low,Close
2016-06-01,69.6,70.2,69.44,69.76
2016-06-02,70.0,70.15,69.45,69.54
2016-06-03,69.51,70.48,68.62,68.91
2016-06-04,69.51,70.48,68.62,68.91
2016-06-05,69.51,70.48,68.62,68.91
2016-06-06,70.49,71.44,69.84,70.11
2016-06-07,70.11,70.11,68.0,68.35'''
)
)
df["Date"] = pd.to_datetime(df["Date"])
inc = df.Close > df.Open
dec = df.Open > df.Close
w = 12*60*60*1000
TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=1000, title
= "Candlestick")
p.xaxis.major_label_orientation = pi/4
p.grid.grid_line_alpha=0.3
p.segment(df.Date, df.High, df.Date, df.Low, color="black")
p.vbar(df.Date[inc], w, df.Open[inc], df.Close[inc], fill_color="#D5E1DD", line_color="black")
p.vbar(df.Date[dec], w, df.Open[dec], df.Close[dec], fill_color="#F2583E", line_color="black")
output_file("candlestick.html", title="candlestick.py example")
show(p)
Code above forked from here:
http://docs.bokeh.org/en/latest/docs/gallery/candlestick.html
I have no reputation to comment #randall-goodwin answer, but for pandas 0.16.2 line:
# convert the datetime64 column in the dataframe to 'float days'
data.Date = mdates.date2num(data.Date)
must be:
data.Date = mdates.date2num(data.Date.dt.to_pydatetime())
because matplotlib does not support the numpy datetime64 dtype
I stumbled across a great pastebin entry: http://pastebin.com/ne7Fjdiq that does this well. I too was having trouble getting the calling syntax right. It usually revolves around transforming your data in simple ways to get the function to work right. My issue was with the datetime. There must be something in my format data. Once I replaced the Date series with range(maxdata) then it worked.
data = pandas.read_csv('data.csv', parse_dates={'Timestamp': ['Date', 'Time']}, index_col='Timestamp')
ticks = data.ix[:, ['Price', 'Volume']]
bars = ticks.Price.resample('1min', how='ohlc')
barsa = bars.fillna(method='ffill')
fig = plt.figure()
fig.subplots_adjust(bottom=0.1)
ax = fig.add_subplot(111)
plt.title("Candlestick chart")
volume = ticks.Volume.resample('1min', how='sum')
value = ticks.prod(axis=1).resample('1min', how='sum')
vwap = value / volume
Date = range(len(barsa))
#Date = matplotlib.dates.date2num(barsa.index)#
DOCHLV = zip(Date , barsa.open, barsa.close, barsa.high, barsa.low, volume)
matplotlib.finance.candlestick(ax, DOCHLV, width=0.6, colorup='g', colordown='r', alpha=1.0)
plt.show()
Here is the solution:
from pandas.io.data import get_data_yahoo
import matplotlib.pyplot as plt
from matplotlib import dates as mdates
from matplotlib import ticker as mticker
from matplotlib.finance import candlestick_ohlc
import datetime as dt
symbol = "GOOG"
data = get_data_yahoo(symbol, start = '2014-9-01', end = '2015-10-23')
data.reset_index(inplace=True)
data['Date']=mdates.date2num(data['Date'].astype(dt.date))
fig = plt.figure()
ax1 = plt.subplot2grid((1,1),(0,0))
plt.ylabel('Price')
ax1.xaxis.set_major_locator(mticker.MaxNLocator(6))
ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
candlestick_ohlc(ax1,data.values,width=0.2)
Found this question when I too was looking how to use candlestick with a pandas dataframe returned from one of the DataReader services like get_data_yahoo. I eventually figured it out. One of the keys was this other question, answered by Wes McKinney and RJRyV. Here is that link:
Pandas convert dataframe to array of tuples
The key was to read the candlestick.py function definition to determine how it expected to receive the data. The date needed to be converted first, then the entire dataframe needed to be converted to an array of tuples.
Here is the final code that worked for me. Maybe there is some other Candlestick chart out there somewhere that works directly on a pandas dataframe returned from one of the stock quote services. That would be very nice.
# Imports
from pandas.io.data import get_data_yahoo
from datetime import datetime, timedelta
import matplotlib.dates as mdates
from matplotlib.pyplot import subplots, draw
from matplotlib.finance import candlestick
import matplotlib.pyplot as plt
# get the data on a symbol (gets last 1 year)
symbol = "TSLA"
data = get_data_yahoo(symbol, datetime.now() - timedelta(days=365))
# drop the date index from the dateframe
data.reset_index(inplace = True)
# convert the datetime64 column in the dataframe to 'float days'
data.Date = mdates.date2num(data.Date)
# make an array of tuples in the specific order needed
dataAr = [tuple(x) for x in data[['Date', 'Open', 'Close', 'High', 'Low']].to_records(index=False)]
# construct and show the plot
fig = plt.figure()
ax1 = plt.subplot(1,1,1)
candlestick(ax1, dataAr)
plt.show()