Understanding Mutual Information on Titanic Dataset - data-analysis

I was reading about Mutual Information on the Kaggle courses: https://www.kaggle.com/code/ryanholbrook/mutual-information
After that I tried it out on the Titanic Competition Dataset and I encountered a weird behaviour .
I will post the code further below.
I ranked all the features with mutual information and received the following output:
PassengerId 0.665912
Name 0.665912
Ticket 0.572496
Cabin 0.165236
Sex 0.150870
Fare 0.141621
Age 0.066269
Pclass 0.058107
SibSp 0.023197
Embarked 0.016668
Parch 0.016366
According to the documentation
Mutual information (MI) [1] between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.
From my point of view at least PassengerId should be independent as well as the name. because I used factorize() on all objects. Which leaves me with 100% unique values for both Id and Names. There are in total 891 rows in the training dataset.
# number of unique values for top 2 MI
print(X_mi["PassengerId"].nunique())
print(X_mi["Name"].nunique())
891
891
My question is how does this happen? And why does PassengerId and Name with all unique values score even higher than lets say Age or Sex?
I followerd the Kaggle course on the link above. Only difference should be that I used
from sklearn.feature_selection import mutual_info_classif
instead of
from sklearn.feature_selection import mutual_info_regression
because my target is a discrete target.
Here is the relevant code:
X_train_full = pd.read_csv("/kaggle/input/titanic/train.csv")
X_test_full = pd.read_csv("/kaggle/input/titanic/test.csv")
X_mi = X_train_full.copy()
y_mi = X_mi.pop("Survived")
# Label encoding for categoricals
for colname in X_mi.select_dtypes("object"):
X_mi[colname], _ = X_mi[colname].factorize()
# fill all NaN values of age with mean and cast type to int
X_mi["Age"] = X_mi["Age"].transform(lambda age: age.fillna(age.mean()))
X_mi["Age"] = X_mi["Age"].transform(lambda age: age.astype("int"))
# cast fare type to int
X_mi["Fare"] = X_mi["Fare"].transform(lambda fare: fare.astype("int"))
# All discrete features should now have integer dtypes (double-check this before using MI!)
discrete_features = X_mi.dtypes == int
from sklearn.feature_selection import mutual_info_classif
def make_mi_scores(X, y, discrete_features):
mi_scores = mutual_info_classif(X, y, discrete_features=discrete_features)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores
mi_scores = make_mi_scores(X_mi, y_mi, discrete_features)
print(mi_scores) # show features with their MI scores
Any explanation or suggestions what I might have done wrong?
From Data Analytics point of view I might have some mistakes but how does a fully unrelated feature like PassengerId score so high and higher than the others?
Thank you :)

Related

Dropping duplicates in a pyarrow table?

Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))

Does sim_slopes take into account random intercept in model?

I am using sim_slopes to test whether the slopes are different for different levels of a variable:
sim_slopes(model6, pred = Score, modx = Age, mod2 =
Status, johnson_neyman = TRUE)
The model I am basing this on is a Linear Mixed Effects model with lmer from package lme4. I was wondering if the above code takes into account the fact that the data is longitudinal and that the model uses random intercepts for each participant?

Get empty prediction with Facebook Prophet

Following the basic steps to create Prophet model and forecast
m = Prophet(daily_seasonality=True)
m.fit(data)
forecast = m.make_future_dataframe(periods=2)
forecast.tail().T
the result is as following (no yhat value ??)
The data passed in to fit the model has two columns (date and value).
Not sure what I have missed out here.
I managed to get it works by creating a new dataframe:
df_p = pd.DataFrame({'ds': d.index, 'y': d.values})

Maximo/GIS spatial query

I have a work order in Maximo 7.6.1.1:
The WO has LatitudeY and LongitudeX coordinates in the Service Address tab.
The WO has a custom zone field.
And there is a feature class (polygons) in a separate GIS database.
I want to do spatial query to return an attribute from the polygon record that the WO intersects and use it to populate zone in the WO.
How can I do this?
Related keyword: Maximo Spatial
To do this live in Maximo using an automation script is possible or by writing custom code into Spatial (more challenging). You want to use the /MapServer/identify tool and post the geometry xy, coordinate system, and the layer you want to query. identify window
You will have to format the geometry object correctly and test your post from the window. I usually grab the post from the network section of developer tools once I get it to work and change the output format to json and use it in my code.
You may actually not need to touch your Maximo environment at all. How about just using a trigger on your work orders table ? That trigger can then automatically fill the zone ID from a simple select statement that matches x and y with the zones in the zones table. Here is how that could look like.
This assumes that your work orders are in a table like this:
create table work_orders (
wo_id number primary key,
x number,
y number,
zone_id number
);
and the zones in a table like this
create table zones (
zone_id number primary key,
shape st_geometry
)
Then the trigger would be like this
create or replace trigger work_orders_fill_zone
before insert or update of x,y on work_orders
for each row
begin
select zone_id
into :new.zone_id
from zones
where sde.st_contains (zone_shape, sde.st_point (:new.x, :new.y, 4326) ) = 1;
end;
/
Some assumptions:
The x and y columns contain coordinates in WGS84 longitude/latitude (not in some projection or some other long/lat coordinate system)
Zones don't overlap: a work order point is always therefore in one and only one zone. If not, then the query may return multiple results, which you then need to handle.
Zones fully cover the territory your work orders can take place in. If a work order location can be outside all your zones, then you also need to handle that (the query would return no result).
The x and y columns are always filled. If they are optional, then you also need to handle that case (set zone_id to NULL if either x or y is NULL)
After that, each time a new work order is inserted in the work_orders table, the zone_id column will be automatically updated.
You can initialize zone_id in your existing work orders with a simple update:
update work_orders set x=x, y=y;
This will make the trigger run for each row in the table ... It may take some time to complete if the table is large.
Adapt the code in the Library Scripts section of Maximo 76 Scripting Features (pdf):
#What the script does:
# 1. Takes the X&Y coordinates of a work order in Maximo
# 2. Generates a URL from the coordinates
# 3. Executes the URL via a separate script/library (LIB_HTTPCLIENT)
# 4. Performs a spatial query in an ESRI REST feature service (a separate GIS system)
# 5. Returns JSON text to Maximo with the attributes of the zone that the work
# order intersected
# 6. Parses the zone number from the JSON text
# 7. Inserts the zone number into the work order record
from psdi.mbo import MboConstants
from java.util import HashMap
from com.ibm.json.java import JSONObject
field_to_update = "ZONE"
gis_field_name = "ROADS_ZONE"
def get_coords():
"""
Get the y and x coordinates(UTM projection) from the WOSERVICEADDRESS table
via the SERVICEADDRESS system relationship.
The datatype of the LatitdeY and LongitudeX fields is decimal.
"""
laty = mbo.getDouble("SERVICEADDRESS.LatitudeY")
longx = mbo.getDouble("SERVICEADDRESS.LongitudeX")
#Test values
#laty = 4444444.7001941890
#longx = 666666.0312127020
return laty, longx
def is_latlong_valid(laty, longx):
#Verify if the numbers are legitimate UTM coordinates
return (4000000 <= laty <= 5000000 and
600000 <= longx <= 700000)
def make_url(laty, longx, gis_field_name):
"""
Assembles the URL (including the longx and the laty).
Note: The coordinates are flipped in the url.
"""
url = (
"http://hostname.port"
"/arcgis/rest/services/Example"
"/Zones/MapServer/15/query?"
"geometry={0}%2C{1}&"
"geometryType=esriGeometryPoint&"
"spatialRel=esriSpatialRelIntersects&"
"outFields={2}&"
"returnGeometry=false&"
"f=pjson"
).format(longx, laty, gis_field_name)
return url
def fetch_zone(url):
# Get the JSON text from the feature service (the JSON text contains the zone value).
ctx = HashMap()
ctx.put("url", url)
service.invokeScript("LIBHTTPCLIENT", ctx)
json_text = str(ctx.get("response"))
# Parse the zone value from the JSON text
obj = JSONObject.parse(json_text)
parsed_val = obj.get("features")[0].get("attributes").get(gis_field_name)
return parsed_val
try:
laty, longx = get_coords()
if not is_latlong_valid(laty, longx):
service.log('Invalid coordinates')
else:
url = make_url(laty, longx, gis_field_name)
zone = fetch_zone(url)
#Insert the zone value into the zone field in the work order
mbo.setValue(field_to_update, zone, MboConstants.NOACCESSCHECK)
service.log(zone)
except:
#If the script fails, then set the field value to null.
mbo.setValue(field_to_update, None, MboConstants.NOACCESSCHECK)
service.log("An exception occurred")
LIBHTTPCLIENT: (a reusable Jython library script)
from psdi.iface.router import HTTPHandler
from java.util import HashMap
from java.lang import String
handler = HTTPHandler()
map = HashMap()
map.put("URL", url)
map.put("HTTPMETHOD", "GET")
responseBytes = handler.invoke(map, None)
response = String(responseBytes, "utf-8")

Django save() does not work

I'm writing an elections application. In the process, I've defined an Election model and a Candidate model.
Note: I'm using Django version 1.3.7, Python 2.7.1.
One of Election's methods,
Election.count_first_place(self)
is intended to count the number of first place votes each candidate receives and update the candidates' numVotes attribute. But for some reason they all stay at zero, no matter the ballots.
Note: I'm implementing STV so each ballot contains an array(ballot.voteArray) of Candidates in order of most preferred (position zero) to least preferred (position n). I've implemented this list with a PickledObjectField (see link).
models.py
class Candidate(models.Model):
election = models.ForeignKey("Election")
numVotes = models.FloatField(blank=True)
class Ballot(models.Model):
election = models.ForeignKey("Election", related_name = "ballot_set")
voteArray = PickledObjectField(null=True,blank=True)
class Election(models.Model):
position = models.CharField(max_length = 50)
candidates = models.ManyToManyField(Candidate,related_name="elections_in",null=True,blank=True)
def count_first_place(self):
#retrieve all of the ballots cast in this election
ballots = Ballot.objects.filter(election = self)
for ballot in ballots.all():
# the first element of a ballot's voteArray is a Candidate object
first_place_choice = ballot.voteArray[0]
first_place_choice.numVotes += 1
first_place_choice.save()
ballot.save()
self.save()
Here is what happens when I run a test:
Note: I realize that I am saving way more often than is necessary. Just being absolutely sure while I test this thing that it saves when it needs to.
elec = Election(position="Student Body President")
elec.save()
j = Candidate(election=elec,numVotes=0)
j.save()
e = Candidate(election=elec,numVotes=0)
e.save()
b = Candidate(election=elec,numVotes=0)
b.save()
elec.candidates.add(j,e,b)
elec.save()
ballot1 = Ballot(election=elec,voteArray=[j,e,b])
ballot1.save()
ballot2 = Ballot(election=elec,voteArray=[j,b,e])
ballot2.save()
ballot3 = Ballot(election=elec,voteArray[e,b,j])
ballot3.save()
So after this bit, j has two 2 place votes, and e has 1. But when I run
elec.count_first_place()
j still has zero votes, as do e and b.
What's up with that????
This is a very strange table structure. Pickling other model instances is a very bad idea: the pickled versions will not update when their database rows do. Really you should be storing an array of candidate IDs, or even better create a many-to-many relationship from Ballot to Candidate with a through table indicating position.
But I think your problem is simpler than that. You say that the objects still have zero votes: that is because you have not updated those particular instances. Again, there is no direct relationship between a Django instance and the database row, other than on loading and saving. You'll need to reload the objects from the database to see any updates.