I installed the dataframe package in Octave and would like to programmatically assert that the package version is at least 1.2.0. Does octave provide a way to check a package version programmatically?
Version = ver('dataframe')
% Version =
% scalar structure containing the fields:
% Name = dataframe
% Version = 1.2.0
% Release = [](0x0)
% Date = 2017-08-14
Obviously Version.Version is still a string, but you can process that further, e.g. with strsplit, to obtain the major-minor-patch numbers.
strsplit( Version.Version, '.' )
% ans =
% {
% [1,1] = 1
% [1,2] = 2
% [1,3] = 0
% }
Alternatively you can also use
Out = pkg('list', 'dataframe')
which also contains a 'version' field, as well as some extra information.
Related
I want to shard Arrow Dataset. To achieve that, I'd like to use a monotonously increasing field and implement a sharding operation in the following filter, which I can use in pyarrow Scanner: pc.field('id') % num_shards == shard_id
Any ideas on how to do this using PyArrow compute API?
Although there is not yet a modulo function there is a bit_wise_and function which can achieve the same thing:
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.compute as pc
arr = pa.array(range(100))
tab = pa.Table.from_arrays([arr], names=['x'])
my_filter = pc.bit_wise_and(pc.field('x'), 7) == 0
filtered = ds.dataset(tab).to_table(filter=my_filter)
print(filtered)
# pyarrow.Table
# x: int64
# ----
# x: [[0,8,16,24,32,...,64,72,80,88,96]]
I am building a multi-class image classifier.
There is a debugging trick to overfit on a single batch to check if there any deeper bugs in the program.
How to design the code in a way that can do it in a much portable format?
One arduous and a not smart way is to build a holdout train/test folder for a small batch where test class consists of 2 distribution - seen data and unseen data and if the model is performing better on seen data and poorly on unseen data, then we can conclude that our network doesn't have any deeper structural bug.
But, this does not seems like a smart and a portable way, and have to do it with every problem.
Currently, I have a dataset class where I am partitioning the data in train/dev/test in the below way -
def split_equal_into_val_test(csv_file=None, stratify_colname='y',
frac_train=0.6, frac_val=0.15, frac_test=0.25,
):
"""
Split a Pandas dataframe into three subsets (train, val, and test).
Following fractional ratios provided by the user, where val and
test set have the same number of each classes while train set have
the remaining number of left classes
Parameters
----------
csv_file : Input data csv file to be passed
stratify_colname : str
The name of the column that will be used for stratification. Usually
this column would be for the label.
frac_train : float
frac_val : float
frac_test : float
The ratios with which the dataframe will be split into train, val, and
test data. The values should be expressed as float fractions and should
sum to 1.0.
random_state : int, None, or RandomStateInstance
Value to be passed to train_test_split().
Returns
-------
df_train, df_val, df_test :
Dataframes containing the three splits.
"""
df = pd.read_csv(csv_file).iloc[:, 1:]
if frac_train + frac_val + frac_test != 1.0:
raise ValueError('fractions %f, %f, %f do not add up to 1.0' %
(frac_train, frac_val, frac_test))
if stratify_colname not in df.columns:
raise ValueError('%s is not a column in the dataframe' %
(stratify_colname))
df_input = df
no_of_classes = 4
sfact = int((0.1*len(df))/no_of_classes)
# Shuffling the data frame
df_input = df_input.sample(frac=1)
df_temp_1 = df_input[df_input['labels'] == 1][:sfact]
df_temp_2 = df_input[df_input['labels'] == 2][:sfact]
df_temp_3 = df_input[df_input['labels'] == 3][:sfact]
df_temp_4 = df_input[df_input['labels'] == 4][:sfact]
dev_test_df = pd.concat([df_temp_1, df_temp_2, df_temp_3, df_temp_4])
dev_test_y = dev_test_df['labels']
# Split the temp dataframe into val and test dataframes.
df_val, df_test, dev_Y, test_Y = train_test_split(
dev_test_df, dev_test_y,
stratify=dev_test_y,
test_size=0.5,
)
df_train = df[~df['img'].isin(dev_test_df['img'])]
assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
return df_train, df_val, df_test
def train_val_to_ids(train, val, test, stratify_columns='labels'): # noqa
"""
Convert the stratified dataset in the form of dictionary : partition['train] and labels.
To generate the parallel code according to https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
Parameters
-----------
csv_file : Input data csv file to be passed
stratify_columns : The label column
Returns
-----------
partition, labels:
partition dictionary containing train and validation ids and label dictionary containing ids and their labels # noqa
"""
train_list, val_list, test_list = train['img'].to_list(), val['img'].to_list(), test['img'].to_list() # noqa
partition = {"train_set": train_list,
"val_set": val_list,
}
labels = dict(zip(train.img, train.labels))
labels.update(dict(zip(val.img, val.labels)))
return partition, labels
P.S - I know about the Pytorch lightning and know that they have an overfitting feature which can be used easily but I don't want to move to PyTorch lightning.
I don't know how portable it will be, but a trick that I use is to modify the __len__ function in the Dataset.
If I modified it from
def __len__(self):
return len(self.data_list)
to
def __len__(self):
return 20
It will only output the first 20 elements in the dataset (regardless of shuffle). You only need to change one line of code and the rest should work just fine so I think it's pretty neat.
I'm trying to calculate Adamic-Adar similarity for a network, which have two types of nodes. I'm only interested in calculating similarity between nodes which have outgoing connections. Nodes with incoming connections are a kind of connector and I'm not interested in them.
Data size and characteristic:
> summary(g)
IGRAPH DNW- 3852 24478 --
+ attr: name (v/c), weight (e/n)
Prototype code in Python 2.7:
import glob
import os
import pandas as pd
from igraph import *
os.chdir("data/")
for file in glob.glob("*.graphml"):
print(file)
g = Graph.Read_GraphML(file)
indegree = Graph.degree(g, mode="in")
g['indegree'] = indegree
dev = g.vs.select(indegree == 0)
m = Graph.similarity_inverse_log_weighted(dev.subgraph())
df = pd.melt(m)
df.to_csv(file.split("_only.graphml")[0] + "_similarity.csv", sep=',')
There is something wrong with this code, because dev is of length 1, and m is 0.0, so it doesn't work as expected.
Hint
I have a working code in R, but seems like I'm unable to rewrite it to Python (which I'm doing for the sake of performance, networks are huge). Here it is:
# make sure g is your network
indegree <- degree(g, mode="in")
V(g)$indegree <- indegree
dev <- V(g)[indegree==0]
m <- similarity.invlogweighted(g, dev)
x.m <- melt(m)
colnames(x.m) <- c("dev1", "dev2", "value")
x.m <- x.m[x.m$value > 0, ]
write.csv(x.m, file = sub(".csv",
"_similarity.csv", filename))
You are assigning the in-degrees as a graph attribute, not as a vertex attribute, so you cannot reasonably call g.vs.select() later on. You need this instead:
indegree = g.degree(mode="in")
g.vs["indegree"] = indegree
dev = g.vs.select(indegree=0)
But actually, you could simply write this:
dev = g.vs.select(_indegree=0)
This works because of how the select method works:
Attribute names inferred from keyword arguments are treated specially
if they start with an underscore (_). These are not real attributes
but refer to specific properties of the vertices, e.g., its degree.
The rule is as follows: if an attribute name starts with an underscore,
the rest of the name is interpreted as a method of the Graph object.
This method is called with the vertex sequence as its first argument
(all others left at default values) and vertices are filtered
according to the value returned by the method.
I have this simple bit of code I can't get to work any advice?
a = 2
b = 4
c = input()
d = 0
d = c + 5
print(d)
Say I input a, so 2, I should get 7. But I don't. This Python 3. Using Wing IDE 101 (ver. 5) here. I get this as my error output.
Traceback (most recent call last):
File "", line 1, in
builtins.NameError: name 'a' is not defined
Are you sure you are using Python 3? In python 2.x you can do it by explictly evaluating an string expression with the eval() function:
c = eval(raw_input()) # Python 2.7
c = eval(input()) # Python 3.x
In Python 3.x input() will convert the input in a string and won't raise that error (NameError). It will raise a TypeError instead, because you cannot concatenate str and int that way.
you can just try c = raw_input()
I have a table in SQLAlchemy with a column IPaddress, that is represented in decimal format.
How can I change it to show a dotted format instead?
Get the decimal value from DB and do the conversion in Python code. Install IPy module and do the following:
from IPy import IP
ip_dotted = str(IP(ip_dec))
If you were using Python 3.3, there's ipaddress in stdlib:
from ipaddress import ip_address
ip_dotted = str(ip_address(ip_dec))
If RDBMS in use is PostgreSQL, conversion can be done at DB level:
from sqlalchemy.sql import cast, func
from sqlalchemy.dialects.postgresql import INET
session.query(func.host(cast('0.0.0.0', INET) + SomeTable.ip_dec)).scalar()
Finally, you can do calculations yourself, so there won't be any dependencies on specific libraries, Python versions or RDBMS:
segments = []
# From 3 to 0.
for n in range(3, -1, -1):
p = 256 ** n
segments.append(str(ip_dec // p))
ip_dec %= p
ip_dotted = '.'.join(segments)