Sqlalchemy change decimal IP address to dotted - sqlalchemy

I have a table in SQLAlchemy with a column IPaddress, that is represented in decimal format.
How can I change it to show a dotted format instead?

Get the decimal value from DB and do the conversion in Python code. Install IPy module and do the following:
from IPy import IP
ip_dotted = str(IP(ip_dec))
If you were using Python 3.3, there's ipaddress in stdlib:
from ipaddress import ip_address
ip_dotted = str(ip_address(ip_dec))
If RDBMS in use is PostgreSQL, conversion can be done at DB level:
from sqlalchemy.sql import cast, func
from sqlalchemy.dialects.postgresql import INET
session.query(func.host(cast('0.0.0.0', INET) + SomeTable.ip_dec)).scalar()
Finally, you can do calculations yourself, so there won't be any dependencies on specific libraries, Python versions or RDBMS:
segments = []
# From 3 to 0.
for n in range(3, -1, -1):
p = 256 ** n
segments.append(str(ip_dec // p))
ip_dec %= p
ip_dotted = '.'.join(segments)

Related

How to implement modulo operation using PyArrow Expression API so that I can use it in filter?

I want to shard Arrow Dataset. To achieve that, I'd like to use a monotonously increasing field and implement a sharding operation in the following filter, which I can use in pyarrow Scanner: pc.field('id') % num_shards == shard_id
Any ideas on how to do this using PyArrow compute API?
Although there is not yet a modulo function there is a bit_wise_and function which can achieve the same thing:
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.compute as pc
arr = pa.array(range(100))
tab = pa.Table.from_arrays([arr], names=['x'])
my_filter = pc.bit_wise_and(pc.field('x'), 7) == 0
filtered = ds.dataset(tab).to_table(filter=my_filter)
print(filtered)
# pyarrow.Table
# x: int64
# ----
# x: [[0,8,16,24,32,...,64,72,80,88,96]]

how can i solve "empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType)" in MushroomRL?

I'm using MushroomRL for a Deep Reinforcement Learning project, and I'm using a Graph representation as RL Environment where the number of nodes represents the number of actions now in my neural network the input is one value ex: tensor([[5.]], and the output Q is the number of nodes which is ten ex: tensor([[5972.4927, 8562.3330, 7443.6479, 7326.1587, 6615.2090, 6617.3145,6911.8672, 8233.7930, 6821.0093, 7000.1182,]] now I'm using a new framework called MushroomRL, and this is the code
if __name__ == '__main__':
from mushroom_rl.core import Core
from mushroom_rl.algorithms.value import TrueOnlineSARSALambda
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.features import Features
from mushroom_rl.features.tiles import Tiles
from mushroom_rl.utils.dataset import compute_J
from mushroom_rl.utils.parameters import LinearParameter, Parameter
from mushroom_rl.approximators.parametric import TorchApproximator
from mushroom_rl.algorithms.value import DQN
# Set the seed
np.random.seed(1)
# Create the toy environment with default parameters
#mdp = Environment.make('graph_env')
mdp=graph_env()
# Using an epsilon-greedy policy
epsilon = Parameter(value=0.1)
pi = EpsGreedy(epsilon=epsilon)
# Policy
epsilon = LinearParameter(value=1.,
threshold_value=.1,
n=1000000)
epsilon_test = Parameter(value=.05)
epsilon_random = Parameter(value=1)
pi = EpsGreedy(epsilon=epsilon_random)
approximator_params = dict(
network=Network,
input_shape=(1,),
output_shape=(1,),
n_actions=mdp.info.action_space.n,
optimizer=optimizer,
loss=F.mse_loss
)
approximator = TorchApproximator
algorithm_params = dict(
batch_size=32,
target_update_frequency=target_update_frequency // train_frequency,
replay_memory=True,
initial_replay_size=initial_replay_size,
max_replay_size=max_replay_size
)
agent=agent = DQN(mdp.info, pi, approximator,
approximator_params=approximator_params,
**algorithm_params)
# Algorithm
core = Core(agent, mdp)
# RUN
# Fill replay memory with random dataset
print_epoch(0)
core.learn(n_steps=initial_replay_size,n_steps_per_fit=initial_replay_size)
# Evaluate initial policy
pi.set_epsilon(epsilon_test)
#mdp.set_episode_end(False)
dataset = core.evaluate(n_steps=test_samples)
scores.append(get_stats(dataset))
for n_epoch in range(1, max_steps // evaluation_frequency + 1):
print_epoch(n_epoch)
print('- Learning:')
# learning step
pi.set_epsilon(epsilon)
mdp.set_episode_end(True)
core.learn(n_steps=evaluation_frequency,
n_steps_per_fit=train_frequency)
print('- Evaluation:')
# evaluation step
pi.set_epsilon(epsilon_test)
mdp.set_episode_end(False)
dataset = core.evaluate(n_steps=test_samples)
scores.append(get_stats(dataset))
it givs me this error when i run the code
TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:
* (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
* (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
i beleve the proplem in part of the code can any one help to fix it ?

How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])
In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

Conversion of binary lidar data (.bin) to point cloud data (.pcd) format

I have a lidar data collected with Velodyne-128 in .bin format. I need to convert it into pcd format. I use NVIDIA Driveworks for data manipulation but there is no tool to convert lidar binary data into pcd.
Thus, is there a method to convert binary lidar data into pcd format?
Here is a snippet to converting a lidar data (in .bin format) into .pcd format
with open ("lidar_velodyne64.bin", "rb") as f:
byte = f.read(size_float*4)
while byte:
x,y,z,intensity = struct.unpack("ffff", byte)
list_pcd.append([x, y, z])
byte = f.read(size_float*4)
np_pcd = np.asarray(list_pcd)
pcd = o3d.geometry.PointCloud()
v3d = o3d.utility.Vector3dVector
pcd.points = v3d(np_pcd)
According to the example here, you can save it to file it as below:
import open3d as o3d
o3d.io.write_point_cloud("copy_of_fragment.pcd", pcd)
Although all the other solutions are probably correct, I was looking for a simple numpy-only version. Here it is:
import numpy as np
import open3d as o3d
# Load binary point cloud
bin_pcd = np.fromfile("lidar_velodyne64.bin", dtype=np.float32)
# Reshape and drop reflection values
points = bin_pcd.reshape((-1, 4))[:, 0:3]
# Convert to Open3D point cloud
o3d_pcd = o3d.geometry.PointCloud(o3d.utility.Vector3dVector(points))
# Save to whatever format you like
o3d.io.write_point_cloud("pointcloud.pcd", o3d_pcd)
I found a code from github (reference is given below):
import numpy as np
import struct
from open3d import *
def convert_kitti_bin_to_pcd(binFilePath):
size_float = 4
list_pcd = []
with open(binFilePath, "rb") as f:
byte = f.read(size_float * 4)
while byte:
x, y, z, intensity = struct.unpack("ffff", byte)
list_pcd.append([x, y, z])
byte = f.read(size_float * 4)
np_pcd = np.asarray(list_pcd)
pcd = geometry.PointCloud()
pcd.points = utility.Vector3dVector(np_pcd)
return pcd
Reference:https://gist.githubusercontent.com/HTLife/e8f1c4ff1737710e34258ef965b48344/raw/76e15821e7cd45cac672dbb1f14d577dc9be5ff8/convert_kitti_bin_to_pcd.py

Calculate inverse log-weighted similarity in a bimodal network, igraph in Python

I'm trying to calculate Adamic-Adar similarity for a network, which have two types of nodes. I'm only interested in calculating similarity between nodes which have outgoing connections. Nodes with incoming connections are a kind of connector and I'm not interested in them.
Data size and characteristic:
> summary(g)
IGRAPH DNW- 3852 24478 --
+ attr: name (v/c), weight (e/n)
Prototype code in Python 2.7:
import glob
import os
import pandas as pd
from igraph import *
os.chdir("data/")
for file in glob.glob("*.graphml"):
print(file)
g = Graph.Read_GraphML(file)
indegree = Graph.degree(g, mode="in")
g['indegree'] = indegree
dev = g.vs.select(indegree == 0)
m = Graph.similarity_inverse_log_weighted(dev.subgraph())
df = pd.melt(m)
df.to_csv(file.split("_only.graphml")[0] + "_similarity.csv", sep=',')
There is something wrong with this code, because dev is of length 1, and m is 0.0, so it doesn't work as expected.
Hint
I have a working code in R, but seems like I'm unable to rewrite it to Python (which I'm doing for the sake of performance, networks are huge). Here it is:
# make sure g is your network
indegree <- degree(g, mode="in")
V(g)$indegree <- indegree
dev <- V(g)[indegree==0]
m <- similarity.invlogweighted(g, dev)
x.m <- melt(m)
colnames(x.m) <- c("dev1", "dev2", "value")
x.m <- x.m[x.m$value > 0, ]
write.csv(x.m, file = sub(".csv",
"_similarity.csv", filename))
You are assigning the in-degrees as a graph attribute, not as a vertex attribute, so you cannot reasonably call g.vs.select() later on. You need this instead:
indegree = g.degree(mode="in")
g.vs["indegree"] = indegree
dev = g.vs.select(indegree=0)
But actually, you could simply write this:
dev = g.vs.select(_indegree=0)
This works because of how the select method works:
Attribute names inferred from keyword arguments are treated specially
if they start with an underscore (_). These are not real attributes
but refer to specific properties of the vertices, e.g., its degree.
The rule is as follows: if an attribute name starts with an underscore,
the rest of the name is interpreted as a method of the Graph object.
This method is called with the vertex sequence as its first argument
(all others left at default values) and vertices are filtered
according to the value returned by the method.