How to run a function in another process using Cython (and not interacting with Python)? [Included Python code example] - cython

What is the best way to replicate the below behavior in a cython (without having to interact with Python)? Assuming that the function which will be passed into the new process is a cdef function.
import time
from multiprocessing import Process
def func1(n):
while True:
# do some work (different from func2)
time.sleep(n)
def func2(n):
while True:
# do some other work (different from func1)
time.sleep(n)
p1 = Process(target=func1, args=(1,))
p1.start()
p2 = Process(target=func2, args=(1,))
p2.start()
How to run a function in another process using Cython (without interacting with Python)?

Related

How to write a Numba function used both in CPU mode and in CUDA device mode?

I want to write a Numba function used both in CPU mode and in CUDA device mode. Of course, I can write two identical functions with and without the cuda.jit decorator. For example:
from numba import cuda, njit
#njit("i4(i4, i4)")
def func_cpu(a, b)
return a + b
#cuda.jit("i4(i4, i4)", device=True)
def func_gpu(a, b)
return a + b
But it is ugly in software engineering. Is there a more elegant way, i.e., combining the codes in one function?
A decorator is essentially a function, that takes a function as the input, and also returns a (often modified) function as the output. The addition of arguments and keywords arguments as done with Numba makes it slightly more complicated (internally), but you can think of it as a nested function where the outer one again returns a decorator.
So instead of using it as a decorator like you do now (with the #), you can just call it as any function and capture the output. And the output will then be a callable function as well.
This allows writing your function in pure Python, and then apply as many "decorators" on it as you'd like. For example:
from numba import cuda, njit
def func_py(a, b)
return a + b
func_njit = njit("i4(i4, i4)")(func_py)
func_gpu = cuda.jit("i4(i4, i4)", device=True)(func_py)
assert func_py(4, 3) == func_njit(4, 3)
assert func_py(4, 3) == func_gpu(4, 3)

Add a TensorBoard metric from my PettingZoo environment

I'm using Tensorboard to see the progress of the PettingZoo environment that my agents are playing. I can see the reward go up with time, which is good, but I'd like to add other metrics that are specific to my environment. i.e. I'd like TensorBoard to show me more charts with my metrics and how they improve over time.
The only way I could figure out how to do that was by inserting a few lines into the learn method of OnPolicyAlgorithm that's part of SB3. This works and I got the charts I wanted:
(The two bottom charts are the ones I added.)
But obviously editing library code isn't a good practice. I should make the modifications in my own code, not in the libraries. Is there currently a more elegant way to add a metric from my PettingZoo environment into TensorBoard?
You can add a callback to add your own logs. See the below example. In this case the call back is called every step. There are other callbacks that you case use depending on your use case.
import numpy as np
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import BaseCallback
model = SAC("MlpPolicy", "Pendulum-v1", tensorboard_log="/tmp/sac/", verbose=1)
class TensorboardCallback(BaseCallback):
"""
Custom callback for plotting additional values in tensorboard.
"""
def __init__(self, verbose=0):
super(TensorboardCallback, self).__init__(verbose)
def _on_step(self) -> bool:
# Log scalar value (here a random variable)
value = np.random.random()
self.logger.record('random_value', value)
return True
model.learn(50000, callback=TensorboardCallback())

Palantir Foundry How to allow dynamic number of input in compute (Code repository)

I have a folder where I will upload one file every month. The file will have the same format in every month.
First problem
The idea is to concatenate all the files in this folder into one file. Currently I am hardcoding the filenames (filename[0], filename[1], filename[2]..) but imagine later I will have 50 files, should I explicitly add them to the transform_df decorator? Is there any other method to handle this?
Second problem:
Currently I have let's say 4 files (2021_07, 2021_08, 2021_09, 2021_10) and I want whenever I add the file presenting 2021_12 data to avoid changing the code.
If I add input_5 = Input(path_to_2021_12_do_not_exists) the code will not be run and give an error.
How can I implement the code for future files and let the code ignore the input if it does not exist without manually each month add a new value to my code?
Thank you
# from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output
from pyspark.sql.functions import to_date, year, col
from pyspark.sql.types import StringType
from myproject.datasets import utils
from pyspark.sql import DataFrame
from functools import reduce
input_dir = '/Company/Project_name/'
prefix_filename = 'DataInput1_'
suffixes = ['2021_07', '2021_08', '2021_09', '2021_10', '2021_11', '2021_12']
filenames = [input_dir + prefix_filename + suffixe for suffixe in suffixes]
#transform_df(
Output("/Company/Project_name/Data/clean/File_concat"),
input_1=Input(filenames[0]),
input_2=Input(filenames[1]),
input_3=Input(filenames[2]),
input_4=Input(filenames[3]),
)
def compute(input_1, input_2, input_3, input_4):
input_dfs = [input_1, input_2, input_3, input_4]
dfs = []
def transformation_input(df):
# some transformation
return df
for input_df in input_dfs:
dfs.append(transformation_input(input_df))
dfs = reduce(DataFrame.unionByName, dfs)
return dfs
This question comes up a lot, the simple answer is that you don't. Defining datasets and executing a build on them are two different steps executed at different stages.
Whenever you commit your code and run the checks, your overall python code is executed during the renderSchrinkwrap stage, except for the compute part. This allows Foundry to discover what datasets exist and publish.
Publishing involves creating your dataset and putting whatever is inside your compute function is published into the jobspec of the dataset, so foundry knows what code to execute whenever you run a build.
Once you hit build on the dataset, Foundry will only pick up whatever is on the jobspec and execute it. Any other code has already run during your checks, and it has run just once.
So any dynamic input/output would require you to re-run checks on your repo, which means that some code change would have had to happen since the Checks is part of the CI process, not part of the build.
Taking a step back, assuming each of your input files has the same schema, Foundry would expect you to have all of those files in the same dataset as append transactions.
This might not be possible though, if for instance, the only indication of the "year" of the data is embedded in the filename, but your sample code would indicate that you expect all these datasets to have the same schema and easily union together.
You can do this manually through the Dataset Preview - just use the Upload File button or drag-and-drop the new file into the Preview window - or, if it's an "end user" workflow, with a File Upload Widget in a Workshop app. You may need to coordinate with your Foundry support team if this widget isn't available.
Bit late to the post although for anyone who is interested in an answer to most of the question. Dynamically determining file names from within a folder is not doable although having some level of dynamic input is possible as follows:
# from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
from pyspark.sql.functions import to_date, year, col
from pyspark.sql.types import StringType
from myproject.datasets import utils
from pyspark.sql import DataFrame
# from functools import reduce
from transforms.verbs.dataframes import union_many # use this instead of reduce
input_dir = '/Company/Project_name/'
prefix_filename = 'DataInput1_'
suffixes = ['2021_07', '2021_08', '2021_09', '2021_10', '2021_11', '2021_12']
filenames = [input_dir + prefix_filename + suffixe for suffixe in suffixes]
inputs = {('input{}'.format(index)): Input(filename) for (index, filename) in enumerate(filenames))}
#transform(
output=Output("/Company/Project_name/Data/clean/File_concat"),
**inputs
)
def compute(output, **kwargs):
# Extract dataframes from input datasets
input_dfs = [dataset_df.dataframe() for dataset_name, dataset_df in kwargs.items()]
dfs = []
def transformation_input(df):
# some transformation
return df
for input_df in input_dfs:
dfs.append(transformation_input(input_df))
# dfs = reduce(DataFrame.unionByName, dfs)
unioned_dfs = union_many(*dfs)
return unioned_dfs
Couple points:
Created dynamic input dict.
That dict is read into the transform using **kwargs.
Using transform decorator not transform_df, we can extract the dataframes.
(not in question) Combine multiple dataframes using union_many function from transforms_verbs library.

Pytest: Create reusable code in conftest.py

Due to the way pytest works, it is not possible (or recommended) to import other modules in a pytest module. Instead, one should properly edit it's conftest.py file.
Several times, I am put in a situation where I need to share constants/functions to several tests modules. And fixture fails to be as practical as functions. Even if they can be indirectly parametrized with the indirect parameter, they are still situations where it's not possible, or simple, to use this approach.
For constants, I am in the following situation, here is an extract of my conftest.py:
TARGET_NAME_1 = 'MY_OP4510'
TARGET_NAME_2 = 'MY_ML605'
TARGET_NAME_3 = 'TARGET_WITH_CHILD'
CONFIG_FILE_NAME = 'config.ini'
#pytest.fixture()
def target_name_1():
"""This fixture returns a target name"""
return TARGET_NAME_1
#pytest.fixture()
def target_name_2():
"""This fixture returns a target name"""
return TARGET_NAME_2
#pytest.fixture()
def target_name_3():
"""This fixture returns a target name"""
return TARGET_NAME_3
#pytest.fixture()
def target_config_path():
"""This fixture returns the config path"""
return CONFIG_FILE_NAME
Every time I have to add a constant, I have to add a fixture. Also, this increase the number of parameters the tests functions will receive (if in this case, I could use the autouse parameter, for some other fixtures that actually execute code, I do not necessary want to auto-use them as they could prevent other test cases from working).
I am looking for a way to simplify this code, would you have a good pattern/implementation to suggest ?

how to embed ipython 0.12 so that it inherits namespace of the caller?

EDIT I isolated a real minimal example which does not work (it is a part of more complex code); the culprit is the inputhook part after all:
def foo():
exec 'a=123' in globals()
from IPython.frontend.terminal.embed import InteractiveShellEmbed
ipshell=InteractiveShellEmbed()
ipshell()
# without inputhook, 'a' is found just fine
import IPython.lib.inputhook
IPython.lib.inputhook.enable_gui(gui='qt4')
foo()
Running with 0.12:
In [1]: a
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/tmp/<ipython-input-1-60b725f10c9c> in <module>()
----> 1 a
NameError: name 'a' is not defined
What would be the way around?
The problem is due to this call to InteractiveShell.instance() in the qt integration, when called before IPython is initialized. If this is called before your embedded shell is created, then some assumptions are not met. The fix is to create your embedded shell object first, then you shouldn't have any issue. And you can retrieve the same object from anywhere else in your code by simply calling InteractiveShellEmbed.instance().
This version should work just fine, by creating the InteractiveShellEmbed instance first:
from IPython.frontend.terminal.embed import InteractiveShellEmbed
# create ipshell *before* calling enable_gui
# it is important that you use instance(), instead of the class
# constructor, so that it creates the global InteractiveShell singleton
ipshell = InteractiveShellEmbed.instance()
import IPython.lib.inputhook
IPython.lib.inputhook.enable_gui(gui='tk')
def foo():
# without inputhook, 'a' is found just fine
exec 'a=123' in globals()
# all calls to instance() will always return the same object
ipshell = InteractiveShellEmbed.instance()
ipshell()
foo()