How to modify and fetch from map in cython? [duplicate] - cython

I was wondering if this was possible to iterate through a map directly in Cython code, ie, in the .pyx.
Here is my example:
import cython
cimport cython
from licpp.map import map as mapcpp
def it_through_map(dict mymap_of_int_int):
# python dict to map
cdef mapcpp[int,int] mymap_in = mymap_of_int_int
cdef mapcpp[int,int].iterator it = mymap_in.begin()
while(it != mymap.end()):
# let's pretend here I just want to print the key and the value
print(it.first) # Not working
print(it.second) # Not working
it ++ # Not working
This does not compile: Object of type 'iterator' has no attribute 'first'
I used map container in cpp before but for this code, I am trying to stick to cython/python, is it possible here?.
Resolved by DavidW
Here is an working version of the code, following DavidW answer:
import cython
cimport cython
from licpp.map import map as mapcpp
from cython.operator import dereference, postincrement
def it_through_map(dict mymap_of_int_int):
# python dict to map
cdef mapcpp[int,int] mymap_in = mymap_of_int_int
cdef mapcpp[int,int].iterator it = mymap_in.begin()
while(it != mymap.end()):
# let's pretend here I just want to print the key and the value
print(dereference(it).first) # print the key
print(dereference(it).second) # print the associated value
postincrement(it) # Increment the iterator to the net element

The map iterator doesn't have elements first and second. Instead it has a operator* which returns a pair reference. In C++ you can use it->first to do this in one go, but that syntax doesn't work in Cython (and it isn't intelligent enough to decide to use -> instead of . itself in this case).
Instead you use cython.operator.dereference:
from cython.operator cimport dereference
# ...
print(dereference(it).first)
Similarly, it++ can be done with cython.operator.postincrement

Related

How can I get my dataset's name in code repository

When combining multiple datasets in Python in code repository, I want to put the dataset name in the first column. But I couldn't figure it out by accessing its path
#transform_df(
Output("/folder/folder1/datasets/mydatset"),
df1 = Input("A"),
df2 = Input("B"),
)
def compute(df1, df2, df3):
print(list(filter(os.path.isfile, os.listdir())))
How can I get my dataset name from within a transform?
This is not possible using the #transform_df decorator. However it is possible using the more powerful #transform decorator.
API Documentation for #transform
Using #transform will cause your function arguments to become of type TransformInput rather than dataframes directly, which have a property path. Note that you will also need to reference and write to the output dataset manually when using #transform.
For example:
#transform(
out=Output("/path/to/my/output"),
inp1=Input("/path/to/my/input1"),
inp2=Input("/path/to/my/input2"),
)
def compute(out, inp1, inp2):
# Add columns containing dataset paths.
df1 = inp1.dataframe().withColumn("dataset_path", F.lit(inp1.path))
df2 = inp2.dataframe().withColumn("dataset_path", F.lit(inp2.path))
# For example.
result = union_many(df1, df2, how="strict")
# Write output manually
out.write_dataframe(result)
However note that a dataset's path is an unstable identifier. If someone were to move or rename these inputs, it could cause unintended behaviour in your pipeline.
For this reason, for a production pipeline I would generally recommend using a more stable identifier. Either a manually chosen hard-coded one (in this case you can use #transform_df again):
#transform_df(
df1=Input("/path/to/my/input1"),
df2=Input("/path/to/my/input2"),
)
def compute(df1, df2):
df1 = df1.withColumn("input_dataset", F.lit("input_1"))
df2 = df2.withColumn("input_dataset", F.lit("input_2"))
# ...etc
or the dataset's RID, using inp1.rid instead of inp1.path.
Note that if you have a large number of inputs, all of these methods can be made neater using python's varargs syntax and comprehensions:
# Using path or rid
#transform(
out=Output("/path/to/my/output"),
inp1=Input("/path/to/my/input1"),
inp2=Input("/path/to/my/input2"),
# and many more...
)
def compute(out, **inps):
# Add columns containing dataset rids (or paths).
dfs = [
inp.dataframe().withColumn("dataset_rid", F.lit(inp.rid))
for key, inp in inps.items()
]
# For example
result = union_many(*dfs, how="strict")
out.write_dataframe(result)
# Using manual keys, we can reuse the argument names as keys.
#transform_df(
Output("/path/to/my/output"),
df1=Input("/path/to/my/input1"),
df2=Input("/path/to/my/input2"),
# and many more...
)
def compute(**dfs):
# Add columns containing dataset keys.
dfs = [
df.withColumn("dataset_key", F.lit(key))
for key, df in dfs.items()
]
# For example
return union_many(*dfs, how="strict")

Automatitation of python file in bash [duplicate]

In Python, how can we find out the command line arguments that were provided for a script, and process them?
For some more specific examples, see Implementing a "[command] [action] [parameter]" style command-line interfaces? and How do I format positional argument help using Python's optparse?.
import sys
print("\n".join(sys.argv))
sys.argv is a list that contains all the arguments passed to the script on the command line. sys.argv[0] is the script name.
Basically,
import sys
print(sys.argv[1:])
The canonical solution in the standard library is argparse (docs):
Here is an example:
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument("-f", "--file", dest="filename",
help="write report to FILE", metavar="FILE")
parser.add_argument("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="don't print status messages to stdout")
args = parser.parse_args()
argparse supports (among other things):
Multiple options in any order.
Short and long options.
Default values.
Generation of a usage help message.
Just going around evangelizing for argparse which is better for these reasons.. essentially:
(copied from the link)
argparse module can handle positional
and optional arguments, while
optparse can handle only optional
arguments
argparse isn’t dogmatic about
what your command line interface
should look like - options like -file
or /file are supported, as are
required options. Optparse refuses to
support these features, preferring
purity over practicality
argparse produces more
informative usage messages, including
command-line usage determined from
your arguments, and help messages for
both positional and optional
arguments. The optparse module
requires you to write your own usage
string, and has no way to display
help for positional arguments.
argparse supports action that
consume a variable number of
command-line args, while optparse
requires that the exact number of
arguments (e.g. 1, 2, or 3) be known
in advance
argparse supports parsers that
dispatch to sub-commands, while
optparse requires setting
allow_interspersed_args and doing the
parser dispatch manually
And my personal favorite:
argparse allows the type and
action parameters to add_argument()
to be specified with simple
callables, while optparse requires
hacking class attributes like
STORE_ACTIONS or CHECK_METHODS to get
proper argument checking
There is also argparse stdlib module (an "impovement" on stdlib's optparse module). Example from the introduction to argparse:
# script.py
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'integers', metavar='int', type=int, choices=range(10),
nargs='+', help='an integer in the range 0..9')
parser.add_argument(
'--sum', dest='accumulate', action='store_const', const=sum,
default=max, help='sum the integers (default: find the max)')
args = parser.parse_args()
print(args.accumulate(args.integers))
Usage:
$ script.py 1 2 3 4
4
$ script.py --sum 1 2 3 4
10
If you need something fast and not very flexible
main.py:
import sys
first_name = sys.argv[1]
last_name = sys.argv[2]
print("Hello " + first_name + " " + last_name)
Then run python main.py James Smith
to produce the following output:
Hello James Smith
The docopt library is really slick. It builds an argument dict from the usage string for your app.
Eg from the docopt readme:
"""Naval Fate.
Usage:
naval_fate.py ship new <name>...
naval_fate.py ship <name> move <x> <y> [--speed=<kn>]
naval_fate.py ship shoot <x> <y>
naval_fate.py mine (set|remove) <x> <y> [--moored | --drifting]
naval_fate.py (-h | --help)
naval_fate.py --version
Options:
-h --help Show this screen.
--version Show version.
--speed=<kn> Speed in knots [default: 10].
--moored Moored (anchored) mine.
--drifting Drifting mine.
"""
from docopt import docopt
if __name__ == '__main__':
arguments = docopt(__doc__, version='Naval Fate 2.0')
print(arguments)
One way to do it is using sys.argv. This will print the script name as the first argument and all the other parameters that you pass to it.
import sys
for arg in sys.argv:
print arg
#set default args as -h , if no args:
if len(sys.argv) == 1: sys.argv[1:] = ["-h"]
I use optparse myself, but really like the direction Simon Willison is taking with his recently introduced optfunc library. It works by:
"introspecting a function
definition (including its arguments
and their default values) and using
that to construct a command line
argument parser."
So, for example, this function definition:
def geocode(s, api_key='', geocoder='google', list_geocoders=False):
is turned into this optparse help text:
Options:
-h, --help show this help message and exit
-l, --list-geocoders
-a API_KEY, --api-key=API_KEY
-g GEOCODER, --geocoder=GEOCODER
I like getopt from stdlib, eg:
try:
opts, args = getopt.getopt(sys.argv[1:], 'h', ['help'])
except getopt.GetoptError, err:
usage(err)
for opt, arg in opts:
if opt in ('-h', '--help'):
usage()
if len(args) != 1:
usage("specify thing...")
Lately I have been wrapping something similiar to this to make things less verbose (eg; making "-h" implicit).
As you can see optparse "The optparse module is deprecated with and will not be developed further; development will continue with the argparse module."
Pocoo's click is more intuitive, requires less boilerplate, and is at least as powerful as argparse.
The only weakness I've encountered so far is that you can't do much customization to help pages, but that usually isn't a requirement and docopt seems like the clear choice when it is.
import argparse
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('integers', metavar='N', type=int, nargs='+',
help='an integer for the accumulator')
parser.add_argument('--sum', dest='accumulate', action='store_const',
const=sum, default=max,
help='sum the integers (default: find the max)')
args = parser.parse_args()
print(args.accumulate(args.integers))
Assuming the Python code above is saved into a file called prog.py
$ python prog.py -h
Ref-link: https://docs.python.org/3.3/library/argparse.html
You may be interested in a little Python module I wrote to make handling of command line arguments even easier (open source and free to use) - Commando
Yet another option is argh. It builds on argparse, and lets you write things like:
import argh
# declaring:
def echo(text):
"Returns given word as is."
return text
def greet(name, greeting='Hello'):
"Greets the user with given name. The greeting is customizable."
return greeting + ', ' + name
# assembling:
parser = argh.ArghParser()
parser.add_commands([echo, greet])
# dispatching:
if __name__ == '__main__':
parser.dispatch()
It will automatically generate help and so on, and you can use decorators to provide extra guidance on how the arg-parsing should work.
I recommend looking at docopt as a simple alternative to these others.
docopt is a new project that works by parsing your --help usage message rather than requiring you to implement everything yourself. You just have to put your usage message in the POSIX format.
Also with python3 you might find convenient to use Extended Iterable Unpacking to handle optional positional arguments without additional dependencies:
try:
_, arg1, arg2, arg3, *_ = sys.argv + [None] * 2
except ValueError:
print("Not enough arguments", file=sys.stderr) # unhandled exception traceback is meaningful enough also
exit(-1)
The above argv unpack makes arg2 and arg3 "optional" - if they are not specified in argv, they will be None, while if the first is not specified, ValueError will be thouwn:
Traceback (most recent call last):
File "test.py", line 3, in <module>
_, arg1, arg2, arg3, *_ = sys.argv + [None] * 2
ValueError: not enough values to unpack (expected at least 4, got 3)
My solution is entrypoint2. Example:
from entrypoint2 import entrypoint
#entrypoint
def add(file, quiet=True):
''' This function writes report.
:param file: write report to FILE
:param quiet: don't print status messages to stdout
'''
print file,quiet
help text:
usage: report.py [-h] [-q] [--debug] file
This function writes report.
positional arguments:
file write report to FILE
optional arguments:
-h, --help show this help message and exit
-q, --quiet don't print status messages to stdout
--debug set logging level to DEBUG
import sys
# Command line arguments are stored into sys.argv
# print(sys.argv[1:])
# I used the slice [1:] to print all the elements except the first
# This because the first element of sys.argv is the program name
# So the first argument is sys.argv[1], the second is sys.argv[2] ecc
print("File name: " + sys.argv[0])
print("Arguments:")
for i in sys.argv[1:]:
print(i)
Let's name this file command_line.py and let's run it:
C:\Users\simone> python command_line.py arg1 arg2 arg3 ecc
File name: command_line.py
Arguments:
arg1
arg2
arg3
ecc
Now let's write a simple program, sum.py:
import sys
try:
print(sum(map(float, sys.argv[1:])))
except:
print("An error has occurred")
Result:
C:\Users\simone> python sum.py 10 4 6 3
23
This handles simple switches, value switches with optional alternative flags.
import sys
# [IN] argv - array of args
# [IN] switch - switch to seek
# [IN] val - expecting value
# [IN] alt - switch alternative
# returns value or True if val not expected
def parse_cmd(argv,switch,val=None,alt=None):
for idx, x in enumerate(argv):
if x == switch or x == alt:
if val:
if len(argv) > (idx+1):
if not argv[idx+1].startswith('-'):
return argv[idx+1]
else:
return True
//expecting a value for -i
i = parse_cmd(sys.argv[1:],"-i", True, "--input")
//no value needed for -p
p = parse_cmd(sys.argv[1:],"-p")
Several of our biotechnology clients have posed these two questions recently:
How can we execute a Python script as a command?
How can we pass input values to a Python script when it is executed as a command?
I have included a Python script below which I believe answers both questions. Let's assume the following Python script is saved in the file test.py:
#
#----------------------------------------------------------------------
#
# file name: test.py
#
# input values: data - location of data to be processed
# date - date data were delivered for processing
# study - name of the study where data originated
# logs - location where log files should be written
#
# macOS usage:
#
# python3 test.py "/Users/lawrence/data" "20220518" "XYZ123" "/Users/lawrence/logs"
#
# Windows usage:
#
# python test.py "D:\data" "20220518" "XYZ123" "D:\logs"
#
#----------------------------------------------------------------------
#
# import needed modules...
#
import sys
import datetime
def main(argv):
#
# print message that process is starting...
#
print("test process starting at", datetime.datetime.now().strftime("%Y%m%d %H:%M"))
#
# set local values from input values...
#
data = sys.argv[1]
date = sys.argv[2]
study = sys.argv[3]
logs = sys.argv[4]
#
# print input arguments...
#
print("data value is", data)
print("date value is", date)
print("study value is", study)
print("logs value is", logs)
#
# print message that process is ending...
#
print("test process ending at", datetime.datetime.now().strftime("%Y%m%d %H:%M"))
#
# call main() to begin processing...
#
if __name__ == '__main__':
main(sys.argv)
The script can be executed on a macOS computer in a Terminal shell as shown below and the results will be printed to standard output (be sure the current directory includes the test.py file):
$ python3 test.py "/Users/lawrence/data" "20220518" "XYZ123" "/Users/lawrence/logs"
test process starting at 20220518 16:51
data value is /Users/lawrence/data
date value is 20220518
study value is XYZ123
logs value is /Users/lawrence/logs
test process ending at 20220518 16:51
The script can also be executed on a Windows computer in a Command Prompt as shown below and the results will be printed to standard output (be sure the current directory includes the test.py file):
D:\scripts>python test.py "D:\data" "20220518" "XYZ123" "D:\logs"
test process starting at 20220518 17:20
data value is D:\data
date value is 20220518
study value is XYZ123
logs value is D:\logs
test process ending at 20220518 17:20
This script answers both questions posed above and is a good starting point for developing scripts that will be executed as commands with input values.
Reason for the new answer:
Existing answers specify multiple options.
Standard option is to use argparse, a few answers provided examples from the documentation, and one answer suggested the advantage of it. But all fail to explain the answer adequately/clearly to the actual question by OP, at least for newbies.
An example of argparse:
import argparse
def load_config(conf_file):
pass
if __name__ == '__main__':
parser = argparse.ArgumentParser()
//Specifies one argument from the command line
//You can have any number of arguments like this
parser.add_argument("conf_file", help="configuration file for the application")
args = parser.parse_args()
config = load_config(args.conf_file)
Above program expects a config file as an argument. If you provide it, it will execute happily. If not, it will print the following
usage: test.py [-h] conf_file
test.py: error: the following arguments are required: conf_file
You can have the option to specify if the argument is optional.
You can specify the expected type for the argument using type key
parser.add_argument("age", type=int, help="age of the person")
You can specify default value for the arguments by specifying default key
This document will help you to understand it to an extent.

How to handle large JSON file in Pytorch?

I am working on a time series problem. Different training time series data is stored in a large JSON file with the size of 30GB. In tensorflow I know how to use TF records. Is there a similar way in pytorch?
I suppose IterableDataset (docs) is what you need, because:
you probably want to traverse files without random access;
number of samples in jsons is not pre-computed.
I've made a minimal usage example with an assumption that every line of dataset file is a json itself, but you can change the logic.
import json
from torch.utils.data import DataLoader, IterableDataset
class JsonDataset(IterableDataset):
def __init__(self, files):
self.files = files
def __iter__(self):
for json_file in self.files:
with open(json_file) as f:
for sample_line in f:
sample = json.loads(sample_line)
yield sample['x'], sample['time'], ...
...
dataset = JsonDataset(['data/1.json', 'data/2.json', ...])
dataloader = DataLoader(dataset, batch_size=32)
for batch in dataloader:
y = model(batch)
Generally, you do not need to change/overload the default data.Dataloader.
What you should look into is how to create a custom data.Dataset.
Once you have your own Dataset that knows how to extract item-by-item from the json file, you feed it do the "vanilla" data.Dataloader and all the batching/multi-processing etc, is done for you based on your dataset provided.
If, for example, you have a folder with several json files, each containing several examples, you can have a Dataset that looks like:
import bisect
class MyJsonsDataset(data.Dataset):
def __init__(self, jfolder):
super(MyJsonsDataset, self).__init__()
self.filenames = [] # keep track of the jfiles you need to load
self.cumulative_sizes = [0] # keep track of number of examples viewed so far
# this is not actually python code - just pseudo code for you to follow
for each jsonfile in jfolder:
self.filenames.append(jsonfile)
l = number of examples in jsonfile
self.cumulative_sizes.append(self.cumulative_sizes[-1] + l)
# discard the first element
self.cumulative_sizes.pop(0)
def __len__(self):
return self.cumulative_sizes[-1]
def __getitem__(self, idx):
# first you need to know wich of the files holds the idx example
jfile_idx = bisect.bisect_right(self.cumulative_sizes, idx)
if jfile_idx == 0:
sample_idx = idx
else:
sample_idx = idx - self.cumulative_sizes[jfile_idx - 1]
# now you need to retrieve the `sample_idx` example from self.filenames[jfile_idx]
return retrieved_example

GHC API: Find all functions (and their types) in scope

I'm trying to list (print) all functions (and their types) that are in scope for a given module.
For example, I have this module:
{-# LANGUAGE NoImplicitPrelude #-}
module Reverse where
import Prelude ((==), String)
myString :: String
myString = "string"
I'm trying to use the GHC API (8.0.2), but I can't seem to find where the information that I'm looking for is stored.
I have managed to find all functions in scope, (but not with their types,) with this code:
import Data.IORef
import DynFlags
import GHC
import GHC.LanguageExtensions
import GHC.Paths (libdir)
import HscTypes
import NameEnv
import OccName
import Outputable
import RdrName
import TcRnTypes
main = runGhc (Just libdir) $ do
liftIO $ print sets
dflags <- getSessionDynFlags
let compdflags =
(foldl xopt_set dflags [Cpp, ImplicitPrelude, MagicHash])
setSessionDynFlags compdflags
target <- guessTarget "Reverse.hs" Nothing
setTargets [target]
load LoadAllTargets
modSum <- getModSummary $ mkModuleName "Reverse"
parsedModule <- parseModule modSum
tmod <- typecheckModule parsedModule
let (tcenv, moddets) = tm_internals_ tmod
printO $ map (map gre_name) $ occEnvElts $ tcg_rdr_env tcenv
printO
:: (GhcMonad m, Outputable a)
=> a -> m ()
printO a = do
dfs <- getProgramDynFlags
liftIO $ putStrLn $ showPpr dfs a
I get this output:
[[String], [==], [myString]]
Of course, this is only half of the way to the data I need.
The GHC API is rather confusing, you have to get used to a lot of abbreviations, type-synonyms and a style-heterogenous codebase, but finding the the name and type of everything in scope should be possible.
Otherwise GHC couldn't tell you that your function is not in scope if you make a typo.
Indeed, once you have type-checked a module, all the relevant information is available.
First you need all the Names of your functions, which you can get with this code:
parsedModule <- parseModule modSum
tmod <- typecheckModule parsedModule
let (tcenv, moddets) = tm_internals_ tmod
let names = concatMap (map gre_name) $ occEnvElts $ tcg_rdr_env tcenv
Then you need to lookup the Names to get TyThings with lookupName.
You'll get a Maybe TyThing (Nothing if the Name is not found), and when the name refers to a function, the TyThing will be AnId i where i is the thing you're looking for.
An Id is just a name with a type.
You can then get the type with varType.
You could argue that all these types make this problem a lot harder, but they've made it possible for me to figure out what I need to do without looking into the code and with no documentation.

Defining module variables from functions

I've been finally getting into Python, and have noticed something strange, that works in Java, but not in Python.
When I type the following:
fn = "" # Local filename storage.
def read(filename):
fn = filename
return open(filename, 'r').read()
My flake8 linter for Atom gives me the following error:
F841 - local variable 'fn' is assigned to but never used.
I'm assuming this means that the variable is being defined on the def level, and not the module level, which I intend on doing. Please correct me if I'm wrong.
I've searched Google, with multiple wordings, but can't seem to word it in a way that the correct results display...
Any ideas on how I can be able to achieve module-level variable definitions from the function-level?
If you want to declare fn as a global variable (module-level), use global statement.
def read(filename):
global fn # <-----
fn = filename
return open(filename, 'r').read()
BTW, ; is optional. Don't use it.
You can set a module level variable from the function by doing:
import sys
def read(filename):
module = sys.modules[__name__]
setattr(module, 'fn', filename)
return open(filename, 'r').read()
However, it's a very strange necessity. Consider to change your architecture.
UPD: Let's consider an example:
# module1
# uncomment it to fix NameError and AttributeError
# some_var = ''
def foo(val):
global some_var
some_var = val
# module2
from module1 import *
print(some_var) # raises NameError: name 'some_var' is not defined
foo('bar')
print(some_var) # still raises NameError: name 'some_var' is not defined
# module3
import module1
print(module1.some_var) # raises AttributeError: 'module' object has no attribute 'some_var'
foo('bar')
print(module1.some_var) # prints 'bar' even without some_var = '' definition in the module1
So, it's not so obvious how global behaves during the import process. I think, that manually doing setattr(module, 'attr_name', value) during the read() call is more clear.