How can I implement a Fisher-Yates shuffle in Scala without side effects? - scalaz

I want to implement the Fisher-Yates algorithm (an in-place array shuffle) without side effects by using an STArray for the local mutation effects, and a functional random number generator
type RNG[A] = State[Seed,A]
to produce the random integers needed by the algorithm.
I have a method def intInRange(max: Int): RNG[Int] which I can use to produce a random Int in [0,max).
From Wikipedia:
To shuffle an array a of n elements (indices 0..n-1):
for i from n − 1 downto 1 do
j ← random integer such that 0 ≤ j ≤ i
exchange a[j] and a[i]
I suppose I need to stack State with ST somehow, but this is confusing to me. Do I need a [S]StateT[ST[S,?],Seed,A]? Do I have to rewrite RNG to use StateT as well?
(Edit) I don't want to involve IO, and I don't want to substitute Vector for STArray because the shuffle wouldn't be performed in-place.
I know there is a Haskell implementation here, but I'm not currently capable of understanding and porting this to Scalaz. But maybe you can? :)
Thanks in advance.

You have lots of options. One simple (but not very principled) one would be just to lift both the Rng and ST operations into IO and then work with them together there. Another would be to use both an STRef[Long] and an STArray in the same ST. Another would be to use a State[(Long, Vector[A]), ?].
You could also use a StateT[State[Long, ?], Vector[A], ?] but that would be kind of pointless. You could probably use a StateT (for the RNG state) over an ST (for the array), but again, I don't really see the point.
It's possible to do this pretty cleanly without side effects with just Rng, though. For example, using NICTA's RNG library:
import com.nicta.rng._, scalaz._, Scalaz._
def shuffle[A](xs: Vector[A]): Rng[Vector[A]] =
(xs.size - 1 to 1 by -1).toVector.traverseU(
i => Rng.chooseint(0, i).map((i, _))
).map {
_.foldLeft(xs) {
case ((i, j), v) =>
val tmp = v(i)
v.updated(i, v(j)).updated(j, tmp)
}
}
Here you just pick all your swap operations in the Rng monad, and then fold over them with your collection as the accumulator, swapping as you go.

Here is a more or less direct translation from the Haskell version you linked that uses a mutable STArray. The Scalaz STArray doesn't have an exact equivalent of the listArray function, so I've made one up. Otherwise, it's a straightforward transliteration:
import scalaz._
import scalaz.effect.{ST, STArray}
import ST._
import State._
import syntax.traverse._
import std.list._
def shuffle[A:Manifest](xs: List[A]): RNG[List[A]] = {
def newArray[S](n: Int, as: List[A]): ST[S, STArray[S, A]] =
if (n <= 0) newArr(0, null.asInstanceOf[A])
else for {
r <- newArr[S,A](n, as.head)
_ <- r.fill((_, a: A) => a, as.zipWithIndex.map(_.swap))
} yield r
for {
seed <- get[Seed]
n = xs.length
r <- runST(new Forall[({type λ[σ] = ST[σ, RNG[List[A]]]})#λ] {
def apply[S] = for {
g <- newVar[S](seed)
randomRST = (lo: Int, hi: Int) => for {
p <- g.read.map(intInRange(hi - lo).apply)
(a, sp) = p
_ <- g.write(sp)
} yield a + lo
ar <- newArray[S](n, xs)
xsp <- Range(0, n).toList.traverseU { i => for {
j <- randomRST(i, n)
vi <- ar read i
vj <- ar read j
_ <- ar.write(j, vi)
} yield vj }
genp <- g.read
} yield put(genp).map(_ => xsp)
})
} yield r
}
Although the asymptotics of using a mutable array might be good, do note that the constant factors of the ST monad in Scala are quite large. You may be better off just doing this in a monolithic block using regular mutable arrays. The overall shuffle function remains pure because all of your mutable state is local.

This is amost the same as Travis solution only difference is that it uses the State monad. I wanted to find a minimal set of imports but I finally gave up:
import com.nicta.rng.Rng
import scalaz._
import Scalaz._
object FisherYatesShuffle {
def randomJ(i: Int): Rng[Int] = Rng.chooseint(0,i)
type Exchange = (Int,Int)
def applyExchange[A](exchange: Exchange)(l: Vector[A]): Vector[A] = {
val (i,j) = exchange
val vi = l(i)
l.updated(i,l(j)).updated(j,vi)
}
def stApplyExchange[A](exchange: Exchange): State[Vector[A], Unit] = State.modify(applyExchange(exchange))
def shuffle[A](l: Vector[A]): Rng[Vector[A]] = {
val rngExchanges: Rng[Vector[Exchange]] = (l.length - 1 to 1 by -1).toVector.traverseU { i =>
for {
j <- randomJ(i)
} yield (i, j)
}
for {
exchanges <- rngExchanges
} yield exchanges.traverseU(stApplyExchange[A]).exec(l)
}
}

Related

numba gpu: how to calculate the max relative error for two array?

I want to calculate the relative error for two array. The pure numpy code is:
# a1, a2 are the two array
np.abs( 1-a2/a1 ).max()
How can I use numba.cuda to accelarate the above code?
In my thought:
#cuda.jit
def calculate(a1, a2):
start = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
grid = cuda.gridDim.x*cuda.blockDim.x
for id in range(start, a1.size, grid):
r = abs(1-a2[id]/a1[id])
ca1 = cuda.to_device(a1)
ca2 = cuda.to_device(a2)
But, how can I compare the r between different thread?
One possible method to do this is to write your own shared memory parallel reduction.
As indicated in the comments, another possible method is to use numba's built-in reduce decorator.
Here is an example demonstrating both:
$ cat t79.py
from numba import cuda, float32, vectorize
import numpy as np
from numpy import random
#values of 0..10 are legal here
TPBP2 = 9
TPB = 2**TPBP2
TPBH = TPB//2
ds = 4096
#method 1: standard cuda parallel max-finding reduction
#cuda.jit
def max_error(a1, a2, err):
s = cuda.shared.array(shape=(TPB), dtype=float32)
x = cuda.grid(1)
st = cuda.gridsize(1)
tx = cuda.threadIdx.x
s[tx] = 0
cuda.syncthreads()
for i in range(x, a1.size, st):
s[tx] = max(s[tx], abs(1-a2[i]/a1[i]))
mid = TPBH
for i in range(TPBP2):
cuda.syncthreads()
if tx < mid:
s[tx] = max(s[tx], s[tx+mid])
mid >>= 1
if tx == 0:
err[cuda.blockIdx.x] = s[0]
# data
# for best performance we should choose blocks based on GPU occupancy
# but for demonstration since we don't know the GPU:
blocks = (ds+TPB-1)//TPB
a1= np.random.rand(ds).astype(np.float32)
a1 += 1
a2= np.random.rand(ds).astype(np.float32)
err = np.zeros(blocks).astype(np.float32)
# Start the kernel
max_error[blocks, TPB](a1,a2, err)
# we could perform another stage of GPU reduction here, but for simplicity:
my_err = np.max(err)
print(my_err)
#method 2: using numba features
#vectorize(['float32(float32,float32)'], target = 'cuda')
def my_error(a1,a2):
return abs(1-a2/a1)
#cuda.reduce
def max_reduce(a,b):
return max(a,b)
r = my_error(a1,a2)
my_err = max_reduce(r)
print(my_err)
$ python t79.py
0.9999707
0.9999707
$

How to call a cdef method

I'd like to call my cdef methods and improve the speed of my program as much as possible. I do not want to use cpdef (I explain why below). Ultimately, I'd like to access cdef methods (some of which return void) that are members of my Cython extensions.
I tried following this example, which gives me the impression that I can call a cdef function by making a Python (def) wrapper for it.
I can't reproduce these results, so I tried a different problem for myself (summing all the numbers from 0 to n).
Of course, I'm looking at the documentation, which says
The directive cpdef makes two versions of the method available; one fast for use from Cython and one slower for use from Python.
and later (emphasis mine),
This does slightly more than providing a python wrapper for a cdef method: unlike a cdef method, a cpdef method is fully overridable by methods and instance attributes in Python subclasses. It adds a little calling overhead compared to a cdef method.
So how does one use a cdef function without the extra calling overhead of a cpdef function?
With the code at the end of this question, I get the following results:
def/cdef:
273.04207632583245
def/cpdef:
304.4114626176919
cpdef/cdef:
0.8969507060538783
Somehow, cpdef is faster than cdef. For n < 100, I can occasionally get cpdef/cdef > 1, but it's rare. I think it has to do with wrapping the cdef function in a def function. This is what the example I link to does, but they claim better performance from using cdef than from using cpdef.
I'm pretty sure this is not how you wrap a cdef function while avoiding the additional overhead (the source of which is not clearly documented) of a cpdef.
And now, the code:
setup.py
from setuptools import setup, Extension
from Cython.Build import cythonize
pkg_name = "tmp"
compile_args=['-std=c++17']
cy_foo = Extension(
name=pkg_name + '.core.cy_foo',
sources=[
pkg_name + '/core/cy_foo.pyx',
],
language='c++',
extra_compile_args=compile_args,
)
setup(
name=pkg_name,
ext_modules=cythonize(cy_foo,
annotate=True,
build_dir='build'),
packages=[
pkg_name,
pkg_name + '.core',
],
)
foo.py
def foo_def(n):
sum = 0
for i in range(n):
sum += i
return sum
cy_foo.pyx
def foo_cdef(n):
return foo_cy(n)
cdef int foo_cy(int n):
cdef int sum = 0
cdef int i = 0
for i in range(n):
sum += i
return sum
cpdef int foo_cpdef(int n):
cdef int sum = 0
cdef int i = 0
for i in range(n):
sum += i
return sum
test.py
import timeit
from tmp.core.foo import foo_def
from tmp.core.cy_foo import foo_cdef
from tmp.core.cy_foo import foo_cpdef
n = 10000
# Python call
start_time = timeit.default_timer()
a = foo_def(n)
pyTime = timeit.default_timer() - start_time
# Call Python wrapper for C function
start_time = timeit.default_timer()
b = foo_cdef(n)
cTime = timeit.default_timer() - start_time
# Call cpdef function, which does more than wrap a cdef function (whatever that means)
start_time = timeit.default_timer()
c = foo_cpdef(n)
cpTime = timeit.default_timer() - start_time
print("def/cdef:")
print(pyTime/cTime)
print("def/cpdef:")
print(pyTime/cpTime)
print("cpdef/cdef:")
print(cpTime/cTime)
The reason for your seemingly anomalous result is that you aren't calling the cdef function foo_cy directly, but instead the def function foo_cdef wrapping it.
when you are wrapping inside another def indeed you are again calling the python function. However you should be able to reach similar results as the cpdef.
Here is what you could do:
like the python def, give the type for both input and output
def foo_cdef(int n):
cdef int val = 0
val = foo_cy(n)
return val
this should have similar results as cpdef, however again you are calling a python function. If you want to directly call the c function, you should use the ctypes and call from there.
and for the benchmarking, the way that you have written, it only considers one run and could fluctuate a lot due OS other task and as well the timer.
better to use the timeit builtin method to calculate for some iteration:
# Python call
pyTime = timeit.timeit('foo_def(n)',globals=globals(), number=10000)
# Call Python wrapper for C function
cTime = timeit.timeit('foo_cdef(n)',globals=globals(), number=10000)
# Call cpdef function, which does more than wrap a cdef function (whatever that means)
cpTime = timeit.timeit('foo_cpdef(n)',globals=globals(), number=10000)
output:
def/cdef:
154.0166154428522
def/cpdef:
154.22669848136132
cpdef/cdef:
0.9986378296327566
like this, you get consistent results and as well you see always close to 1 for both either cython itself wraps or we explicitly wrap around a python function.

How do I find the index of a particular element within a List in DAML?

Say I have a List that looks like this:
let identifiers = ["ABC123", "DEF456", "GHI789"]
I want to know the index if the element "DEF456". What's the recommended way to accomplish this?
In daml 1.2 you can use the elemIndex : Eq a => a -> [a] -> Optional Int function in the DA.List standard library module like so:
daml 1.2 module MyModule where
import DA.List
indexOfElement = scenario do
let identifiers = ["ABC123", "DEF456", "GHI789"]
index : Optional Int = elemIndex "DEF456" identifiers
assert $ index == Some 1
return index
The findIndex function in the Base.List module in the standard library, does what you want.
daml 1.0 module FindIndex where
import Base.List
import Base.Maybe
test foo : Scenario {} = scenario
let
identifiers = ["ABC123", "DEF456", "GHI789"]
index: Maybe Integer = findIndex ((==) "DEF456") identifiers
assert $ index == Just 1
Under the hood most list manipulation in DAML, including findIndex is implemented using foldr and foldl.
-- Returns the index of the first element in the list satisfying the predicate, or M.Nothing if there is no such element.
def findIndex (f: a -> Bool) (xs: List a) : Maybe Integer =
headMay (findIndices f xs)
-- Returns the indices of all elements satisfying the predicate, in ascending order.
def findIndices (f: a -> Bool) (xs: List a) =
let work acc x =
let i = fst acc
let is = snd acc
tuple (i + 1) (if f x then cons i is else is)
reverse (snd (foldl work (tuple 0 nil) xs))

How to avoid append when computing the intersection of two lists?

I am writing a function to compute the intersection between two sorted arrays (which may contain duplicates). So if the input is [0,3,7,7,7,9, 12] and [2,7,7,8, 12] the output should be [7,7,12] for example.
Here is my code:
cimport cython
#cython.wraparound(False)
#cython.cdivision(True)
#cython.boundscheck(False)
def sorting(int[:] A, int[:] B):
cdef Py_ssize_t i = 0
cdef Py_ssize_t j = 0
cdef int lenA = A.shape[0]
cdef int lenB = B.shape[0]
intersect = []
while (i < lenA and j < lenB):
if A[i] == B[j]:
intersect.append(A[i])
i += 1
j += 1
elif A[i] > B[j]:
j += 1
elif A[i] < B[j]:
i += 1
return intersect
As you will see, I use a list to store the answers and append to add the answers as they arrive. I am happy to return a python or numpy array if that will speed things up.
How can I avoid append to speed up the cython?
For this kind of thing you usually want to pre-allocate the array (it's basically free to shrink it later). In this case it can't be longer than the shortest of your input arrays, so that gives you a starting size:
cdef int[::1] intersect = np.array([A.shape[0] if A.shape[0]<B.shape[0] else B.shape[0]],dtype=np.int)
You then just keep a running total of how what index you're at on that array (say k), so append is replaced by:
intersect[k] = A[i]
k += 1
At the end you can either return the memoryview intersect[:k] or convert it to a numpy array with np.asarray(intersect[:k]).
As an aside: I'd remove the Cython directive #cython.cdivision(True) since you aren't doing any division. I believe you should be thinking about whether these directives are useful and if they apply to your code rather than blindly copying them in out of habit.

map over json data with haskell

I'm just learning a haskell and seems like all is good even scary monads are not a big deal for me. But I can't get to real practiacal stuff at all.
My first practical task for haskell I choosed as follows:
Given a JSON describing some binary file's format to parse that file.
JSON has some deeply nested structure with lists of assocoative lists (dictionaries) of lists etc with endpoints as numbers or strings.
So first of all I want to be able to map other those endpoints (to have functor class for jsons data) converting some strings to numbers in particular. Also it would be nice to be able to fold all those endpoints as well.
I came up with some python code easily. But can't go any far with haskell.
So what your suggestions for implementing things in haskell? It really would be nice to hear some advise for solutions using libraries to greatest extent and not handwrite all the stuff from scratch.
Thanks in advance!
added---
example of what I have in python
Some helper functions:
islist = lambda l: isinstance(l, collections.Iterable) and not isinstance(l, (str, bytes))
isdict = lambda d: isinstance(d, collections.Mapping)
isiter = lambda i: islist(i) or isdict(i)
def iterable(d):
if isdict(d):
i = d.items()
elif islist(d):
i = enumerate(d)
else:
raise ValueError
return i
Iterator over nested json data:
def nested_iter(nested, f = lambda *args: None):
for key, value in iterable(nested):
if not isiter(value):
f(nested, key)
yield key, value
else:
yield from nested_iter(value, f)
now I can substitute some numbers with lists of keys:
def list_from_num(d, k):
if type(d[k]) == int:
d[k] = [k]*d[k]
list(nested_iter(typedef, list_from_num))
or I can substitute some strings with some other nested data with the same key name
def nest_dicts(defs, d, k):
if d[k] in defs.keys():
d[k] = deepcopy(defs[d[k]])
if isiter(d[k]):
list(nested_iter(d[k], partial(nest_dicts, defs)))
list(nested_iter(typedef, partial(nest_dicts, typedef)))
or can just flatten data
list(nested_iter(d))
parsing binary is a bit more evolved but it is nothing more as passing to iterator one more function
Well this is my solution.
It uses Control.Lens, Data.Aeson.Lens, Control.Lens.Plated
One can use transform from Uniplate or Lens.Plated to transform values.
for example to substitute each number with list of key values of length of that number:
n2k :: T.Text -> Value -> Value --import qualified Data.Text as T
n2k s (Number x)
| isInteger x = case toBoundedInteger x of
Just n -> Array (V.replicate n (String s)) -- import qualified Data.Vector as V
_ -> Number x
| otherwise = Number x
n2k _ v = v
f (Object o) = Object $ imap n2k o --imap from Data.Map.Lens
f x = x
j2 = transform f j --transform JSON j using function f
to substitute string with data with same key:
-- o is hashmap where we are looking for keys to substitute strings
h (String s) = fromMaybe (String s) (H.lookup s o) --import qualified Data.HashMap.Lazy as H
h x = x
j2 = transform h j
just get all numbers into list:
l = [x | Number x <- universe j]