Functional Programming: Does a list only contain unique items? - language-agnostic

I'm having an unsorted list and want to know, whether all items in it are unique.
My naive approach would be val l = List(1,2,3,4,3)
def isUniqueList(l: List[Int]) = (new HashSet()++l).size == l.size
Basically, I'm checking whether a Set containing all elements of the list has the same size (since an item appearing twice in the original list will only appear once in the set), but I'm not sure whether this is the ideal solution for this problem.
Edit:
I benchmarked the 3 most popular solutions, l==l.distinct, l.size==l.distinct.size and Alexey's HashSet-based solution.
Each function was run 1000 times with a unique list of 10 items, a unique list of 10000 items and the same lists with one item appearing in the third quarter copied to the middle of the list. Before each run, each function got called 1000 times to warm up the JIT, the whole benchmark was run 5 times before the times were taken with System.currentTimeMillis.
The machine was a C2D P8400 (2.26 GHz) with 3GB RAM, the java version was the OpenJDK 64bit server VM (1.6.0.20). The java args were -Xmx1536M -Xms512M
The results:
l.size==l.distinct.size (3, 5471, 2, 6492)
l==l.distinct (3, 5601, 2, 6054)
Alexey's HashSet (2, 1590, 3, 781)
The results with larger objects (Strings from 1KB to 5KB):
l.size==l.distinct.size MutableList(4, 5566, 7, 6506)
l==l.distinct MutableList(4, 5926, 3, 6075)
Alexey's HashSet MutableList(2, 2341, 3, 784)
The solution using HashSets is definitely the fastest, and as he already pointed out using .size doesn't make a major difference.

Here is the fastest purely functional solution I can think of:
def isUniqueList(l: List[T]) = isUniqueList1(l, new HashSet[T])
#tailrec
def isUniqueList1(l: List[T], s: Set[T]) = l match {
case Nil => true
case (h :: t) => if (s(h)) false else isUniqueList1(t, s + h)
}
This should be faster, but uses mutable data structure (based on the distinct implementation given by Vasil Remeniuk):
def isUniqueList(l: List[T]): Boolean = {
val seen = mutable.HashSet[A]()
for (x <- this) {
if (seen(x)) {
return false
}
else {
seen += x
}
}
true
}
And here is the simplest (equivalent to yours):
def isUniqueList(l: List[T]) = l.toSet.size == l.size

I would simply use distinct method:
scala> val l = List(1,2,3,4,3)
l: List[Int] = List(1, 2, 3, 4, 3)
scala> l.distinct.size == l.size
res2: Boolean = false
ADD: Standard distinct implementation (from scala.collection.SeqLike) uses mutable HashSet, to find duplicate elements:
def distinct: Repr = {
val b = newBuilder
val seen = mutable.HashSet[A]()
for (x <- this) {
if (!seen(x)) {
b += x
seen += x
}
}
b.result
}

A more efficient method would be to attempt to find a dupe; this would return more quickly if one were found:
var dupes : Set[A] = Set.empty
def isDupe(a : A) = if (dupes(a)) true else { dupes += a; false }
//then
l exists isDupe

Related

using the right side of the disjoint union properly

what's the best way to turn a Right[List] into a List
I will parse a Json String like so
val parsed_states = io.circe.parser.decode[List[List[String]]](source)
And that will create an value equivalent to this
val example_data = Right(List(List("NAME", "state"), List("Alabama", "01"), List("Alaska", "02"), List("Arizona", "04")))
I'm trying to grok Right, Left, Either and implement the best way to get a list of StateName, StateValue pairs out of that list above.
I see that any of these ways will give me what I need (while dropping the header):
val parsed_states = example_data.toSeq(0).tail
val parsed_states = example_data.getOrElse(<ProbUseNoneHere>).iterator.to(Seq).tail
val parsed_states = example_data.getOrElse(<ProbUseNoneHere>).asInstanceOf[Seq[List[String]]].tail
I guess I'm wondering if I should do it one way or another based on the possible behavior upstream coming out of io.circe.parser.decode or am I overthinking this. I'm new to the Right, Left, Either paradigm and not finding terribly many helpful examples.
in reply to #slouc
trying to connect the dots from your answer as they apply to this use case. so something like this?
def blackBox: String => Either[Exception, List[List[String]]] = (url:String) => {
if (url == "passalong") {
Right(List(List("NAME", "state"), List("Alabama", "01"), List("Alaska", "02"), List("Arizona", "04")))
}
else Left(new Exception(s"This didn't work bc blackbox didn't parse ${url}"))
}
//val seed = "passalong"
val seed = "notgonnawork"
val xx: Either[Exception, List[List[String]]] = blackBox(seed)
def ff(i: List[List[String]]) = i.tail
val yy = xx.map(ff)
val zz = xx.fold(
_ => throw new Exception("<need info here>"),
i => i.tail)
The trick is in not getting state name / state value pairs out of the Either. They should be kept inside. If you want to, you can transform the Either type into something else (e.g. an Option by discarding whatever you possibly had on the left side), but don't destroy the effect. Something should be there to show that decoding could have failed; it can be an Either, Option, Try, etc. Eventually you will process left and right case accordingly, but this should happen as late as possible.
Let's take the following trivial example:
val x: Either[String, Int] = Right(42)
def f(i: Int) = i + 1
You might argue that you need to get the 42 out of the Right so that you can pass it to f. But that's not correct. Let's rewrite the example:
val x: Either[String, Int] = someFunction()
Now what? We have no idea whether we have a Left or a Right in value x, so we can't "get it out". Which integer would you obtain in case it's a Left? (if you really do have an integer value to use in that case, that's fair enough, and I will address that use case a bit later)
What you need to do instead is keep the effect (in this case Either), and you need to continue working in the context of that effect. It's there to show that there was a point in your program (in this case someFunction(), or decoding in your original question) that might have gone wrong.
So if you want to apply f to your potential integer, you need to map the effect with it (we can do that because Either is a functor, but that's a detail which probably exceeds the scope of this answer):
val x: Either[String, Int] = Right(42)
def f(i: Int) = i + 1
val y = x.map(value => f(value)) // Right(43)
val y = x.map(f) // shorter, point-free notation
and
val x: Either[String, Int] = someFunction()
def f(i: Int) = i + 1
// either a Left with some String, or a Right with some integer increased by 1
val y = x.map(f)
Then, at the very end of the chain of computations, you can handle the Left and Right cases; for example, if you were processing an HTTP request, then in case of Left you might return a 500, and in case of Right return a 200.
To address the use case with default value mentioned earlier - if you really want to do that, get rid of the Left and in that case resolve into some value (e.g. 0), then you can use fold:
def f(i: Int) = i + 1
// if x = Left, then z = 0
// if x = Right, then z = x + 1
val z = x.fold(_ => 0, i => i + 1)

Shortest path performance in Graphx with Spark

I am creating a graph from a gz compressed json file of edge and vertices type.
I have put the files in a dropbox folder here
I load and map these json records to create the vertices and edge types required by graphx like this:
val vertices_raw = sqlContext.read.json("path/vertices.json.gz")
val vertices = vertices_raw.rdd.map(row=> ((row.getAs[String]("toid").stripPrefix("osgb").toLong),row.getAs[Long]("index")))
val verticesRDD: RDD[(VertexId, Long)] = vertices
val edges_raw = sqlContext.read.json("path/edges.json.gz")
val edgesRDD = edges_raw.rdd.map(row=>(Edge(row.getAs[String]("positiveNode").stripPrefix("osgb").toLong, row.getAs[String]("negativeNode").stripPrefix("osgb").toLong, row.getAs[Double]("length"))))
val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD, edgesRDD).partitionBy(PartitionStrategy.RandomVertexCut)
I then use this dijkstra implementation I found to compute a shortest path between two vertices:
def dijkstra[VD](g: Graph[VD, Double], origin: VertexId) = {
var g2 = g.mapVertices(
(vid, vd) => (false, if (vid == origin) 0 else Double.MaxValue, List[VertexId]())
)
for (i <- 1L to g.vertices.count - 1) {
val currentVertexId: VertexId = g2.vertices.filter(!_._2._1)
.fold((0L, (false, Double.MaxValue, List[VertexId]())))(
(a, b) => if (a._2._2 < b._2._2) a else b)
._1
val newDistances: VertexRDD[(Double, List[VertexId])] =
g2.aggregateMessages[(Double, List[VertexId])](
ctx => if (ctx.srcId == currentVertexId) {
ctx.sendToDst((ctx.srcAttr._2 + ctx.attr, ctx.srcAttr._3 :+ ctx.srcId))
},
(a, b) => if (a._1 < b._1) a else b
)
g2 = g2.outerJoinVertices(newDistances)((vid, vd, newSum) => {
val newSumVal = newSum.getOrElse((Double.MaxValue, List[VertexId]()))
(
vd._1 || vid == currentVertexId,
math.min(vd._2, newSumVal._1),
if (vd._2 < newSumVal._1) vd._3 else newSumVal._2
)
})
}
g.outerJoinVertices(g2.vertices)((vid, vd, dist) =>
(vd, dist.getOrElse((false, Double.MaxValue, List[VertexId]()))
.productIterator.toList.tail
))
}
I take two random vertex id's:
val v1 = 4000000028222916L
val v2 = 4000000031019012L
and compute the path between them:
val results = dijkstra(my_graph, v1).vertices.map(_._2).collect
I am unable to compute this locally on my laptop without getting a stackoverflow error. I can see that it is using 3 out of 4 cores available. I can load this graph and compute shortest 10 paths per second with the igraph library in Python on exactly the same graph. Is this an inefficient means of computing paths? At scale, on multiple nodes the paths will compute (no stackoverflow error) but it is still 30/40seconds per path computation.
As you can read on the python-igraph github
"It is intended to be as powerful (ie. fast) as possible to enable the
analysis of large graphs."
In order to explain why it is taking 4000x more time on apache-spark than on local python, you may take a look here (A deep dive into performance bottlenecks with Spark PMC member Kay Ousterhout.) to see that it is probably due to a bottleneck:
... beginning with the idea that network and disk I/O are major bottlenecks ...
You may not need to store your data in-memory because the job may not get that much faster. This is saying that if you moved the serialized compressed data from on-disk to in-memory...
you may also see here & here some informations , but best final method is to benchmark your code to know where the bottleneck is.

all empty Immutable.Lists are identical

Feature or bug: All empty Immutable.Lists test as identical.
For example:
var xxx = new Immutable.List();
var yyy = new Immutable.List();
xxx === yyy; // true
var zzz = yyy.push(1).pop();
zzz === yyy; // true
zzz = Immutable.fromJS([]);
xxx === zzz; // true
I can see why you might want to consider all empty lists as identical, but I also have use cases where just because 2 lists are empty doesn't imply that they are the same. As a counter-example, if I create two Immutable.Lists with the same contents, they do not test as identical.
Is there a way to tell 2 empty Lists apart?
Do you think this is a bug?
I am not an expert in Immutable.js nor a contributor, but I will try to write down some thoughts in immutable data structures.
Immutable.js is optimized for performance. Especially for performing equality checks. The idea behind immutable is, that if we have immutable data structures, we can determine if a structure changed simply by comparing the references. The first thing the implemention of Immutable.is does is checking if the references are equal:
export function is(valueA, valueB) {
if (valueA === valueB || (valueA !== valueA && valueB !== valueB)) {
return true;
}
// ...
}
If they are the same, we can always assume that the structure is still the same, since it is immutable. Making all empty lists equal increases performance here.
What we also need to understand is, that immutable data structures are widely used in functional programming languages (Immutable.js is actually inspired by Scala and Closure). Functional programming languages are much closer to mathematics than other languages. Mathematics is value based, this means in mathematics there are no instances. We would assume that a set containing the elements 1, 2 and 3 is equal a set of 1, 2 and 3. The empty set is always equal the empty set. The empty list equals the empty list.
Some interesting results from the Scala REPL:
scala> List(1, 2, 3) == List(1, 2, 3)
res0: Boolean = true
scala> Nil == Nil
res1: Boolean = true
scala> 1 :: 2 :: 3 :: Nil == List(1, 2, 3)
res2: Boolean = true

Python Return multiple Variables from a function

Not so much of a question as a valuable observation for people using python.
Unlike the majority of other programming languages, you can return multiple variable from a function
without dealing with objects, lists etc.
simply put
return ReturnValue1, ReturnValue2, ReturnValue3
to return however many you wish.
and to retrieve them:
ReturnValue1, ReturnValue2, ReturnValue3 = functionName(parameters)
But remember to do it in order just like assigning a parameter for a function.
Cheers
As I am not able to just comment on your "Question", I have to put this into an answer:
To be precise, the return-value will be a tuple. So technically you are not returning multiple values, but an instance of the class tuple, containing those exact values. This provides the opportunity to receive those values in quite a lot of different ways:
def f():
return 1, 2, 3
one, two, three = f() # one = 1, two = 2, three = 3
all_three_values = f() # all_three_values = (1, 2, 3)
a, *b = f() # a = 1, b = [2, 3]
assert isinstance(all_three_values, tuple) # True
This may help you:
lst = ['a','s','d','f','w','e']
def list2variables(lst):
di = {}
for i in range(len(lst)): # capture index to form dict
di[lst[i]]=lst[i]
for k, v in di.items(): # move elements as global variable
globals()[k] = v
return globals() # return so that can be used in future
little change can be done in function to take dict as argument too

Scala: Self-Recursive val in function [duplicate]

Why can't i define a variable recursively in a code block?
scala> {
| val test: Stream[Int] = 1 #:: test
| }
<console>:9: error: forward reference extends over definition of value test
val test: Stream[Int] = 1 #:: test
^
scala> val test: Stream[Int] = 1 #:: test
test: Stream[Int] = Stream(1, ?)
lazy keyword solves this problem, but i can't understand why it works without a code block but throws a compilation error in a code block.
Note that in the REPL
scala> val something = "a value"
is evaluated more or less as follows:
object REPL$1 {
val something = "a value"
}
import REPL$1._
So, any val(or def, etc) is a member of an internal REPL helper object.
Now the point is that classes (and objects) allow forward references on their members:
object ForwardTest {
def x = y // val x would also compile but with a more confusing result
val y = 2
}
ForwardTest.x == 2
This is not true for vals inside a block. In a block everything must be defined in linear order. Thus vals are no members anymore but plain variables (or values, resp.). The following does not compile either:
def plainMethod = { // could as well be a simple block
def x = y
val y = 2
x
}
<console>: error: forward reference extends over definition of value y
def x = y
^
It is not recursion which makes the difference. The difference is that classes and objects allow forward references, whereas blocks do not.
I'll add that when you write:
object O {
val x = y
val y = 0
}
You are actually writing this:
object O {
val x = this.y
val y = 0
}
That little this is what is missing when you declare this stuff inside a definition.
The reason for this behavior depends on different val initialization times. If you type val x = 5 directly to the REPL, x becomes a member of an object, which values can be initialized with a default value (null, 0, 0.0, false). In contrast, values in a block can not initialized by default values.
This tends to different behavior:
scala> class X { val x = y+1; val y = 10 }
defined class X
scala> (new X).x
res17: Int = 1
scala> { val x = y+1; val y = 10; x } // compiles only with 2.9.0
res20: Int = 11
In Scala 2.10 the last example does not compile anymore. In 2.9.0 the values are reordered by the compiler to get it to compile. There is a bug report which describes the different initialization times.
I'd like to add that a Scala Worksheet in the Eclipse-based Scala-IDE (v4.0.0) does not behave like the REPL as one might expect (e.g. https://github.com/scala-ide/scala-worksheet/wiki/Getting-Started says "Worksheets are like a REPL session on steroids") in this respect, but rather like the definition of one long method: That is, forward referencing val definitions (including recursive val definitions) in a worksheet must be made members of some object or class.