Shortest path performance in Graphx with Spark - json

I am creating a graph from a gz compressed json file of edge and vertices type.
I have put the files in a dropbox folder here
I load and map these json records to create the vertices and edge types required by graphx like this:
val vertices_raw = sqlContext.read.json("path/vertices.json.gz")
val vertices = vertices_raw.rdd.map(row=> ((row.getAs[String]("toid").stripPrefix("osgb").toLong),row.getAs[Long]("index")))
val verticesRDD: RDD[(VertexId, Long)] = vertices
val edges_raw = sqlContext.read.json("path/edges.json.gz")
val edgesRDD = edges_raw.rdd.map(row=>(Edge(row.getAs[String]("positiveNode").stripPrefix("osgb").toLong, row.getAs[String]("negativeNode").stripPrefix("osgb").toLong, row.getAs[Double]("length"))))
val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD, edgesRDD).partitionBy(PartitionStrategy.RandomVertexCut)
I then use this dijkstra implementation I found to compute a shortest path between two vertices:
def dijkstra[VD](g: Graph[VD, Double], origin: VertexId) = {
var g2 = g.mapVertices(
(vid, vd) => (false, if (vid == origin) 0 else Double.MaxValue, List[VertexId]())
)
for (i <- 1L to g.vertices.count - 1) {
val currentVertexId: VertexId = g2.vertices.filter(!_._2._1)
.fold((0L, (false, Double.MaxValue, List[VertexId]())))(
(a, b) => if (a._2._2 < b._2._2) a else b)
._1
val newDistances: VertexRDD[(Double, List[VertexId])] =
g2.aggregateMessages[(Double, List[VertexId])](
ctx => if (ctx.srcId == currentVertexId) {
ctx.sendToDst((ctx.srcAttr._2 + ctx.attr, ctx.srcAttr._3 :+ ctx.srcId))
},
(a, b) => if (a._1 < b._1) a else b
)
g2 = g2.outerJoinVertices(newDistances)((vid, vd, newSum) => {
val newSumVal = newSum.getOrElse((Double.MaxValue, List[VertexId]()))
(
vd._1 || vid == currentVertexId,
math.min(vd._2, newSumVal._1),
if (vd._2 < newSumVal._1) vd._3 else newSumVal._2
)
})
}
g.outerJoinVertices(g2.vertices)((vid, vd, dist) =>
(vd, dist.getOrElse((false, Double.MaxValue, List[VertexId]()))
.productIterator.toList.tail
))
}
I take two random vertex id's:
val v1 = 4000000028222916L
val v2 = 4000000031019012L
and compute the path between them:
val results = dijkstra(my_graph, v1).vertices.map(_._2).collect
I am unable to compute this locally on my laptop without getting a stackoverflow error. I can see that it is using 3 out of 4 cores available. I can load this graph and compute shortest 10 paths per second with the igraph library in Python on exactly the same graph. Is this an inefficient means of computing paths? At scale, on multiple nodes the paths will compute (no stackoverflow error) but it is still 30/40seconds per path computation.

As you can read on the python-igraph github
"It is intended to be as powerful (ie. fast) as possible to enable the
analysis of large graphs."
In order to explain why it is taking 4000x more time on apache-spark than on local python, you may take a look here (A deep dive into performance bottlenecks with Spark PMC member Kay Ousterhout.) to see that it is probably due to a bottleneck:
... beginning with the idea that network and disk I/O are major bottlenecks ...
You may not need to store your data in-memory because the job may not get that much faster. This is saying that if you moved the serialized compressed data from on-disk to in-memory...
you may also see here & here some informations , but best final method is to benchmark your code to know where the bottleneck is.

Related

Inconsistent behaviour when attempting to write Dataframe to CSV in Apache Spark

I'm trying to output the optimal hyperparameters for a decision tree classifier I trained using Spark's MLlib to a csv file using Dataframes and spark-csv. Here's a snippet of my code:
// Split the data into training and test sets (10% held out for testing)
val Array(trainingData, testData) = assembledData.randomSplit(Array(0.9, 0.1))
// Define cross validation with a hyperparameter grid
val crossval = new CrossValidator()
.setEstimator(classifier)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(new BinaryClassificationEvaluator)
.setNumFolds(10)
// Train model
val model = crossval.fit(trainingData)
// Find best hyperparameter combination and create an RDD
val bestModel = model.bestModel
val hyperparamList = new ListBuffer[(String, String)]()
bestModel.extractParamMap().toSeq.foreach(pair => {
val hyperparam: Tuple2[String,String] = (pair.param.name,pair.value.toString)
hyperparamList += hyperparam
})
val hyperparameters = sqlContext.sparkContext.parallelize(hyperparamList.toSeq)
// Print the best hyperparameters
println(bestModel.extractParamMap().toSeq.foreach(pair => {
println(s"${pair.param.parent} ${pair.param.name}")
println(pair.value)
}))
// Define csv path to output results
var csvPath: String = "/root/results/decision-tree"
val hyperparametersPath: String = csvPath+"/hyperparameters"
val hyperparametersFile: File = new File(hyperparametersPath)
val results = (hyperparameters, hyperparametersPath, hyperparametersFile)
// Convert RDD to Dataframe and write it as csv
val dfToSave = spark.createDataFrame(results._1.map(x => Row(x._1, x._2)))
dfToSave.write.format("csv").mode("overwrite").save(results._2)
// Stop spark session
spark.stop()
After finishing a Spark job, I can see the part-00*... and _SUCCESS files inside the path as expected. However, though there are 13 hyperparameters total in this case (confirmed by printing them on screen), cat-ing the csv files shows not every hyperparameter was written to csv:
user#master:~$ cat /root/results/decision-tree/hyperparameters/part*.csv
checkpointInterval,10
featuresCol,features
maxDepth,5
minInstancesPerNode,1
Also, the hyperparameters that do get written change in every execution. This is executed on a HDFS-based Spark cluster with 1 master and 3 workers that have exactly the same hardware. Could it be a race condition? If so, how can I solve it?
Thanks in advance.
I think I figured it out. I expected dfTosave.write.format("csv")save(path) to write everything to the master node, but since the tasks are distributed to all workers, each worker saves its part of the hyperparameters to a local CSV in its filesystem. Because in my case the master node is also a worker, I can see its part of the hyperparameters. The "inconsistent behaviour" (i.e. seeing different parts in each execution) is caused by whatever algorithm Spark uses for distributing partitions among workers.
My solution will be to collect the CSVs from all workers using something like scp or rsync to build the full results.

How to get the Index of Max element in UInt Vec , Chisel

I'm trying to get the index of the Max element in a UInt vector.
My code looks like this
val pwr = Vec.tabulate(N) {i => energyMeters(i).io.pwr}
val maxPwr = pwr.indexOf(pwr.max)
However this code generate compilation error:
No implicit Ordering Defined for Chisel.UInt.
val maxPwr = pwr.indexOf(pwr.max)
^
I understand that I probably need to implement the max function , can someone give an example how this should be done ?
Edit:
I also tried this:
val pwr = Vec.tabulate(N) {i => energyMeters(i).io.pwr}
val maxPwr = pwr reduceLeft {(x,y) => Mux(x > y,x,y)}
val maxPwridx = pwr.indexOf(maxPwr)
But it fails on elaboration , when I tried to cast maxPwridx to UInt.
I've ended up with this workaround:
val pwr = Vec.tabulate(N) {i => energyMeters(i).io.pwr}
val maxPwr = pwr reduceLeft {(x,y) => Mux(x > y,x,y)}
val maxPwridx = pwr.indexWhere((x : UInt => x === maxPwr))
Chisel's Vec extends Scala's Seq. This means that a Vec has both dynamic access hardware methods that will allow you to generate hardware to search for something in a Vec (e.g., indexWhere, onlyIndexWhere, lastIndexWhere) as well as all the methods available to normal Scala sequences (e.g., indexOf).
For the purposes of doing hardware operations, you want to use the former (as you found in your last edit---which looks great!) as opposed to the latter.
To get some handle on this, the screenshot below shows the Chisel 3.3.0-RC1 API documentation for VecLike, filtered to excluded inherited methods. Notable here are indexWhere, onlyIndexWhere, lastIndexWhere, exists, forall, and contains:
And the documentation for Vec. The only interesting method here would be reduceTree:

Error while passing values using peekpoketester

I am trying to pass some random integers (which I have stored in an array) to my hardware as an Input through the poke method in peekpoketester. But I am getting this error:
chisel3.internal.ChiselException: Error: Not in a UserModule. Likely cause: Missed Module() wrap, bare chisel API call, or attempting to construct hardware inside a BlackBox.
What could be the reason? I don't think I need a module wrap here as this is not hardware.
class TesterSimple (dut: DeviceUnderTest)(parameter1 : Int)(parameter2 : Int) extends
PeekPokeTester (dut) {
var x = Array[Int](parameter1)
var y = Array[Int](parameter2)
var z = 1
poke(dut.io.IP1, z.asUInt)
for(i <- 0 until parameter1){poke(dut.io.IP2(i), x(i).asUInt)}
for(j <- 0 until parameter2){poke(dut.io.IP3(j), y(j).asUInt)}
}
object TesterSimple extends App {
implicit val parameter1 = 2
implicit val parameter2 = 2
chisel3.iotesters.Driver (() => DeviceUnderTest(parameter1 :Int, parameter2 :Int)) { c =>
new TesterSimple (c)(parameter1, parameter2)}
}
I'd suggest a couple of things.
Main problem, I think you are not initializing your arrays properly
Try using Array.fill or Array.tabulate to create and initialize arrays
val rand = scala.util.Random
var x = Array.fill(parameter1)(rand.nextInt(100))
var y = Array.fill(parameter2)(rand.nextInt(100))
You don't need the .asUInt in the poke, it accepts Ints or BigInts
When defining hardware constants, use .U instead of .asUInt, the latter is a way of casting other chisel types, it does work but it a backward compatibility thing.
It's better to not start variables or methods with capital letters
I suggest us class DutName(val parameter1: Int, val parameter2: Int) or class DutName(val parameter1: Int)(val parameter2: Int) if you prefer.
This will allow to use the dut's paremeters when you are writing your test.
E.g. for(i <- 0 until dut.parameter1){poke(dut.io.IP2(i), x(i))}
This will save you have to duplicate parameter objects on your DUT and your Tester
Good luck!
Could you also share your DUT?
I believe the most likely case is your DUT does not extend Module

How are reactive streams used in Slick for inserting data

In Slick's documentation examples for using Reactive Streams are presented just for reading data as a means of a DatabasePublisher. But what happens when you want to use your database as a Sink and backpreasure based on your insertion rate?
I've looked for equivalent DatabaseSubscriber but it doesn't exist. So the question is, if I have a Source, say:
val source = Source(0 to 100)
how can I crete a Sink with Slick that writes those values into a table with schema:
create table NumberTable (value INT)
Serial Inserts
The easiest way would be to do inserts within a Sink.foreach.
Assuming you've used the schema code generation and further assuming your table is named "NumberTable"
//Tables file was auto-generated by the schema code generation
import Tables.{Numbertable, NumbertableRow}
val numberTableDB = Database forConfig "NumberTableConfig"
We can write a function that does the insertion
def insertIntoDb(num : Int) =
numberTableDB run (Numbertable += NumbertableRow(num))
And that function can be placed in the Sink
val insertSink = Sink[Int] foreach insertIntoDb
Source(0 to 100) runWith insertSink
Batched Inserts
You could further extend the Sink methodology by batching N inserts at a time:
def batchInsertIntoDb(nums : Seq[Int]) =
numberTableDB run (Numbertable ++= nums.map(NumbertableRow.apply))
val batchInsertSink = Sink[Seq[Int]] foreach batchInsertIntoDb
This batched Sink can be fed by a Flow which does the batch grouping:
val batchSize = 10
Source(0 to 100).via(Flow[Int].grouped(batchSize))
.runWith(batchInsertSink)
Although you can use a Sink.foreach to achieve this (as mentioned by Ramon) it is safer and likely faster (by running the inserts in parallel) to use the mapAsync Flow. The problem you will face with using Sink.foreach is that it does not have a return value. Inserting into a database via slicks db.run method returns a Future which will then escape out of the steams returned Future[Done] which completes as soon as the Sink.foreach finishes.
implicit val system = ActorSystem("system")
implicit val materializer = ActorMaterializer()
class Numbers(tag: Tag) extends Table[Int](tag, "NumberTable") {
def value = column[Int]("value")
def * = value
}
val numbers = TableQuery[Numbers]
val db = Database.forConfig("postgres")
Await.result(db.run(numbers.schema.create), Duration.Inf)
val streamFuture: Future[Done] = Source(0 to 100)
.runWith(Sink.foreach[Int] { (i: Int) =>
db.run(numbers += i).foreach(_ => println(s"stream 1 insert $i done"))
})
Await.result(streamFuture, Duration.Inf)
println("stream 1 done")
//// sample 1 output: ////
// stream 1 insert 1 done
// ...
// stream 1 insert 99 done
// stream 1 done <-- stream Future[Done] returned before inserts finished
// stream 1 insert 100 done
On the other hand the def mapAsync[T](parallelism: Int)(f: Out ⇒ Future[T]) Flow allows you to run the inserts in parallel via the parallelism paramerter and accepts a function from the upstream out value to a future of some type. This matches our i => db.run(numbers += i) function. The great thing about this Flow is that it then feeds the result of these Futures downstream.
val streamFuture2: Future[Done] = Source(0 to 100)
.mapAsync(1) { (i: Int) =>
db.run(numbers += i).map { r => println(s"stream 2 insert $i done"); r }
}
.runWith(Sink.ignore)
Await.result(streamFuture2, Duration.Inf)
println("stream 2 done")
//// sample 2 output: ////
// stream 2 insert 1 done
// ...
// stream 2 insert 100 done
// stream 1 done <-- stream Future[Done] returned after inserts finished
To prove the point you can even return a real result from the stream rather than a Future[Done] (With Done representing Unit). This stream will also add a higher parallelism value and batching for extra performance. *
val streamFuture3: Future[Int] = Source(0 to 100)
.via(Flow[Int].grouped(10)) // Batch in size 10
.mapAsync(2)((ints: Seq[Int]) => db.run(numbers ++= ints).map(_.getOrElse(0))) // Insert batches in parallel, return insert count
.runWith(Sink.fold(0)(_+_)) // count all inserts and return total
val rowsInserted = Await.result(streamFuture3, Duration.Inf)
println(s"stream 3 done, inserted $rowsInserted rows")
// sample 3 output:
// stream 3 done, inserted 101 rows
Note: You probably won't see better performance for such a small data set, but when I was dealing with a 1.7M insert I was able to get the best performance on my machine with a batch size of 1000 and parallelism value of 8, locally with postgresql. This was about twice as good as not running in parallel. As always when dealing with performance your results may vary and you should measure for yourself.
I find Alpakka's documentation to be excellent and it's DSL to be really easy to work with reactive streams.
This is the docs for Slick: https://doc.akka.io/docs/alpakka/current/slick.html
Inserting example:
Source(0 to 100)
.runWith(
// add an optional first argument to specify the parallelism factor (Int)
Slick.sink(value => sqlu"INSERT INTO NumberTable VALUES(${value})")
)

Functional Programming: Does a list only contain unique items?

I'm having an unsorted list and want to know, whether all items in it are unique.
My naive approach would be val l = List(1,2,3,4,3)
def isUniqueList(l: List[Int]) = (new HashSet()++l).size == l.size
Basically, I'm checking whether a Set containing all elements of the list has the same size (since an item appearing twice in the original list will only appear once in the set), but I'm not sure whether this is the ideal solution for this problem.
Edit:
I benchmarked the 3 most popular solutions, l==l.distinct, l.size==l.distinct.size and Alexey's HashSet-based solution.
Each function was run 1000 times with a unique list of 10 items, a unique list of 10000 items and the same lists with one item appearing in the third quarter copied to the middle of the list. Before each run, each function got called 1000 times to warm up the JIT, the whole benchmark was run 5 times before the times were taken with System.currentTimeMillis.
The machine was a C2D P8400 (2.26 GHz) with 3GB RAM, the java version was the OpenJDK 64bit server VM (1.6.0.20). The java args were -Xmx1536M -Xms512M
The results:
l.size==l.distinct.size (3, 5471, 2, 6492)
l==l.distinct (3, 5601, 2, 6054)
Alexey's HashSet (2, 1590, 3, 781)
The results with larger objects (Strings from 1KB to 5KB):
l.size==l.distinct.size MutableList(4, 5566, 7, 6506)
l==l.distinct MutableList(4, 5926, 3, 6075)
Alexey's HashSet MutableList(2, 2341, 3, 784)
The solution using HashSets is definitely the fastest, and as he already pointed out using .size doesn't make a major difference.
Here is the fastest purely functional solution I can think of:
def isUniqueList(l: List[T]) = isUniqueList1(l, new HashSet[T])
#tailrec
def isUniqueList1(l: List[T], s: Set[T]) = l match {
case Nil => true
case (h :: t) => if (s(h)) false else isUniqueList1(t, s + h)
}
This should be faster, but uses mutable data structure (based on the distinct implementation given by Vasil Remeniuk):
def isUniqueList(l: List[T]): Boolean = {
val seen = mutable.HashSet[A]()
for (x <- this) {
if (seen(x)) {
return false
}
else {
seen += x
}
}
true
}
And here is the simplest (equivalent to yours):
def isUniqueList(l: List[T]) = l.toSet.size == l.size
I would simply use distinct method:
scala> val l = List(1,2,3,4,3)
l: List[Int] = List(1, 2, 3, 4, 3)
scala> l.distinct.size == l.size
res2: Boolean = false
ADD: Standard distinct implementation (from scala.collection.SeqLike) uses mutable HashSet, to find duplicate elements:
def distinct: Repr = {
val b = newBuilder
val seen = mutable.HashSet[A]()
for (x <- this) {
if (!seen(x)) {
b += x
seen += x
}
}
b.result
}
A more efficient method would be to attempt to find a dupe; this would return more quickly if one were found:
var dupes : Set[A] = Set.empty
def isDupe(a : A) = if (dupes(a)) true else { dupes += a; false }
//then
l exists isDupe