I have a RDD[Map[String,Int]] where the keys of the maps are the column names. Each map is incomplete and to know the column names I would need to union all the keys. Is there a way to avoid this collect operation to know all the keys and use just once rdd.saveAsTextFile(..) to get the csv?
For example, say I have an RDD with two elements (scala notation):
Map("a"->1, "b"->2)
Map("b"->1, "c"->3)
I would like to end up with this csv:
a,b,c
1,2,0
0,1,3
Scala solutions are better but any other Spark-compatible language would do.
EDIT:
I can try to solve my problem from another direction also. Let's say I somehow know all the columns in the beginning, but I want to get rid of columns that have 0 value in all maps. So the problem becomes, I know that the keys are ("a", "b", "c") and from this:
Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)
I need to write the csv:
a,b
1,2
3,1
Would it be possible to do this with only one collect?
If you're statement is: "every new element in my RDD may add a new column name I have not seen so far", the answer is obviously can't avoid a full scan. But you don't need to collect all elements on the driver.
You could use aggregate to only collect column names. This method takes two functions, one is to insert a single element into the resulting collection, and another one to merge results from two different partitions.
rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })
You will get back a set of all column names in the RDD. In a second scan you can print the CSV file.
Scala and any other supported language
You can use spark-csv
First lets find all present columns:
val cols = sc.broadcast(rdd.flatMap(_.keys).distinct().collect())
Create RDD[Row]:
val rows = rdd.map {
row => { Row.fromSeq(cols.value.map { row.getOrElse(_, 0) })}
}
Prepare schema:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(
cols.value.map(field => StructField(field, IntegerType, true)))
Convert RDD[Row] to Data Frame:
val df = sqlContext.createDataFrame(rows, schema)
Write results:
// Spark 1.4+, for other versions see spark-csv docs
df.write.format("com.databricks.spark.csv").save("mycsv.csv")
You can do pretty much the same thing using other supported languages.
Python
If you use Python and final data fits in a driver memory you can use Pandas through toPandas() method:
rdd = sc.parallelize([{'a': 1, 'b': 2}, {'b': 1, 'c': 3}])
cols = sc.broadcast(rdd.flatMap(lambda row: row.keys()).distinct().collect())
df = sqlContext.createDataFrame(
rdd.map(lambda row: {k: row.get(k, 0) for k in cols.value}))
df.toPandas().save('mycsv.csv')
or directly:
import pandas as pd
pd.DataFrame(rdd.collect()).fillna(0).save('mycsv.csv')
Edit
One possible way to the second collect is to use accumulators to either build a set of all column names or to count these where you found zeros and use this information to map over rows and remove unnecessary columns or to add zeros.
It is possible but inefficient and feels like cheating. The only situation when it makes some sense is when number of zeros is very low, but I guess it is not the case here.
object ColsSetParam extends AccumulatorParam[Set[String]] {
def zero(initialValue: Set[String]): Set[String] = {
Set.empty[String]
}
def addInPlace(s1: Set[String], s2: Set[String]): Set[String] = {
s1 ++ s2
}
}
val colSetAccum = sc.accumulator(Set.empty[String])(ColsSetParam)
rdd.foreach { colSetAccum += _.keys.toSet }
or
// We assume you know this upfront
val allColnames = sc.broadcast(Set("a", "b", "c"))
object ZeroColsParam extends AccumulatorParam[Map[String, Int]] {
def zero(initialValue: Map[String, Int]): Map[String, Int] = {
Map.empty[String, Int]
}
def addInPlace(m1: Map[String, Int], m2: Map[String, Int]): Map[String, Int] = {
val keys = m1.keys ++ m2.keys
keys.map(
(k: String) => (k -> (m1.getOrElse(k, 0) + m2.getOrElse(k, 0)))).toMap
}
}
val accum = sc.accumulator(Map.empty[String, Int])(ZeroColsParam)
rdd.foreach { row =>
// If allColnames.value -- row.keys.toSet is empty we can avoid this part
accum += (allColnames.value -- row.keys.toSet).map(x => (x -> 1)).toMap
}
Related
I'm trying to create multiple DataFrames from the two lists below,
val paths = ListBuffer("s3://abc_xyz_tableA.json",
"s3://def_xyz_tableA.json",
"s3://abc_xyz_tableB.json",
"s3://def_xyz_tableB.json",
"s3://abc_xyz_tableC.json",....)
val tableNames = ListBuffer("tableA","tableB","tableC","tableD",....)
I want to create different dataframes using the table names by bringing all the common table name ending s3 paths together as they have the unique schema.
so for example if the tables and paths related to it are brought together then -
"tableADF" will have all the data from these paths "s3://abc_xyz_tableA.json", "s3://def_xyz_tableA.json" as they have "tableA" in the path
"tableBDF" will have all the data from these paths "s3://abc_xyz_tableB.json", "s3://def_xyz_tableB.json" as they have "tableB" in the path
and so on there can be many tableNames and Paths
I'm trying different approaches but not successful yet.
Any leads in achieving the desired solution will be of great help. Thanks!
using input_file_name() udf, you can filter based on the file names to get the dataframe for each file/file patterns
import org.apache.spark.sql.functions._
import spark.implicits._
var df = spark.read.format("json").load("s3://data/*.json")
df = df.withColumn(
"input_file", input_file_name()
)
val tableADF= df.filter($"input_file".endsWith("tableA.json"))
val tableBDF= df.filter($"input_file".endsWith("tableB.json"))
If the file post fix name list is pretty long then you an use something as below,
Also find the code explanation inline
import org.apache.spark.sql.functions._
object DFByFileName {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Load your JSON data
var df = spark.read.format("json").load("s3://data/*.json")
//Add a column with file name
df = df.withColumn(
"input_file", (input_file_name())
)
//Extract unique file postfix from the file names in a List
val fileGroupList = df.select("input_file").map(row => {
val fileName = row.getString(0)
val index1 = fileName.lastIndexOf("_")
val index2 = fileName.lastIndexOf(".")
fileName.substring(index1 + 1, index2)
}).collect()
//Iterate file group name to map of (fileGroup -> Dataframe of file group)
fileGroupList.map(fileGroupName => {
df.filter($"input_file".endsWith(s"${fileGroupName}.json"))
//perform dataframe operations
})
}
}
Check below code & Final result type is
scala.collection.immutable.Map[String,org.apache.spark.sql.DataFrame] = Map(tableBDF -> [...], tableADF -> [...], tableCDF -> [...]) where ... is your column list.
paths
.map(path => (s"${path.split("_").last.split("\\.json").head}DF",path)) // parsing file names and extracting table name and path into tuple
.groupBy(_._1) // grouping paths based same table name
.map(p => (p._1 -> p._2.map(_._2))).par // combining paths for same table into list and also .par function to execute subsequent steps in Parallel
.map(mp => {
(
mp._1, // table name
mp._2.par // For same DF multiple Files load parallel.
.map(spark.read.json(_)) // loading files s3
.reduce(_ union _) // union if same table has multiple files.
)
}
)
I have a large CSV( > 500 MB), which I take into a spark RDD, and I want to store it to a large Map[String, Array[Long]].
The CSV has multiple columns but I require only two for the time being. The first and second column, and is of the form:
A 12312 [some_value] ....
B 123123[some_value] ....
A 1222 [some_value] ....
C 1231 [some_value] ....
I want my map to basically group by the string and store an array of long
so, for the above case, my map would be:
{"A": [12312, 1222], "B": 123123, "C":1231}
But since this map would be huge, I can't simply do this directly. tsca
I take the CSV in a sql.dataframe
My code so far(Looks incorrect though):
def getMap(df: sql.DataFrame, sc: SparkContext): RDD[Map[String, Array[Long]]] = {
var records = sc.emptyRDD[Map[String, Array[Long]]]
val rows: RDD[Row] = df.rdd
rows.foreachPartition( iter => {
iter.foreach(x =>
if(records.contains(x.get(0).toString)){
val arr = temp_map.getOrElse()
records = records + (x.get(0).toString -> (temp_map.getOrElse(x.get(0).toString) :+ x.get(1).toString.toLong))
}
else{
val arr = new Array[Long](1)
arr(0) = x.get(1).toString.toLong
records = records + (x.get(0).toString -> arr)
}
)
})
}
Thanks in advance!
If I understood your question correctly then
You could groupBy first column and collect_list for the second column
import org.apache.spark.sql.functions._
val newDF = df.groupBy("column1").agg(collect_list("column2"))
newDF.show(faslse)
val rdd = newDF.rdd.map(r => (r.getString(0), r.getAs[List[Long]](1)))
This will give you RDD[(String, List[Long])] where the string will be unique
a relative newbie to spark, hbase, and scala here.
I have json (stored as byte arrays) in hbase cells in the same column family but across several thousand column qualifiers. Example (simplified):
Table name: 'Events'
rowkey: rk1
column family: cf1
column qualifier: cq1, cell data (in bytes): {"id":1, "event":"standing"}
column qualifier: cq2, cell data (in bytes): {"id":2, "event":"sitting"}
etc.
Using scala, I can read rows by specifying a timerange
val scan = new Scan()
val start = 1460542400
val end = 1462801600
val hbaseContext = new HBaseContext(sc, conf)
val getRdd = hbaseContext.hbaseRDD(TableName.valueOf("Events"), scan)
If I try to load up my hbase rdd (getRdd) into a dataframe (after converting the byte arrays into string etc.), it only reads the first cell in every row (in the example above, I would only get "standing".
this code only loads up a single cell for every row returned
val resultsString = getRdd.map(s=>Bytes.toString(s._2.value()))
val resultsDf = sqlContext.read.json(resultsString)
In order to get every cell I have to iterate as below.
val jsonRDD = getRdd.map(
row => {
val str = new StringBuilder
str.append("[")
val it = row._2.listCells().iterator()
while (it.hasNext) {
val cell = it.next()
val cellstring = Bytes.toString(CellUtil.cloneValue(cell))
str.append(cellstring)
if (it.hasNext()) {
str.append(",")
}
}
str.append("]")
str.toString()
}
)
val hbaseDataSet = sqlContext.read.json(jsonRDD)
I need to add the square brackets and the commas so its properly formatted json for the dataframe to read it.
Questions:
Is there a more elegant way to construct the json i.e. some parser that takes in the individual json strings and concatenates them together so its properly formed json?
Is there a better capability to flatten hbase cells so i dont need to iterate?
For the jsonRdd, the closure that is computed should include the str local variable, so the task executing this code on a node should not be missing the "[", "]" or ",". i.e i wont get parser errors once i run this on the cluster instead of local[*]
Finally, is it better to just create a pair RDD from the json or use data frames to perform simple things like counts? Is there some way to measure the efficiency and performance of one vs. the other?
thank you
It may be simple question, but I am new to Scala and not able to find the proper solution
I am trying to create a JSON object from the Option values. Will check if the value is not empty then create the Json obj, if the value is None I don't want to create the json object. With out else, default else is Unit which will fail to create Json obj
Json.obj(if(position.nonEmpty) ("position" -> position.get),
if(place.nonEmpty) ("place" -> place.get),
if(country.nonEmpty) ("country" -> country.get))
Need to put the If condition so that the final json string to look like
{
"position": "M2",
"place": "place",
"country": "country"
}
val obj = for {
p <- position
o <- otherOption
...
} yield Json.obj(
"position" -> p,
"other" -> o)
Will only yield a Some of Json Object if all options are defined. Otherwise None
Option is a monad and there are few convenient ways for using it.
First, if you want to extract value you should use map or flatMap and getOrElse methods:
val res = position.map(value => Json.obj("position" -> value)).getOrElse(null)
Another way is to keep Option of another type and use it latter:
val jsonOption = position.map(value => Json.obj("position" -> value))
After you can use it in for comprehension with another options or perform another mutations without extracting:
for (positionJson <- jsonOption; xJson <- xJsonOption) yield positionJson.toString + xJson.toString
jsonOption.map(_.toString).foreach(print(_))
And always try to avoid pattern matching on monads.
I think I may be missing something fundamental from the list-json xpath architecture. The smoothest way I've been able to extract and traverse a list is shown below. Can someone please show me a better technique:
class Example {
#Test
def traverseJsonArray() {
def myOperation(kid:JObject) = println("kid="+kid)
val json = JsonParser.parse("""
{ "kids":[
{"name":"bob","age":3},
{"name":"angie","age":5},
]}
""")
val list = ( json \\ "kids" ).children(0).children
for ( kid <- list ) myOperation(kid.asInstanceOf[JObject])
}
}
If at all possible you should upgrade to Lift JSON 2.3-M1 (http://www.scala-tools.org/repo-releases/net/liftweb/lift-json_2.8.1/2.3-M1/). It contains two important improvements, the other affecting the path expressions.
With 2.3 the path expressions never return JFields, instead the values of JFields are returned directly. After that your example would look like:
val list = (json \ "kids").children
for ( kid <- list ) myOperation(kid.asInstanceOf[JObject])
Lift JSON provides several styles to parse values from JSON: path expressions, query comprehensions and case class extractions. It is possible to mix and match these styles and to get the best results we often do. For completeness sake I'll give you some variations of the above example to get a better intuition of these different styles.
// Collect all JObjects from 'kids' array and iterate
val JArray(kids) = json \ "kids"
kids collect { case kid: JObject => kid } foreach myOperation
// Yield the JObjects from 'kids' array and iterate over yielded list
(for (kid#JObject(_) <- json \ "kids") yield kid) foreach myOperation
// Extract the values of 'kids' array as JObjects
implicit val formats = DefaultFormats
(json \ "kids").extract[List[JObject]] foreach myOperation
// Extract the values of 'kids' array as case classes
case class Kid(name: String, age: Int)
(json \ "kids").extract[List[Kid]] foreach println
// Query the JSON with a query comprehension
val ks = for {
JArray(kids) <- json
kid#JObject(_) <- kids
} yield kid
ks foreach myOperation