Related
Thanks to the great help from Tenfour04, I've got wonderful code for handling CSV files.
However, I am in trouble like followings.
How to call these functions?
How to initialize 2-dimensional array variables?
Below is the code that finally worked.
MainActivity.kt
package com.surlofia.csv_tenfour04_1
import androidx.appcompat.app.AppCompatActivity
import android.os.Bundle
import java.io.File
import java.io.IOException
import com.surlofia.csv_tenfour04_1.databinding.ActivityMainBinding
var chk_Q_Num: MutableList<Int> = mutableListOf (
0,
1, 2, 3, 4, 5,
6, 7, 8, 9, 10,
11, 12, 13, 14, 15,
16, 17, 18, 19, 20,
)
var chk_Q_State: MutableList<String> = mutableListOf (
"z",
"a", "b", "c", "d", "e",
"f", "g", "h", "i", "j"
)
class MainActivity : AppCompatActivity() {
private lateinit var binding: ActivityMainBinding
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
// setContentView(R.layout.activity_main)
binding = ActivityMainBinding.inflate(layoutInflater)
val view = binding.root
setContentView(view)
// Load saved data at game startup. It will be invalid if performed by other activities.
val filePath = filesDir.path + "/chk_Q.csv"
val file = File(filePath)
binding.fileExists.text = isFileExists(file).toString()
if (isFileExists(file)) {
val csvIN = file.readAsCSV()
for (i in 0 .. 10) {
chk_Q_Num[i] = csvIN[i][0].toInt()
chk_Q_State[i] = csvIN[i][1]
}
}
// Game Program Run
val csvOUT = mutableListOf(
mutableListOf("0","OK"),
mutableListOf("1","OK"),
mutableListOf("2","OK"),
mutableListOf("3","Not yet"),
mutableListOf("4","Not yet"),
mutableListOf("5","Not yet"),
mutableListOf("6","Not yet"),
mutableListOf("7","Not yet"),
mutableListOf("8","Not yet"),
mutableListOf("9","Not yet"),
mutableListOf("10","Not yet")
)
var tempString = ""
for (i in 0 .. 10) {
csvOUT[i][0] = chk_Q_Num[i].toString()
csvOUT[i][1] = "OK"
tempString = tempString + csvOUT[i][0] + "-->" + csvOUT[i][1] + "\n"
}
binding.readFile.text = tempString
// and save Data
file.writeAsCSV(csvOUT)
}
// https://www.techiedelight.com/ja/check-if-a-file-exists-in-kotlin/
private fun isFileExists(file: File): Boolean {
return file.exists() && !file.isDirectory
}
#Throws(IOException::class)
fun File.readAsCSV(): List<List<String>> {
val splitLines = mutableListOf<List<String>>()
forEachLine {
splitLines += it.split(", ")
}
return splitLines
}
#Throws(IOException::class)
fun File.writeAsCSV(values: List<List<String>>) {
val csv = values.joinToString("\n") { line -> line.joinToString(", ") }
writeText(csv)
}
}
chk_Q.csv
0,0
1,OK
2,OK
3,Not yet
4,Not yet
5,Not yet
6,Not yet
7,Not yet
8,Not yet
9,Not yet
10,Not yet
1. How to call these functions?
The code below seems work well.
Did I call these funtions in right way?
Or are there better ways to achieve this?
read
if (isFileExists(file)) {
val csvIN = file.readAsCSV()
for (i in 0 .. 10) {
chk_Q_Num[i] = csvIN[i][0].toInt()
chk_Q_State[i] = csvIN[i][1]
}
}
write
file.writeAsCSV(csvOUT)
2. How to initialize 2-dimensional array variables?
val csvOUT = mutableListOf(
mutableListOf("0","OK"),
mutableListOf("1","OK"),
mutableListOf("2","OK"),
mutableListOf("3","Not yet"),
mutableListOf("4","Not yet"),
mutableListOf("5","Not yet"),
mutableListOf("6","Not yet"),
mutableListOf("7","Not yet"),
mutableListOf("8","Not yet"),
mutableListOf("9","Not yet"),
mutableListOf("10","Not yet")
)
I would like to know the clever way to use a for loop instead of writing specific values one by one.
For example, something like bellow.
val csvOUT = mutableListOf(mutableListOf())
for (i in 0 .. 10) {
csvOUT[i][0] = i
csvOUT[i][1] = "OK"
}
But this gave me the following error message:
Not enough information to infer type variable T
It would be great if you could provide an example of how to execute this for beginners.
----- Added on June 15, 2022. -----
[Question 1]
Regarding initialization, I got an error "keep stopping" when I executed the following code.
The application is forced to terminate.
Why is this?
val csvOUT: MutableList<MutableList<String>> = mutableListOf(mutableListOf())
for (i in 0 .. 10) {
csvOUT[i][0] = "$i"
csvOUT[i][1] = "OK"
}
[Error Message]
java.lang.RuntimeException: Unable to start activity ComponentInfo{com.surlofia.csv_endzeit_01/com.surlofia.csv_endzeit_01.MainActivity}: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
In my opinion there are basically two parts to your question. First you need an understanding of the Kotlin type system including generics. Secondly you want some knowledge about approaches to the problem at hand.
type-system and generics
The function mutableListOf you're using is generic and thus needs a single type parameter T, as can be seen by definition its taken from the documentation:
fun <T> mutableListOf(): MutableList<T>
Most of the time the Kotlin compiler is quite good at type-inference, that is guessing the type used based on the context. For example, I do not need to provide a type explicitly in the following example, because the Kotlin compiler can infer the type from the usage context.
val listWithInts = mutableListOf(3, 7)
The infered type is MutableList<Int>.
However, sometimes this might not be what one desires. For example, I might want to allow null values in my list above. To achieve this, I have to tell the compiler that it should not only allow Int values to the list but also null values, widening the type from Int to Int?. I can achieve this in at least two ways.
providing a generic type parameter
val listWithNullableInts = mutableListOf<Int?>(3, 7)
defining the expected return type explicitly
val listWithNullableInts: MutableList<Int?> = mutableListOf(3, 7)
In your case the compiler does NOT have enough information to infer the type from the usage context. Thus you either have to provide it that context, e.g. by passing values of a specific type to the function or using one of the two options named above.
initialization of multidimensional arrays
There are questions and answers on creating multi-dimensional arrays in Kotlin on StackOverflow already.
One solution to your problem at hand might be the following.
val csvOUT: MutableList<MutableList<String>> = mutableListOf(mutableListOf())
for (i in 0 .. 10) {
csvOUT[i][0] = "$i"
csvOUT[i][1] = "OK"
}
You help the Kotlin compiler by defining the expected return type explicitly and then add the values as Strings to your 2D list.
If the dimensions are fixed, you might want to use fixed-size Arrays instead.
val csvArray = Array(11) { index -> arrayOf("$index", "OK") }
In both solutions you convert the Int index to a String however.
If the only information you want to store for each level is a String, you might as well use a simple List<String and use the index of each entry as the level number, e.g.:
val csvOut = List(11) { "OK" }
val levelThree = csvOut[2] // first index of List is 0
This would also work with more complicated data structures instead of Strings. You simply would have to adjust your fun File.writeAsCSV(values: List<List<String>>) to accept a different type as the values parameter.
Assume a simple data class you might end up with something along the lines of:
data class LevelState(val state: String, val timeBeaten: Instant?)
val levelState = List(11) { LevelState("OK", Instant.now()) }
fun File.writeAsCSV(values: List<LevelState>) {
val csvString = values
.mapIndexed { index, levelState -> "$index, ${levelState.state}, ${levelState.timeBeaten}" }
.joinToString("\n")
writeText(csvString)
}
If you prefer a more "classical" imperative approach, you can populate your 2-dimensional Array / List using a loop like for in.
val list: MutableList<MutableList<String>> = mutableListOf() // list is now []
for (i in 0..10) {
val innerList: MutableList<String> = mutableListOf()
innerList.add("$i")
innerList.add("OK")
innerList.add("${Instant.now()}")
list.add(innerList)
// list is after first iteration [ ["0", "OK", "2022-06-15T07:03:14.315Z"] ]
}
The syntax listName[index] = value is just syntactic sugar for the operator overload of the set operator, see the documentation on MutableList for example.
You cannot access an index, that has not been populated before, e.g. during the List's initialization or by using add; or else you're greeted with a IndexOutOfBoundsException.
If you want to use the set operator, one option is to use a pre-populated Array as such:
val array: Array<Array<String>>> = Array(11) {
Array(3) { "default" }
} // array is [ ["default, "default", "default"], ...]
array[1][2] = "myValue"
However, I wouldn't recommend this approach, as it might lead to left over, potentially invalid initial data, in case one misses to replace a value.
The long title contain also a mini-exaple because I couldn't explain well what I'm trying to do. Nonethless, the similar questions windows led me to various implementation. But since I read multiple times that it's a bad design, I would like to ask if what I'm trying to do is a bad design rather asking how to do it. For this reason I will try to explain my use case with a minial functional code.
Suppose I have a two classes, each of them with their own parameters:
class MyClass1:
def __init__(self,param1=1,param2=2):
self.param1=param1
self.param2=param2
class MyClass2:
def __init__(self,param3=3,param4=4):
self.param3=param3
self.param4=param4
I want to print param1...param4 as a string (i.e. "param1"..."param4") and not its value (i.e.=1...4).
Why? Two reasons in my case:
I have a GUI where the user is asked to select one of of the class
type (Myclass1, Myclass2) and then it's asked to insert the values
for the parameters of that class. The GUI then must show the
parameter names ("param1", "param2" if MyClass1 was chosen) as a
label with the Entry Widget to get the value. Now, suppose the
number of MyClass and parameter is very high, like 10 classes and 20
parameters per class. In order to minimize the written code and to
make it flexible (add or remove parameters from classes without
modifying the GUI code) I would like to cycle all the parameter of
Myclass and for each of them create the relative widget, thus I need
the paramx names under the form od string. The real application I'm
working on is even more complex, like parameter are inside other
objects of classes, but I used the simpliest example. One solution
would be to define every parameter as an object where
param1.name="param1" and param1.value=1. Thus in the GUI I would
print param1.name. But this lead to a specifi problem of my
implementation, that's reason 2:
MyClass1..MyClassN will be at some point printed in a JSON. The JSON
will be a huge file, and also since it's a complex tree (the example
is simple) I want to make it as simple as possibile. To explain why
I don't like to solution above, suppose this situation:
class MyClass1:
def init(self,param1,param2,combinations=[]):
self.param1=param1
self.param2=param2
self.combinations=combinations
Supposse param1 and param2 are now list of variable size, and
combination is a list where each element is composed by all the
combination of param1 and param2, and generate an output from some
sort of calculation. Each element of the list combinations is an
object SingleCombination,for example (metacode):
param1=[1,2] param2=[5,6] SingleCombination.param1=1
SingleCombination.param2=5 SingleCombination.output=1*5
MyInst1.combinations.append(SingleCombination).
In my case I will further incapsulated param1,param2 in a object
called parameters, so every condition will hace a nice tree with
only two object, parameters and output, and expanding parameters
node will show all the parameters with their value.
If I use JSON pickle to generate a JSON from the situation above, it
is nicely displayed since the name of the node will be the name of
the varaible ("param1", "param2" as strings in the JSON). But if I
do the trick at the end of situation (1), creating an object of
paramN as paramN.name and paramN.value, the JSON tree will become
ugly but especially huge, because if I have a big number of
condition, every paramN contains 2 sub-element. I wrote the
situation and displayed with a JSON Viewer, see the attached immage
I could pre processing the data structure before creating the JSON,
the problem is that I use the JSON to recreate the data structure in
another session of the program, so I need all the pieces of the data
structure to be in the JSON.
So, from my requirements, it seems that the workround to avoid print the variable names creates some side effect on the JSON visualization that I don't know how to solve without changing the logic of my program...
If you use dataclasses, getting the field names is pretty straightforward:
from dataclasses import dataclass, fields
#dataclass
class MyClass1:
first:int = 4
>>> fields(MyClass1)
(Field(name='first',type=<class 'int'>,default=4,...),)
This way, you can iterate over the class fields and ask your user to fill them. Note the field has a type, which you could use to eg ask the user for several values, as in your example.
You could add functions to extract programatically the param names (_show_inputs below ) from the class and values from instances (_json below ):
def blossom(cls):
"""decorate a class with `_json` (classmethod) and `_show_inputs` (bound)"""
def _json(self):
return json.dumps(self, cls=DataClassEncoder)
def _show_inputs(cls):
return {
field.name: field.type.__name__
for field in fields(cls)
}
cls._json = _json
cls._show_inputs = classmethod(_show_inputs)
return cls
NOTE 1: There's actually no need to decorate the classes with blossom. You could just use its internal functions programatically.
Using a custom json encoder to dump the dataclass objects, including properties:
import json
class DataClassPropEncoder(json.JSONEncoder): # https://stackoverflow.com/a/51286749/7814595
def default(self, o):
if is_dataclass(o):
cls = type(o)
# inject instance properties
props = {
name: getattr(o, name)
for name, value in cls.__dict__.items() if isinstance(value, property)
}
return {
**props,
**asdict(o)
}
return super().default(o)
Finally, wrap the computations inside properties so they are
serialized as well when using the decorated class. Full code example:
from dataclasses import asdict
from dataclasses import dataclass
from dataclasses import fields
from dataclasses import is_dataclass
import json
from itertools import product
from typing import List
class DataClassPropEncoder(json.JSONEncoder): # https://stackoverflow.com/a/51286749/7814595
def default(self, o):
if is_dataclass(o):
cls = type(o)
props = {
name: getattr(o, name)
for name, value in cls.__dict__.items() if isinstance(value, property)
}
return {
**props,
**asdict(o)
}
return super().default(o)
def blossom(cls):
def _json(self):
return json.dumps(self, cls=DataClassEncoder)
def _show_inputs(cls):
return {
field.name: field.type.__name__
for field in fields(cls)
}
cls._json = _json
cls._show_inputs = classmethod(_show_inputs)
return cls
#blossom
#dataclass
class MyClass1:
param1:int
param2:int
#blossom
#dataclass
class MyClass2:
param3: List[str]
param4: List[int]
def _compute_single(self, values): # TODO: implmement this
return values[0]*values[1]
#property
def combinations(self):
# TODO: cache if used more than once
# TODO: combinations might explode
field_names = []
field_values = []
cls = type(self)
for field in fields(cls):
field_names.append(field.name)
field_values.append(getattr(self, field.name))
results = []
for values in product(*field_values):
result = {
**{
field_names[idx]: value
for idx, value in enumerate(values)
},
"output": self._compute_single(values)
}
results.append(result)
return results
>>> print(f"MyClass1:\n{MyClass1._show_inputs()}")
MyClass1:
{'param1': 'int', 'param2': 'int'}
>>> print(f"MyClass2:\n{MyClass2._show_inputs()}")
MyClass2:
{'param3': 'List', 'param4': 'List'}
>>> obj_1 = MyClass1(3,4)
>>> print(f"obj_1:\n{obj_1._json()}")
obj_1:
{"param1": 3, "param2": 4}
>>> obj_2 = MyClass2(["first", "second"],[4,2])._json()
>>> print(f"obj_2:\n{obj_2._json()}")
obj_2:
{"combinations": [{"param3": "first", "param4": 4, "output": "firstfirstfirstfirst"}, {"param3": "first", "param4": 2, "output": "firstfirst"}, {"param3": "second", "param4": 4, "output": "secondsecondsecondsecond"}, {"param3": "second", "param4": 2, "output": "secondsecond"}], "param3": ["first", "second"], "param4": [4, 2]}
NOTE 2: If you need to perform several computations per class, it might be a good idea to abstract away the pattern in the combinations property to avoid repeating code.
NOTE 3: If you need access to the properties several times and not ust once, you might want to consider caching their values to avoid re-computation.
Once you have an instance of MyClass / MyClass2, you can call vars() or vars().keys() and it will give you the attributes as a str. Unlike dir, it will not show all the builtin attributes/methods starting with __.
class MyClass2:
def __init__(self,param3=3,param4=4):
self.param3=param3
self.param4=param4
instance_of_myclass2 = MyClass2(param3="what", param4="ever")
print(vars(instance_of_myclass2))
{'param3': 'what', 'param4': 'ever'}
print(vars(instance_of_myclass2).keys())
dict_keys(['param3', 'param4'])
dir(instance_of_myclass2)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'param3', 'param4']
[I ran into the issues that prompted this question and my previous question at the same time, but decided the two questions deserve to be separate.]
The docs describe using destructuring assignment with my and our variables, but don't mention whether it can be used with has variables. But Raku is consistent enough that I decided to try, and it appears to work:
class C { has $.a; has $.b }
class D { has ($.a, $.b) }
C.new: :a<foo>; # OUTPUT: «C.new(a => "foo", b => Any)»
D.new: :a<foo>; # OUTPUT: «D.new(a => "foo", b => Any)»
However, this form seems to break attribute defaults:
class C { has $.a; has $.b = 42 }
class D { has ($.a, $.b = 42) }
C.new: :a<foo>; # OUTPUT: «C.new(a => "foo", b => 42)»
D.new: :a<foo>; # OUTPUT: «C.new(a => "foo", b => Any)»
Additionally, flipping the position of the default provides an error message that might provide some insight into what is going on (though not enough for me to understand if the above behavior is correct).
class D { has ($.a = 42, $.b) }
# OUTPUT:
===SORRY!=== Error while compiling:
Cannot put required parameter $.b after optional parameters
So, a few questions: is destructuring assignment even supposed to work with has? If so, is the behavior with default values correct/is there a way to assign default values when using destucturing assignment?
(I really hope that destructuring assignment is supported with has and can be made to work with default values; even though it might seem like a niche feature for someone using classes for true OO, it's very handy for someone writing more functional code who wants to use a class as a slightly-more-typesafe Hash with fixed keys. Being able to write things like class User { has (Str $.first-name, Str $.last-name, Int $.age) } is very helpful for that sort of code)
This is currently a known bug in Rakudo. The intended behavior is for has to support list assignment, which would make syntax very much like that shown in the question work.
I am not sure if the supported syntax will be:
class D { has ($.a, $.b = 42) }
D.new: :a<foo>; # OUTPUT: «D.new(a => "foo", b => 42)»
or
class D { has ($.a, $.b) = (Any, 42) }
D.new: :a<foo>; # OUTPUT: «D.new(a => "foo", b => 42)»
but, either way, there will be a way to use a single has to declare multiple attributes while also providing default values for those attributes.
The current expectation is that this bug will get resolved sometime after the RakuAST branch is merged.
For example, you might have function with a complicated signature and varargs:
fun complicated(easy: Boolean = false, hard: Boolean = true, vararg numbers: Int)
It would make sense that you should be able to call this function like so:
complicated(numbers = 1, 2, 3, 4, 5)
Unfortunately the compiler doesn't allow this.
Is it possible to use named arguments for varargs? Are there any clever workarounds?
To pass a named argument to a vararg parameter, use the spread operator:
complicated(numbers = *intArrayOf(1, 2, 3, 4, 5))
It can be worked around by moving optional arguments after the vararg:
fun complicated(vararg numbers: Int, easy: Boolean = false, hard: Boolean = true) = {}
Then it can be called like this:
complicated(1, 2, 3, 4, 5)
complicated(1, 2, 3, hard = true)
complicated(1, easy = true)
Note that trailing optional params need to be always passed with name.
This won't compile:
complicated(1, 2, 3, 4, true, true) // compile error
Another option is to spare vararg sugar for explicit array param:
fun complicated(easy: Boolean = false, hard: Boolean = true, numbers: IntArray) = {}
complicated(numbers = intArrayOf(1, 2, 3, 4, 5))
Kotlin Docs says clearly that:
Variable number of arguments (Varargs)
A parameter of a function (normally the last one) may be marked with
vararg modifier:
fun <T> asList(vararg ts: T): List<T> {
val result = ArrayList<T>()
for (t in ts) // ts is an Array
result.add(t)
return result
}
allowing a variable number of arguments to be passed to the function:
val list = asList(1, 2, 3)
Inside a function a vararg-parameter of type T is visible as an
array of T, i.e. the ts variable in the example above has type
Array<out T>.
Only one parameter may be marked as vararg. If a vararg parameter
is not the last one in the list, values for the following parameters
can be passed using the named argument syntax, or, if the parameter
has a function type, by passing a lambda outside parentheses.
When we call a vararg-function, we can pass arguments one-by-one,
e.g. asList(1, 2, 3), or, if we already have an array and want to
pass its contents to the function, we use the spread operator
(prefix the array with *):
val a = arrayOf(1, 2, 3)
val list = asList(-1, 0, *a, 4)
From: https://kotlinlang.org/docs/reference/functions.html#variable-number-of-arguments-varargs
To resume, you can make it using spread operator so it would look like:
complicated(numbers = *intArrayOf(1, 2, 3, 4, 5))
Hope it will help
The vararg parameter can be anywhere in the list of parameters. See example below of how it may be called with different set of parameters. BTW any call can also provide lambda after closed parenthesis.
fun varargs(
first: Double = 0.0,
second: String = "2nd",
vararg varargs: Int,
third: String = "3rd",
lambda: ()->Unit = {}
) {
...
}
fun main(args: Array<String>) {
val list = intArrayOf(1, 2, 3)
varargs(1.0, "...", *list, third="third")
varargs(1.0, "...", *list)
varargs(1.0, varargs= *list, third="third")
varargs(varargs= *list, third="third")
varargs(varargs= *list, third="third", second="...")
varargs(varargs= *list, second="...")
varargs(1.0, "...", 1, 2, 3, third="third")
varargs(1.0, "...", 1, 2, 3)
varargs(1.0)
varargs(1.0, "...", third="third")
varargs(1.0, third="third")
varargs(third="third")
}
I have a RDD[Map[String,Int]] where the keys of the maps are the column names. Each map is incomplete and to know the column names I would need to union all the keys. Is there a way to avoid this collect operation to know all the keys and use just once rdd.saveAsTextFile(..) to get the csv?
For example, say I have an RDD with two elements (scala notation):
Map("a"->1, "b"->2)
Map("b"->1, "c"->3)
I would like to end up with this csv:
a,b,c
1,2,0
0,1,3
Scala solutions are better but any other Spark-compatible language would do.
EDIT:
I can try to solve my problem from another direction also. Let's say I somehow know all the columns in the beginning, but I want to get rid of columns that have 0 value in all maps. So the problem becomes, I know that the keys are ("a", "b", "c") and from this:
Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)
I need to write the csv:
a,b
1,2
3,1
Would it be possible to do this with only one collect?
If you're statement is: "every new element in my RDD may add a new column name I have not seen so far", the answer is obviously can't avoid a full scan. But you don't need to collect all elements on the driver.
You could use aggregate to only collect column names. This method takes two functions, one is to insert a single element into the resulting collection, and another one to merge results from two different partitions.
rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })
You will get back a set of all column names in the RDD. In a second scan you can print the CSV file.
Scala and any other supported language
You can use spark-csv
First lets find all present columns:
val cols = sc.broadcast(rdd.flatMap(_.keys).distinct().collect())
Create RDD[Row]:
val rows = rdd.map {
row => { Row.fromSeq(cols.value.map { row.getOrElse(_, 0) })}
}
Prepare schema:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(
cols.value.map(field => StructField(field, IntegerType, true)))
Convert RDD[Row] to Data Frame:
val df = sqlContext.createDataFrame(rows, schema)
Write results:
// Spark 1.4+, for other versions see spark-csv docs
df.write.format("com.databricks.spark.csv").save("mycsv.csv")
You can do pretty much the same thing using other supported languages.
Python
If you use Python and final data fits in a driver memory you can use Pandas through toPandas() method:
rdd = sc.parallelize([{'a': 1, 'b': 2}, {'b': 1, 'c': 3}])
cols = sc.broadcast(rdd.flatMap(lambda row: row.keys()).distinct().collect())
df = sqlContext.createDataFrame(
rdd.map(lambda row: {k: row.get(k, 0) for k in cols.value}))
df.toPandas().save('mycsv.csv')
or directly:
import pandas as pd
pd.DataFrame(rdd.collect()).fillna(0).save('mycsv.csv')
Edit
One possible way to the second collect is to use accumulators to either build a set of all column names or to count these where you found zeros and use this information to map over rows and remove unnecessary columns or to add zeros.
It is possible but inefficient and feels like cheating. The only situation when it makes some sense is when number of zeros is very low, but I guess it is not the case here.
object ColsSetParam extends AccumulatorParam[Set[String]] {
def zero(initialValue: Set[String]): Set[String] = {
Set.empty[String]
}
def addInPlace(s1: Set[String], s2: Set[String]): Set[String] = {
s1 ++ s2
}
}
val colSetAccum = sc.accumulator(Set.empty[String])(ColsSetParam)
rdd.foreach { colSetAccum += _.keys.toSet }
or
// We assume you know this upfront
val allColnames = sc.broadcast(Set("a", "b", "c"))
object ZeroColsParam extends AccumulatorParam[Map[String, Int]] {
def zero(initialValue: Map[String, Int]): Map[String, Int] = {
Map.empty[String, Int]
}
def addInPlace(m1: Map[String, Int], m2: Map[String, Int]): Map[String, Int] = {
val keys = m1.keys ++ m2.keys
keys.map(
(k: String) => (k -> (m1.getOrElse(k, 0) + m2.getOrElse(k, 0)))).toMap
}
}
val accum = sc.accumulator(Map.empty[String, Int])(ZeroColsParam)
rdd.foreach { row =>
// If allColnames.value -- row.keys.toSet is empty we can avoid this part
accum += (allColnames.value -- row.keys.toSet).map(x => (x -> 1)).toMap
}