How can I create an empty dataset from on a PySpark schema in Palantir Foundry? - palantir-foundry

I have a PySpark schema that describes columns and their types for a dataset (which I could write by hand, or get from an existing dataset by going to the 'Columns' tab, then 'Copy PySpark schema').
I want an empty dataset with this schema, for example that could be used as a backing dataset for a writeback-only ontology object. How can I create this in Foundry?

To do this in Python, you can create an empty dataset by using the Spark session from the context to create a DataFrame with the schema, for example:
from pyspark.sql import types as T
from transforms.api import transform_df, configure, Output
SCHEMA = T.StructType([
T.StructField('entity_name', T.StringType()),
T.StructField('thing_value', T.IntegerType()),
T.StructField('created_at', T.TimestampType()),
])
# Given there is no work to do, save on compute by running it on the driver
#configure(profile=["KUBERNETES_NO_EXECUTORS_SMALL"])
#transform_df(
Output("/some/dataset/path/or/rid"),
)
def compute(ctx):
return ctx.spark_session.createDataFrame([], schema=SCHEMA)

To do this in Java, you can create a transform using the Spark session on the TransformContext:
package myproject.datasets;
import com.palantir.transforms.lang.java.api.Compute;
import com.palantir.transforms.lang.java.api.Output;
import com.palantir.transforms.lang.java.api.TransformProfiles;
import com.palantir.transforms.lang.java.api.TransformContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.*;
import java.util.List;
public final class MyTransform {
private static final StructType SCHEMA = new StructType()
.add(new StructField("entity_name", DataTypes.StringType, true, Metadata.empty()))
.add(new StructField("thing_value", DataTypes.IntegerType, true, Metadata.empty()))
.add(new StructField("created_at", DataTypes.TimestampType, true, Metadata.empty()));
#Compute
// Given there is no work to do, save on compute by running it on the driver
#TransformProfiles({ "KUBERNETES_NO_EXECUTORS_SMALL" })
#Output("/some/dataset/path/or/rid")
public Dataset<Row> myComputeFunction(TransformContext context) {
return context.sparkSession().createDataFrame(List.of(), SCHEMA);
}
}

Related

Dart Getting error: Undefined name 'csvCodec'. (undefined_identifier) when using csv package

Getting error -
error: Undefined name 'csvCodec'. (undefined_identifier at [easy_csv] example\exa.dart:10)
I implemented decoder example from dart csv package like -
import 'dart:async';
import 'dart:convert';
import 'dart:io';
import 'package:csv/csv.dart';
main() async {
final input = new File('foo.csv').openRead();
final fields =
await input.transform(utf8.decoder).transform(csvCodec.decoder).toList();
}
Issue is now solved.
As with Dart 2, csv was no longer able to be a codec & corresponding
documentation was not removed.
This change is now reflected in documentation
Example code for reading csv file & printing them on according latest version 4.0.3
import 'dart:async';
import 'dart:convert';
import 'dart:io';
import 'package:csv/csv.dart';
main() async {
//TODO Change file_name
String file_name = 'foo.csv';
final input = File(file_name).openRead();
//Every csv row is converted to a list of values.
//Unquoted strings looking like numbers (integers and doubles) are by default converted to ints or doubles.
final fields = await input.transform(utf8.decoder).transform(new CsvToListConverter()).toList();
print(fields);
}

Get a Json format for a Seq of a generic type

I have an abstract class with a generic type which gets a Json format for that generic type from its subclass. But the abstract class also needs a Json format of a sequence of that type. Is there any way in Scala to get a Json format of a sequence of things based only on the format of those things?
I'm using the Play Json framework.
Here's an example that doesn't follow my case exactly but provides a good indication of what I want to achieve:
package scalatest
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Await
import scala.concurrent.duration.Duration
import java.util.UUID
import scala.util.control.NonFatal
import play.api.libs.json.Format
import play.api.libs.json.Json
object Banana {
def main(args: Array[String]): Unit = {
val f: Format[Seq[Banana]] = getSeqFormat(Json.format[Banana])
}
def getSeqFormat[T](format: Format[T]): Format[Seq[T]] = {
??? // TODO implement
}
}
case class Banana(color: String)
If you're just trying to serialize bananas into JSON objects then the only thing you need to do is define the Banana implicit json format, the others (like Seq format for example) are built-in within play:
import play.api.libs.json.Json
case class Banana(color: String)
object Banana {
implicit val jsonFormat = Json.writes[Banana]
}
object PlayJsonTest extends App {
val bananas = Seq(Banana("yellow"), Banana("green"))
println(Json.toJson(bananas)) // [{"color":"yellow"},{"color":"green"}]
}
This also works for other types because the Json#toJson method is defined as follows:
// Give me an implicit `Writes[T]` and I know how to serialize it
def toJson[T](o: T)(implicit tjs: Writes[T]): JsValue = tjs.writes(o)
The defaults are implicitly used and those include a format for most of the collections. You can find them here.
I hope that helps you.

Scala Spark - creating nested json output from simple dataframe

Thanks for getting back. But the problem I am facing is while writing those structs into nested json. Somehow 'tojson' is not working and is just skipping the nested fields resulting into a flat structure always. How can I write into nested json format into HDFS?
You should be creating the struct fields from the fields which have to be be nested together.
Below is an working example :
Assume you have a employee data in csv format containing companyname,employee and department name and you would want to list all the employees per department per company in json format. Below is the code for the same.
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.api.java.UDF2;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import scala.collection.mutable.WrappedArray;
public class JsonExample {
public static void main(String [] args)
{
SparkSession sparkSession = SparkSession
.builder()
.appName("JsonExample")
.master("local")
.getOrCreate();
//read the csv file
Dataset<Row> employees = sparkSession.read().option("header", "true").csv("/tmp/data/emp.csv");
//create the temp view
employees.createOrReplaceTempView("employees");
//First , group the employees based on company AND department
sparkSession.sql("select company,department,collect_list(name) as department_employees from employees group by company,department").createOrReplaceTempView("employees");
/*Now create a struct by invoking the UDF create_struct.
* The struct will contain department and the list of employees
*/
sparkSession.sql("select company,collect_list(struct(department,department_employees)) as department_info from employees group by company").toJSON().show(false);
}
}
You can find the same example on my blog:
http://baahu.in/spark-how-to-generate-nested-json-using-dataset/

How to insert json fixture data in Play Specification tests?

I have a Scala Play 2.2.2 application and as part of my Specification tests I would like to insert some fixture data for testing preferably in json format. For the tests I use the usual in-memory H2 database. How can I accomplish this? I have searched all the documentation but there is no mention to this anywhere.
Note that I would prefer not to build my own flavor of fixture implementation via the Global. There should be a non-hacky way to this right?
AFAIK there is no built-in stuff to do this, ala Rails, and it's hard to imagine what the devs could do without making Play Scala much more opinionated about the way persistence should be handled (which I'd personally consider a negative.)
I also use H2 for testing and employ plain SQL fixtures in a resource file and load them before tests using a couple of (fairly simple) helpers:
package object helpers {
import java.io.File
import java.sql.CallableStatement
import org.specs2.execute.{Result, AsResult}
import org.specs2.mutable.Around
import org.specs2.specification.Scope
import play.api.db.DB
import play.api.test.FakeApplication
import play.api.test.Helpers._
/**
* Load a file containing SQL statements into the DB.
*/
private def loadSqlResource(resource: String)(implicit app: FakeApplication) = DB.withConnection { conn =>
val file = new File(getClass.getClassLoader.getResource(resource).toURI)
val path = file.getAbsolutePath
val statement: CallableStatement = conn.prepareCall(s"RUNSCRIPT FROM '$path'")
statement.execute()
conn.commit()
}
/**
* Run a spec after loading the given resource name as SQL fixtures.
*/
abstract class WithSqlFixtures(val resource: String, val app: FakeApplication = FakeApplication()) extends Around with Scope {
implicit def implicitApp = app
override def around[T: AsResult](t: => T): Result = {
running(app) {
loadSqlResource(resource)
AsResult.effectively(t)
}
}
}
}
Then, in your actual spec you can do something like so:
package models
import helpers.WithSqlFixtures
import play.api.test.PlaySpecification
class MyModelSpec extends PlaySpecification {
"My model" should {
"locate items correctly" in new WithSqlFixtures("model-fixtures.sql") {
MyModel.findAll().size must beGreaterThan(0)
}
}
}
Note: this specs2 stuff could probably be better.
Obviously if you really need JSON you'll have to add extra machinery to deserialise your models and persist them in the database (often in your app you'll be doing these things anyway, in which case that might be relatively trivial.)
You'll also need:
Some evolutions to establish your DB schema in conf/evolutions/default
The evolution plugin enabled, which will build your schema when the FakeApplication starts up
The appropriate H2 DB config

Rendering JSON with Play! and Scala

I have a simple question regarding rendering JSON object from a Scala class. Why do I have to implemet deserializer ( read, write ).
I have the following case class:
case class User(firstname:String, lastname:String, age:Int)
And in my controller:
val milo:User = new User("Sam","Fisher",23);
Json.toJson(milo);
I get compilation error: No Json deserializer found for type models.User. Try to implement an implicit Writes or Format for this type.
In my previous project I had to implement a reader,writer object in the class for it to work and I find it very annoying.
object UserWebsite {
implicit object UserWebsiteReads extends Format[UserWebsite] {
def reads(json: JsValue) = UserWebsite(
(json \ "email").as[String],
(json \ "url").as[String],
(json \ "imageurl").as[String])
def writes(ts: UserWebsite) = JsObject(Seq(
"email" -> JsString(ts.email),
"url" -> JsString(ts.url),
"imageurl" -> JsString(ts.imageurl)))
}
}
I really recommend to upgrade to play 2.1-RC1 because here, JSON writers/readers are very simple to be defined (more details here)
But in order to help you to avoid some errors, I will give you a hint with imports:
- use these imports only! (notice that json.Reads is not included)
import play.api.libs.json._
import play.api.libs.functional.syntax._
import play.api.libs.json.Writes._
and you only have to write this code for write/read your class to/from Json (of course you will have User instead of Address:
implicit val addressWrites = Json.writes[Address]
implicit val addressReads = Json.reads[Address]
Now, they will be used automatically:
Example of write:
Ok(Json.toJson(entities.map(s => Json.toJson(s))))
Example of read(I put my example of doing POST for creating an entity by reading json from body) please notice addressReads used here
def create = Action(parse.json) { request =>
request.body.validate(addressReads).map { entity =>
Addresses.insert(entity)
Ok(RestResponses.toJson(RestResponse(OK, "Succesfully created a new entity.")))
}.recover { Result =>
BadRequest(RestResponses.toJson(RestResponse(BAD_REQUEST, "Unable to transform JSON body to entity.")))
}
}
In conclusion, they tried (and succeded) to make things very simple regarding JSON.
If you are using play 2.0.x you can do
import com.codahale.jerkson.Json._
generate(milo)
generate uses reflection to do it.
In play 2.1 you can use Json.writes to create a macro for that implicit object you had to create. No runtime reflection needed!
import play.api.libs.json._
import play.api.libs.functional.syntax._
implicit val userWrites = Json.writes[User]
Json.toJson(milo)
I have been using jerkson (which basically is wrapper to jackson) in my project to convert objects to json string.
The simplest way to do that is:
import com.codehale.jerkson.Json._
...
generate(milo)
...
If you need to configure the ObjectMapper (e.g. adding custom serializer/deserializer, configuring output format, etc.), you can do it by creating object which extends com.codehale.jerkson.Json class.
package utils
import org.codehaus.jackson.map._
import org.codehaus.jackson.{Version, JsonGenerator, JsonParser}
import com.codahale.jerkson.Json
import org.codehaus.jackson.map.module.SimpleModule
import org.codehaus.jackson.map.annotate.JsonSerialize
object CustomJson extends Json {
val module = new SimpleModule("CustomSerializer", Version.unknownVersion())
// --- (SERIALIZERS) ---
// Example:
// module.addSerializer(classOf[Enumeration#Value], EnumerationSerializer)
// --- (DESERIALIZERS) ---
// Example:
// module.addDeserializer(classOf[MyEnumType], new EnumerationDeserializer[MyEnumType](MyEnumTypes))
mapper.setSerializationInclusion(JsonSerialize.Inclusion.NON_NULL)
mapper.setSerializationConfig(mapper.getSerializationConfig.without(SerializationConfig.Feature.WRITE_NULL_MAP_VALUES))
mapper.registerModule(module)
}
To use it in your codes:
import utils.CustomJson._
...
generate(milo)
...
In fact, this is very simple. Firstly import:
import play.api.libs.json._
Thanks to the fact, that User is a case class you can automatically create Json writes using Json.writes[]:
val milo:User = new User("Millad","Dagdoni",23)
implicit val userImplicitWrites = Json.writes[User]
Json.toJson(milo)
I haven't found it in the docs, but here is the link to the api: http://www.playframework.com/documentation/2.2.x/api/scala/index.html#play.api.libs.json.Json$
In your case, I'd use the JSON.format macro.
import play.api.libs.json._
implicit val userFormat = Json.format[User]
val milo = new User("Sam", "Fisher", 23)
val json = Json.toJson(milo)