How can I create an empty dataset using sparkcontext in Code Workbook in Palantir Foundry? - palantir-foundry

How do I create a bare minimum PySpark DataFrame in a Palantir Foundry Code Workbook?
To do this in a Code Repository I'd use:
my_df = ctx.spark_session.createDataFrame([('1',)], ["a"])

Code workbook injects a global spark as the Spark session, rather than a transform context in ctx. You can use it in a Python transform ('New Transform' > 'Python Code'):
def my_dataframe():
return spark.createDataFrame([('1',)], ["a"])
Or with a defined schema:
from pyspark.sql import types as T
from datetime import datetime
SCHEMA = T.StructType([
T.StructField('entity_name', T.StringType()),
T.StructField('thing_value', T.IntegerType()),
T.StructField('created_at', T.TimestampType()),
])
def my_dataframe():
return spark.createDataFrame([("Name", 3, datetime.now())], SCHEMA)

Related

How to export all data from Elastic Search Index to file in JSON format with _id field specified?

I'm new to both Spark and Scala. I'm trying to read all data from a particular index in Elastic Search into a RDD and use this data to write to Mongo DB.
I'm loading the Elastic search data to a esJsonRDD and when I try to print the RDD contents, it is in the following format,
(1765770532{"FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"})
Expected format,
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
How can I achieve the output from elastic search to be formatted this way?.
Any help would be appreciated.
The data retrieved from elastic search is in the following format,
(1765770532{"FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"})
Expected format is,
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
object readFromES {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("readFromES")
.set("es.nodes", Config.ES_NODES)
.set("es.nodes.wan.only", Config.ES_NODES_WAN_ONLY)
.set("es.net.http.auth.user", Config.ES_NET_HTTP_AUTH_USER)
.set("es.net.http.auth.pass", Config.ES_NET_HTTP_AUTH_PASS)
.set("es.net.ssl", Config.ES_NET_SSL)
.set("es.output.json","true")
val sc = new SparkContext(conf)
val RDD = EsSpark.esJsonRDD(sc, "userdata/user")
//RDD.coalesce(1).saveAsTextFile(args(0))
RDD.take(5).foreach(println)
}
}
I would like the RDD output to be written to a file in the following JSON Format(one line per doc),
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
{_id:"1765770533","FirstName":DEF,"LastName":"DEF",Zipcode":"35525","City":"PortWinchestor","StateCode":"AI"}
"_id" is a part of metadata, to access it you should add .config("es.read.metadata", true) to config.
Then you can access it two ways, You can use
val RDD = EsSpark.esJsonRDD(sc, "userdata/user")
and manually add the _id field in json
Or easier way is to read as a dataframe
val df = spark.read
.format("org.elasticsearch.spark.sql")
.load("userdata/user")
.withColumn("_id", $"_metadata".getItem("_id"))
.drop("_metadata")
//Write as json in file
df.write.json("output folder ")
Here the spark is the spark session created as
val spark = SparkSession.builder().master("local[*]").appName("Test")
.config("spark.es.nodes","host")
.config("spark.es.port","ports")
.config("spark.es.nodes.wan.only","true")
.config("es.read.metadata", true) //for enabling metadata
.getOrCreate()
Hope this helps

Is there a way how to save json with flask_sqlalchemy with sqlite backend

I am trying to save data in form of JSON (returned as result from POST request)
def get_data(...):
...
try:
_r = requests.post(
_url_list,
headers=_headers
)
return _r.json()
except Exception as ee:
print('Could not get data: {}'.format(ee))
return None
Into a table in SQLITE database as backend.
def add_to_flight_data(_data):
if _data:
try:
new_record = FlightData(data=_data)
db.session.add(new_record)
db.session.commit()
print('Data instertedto DB!')
return "Success"
except Exception as e:
print('Data NOT instertedto DB! {}'.format(e))
pass
This is my simple flask code
import os
import time
import auth
import json
import requests
import datetime
from flask import Flask
from flask_marshmallow import Marshmallow
from flask_sqlalchemy import SQLAlchemy
# from safrs.safrs_types import JSONType
project_dir = os.path.dirname(os.path.abspath(__file__))
database_file = "sqlite:///{}".format(os.path.join(project_dir, "2w.sqlite"))
app = Flask(__name__)
app.config["SQLALCHEMY_DATABASE_URI"] = database_file
db = SQLAlchemy(app)
ma = Marshmallow(app)
class FlightData(db.Model):
id = db.Column(db.Integer, primary_key=True)
created = db.Column(db.DateTime, server_default=db.func.now())
json_data = db.Column(db.JSONType, default={})
def __init__(self, data):
self.data = data
It seems like there is perhaps no option to save JSON in sqlite
json_data = db.Column(db.JSONType, default={})
Please ADVISE
Thanks.
I believe that you should be using db.JSON, not db.JSONType as there is no such column type in sqlalchemy.
Regardless of that, SQLite has no JSON data type, so sqlalchemy won't be able to map columns of type db.JSON onto anything. According to the documentation only Postgres and some MySQL are supported. There is support for JSON in SQLite with the JSON1 extension, but sqlalchemy will not be able to make use of it.
Your best bet then is to declare the column as db.Text and use json.dumps() to jsonify the data on write. Alternatively modify your get_data() function to check for a JSON response (check the Content-type header or try calling _r.json() and catching exceptions), and then return _r.content which will already be a JSON string.
Use json.loads() to read data back from the db.

use pre-defined schema in pyspark json

Currently, if i want to read a json with pyspark, either I use interfered schema, or I have to define manually my schema StructType
Is it possible to use a file as reference for the schema ?
You can indeed use a file to define your schema. For example, for the following schema:
TICKET:string
TRANSFERRED:string
ACCOUNT:integer
you can use this code to import it:
import csv
from collections import OrderedDict
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
schema = OrderedDict()
with open(r'schema.txt') as csvfile:
schemareader = csv.reader(csvfile, delimiter=':')
for row in schemareader:
schema[row[0]]=row[1]
and then you can use it to create your StructType schema on the fly:
mapping = {"string": StringType, "integer": IntegerType}
schema = StructType([
StructField(k, mapping.get(v.lower())(), True) for (k, v) in schema.items()])
You may have to create a more complex schema file for JSON file, however, please note that you can't use a JSON file to define your schema as the order of the columns is not guaranteed when parsing JSON.

Not able to read streaming files using Spark structured streaming

I have a set of CSV files which needs to be read through Spark structured streaming. After creating a DataFrame I need to load into a Hive table.
When a file is already present before running code through spark-submit,the data is loaded into Hive successfully.But when I add new CSV files on runtime, it's not at all inserting it into Hive.
Code is:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
val spark = SparkSession.builder().appName("Spark SQL Example").config("hive.metastore.uris","thrift://hostname:port").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val df = spark.readStream.option("header", true).csv("file:///folder path/")
val query = df.writeStream.queryName("tab").format("memory").outputMode(OutputMode.Append()).start()
spark.sql("insert into hivetab select * from tab").show()
query.awaitTermination()
Am I missing out something here?
Any suggestions would be helpful.
Thanks

How to save JSON data fetched from URL in PySpark?

I have fetched some .json data from API.
import urllib2
test=urllib2.urlopen('url')
print test
How can I save it as a table or data frame? I am using Spark 2.0.
This is how I succeeded importing .json data from web into df:
from pyspark.sql import SparkSession, functions as F
from urllib.request import urlopen
spark = SparkSession.builder.getOrCreate()
url = 'https://web.url'
jsonData = urlopen(url).read().decode('utf-8')
rdd = spark.sparkContext.parallelize([jsonData])
df = spark.read.json(rdd)
For this you can have some research and try using sqlContext. Here is Sample code:-
>>> df2 = sqlContext.jsonRDD(test)
>>> df2.first()
Moreover visit line and check for more things here,
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
Adding to Rakesh Kumar answer, the way to do it in spark 2.0 is:
http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#data-sources
As an example, the following creates a DataFrame based on the content of a JSON file:
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.
from pyspark import SparkFiles
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Project").getOrCreate()
zip_url = "https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json"
spark.sparkContext.addFile(zip_url)
zip_df = spark.read.json("file://" +SparkFiles.get("zipcodes.json"))
#click on raw and then copy url