How to read csv with schema in pyspark

How to read csv with schema in pyspark - csv

I know how to read a csv with pyspark, but I'm having a lot of problems to load it with the correct format. My csv has 3 columns, where the first and the second are strings, but the third is a list of dicts. I'm not being able to load this last column.
I tried with
schema = StructType([
StructField("_id", StringType()),
StructField("text", StringType()),
StructField("links", ArrayType(elementType=MapType(StringType(), StringType())))
])
but it's raising an error. With Inferschema neither it's working.

You need to have inferSchema="true". If it causes problems, read everything as string, and then you can use ast.literal_eval() from ast package to convert the str to dict.
You use this function:
def read_csv_spark(spark, file_path):
"""
:param spark: SparkSession or SQLContext
:param file_path: Path to the file
:return: Spark Dataframe
"""
df = (
spark.read.format("com.databricks.spark.csv")
.options(header="true", inferSchema="true")
.load(file_path)
)
return df

Related

In Palantir Foundry, how should I get the current SparkSession in a Transform?

I'm writing a Python Transform and need to get the SparkSession so I can construct a DataFrame.
How should I do this?

You can pass the SparkContext as an argument in the transform, which can then be used to generate the SparkSession.
#transform(
output=Output('/path/to/first/output/dataset'),
)
def my_compute_function(ctx, output):
# type: (TransformContext, TransformOutput) -> None
# In this example, the Spark session is used to create an empty data frame.
columns = [
StructField("col_a", StringType(), True)
]
empty_df = ctx.spark_session.createDataFrame([], schema=StructType(columns))
output.write_dataframe(empty_df)
This example can also be found in the Foundry documentation here: https://www.palantir.com/docs/foundry/transforms-python/transforms-python-api/#transform

In Foundry, how can I parse a dataframe column that has a JSON response

I am trying to bring in JIRA data into Foundry using an external API. When it comes in via Magritte, the data gets stored in AVRO and there is a column called response. The response column has data that looks like this...
[{"id":"customfield_5","name":"test","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["cf[5]","test"],"schema":{"type":"user","custom":"com.atlassian.jira.plugin.system.customfieldtypes:userpicker","customId":5}},{"id":"customfield_2","name":"test2","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["test2","cf[2]"],"schema":{"type":"option","custom":"com.atlassian.jira.plugin.system.customfieldtypes:select","customId":2}}]
Due to the fact that this imports as AVRO, the documentation that talks about how to convert this data that's in Foundry doesn't work. How can I convert this data into individual columns and rows?
Here is the code that I've attempted to use:
from transforms.api import transform_df, Input, Output
from pyspark import SparkContext as sc
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf
import json
import pyspark.sql.types as T
#transform_df(
Output("json output"),
json_raw=Input("json input"),
)
def my_compute_function(json_raw, ctx):
sqlContext = SQLContext(sc)
source = json_raw.select('response').collect() # noqa
# Read the list into data frame
df = sqlContext.read.json(sc.parallelize(source))
json_schema = T.StructType([
T.StructField("id", T.StringType(), False),
T.StructField("name", T.StringType(), False),
T.StructField("custom", T.StringType(), False),
T.StructField("orderable", T.StringType(), False),
T.StructField("navigable", T.StringType(), False),
T.StructField("searchable", T.StringType(), False),
T.StructField("clauseNames", T.StringType(), False),
T.StructField("schema", T.StringType(), False)
])
udf_parse_json = udf(lambda str: parse_json(str), json_schema)
df_new = df.select(udf_parse_json(df.response).alias("response"))
return df_new
# Function to convert JSON array string to a list
def parse_json(array_str):
json_obj = json.loads(array_str)
for item in json_obj:
yield (item["a"], item["b"])

Parsing Json in a string column to a struct column (and then into separate columns) can be easily done using the F.from_json function.
In your case, you need to do:
df = df.withColumn("response_parsed", F.from_json("response", json_schema))
Then you can do this or similar to get the contents into different columns:
df = df.select("response_parsed.*")
However, this won't work as your schema is incorrect, you actually have a list of json structs in each row, not just 1, so you need a T.ArrayType(your_schema) wrapping around the whole thing, you'll also need to do an F.explode before selecting, to get each array element in its own row.
An additional useful function is F.get_json_object, which allows you to get json one json object from a json string.
Using a UDF like you've done could work, but UDFs are generally much less performant than native spark functions.
Additionally, all the AVRO file format does in this case is to merge multiple json files into one big file, with each file in its own row, so the example under "Rest API Plugin" - "Processing JSON in Foundry" should work as long as you skip the 'put this schema on the raw dataset' step.

I used the magritte-rest connector to walk through the paged results from search:
type: rest-source-adapter2
restCalls:
- type: magritte-paging-inc-param-call
method: GET
path: search
paramToIncrease: startAt
increaseBy: 50
initValue: 0
parameters:
startAt: '{%startAt%}'
extractor:
- type: json
assign:
issues: /issues
allowNull: true
condition:
type: magritte-rest-non-empty-condition
var: issues
maxIterationsAllowed: 4096
cacheToDisk: false
oneFilePerResponse: false
That yielded a dataset that looked like this:
Once I had that, this parsed expanded and parsed the returned JSON issues into a properly-typed DataFrame with fields holding the inner structure of the issue as a (very complex) struct:
import json
from pyspark.sql import Row
from pyspark.sql.functions import explode
def issues_enumerated(All_Issues_Paged):
def generate_issue_row(input_row: Row) -> Row:
"""
Generates a dataframe of each responses issue array as a single array record per-Row
"""
d = input_row.asDict()
resp_json = d['response']
resp_obj = json.loads(resp_json)
issues = list(map(json.dumps,resp_obj['issues']))
return Row(issues=issues)
# array-per-record
unexploded_df = All_Issues_Paged.rdd.map(generate_issue_row).toDF()
# row-per-record
row_per_record_df = unexploded_df.select(explode(unexploded_df.issues))
# raw JSON string per-record RDD
issue_json_strings_rdd = row_per_record_df.rdd.map(lambda _: _.col)
# JSON object dataframe
issues_df = spark.read.json(issue_json_strings_rdd)
issues_df.printSchema()
return issues_df

Reading a csv file as a spark dataframe

I have got a CSV file along with a header which has to be read through Spark(2.0.0 and Scala 2.11.8) as a dataframe.
Sample csv data:
Item,No. of items,Place
abc,5,xxx
def,6,yyy
ghi,7,zzz
.........
I'm facing problem when I try to read this csv data in spark as a dataframe, because the header contains column(No. of items) having special character "."
Code with which I try to read csv data is:
val spark = SparkSession.builder().appName("SparkExample")
import spark.implicits._
val df = spark.read.option("header", "true").csv("file:///INPUT_FILENAME")
Error I'm facing:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to resolve No. of items given [Item,No. of items,Place];
If I remove the "." from the header, I wont get any error. Even tried with escaping the character,but it escapes all the "." characters even from the data.
Is there any way to escape the special character "." only from the CSV header using spark code?

#Pooja Nayak, Not sure if this was solved; answering this in the interest of community.
sc: SparkContext
spark: SparkSession
sqlContext: SQLContext
// Read the raw file from localFS as-is.
val rdd_raw = sc.textFile("file:///home/xxxx/sample.csv")
// Drop the first line in first partition because it is the header.
val rdd = rdd_raw.mapPartitionsWithIndex{(idx,iter) =>
if(idx == 0) iter.drop(1) else iter
}
// A function to create schema dynamically.
def schemaCreator(header: String): StructType = {
StructType(header
.split(",")
.map(field => StructField(field.trim, StringType, true))
)
}
// Create the schema for the csv that was read and store it.
val csvSchema: StructType = schemaCreator(rdd_raw.first)
// As the input is CSV, split it at "," and trim away the whitespaces.
val rdd_curated = rdd.map(x => x.split(",").map(y => y.trim)).map(xy => Row(xy:_*))
// Create the DF from the RDD.
val df = sqlContext.createDataFrame(rdd_curated, csvSchema)
imports that are necessary
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark._

I am giving you example which is working with pyspark, hopefully same will work for you, just by adding some language related syntax.
file =r'C:\Users\e5543130\Desktop\sampleCSV2.csv'
conf = SparkConf().setAppName('FICBOutputGenerator')
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
df = sqlContext.read.options(delimiter=",", header="true").csv("cars.csv") #Without deprecated API
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", ",").load("cars.csv")

Convert Pandas DataFrame to JSON format

I have a Pandas DataFrame with two columns – one with the filename and one with the hour in which it was generated:
File Hour
F1 1
F1 2
F2 1
F3 1
I am trying to convert it to a JSON file with the following format:
{"File":"F1","Hour":"1"}
{"File":"F1","Hour":"2"}
{"File":"F2","Hour":"1"}
{"File":"F3","Hour":"1"}
When I use the command DataFrame.to_json(orient = "records"), I get the records in the below format:
[{"File":"F1","Hour":"1"},
{"File":"F1","Hour":"2"},
{"File":"F2","Hour":"1"},
{"File":"F3","Hour":"1"}]
I'm just wondering whether there is an option to get the JSON file in the desired format. Any help would be appreciated.

The output that you get after DF.to_json is a string. So, you can simply slice it according to your requirement and remove the commas from it too.
out = df.to_json(orient='records')[1:-1].replace('},{', '} {')
To write the output to a text file, you could do:
with open('file_name.txt', 'w') as f:
f.write(out)

In newer versions of pandas (0.20.0+, I believe), this can be done directly:
df.to_json('temp.json', orient='records', lines=True)
Direct compression is also possible:
df.to_json('temp.json.gz', orient='records', lines=True, compression='gzip')

I think what the OP is looking for is:
with open('temp.json', 'w') as f:
f.write(df.to_json(orient='records', lines=True))
This should do the trick.

use this formula to convert a pandas DataFrame to a list of dictionaries :
import json
json_list = json.loads(json.dumps(list(DataFrame.T.to_dict().values())))

Try this one:
json.dumps(json.loads(df.to_json(orient="records")))

convert data-frame to list of dictionary
list_dict = []
for index, row in list(df.iterrows()):
list_dict.append(dict(row))
save file
with open("output.json", mode) as f:
f.write("\n".join(str(item) for item in list_dict))

To transform a dataFrame in a real json (not a string) I use:
from io import StringIO
import json
import DataFrame
buff=StringIO()
#df is your DataFrame
df.to_json(path_or_buf=buff,orient='records')
dfJson=json.loads(buff)

instead of using dataframe.to_json(orient = “records”)
use dataframe.to_json(orient = “index”)
my above code convert the dataframe into json format of dict like {index -> {column -> value}}

Here is small utility class that converts JSON to DataFrame and back: Hope you find this helpful.
# -*- coding: utf-8 -*-
from pandas.io.json import json_normalize
class DFConverter:
#Converts the input JSON to a DataFrame
def convertToDF(self,dfJSON):
return(json_normalize(dfJSON))
#Converts the input DataFrame to JSON
def convertToJSON(self, df):
resultJSON = df.to_json(orient='records')
return(resultJSON)

Reading JSON with Apache Spark - `corrupt_record`

I have a json file, nodes that looks like this:
[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]
I am able to read and manipulate this record with Python.
I am trying to read this file in scala through the spark-shell.
From this tutorial, I can see that it is possible to read json via sqlContext.read.json
val vfile = sqlContext.read.json("path/to/file/nodes.json")
However, this results in a corrupt_record error:
vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.

As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:
val df = spark.read.option("multiline", "true").json("<file>")

Spark cannot read JSON-array to a record on top-level, so you have to pass:
{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}
As it's described in the tutorial you're referring to:
Let's begin by loading a JSON file, where each line is a JSON object
The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).
To put more light on it, here is a quote form the official doc
Note that the file that is offered as a json file is not a typical
JSON file. Each line must contain a separate, self-contained valid
JSON object. As a consequence, a regular multi-line JSON file will
most often fail.
This format is called JSONL. Basically it's an alternative to CSV.

To read the multi-line JSON as a DataFrame:
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)
Reading large files in this manner is not recommended, from the wholeTextFiles docs
Small files are preferred, large file is also allowable, but may cause bad performance.

I run into the same problem. I used sparkContext and sparkSql on the same configuration:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("Simple Application")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
Then, using the spark context I read the whole json (JSON - path to file) file:
val jsonRDD = sc.wholeTextFiles(JSON).map(x => x._2)
You can create a schema for future selects, filters...
val schema = StructType( List(
StructField("toid", StringType, nullable = true),
StructField("point", ArrayType(DoubleType), nullable = true),
StructField("index", DoubleType, nullable = true)
))
Create a DataFrame using spark sql:
var df: DataFrame = spark.read.schema(schema).json(jsonRDD).toDF()
For testing use show and printSchema:
df.show()
df.printSchema()
sbt build file:
name := "spark-single"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "2.0.2"

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to read csv with schema in pyspark - csv

Related

In Palantir Foundry, how should I get the current SparkSession in a Transform?

In Foundry, how can I parse a dataframe column that has a JSON response

Reading a csv file as a spark dataframe

Convert Pandas DataFrame to JSON format

Reading JSON with Apache Spark - `corrupt_record`

Categories

Resources