Unable To Read local JSON File using spark submit - json

I am reading JSON file in scala spark using
val df=spark.read.json(properties.getProperty("jsonFilePath"))
This code works when I run from my IDE(Intellij)
But when I try to execute using spark-submit command, it gives the below message
INFO FileScanRDD: Reading File path: file:///Users/path/to/json/file.json, range: 0-8822, partition values: [empty row]
I am not able to process the JSON data due to this. Any idea what could be happening here?
Here is my spark submit command:
${SPARK_HOME_DIR}/bin/spark-submit --master local ${SCALA_JAR_LOC}/<JARNAME>.jar
I tried providing it as part of spark-submit using --files option as well. Need help

spark.read.json supports reading data from a filesystem that's supported by Hadoop. If the JSON is part of the jar that contains your jar you can use the getClass.getResourceAsStream java API to read the json from the classpath of the job.
To read the JSON file from your classpath
def read(file: String): String = {
val stream = getClass.getResourceAsStream(s"/$file")
scala.io.Source.fromInputStream(stream)
.getLines
.toList
.mkString(" ")
.trim
.replaceAll("\\s+", " ")
}
Since you want to read that JSON as a dataframe in your code, you might have to convert the String to an RDD and then to a single record dataframe.
val inputAsRDD = sparkCtxt.parallelize(List(read("/path/within/resources/folder")))
val df = sparkSession.read.json(inputAsRDD)

Related

Read from kafka then writeStream to json file, but only found one message in HDFS json file

Just setup a hadoop/kafka/spark, 1 node demo environment. In pyspark, I try to read(.readStream ) Kafka messages and write(.writeStream) it to json file in hadoop. The weird thing is, under hadoop "output/test" directory, I can find there is a created json file but only within one messages. All the new messages from kafka will not update the json file. But I want to all messages which from Kafka will store into one json file.
I have tried the sink type as console(writeStream.format("console")) or kafak(writeStream.format("kafka")) , it worked as normal.
Any suggestions or comments? Next are sample code.
schema = StructType([StructField("stock_name",StringType(),True),
StructField("stock_value", DoubleType(), True),
StructField("timestamp", LongType(), True)])
line = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "127.0.1.1:9092") \
.option("subscribe", "fakestock") \
.option("startingOffsets","earliest")\
.load()\
.selectExpr("CAST(value AS STRING)")
df=line.select(functions.from_json(functions.col("value")\
.cast("string"),schema).alias("parse_value"))\
.select("parse_value.stock_name","parse_value.stock_value","parse_value.timestamp")
query=df.writeStream\
.format("json")\
.option("checkpointLocation", "output/checkpoint")\
.option("path","output/test")\
.start()
It's not possible to store all records in one file. Spark periodically polls batches of data as a Kafka consumer, then writes those batches as unique files.
Without knowing how many records are in the topic to begin with, it's hard to say how many records should be in the output path, but your code looks okay. Parquet is more recommended output format than JSON, however.
Also worth mentioning that Kafka Connect has an HDFS plugin that only requires writing a config file, no Spark parsing code.

Any library that can help me create a JSON file with dummy records

I am looking at any library (in java) that can help me generate a dummy JSON file to test my code for e.g The JSON file can contain random user profile data-name, address, zipcode
I searched StackOverflow and found this link, found the following link : How to generate JSON string in Java?
I think the suggested library https://github.com/DiUS/java-faker, seems to be useful, however because of security constraints I cannot use this particular library. Are there any more recommendations?
Use for instance Faker, like that:
#!/usr/bin/env python3
from json import dumps
from faker import Faker
fake = Faker()
def user():
return dict(
name=fake.name(),
address=fake.address(),
bio=fake.text()
)
print('[')
try:
while True:
print(dumps(user()))
print(',')
except KeyboardInterrupt:
# XXX: json array can not end with a comma
print(dumps(user()))
print(']')
You can use it like that:
python3 fake_user.py > users.json
Use Ctrl+C to stop it when the file is big enough

Spark: Streaming json to parquet

How to convert json to parquet in streaming with Spark?
Acutually i have to ssh from a server, recieve a big json file, convert it to parquet, and upload it on hadoop.
I there a way to do this in a pipelined way?
They are backup files so I have a directory with a predefined amount of files that don't change in size in time
Something like:
scp host /dev/stdout | spark-submit myprogram.py | hadoop /dir/
edit:
Actually I'm working on this:
sc = SparkContext(appName="Test")
sqlContext = SQLContext(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec.", "gzip")
#Since i couldn't get the stdio, went for a pipe:
with open("mypipe", "r") as o:
while True:
line = o.readline()
print "Processing: " + line
lineRDD = sc.parallelize([line])
df = sqlContext.jsonRDD(lineRDD)
#Create and append
df.write.parquet("file:///home/user/spark/test", mode="append")
print "Done."
This is working fine, but the resulting parquet is very large (280kb for 4 lines 2 columns json). Any improvements?
If anyone is interested, I managed to resolve this using the .pipe() method.
https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe

How to avoid generating crc files and SUCCESS files while saving a DataFrame?

I am using the following code to save a spark DataFrame to JSON file
unzipJSON.write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")
the output result is:
part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8
.part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8.crc
part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8
.part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8.crc
_SUCCESS
._SUCCESS.crc
How do I generate a single JSON file and not a file per line?
How can I avoid the *crc files?
How can I avoid the SUCCESS file?
If you want a single file, you need to do a coalesce to a single partition before calling write, so:
unzipJSON.coalesce(1).write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")
Personally, I find it rather annoying that the number of output files depend on number of partitions you have before calling write - especially if you do a write with a partitionBy - but as far as I know, there are currently no other way.
I don't know if there is a way to disable the .crc files - I don't know of one - but you can disable the _SUCCESS file by setting the following on the hadoop configuration of the Spark context.
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Note, that you may also want to disable generation of the metadata files with:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
Apparently, generating the metadata files takes some time (see this blog post) but aren't actually that important (according to this). Personally, I always disable them and I have had no issues.
Just a little update in above answer. To disable crc and SUCCESS file, simply set property in spark session as follows(example)
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Ignore crc files on .write
val hadoopConf = spark.sparkContext.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
fs.setWriteChecksum(false)

Ways to parse JSON using KornShell

I have a working code for parsing a JSON output using KornShell by treating it as a string of characters. The issue I have is that the vendor keeps changing the position of the field that I am intersted in. I understand in JSON, we can parse it by key-value pairs.
Is there something out there that can do this? I am intersted in a specific field and I would like to use it to run the checks on the status of another RESTAPI call.
My sample json output is like this:
JSONDATA value :
{
"status": "success",
"job-execution-id": 396805,
"job-execution-user": "flexapp",
"job-execution-trigger": "RESTAPI"
}
I would need the job-execution-id value to monitor this job through the rest of the script.
I am using the following command to parse it:
RUNJOB=$(print ${DATA} |cut -f3 -d':'|cut -f1 -d','| tr -d [:blank:]) >> ${LOGDIR}/${LOGFILE}
The problem with this is, it is field delimited by :. The field position has been known to be changed by the vendors during releases.
So I am trying to see if I can use a utility out there that would always give me the key-value pair of "job-execution-id": 396805, no matter where it is in the json output.
I started looking at jsawk, and it requires the js interpreter to be installed on our machines which I don't want. Any hint on how to go about finding which RPM that I need to solve it?
I am using RHEL5.5.
Any help is greatly appreciated.
The ast-open project has libdss (and a dss wrapper) which supposedly could be used with ksh. Documentation is sparse and is limited to a few messages on the ast-user mailing list.
The regression tests for libdss contain some json and xml examples.
I'll try to find more info.
Python is included by default with CentOS so one thing you could do is pass your JSON string to a Python script and use Python's JSON parser. You can then grab the value written out by the script. An example you could modify to meet your needs is below.
Note that by specifying other dictionary keys in the Python script you can get any of the values you need without having to worry about the order changing.
Python script:
#get_job_execution_id.py
# The try/except is because you'll probably have Python 2.4 on CentOS 5.5,
# and the straight "import json" statement won't work unless you have Python 2.6+.
try:
import json
except:
import simplejson as json
import sys
json_data = sys.argv[1]
data = json.loads(json_data)
job_execution_id = data['job-execution-id']
sys.stdout.write(str(job_execution_id))
Kornshell script that executes it:
#get_job_execution_id.sh
#!/bin/ksh
JSON_DATA='{"status":"success","job-execution-id":396805,"job-execution-user":"flexapp","job-execution-trigger":"RESTAPI"}'
EXECUTION_ID=`python get_execution_id.py "$JSON_DATA"`
echo $EXECUTION_ID