I wouldn't like to build a Geomesa Datastore, just want to use the Geomesa Spark Core/SQL module to do some spatial analysis process on spark. My data sources are some GeoJson files on hdfs. However, I have to create a SpatialRDD by SpatialRDDProvider.
There is a Converter RDD Provider example in the documents of Geomesa:
import com.typesafe.config.ConfigFactory
import org.apache.hadoop.conf.Configuration
import org.geotools.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark
val exampleConf = ConfigFactory.load("example.conf").root().render()
val params = Map(
"geomesa.converter" -> exampleConf,
"geomesa.converter.inputs" -> "example.csv",
"geomesa.sft" -> "phrase:String,dtg:Date,geom:Point:srid=4326",
"geomesa.sft.name" -> "example")
val query = new Query("example")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)
I can choose GeoMesa's JSON Converter to create the SpatialRDD. However, it seems to be
necessary to assign all field names and types in geomesa.sft paramater and a converter config file. If I have many GeoJson files, I have to do this one by one manually, it is very
inconvenient obviously.
Is there any way that Geomesa Converter can infer the field names and types from the file?
Yes, GeoMesa can infer the type and converter. For scala/java, see this unit test. Alternatively, the GeoMesa CLI tools can be used ahead of time to persist the type and converter to reusable files, using the convert command (type inference is described in the linked ingest command).
Related
I am trying to read data from Kafka using structured streaming. The data received from kafka is in json format.
My code is as follows:
in the code I use the from_json function to convert the json to a dataframe for further processing.
val **schema**: StructType = new StructType()
.add("time", LongType)
.add(id", LongType)
.add("properties",new StructType()
.add("$app_version", StringType)
.
.
)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","...")
.option("subscribe","...")
.load()
.selectExpr("CAST(value AS STRING) as value")
.select(from_json(col("value"), **schema**))
My problem is that if the field is increased,
I can't stop the spark program to manually add these fields,
then how can I parse these fields dynamically, I tried schema_of_json(),
it can only take the first line to infer the field type and it not suitable for multi-level nested structures json data.
My problem is that if the field is increased, I can't stop the spark program to manually add these fields, then how can I parse these fields dynamically
It is not possible in Spark Structured Streaming (or even Spark SQL) out of the box. There are a couple of solutions though.
Changing Schema in Code and Resuming Streaming Query
You simply have to stop your streaming query, change the code to match the current schema, and resume it. It is possible in Spark Structured Streaming with data sources that support resuming from checkpoint. Kafka data source does support it.
User-Defined Function (UDF)
You could write a user-defined function (UDF) that would do this dynamic JSON parsing for you. That's also among the easiest options.
New Data Source (MicroBatchReader)
Another option is to create an extension to the built-in Kafka data source that would do the dynamic JSON parsing (similarly to Kafka deserializers). That requires a bit more development, but is certainly doable.
I have downloaded a xxxx.json.lz4 file from https://censys.io/ however when I try to read the file using the following line I get no data out/count of 0.
metadata_lz4 = spark.read.json("s3n://file.json.lz4")
it returns no results although decompressing manually works fine and can be imported into Spark.
I have also tried
val metadata_lz4_2 = spark.sparkContext.newAPIHadoopFile("s3n://file.json.lz4", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
Which also returns no results.
I have multiple of these files which are 100GBs each so really keen on not having to decompress each one manually.
Any ideas?
According to this open issue spark LZ4 decompressor uses different specification then the standard LZ4 decompressor.
Hence until this issue will be solved in apache-spark, you won't be able to use spark LZ4 in order to decompress standard LZ4 compressed files.
I don't think our Lz4Codec implementation actually uses the FRAME
specification (http://cyan4973.github.io/lz4/lz4_Frame_format.html)
when creating text based files. It seems it was added in as a codec
for use inside block compression formats such as
SequenceFiles/HFiles/etc., but wasn't oriented towards Text files from
the looks of it, or was introduced at a time when there was no FRAME
specification of LZ4.
Therefore, fundamentally, we are not interoperable with the lz4
utility. The difference is very similar to the GPLExtras' LzoCodec vs.
LzopCodec, the former is just the data compressing algorithm, but the
latter is an actual framed format interoperable with lzop CLI utility.
To make ourselves interoperable, we'll need to introduce a new frame
wrapping codec such as LZ4FrameCodec, and users could use that when
they want to decompress or compress text data produced/readable by
lz4/lz4cat CLI utilities.
I achieved to parse lz4 compression in Pyspark in this way:
import lz4.frame
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContext
list_paths = ['/my/file.json.lz4', '/my/beautiful/file.json.lz4']
rdd = sc.binaryFiles(','.join(list_paths))
df = rdd.map(lambda x: lz4.frame.decompress(x[1])).map(lambda x: str(x)).map(lambda x: (x, )).toDF()
this is usually enough for non complex objects. But if the compressed JSON you are
parsing has nested structures, then it is necessary to do extra cleaning on the parsed file before calling the function F.from_json():
schema = spark.read.json("/my/uncompressed_file.json").schema
df = df.select(F.regexp_replace(F.regexp_replace(F.regexp_replace(F.regexp_replace(F.regexp_replace("_1", "None", "null"), "False", "false"), "True", "true"), "b'", ""), "'", "").alias("json_notation"))
result_df = df.select(F.from_json("json_notation", schema)
where /my/uncompressed_file.json is the /my/file.json.lz4 that you have previously decompressed (unless you want to provide the schema manually if not too complex, it will work anyway)
I found some code which can parse JSON document, convert it to BSON and then insert. But this code is implemented using Java classes in casbah. I could not find corresponding implementation in Scala.
Also casbah documentation says "In general, you should only need the org.mongodb.scala and org.bson namespaces in your code. You should not need to import from the com.mongodb namespace as there are equivalent type aliases and companion objects in the Scala driver. The main exception is for advanced configuration via the builders in MongoClientSettings which is considered to be for advanced users."
If you see below code and note imports, they are using com.mongodb classes. I can use below code in scala and make it work, but I want to know if there is Scala implementation out there to insert JSON into mongodb.
import com.mongodb.DBObject
import com.mongodb.casbah.MongoClient
import com.mongodb.casbah.MongoClientURI
import com.mongodb.util.JSON
val jsonString = """{"card_id" : 75893645814809,"cust_id": 1008,"card_info": {"card_type" : "Travel Card","credit_limit": 126839},"card_dates" : [{"date":"1997-09-09" },{"date":"2007-09-07" }]}"""
val dbObject: DBObject = JSON.parse(jsonString).asInstanceOf[DBObject]
val mongo = MongoClient(MongoClientURI("mongodb://127.0.0.1:27017"))
val buffer = new java.util.ArrayList[DBObject]()
buffer.add(dbObject)
mongo.getDB("yourDBName").getCollection("yourCollectionName").insert(buffer)
buffer.clear()
Reference : Scala code to insert JSON string to mongo DB
I found few link online which suggests to use different JSON parser libraries, but none of them seems straightforward even though above ~5 lines of code can insert JSON document in Java. I would like to achieve similar thing in Java.
I am using PySpark. I have a list of gziped json files on s3 which I have to access, transform and then export in parquet to s3. Each json file contains around 100k lines so parallelizing it wont make much sense(but i am open to parallelizing it), but there are around 5k files which I have parallelize. My approach is pass the json file list to script -> run parallelize on the list -> run map(? this is where I am getting blocked). how do I access and transform the json create a DF out of the transformed json and dump it as parquet into s3.
To read json in a distributed fashion, you will need to parallelize your keys as you mention. To do this while reading from s3, you'll need to use boto3. Below is a skeleton sketch of how to do so. You'll likely need to modify distributedJsonRead to fit your use case.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
return Row(**contents)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.map(distributedJsonRead)
Boto3 Reference: http://boto3.readthedocs.org/en/latest/guide/quickstart.html
Edit: to address the 1:1 mapping of input files to output files
Later on, it's likely that having a merged parquet data set would be easier to work with. But if this is the way you need to do it, you could try something like this
for k in keyList:
rawtext = sc.read.json(k) # or whichever method you need to use to read in the data
outpath = k[:-4]+'parquet'
rawtext.write.parquet(outpath)
I don't think you will not be able to parallelize these operations if you want a 1:1 mapping of json to parquet files. Spark's read/write functionality is designed to be called by the driver, and needs access to sc and sqlContext. This is another reason why having 1 parquet directory is likely the way to go.
I need to validate the schema of certain JSON input I receive. I am not clear about how to go about the whole thing. But this is what I have gathered so far :
I need to prepare a schema for all sorts of input using something like http://json-schema.org/implementations.html
Then I need a validator like https://github.com/fge/json-schema-validator
I need to give the json input to and the schema to the validator and get the result.
However my question is that I need to use a jar which I can import and use of the json-schema-validator https://github.com/fge/json-schema-validator . Also I am not clear on how to use it. I don't know the format it accepts , the classes and methods needed etc.
How good the validator's support is for Scala?
I would not go through the pain of manually collecting the jars of json schema validator (done that, not fun). It is better use a tool for that (like maven, sbt, gradle or ivy).
In case you want to use it in an OSGi environment, you might need to use a different (probably not up-to-date) version.
Usage:
val factory: JsonSchemaFactory = JsonSchemaFactory.getDefault
val validator: JsonValidator = factory.getValidator
val schemaJson: com.fasterxml.jackson.databind.JsonNode = yourJsonSchemaInJackson2Format
val report: ProcessingReport = validator.validate(schemaJson, yourJsonInJackson2Format)
//check your report.
PS.: In case you want to collect the dependencies manually, you can go through the dependencies transitively starting on this page.
There is Orderly, liftweb-json-based JSON validator implementation for Scala:
import com.nparry.orderly._
import net.liftweb.json.JsonAST._
val orderly = Orderly("integer {0,100};")
val noProblems = orderly.validate(JInt(50))
val notAllowed = orderly.validate(JInt(200))
Use net.liftweb.json.parse(s: String): JValue to obtain JValue from String.
I noticed that orderly4jvm does not support the latest JSON Schema version 4, which causes problems if you want to use it to generate the JSON schema.