Parsing a json column in tsv file to Spark RDD - json

I'm trying to port an existing Python (PySpark) script to Scala in an effort to improve performance.
I'm having trouble with something troublingly basic though -- how to parse a json column in Scala?
Here is the Python version
# Each row in file is tab separated, example:
# 2015-10-10 149775392 {"url": "http://example.com", "id": 149775392, "segments": {"completed_segments": [321, 4322, 126]}}
action_files = sc.textFile("s3://my-s3-bucket/2015/10/10/")
actions = (action_files
.map(lambda row: json.loads(row.split('\t')[-1]))
.filter(lambda a: a.get('url') != None and a.get('segments') != None and a.get('segments').get('completed_segments') != None)
.map(lambda a: (action['url'], {"url": action['url'], "action_id": action["id"], "completed_segments": action["segments"]["completed_segments"],}))
.partitionBy(100)
.persist())
Basically, I'm just trying to parse the json column and then transform it into a simplified version that I can process further in SparkSQL
As a new Scala user, I'm finding that there are dozens of libraries json parsing libraries for this simple task. Doesn't look like there is one in the stdlib. From what I've read so far, looks like the languages strong typing is was makes this simple task a bit of a chore.
I'd appreciate any push in the right direction!
PS. By the way, if I'm missing something obvious that is making the PySpark version crawl, I'd love to hear about it! I'm porting a Pig Script from Hadoop/MR, and performance dropped from 17min with MR to over 5 and a half hours on Spark! I'm guessing it is serialization overhead to and from Python....

If your goal is to pass data to SparkSQL anyway and you're sure that you don't have malformed fields (I don't see any exception handling in your code) I wouldn't bother with parsing manually at all:
val raw = sqlContext.read.json(action_files.flatMap(_.split("\t").takeRight(1)))
val df = raw
.withColumn("completed_segments", $"segments.completed_segments")
.where($"url".isNotNull && $"completed_segments".isNotNull)
.select($"url", $"id".alias("action_id"), $"completed_segments")
Regarding you Python code:
don't use != to compare to None. A correct way is to use is / is not. It is semantically correct (None is a singleton) and significantly faster. See also PEP8
don't duplicate data unless you have to. Emitting url twice means a higher memory usage and subsequent network traffic
if you plan to use SparkSQL check for missing values can be perform on a DataFrame, same as in Scala. I would also persist DataFrame not a RDD.
On a side note I am rather skeptical abut serialization being a real problem here. There is an overhead but a real impact shouldn't be anywhere near to what you've described.

Related

Are My Data Better Suited To A CSV Import, Rather Than a JSON Import?

I am trying to force myself to use mongoDB, using the excuse of the "convenience" of it being able to accept JSON data. Of course, it's not as simple as that (it never is!).
At the moment, for this use case, I think I should revert to a traditional CSV import, and possibly a traditional RDBMS (e.g. MariaDB or MySQL). Am I wrong?
I found a possible solution in CSV DATA import to nestable json data, which seems to be a lot of faffing around.
The problem:
I am pulling some data from an online database, which returns data in blocks like this (actually it's all on one line, but I have broken it up to improve readability):
[
[8,1469734163000,50.84516753,0.00021818,2],
[6,1469734163000,50.80342373,0.00021818,2],
[4,1469734163000,50.33066367,0.00021818,2],
[12,1469734164000,40.31650031,0.00021918,2],
[10,1469734164000,11.36652478,0.00021818,2],
[14,1469734165000,52.03905845,0.00021918,2],
[16,1469734168000,57.32,0.00021918,2]
]
According to the command python -mjson.tool this is valid JSON.
But this command barfs
mongoimport --jsonArray --db=bitfinexLendingHistory --collection=fUSD --file=test.json
with
2019-12-31T12:23:42.934+0100 connected to: localhost
2019-12-31T12:23:42.935+0100 Failed: error unmarshaling bytes on document #3: JSON decoder out of sync - data changing underfoot?
2019-12-31T12:23:42.935+0100 imported 0 documents
The named DB and collection already exist.
$ mongo
> use bitfinexLendingHistory
switched to db bitfinexLendingHistory
> db.getCollectionNames()
[ "fUSD" ]
>
I realise that, at this stage, I have no <whatever the mongoDB equivalent of a column header is called in this case> defined, but I suspect the problem above is independent of that.
By wrapping by data above in the way as shown below, I managed to get the data imported.
{
"arf":
[
[8,1469734163000,50.84516753,0.00021818,2],
[6,1469734163000,50.80342373,0.00021818,2],
[4,1469734163000,50.33066367,0.00021818,2],
[12,1469734164000,40.31650031,0.00021918,2],
[10,1469734164000,11.36652478,0.00021818,2],
[14,1469734165000,52.03905845,0.00021918,2],
[16,1469734168000,57.32,0.00021918,2]
]
}
Next step is to determine if that is what I want, and if so, work out how to query it.

How to capture incorrect (corrupt) JSON records in (Py)Spark Structured Streaming?

I have a Azure Eventhub, which is streaming data (in JSON format).
I read it as a Spark dataframe, parse the incoming "body" with from_json(col("body"), schema) where schema is pre-defined. In code it, looks like:
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import *
schema = StructType().add(...) # define the incoming JSON schema
df_stream_input = (spark
.readStream
.format("eventhubs")
.options(**ehConfInput)
.load()
.select(from_json(col("body").cast("string"), schema)
)
And now = if there is some inconsistency between the incoming JSON's schema and the defined schema (e.g. the source eventhub starts sending data in new format without notice), the from_json() functions will not throw an error = instead, it will put NULL to the fields, which are present in my schema definition but not in the JSONs eventhub sends.
I want to capture this information and log it somewhere (Spark's log4j, Azure Monitor, warning email, ...).
My question is: what is the best way how to achieve this.
Some of my thoughts:
First thing I can think of is to have a UDF, which checks for the NULLs and if there is any problem, it raise an Exception. I believe there it is not possible to send logs to log4j via PySpark, as the "spark" context cannot be initiated within the UDF (on the workers) and one wants to use the default:
log4jLogger = sc._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger('PySpark Logger')
Second thing I can think of is to use "foreach/foreachBatch" function and put this check logic there.
But I feel both these approaches are like.. like too much custom - I was hoping that Spark has something built-in for these purposes.
tl;dr You have to do this check logic yourself using foreach or foreachBatch operators.
It turns out I was mistaken thinking that columnNameOfCorruptRecord option could be an answer. It will not work.
Firstly, it won't work due to this:
case _: BadRecordException => null
And secondly due to this that simply disables any other parsing modes (incl. PERMISSIVE that seems to be used alongside columnNameOfCorruptRecord option):
new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
In other words, your only option is to use the 2nd item in your list, i.e. foreach or foreachBatch and handle corrupted records yourself.
A solution could use from_json while keeping the initial body column. Any record with an incorrect JSON would end up with the result column null and foreach* would catch it, e.g.
def handleCorruptRecords:
// if json == null the body was corrupt
// handle it
df_stream_input = (spark
.readStream
.format("eventhubs")
.options(**ehConfInput)
.load()
.select("body", from_json(col("body").cast("string"), schema).as("json"))
).foreach(handleCorruptRecords).start()

How to process values in CSV format in streaming queries over Kafka source?

I'm new to Structured Streaming, and I'd like to know is there a way to specify Kafka value's schema like what we do in normal structured streaming jobs. The format in Kafka value is 50+ fields syslog-like csv, and manually splitting is painfully slow.
Here's the brief part of my code (see full gist here)
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "myserver:9092")
.option("subscribe", "mytopic")
.load()
.select(split('value, """\^""") as "raw")
.select(ColumnExplode('raw, schema.size): _*) // flatten WrappedArray
.toDF(schema.fieldNames: _*) // apply column names
.select(fieldsWithTypeFix: _*) // cast column types from string
.select(schema.fieldNames.map(col): _*) // re-order columns, as defined in schema
.writeStream.format("console").start()
with no further operations, I can only achieve roughly 10MB/s throughput on a 24-core 128GB mem server. Would it help if I convert the syslog to JSON in prior? In that case I can use from_json with schema, and maybe it will be faster.
is there a way to specify Kafka value's schema like what we do in normal structured streaming jobs.
No. The so-called output schema for kafka external data source is fixed and cannot be changed ever. See this line.
Would it help if I convert the syslog to JSON in prior? In that case I can use from_json with schema, and maybe it will be faster.
I don't think so. I'd even say that CSV is a simpler text format than JSON (as there's simply a single separator usually).
Using split standard function is the way to go and think you can hardly get better performance since it's to split a row and take every element to build the final output.

What is the most elegant way to stream the results of an SQL query out as JSON?

Using the Play framework with Anorm, I would like to write a Controller method that simply returns the results of an SQL query as JSON. I don't want to bother converting to objects. Also, ideally this code should stream out the JSON as the SQL ResultSet is processed rather than processing the entire SQL result before returning anything to the client.
select colA colB from mytable
JSON Response
[{"colA": "ValueA", "colB": 33}, {"colA": "ValueA2", "colB": 34}, ...]
I would like to express this in Scala code as elegantly and concisely as possible, but the examples I'm finding seem to have a lot of boiler plate (redundant column name definitions). I'm surprised there isn't some kind of SqlResult to JsValue conversion in Play or Anorm already.
I realize you may need to define Writes[] or an Enumeratee implementation to achieve this, but once the conversion code is defined, I'd like the code for each method to be nearly as simple as this:
val columns = List("colA", "colB")
db.withConnection { implicit c =>
Ok(Json.toJson(SQL"select #$columns from mytable"))
}
I'm not clear on the best way to define column names just once and pass it to the SQL query as well as JSON conversion code. Maybe if I create some kind of implicit ColumnNames type, the JSON conversion code could access it in the previous example?
Or maybe define my own kind of SqlToJsonAction to achieve even simpler code and have common control over JSON responses?
def getJson = SqlToJsonAction(List("colA", "colB")) { columns =>
SQL"select #$columns from mytable"
}
The only related StackOverflow question I found was: Convert from MySQL query result or LIST to JSON, which didn't have helpful answers.
I'm a Java developer just learning Scala so I still have a lot to learn about the language, but I've read through the Anorm, Iteratee/Enumeratee, Writes, docs and numerous blogs on Anorm, and am having trouble figuring out how to setup the necessary helper code so that I can compose my JSON methods this way.
Also, I'm unclear on what approaches allow Streaming out the Response, and which will iterate the entire SQL ResultSet before responding with anything to the client. According to Anorm Streaming Results only methods such as fold/foldWhile/withResult and Iteratees stream. Are these the techniques I should use?
Bonus:
In some cases, I'll probably want to map a SQL column name to a different JSON field name. Is there a slick way to do this as well?
Something like this (no idea if this Scala syntax is possible):
def getJson = SqlToJsonAction("colA" -> "jsonColA", "colB", "colC" -> "jsonColC")) { columns =>
SQL"select #$columns from mytable"
}

Read a Text File into R

I apologize if this has been asked previously, but I haven't been able to find an example online or elsewhere.
I have very dirty data file in a text file (it may be JSON). I want to analyze the data in R, and since I am still new to the language, I want to read in the raw data and manipulate as needed from there.
How would I go about reading in JSON from a text file on my machine? Additionally, if it isn't JSON, how can I read in the raw data as is (not parsed into columns, etc.) so I can go ahead and figure out how to parse it as needed?
Thanks in advance!
Use the rjson package. In particular, look at the fromJSON function in the documentation.
If you want further pointers, then search for rjson at the R Bloggers website.
If you want to use the packages related to JSON in R, there are a number of other posts on SO answering this. I presume you searched on JSON [r] already on this site, plenty of info there.
If you just want to read in the text file line by line and process later on, then you can use either scan() or readLines(). They appear to do the same thing, but there's an important difference between them.
scan() lets you define what kind of objects you want to find, how many, and so on. Read the help file for more info. You can use scan to read in every word/number/sign as element of a vector using eg scan(filename,""). You can also use specific delimiters to separate the data. See also the examples in the help files.
To read line by line, you use readLines(filename) or scan(filename,"",sep="\n"). It gives you a vector with the lines of the file as elements. This again allows you to do custom processing of the text. Then again, if you really have to do this often, you might want to consider doing this in Perl.
Suppose your file is in JSON format, you may try the packages jsonlite ou RJSONIO or rjson. These three package allows you to use the function fromJSON.
To install a package you use the install.packages function. For example:
install.packages("jsonlite")
And, whenever the package is installed, you can load using the function library.
library(jsonlite)
Generally, the line-delimited JSON has one object per line. So, you need to read line by line and collecting the objects. For example:
con <- file('myBigJsonFile.json')
open(con)
objects <- list()
index <- 1
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
objects[[index]] <- fromJSON(line)
index <- index + 1
}
close(con)
After that, you have all the data in the objects variable. With that variable you may extract the information you want.