Decoding a String with escaped special characters in Scala issue - json

I have a multi-line JSON file with records that contain special characters encoded as hexadecimals. Here is an example of a single JSON record:
{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}
This record is supposed to be {"value":"ıarines Bintıç Ramuçlar"} , e.g. '"' character are replaced with corresponding hexadecimal \x22 and other special Unicode characters are replaced with one or two hexadecimals (for instance \xC3\xA7 encodes ç, etc.)
I need to convert similar Strings into a regular Unicode String in Scala, so when printed it produced {"value":"ıarines Bintıç Ramuçlar"} without hexadecimals.
In Python I can easily decode these records with a line of code:
>>> a = "{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"
>>> a.decode("utf-8")
u'{"value":"\u0131arines Bint\u0131\xe7 Ramu\xe7lar"}'
>>> print a.decode("utf-8")
{"value":"ıarines Bintıç Ramuçlar"}
But in Scala I can't find a way to decode it. I unsuccessfully tried to convert it like this:
scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(new String(a.getBytes(), "UTF-8"))
{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}
I also tried URLDecoder as I found in solution for similar problem (but with URL):
scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(java.net.URLDecoder.decode(a.replace("\\x", "%"), "UTF-8"))
{"value":"ıarines Bintıç Ramuçlar"}
It produced the desired result for this example but is seems not safe for generic text fields since it designed to work with URLs and requires replacing all \x to % in the string.
Does Scala have some better way to deal with this issue?
I am new to Scala and will be thankful for any help
UPDATE:
I have made a custom solution with javax.xml.bind.DatatypeConverter.parseHexBinary. It works for now, but it seems cumbersome and not at all elegant. I think there should be a simpler way to do this.
Here is the code:
import javax.xml.bind.DatatypeConverter
import scala.annotation.tailrec
import scala.util.matching.Regex
def decodeHexChars(string: String): String = {
val regexHex: Regex = """\A\\[xX]([0-9a-fA-F]{1,2})(.*)""".r
def purgeBuffer(buffer: String, acc: List[Char]): List[Char] = {
if (buffer.isEmpty) acc
else new String(DatatypeConverter.parseHexBinary(buffer)).reverse.toList ::: acc
}
#tailrec
def traverse(s: String, acc: List[Char], buffer: String): String = s match {
case "" =>
val accUpdated = purgeBuffer(buffer, acc)
accUpdated.foldRight("")((str, b) => b + str)
case regexHex(chars, suffix) =>
traverse(suffix, acc, buffer + chars)
case _ =>
val accUpdated = purgeBuffer(buffer, acc)
traverse(s.tail, s.head :: accUpdated, "")
}
traverse(string, Nil, "")
}

Each \x?? encodes one byte, like \x22 encodes " and \x5C encodes \. But in UTF-8 some characters are encoded using multiple bytes, so you need to transform \xC4\xB1 to ı symbol and so on.
replaceAllIn is really nice, but it might eat your slashes. So, if you don't use groups (like \1) in a replaced string, quoteReplacement is a recommended way to escape \ and $ symbols.
/** "22" -> 34, "AA" -> -86 */
def hex2byte(hex: String) = Integer.parseInt(hex, 16).toByte
/** decode strings like \x22 or \xC4\xB1\xC3\xA7 to specified encoding */
def decodeHexadecimals(str: String, encoding: String="UTF-8") =
new String(str.split("""\\x""").tail.map(hex2byte), encoding)
/** fix weird strings */
def replaceHexadecimals(str: String, encoding: String="UTF-8") =
"""(\\x[\dA-F]{2})+""".r.replaceAllIn(str, m =>
util.matching.Regex.quoteReplacement(
decodeHexadecimals(m.group(0), encoding)))
P.S. Does anyone know the difference between java.util.regex.Matcher.quoteReplacement and scala.util.matching.Regex.quoteReplacement?

The problem is that encoding is really specific to python (i think). Something like this might work:
val s = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
"""\\x([A-F0-9]{2})""".r.replaceAllIn(s, (x: Regex.Match) =>
new String(BigInt(x.group(1), 16).toByteArray, "UTF-8")
)

Related

Scala - How to get the Max value of a Json field?

I'm trying to get the maximum value of a MetricId field from a JSON String. However I'm getting a java.lang.UnsupportedOperationException: empty.max for the below String:
[{"MetricName":"name1","DateParsed":"2019-11-20 05:39:00","MetricId":"7855","isValid":"true"},
{"MetricName":"name2","DateParsed":"2019-05-22 17:45:00","MetricId":"1295","isValid":"false"}]
Here is how I've implemented a method for finding the Max value:
val metricIdRegex = """"MetricId"\s*:\s*(\d+)""".r
def maxMetricId(jsonString: String): String = {
metricIdRegex.findAllIn(jsonString).map({
case metricIdRegex(id) => id.toInt
}).max.toString
}
val maxId: String = maxMetricId(metricsString)
I'm expecting to get "7855" as a Max metric Id
What could be wrong with the method? I suspect that it could be a problem with the regex.
You could also use json4s which is quite popular and used by many other scala libraries:
import org.json4s._
import org.json4s.jackson.JsonMethods._
val data = """[{"MetricName":"name1","DateParsed":"2019-11-20 05:39:00","MetricId":"7855","isValid":"true"},
{"MetricName":"name2","DateParsed":"2019-05-22 17:45:00","MetricId":"1295","isValid":"false"}]"""
// parse data into JValue
val parsed = parse(data)
// go through the parsed variable and extract MetricId into a string list, then cast every item to int
val maxMetricId = (parsed \ "MetricId" \\ classOf[JString]).map{_.toInt}.max
Let me show an example how it can be done with a JSON parser efficiently without holding of a whole JSON input and parsed data in memory.
Add dependencies to your build.sbt:
libraryDependencies ++= Seq(
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-core" % "2.0.2" % Compile,
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-macros" % "2.0.2" % Provided // required only in compile-time
)
Add imports, define a data structure for repeating part of your JSON array which should be parsed out, derive a codec for it, open an input stream and scan it with provided handling function which will reduce all parsed metrics to the maximum value:
import com.github.plokhotnyuk.jsoniter_scala.macros._
import com.github.plokhotnyuk.jsoniter_scala.core._
import java.io.ByteArrayInputStream
import java.io.InputStream
case class Metric(#stringified MetricId: Int)
implicit val codec: JsonValueCodec[Metric] = JsonCodecMaker.make(CodecMakerConfig)
val in: InputStream = new ByteArrayInputStream( // <- replace it by FileInputStream
"""[{"MetricName":"name1","DateParsed":"2019-11-20 05:39:00","MetricId":"7855","isValid":"true"},
{"MetricName":"name2","DateParsed":"2019-05-22 17:45:00","MetricId":"1295","isValid":"false"}]""".getBytes("UTF-8"))
try {
var max = -1
scanJsonArrayFromStream[Metric](in) { m: Metric =>
max = Math.max(max, m.MetricId)
true
}
println(max)
} finally in.close()
And this code should print 7855.

AWS Athena export array of structs to JSON

I've got an Athena table where some fields have a fairly complex nested format. The backing records in S3 are JSON. Along these lines (but we have several more levels of nesting):
CREATE EXTERNAL TABLE IF NOT EXISTS test (
timestamp double,
stats array<struct<time:double, mean:double, var:double>>,
dets array<struct<coords: array<double>, header:struct<frame:int,
seq:int, name:string>>>,
pos struct<x:double, y:double, theta:double>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json'='true')
LOCATION 's3://test-bucket/test-folder/'
Now we need to be able to query the data and import the results into Python for analysis. Because of security restrictions I can't connect directly to Athena; I need to be able to give someone the query and then they will give me the CSV results.
If we just do a straight select * we get back the struct/array columns in a format that isn't quite JSON.
Here's a sample input file entry:
{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}
And example output:
select * from test
"timestamp","stats","dets","pos"
"1.520640777666096E9","[{time=15.0, mean=45.23, var=0.31}, {time=19.0, mean=17.315, var=2.612}]","[{coords=[2.4, 1.7, 0.3], header={frame=1, seq=1, name=hello}}]","{x=5.0, y=1.4, theta=0.04}"
I was hoping to get those nested fields exported in a more convenient format - getting them in JSON would be great.
Unfortunately it seems that cast to JSON only works for maps, not structs, because it just flattens everything into arrays:
SELECT timestamp, cast(stats as JSON) as stats, cast(dets as JSON) as dets, cast(pos as JSON) as pos FROM "sampledb"."test"
"timestamp","stats","dets","pos"
"1.520640777666096E9","[[15.0,45.23,0.31],[19.0,17.315,2.612]]","[[[2.4,1.7,0.3],[1,1,""hello""]]]","[5.0,1.4,0.04]"
Is there a good way to convert to JSON (or another easy-to-import format) or should I just go ahead and do a custom parsing function?
I have skimmed through all the documentation and unfortunately there seems to be no way to do this as of now. The only possible workaround is
converting a struct to a json when querying athena
SELECT
my_field,
my_field.a,
my_field.b,
my_field.c.d,
my_field.c.e
FROM
my_table
Or I would convert the data to json using post processing. Below script shows how
#!/usr/bin/env python
import io
import re
pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
pattern3 = re.compile(r'\\"', re.I)
with io.open("test.csv") as f:
headers = list(map(lambda f: f.strip(), f.readline().split(",")))
for line in f.readlines():
orig_line = line
data = []
for i, l in enumerate(line.split('","')):
data.append(headers[i] + ":" + re.sub('^"|"$', "", l))
line = "{" + ','.join(data) + "}"
line = pattern1.sub(r'"\1":', line)
line = pattern2.sub(r':"\1"', line)
print(line)
The output on your input data is
{"timestamp":1.520640777666096E9,"stats":[{"time":15.0, "mean":45.23, "var":0.31}, {"time":19.0, "mean":17.315, "var":2.612}],"dets":[{"coords":[2.4, 1.7, 0.3], "header":{"frame":1, "seq":1, "name":"hello"}}],"pos":{"x":5.0, "y":1.4, "theta":0.04}
}
Which is a valid JSON
The python code from #tarun almost got me there, but I had to modify it in several ways due to my data. In particular, I have:
json structures saved in Athena as strings
Strings that contain multiple words, and therefore need to be in between double quotes. Some of them contain "[]" and "{}" symbols.
Here is the code that worked for me, hopefully will be useful for others:
#!/usr/bin/env python
import io
import re, sys
pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
pattern3 = re.compile(r'\\"', re.I)
with io.open(sys.argv[1]) as f:
headers = list(map(lambda f: f.strip(), f.readline().split(",")))
print(headers)
for line in f.readlines():
orig_line = line
#save the double quote cases, which mean there is a string with quotes inside
line = re.sub('""', "#", orig_line)
data = []
for i, l in enumerate(line.split('","')):
item = re.sub('^"|"$', "", l.rstrip())
if (item[0] == "{" and item[-1] == "}") or (item[0] == "[" and item[-1] == "]"):
data.append(headers[i] + ":" + item)
else: #we have a string
data.append(headers[i] + ": \"" + item + "\"")
line = "{" + ','.join(data) + "}"
line = pattern1.sub(r'"\1":', line)
line = pattern2.sub(r':"\1"', line)
#restate the double quotes to single ones, once inside the json
line = re.sub("#", '"', line)
print(line)
This method is not by modifying the Query.
Its by Post Processing For Javascript/Nodejs we can use the npm package athena-struct-parser.
Detailed Answer with Example
https://stackoverflow.com/a/67899845/6662952
Reference - https://www.npmjs.com/package/athena-struct-parser
I used a simple approach to get around the struct -> json Athena limitation. I created a second table where the json columns were saved as raw strings. Using presto json and array functions I was able to query the data and return the valid json string to my program:
--Array transform functions too
select
json_extract_scalar(dd, '$.timestamp') as timestamp,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.time')) as arr_stats_time,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.mean')) as arr_stats_mean,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.var')) as arr_stats_var
from
(select '{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}' as dd);
I know the query will take longer to execute but there are ways to optimize.
I worked around this by creating a second table using the same S3 location, but changed the field's data type to string. The resulting CSV then had the string that Athena pulled from the object in the JSON file and I was able to parse the result.
I also had to adjust the #tarun code, because I had more complex data and nested structures. Here is the solution I've got, I hope it helps:
import re
import json
import numpy as np
pattern1 = re.compile(r'(?<=[{,\[])\s*([^{}\[\],"=]+)=')
pattern2 = re.compile(r':([^{}\[\],"]+|()(?![{\[]))')
pattern3 = re.compile(r'"null"')
def convert_metadata_to_json(value):
if type(value) is str:
value = pattern1.sub('"\\1":', value)
value = pattern2.sub(': "\\1"', value)
value = pattern3.sub('null', value)
elif np.isnan(value):
return None
return json.loads(value)
df = pd.read_csv('test.csv')
df['metadata_json'] = df.metadata.apply(convert_metadata_to_json)

Python 3 Tornado bytes and JSON issues

I stumbled into python 3, and specifically into tornado framework.
My task was to integrate facebook authentification, and i used test cases from here:
https://github.com/tornadoweb/tornado/tree/master/demos/facebook
So the point is that user is a dictionary with bytes data.
class AuthLoginHandler(BaseHandler, tornado.auth.FacebookGraphMixin):
#tornado.web.asynchronous
def get(self):
....
def _on_auth(self, user):
if not user:
raise tornado.web.HTTPError(500, "Facebook auth failed")
self.set_secure_cookie("fbdemo_user", tornado.escape.json_encode(user))
self.redirect(self.get_argument("next", "/"))
_on_auth always produces this Error: b'token or sesion_expire data here' is not JSON serializable
Ive come out with few solitons found on stackoverflow:
Fix the data before encode
import collections.abc
def convert(data):
'''
Converts bytes data into unicode strings, so this can be encoded into JSON
'''
if isinstance(data, str):
return str(data)
elif isinstance(data, bytes):
return data.decode('utf-8')
elif isinstance(data, collections.abc.Mapping):
return dict(map(convert, data.items()))
elif isinstance(data, collections.abc.Iterable):
return type(data)(map(convert, data))
else:
return data
# ... and somewhere in the code
tornado.escape.json_encode(convert(user))
And the next one is to extend the json itself:
import json
class JSONEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, bytes):
return o.decode('utf-8')
return json.JSONEncoder.default(self, o)
Now the question: why are there such an isses with data like type(data) == <class 'bytes'>, and am i doing it right?
Thank you
Better late than never.
Python 3 json encoder does not accept byte strings. Tornado provides a method to_basestring which can be used to overcome this problem.
Here's what the source doc says about the issue:
In python2, byte and unicode strings are mostly interchangeable, so
functions that deal with a user-supplied argument in combination with
ascii string constants can use either and should return the type the
user supplied. In python3, the two types are not interchangeable, so
this method is needed to convert byte strings to unicode.
Usage:
tornado.escape.to_basestring(value)

Forcing Python json module to work with ASCII

I'm using json.dump() and json.load() to save/read a dictionary of strings to/from disk. The issue is that I can't have any of the strings in unicode. They seem to be in unicode no matter how I set the parameters to dump/load (including ensure_ascii and encoding).
If you are just dealing with simple JSON objects, you can use the following:
def ascii_encode_dict(data):
ascii_encode = lambda x: x.encode('ascii')
return dict(map(ascii_encode, pair) for pair in data.items())
json.loads(json_data, object_hook=ascii_encode_dict)
Here is an example of how it works:
>>> json_data = '{"foo": "bar", "bar": "baz"}'
>>> json.loads(json_data) # old call gives unicode
{u'foo': u'bar', u'bar': u'baz'}
>>> json.loads(json_data, object_hook=ascii_encode_dict) # new call gives str
{'foo': 'bar', 'bar': 'baz'}
This answer works for a more complex JSON structure, and gives some nice explanation on the object_hook parameter. There is also another answer there that recursively takes the result of a json.loads() call and converts all of the Unicode strings to byte strings.
And if the json object is a mix of datatypes, not only unicode strings, you can use this expression:
def ascii_encode_dict(data):
ascii_encode = lambda x: x.encode('ascii') if isinstance(x, unicode) else x
return dict(map(ascii_encode, pair) for pair in data.items())

Parsing JSON with Scala lift

I am trying to parse a json string with special characters in its attributes names (dots).
This is what I'm trying:
//Json parser objects
case class SolrDoc(`rdf.about`:String, `dc.title`:List[String],
`dc.creator`:List[String], `dc.dateCopyrighted`:List[Int],
`dc.publisher`:List[String], `dc.type` :String)
case class SolrResponse(numFound:String, start:String, docs: List[SolrDoc])
val req = url("http://localhost:8983/solr/select") <<? Map("q" -> q)
var search_result = http(req ># { json => (json \ "response") })
var response = search_result.extract[SolrResponse]
Even though my json string contains values for all the fields this is the error I'm getting:
Message: net.liftweb.json.MappingException: No usable value for docs
No usable value for rdf$u002Eabout
Did not find value which can be converted into java.lang.String
I suspect that it has something to do with the dot on the names but so far I did not manage to make it work.
Thanks!
These is an extract from my LiftProject.scala file :
"net.databinder" % "dispatch-http_2.8.1" % "0.8.6",
"net.databinder" % "dispatch-http-json_2.8.1" % "0.8.6",
"net.databinder" % "dispatch-lift-json_2.8.1" % "0.8.6"
Dots in names should not be a problem. This is with lift-json-2.4-M4
scala> val json = """ {"first.name":"joe"} """
scala> parse(json).extract[Person]
res0: Person = Person(joe)
Where
case class Person(`first.name`: String)