Parsing json string with escaped quotes in spark scala [duplicate] - json

This question already has answers here:
Parse a String in Scala [closed]
(2 answers)
Closed 1 year ago.
I am trying to parse the string given below as a JSON using Scala, but haven't been able to do so because of the escaped quotes \" occurring in fields.
{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r <RegistrationInfo>\r <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r </RegistrationInfo>\r <Triggers>\r <TimeTrigger>\r <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r <Enabled>true</Enabled>\r </TimeTrigger>\r </Triggers>\r <Settings>\r <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r <AllowHardTerminate>true</AllowHardTerminate>\r <StartWhenAvailable>false</StartWhenAvailable>\r <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r <IdleSettings>\r <Duration>PT10M</Duration>\r <WaitTimeout>PT1H</WaitTimeout>\r <StopOnIdleEnd>true</StopOnIdleEnd>\r <RestartOnIdle>false</RestartOnIdle>\r </IdleSettings>\r <AllowStartOnDemand>true</AllowStartOnDemand>\r <Enabled>true</Enabled>\r <Hidden>false</Hidden>\r <RunOnlyIfIdle>false</RunOnlyIfIdle>\r <WakeToRun>false</WakeToRun>\r <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r <Priority>7</Priority>\r </Settings>\r <Actions Context=\"Author\">\r <Exec>\r <Command>%systemroot%\\system32\\usoclient.exe</Command>\r <Arguments>StartUWork</Arguments>\r </Exec>\r </Actions>\r <Principals>\r <Principal id=\"Author\">\r <UserId>S-1-5-18</UserId>\r <RunLevel>LeastPrivilege</RunLevel>\r </Principal>\r </Principals>\r </Task>\"}
So far, I have tried spark.json.read and the net.liftweb library, but to no avail.
Any kind of help is highly appreciated.

The json output that you are getting might not be a valid json or if the json is valid then it has xml content in the TaskContent element which has xml tags with attributes and I think that is what is causing the issue. The idea I have is to remove the Double quotes from the XML attributes values and then parse. You could replace that double quotes with any specific value and once you have that `TaskContent' as dataframe column than you can again replace that to get the original content.
This might not be the perfect or efficient answer but based on how you are getting the json and if that json structure remains the same then you could do something as below :
Convert the json that you have to string.
Do some replaceAll operations on the string to make it look like valid json.
read the json into Dataframe.
//Source data copied from Question
val json = """{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r <RegistrationInfo>\r <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r </RegistrationInfo>\r <Triggers>\r <TimeTrigger>\r <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r <Enabled>true</Enabled>\r </TimeTrigger>\r </Triggers>\r <Settings>\r <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r <AllowHardTerminate>true</AllowHardTerminate>\r <StartWhenAvailable>false</StartWhenAvailable>\r <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r <IdleSettings>\r <Duration>PT10M</Duration>\r <WaitTimeout>PT1H</WaitTimeout>\r <StopOnIdleEnd>true</StopOnIdleEnd>\r <RestartOnIdle>false</RestartOnIdle>\r </IdleSettings>\r <AllowStartOnDemand>true</AllowStartOnDemand>\r <Enabled>true</Enabled>\r <Hidden>false</Hidden>\r <RunOnlyIfIdle>false</RunOnlyIfIdle>\r <WakeToRun>false</WakeToRun>\r <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r <Priority>7</Priority>\r </Settings>\r <Actions Context=\"Author\">\r <Exec>\r <Command>%systemroot%\\system32\\usoclient.exe</Command>\r <Arguments>StartUWork</Arguments>\r </Exec>\r </Actions>\r <Principals>\r <Principal id=\"Author\">\r <UserId>S-1-5-18</UserId>\r <RunLevel>LeastPrivilege</RunLevel>\r </Principal>\r </Principals>\r </Task>\"}"""
//Modifying json to make it valid
val modifiedJson = json.replaceAll("\\\\\\\\","#").replaceAll("\\\\r","").replaceAll("\\\\","").replaceAll(" ","").replaceAll(" ","").replaceAll("> <","><").replaceAll("=\"","=").replaceAll("\">",">").replaceAll("#","\\\\\\\\").replaceAll("1.0\"","1.0").replaceAll("UTF-16\"?","UTF-16").replaceAll("1.2\"","1.2")
//creating a dataset out of json String
val ds = spark.createDataset(modifiedJson :: Nil)
//reading the dataset as json
val df = spark.read.json(ds)
you can see the output as below:
You can do some optimization for it to work in a more efficient way but this is how I made it work.

You can use replace escaped quotes \" in the json with " (except for those inside the xml content) using regexp_replace function then read into dataframe :
val jsonString = """
{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r <RegistrationInfo>\r <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r </RegistrationInfo>\r <Triggers>\r <TimeTrigger>\r <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r <Enabled>true</Enabled>\r </TimeTrigger>\r </Triggers>\r <Settings>\r <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r <AllowHardTerminate>true</AllowHardTerminate>\r <StartWhenAvailable>false</StartWhenAvailable>\r <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r <IdleSettings>\r <Duration>PT10M</Duration>\r <WaitTimeout>PT1H</WaitTimeout>\r <StopOnIdleEnd>true</StopOnIdleEnd>\r <RestartOnIdle>false</RestartOnIdle>\r </IdleSettings>\r <AllowStartOnDemand>true</AllowStartOnDemand>\r <Enabled>true</Enabled>\r <Hidden>false</Hidden>\r <RunOnlyIfIdle>false</RunOnlyIfIdle>\r <WakeToRun>false</WakeToRun>\r <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r <Priority>7</Priority>\r </Settings>\r <Actions Context=\"Author\">\r <Exec>\r <Command>%systemroot%\\system32\\usoclient.exe</Command>\r <Arguments>StartUWork</Arguments>\r </Exec>\r </Actions>\r <Principals>\r <Principal id=\"Author\">\r <UserId>S-1-5-18</UserId>\r <RunLevel>LeastPrivilege</RunLevel>\r </Principal>\r </Principals>\r </Task>\"}
"""
val df = spark.read.json(
Seq(jsonString).toDS
.withColumn("value", regexp_replace($"value", """([:\[,{]\s*)\\"(.*?)\\"(?=\s*[:,\]}])""", "$1\"$2\""))
.as[String]
)
df.show
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+
//| Category| Channel| Computer|EventID|EventRecordID| Provider|SubjectDomainName|SubjectLogonId|SubjectUserName|SubjectUserSid| TaskContent| TaskName| TimeCreated|Version|
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+
//|AUDIT_SUCCESS|Security|WIN-10.atfdetonat...| 4698| 12956650|Microsoft-Windows...| ATFDETONATE| 0x3e7| WIN-10$| S-1-5-18|<?xml version="1....|\Microsoft\Window...|2021-01-09T04:29:...| 1|
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+

Related

How to parse json and replace a value in a nested json?

I have the below json:
{
"status":"success",
"data":{
"_id":"ABCD",
"CNTL":{"XMN Version":"R3.1.0"},
"OMN":{"dree":["ANY"]},
"os0":{
"Enable":true,"Service Reference":"","Name":"",
"TD ex":["a0.c985.c0"],
"pn ex":["s0.c100.c0"],"i ex":{},"US Denta Treatment":"copy","US Denta Value":0,"DP":{"Remote ID":"","cir ID":"","Sub Options":"","etp Number":54469},"pe":{"Remote ID":"","cir ID":""},"rd":{"can Identifier":"","can pt ID":"","uno":"Default"},"Filter":{"pv":"pass","pv6":"pass","ep":"pass","pe":"pass"},"sc":"Max","dc":"","st Limit":2046,"dm":false},
"os1":{
"Enable":false,"Service Reference":"","Name":"",
"TD ex":[],
"pn ex":[],"i ex":{},"US Denta Treatment":"copy","US Denta Value":0,"DP":{"Remote ID":"","cir ID":"","Sub Options":"","etp Number":54469},"pe":{"Remote ID":"","cir ID":""},"rd":{"can Identifier":"","can pt ID":"","uno":"Default"},"Filter":{"pv":"pass","pv6":"pass","ep":"pass","pe":"pass"},"sc":"Max","dc":"","st Limit":2046,"dm":false},
"ONM":{
"ONM-ALARM-XMN":"Default","Auto Boot Mode":false,"XMN Change Count":0,"CVID":0,"FW Bank Files":[],"FW Bank":[],"FW Bank Ptr":65535,"pn Max Frame Size":2000,"Realtime Stats":false,"Reset Count":0,"SRV-XMN":"Unmodified","Service Config Once":false,"Service Config pts":[],"Skip ot":false,"Name":"","Location":"","dree":"","Picture":"","Tag":"","PHY Delay":0,"Labels":[],"ex":"From OMN","st Age":60,"Laser TX Disable Time":0,"Laser TX Disable Count":0,"Clear st Count":0,"MIB Reset Count":0,"Expected ID":"ANY","Create Date":"2023-02-15 22:41:14.422681"},
"SRV-XMN Values":{},
"nc":{"Name":"ABCD"},
"Alarm History":{
"Alarm IDs":[],"Ack Count":0,"Ack Operator":"","Purge Count":0},"h FW Upgrade":{"wsize":64,"Backoff Divisor":2,"Backoff Delay":5,"Max Retries":4,"End Download Timeout":0},"Epn FW Upgrade":{"Final Ack Timeout":60},
"UNI-x 1":{"Max Frame Size":2000,"Duplex":"Auto","Speed":"Auto","lb":false,"Enable":true,"bd Rate Limit":200000,"st Limit":100,"lb Type":"PHY","Clear st Count":0,"ex":"Off","pc":false},
"UNI-x 2":{"Max Frame Size":2000,"Duplex":"Auto","Speed":"Auto","lb":false,"Enable":true,"bd Rate Limit":200000,"st Limit":100,"lb Type":"PHY","Clear st Count":0,"ex":"Off","pc":false},
"UNI-POTS 1":{"Enable":true},"UNI-POTS 2":{"Enable":true}}
}
All I am trying to do is to replace only 1 small value in this super-complicated json. I am trying to replace the value of os0 tags's TD ex's value from ["a0.c985.c0"] to ["a0.c995.c0"].
Is freemarker the best way to do this? I need to change only 1 value. Can this be done through regex or should I use gson?
I can replace the value like this:
JsonObject jsonObject = new JsonParser().parse(inputJson).getAsJsonObject();
JsonElement jsonElement = jsonObject.get("data").getAsJsonObject().get("os0").getAsJsonObject().get("TD ex");
String str = jsonElement.getAsString();
System.out.println(str);
String[] strs = str.split("\\.");
String replaced = strs[0] + "." + strs[1].replaceAll("\\d+", "201") + "." + strs[2];
System.out.println(replaced);
How to put it back and create the json?
FreeMarker is a template engine, so it's not the tool for this. Load JSON with some real JSON parser library (like Jackson, or GSon) to a node tree, change the value in that, and then use the same JSON library to generate JSON from the node tree. Also, always avoid doing anything in JSON with regular expressions, as JSON (like most pracitcal languages) can describe the same value in many ways, and so writing truly correct regular expression is totally unpractical.

Json string written to Kafka using Spark is not converted properly on reading

I read a .csv file to create a data frame and I want to write the data to a kafka topic. The code is the following
df = spark.read.format("csv").option("header", "true").load(f'{file_location}')
kafka_df = df.selectExpr("to_json(struct(*)) AS value").selectExpr("CAST(value AS STRING)")
kafka_df.show(truncate=False)
And the data frame looks like this:
value
"{""id"":""d215e9f1-4d0c-42da-8f65-1f4ae72077b3"",""latitude"":""-63.571457254062715"",""longitude"":""-155.7055842710919""}"
"{""id"":""ca3d75b3-86e3-438f-b74f-c690e875ba52"",""latitude"":""-53.36506636464281"",""longitude"":""30.069167069917597""}"
"{""id"":""29e66862-9248-4af7-9126-6880ceb3b45f"",""latitude"":""-23.767505281795835"",""longitude"":""174.593140405442""}"
"{""id"":""451a7e21-6d5e-42c3-85a8-13c740a058a9"",""latitude"":""13.02054867061598"",""longitude"":""20.328402498420786""}"
"{""id"":""09d6c11d-7aae-4d17-8cd8-183157794893"",""latitude"":""-81.48976715040848"",""longitude"":""1.1995769642056189""}"
"{""id"":""393e8760-ef40-482a-a039-d263af3379ba"",""latitude"":""-71.73949722379649"",""longitude"":""112.59922770487054""}"
"{""id"":""d6db8fcf-ee83-41cf-9ec2-5c2909c18534"",""latitude"":""-4.034680969008576"",""longitude"":""60.59645511854336""}"
After I wrote it to Kafka I want to read it and transform the binary data from column "value" back to json string but the result is that the value contains only the id, not the whole string. Any ideea why?
from pyspark.sql import functions as F
df = consume_from_event_hub(topic, bootstrap_servers, config, consumer_group)
string_df = df.select(F.col("value").cast("string"))
string_df.display()
value
794541bc-30e6-4c16-9cd0-3c5c8995a3a4
20ea5b50-0baa-47e3-b921-f9a3ac8873e2
598d2fc1-c919-4498-9226-dd5749d92fc5
86cd5b2b-1c57-466a-a3c8-721811ab6959
807de968-c070-4b8b-86f6-00a865474c35
e708789c-e877-44b8-9504-86fd9a20ef91
9133a888-2e8d-4a5a-87ce-4a53e63b67fc
cd5e3e0d-8b02-45ee-8634-7e056d49bf3b
the CSV the format is this
id,latitude,longitude
bd6d98e1-d1da-4f41-94ba-8dbd8c8fce42,-86.06318155350924,-108.14300138138589
c39e84c6-8d7b-4cc5-b925-68a5ea406d52,74.20752175171859,-129.9453606091319
011e5fb8-6ab7-4ee9-97bb-acafc2c71e15,19.302250885973592,-103.2154291337162
You need to remove selectExpr("CAST(value AS STRING)") since to_json already returns a string column
from pyspark.sql.functions import col, to_json, struct
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(f'{file_location}')
kafka_df = df.select(to_json(struct(col("*"))).alias("value"))
kafka_df.show(truncate=False)
I'm not sure what's wrong with the consumer. That should have worked unless consume_from_event_hub does something specifically to extract the ID column

Removing Extra "" from JSON values in using scala

I have been trying to clean my JSON object using scala but I am not able to remove extra "" from my
JSON value
example "LAST_NM":"SMITH "LIBBY" MARY"
Extra commas inside my string are creating problem.
Here is my code that I am using to clean my json file
val readjson = sparkSession.sparkContext.textFile("dev.json")
val json=readjson.map(element=>element.replace("\"\":\"\"","\":\"")
.replace("\"\",\"\"","\",\"")
.replace("\"\":","\":")
.replace(",\"\"",",\"")
.replace("\"{\"\"","{\"")
.replace("\"\"}\"","\"}")
.replaceAll("\\u0009"," "))
.saveAsTextFile("JSON")
Here is my json string that I want to clean (whitespace added for readability):
{
"SEQ_NO":597216,
"PROV_DEMOG_SK":597216,
"PROV_ID":"QMP000003371283",
"FRST_NM":"",
"LAST_NM":"SMITH "LIBBY" MARY",
"FUL_NM":"",
"GENDR_CD":"",
"PROV_NPI":"",
"PROV_STAT":"Incomplete",
"PROV_TY":"03",
"DT_OF_BRTH":"",
"PROFPROFL_DESGTN":"",
"ETL_LAST_UPDT_DT_TM":"2020-04-28 11:43:31.000000",
"PROV_CLSFTN_CD":"A",
"SRC_DATA_KEY":50,
"OPRN_CD":"I",
"REC_SET":"F"
}
What should I add in my code to remove extra "" from LAST_NM value of my json string.
Check below code
df.map(_.replaceAll(" \""," ").replaceAll("\" "," ")).show(false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"SEQ_NO":597216,"PROV_DEMOG_SK":597216,"PROV_ID":"QMP000003371283","FRST_NM":"","LAST_NM":"SMITH LIBBY MARY","FUL_NM":"","GENDR_CD":"","PROV_NPI":"","PROV_STAT":"Incomplete","PROV_TY":"03","DT_OF_BRTH":"","PROFPROFL_DESGTN":"","ETL_LAST_UPDT_DT_TM":"2020-04-28 11:43:31.000000","PROV_CLSFTN_CD":"A","SRC_DATA_KEY":50,"OPRN_CD":"I","REC_SET":"F"}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Parse JSON into U-SQL then convert to csv

I'm trying to convert some telemetry data that is in JSON format into CSV format, then write it out to a file, using U-SQL.
The problem is that some of the JSON key values have periods in them, and so when I'm doing the SELECT operation, U-SQL is not recognizing them. When I check the output file, all that I am seeing is the values for "p1". How can I represent the names of the JSON key names in the script so that they are recognized. Thanks in advance for any help!
Code:
REFERENCE ASSEMBLY MATSDevDB.[Newtonsoft.Json];
REFERENCE ASSEMBLY MATSDevDB.[Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#jsonDocuments =
EXTRACT jsonString string
FROM #"adl://xxxx.azuredatalakestore.net/xxxx/{*}/{*}/{*}/telemetry_{*}.json"
USING Extractors.Tsv(quoting:false);
#jsonify =
SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(jsonString) AS json
FROM #jsonDocuments;
#columnized = SELECT
json["EventInfo.Source"] AS EventInfoSource,
json["EventInfo.InitId"] AS EventInfoInitId,
json["EventInfo.Sequence"] AS EventInfoSequence,
json["EventInfo.Name"] AS EventInfoName,
json["EventInfo.Time"] AS EventInfoTime,
json["EventInfo.SdkVersion"] AS EventInfoSdkVersion,
json["AppInfo.Language"] AS AppInfoLanguage,
json["UserInfo.Language"] AS UserInfoLanguage,
json["DeviceInfo.BrowserName"] AS DeviceInfoBrowswerName,
json["DeviceInfo.BrowserVersion"] AS BrowswerVersion,
json["DeviceInfo.OsName"] AS DeviceInfoOsName,
json["DeviceInfo.OsVersion"] AS DeviceInfoOsVersion,
json["DeviceInfo.Id"] AS DeviceInfoId,
json["p1"] AS p1,
json["PipelineInfo.AccountId"] AS PipelineInfoAccountId,
json["PipelineInfo.IngestionTime"] AS PipelineInfoIngestionTime,
json["PipelineInfo.ClientIp"] AS PipelineInfoClientIp,
json["PipelineInfo.ClientCountry"] AS PipelineInfoClientCountry,
json["PipelineInfo.IngestionPath"] AS PipelineInfoIngestionPath,
json["AppInfo.Id"] AS AppInfoId,
json["EventInfo.Id"] AS EventInfoId,
json["EventInfo.BaseType"] AS EventInfoBaseType,
json["EventINfo.IngestionTime"] AS EventINfoIngestionTime
FROM #jsonify;
OUTPUT #columnized
TO "adl://xxxx.azuredatalakestore.net/poc/TestResult.csv"
USING Outputters.Csv(quoting : false);
JSON:
{"EventInfo.Source":"JS_default_source","EventInfo.Sequence":"1","EventInfo.Name":"daysofweek","EventInfo.Time":"2018-01-25T21:09:36.779Z","EventInfo.SdkVersion":"ACT-Web-JS-2.6.0","AppInfo.Language":"en","UserInfo.Language":"en-US","UserInfo.TimeZone":"-08:00","DeviceInfo.BrowserName":"Chrome","DeviceInfo.BrowserVersion":"63.0.3239.132","DeviceInfo.OsName":"Mac OS X","DeviceInfo.OsVersion":"10","p1":"V1","PipelineInfo.IngestionTime":"2018-01-25T21:09:33.9930000Z","PipelineInfo.ClientCountry":"CA","PipelineInfo.IngestionPath":"FastPath","EventInfo.BaseType":"custom","EventInfo.IngestionTime":"2018-01-25T21:09:33.9930000Z"}
I got this to work with single quotes and single square brackets, eg
#columnized = SELECT
json["['EventInfo.Source']"] AS EventInfoSource,
...
Full code:
#columnized = SELECT
json["['EventInfo.Source']"] AS EventInfoSource,
json["['EventInfo.InitId']"] AS EventInfoInitId,
json["['EventInfo.Sequence']"] AS EventInfoSequence,
json["['EventInfo.Name']"] AS EventInfoName,
json["['EventInfo.Time']"] AS EventInfoTime,
json["['EventInfo.SdkVersion']"] AS EventInfoSdkVersion,
json["['AppInfo.Language']"] AS AppInfoLanguage,
json["['UserInfo.Language']"] AS UserInfoLanguage,
json["['DeviceInfo.BrowserName']"] AS DeviceInfoBrowswerName,
json["['DeviceInfo.BrowserVersion']"] AS BrowswerVersion,
json["['DeviceInfo.OsName']"] AS DeviceInfoOsName,
json["['DeviceInfo.OsVersion']"] AS DeviceInfoOsVersion,
json["['DeviceInfo.Id']"] AS DeviceInfoId,
json["p1"] AS p1,
json["['PipelineInfo.AccountId']"] AS PipelineInfoAccountId,
json["['PipelineInfo.IngestionTime']"] AS PipelineInfoIngestionTime,
json["['PipelineInfo.ClientIp']"] AS PipelineInfoClientIp,
json["['PipelineInfo.ClientCountry']"] AS PipelineInfoClientCountry,
json["['PipelineInfo.IngestionPath']"] AS PipelineInfoIngestionPath,
json["['AppInfo.Id']"] AS AppInfoId,
json["['EventInfo.Id']"] AS EventInfoId,
json["['EventInfo.BaseType']"] AS EventInfoBaseType,
json["['EventINfo.IngestionTime']"] AS EventINfoIngestionTime
FROM #jsonify;
My results:

How can i flatten hbase cells so i can process the resulting JSON using a Spark RDD or Data frame in scala?

a relative newbie to spark, hbase, and scala here.
I have json (stored as byte arrays) in hbase cells in the same column family but across several thousand column qualifiers. Example (simplified):
Table name: 'Events'
rowkey: rk1
column family: cf1
column qualifier: cq1, cell data (in bytes): {"id":1, "event":"standing"}
column qualifier: cq2, cell data (in bytes): {"id":2, "event":"sitting"}
etc.
Using scala, I can read rows by specifying a timerange
val scan = new Scan()
val start = 1460542400
val end = 1462801600
val hbaseContext = new HBaseContext(sc, conf)
val getRdd = hbaseContext.hbaseRDD(TableName.valueOf("Events"), scan)
If I try to load up my hbase rdd (getRdd) into a dataframe (after converting the byte arrays into string etc.), it only reads the first cell in every row (in the example above, I would only get "standing".
this code only loads up a single cell for every row returned
val resultsString = getRdd.map(s=>Bytes.toString(s._2.value()))
val resultsDf = sqlContext.read.json(resultsString)
In order to get every cell I have to iterate as below.
val jsonRDD = getRdd.map(
row => {
val str = new StringBuilder
str.append("[")
val it = row._2.listCells().iterator()
while (it.hasNext) {
val cell = it.next()
val cellstring = Bytes.toString(CellUtil.cloneValue(cell))
str.append(cellstring)
if (it.hasNext()) {
str.append(",")
}
}
str.append("]")
str.toString()
}
)
val hbaseDataSet = sqlContext.read.json(jsonRDD)
I need to add the square brackets and the commas so its properly formatted json for the dataframe to read it.
Questions:
Is there a more elegant way to construct the json i.e. some parser that takes in the individual json strings and concatenates them together so its properly formed json?
Is there a better capability to flatten hbase cells so i dont need to iterate?
For the jsonRdd, the closure that is computed should include the str local variable, so the task executing this code on a node should not be missing the "[", "]" or ",". i.e i wont get parser errors once i run this on the cluster instead of local[*]
Finally, is it better to just create a pair RDD from the json or use data frames to perform simple things like counts? Is there some way to measure the efficiency and performance of one vs. the other?
thank you