Removing Extra "" from JSON values in using scala - json

I have been trying to clean my JSON object using scala but I am not able to remove extra "" from my
JSON value
example "LAST_NM":"SMITH "LIBBY" MARY"
Extra commas inside my string are creating problem.
Here is my code that I am using to clean my json file
val readjson = sparkSession.sparkContext.textFile("dev.json")
val json=readjson.map(element=>element.replace("\"\":\"\"","\":\"")
.replace("\"\",\"\"","\",\"")
.replace("\"\":","\":")
.replace(",\"\"",",\"")
.replace("\"{\"\"","{\"")
.replace("\"\"}\"","\"}")
.replaceAll("\\u0009"," "))
.saveAsTextFile("JSON")
Here is my json string that I want to clean (whitespace added for readability):
{
"SEQ_NO":597216,
"PROV_DEMOG_SK":597216,
"PROV_ID":"QMP000003371283",
"FRST_NM":"",
"LAST_NM":"SMITH "LIBBY" MARY",
"FUL_NM":"",
"GENDR_CD":"",
"PROV_NPI":"",
"PROV_STAT":"Incomplete",
"PROV_TY":"03",
"DT_OF_BRTH":"",
"PROFPROFL_DESGTN":"",
"ETL_LAST_UPDT_DT_TM":"2020-04-28 11:43:31.000000",
"PROV_CLSFTN_CD":"A",
"SRC_DATA_KEY":50,
"OPRN_CD":"I",
"REC_SET":"F"
}
What should I add in my code to remove extra "" from LAST_NM value of my json string.

Check below code
df.map(_.replaceAll(" \""," ").replaceAll("\" "," ")).show(false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"SEQ_NO":597216,"PROV_DEMOG_SK":597216,"PROV_ID":"QMP000003371283","FRST_NM":"","LAST_NM":"SMITH LIBBY MARY","FUL_NM":"","GENDR_CD":"","PROV_NPI":"","PROV_STAT":"Incomplete","PROV_TY":"03","DT_OF_BRTH":"","PROFPROFL_DESGTN":"","ETL_LAST_UPDT_DT_TM":"2020-04-28 11:43:31.000000","PROV_CLSFTN_CD":"A","SRC_DATA_KEY":50,"OPRN_CD":"I","REC_SET":"F"}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Related

How to parse json and replace a value in a nested json?

I have the below json:
{
"status":"success",
"data":{
"_id":"ABCD",
"CNTL":{"XMN Version":"R3.1.0"},
"OMN":{"dree":["ANY"]},
"os0":{
"Enable":true,"Service Reference":"","Name":"",
"TD ex":["a0.c985.c0"],
"pn ex":["s0.c100.c0"],"i ex":{},"US Denta Treatment":"copy","US Denta Value":0,"DP":{"Remote ID":"","cir ID":"","Sub Options":"","etp Number":54469},"pe":{"Remote ID":"","cir ID":""},"rd":{"can Identifier":"","can pt ID":"","uno":"Default"},"Filter":{"pv":"pass","pv6":"pass","ep":"pass","pe":"pass"},"sc":"Max","dc":"","st Limit":2046,"dm":false},
"os1":{
"Enable":false,"Service Reference":"","Name":"",
"TD ex":[],
"pn ex":[],"i ex":{},"US Denta Treatment":"copy","US Denta Value":0,"DP":{"Remote ID":"","cir ID":"","Sub Options":"","etp Number":54469},"pe":{"Remote ID":"","cir ID":""},"rd":{"can Identifier":"","can pt ID":"","uno":"Default"},"Filter":{"pv":"pass","pv6":"pass","ep":"pass","pe":"pass"},"sc":"Max","dc":"","st Limit":2046,"dm":false},
"ONM":{
"ONM-ALARM-XMN":"Default","Auto Boot Mode":false,"XMN Change Count":0,"CVID":0,"FW Bank Files":[],"FW Bank":[],"FW Bank Ptr":65535,"pn Max Frame Size":2000,"Realtime Stats":false,"Reset Count":0,"SRV-XMN":"Unmodified","Service Config Once":false,"Service Config pts":[],"Skip ot":false,"Name":"","Location":"","dree":"","Picture":"","Tag":"","PHY Delay":0,"Labels":[],"ex":"From OMN","st Age":60,"Laser TX Disable Time":0,"Laser TX Disable Count":0,"Clear st Count":0,"MIB Reset Count":0,"Expected ID":"ANY","Create Date":"2023-02-15 22:41:14.422681"},
"SRV-XMN Values":{},
"nc":{"Name":"ABCD"},
"Alarm History":{
"Alarm IDs":[],"Ack Count":0,"Ack Operator":"","Purge Count":0},"h FW Upgrade":{"wsize":64,"Backoff Divisor":2,"Backoff Delay":5,"Max Retries":4,"End Download Timeout":0},"Epn FW Upgrade":{"Final Ack Timeout":60},
"UNI-x 1":{"Max Frame Size":2000,"Duplex":"Auto","Speed":"Auto","lb":false,"Enable":true,"bd Rate Limit":200000,"st Limit":100,"lb Type":"PHY","Clear st Count":0,"ex":"Off","pc":false},
"UNI-x 2":{"Max Frame Size":2000,"Duplex":"Auto","Speed":"Auto","lb":false,"Enable":true,"bd Rate Limit":200000,"st Limit":100,"lb Type":"PHY","Clear st Count":0,"ex":"Off","pc":false},
"UNI-POTS 1":{"Enable":true},"UNI-POTS 2":{"Enable":true}}
}
All I am trying to do is to replace only 1 small value in this super-complicated json. I am trying to replace the value of os0 tags's TD ex's value from ["a0.c985.c0"] to ["a0.c995.c0"].
Is freemarker the best way to do this? I need to change only 1 value. Can this be done through regex or should I use gson?
I can replace the value like this:
JsonObject jsonObject = new JsonParser().parse(inputJson).getAsJsonObject();
JsonElement jsonElement = jsonObject.get("data").getAsJsonObject().get("os0").getAsJsonObject().get("TD ex");
String str = jsonElement.getAsString();
System.out.println(str);
String[] strs = str.split("\\.");
String replaced = strs[0] + "." + strs[1].replaceAll("\\d+", "201") + "." + strs[2];
System.out.println(replaced);
How to put it back and create the json?
FreeMarker is a template engine, so it's not the tool for this. Load JSON with some real JSON parser library (like Jackson, or GSon) to a node tree, change the value in that, and then use the same JSON library to generate JSON from the node tree. Also, always avoid doing anything in JSON with regular expressions, as JSON (like most pracitcal languages) can describe the same value in many ways, and so writing truly correct regular expression is totally unpractical.

Parsing json string with escaped quotes in spark scala [duplicate]

This question already has answers here:
Parse a String in Scala [closed]
(2 answers)
Closed 1 year ago.
I am trying to parse the string given below as a JSON using Scala, but haven't been able to do so because of the escaped quotes \" occurring in fields.
{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r <RegistrationInfo>\r <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r </RegistrationInfo>\r <Triggers>\r <TimeTrigger>\r <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r <Enabled>true</Enabled>\r </TimeTrigger>\r </Triggers>\r <Settings>\r <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r <AllowHardTerminate>true</AllowHardTerminate>\r <StartWhenAvailable>false</StartWhenAvailable>\r <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r <IdleSettings>\r <Duration>PT10M</Duration>\r <WaitTimeout>PT1H</WaitTimeout>\r <StopOnIdleEnd>true</StopOnIdleEnd>\r <RestartOnIdle>false</RestartOnIdle>\r </IdleSettings>\r <AllowStartOnDemand>true</AllowStartOnDemand>\r <Enabled>true</Enabled>\r <Hidden>false</Hidden>\r <RunOnlyIfIdle>false</RunOnlyIfIdle>\r <WakeToRun>false</WakeToRun>\r <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r <Priority>7</Priority>\r </Settings>\r <Actions Context=\"Author\">\r <Exec>\r <Command>%systemroot%\\system32\\usoclient.exe</Command>\r <Arguments>StartUWork</Arguments>\r </Exec>\r </Actions>\r <Principals>\r <Principal id=\"Author\">\r <UserId>S-1-5-18</UserId>\r <RunLevel>LeastPrivilege</RunLevel>\r </Principal>\r </Principals>\r </Task>\"}
So far, I have tried spark.json.read and the net.liftweb library, but to no avail.
Any kind of help is highly appreciated.
The json output that you are getting might not be a valid json or if the json is valid then it has xml content in the TaskContent element which has xml tags with attributes and I think that is what is causing the issue. The idea I have is to remove the Double quotes from the XML attributes values and then parse. You could replace that double quotes with any specific value and once you have that `TaskContent' as dataframe column than you can again replace that to get the original content.
This might not be the perfect or efficient answer but based on how you are getting the json and if that json structure remains the same then you could do something as below :
Convert the json that you have to string.
Do some replaceAll operations on the string to make it look like valid json.
read the json into Dataframe.
//Source data copied from Question
val json = """{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r <RegistrationInfo>\r <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r </RegistrationInfo>\r <Triggers>\r <TimeTrigger>\r <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r <Enabled>true</Enabled>\r </TimeTrigger>\r </Triggers>\r <Settings>\r <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r <AllowHardTerminate>true</AllowHardTerminate>\r <StartWhenAvailable>false</StartWhenAvailable>\r <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r <IdleSettings>\r <Duration>PT10M</Duration>\r <WaitTimeout>PT1H</WaitTimeout>\r <StopOnIdleEnd>true</StopOnIdleEnd>\r <RestartOnIdle>false</RestartOnIdle>\r </IdleSettings>\r <AllowStartOnDemand>true</AllowStartOnDemand>\r <Enabled>true</Enabled>\r <Hidden>false</Hidden>\r <RunOnlyIfIdle>false</RunOnlyIfIdle>\r <WakeToRun>false</WakeToRun>\r <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r <Priority>7</Priority>\r </Settings>\r <Actions Context=\"Author\">\r <Exec>\r <Command>%systemroot%\\system32\\usoclient.exe</Command>\r <Arguments>StartUWork</Arguments>\r </Exec>\r </Actions>\r <Principals>\r <Principal id=\"Author\">\r <UserId>S-1-5-18</UserId>\r <RunLevel>LeastPrivilege</RunLevel>\r </Principal>\r </Principals>\r </Task>\"}"""
//Modifying json to make it valid
val modifiedJson = json.replaceAll("\\\\\\\\","#").replaceAll("\\\\r","").replaceAll("\\\\","").replaceAll(" ","").replaceAll(" ","").replaceAll("> <","><").replaceAll("=\"","=").replaceAll("\">",">").replaceAll("#","\\\\\\\\").replaceAll("1.0\"","1.0").replaceAll("UTF-16\"?","UTF-16").replaceAll("1.2\"","1.2")
//creating a dataset out of json String
val ds = spark.createDataset(modifiedJson :: Nil)
//reading the dataset as json
val df = spark.read.json(ds)
you can see the output as below:
You can do some optimization for it to work in a more efficient way but this is how I made it work.
You can use replace escaped quotes \" in the json with " (except for those inside the xml content) using regexp_replace function then read into dataframe :
val jsonString = """
{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r <RegistrationInfo>\r <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r </RegistrationInfo>\r <Triggers>\r <TimeTrigger>\r <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r <Enabled>true</Enabled>\r </TimeTrigger>\r </Triggers>\r <Settings>\r <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r <AllowHardTerminate>true</AllowHardTerminate>\r <StartWhenAvailable>false</StartWhenAvailable>\r <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r <IdleSettings>\r <Duration>PT10M</Duration>\r <WaitTimeout>PT1H</WaitTimeout>\r <StopOnIdleEnd>true</StopOnIdleEnd>\r <RestartOnIdle>false</RestartOnIdle>\r </IdleSettings>\r <AllowStartOnDemand>true</AllowStartOnDemand>\r <Enabled>true</Enabled>\r <Hidden>false</Hidden>\r <RunOnlyIfIdle>false</RunOnlyIfIdle>\r <WakeToRun>false</WakeToRun>\r <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r <Priority>7</Priority>\r </Settings>\r <Actions Context=\"Author\">\r <Exec>\r <Command>%systemroot%\\system32\\usoclient.exe</Command>\r <Arguments>StartUWork</Arguments>\r </Exec>\r </Actions>\r <Principals>\r <Principal id=\"Author\">\r <UserId>S-1-5-18</UserId>\r <RunLevel>LeastPrivilege</RunLevel>\r </Principal>\r </Principals>\r </Task>\"}
"""
val df = spark.read.json(
Seq(jsonString).toDS
.withColumn("value", regexp_replace($"value", """([:\[,{]\s*)\\"(.*?)\\"(?=\s*[:,\]}])""", "$1\"$2\""))
.as[String]
)
df.show
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+
//| Category| Channel| Computer|EventID|EventRecordID| Provider|SubjectDomainName|SubjectLogonId|SubjectUserName|SubjectUserSid| TaskContent| TaskName| TimeCreated|Version|
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+
//|AUDIT_SUCCESS|Security|WIN-10.atfdetonat...| 4698| 12956650|Microsoft-Windows...| ATFDETONATE| 0x3e7| WIN-10$| S-1-5-18|<?xml version="1....|\Microsoft\Window...|2021-01-09T04:29:...| 1|
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+

python RegEx json Double quotes

I got a json through the spider, but the json format has a problem, name:value, the name is missing double quotes. like this:
{ listInfo:[{title:'it is title',url:'http://test.com',imgurl:'http://test.com',imgurl2:'',abstract:'',source:'',pubtime:'2019-05-22 11:47:24'},{title:'xx',url:'http://test.com/1.htm',imgurl:'http://test.com',imgurl2:'',abstract:'',source:'',pubtime:'2019-05-22 07:54:46'}]}
I want add double quotes in "name",and need to exclude String of [http { ...]
{ "listInfo":[{"title":'it is "title"',"url":'http://test.com',...
I tryed this but it is not work
#(.*?)\:(.*?)\n'
pattern = re.compile(r'^((?![http]\").)*\:(.*?)\n', re.MULTILINE )
content = content.replace(pattern.search(content).group(1),'\"'+pattern.search(content).group(1).strip()+'\"')
I also tryed
How to add double quotes to the dictionary?
content = '''
{ listInfo:[{title:'it is title',url:'http://test.com',
imgurl:'http://test.com',imgurl2:'',abstract:'',source:'',
pubtime:'2019-05-22 11:47:24'},{title:'xx',url:'http://test.com/1.htm',
imgurl:'http://test.com',imgurl2:'',abstract:'',source:'',pubtime:'2019-05-22 07:54:46'}]}
'''
# dict_str = lambda data : re.sub(r'(\w+):\s*(-?\d[\d/.]*)',r'"\1": "\2"',data)
dict_str = lambda data : re.sub(r'(\w+):(.*?)\n',r'"\1": "\2"',data)
for i in [content] :
var1=dict_str(i)
print(var1)
the result is look like:
{ "listInfo": "[{title:'it is title',url:'http://test.com',""imgurl": "'http://test.com',imgurl2:'',abstract:'',source:'',""pubtime": "'2019-05-22 11:47:24'},{title:'xx',url:'http://test.com/1.htm',""imgurl": "'http://test.com',imgurl2:'',abstract:'',source:'',pubtime:'2019-05-22 07:54:46'}]}"
Who knows how to write regEx.
Thinks!
I used a comparative method to solve it.
script = script.replace('abstract','\"abstract\"')
...
:(

How to convert JSON Array String to Set in java

I have a JSON Array string like [1,2].I want to convert this in to Set.How can I do it in java 8 ?
This is my code
String languageStr = rs.getString("languages");
jobseeker.setLanguageIds(StringUtils.isEmpty(languageStr) ? null
: Arrays.stream(languageStr.split(","))
.map(Integer::parseInt)
.collect(Collectors.toSet()));
Getting error like this
java.lang.NumberFormatException: For input string: " 2"
The space in json array is the problem.Is there any solution?
This is my code after changes
String languageStr = rs.getString("languages");
String languages=languageStr.substring(1,languageStr.length()-1);
jobseeker.setLanguageIds(StringUtils.isEmpty(languages) ? null
: Arrays.stream(languages.split(","))
.map(String::trim)
.map(Integer::parseInt)
.collect(Collectors.toSet()));
Can I get the output in any other way withot using these 2 steps:
languages=languageStr.substring(1,languageStr.length()-1);
.map(String::trim)
You can use the trim method to remove leading and trailing withespace before parse it to Integer.
So your code will be this
Arrays.stream(languageStr.split(","))
.map(String::trim)
.map(Integer::parseInt)
Finally I got the solution
Changed code like this:
String languageStr = rs.getString("languages");
Set<Integer> languages = mapper.readValue(languageStr, typeFactory.constructCollectionType(Set.class, Integer.class));
jobseeker.setLanguageIds(StringUtils.isEmpty(languageStr) ? null : languages);
Using a TypeToken and the Google Gson library, you should be able to do that like below:
String languageJsonStr = rs.getString("languages");
Set<Integer> myLanguageSet = new Gson().fromJson(languageJsonStr, new TypeToken<HashSet<Integer>>(){}.getType());

Indicating a space in Groovy Property

I'm new to Groovy. I am using the Groovy CSV parser to read in some data, but have run into an issue because some of the CSV file headers have spaces. Is there a way to specify a space character in a Groovy property? Example:
def csv = new File("data.csv").text
def data = new CsvParser().parse(csv)
for(line in data) {
println "$line.First Name $line.Last Name"
}
CSV:
Last Name, First Name
Smith,John
Jones,Sally
This fails due to the extra space characters. (Yes, I could change the CSV file, but that's a last resort.)
Try:
println "${line.'First Name'} ${line.'Last Name'}"
Or, an alternative syntax:
println "${line[ 'First Name' ]} ${line[ 'Last Name' ]}"
You can use [] method to look up the property:
"${line['First Name']}