I have a small cluster setup of Spark 3.x. I have read some data and after transformations, I have to save it as JSON. But the problem I am facing is that, in array type of columns, Spark is adding extra double quotes when written as json file.
Sample data-frame data
I am saving this data frame as JSON with following command
df.write.json("Documents/abc")
The saved output is as follows
Finally, the schema info is as follows
The elements of the string array contain double quotes within the data, e.g. the first element is "Saddar Cantt, Lahore Punjab Pakistan" instead of Saddar Cantt, Lahore Punjab Pakistan. You can remove the extra double quotes from the strings before writing the json with transform and replace:
df.withColumn("ADDRESS", F.expr("""transform(ADDRESS, a -> replace(a, '"'))""")) \
.write.json("Documents/abc")
If we enforce schema before writing the dataframe as json, I believe we can work around such issues without having to replace/change any characters.
Without schema ->
df.show()
id | address
1 | ['Saddar Cantt, Lahore Punjab Pakistan', 'Shahpur']
df.write.json("path")
{"id":"1","address":["'Saddar Cantt, Lahore Punjab Pakistan'","'Shahpur'"]}
With Schema ->
df = df.withColumn('address', F.col('address').cast(StringType()))
df = df.withColumn('address', F.from_json(F.col('address'), ArrayType(StringType())))
df.write.json("path")
{"id":"1","address":["Saddar Cantt, Lahore Punjab Pakistan","Shahpur"]}
from_json only takes string as input. Hence we need to first cast array to string.
Related
I'm new to Python3 and I am working with large JSON objects. I have a large JSON object which has extra chars coming in between two JSON objects, in between the braces.
For example:
{"id":"121324343", "name":"foobar"}3$£_$£rvcfddkgga£($(>..bu&^783 { "id":"343554353", "name":"ABCXYZ"}'
These extra chars could be anything alphanumeric, special chars or ASCII. They appear in this large JSON multiple times and can be of any length. I'm trying to use regex to identify that pattern to remove them, but regex doesn't seem to work. Here is the regex I used:
(^}\n[a-zA-Z0-9]+{$)
Is there a way of identifying such patter using regex in python?
You can select the dictionary data based on named capture groups. As a bonus, this will also ignore any { or } within the extra chars.
The following pattern works on the provided data:
"\"id\"\:\"(?P<id>\d+?)\"[,\s]+\"name\"\:\"(?P<name>[ \w]+)\""
Example
import re
from pprint import pprint
string = \
"""
{"id":"121324343", "name":"foobar"}3$£_$£rvcfdd{}kgga£($(>..bu&^783 { "id":"343554353", "name":"ABC XYZ"}'
"""
pattern = re.compile(pattern="\"id\"\:\"(?P<id>\d+?)\"[,\s]+\"name\"\:\"(?P<name>[ \w]+)\"")
pprint([match.groupdict() for match in pattern.finditer(string=string)])
Output
[{'id': '121324343', 'name': 'foobar'}, {'id': '343554353', 'name': 'ABC XYZ'}]
Test it out yourself: https://regex101.com/r/82BqbE/1
Notes
For this example I assume the following:
id only contains integer digits.
name is a string that can contain the following characters [a-zA-Z0-9_ ]. (this includes white spaces and underscores).
Assuming the whole json is a single line, and there are no }{ inside the fields themselves, this should be enough
In [1]: import re
In [2]: x = """{"id":"121324343", "name":"foobar"}3$£_$£rvcfddkgga£($(>..bu&^783 { "id":"343554353", "name":"ABCXYZ"}"""
In [3]: print(re.sub(r'(?<=})[^}{]+(?={)', "\n", x))
{"id":"121324343", "name":"foobar"}
{ "id":"343554353", "name":"ABCXYZ"}
You can check the regex here https://regex101.com/r/leIoqE/1
This question already has answers here:
Parse a String in Scala [closed]
(2 answers)
Closed 1 year ago.
I am trying to parse the string given below as a JSON using Scala, but haven't been able to do so because of the escaped quotes \" occurring in fields.
{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r <RegistrationInfo>\r <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r </RegistrationInfo>\r <Triggers>\r <TimeTrigger>\r <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r <Enabled>true</Enabled>\r </TimeTrigger>\r </Triggers>\r <Settings>\r <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r <AllowHardTerminate>true</AllowHardTerminate>\r <StartWhenAvailable>false</StartWhenAvailable>\r <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r <IdleSettings>\r <Duration>PT10M</Duration>\r <WaitTimeout>PT1H</WaitTimeout>\r <StopOnIdleEnd>true</StopOnIdleEnd>\r <RestartOnIdle>false</RestartOnIdle>\r </IdleSettings>\r <AllowStartOnDemand>true</AllowStartOnDemand>\r <Enabled>true</Enabled>\r <Hidden>false</Hidden>\r <RunOnlyIfIdle>false</RunOnlyIfIdle>\r <WakeToRun>false</WakeToRun>\r <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r <Priority>7</Priority>\r </Settings>\r <Actions Context=\"Author\">\r <Exec>\r <Command>%systemroot%\\system32\\usoclient.exe</Command>\r <Arguments>StartUWork</Arguments>\r </Exec>\r </Actions>\r <Principals>\r <Principal id=\"Author\">\r <UserId>S-1-5-18</UserId>\r <RunLevel>LeastPrivilege</RunLevel>\r </Principal>\r </Principals>\r </Task>\"}
So far, I have tried spark.json.read and the net.liftweb library, but to no avail.
Any kind of help is highly appreciated.
The json output that you are getting might not be a valid json or if the json is valid then it has xml content in the TaskContent element which has xml tags with attributes and I think that is what is causing the issue. The idea I have is to remove the Double quotes from the XML attributes values and then parse. You could replace that double quotes with any specific value and once you have that `TaskContent' as dataframe column than you can again replace that to get the original content.
This might not be the perfect or efficient answer but based on how you are getting the json and if that json structure remains the same then you could do something as below :
Convert the json that you have to string.
Do some replaceAll operations on the string to make it look like valid json.
read the json into Dataframe.
//Source data copied from Question
val json = """{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r <RegistrationInfo>\r <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r </RegistrationInfo>\r <Triggers>\r <TimeTrigger>\r <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r <Enabled>true</Enabled>\r </TimeTrigger>\r </Triggers>\r <Settings>\r <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r <AllowHardTerminate>true</AllowHardTerminate>\r <StartWhenAvailable>false</StartWhenAvailable>\r <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r <IdleSettings>\r <Duration>PT10M</Duration>\r <WaitTimeout>PT1H</WaitTimeout>\r <StopOnIdleEnd>true</StopOnIdleEnd>\r <RestartOnIdle>false</RestartOnIdle>\r </IdleSettings>\r <AllowStartOnDemand>true</AllowStartOnDemand>\r <Enabled>true</Enabled>\r <Hidden>false</Hidden>\r <RunOnlyIfIdle>false</RunOnlyIfIdle>\r <WakeToRun>false</WakeToRun>\r <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r <Priority>7</Priority>\r </Settings>\r <Actions Context=\"Author\">\r <Exec>\r <Command>%systemroot%\\system32\\usoclient.exe</Command>\r <Arguments>StartUWork</Arguments>\r </Exec>\r </Actions>\r <Principals>\r <Principal id=\"Author\">\r <UserId>S-1-5-18</UserId>\r <RunLevel>LeastPrivilege</RunLevel>\r </Principal>\r </Principals>\r </Task>\"}"""
//Modifying json to make it valid
val modifiedJson = json.replaceAll("\\\\\\\\","#").replaceAll("\\\\r","").replaceAll("\\\\","").replaceAll(" ","").replaceAll(" ","").replaceAll("> <","><").replaceAll("=\"","=").replaceAll("\">",">").replaceAll("#","\\\\\\\\").replaceAll("1.0\"","1.0").replaceAll("UTF-16\"?","UTF-16").replaceAll("1.2\"","1.2")
//creating a dataset out of json String
val ds = spark.createDataset(modifiedJson :: Nil)
//reading the dataset as json
val df = spark.read.json(ds)
you can see the output as below:
You can do some optimization for it to work in a more efficient way but this is how I made it work.
You can use replace escaped quotes \" in the json with " (except for those inside the xml content) using regexp_replace function then read into dataframe :
val jsonString = """
{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r <RegistrationInfo>\r <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r </RegistrationInfo>\r <Triggers>\r <TimeTrigger>\r <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r <Enabled>true</Enabled>\r </TimeTrigger>\r </Triggers>\r <Settings>\r <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r <AllowHardTerminate>true</AllowHardTerminate>\r <StartWhenAvailable>false</StartWhenAvailable>\r <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r <IdleSettings>\r <Duration>PT10M</Duration>\r <WaitTimeout>PT1H</WaitTimeout>\r <StopOnIdleEnd>true</StopOnIdleEnd>\r <RestartOnIdle>false</RestartOnIdle>\r </IdleSettings>\r <AllowStartOnDemand>true</AllowStartOnDemand>\r <Enabled>true</Enabled>\r <Hidden>false</Hidden>\r <RunOnlyIfIdle>false</RunOnlyIfIdle>\r <WakeToRun>false</WakeToRun>\r <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r <Priority>7</Priority>\r </Settings>\r <Actions Context=\"Author\">\r <Exec>\r <Command>%systemroot%\\system32\\usoclient.exe</Command>\r <Arguments>StartUWork</Arguments>\r </Exec>\r </Actions>\r <Principals>\r <Principal id=\"Author\">\r <UserId>S-1-5-18</UserId>\r <RunLevel>LeastPrivilege</RunLevel>\r </Principal>\r </Principals>\r </Task>\"}
"""
val df = spark.read.json(
Seq(jsonString).toDS
.withColumn("value", regexp_replace($"value", """([:\[,{]\s*)\\"(.*?)\\"(?=\s*[:,\]}])""", "$1\"$2\""))
.as[String]
)
df.show
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+
//| Category| Channel| Computer|EventID|EventRecordID| Provider|SubjectDomainName|SubjectLogonId|SubjectUserName|SubjectUserSid| TaskContent| TaskName| TimeCreated|Version|
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+
//|AUDIT_SUCCESS|Security|WIN-10.atfdetonat...| 4698| 12956650|Microsoft-Windows...| ATFDETONATE| 0x3e7| WIN-10$| S-1-5-18|<?xml version="1....|\Microsoft\Window...|2021-01-09T04:29:...| 1|
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+
I'd like to parse the JSON output from an IEX Cloud stock quote query: https://cloud.iexapis.com/stable/stock/aapl/quote?token=YOUR_TOKEN_HERE
I have tired to use Regex101 to solve the issue:
https://regex101.com/r/y8i01T/1/
Here is the Regex expression that I tried:"([^"]+)":"?([^",\s]+)
Here is the example of a IEX Cloud stock quote output for Apple:
{
"symbol":"AAPL",
"companyName":"Apple, Inc.",
"calculationPrice":"close",
"open":204.86,
"openTime":1556285400914,
"close":204.3,
"closeTime":1556308800303,
"high":205,
"low":202.12,
"latestPrice":204.3,
"latestSource":"Close",
"latestTime":"April 26, 2019",
"latestUpdate":1556308800303,
"latestVolume":18604306,
"iexRealtimePrice":204.34,
"iexRealtimeSize":48,
"iexLastUpdated":1556308799763,
"delayedPrice":204.3,
"delayedPriceTime":1556308800303,
"extendedPrice":204.46,
"extendedChange":0.16,
"extendedChangePercent":0.00078,
"extendedPriceTime":1556310657637,
"previousClose":205.28,
"change":-0.98,
"changePercent":-0.00477,
"iexMarketPercent":0.030716437366704246,
"iexVolume":571458,
"avgTotalVolume":27717780,
"iexBidPrice":0,
"iexBidSize":0,
"iexAskPrice":0,
"iexAskSize":0,
"marketCap":963331704000,
"peRatio":16.65,
"week52High":233.47,
"week52Low":142,
"ytdChange":0.29512900000000003
}
I want to save the key value pairs in the JSON response without quotes around the key and gather the value starting after the colon (:). I need to exclude any quotes for text, the comma at the end of each line and include the last key-value pair that does not include a comma at the end of the line.
For example, "peRatio":16.65, should have the key equal to peRatio and the value equal to 16.65. Or another example, "changePercent":-0.00477, should have a key equal to changePercent and a value of -0.00477. If it's a text such as "companyName":"Apple, Inc.",, it should have a key equal to companyName and a value equal to Apple, Inc.
Also, the last JSON key-value entry: "ytdChange":0.29512900000000003 does not have a comma and that needs to be accounted for.
You most likely do not need to parse your data using regex. However, if you wish/have to do so, maybe for practicing regular expressions, you could do so by defining a few boundaries in your expression.
This RegEx might help you to do that, which divides your input JSON values into three categories of string, numeric, and last no-comma values:
"([^"]+)":("(.+)"|(.+))(,{1}|\n\})
Then, you can use the \n} boundary for the last value, "" boundary for your string values and no boundary for numeric values.
I am new to Spark, and want to read a log file and create a dataframe out of it. My data is half json, and I cannot convert it into a dataframe properly. Here below is first row in the file;
[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}
See first part is plain text and the last part between { } is json, I tried few things, converting it first to RDD then map and split then convert back to DataFrame, but I cannot extract the values from Json part of the row, is there a trick to extract fields in this context?
Final output will be like;
TimeStamp userid ip artist album song id service
2017-01-06 07:00:01 444444 11.11.111.0 Tears For Fears Songs From The Big Chair Everybody Wants To Rule The World S4555 pandora
You just need to parse out the pieces with a Python UDF into a tuple then tell spark to convert the RDD to a dataframe. The easiest way to do this is probably a regular expression. For example:
import re
import json
def parse(row):
pattern = ' '.join([
r'\[(?P<ts>\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)\]',
r'userid:(?P<userid>\d+)',
r'(?P<ip>\d+\.\d+\.\d+\.\d+)',
r'(?P<level>\w+)',
r'(?P<json>.+$)'
])
match = re.match(pattern, row)
parsed_json = json.loads(match.group('json'))
return (match.group('ts'), match.group('userid'), match.group('ip'), match.group('level'), parsed_json['artist'], parsed_json['song'], parsed_json['service'])
lines = [
'[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}'
]
rdd = sc.parallelize(lines)
df = rdd.map(parse).toDF(['ts', 'userid', 'ip', 'level', 'artist', 'song', 'service'])
df.show()
This prints
+-------------------+------+-----------+-----+---------------+--------------------+-------+
| ts|userid| ip|level| artist| song|service|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
|2017-01-06 07:00:01|444444|11.11.111.0| info|Tears For Fears|Everybody Wants T...|pandora|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
I have used the following, just some parsing utilizing pyspark power;
parts=r1.map( lambda x: x.value.replace('[','').replace('] ','###')
.replace(' userid:','###').replace('null','"null"').replace('""','"NA"')
.replace(' music_info {"artist":"','###').replace('","album":"','###')
.replace('","song":"','###').replace('","id":"','###')
.replace('","service":"','###').replace('"}','###').split('###'))
people = parts.map(lambda p: (p[0], p[1],p[2], p[3], p[4], p[5], p[6], p[7]))
schemaString = "timestamp mac userid_ip artist album song id service"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
With this I got almost what I want, and performance was super fast.
+-------------------+-----------------+--------------------+-------------------- +--------------------+--------------------+--------------------+-------+
| timestamp| mac| userid_ip| artist| album| song| id|service|
+-------------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------+
|2017-01-01 00:00:00|00:00:00:00:00:00|111122 22.235.17...|The United States...| This Is Christmas!|Do You Hear What ...| S1112536|pandora|
|2017-01-01 00:00:00|00:11:11:11:11:11|123123 108.252.2...| NA| Dinner Party Radio| NA| null|pandora|
In Flink, parsing a CSV file using readCsvFile raises an exception when encountring a field containing quotes like "Fazenda São José ""OB"" Airport":
org.apache.flink.api.common.io.ParseException: Line could not be parsed: '191,"SDOB","small_airport","Fazenda São José ""OB"" Airport",-21.425199508666992,-46.75429916381836,2585,"SA","BR","BR-SP","Tapiratiba","no","SDOB",,"SDOB",,,'
I've found in this mailing list thread and this JIRA issue that quoting inside the field should be realized through the \ character, but I don't have control over the data to modify it. Is there a way to work around this?
I've also tried using ignoreInvalidLines() (which is the less preferable solution) but it gave me the following error:
08:49:05,737 INFO org.apache.flink.api.common.io.LocatableInputSplitAssigner - Assigning remote split to host localhost
08:49:05,765 ERROR org.apache.flink.runtime.operators.BatchTask - Error in task code: CHAIN DataSource (at main(Job.java:53) (org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at main(Job.java:54)) -> Combine(SUM(1), at main(Job.java:56) (2/8)
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.flink.api.common.io.GenericCsvInputFormat.skipFields(GenericCsvInputFormat.java:443)
at org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:412)
at org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:111)
at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:454)
at org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:79)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:176)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Here is my code:
DataSet<Tuple2<String, Integer>> csvInput = env.readCsvFile("resources/airports.csv")
.ignoreFirstLine()
.ignoreInvalidLines()
.parseQuotedStrings('"')
.includeFields("100000001")
.types(String.class, String.class)
.map((Tuple2<String, String> value) -> new Tuple2<>(value.f1, 1))
.groupBy(0)
.sum(1);
If you cannot change the input data, then you should turn off parseQuotedString(). This will simply look for the next field delimiter and return everything in between as a string (including the quotations marks). Then you can remove the leading and trailing quotation mark in a subsequent map operation.