KQL | How do I extract or check for data in a long string with many quotation marks? - extract

I'm super newbie to KQL and data in general.
I'm working with a data column with long strings like this:
"data": {"stageID":1670839857060,"entities":[{"entity":{"key":"BearKnight","owner":0,"id":"[2|1]"},"levels":{"main":4,"star":1,"ShieldWall.main":4,"ShieldWall.enhance":0,"ShieldThrow.main":4,"ShieldThrow.enhance":0}},{"entity":{"key":"DryadHealer","owner":0,"id":"[3|1]"},"levels":{"main":5,"star":1,"HealingTouch.main":5,"HealingTouch.enhance":0,"CuringTouch.main":5,"CuringTouch.enhance":0}},{"entity":{"key":"HumanKnight","owner":1,"id":"[4|1]"},"levels":{"main":4,"star":0,"BullRush.main":4,"BullRush.enhance":0,"FinishingStrike.main":4,"FinishingStrike.enhance":0,"SwordThrow.main":4,"SwordThrow.enhance":0,"StrongAttack.main":0,"StrongAttack.enhance":0}},
I need to get a list of the *HeroNames *inside here [ "key":"HeroName","owner":0 ] but not in here [ "key":"HeroName","owner":1 ].
I've been trying the extract_all and has_any functions, but I can't work with the data if it has all the quotation marks. Can I parse this somewhow and remove them?
My ideal output would be a list of hero names who have owner:0.
For example, for the top string the ideal output is: "BearKnight","DryadHealer"

print txt = 'data: {"stageID":1670839857060,"entities":[{"entity":{"key":"BearKnight","owner":0,"id":"[2|1]"},"levels":{"main":4,"star":1,"ShieldWall.main":4,"ShieldWall.enhance":0,"ShieldThrow.main":4,"ShieldThrow.enhance":0}},{"entity":{"key":"DryadHealer","owner":0,"id":"[3|1]"},"levels":{"main":5,"star":1,"HealingTouch.main":5,"HealingTouch.enhance":0,"CuringTouch.main":5,"CuringTouch.enhance":0}},{"entity":{"key":"HumanKnight","owner":1,"id":"[4|1]"},"levels":{"main":4,"star":0,"BullRush.main":4,"BullRush.enhance":0,"FinishingStrike.main":4,"FinishingStrike.enhance":0,"SwordThrow.main":4,"SwordThrow.enhance":0,"StrongAttack.main":0,"StrongAttack.enhance":0}}]}'
| parse txt with * ": " doc
| mv-apply e = parse_json(doc).entities on (where e.entity.owner == 0 | summarize HeroNames = make_list(e.entity.key))
| project-away txt, doc
HeroNames
["BearKnight","DryadHealer"]
Fiddle

Related

Convert string column to json and parse in pyspark

My dataframe looks like
|ID|Notes|
---------------
|1|'{"Country":"USA","Count":"1000"}'|
|2|{"Country":"USA","Count":"1000"}|
ID : int
Notes : string
When i use from_json to parse the column Notes, it gives all Null values.
I need help in parsing this column Notes into columns in pyspark
When you are using from_json() function, make sure that the column value is exactly a json/dictionary in String format. In the sample data you have given, the Notes column value with id=1 is not exactly in json format (it is a string but enclosed within additional single quotes). This is the reason it is returning NULL values. Implementing the following code on the input dataframe gives the following output.
df = df.withColumn("Notes",from_json(df.Notes,MapType(StringType(),StringType())))
You need to change your input data such that the entire Notes column is in same format which is json/dictionary as a string and nothing more because it is the main reason for the issue. The below is the correct format that helps you to fix your issue.
| ID | Notes |
---------------
| 1 | {"Country":"USA","Count":"1000"} |
| 2 | {"Country":"USA","Count":"1000"} |
To parse Notes column values as columns in pyspark, you can simply use function called json_tuple() (no need to use from_json()). It extracts the elements from a json column (string format) and creates the result as new columns.
df = df.select(col("id"),json_tuple(col("Notes"),"Country","Count")) \
.toDF("id","Country","Count")
df.show()
Output:
NOTE: json_tuple() also returns null if the column value is not in the correct format (make sure the column values are json/dictionary as a string without additional quotes).

How to retrieve a json value based on a string key

I have json data that looks like this:
{
"deploy:success": 2,
"deploy:RTX:success": 1,
"deploy:BLX:success": 1,
"deploy:RTX:BigTop:success": 1,
"deploy:BLX:BigTop:success": 1,
"deploy:RTX:BigTop:xxx:success": 1,
"deploy:BLX:BigTop:yyy:success": 1,
}
Where each new :<field> tacked on makes it more specific. Say a key with the format "deploy:RTX:success" is for a specific site RTX. I was planning on using a filter to show only the site-specific counts.
eval column_name=if($site_token$ = "", "deploy:success", "deploy:$site_token$:success")
Then rename the derived column:
rename column_name deploy
But the rename is looking for actual values in that first argument and not just a column name. I can't figure out how to get the values associated from that column for the life of me.
index=cloud_aws namespace=my namespace=Stats protov3=*
| spath input=protov3
| eval column_name=if("$site_token$" = "", "deploy:success", "deploy:$site_token$:success")
| rename column_name AS "deploy"
What have I done incorrectly?
It's not clear what the final result is supposed to be. If the result when $site_token$ is empty should be "deploy:success" then just use "deploy" as the target of the eval.
index=cloud_aws namespace=my namespace=Stats protov3=*
| spath input=protov3
| eval deploy=if("$site_token$" = "", "deploy:success", "deploy:$site_token$:success")
OTOH, if the result when $site_token$ is empty should be "2" then use the existing query with single quotes in the eval. Single quotes tell Splunk to treat the enclosed text as a field name rather than a literal string (which is what double quotes do).
index=cloud_aws namespace=my namespace=Stats protov3=*
| spath input=protov3
| eval deploy=if("$site_token$" = "", 'deploy:success', 'deploy:$site_token$:success')

How do I search for a string in this JSON with Python

My JSON file looks something like:
{
"generator": {
"name": "Xfer Records Serum",
....
},
"generator": {
"name: "Lennar Digital Sylenth1",
....
}
}
I ask the user for search term and the input is searched for in the name key only. All matching results are returned. It means if I input 's' only then also both the above ones would be returned. Also please explain me how to return all the object names which are generators. The more simple method the better it will be for me. I use json library. However if another library is required not a problem.
Before switching to JSON I tried XML but it did not work.
If your goal is just to search all name properties, this will do the trick:
import re
def search_names(term, lines):
name_search = re.compile('\s*"name"\s*:\s*"(.*' + term + '.*)",?$', re.I)
return [x.group(1) for x in [name_search.search(y) for y in lines] if x]
with open('path/to/your.json') as f:
lines = f.readlines()
print(search_names('s', lines))
which would return both names you listed in your example.
The way the search_names() function works is it builds a regular expression that will match any line starting with "name": " (with varying amount of whitespace) followed by your search term with any other characters around it then terminated with " followed by an optional , and the end of string. Then applies that to each line from the file. Finally it filters out any non-matching lines and returns the value of the name property (the capture group contents) for each match.

Extract fields from log file where data is stored half json and half plain text

I am new to Spark, and want to read a log file and create a dataframe out of it. My data is half json, and I cannot convert it into a dataframe properly. Here below is first row in the file;
[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}
See first part is plain text and the last part between { } is json, I tried few things, converting it first to RDD then map and split then convert back to DataFrame, but I cannot extract the values from Json part of the row, is there a trick to extract fields in this context?
Final output will be like;
TimeStamp userid ip artist album song id service
2017-01-06 07:00:01 444444 11.11.111.0 Tears For Fears Songs From The Big Chair Everybody Wants To Rule The World S4555 pandora
You just need to parse out the pieces with a Python UDF into a tuple then tell spark to convert the RDD to a dataframe. The easiest way to do this is probably a regular expression. For example:
import re
import json
def parse(row):
pattern = ' '.join([
r'\[(?P<ts>\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)\]',
r'userid:(?P<userid>\d+)',
r'(?P<ip>\d+\.\d+\.\d+\.\d+)',
r'(?P<level>\w+)',
r'(?P<json>.+$)'
])
match = re.match(pattern, row)
parsed_json = json.loads(match.group('json'))
return (match.group('ts'), match.group('userid'), match.group('ip'), match.group('level'), parsed_json['artist'], parsed_json['song'], parsed_json['service'])
lines = [
'[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}'
]
rdd = sc.parallelize(lines)
df = rdd.map(parse).toDF(['ts', 'userid', 'ip', 'level', 'artist', 'song', 'service'])
df.show()
This prints
+-------------------+------+-----------+-----+---------------+--------------------+-------+
| ts|userid| ip|level| artist| song|service|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
|2017-01-06 07:00:01|444444|11.11.111.0| info|Tears For Fears|Everybody Wants T...|pandora|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
I have used the following, just some parsing utilizing pyspark power;
parts=r1.map( lambda x: x.value.replace('[','').replace('] ','###')
.replace(' userid:','###').replace('null','"null"').replace('""','"NA"')
.replace(' music_info {"artist":"','###').replace('","album":"','###')
.replace('","song":"','###').replace('","id":"','###')
.replace('","service":"','###').replace('"}','###').split('###'))
people = parts.map(lambda p: (p[0], p[1],p[2], p[3], p[4], p[5], p[6], p[7]))
schemaString = "timestamp mac userid_ip artist album song id service"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
With this I got almost what I want, and performance was super fast.
+-------------------+-----------------+--------------------+-------------------- +--------------------+--------------------+--------------------+-------+
| timestamp| mac| userid_ip| artist| album| song| id|service|
+-------------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------+
|2017-01-01 00:00:00|00:00:00:00:00:00|111122 22.235.17...|The United States...| This Is Christmas!|Do You Hear What ...| S1112536|pandora|
|2017-01-01 00:00:00|00:11:11:11:11:11|123123 108.252.2...| NA| Dinner Party Radio| NA| null|pandora|

Custom Function to Extract Data from JSON API Based on Column Values in Excel VBA

I've an excel workbook that looks something like this:
/-------------------------------------\
| Lat | Long | Area |
|-------------------------------------|
| 5.3 | 103.8 | AREA_NAME |
\-------------------------------------/
I also have a JSON api with a url of the following structure:
https://example.com/api?token=TOKEN&lat=X.X&lng=X.X
that returns a JSON object with the following structure:
{ "Area": "AREA_NAME", "OTHERS": "Other_details"}
I tried to implement a VBA function that will help me to extract AREA_NAME. However, I keep getting syntax errors. I don't know where I am going wrong.
Function get_p()
Source = Json.Document (Web.Contents("https://example.com/api?token=TOKEN&lat=5.3&lng=103.8"))
name = Source[Area]
get_p = Name
End Function
I intentionally hardcoded the lat and long value for development purposes. Eventually, I want the function to accept lat and long as parameters. I got the first line of the function from PowerQuery Editor.
Where am I going wrong? How to do this properly in VBA? Or is there a simpler way using PowerQuery?