Dependency Parsing in Spacy - nltk

I want to extract the pair verb-noun of my text using dependency parsing.
I did this:
document = nlp('appoint department heads or managers and assign or delegate responsibilities to them ')
print ("{:<15} | {:<8} | {:<15} | {:<20}".format('Token','Relation','Head', 'Children'))
print ("-" * 70)
for token in document:
print ("{:<15} | {:<8} | {:<15} | {:<20}"
.format(str(token.text), str(token.dep_), str(token.head.text), str([child for child in token.children])))
from spacy import displacy
displacy.render(document, style = 'dep', jupyter=True )
Can you guys help me do a cleaner one?

Related

Pass default 0 value to missing field in json log search in Sumo Logic

I am trying to parse aws ecr scan json logs to get vulnerabilities table report using below given query in SumoLogic. The issue is that aws.ecr sends the fields CRITICAL or HIGH only when those are found else it omits those fields. How to add CRITICAL field to 0 if CRITICAL is not found in json logs ?
I tried using isNull, isEmpty, isBlank but it seems I am missing something, please share your valuable advise. Thanks in advance.
_source="aws_ecr_events_test"
| json field=message "detail.repository-name" as repository_name
| json field=message "detail.image-tags" as tags
| json field=message "time" as last_scan
| json field=message "detail.finding-severity-counts.CRITICAL" as CRITICAL
| if(isNull("detail.finding-severity-counts.CRITICAL"), 0, CRITICAL) as CRITICAL
| json field=message "detail.finding-severity-counts.HIGH" as HIGH
| json field=message "detail.finding-severity-counts.MEDIUM" as MEDIUM
| json field=message "detail.finding-severity-counts.INFORMATIONAL" as INFORMATIONAL
| json field=message "detail.finding-severity-counts.LOW" as LOW
| json field=message "detail.finding-severity-counts.UNDEFINED" as UNDEFINED
| json field=message "detail.image-digest" as image_digest
| json field=message "detail.scan-status" as scan_status
| count by repository_name, tags, image_digest, scan_status, last_scan, CRITICAL, HIGH, MEDIUM, LOW, INFORMATIONAL, UNDEFINED
Example log:
detail:{finding-severity-counts:{LOW:1,HIGH:1}}
I think you're on the right track, but you might need a "nodrop" at the end of the parse line, otherwise Sumo Logic will just drop the records that don't match the json parse statement:
...
| json field=message "detail.finding-severity-counts.CRITICAL" as CRITICAL nodrop
| if(isNull("detail.finding-severity-counts.CRITICAL"), 0, CRITICAL) as CRITICAL

JSON in spark sql dataframe [duplicate]

I asked the question a while back for python, but now I need to do the same thing in PySpark.
I have a dataframe (df) like so:
|cust_id|address |store_id|email |sales_channel|category|
-------------------------------------------------------------------
|1234567|123 Main St|10SjtT |idk#gmail.com|ecom |direct |
|4567345|345 Main St|10SjtT |101#gmail.com|instore |direct |
|1569457|876 Main St|51FstT |404#gmail.com|ecom |direct |
and I would like to combine the last 4 fields into one metadata field that is a json like so:
|cust_id|address |metadata |
-------------------------------------------------------------------------------------------------------------------
|1234567|123 Main St|{'store_id':'10SjtT', 'email':'idk#gmail.com','sales_channel':'ecom', 'category':'direct'} |
|4567345|345 Main St|{'store_id':'10SjtT', 'email':'101#gmail.com','sales_channel':'instore', 'category':'direct'}|
|1569457|876 Main St|{'store_id':'51FstT', 'email':'404#gmail.com','sales_channel':'ecom', 'category':'direct'} |
Here's the code I used to do this in python:
cols = [
'store_id',
'store_category',
'sales_channel',
'email'
]
df1 = df.copy()
df1['metadata'] = df1[cols].to_dict(orient='records')
df1 = df1.drop(columns=cols)
but I would like to translate this to PySpark code to work with a spark dataframe; I do NOT want to use pandas in Spark.
Use to_json function to create json object!
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([('1234567','123 Main St','10SjtT','idk#gmail.com','ecom','direct')],['cust_id','address','store_id','email','sales_channel','category'])
df.select("cust_id","address",to_json(struct("store_id","category","sales_channel","email")).alias("metadata")).show(10,False)
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","category":"direct","sales_channel":"ecom","email":"idk#gmail.com"}|
+-------+-----------+----------------------------------------------------------------------------------------+
to_json by passing list of columns:
ll=['store_id','email','sales_channel','category']
df.withColumn("metadata", to_json(struct([x for x in ll]))).drop(*ll).show()
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","email":"idk#gmail.com","sales_channel":"ecom","category":"direct"}|
+-------+-----------+----------------------------------------------------------------------------------------+
#Shu gives a good answer, here's a variant that works out slightly better for my use case. I'm going from Kafka -> Spark -> Kafka and this one liner does exactly what I want. The struct(*) will pack up all the fields in the dataframe.
# Packup the fields in preparation for sending to Kafka sink
kafka_df = df.selectExpr('cast(id as string) as key', 'to_json(struct(*)) as value')

Karate does not display a response after a POST request with status 201 [duplicate]

This question already has an answer here:
Support passing from Scenario Outline to JSON file
(1 answer)
Closed 2 years ago.
I am struggling with the following test, which is usually pretty easy...
Feature: Testing Env Create Feature
Scenario Outline: Create works as intended
Given url "http://localhost:10000/api/envs"
And request {"name": <Name>,"gcpProjectName": <GcpProjectName>,"url": <Url>}
When method POST
Then status 201
And match response contains {"id": #string, "name": <Name>,"gcpProjectName": <GcpProjectName>,"url": <Url>}
Examples:
| Name | GcpProjectName | Url |
| tests | D-COO-ContinuousCollaboration | https://fake.com |
| approval | Q-COO-ContinuousCollaboration | https://fake.com |
| demo | P-COO-ContinuousCollaboration | https://fake.com |
| prod | P-COO-ContinuousCollaboration | https://fake.com |
I am supposed to get a response summarizing my POST request that I successfully get using curl, Postman or even Swagger, but it does not appear with Karate:
[failed features:
src.test.features.envtest.env-create: [1.1:13] env-create.feature:9 - path: $, actual: '', expected: '{"id":"#string","name":"tests","gcpProjectName":"D-COO-ContinuousCollaboration","url":"https://fake.com"}', reason: not a sub-string
Anyone knows what happens ?
Thanks for your help.
Just add quotes around string substitutions:
And request {"name": "<Name>", "gcpProjectName": "<GcpProjectName>", "url": "<Url>" }

Trying to understand number of ParseError in html5lib-test

I was looking at following test case in html5lib-tests:
{"description":"<!DOCTYPE\\u0008",
"input":"<!DOCTYPE\u0008",
"output":["ParseError", "ParseError", "ParseError",
["DOCTYPE", "\u0008", null, null, false]]},
source
State |Input char | Actions
--------------------------------------------------------------------------------------------
Data State | "<" | -> TagOpenState
TagOpenState | "!" | -> MarkupDeclarationOpenState
MarkupDeclarationOpenState | "DOCTYPE" | -> DOCTYPE state
DOCTYPE state | "\u0008" | Parse error; -> before DOCTYPE name state (reconsume)
before DOCTYPE name state | "\u0008" | DOCTYPE(name = "\u0008"); -> DOCTYPE name state
DOCTYPE name state | EOF | Parse error. Set force quirks on. Emit DOCTYPE -> Data state.
Data state | EOF | Emit EOF.
I'm wondering where do those three errors come from? I can only track two, but I assume I'm making an error in logic, somewhere.
The one you're missing is the one from the "Preprocessing the input stream" section:
Any occurrences of any characters in the ranges U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).
This causes a parse error prior to the U+0008 character ever reaching the tokenizer. Given the tokenizer is defined as reading from the input stream, the tokenizer tests assume the input stream has its normal preprocessing applied to it.

Fullpath of current TCL script

Is there a possibility to get the full path of the currently executing TCL script?
In PHP it would be: __FILE__
Depending on what you mean by "currently executing TCL script", you might actually seek info script, or possibly even info nameofexecutable or something more esoteric.
The correct way to retrieve the name of the file that the current statement resides in, is this (a true equivalent to PHP/C++'s __FILE__):
set thisFile [ dict get [ info frame 0 ] file ]
Psuedocode (how it works):
set thisFile <value> : sets variable thisFile to value
dict get <dict> file : returns the file value from a dict
info frame <#> : returns a dict with information about the frame at the specified stack level (#), and 0 will return the most recent stack frame
NOTICE: See end of post for more information on info frame.
In this case, the file value returned from info frame is already normalized, so file normalize <path> in not needed.
The difference between info script and info frame is mainly for use with Tcl Packages. If info script was used in a Tcl file that was provided durring a package require (require package <name>), then info script would return the path to the currently executing Tcl script and would not provide the actual name of the Tcl file that contained the info script command; However, the info frame example provided here would correctly return the file name of the file that contains the command.
If you want the name of the script currently being evaluated, then:
set sourcedScript [ info script ]
If you want the name of the script (or interpreter) that was initially invoked, then:
set scriptAtInvocation $::argv0
If you want the name of the executable that was initially invoked, then:
set exeAtInvocation [ info nameofexecutable ]
UPDATE - Details about: info frame
Here is what a stacktrace looks like within Tcl. The frame_index is the showing us what info frame $frame_index looks like for values from 0 through [ info frame ].
Calling info frame [ info frame ] is functionally equivalent to info frame 0, but using 0 is of course faster.
There are only actually 1 to [ info frame ] stack frames, and 0 behaves like [ info frame ]. In this example you can see that 0 and 5 (which is [ info frame ]) are the same:
frame_index: 0 | type = source | proc = ::stacktrace | line = 26 | level = 0 | file = /tcltest/stacktrace.tcl | cmd = info frame $frame_counter
frame_index: 1 | type = source | line = 6 | level = 4 | file = /tcltest/main.tcl | cmd = a
frame_index: 2 | type = source | proc = ::a | line = 2 | level = 3 | file = /tcltest/a.tcl | cmd = b
frame_index: 3 | type = source | proc = ::b | line = 2 | level = 2 | file = /tcltest/b.tcl | cmd = c
frame_index: 4 | type = source | proc = ::c | line = 5 | level = 1 | file = /tcltest/c.tcl | cmd = stacktrace
frame_index: 5 | type = source | proc = ::stacktrace | line = 26 | level = 0 | file = /tcltest/stacktrace.tcl | cmd = info frame $frame_counter
See:
https://github.com/Xilinx/XilinxTclStore/blob/master/tclapp/xilinx/profiler/app.tcl#L273
You want $argv0
You can use [file normalize] to get the fully normalized name, too.
file normalize $argv0
file normalize [info nameofexecutable]
seconds after I've posted my question ... lindex $argv 0 is a good starting point ;-)