Getting my re.findall to accept urls with a # symbol - json

Right now I have the line of code in python:
urls = re.findall("(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+",str(field))
This searches if a keyword is in a url, however it doesn't parse urls which include a # correctly. An example link I am trying parse is
https://partalert.net/product.html?v=51421546#asin=B08KH7RL89&price=&smid=A3P5ROKL5A1OLE&tag=partalert-21&timestamp=00%3A17+UTC+%281.3.2021%29&title=Gigabyte+GeForce+RTX+3080+VISION+OC+10GB+Graphics+Card&tld=.co.uk
However the parsing excludes the hashtag and everything after it:
https://partalert.net/product.html?v=51421546

I managed to solve this, i needed to add a few symbols to the character classes, here is the working regex: "(?:(?:https?|ftp)://)?[\w/-?=%.#&+]+.[\w/-?=%.#&+]+"

Related

Replacing a string in HTML file in Python

I'm trying to replace a string stored in a list with an HTML tag in a file by doing:
links=[http://hexagon-dashboard-gbc-01/vboard/latest?regs=3281546<!--V68NUR-->]
str1="""%s<!--V68NUR-->"""%(vboard['V68N']['perf.tl'],vboard['V68N']['perf.tl'])
with open(html_file,'r+') as f:
content = f.read()
f.seek(0)
f.truncate()
f.write(content.replace(links[0],str1))
But I get the following error:
TypeError: replace() argument 1 must be str, not Tag.
What am I missing? Please help me with the modification I have to do.
Updated:
From what you posted, I suppose you are treating a html file as plain text and going to perform string replacement.
The replace() function only works when both of its arguments are strings.
The reason you got an error is that links[0] is not a string but a tag.
If you manage to get links like this (note the single quotes)
links=['http://hexagon-dashboard-gbc-01/vboard/latest?regs=3281546<!--V68NUR-->']
then
content.replace(links[0],str1)
would not produce any errors.
To edit html files, you can also use HTML Parser instead.

TCL: file normalize gives unexpected output

I have the following line of code:
file normalize [string map {\\ /} $file]
The string map operation is to make the line work for paths containing backslashes instead of forward (as is the case in Windows)
For some values of $file (let's say it's "/path/to/my/file") I get output similar to:
/path/to/"/path/to/my/file/"
This doesn't happen for all paths but I'm unable to figure out what causes it. There are no symbolic links in the path.
Is there something I'm doing wrong, or is there an alternative to file normalize that I could try?
my tcl version is 8.5
UPDATE:
On further investigation I see that the string map is not making any difference. The output of file normalize itself is coming with that extra text before the desired text. Also, the extra text seems to be from a previous run of the code.
UPDATE 2: It was because of the quotation marks in the input to file normalize
Most likely the path has backslashes where it shouldn't have them.
% file normalize {"/path/to/some/file"}
/path/to/"/path/to/some/file"
% file normalize \"/path/to/some/file\"
/path/to/"/path/to/some/file"
Perhaps some pathname handling code escaped special characters for some reason and left the path in a mangled state.
I would try to keep the pathname pristine and when it needs to be changed for display or other processing, make a copy of it first.

Escaping symbols in Gatling jsonpath

We're using Gatling jsonpath in scala to parse our JSON, and are using it like so as per the docs:
val jsonSample = (new ObjectMapper).readValue("""{"#a":"A","#b":"B"}""", classOf[Object])
JsonPath.query("$.#a", jsonSample).right.map(_.toVector)
However, this code fails, and we get an error message about "string matching regex '[$_\d... etc etc }]* expected, but # found".
I've tried using backslashes, but these do not work and give the same error message. Does anyone know how to escape the # symbol?
It's worth noting I also tried the solution with hex on this page, but it doesn't work for the above. How do you escape the # symbol in jsonpath?
Thanks!
Turns out using a different syntax fixes this:
JsonPath.query("$['#a']", jsonSample).right.map(_.toVector)

I need a good regex for HTML file parsing in ruby

Here is a Ruby question guys. So need to parse through the html file and catch urls and emails can't come up with proper regex expression. Tried 100+ regexes and all the times I cash something else with the url.
File.open("/Desktop/file.html").each_line do |line|
if line.split("href=\"") =~ /???/
puts line
end
end
# I can use line.split("href=\"") so each new line will start with url =>
(https://www.facebook.com/students">
The question is what regex can I use to catch everything from https to the end of the url which ends with (") (there could be one or more samples of same url so {1,2} is needed
Try this
file = File.open('filename_path')
links = file.read().scan(/href=\"(?<url>.*?)\"/)
you get links in array
it also works if you remove ?<url> from above(it's just named capture group)

How to embed HTML string syntax in CoffeeScript using VIM?

I have looked at how to embed HTML syntax in JavaScript string from HTML syntax highlighting in javascript strings in vim.
However, when I use CoffeeScript I cannot get the same thing working by editing coffee.vim syntax file in a similar way. I got recursive errors which said including html.vim make it too nested.
I have some HTML template in CoffeeScript like the following::
angular.module('m', [])
.directive(
'myDirective'
[
->
template: """
<div>
<div>This is <b>bold</b> text</div>
<div><i>This should be italic.</i></div>
</div>
"""
]
)
How do I get the template HTML syntax in CoffeeScript string properly highlighted in VIM?
I would proceed as follows:
Find out the syntax groups that should be highlighted as pure html would be. Add html syntax highlighting to these groups.
To find the valid syntax group under the cursor you can follow the instructions here.
In your example the syntax group of interest is coffeeHereDoc.
To add html highlighting to this group execute the following commands
unlet b:current_syntax
syntax include #HTML syntax/html.vim
syn region HtmlEmbeddedInCoffeeScript start="" end=""
\ contains=#HTML containedin=coffeeHereDoc
Since vim complains about recursion if you add these lines to coffee.vim i would go with an autocommand:
function! Coffee_syntax()
if !empty(b:current_syntax)
unlet b:current_syntax
endif
syn include #HTML syntax/html.vim
syn region HtmlEmbeddedInCoffeeScript start="" end="" contains=#HTML
\ containedin=coffeeHereDoc
endfunction
autocmd BufEnter *.coffee call Coffee_syntax()
I was also running into various issues while trying to get this to work. After some experimentation, here's what I came up with. Just create .vim/after/syntax/coffee.vim with the following contents:
unlet b:current_syntax
syntax include #HTML $VIMRUNTIME/syntax/html.vim
syntax region coffeeHtmlString matchgroup=coffeeHeredoc
\ start=+'''\\(\\_\\s*<\\w\\)\\#=+ end=+\\(\\w>\\_\\s*\\)\\#<='''+
\ contains=#HTML
syn sync minlines=300
The unlet b:current_syntax line disables the current syntax matching and lets the HTML syntax definition take over for matching regions.
Using an absolute path for the html.vim inclusion avoids the recursion problem (described more below).
The region definition matches heredoc strings that look like they contain HTML. Specifically, the start pattern looks for three single quotes followed by something that looks like the beginning of an HTML tag (there can be whitespace between the two), and the end pattern looks for the end of an HTML tag followed by three single quotes. Heredoc strings that don't look like they contain HTML are still matched using the coffeeHeredoc pattern. This works because this syntax file is being loaded after the syntax definitions from the coffeescript plugin, so we get a chance to make the more specific match (a heredoc containing HTML) before the more general match (the coffeeHeredoc region) happens.
The syn sync minlines=300 widens the matching region. My embedded HTML strings sometimes stretched over 50 lines, and Vim's syntax highlighter would get confused about how the string should be highlighted. For complete surety you could use syn sync fromstart, but for large files this could theoretically be slow (I didn't try it).
The recursion problem originally experienced by #heartbreaker was caused by the html.vim script that comes with the vim-coffeescript plugin (I'm assuming that was being used). That plugin's html.vim file includes the its coffee.vim syntax file to add coffeescript highlighting to HTML files. Using a relative syntax include, a la
syntax include #HTML syntax/html.vim
you get all the syntax/html.vim files in VIM's runtime path, including the one from the coffeescript plugin (which includes coffee.vim, hence the recursion). Using an absolute path will restrict you to only getting the particular syntax file you specify, but this seems like a reasonable tradeoff since the HTML one would embed in a coffeescript string is likely fairly simple.