Extracting character sequence containing word - html

I've got an HTML string containing special character sequences looking like this:
[start_tag attr="value"][/end_tag]
I want to be able to extract one of these sequences containing specific attribute e.g:
[my_image_tag image_id="12345" attr2="..." ...]
and from the above example, I want to extract the whole thing with square brackets but using only one of the attributes and its value in this case - image_id="12345"
I tried using regex but it gives me the whole line whereas I need only the part of the line based on specific value as mentioned above.

Something like this should work:
my_string = '<h1>Heading1</h1>some text soem tex some text [some_tag attrs][/some_tag]some text some text [some_tag image_id="12345"] some text'
search_attrs = %w(image_id foo bar)
found = my_string =~ /(\[[^\]]*(#{search_attrs.join('|')})="[^"\]]*"[^\]]*\])/ && $1
# => "[some_tag image_id=\"12345\"]"
For a specific attribute id and value, you can simplify it like so:
found = my_string =~ /(\[[^\]]* image_id="12345"[^\]]*\])/ && $1
# => "[some_tag image_id=\"12345\"]"
It works by expanding the primary capture group to everything you're looking for.
However, this assumes you only need to extract one such attribute.
It also assumes that you don't care if the string crosses through any HTML tag boundaries. If you cared about that, then you'd need to first hash out the legal boundaries using an HTML parser, then search within those results.

Related

Split a paragraph to separate one or two digit numbers with PowerShell

I'm attempting to parse and format some text from an HTML file into Word. I'm doing this by capturing each paragraph into an array and then writing it into the word document one paragraph at a time. However, there are superscripted references sprinkled throughout the text. I'm looking for a way to superscript these references in the new Word file and thought I would use regex and split to make this work. Here is an example paragraph:
$p = "This is an example sentence.1 The number is a reference note that should be superscripted and can be one or two digits long."
Here is the code I tried to split and select the digit(s):
[regex]::Split($p,"(\d{1,2})")
This works for single and double digits. However, if there are more than two digits, it still splits it, but moves the extra numbers to the next line. Like so:
This is an example sentence.
10
0
The number is a reference note that should be superscripted and can be one or two digits long.
This is important because there are sometimes larger numbers (3-10 digits) in the text that I don't want to split on. My goal is to take a block of text with reference note numbers and seperate out the notes so I can perform formatting functions on them when I write it out to the Word file. Something like this (untested):
$paragraphs | % {
$a = #([regex]::Split($_,"(\d{1,2})"))
$a | % {
$text = $_
if ($text -match "(\d{1,2})")
{
$objSelection.Font.SuperScript = 1
$objSelection.TypeText("$text")
$objSelection.Font.SuperScript = 0
}
Else
{
$objSelection.Style="Normal"
$objSelection.TypeText("$text")
}
}
$text = "`v"
$objSelection.TypeText("$text")
$objSelection.TypeParagraph()
}
EDIT:
The following regex expression works when I test it with the above loop in it's own script:
"(?<![\d\s])(\d{1,2})(?!\d)"
However, when I run it in the parent script, I get the following error:
Cannot find an overload for "Split" and the argument count: "2"
$a = [regex]::Split($_,"(?<![\d\s])(\d{1,2})(?!\d)")
How would I go about troubleshooting this error?
You may use
[regex]::Split($p,"(?<![\d\s])(\d{1,2})(?!\d)\s*")
It only matches and captures one or two digits that are neither followed nor preceded with another digit, and not preceded with any whitespace char. Any trailing whitespace is matched with \s* and is thus removed from the items that are added into the resulting array.
See this regex demo:
Details
(?<![\d\s]) - a negative lookbehind that fails the match if, immediately to the left of the current position, there is a digit or a whitespace
(\d{1,2}) - Group 1: one or two digits
(?!\d) - that cannot be followed with another digit (it is a negative lookahead that fails the match if its pattern matches immediately to the right of the current location)
\s* - 0+ whitespaces.

Extracting content of HTML tag with specific attribute

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?
This is what I currently have:
<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>
The problem with this is this sample:
<div id="1">test</div><div id="2">test</div>
If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.
How can I do this?
A fairly simple way is to use
Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>
Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/
Use the variable in place of 2.
The content will be in group 1.
Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.
<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>
Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.
However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

Remove/strip specific Html tag and replace using NotePad++

Here is my text:
<h3>#6</h2>
Is he eating a lemon?
</div>
I have a few of them in my articles the #number is always different also the text is always different.
I want to make this out of it:
<h3>#6 Is he eating a lemon?</h3>
I tried it via regex in notepad++ but I am still very new to this:
My Search:
<h3>.*?</h2>\r\n.*?\r\n\r\n</div>
Also see here.
Now it is always selecting the the right part of the text.
How does my replace command need to look like now to get an output like above?
You should modify your original regex to capture the text you want in groups, like this:
<h3>(.*?)</h2>\r\n(.*?)\r\n\r\n</div>
( ) ( )
// ^ ^ These are your capture groups
You can then access these groups with the \1 and \2 tokens respectively.
So your replace pattern would look like:
<h3>\1 \2</h3>
Your search could be <h3>(.*)<\/h2>\r\n(.*)\r\n\r\n<\/div>
and the replace is <h3>$1 $2</h3>, where $1 and $2 represent the strings captured in the parentheses.

What is regex expression for a string after I use nokogiri to scrape

I have this string and it is in an html document of 100 other names that are formatted the same:
<li>Physical education sed<span class="meta"><ul><li>15184745922</li></ul></span>
</li>
And I want to save 'Physical education sed under a name column and '15184745922' under a number column.
I was wondering how do you do this in Ruby.
In nokogiri I can get only the li's by doing this:
puts page.css("ul li").text
but then it comes out all in one word:"Physical education sed15184745922"
I was thinking regex is the way to go but I am stumped with that.
I did split it on the li
full_contact = page.css("ul li")[22]
split_contact_on_li = full_contact.to_s.split(/(\W|^)li(\W|$)/).map(&:to_sym)
puts split_contact_on_li
and I get this
<
>
Physical education sed<span class="meta"><ul>
<
>
15184745922<
/
>
</ul></span>
<
/
>
The same number of lines will be shown for each contact_info and the name is always the third line before the span class and the number is always the 6th line.
There is an instance where there might be an email address instead on the 6th line put not often.
So should I match the second and the third angular bracket and pull the information up to the third and fourth bracket then shove it into an array called name and number?
You shouldn't use a regex to parse xhtml since the regex engine might mess up things, you should use a html parser instead. However, if you want to use a regex, you can use a regex like this:
<li>(.*?)<.*?<li>(.*?)<
Working demo
The idea behind this regex is to use capturing groups (using paretheses) to capture the content you want. So, for you sample input the match information is:
MATCH 1
Group 1. [4-26] `Physical education sed`
Group 2. [53-64] `15184745922`
For example;
#!/usr/bin/env ruby
string = "<li>Physical education sed<span class=\"meta\"><ul><li>15184745922</li></ul></span></li>"
one, two = string.match(/<li>(.*?)<.*?<li>(.*?)</i).captures
p one #=> "Physical education sed"
p two #=> "15184745922"
Why don't you just do a regex on the string "physical education sed15184745922"? You can match on the first digit, and get back the number and the preceding text.
I don't know how to use Ruby, but if I understand your question correctly I would take advantage of the gsub function (or Ruby's equivalent). It might not be the prettiest approach, but since we just want the text in one variable and the numbers in another, we can just replace the characters we don't want with empty values.
v1 = page.css('ul li').text
v2 = gsub('\d*', '', v1)
v3 = gsub('(^\d)', '', v1)
v1 gets the full text value, v2 replaces all numeric characters with '', and v3 replaces all alpha characters with '', giving us two new variables to put wherever we please.
Again, I don't know how to use Ruby, but in R I know that I could get all of the values from the page using the xpath you provided ("ul li") into a vector and then loop across the vector performing the above steps on each element. I'm not sure if that adequately answers your question, but hopefully the gsub function gets you closer to what you want.
You need to use your HTML parser (Nokogiri) and regular expressions together. First, use Nokogiri to traverse down to the first parent node that contains all the text you need, then regex the text to get what you need.
Also, consider using .xpath instead of .css it provides much more functionality to search and scrape for just what you want. Given your example, you could do like so:
page.xpath("//span[#class='meta']/parent::li").map do |i|
i.text.scan(/^([a-z\s]+)(\d+)$/i).flatten
end
#=> [['Physical education sed', '15184745922'], ['the next string', '1234567890'], ...]
And now you have a two-dimensional array you can iterate over and save each pair.
This bit of xpath business: "//span[#class='meta']/parent::li" is doing what .css can't do, returning the parent node that has the text and specific children nodes you want to scrape against.

Find a string between html tags in Powershell

I'm trying to write a Powershell script that will pull out a string between two HTML tags within an HTML file. I don't know what the value will be, but I know what tags need to be searched. Additionally, I know that the tags do not always appear at the start of a line (i.e., they can be in the middle of a line of text). Finally, I also know that the tags and the string between them will never break across a line.
I have the path of the file stored in a variable
$filePath = "C:\Path\file.html"
I'm trying to find any value between <h6> and </h6> and store those values in an array.
Try
$myarray = gc $filepath |
% { [regex]::matches( $_ , '(?<=<h6>\s+)(.*?)(?=\s+</h6>)' ) } |
select -expa value
This remove starting and trailing spaces if any.
If you need also this spaces remove \s+ from the regex pattern