Need help in forming regular expression in perl - html

I need some suggestion in parsing a html content,need to extract the id of tag <\a> inside a div, and store it into an variable specific variable. i have tried to make a regular expression for this but its getting the id of tag in all div. i need to store the ids of tag<\a> which is only inside a specific div .
The HTML content is
<div class="m_categories" id="part_one">
<ul>
<li>-
aaa
</li>
<li>-
bbb
</li>
.
.
.
</div>
<div class="m_categories hidden" id="part_two">
<ul>
<li>-
ccc
</li>
<li>-
ddd
</li>
<li>-
eee
</li>
.
.
</div>
Need some suggestion, Thanks in advance
update:
the regex i have used
if($content=~m/sel_cat " id="([^<]*?)"/is){}
while($content=~m/sel_cat " id="([^<]*?)"/igs){}

You should really look into HTML::Parser rather than trying to use a regex to extract bits of HTML.
one way to us it to extract the id element from each div tag would be:
# This parser only looks at opening tags
sub start_handler {
my ($self, $tagname, $attr, $attrseq, $origtext) = #_;
if ($tagname eq 'div') { # is it a div element?
if($attr->{ id }) { # does div have an id?
print "div id found: ", $attr->{ id }, "\n";
}
}
}
my $html = &read_html_somehow() or die $!;
my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler );
$p->parse($html);
This is a lot more robust and flexible than a regex-based approach.

There are so many great HTML parser around. I kind of like the Mojo suite, which allows me to use CSS selectors to get a part of the DOM:
use Mojo;
my $dom = Mojo::DOM->new($html_content);
say for $dom->find('a.sel_cat')->all_text;
# Or, more robust:
# say $_->all_text for $dom->find('a.sel_cat')->each;
Output:
aaa
bbb
ccc
ddd
eee
Or for the IDs:
say for $dom->find('a.sel_cat')->attr('id');
# Or, more robust_
# say $_->attr('id') for $dom->find('a.sel_cat')->each;
Output:
sel_cat_10018
sel_cat_10007
sel_cat_10016
sel_cat_10011
sel_cat_10025
If you only want those ids in the part_two div, use the selector #part_two a.sel_cat.

Related

How to generate a different HTML format from a Nokogiri collection in Ruby

I'm working on a script that migrates a current HTML page and transforms it into a different HTML layout. I can get the information from the document using Nokogiri and XPath.
The problem is how to traverse the nodes retrieved with a loop in a similar fashion to how an array and a hash are traversed to generate the layout that I need.
This is a sample of the original layout that I am trying to convert:
<ul id="nav">
<li>Link 1 </li>
<li>
Link 2
<ul>
<li>Sublink 1</li>
<li>Sublink 2</li>
</ul>
</li>
</ul>
This code is what I have tried so far. The problem is when it loops through the collection set it outputs all of the nodes in the new HTML tag each time a pass is made through the collection, rather than only outputting information at the current index.
require 'nokogiri'
source_file = Nokogiri.XML(open("navigation.inc"))
source_file = Nokogiri.XML(source_file.to_s.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))
navigation = ""
if source_file.xpath("//ul[#id = 'nav']").length > 0
navcontain = source_file.xpath("//ul[#id = 'nav']/li")
navcontain.each do | child |
if child.xpath("//li and count(*) = 2")
navigation = navigation + "<details>"
child.xpath("//li/ul").each do | children |
navigation = navigation + child.xpath("//li/a").to_s
end #end child loop
navigation = navigation + "</details>"
else
navigation = navigation + source_file.xpath("//ul[#id = 'nav']/li/a").to_s
end #end conditional check
end #end initial loop
end #end length check
puts navigation
This is an example of what the code above is currently doing:
<div id="nav">
<details>
Link 1
Link 2
Sublink 1
Sublink 2
</details>
<details>
Link 1
Link 2
Sublink 1
Sublink 2
</details>
</div>
The format that I want after the transformation is:
<div id="nav">
Link 1
<details>
<summary>
Link 2
</summary>
Sublink 1
Sublink 2
</details>
</div>
I believe part of the code works correctly as I can identify the total number of single and second-level link structures. I haven't figured out how to translate the data to the final version I need.
The code you posted doesn't produce the output you posted. The code actually produces this:
Link 1
Link 2
<details>
Link 1
Link 2
Sublink 1
Sublink 2
</details>
I guess you don't want Link 1 and Link 2 in the <details> section.
There's a problem with how you are using XPath selectors:
child.xpath("//li/ul")
searches starting at the root of the document, not at the child element. Instead you need to use:
child.xpath(".//li/ul")
if you want to search starting at the child element.
Here's the cleaned up code that should produce the output you need:
require 'nokogiri'
source_file = Nokogiri.XML(File.read("navigation.inc").encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))
navigation = ""
if source_file.xpath("//ul[#id = 'nav']").length > 0
navcontain = source_file.xpath("//ul[#id = 'nav']/li")
navcontain.each do |child|
if child.xpath(".//li and count(*) = 2")
navigation += "<details>"
child.xpath(".//ul/li/a").each do |grandchild|
navigation += grandchild.to_s
end
navigation = navigation + "</details>"
else
# not sure how that's supposed to work based on your input file example
navigation = navigation + source_file.xpath("//ul[#id = 'nav']/li/a").to_s
end
end
end
puts navigation

HTML::ELEMENT not finding all elements

I have this snippet of html:
<li class="result-row" data="2">
<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2018-12-04 09:21" title="Tue 04 Dec 09:21:50 AM">Dec 4</time>
Link Text
and this perl code (not production, so no quality comments are necessary)
my $root = $tree->elementify();
my #rows = $root->look_down('class', 'result-row');
my $item = $rows[0];
say $item->dump;
my $date = $item->look_down('class', 'result-date');
say $date;
my $title = $item->look_down('class', 'result-title hdrlnk');
All outputs are as I expected except $date isn't defined.
When I look at the $item->dump, it looks like the time element doesn't show up in the output. Here's a snippet of the output from $item->dump where I would expect to see a <time...> element. All it shows is the text from the time element.
<li class="result-row" data="2"> #0.1.9.3.2.0
<a class="result-image gallery empty" href="https://localhost/1.html"> #0.1.9.3.2.0.0
<p class="result-info"> #0.1.9.3.2.0.1
<span class="icon icon-star" role="button"> #0.1.9.3.2.0.1.0
" "
<span class="screen-reader-text"> #0.1.9.3.2.0.1.0.1
"favorite this post"
" "
" Dec 4 "
<a class="result-title hdrlnk" data="2" href="https://localhost/1.html"> #0.1.9.3.2.0.1
.2
"Link Text..."
" "
...
I've not used HTML::Element before. I rtfmed and didn't see any tag exclusions and I did a search of the package code for tags white/black lists (which wouldn't make sense, but neither does leaving out the time tag).
Does anyone know why the time element is not showing up in the dump and any search for it turns up nothing?
As an fyi, the rest of the code searches and finds elements without issue, it just appears to be the time tag that's missing.
HTML::TreeBuilder does not support HTML5 tags. Consider Mojo::DOM as an alternative that keeps up with the living HTML standard. I can't show how your whole code would look with Mojo::DOM since you've only shown a piece, but the Mojo::DOM equivalent of look_down is find (returns a Mojo::Collection arrayref) or at (returns the first element found or undef), both taking a CSS selector.

selenium, xpath: How to select a node within node?

I have a webpage that have a structure like this:
<div class="l_post j_l_post l_post_bright "...>
...
<div class="j_lzl_c_b_a core_reply_content">
<li class="lzl_single_post j_lzl_s_p first_no_border" ...>
<div class="lzl_cnt">
content
</div>
</li>
<li class="lzl_single_post j_lzl_s_p first_no_border" ...>
...
</li>
</div>
</div>
<div class="l_post j_l_post l_post_bright "...>
...(contain content, same as above)
</div>
...
Currently I could select all the content in one step like this:
for i in driver.find_elements_by_xpath('//*[#class="lzl_cnt"]'):
print(i.text)
But as you could see, the webpage consist of repetitive blocks that contain the contents that I need, therefore I want to get those contents separately along with other information that differs between those repetitive blocks(<div class="l_post j_l_post l_post_bright "...>...</div>), moreover, I want those contents within <li class ="lzl_single_post"...>to be separated so as to be easier for me to process the contents later . I tried this:
items = []
# get each blocks
for sel in driver.find_elements_by_xpath('//div[#class="l_post j_l_post l_post_bright "]'):
name = sel.find_element_by_css_selector('.d_name').text
try: content = sel.find_element_by_css_selector('.j_d_post_content').text
except: content = '',
try:
reply = []
# get each post within specific block
for i in sel.find_elements_by_xpath('//*[#class="lzl_cnt"]'):
reply.append(i.text)
except: reply = []
items.append({'name': name, 'content': content, 'reply': reply})
But the result shows that I am getting all the replies on the webpage every time the outer for-loop runs instead of a set of replies for each individual block that I wanted
Any suggestions?
Just add . (context pointer) to XPath as
sel.find_elements_by_xpath('.//*[#class="lzl_cnt"]')
Note that //*[#class="lzl_cnt"] means all nodes in DOM with "lzl_cnt" class name while .//*[#class="lzl_cnt"] means all nodes that are descendant of sel with "lzl_cnt" class name

How to remove trailing </li> tags from wp_nav_menu walker

I trying to migrate my website to Word press and I am stuck with a problem that when I use the example given by RCV ( Wordpress change header navigation list items to div ) I get trailing </li> tags and I can't figure out how to remove them, this is the output I get.
<div class="top-left home">
<div class="frame1">
<span class="click"></span>
</div></li>
<div class="frame2"><h1 class="fittext1">Text<br/>Text<br/>Text</h1></div></li>
<div class="frame3"><span class="click"></span>
<h3 class="fittext3 bottomfull">text<span class="rightfull">></span></h3>
</div></li>
</div>
any help would be most appreciated
You're getting the trailing li tag because you've added the start_el method to your custom walker but you haven't added the end_el so it's using the default. Add the following to your custom walker class.
function end_el( &$output, $item, $depth = 0, $args = array() ) {
$output .= "</div>\n";
}
You will then need to remove the closing div tag from your start_el. Alternatively you could set end_el to $output .= ""; but this isn't recommended.

How can I extract information from an HTML file using Perl regular expressions?

I have two files, XML and an HTML and need to extract data from these on certain patterns.
My XML file is pretty well formatted and I can use readline to read a line and search data between tags.
if($line =~ /\<tag1\>$varvalue\<\/tag1\>/)`
However, for my HTML, it has one of the worst code I have seen and the file is like:
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
<div class="address">
<i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
</div>
</div>
<div class="mtitle">
<a href="/movie/dream-house-2011" title="Dream House" onmouseover="mB(event, 771204354);" >**Dream House**</a>
<span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>
<div class="times">
**1:00 PM,**
</div>
Now from this file I need to pick data which is shown in bold.
I can use Perl regular expression to search data from this file.
RegEx match open tags except XHTML self-contained tags
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Using regular expressions to parse HTML: why not?
When you are done reading those come back :)
Edit : and to actually solve your problem take a look at this module :
http://perlmeme.org/tutorials/html_parser.html
Some sample to parse the an html file :
#!/usr/local/bin/perl
use HTML::TreeBuilder;
$tree = HTML::TreeBuilder->new;
$tree->parse_file('C:\Users\Stefanos\workspace\HTML_Parser_Test\test.html');
#divs = $tree->find('div');
$tree->delete;
In this example I just used your tags as the main body of an .html file. The divs are stored in the #divs array. Since I have no idea which text you want to find, because ** is not a element I can't help you further..
P.S. I have never used this module but I just did it in 5 minutes so it is not so hard to parse the html file and find whatever you want..
Regex to match any specific tag and store of contents result into $1:
if ($subject =~ m!<tagname[^>]*>(.*?)</tagname>!s) {
# Successful match
}
Although you will soon realize the limitations of this approach when you have nested elements..
Replace tagname with actual tag.. e.g. in your case i, a, span, div although for div you will also get the contents of the first div which is not what you want..
Parsing XML and HTML using regular expressions is a fool's errand. There are many simple to use Perl modules for parsing HTML. Here is something using HTML::TokeParser::Simple. I've omitted the code to associate movies and showtimes with theaters (because I have no intention of building an appropriate input file):
#!/usr/bin/env perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);
my #theaters;
while (my $div = $parser->get_tag('div')) {
my $class = $div->get_attr('class');
next unless defined($class) and $class eq 'theater';
my %record;
$record{theater} = $parser->get_text('/a');
$record{address} = $parser->get_text('/i');
s{(?:^\s+)|(?:\s+\z)}{} for values %record;
push #theaters, \%record;
}
use YAML;
print Dump \#theaters;
__DATA__
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
<div class="address">
<i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
</div>
</div>
<div class="mtitle">
<a href="/movie/dream-house-2011" title="Dream House" onmouseover="mB(event, 771204354);" >**Dream House**</a>
<span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>
<div class="times">
**1:00 PM,**
</div>
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**Some other theater*</a></h2>
<div class="address">
<i>**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**</i>
</div>
</div>
Output:
[sinan#macardy]:~/tmp> ./tt.pl
---
- address: '**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**'
theater: '**University Village 3**'
- address: '**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**'
theater: '**Some other theater*'