Python Selenium Getting link_text from an anchor that is inline - html

in selenium how do I correctly write an xpath or css_selector that would
parse html such as
<div class="fxg-rte " style="color:;" data-emptytext="Rich Text">
<p>United States | English
| Español</p>
<p>China | English
| 简体中文</p>
<p>Mexico | English
| Español</p>
<p>India | English</p>
<p>Canada | English
| Français</p>
</div>
to do the following:
Find any <p> element that contains the text "United States"
then within the element find any link_text that has "English"
then click that link.
So specifically I want to look at link_text only within tags that meet a
given criteria.

try the below xpath :
//div[#data-emptytext='Rich Text']//p
there will be serveral p tags, you may have to use find_elements instead of find_element.
in code something like this :
driver.maximize_window()
driver.get("https://www.fedex.com/global/choose-location.html")
wait = WebDriverWait(driver, 10)
for names in driver.find_elements(By.XPATH, "(//div[contains(#class, 'richtext parbase section')])[1]/descendant::p"):
print(names.get_attribute('innerHTML'))
if "United States" in names.get_attribute('innerHTML'):
print("matched")
lang_href = names.find_element(By.XPATH, "((//div[contains(#class, 'richtext parbase section')])[1]/descendant::p/a[1])[1]")
lang_href.click()
break

Related

Retrieve all names from html tags using BeautifulSoup

I managed to setup by Beautiful Soup and find the tags that I needed. How do I extract all the names in the tags?
tags = soup.find_all("a")
print(tags)
After running the above code, I got the following output
[Alfred the Great, <a class="mw-redirect" href="/wiki/Elizabeth_I_of_England" title="Elizabeth I of England">Queen Elizabeth I</a>, Family tree of Scottish monarchs, Kenneth MacAlpin]
How do I retrieve the names, Alfred the Great,Queen Elizabeth I, Kenneth MacAlpin, etc? Do i need to use regular expression? Using .string gave me an error
You can iterate over the tags and use tag.get('title') to get the title value.
Some other ways to do the same:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
No need to apply re. You can easily grab all the names by iterating all a tags then call title attribute or get_text() or .find(text=True)
html='''
<html>
<body>
<a href="/wiki/Alfred_the_Great" title="Alfred the Great">
Alfred the Great
</a>
,
<a class="mw-redirect" href="/wiki/Elizabeth_I_of_England" title="Elizabeth I of England">
Queen Elizabeth I
</a>
,
<a href="/wiki/Family_tree_of_Scottish_monarchs" title="Family tree of Scottish monarchs">
Family tree of Scottish monarchs
</a>
,
<a href="/wiki/Kenneth_MacAlpin" title="Kenneth MacAlpin">
Kenneth MacAlpin
</a>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
#print(soup.prettify())
for name in soup.find_all('a'):
txt = name.get('title')
#OR
#txt = name.get_text(strip=True)
print(txt)
Output:
Alfred the Great
Queen Elizabeth I
Family tree of Scottish monarchs
Kenneth MacAlpin

Why is contains(text(), "string" ) not working in XPath?

I have written this expression //*[contains(text(), "Brand:" )] for the below HTML code.
<div class="info-product mt-3">
<h3>Informazioni prodotto</h3>
Brand: <span class="brand_title font-weight-bold text-uppercase">Ava</span><br> SKU: 8002910009960<br> Peso Lordo: 0.471 kg <br> Dimensioni: 44.00 × 145.00 × 153.00 mm<br>
<p class="mt-2">
AVA BUCATO A MANO E2 GR.380</p>
</div>
The xpath that I have written is not working I want to select Node that contains text Brand:. Can someone tell me my mistake?
Your XPath,
//*[contains(text(), "Brand:")]
in XPath 1.0 will select all elements whose first text node child contains a "Brand:" substring. In XPath 2.0 it is an error to call contains() with a sequence of more than one item as the first argument.
This XPath,
//*[text()[contains(., "Brand:")]]
will select all elements with a text node child whose string value contains a "Brand:" substring.
See also
XPath 1.0 vs 2.0+ different contains() behavior explanation
Testing text() nodes vs string values in XPath

How do I make my interpolation binding bold in my HTML using Angular?

I have the following code where I need to make the profile.userId bold:
<p class="profile__last-login" *ngIf="profile.lastLoggedIn">
{{'intranet.profile.dashboard.lastLoggedIn' | messageBundle: profile.userId + ',' + (profile.lastLoggedIn | date: 'MM/dd/yyy h:mma') }}
</p>
When displayed on the page, it should say "User (username), last logged in on MM/dd/yyy h:mma" with the username in bold, but I'm not sure how to style the profile.userId within a binding?
You could split the text field. You do not need to keep the binding all together. However, I'm not quite sure what's happening with your pipes so this may not work. My guess is that you should simplify your pipes so that you can break up the text portion as below:
<p class="profile__last-login" *ngIf="profile.lastLoggedIn">
<b>{{'intranet.profile.dashboard.lastLoggedIn' | messageBundle: profile.userId }} </b>
, {{ (profile.lastLoggedIn | date: 'MM/dd/yyy h:mma') }}
</p>

MediaWiki: Forcing new line in templates

I'm going to standardize some picture galleries at some non-public wiki using pure templates. The legacy wiki picture/thumbnail galleries are specified with a lot of boilerplate code (it renders a gallery of pictures with thumbnails underneath):
<center>
<gallery widths="120px" heights="170px" perrow="5">
Image:Pic1.jpg|<center>1</center>
Image:Pic2.jpg|<center>2</center>
Image:Pic3.jpg|<center>3</center>
Image:Pic4.jpg|<center>4</center>
Image:Pic5.jpg|<center>5</center>
Image:Pic6.jpg|<center>6</center>
Image:Pic7.jpg|<center>7</center>
</gallery>
</center>
This is scary. There is an idea of re-implementing the above code with the following template:
{{Photos
| Picture1.jpg = 1
| Picture2.jpg = 2
| Picture3.jpg = 3
| Picture4.jpg = 4
| Picture5.jpg = 5
| Picture6.jpg = 6
| Picture7.jpg = 7
|}}
The template is mostly as follows:
... var definitions, etc ...
<center>
{{#tag:gallery
| {{#forargs: | K | V |
Image:{{#var: K}} {{!}} <center>'' {{#var: V}} ''</center>
}}
| widths = {{#var:WIDTHS}}
| heights = {{#var:HEIGHTS}}
| perrow = {{#var:PERROW}}
}}
</center>
But the problem is that only the first image is rendered, and the whole rest Picture 2... Picture 7 is rendered under the first image thumbnail. And I suspect that the reason possibly is a missing new line character so the gallery tag may be rendered like this producing wrong 1-picture gallery:
<gallery widths="120px" heights="170px" perrow="5">
Image:Pic1.jpg|<center>1</center>Image:Pic2.jpg|<center>2</center>Image:Pic3.jpg|<center>3</center>...
It's only an assumption, but I guess it may have strong background. So the question is:
is there any way of forcing a new line break so the <gallery> tag could be rendered as expected?
You can force a newline by adding <nowiki />like this:
{{#tag:gallery
| {{#forargs: | K | V |<nowiki />
Image:{{#var: K}} {{!}} <center>'' {{#var: V}} ''</center>
}}
| widths = {{#var:WIDTHS}}
| heights = {{#var:HEIGHTS}}
| perrow = {{#var:PERROW}}
}}

How can I extract information from an HTML file using Perl regular expressions?

I have two files, XML and an HTML and need to extract data from these on certain patterns.
My XML file is pretty well formatted and I can use readline to read a line and search data between tags.
if($line =~ /\<tag1\>$varvalue\<\/tag1\>/)`
However, for my HTML, it has one of the worst code I have seen and the file is like:
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
<div class="address">
<i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
</div>
</div>
<div class="mtitle">
<a href="/movie/dream-house-2011" title="Dream House" onmouseover="mB(event, 771204354);" >**Dream House**</a>
<span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>
<div class="times">
**1:00 PM,**
</div>
Now from this file I need to pick data which is shown in bold.
I can use Perl regular expression to search data from this file.
RegEx match open tags except XHTML self-contained tags
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Using regular expressions to parse HTML: why not?
When you are done reading those come back :)
Edit : and to actually solve your problem take a look at this module :
http://perlmeme.org/tutorials/html_parser.html
Some sample to parse the an html file :
#!/usr/local/bin/perl
use HTML::TreeBuilder;
$tree = HTML::TreeBuilder->new;
$tree->parse_file('C:\Users\Stefanos\workspace\HTML_Parser_Test\test.html');
#divs = $tree->find('div');
$tree->delete;
In this example I just used your tags as the main body of an .html file. The divs are stored in the #divs array. Since I have no idea which text you want to find, because ** is not a element I can't help you further..
P.S. I have never used this module but I just did it in 5 minutes so it is not so hard to parse the html file and find whatever you want..
Regex to match any specific tag and store of contents result into $1:
if ($subject =~ m!<tagname[^>]*>(.*?)</tagname>!s) {
# Successful match
}
Although you will soon realize the limitations of this approach when you have nested elements..
Replace tagname with actual tag.. e.g. in your case i, a, span, div although for div you will also get the contents of the first div which is not what you want..
Parsing XML and HTML using regular expressions is a fool's errand. There are many simple to use Perl modules for parsing HTML. Here is something using HTML::TokeParser::Simple. I've omitted the code to associate movies and showtimes with theaters (because I have no intention of building an appropriate input file):
#!/usr/bin/env perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);
my #theaters;
while (my $div = $parser->get_tag('div')) {
my $class = $div->get_attr('class');
next unless defined($class) and $class eq 'theater';
my %record;
$record{theater} = $parser->get_text('/a');
$record{address} = $parser->get_text('/i');
s{(?:^\s+)|(?:\s+\z)}{} for values %record;
push #theaters, \%record;
}
use YAML;
print Dump \#theaters;
__DATA__
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
<div class="address">
<i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
</div>
</div>
<div class="mtitle">
<a href="/movie/dream-house-2011" title="Dream House" onmouseover="mB(event, 771204354);" >**Dream House**</a>
<span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>
<div class="times">
**1:00 PM,**
</div>
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**Some other theater*</a></h2>
<div class="address">
<i>**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**</i>
</div>
</div>
Output:
[sinan#macardy]:~/tmp> ./tt.pl
---
- address: '**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**'
theater: '**University Village 3**'
- address: '**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**'
theater: '**Some other theater*'