How to extract something I want in html using 'xpath' - html

The html code is looking like this:
<img alt="Papa's Cupcakeria To Go!" src="" data-old-hires="" class="a-dynamic-image a-stretch-vertical" id="landingImage" data-a-dynamic-image="{"https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L.png":[512,512],"https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L._SX425_.png":[425,425],"https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L._SX466_.png":[466,466],"https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L._SY450_.png":[450,450],"https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L._SY355_.png":[355,355]}" style="max-width:512px;max-height:512px;">
I want to get "https://images-na.ssl-images-amazon.com/images/I/814vdYZK17L.png" and now I'm using
extract_item(hxs.xpath("//img[#id='landingImage']/#data-a-dynamic-image"))
, what I got is all the content inside that tag.
How can I get the first url only?

If you just want the first URL:
full_content = extract_item(hxs.xpath("//img[#id='landingImage']/#data-a-dynamic-image"))
list_contents = full_content.split(";")
first_image = list_contents[1].replace("&quot","")
print first_image
Also, you can refer this for extracting URL using regex.

Related

how to make img link to a simple code (Html and Css)

so i want to make all my img links into a simple word/code in html and css
Example:
//Not like this
<img src="https://img1.com">
<img src="https://img2.com">
<img src="https://img3.com">
//I want to do something a little bit more like this instead
value01 = https://img1.com
value02 = https://img2.com
value03 = https://img3.com
<img src="value01">
<img src="value02">
<img src="value03">
I don't know what to do I am new to HTML and CSS
I think you can't do this in html because
The <img> tag is used to embed an image in an HTML page, maybe you can do this in python, instead, you can do this:
<b>
<img src="value1.jpg" alt="Value1" >
</b>
Source :
img tag html
there are two ways I can think of.
::THIS FIRST OPTION ONLY WORKS IF YOU SAVE THE PAGE IN (.PHP) EXTENSION
1° Method => You can create a php file apart, store the links of images in variables like this.
< ? php
$img = 'https : // upload . wikimedia . org /wikipedia/commons/thumb/c/c3/Python-logo-notext . svg/1200px-Python-logo-notext . svg . png';
? >
next, you can call this file in the main page/index.
< ? php
include ". /page/images . php";
? >
< html >
< img src="< ? php echo $img; ? >" alt="" srcset="">
< / html >
2° Method => you can just save the image to a folder easy to target.
create a folder inside the same folder you are accessing your main page.
for example: I created a folder called (img) in the same folder my index.html is found, save the image with a short name.
so to access that image i would call the image like this
< img src="image/img.png" alt="" srcset="">

How do I search for an attribute using BeautifulSoup?

I am trying to scrape a that contains the following HTML.
<div class="FeedCard urn:publicid:ap.org:db2b278b7e4f9fea9a2df48b8508ed14 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
<div class="FeedCard urn:publicid:ap.org:2f23aa3df0f2f6916ad458785dd52c59 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
As you can see, "FeedCard " is something they have in common. Therefore, I am trying to use a regular expression in conjunction with BeautifulSoup. Here is the code I've tried.
pattern = r"\AFeedCard"
for card in soup.find('div', 'class'==re.compile(pattern)):
print(card)
print('**********')
I'm expecting it to give me each on of the divs from above, with the asterisks separating them. Instead it is giving me the entire HTML of the page in a single instance
Thank you,
No need to use regular expression here. Just use CSS selector or BS4 Api:
from bs4 import BeautifulSoup
html = """\
<div class="FeedCard urn:publicid:ap.org:db2b278b7e4f9fea9a2df48b8508ed14 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
Item 1
</div>
<div class="FeedCard urn:publicid:ap.org:2f23aa3df0f2f6916ad458785dd52c59 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
Item 2
</div>
"""
soup = BeautifulSoup(html, "html.parser")
for card in soup.select(".FeedCard"):
print(card.text.strip())
Prints:
Item 1
Item 2

cannot get tag however it is appear on html

I am trying a scraping job using BeatifulSoup and find methods, I get the HTML with lxml parser as following :
result = requests.get('https://wuzzuf.net/jobs/p/xgUqkfYngXZL-Senior-Python-Developer-Remote---Part-Time-Cairo-Egypt?o=2&l=sp&t=sj&a=python|search-v3|hpb')
#print(result.status_code)
soup1 =BeautifulSoup(result.content , "html5lib")
sections = soup1.find( 'section' ,class_="css-3kx5e2")
divs = sections.find_all('div')
spans = sections.find_all('span')
span = divs[3].find('span' , class_ ='css-47jx3m')
divs[3]
I get the following
<div class="css-rcl8e5"><span class="css-wn0avc">Salary<!-- -->:</span></div>
however, the original HTML is
<div class="css-rcl8e5"><span class="css-wn0avc">Salary<!-- -->:</span>
<span class="css-47jx3m"><span class="css-8il94u">Confidential, Hourly Based</span>
</span>
</div>
I need to get the ('span class="css-8il94u"') which have the text ('Confidential, Hourly Based') but it does not appear
thanks

HTML data update for XML column with new value in SQL Server

I have some experience in XQuery to update the XML data. I have tried to use the same logic for the HTML data in SQL Server.
But not working as expected.
For example I have a XML column Value (actually HTML data) as below.
Declare #template xml = '<div>
<div id="divHeader">Congratulation<div id="Salutation">ravi</div></div><br/>
<div>From now you are a part of the Company<div id="cmpnyUserDetails"></div></div><br/>
<div id="clickSection">Please Click Here to Access Your New Features</div>
</div>'
and I would like change the html value od the div with ID "Salutation" to "New Value" and Append the href value to a valid link using the XQuery.
SET #template.modify('replace value of (//div[id=("Salutation")]/text())[1] with "New Value"')
SELECT #template AS data
But it's not working.
Can someone please suggest to me how to make it happen?
Thanks a ton in advance,
Ravi.
You were close. Notice the #id vs. your id
Example
SET #template.modify('replace value of (//div[#id=("Salutation")]/text())[1] with "New Value"')
select #template as data
Returns
<div>
<div id="divHeader">Congratulation<div id="Salutation">New Value</div></div>
<br />
<div>From now you are a part of the Company<div id="cmpnyUserDetails" /></div>
<br />
<div id="clickSection">Please Click Here to Access Your New Features</div>
</div>

xpath find specific link in page

I'm trying to get the email to a friend link from this page using xpath.
http://www.guardian.co.uk/education/2009/oct/14/30000-miss-university-place
The link itself is wrapped up in tags like this
<li><a class="rollover sendlink" href="http://www.guardian.co.uk/email/354237257" title="Opens an email form" name="&lid={pageToolbox}{Email a friend}&lpos={pageToolbox}{2}"><img src="http://static.guim.co.uk/static/80163/common/images/icon_email-friend.gif" alt="" class="trail-icon" /><span>Send to a friend</span></a></li>
I'm using this for my query, but it's not quite right.
$links = $xpath->query("//a/span[text()='Send to a friend']/#href");
You're trying to get the href of the span there. I think you want
$links = $xpath->query("//a[span/text()='Send to a friend']/#href");
You need to use something like this (since href is an attribute of a):
$links = $xpath->query("//a[span/text()='Send to a friend']/#href");
The href is an attribute of the anchor hence you need:-
$links = $xpath->query("//a[span[text()='Send to a friend']]/#href");
try this
$links = $xpath->query("//a[span='Send to a friend']/#href");