I am writing an app for my school newspaper, which is run completely online through wordpress. I am using Hpple to parse the html. From the following:
</div>
<div id="fs-2" class="fs">
<div id="fsh-2" class="fsh">
<div id="fdh-2" class="fdh">Hit or ‘Mis’: Les Miserables Review<br> by *******</div>
<img src="http://www.mabearnews.com/wp-content/uploads/2012/12/les-miserables-2012-wallpapers-les-miserables-2012-movie-32697313-1280-800-600x375.jpg" id="fph-2" class="fph" />
What xpath query string would return the image url (img src)?
The html content is not well-formed.
Assuming this is the proper html content:
<div id="fs-2" class="fs">
<div id="fsh-2" class="fsh">
<div id="fdh-2" class="fdh">Hit or ‘Mis’: Les Miserables Review<br> by *******</div>
<img src="http://www.mabearnews.com/wp-content/uploads/2012/12/les-miserables-2012-wallpapers-les-miserables-2012-movie-32697313-1280-800-600x375.jpg" id="fph-2" class="fph" />
</div>
</div>
You can get the image URL with this xpath query:
//div[#id="fsh-2"]/a/img/#src
Related
I am trying to use Nokogiri's CSS method to get some names from my HTML.
This is an example of the HTML:
<section class="container partner-customer padding-bottom--60">
<div>
<div>
<a id="technologies"></a>
<h4 class="center-align">The Team</h4>
</div>
</div>
<div class="consultant list-across wrap">
<div class="engineering">
<img class="" src="https://v0001.jpg" alt="Person 1"/>
<p>Person 1<br>Founder, Chairman & CTO</p>
</div>
<div class="engineering">
<img class="" src="https://v0002.png" alt="Person 2"/></a>
<p>Person 2<br>Founder, VP of Engineering</p>
</div>
<div class="product">
<img class="" src="https://v0003.jpg" alt="Person 3"/></a>
<p>Person 3<br>Product</p>
</div>
<div class="Human Resources & Admin">
<img class="" src="https://v0004.jpg" alt="Person 4"/></a>
<p>Person 4<br>People & Places</p>
</div>
<div class="alliances">
<img class="" src="https://v0005.jpg" alt="Person 5"/></a>
<p>Person 5<br>VP of Alliances</p>
</div>
What I have so far in my people.rake file is the following:
staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)
I am having a little trouble getting all elements within the alt="" tag (the name of the person), as it is nested under a few divs.
Currently, using div.consultant, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO, instead of just the person's name in alt=.
How could I simply get the element within alt?
Your desired output isn't clear and the HTML is broken.
Start with this:
require 'nokogiri'
doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]
Using text on the output of css isn't a good idea. css returns a NodeSet. text against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:
doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"
This behavior is documented in NodeSet#text:
Get the inner text of all contained Node objects
Instead, use text (AKA inner_text or content) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:
Returns the content for this Node
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
Newbie to angularjs.trying to display data from json nested object like this
enter image description here
my html code is
<a rel="extranal" data-val="<%rcds%>" ng-repeat="rcds in rcd" class="international" id="<%rcds.id%>">
<span><img ng-src="<% rcds.routes.subroutes %>"/> <% rcds.subroutes[0].xyz%></span>
<div class="departure-time"><% rcds.subroutes[0].abc %></div>
</a>
want to display the data subroutes in the ng-repeat based on the condition of legtype in the json.how to do this.
if you want show your object as JSON the only thing that you need is write {{rcds | json}}
Otherwise if you want to navigate your nested object you should do somethings like:
<div ng-repeat="rcds in red">
<div ng-repeat="route in rcds.routes">
<!-- route element --->
<div ng-repeat="depart in route.depart">
<!-- depart element --->
<div ng-repeat="subroute in route.subroutes">
<!-- subroute element -->
</div>
</div>
</div>
</div>
I have this code that is getting data from a cell in a table
<div class="p-s-header">TITLE</div>
<div class="p-content">
<div class="row">
<div class="col-md-12 col-xs-12 col-sm-12">
<span di="76" ></span><br />
<span di="77" ></span><br />
</div>
</div>
</div>
This is producing:
TITLE
(data in cell 76)
(DATA in cell 77)
Now, data in cell 77 is a a link that is too long for the space, so I want to add the work "Click here" (hyperlinked) instead of showing the link.
So I wanted to change code, so output looks like:
TITLE
(data in cell 76)
Click here
"Click here" shoudl be built with the data in cell 77. I didn't code this but it seems the code to get the data from that cell is:
How would I build it?
I tried a few options, but nothing seems to work:
For example something like this:
<a href=<span di="77" ></span>Link</a><br />
thanks for your help
Try this, I hope this will encourage you on studying further how HTML and JS work together. (PHP is too advanced right now).
Change your code for this:
<div class="p-s-header">TITLE</div>
<div class="p-content">
<div class="row">
<div class="col-md-12 col-xs-12 col-sm-12">
<span id="cell76" ></span><br />
<span id="cell77" >http://www.google.com</span><br />
</div>
</div>
</div>
The attribute di doesn't exist, it is id and id's cannot start with numbers or be numbers
At the bottom of the code that you added write the following:
<script>
var link = document.getElementById('cell77').innerHTML;
document.write(''+link+'');
</script>
Here is the jsfiddle
https://jsfiddle.net/cz1Ltw3x/1/
I am new to Go. I am using goquery to extract data from an HTML page.
But the problem is the data I am looking for is not bounded by any HTML tag. It is simple text after a <br> tag. How can I extract it?
Edit : Here is HTML code.
<div class="container">
<div class="row">
<div class="col-lg-8">
<p align="justify"><b>Name</b>Priyaka</p>
<p align="justify"><b>Surname</b>Patil</p>
<p align="justify"><b>Adress</b><br>India,Kolhapur</p>
<p align="justify"><b>Hobbies </b><br>Playing</p>
<p align="justify"><b>Eduction</b><br>12th</p>
<p align="justify"><b>School</b><br>New Highschool</p>
</div>
</div>
</div>
From this I want "Priyanka" and "12th".
The following is what you want:
doc.Find(".container").Find("[align=\"justify\"]").Each(func(_ int, s *goquery.Selection) {
prefix := s.Find("b").Text()
result := strings.TrimPrefix(s.Text(), prefix)
println(result)
})
import strings in front of your code. If you need complete code example, check here.
Try query for and get its siblings
http://godoc.org/github.com/PuerkitoBio/goquery#Selection.Siblings
I'm trying to scrape information from a website using Nokogiri and Curb, but I can't seem to find the right name/ to find where to scrape. I'm trying to scrape the API key, which is at the bottom of the HTML code as "xxxxxxx".
The HTML code is:
<body class="html not-front logged-in no-sidebars page-app page-app- page-app-8383900 page-app-keys i18n-en" data-twttr-rendered="true">
<div id="skip-link"></div>
<div id="page-wrapper">
<!--
Code for the global nav
-->
<nav id="globalnav" class="without-subnav"></nav>
<nav id="subnav"></nav>
<section id="hero" class="hero-short"></section>
<section id="gaz-content">
<div class="container">
::before
<div id="messages"></div>
<div id="gaz-content-wrap-outer" class="row">
::before
<div id="gaz-content-wrap-inner" class="span12">
<div class="row">
::before
<div class="article-wrap span12">
<article id="gaz-content-body" class="content">
<header></header>
<div class="header-action"></div>
<div class="tabs"></div>
lass="d-block d-block-system g-main">
<div class="app-details">
<h2>
Application Settings
</h2>
<div class="description"></div>
<div class="app-settings">
<div class="row">
::before
<span class="heading">
Consumer Key (API Key)
</span>
<span>
xxxxxxxxx
</span>
All I can seem to get is the "content" text.
My code looks like:
consumer = html.at("#gaz-content-body")['class']
puts consumer
I'm not sure what to type to select the class and/or span then the input text. All I can get is Nokogiri to put "content".
In this case we need to find the second span after the span class="heading", and inside the div class="app-settings" - I'm being a bit general but not too much. I'm using search instead of at to retrieve the two spans and get the second one:
# Gets the 2 span elements under <div class='app-settings'>.
res = html.search('#gaz-content-body .app-settings span')
# Use .text to get the contents of the 2nd element.
res[1].text.strip
# => "xxxxxxxx"
But you can also use at to target the same:
res = html.at("#gaz-content-body .app-settings span:nth-child(2)")
res.text.strip
# => "xxxxxxxx"