How to get Nokogiri to scrape text from span in Ruby - html

I'm trying to scrape information from a website using Nokogiri and Curb, but I can't seem to find the right name/ to find where to scrape. I'm trying to scrape the API key, which is at the bottom of the HTML code as "xxxxxxx".
The HTML code is:
<body class="html not-front logged-in no-sidebars page-app page-app- page-app-8383900 page-app-keys i18n-en" data-twttr-rendered="true">
<div id="skip-link"></div>
<div id="page-wrapper">
<!--
Code for the global nav
-->
<nav id="globalnav" class="without-subnav"></nav>
<nav id="subnav"></nav>
<section id="hero" class="hero-short"></section>
<section id="gaz-content">
<div class="container">
::before
<div id="messages"></div>
<div id="gaz-content-wrap-outer" class="row">
::before
<div id="gaz-content-wrap-inner" class="span12">
<div class="row">
::before
<div class="article-wrap span12">
<article id="gaz-content-body" class="content">
<header></header>
<div class="header-action"></div>
<div class="tabs"></div>
lass="d-block d-block-system g-main">
<div class="app-details">
<h2>
Application Settings
</h2>
<div class="description"></div>
<div class="app-settings">
<div class="row">
::before
<span class="heading">
Consumer Key (API Key)
</span>
<span>
xxxxxxxxx
</span>
All I can seem to get is the "content" text.
My code looks like:
consumer = html.at("#gaz-content-body")['class']
puts consumer
I'm not sure what to type to select the class and/or span then the input text. All I can get is Nokogiri to put "content".

In this case we need to find the second span after the span class="heading", and inside the div class="app-settings" - I'm being a bit general but not too much. I'm using search instead of at to retrieve the two spans and get the second one:
# Gets the 2 span elements under <div class='app-settings'>.
res = html.search('#gaz-content-body .app-settings span')
# Use .text to get the contents of the 2nd element.
res[1].text.strip
# => "xxxxxxxx"
But you can also use at to target the same:
res = html.at("#gaz-content-body .app-settings span:nth-child(2)")
res.text.strip
# => "xxxxxxxx"

Related

Traversing the DOM with querySelector

I'm using the statement document.querySelector("[data-testid='people-menu'] div:nth-child(4)") in the console to give me the below HTML snippet:
<div>
<span class="jss1">
<div class="jss2">
<p class="jss3">Owner</p>
</div>
</span>
<div class="jss4">
<div class="5" title="User Title">
<p class="jss6">UT</p>
</div>
<div class="jss7">
<p class="jss82">User Title</p>
<span class="jss9">Project Manager</span>
</div>
</div>
</div>
I'd like to extend the statement in the console to extract the title "User Title" but can't figure out what combination of nth-child or nextSibling (or something else) to use. The closest I've gotten is:
document.querySelector("[data-testid='people-menu'] div:nth-child(4) span:nth-child(1)")
which gives me the span with class jss1.
I expected document.querySelector("[data-testid='people-menu'] div:nth-child(4) span:nth-child(1).nextSibling") to give me the div with class jss4, but it returns null.
I can't use class selectors because those are generated dynamically at build.
Why not just add [title] onto your querySelector?
document.querySelector("[data-testid='people-menu'] div:nth-child(4) [title]")
You can then get whatever you are looking for from that section? This is assuming title will be unique attribute in this section of html

Hide main div in which sub div contains specific text

I'm trying to make my first Tampermoney script
Here's an example of an html page :
<div class="a">
<div class="b">
"Hello world"
</div>
<div class="n">
"Test"
</div>
</div>
<div class="a">
<div class="d">
<div class="e">
...
<div class="n">
"Hello world"
</div>
</div>
</div>
</div>
I found this topic, very interesting, but I'm not able to make it fits my requirements : Hiding div that contains specific string
I would like to hide the divs class="a" ONLY if it contains a div class="n" that contains the text "Hello world".
Do I need to loop on all the divs class="a", to seek for a class="n" containing "Hello world" ? I'll need some help please...
<div class="a">
<div class="b">
"Hello world"
</div>
<div class="n">
"Test"
</div>
</div>
Hmmmm, I think it needs a few ajustements...
Here's my code :
$(window).load(function(){
$('._5g-l:contains("Publication suggérée")').closest('._5jmm _5pat _3lb4 m_95yeui-j _x72').hide();
});
The web page :
And the associated code :
<div data-fte="1" data-ftr="1" class="_5jmm _5pat _3lb4 m_95yeui-j _x72" id="hyperfeed_story_id_581d0a6f0b12a3832990101" data-testid="fbfeed_story" [...]>
<div class="_4-u2 mbm _5v3q _4-u8" id="u_jsonp_2_1f">
<div class="_3ccb _4-u8" [...]>
<div></div>
<div class="userContentWrapper _5pcr" role="article" aria-label="Actualité">
<div class="_1dwg _1w_m">
<div class="_5g-l"><span>Publication suggérée</span></div>
[...]
Also, I can't put in the webpage :
<script src="https://code.jquery.com/jquery-3.1.0.min.js"></script>
When I try to execute the commande in the console :
$('._5g-l:contains("Publication suggérée")').closest('._5jmm _5pat _3lb4 m_95yeui-j _x72').hide();
(unknown) Uncaught Error: <![EX[["Tried to get element with id of \"%s\" but it is not present on the page.","._5g-l:contains(\"Publication suggérée\")"]]]>
at h (https://www.facebook.com/rsrc.php/v3/yZ/r/AveNRnydIl_.js:36:166)
at i (https://www.facebook.com/rsrc.php/v3/yZ/r/AveNRnydIl_.js:36:293)
at <anonymous>:1:1
I think this will help you. And you want if class .a contain .n with text("hello world") , .a need to be hide. I just make edit your HTML part and also comment out which div will be hide. According to your HTML first one will be not hide because .n contain "test".
if any question or if you found anything wrong on my answer ask me. :) Live On Fiddle
UPDATE: In your Code the problem in your parent div you have 5 class _5jmm, _5pat, _3lb4, m_95yeui-j, _x72 but you write in jQuery closest('.5jmm _5pat _3lb4 m_95yeui-j _x72').hide(). So you can set only one class which is closest. And you don't
$(window).on('load', function() {
$('._5g-l:contains("Publication suggérée")').closest('._5jmm').hide();
});
<script src="https://code.jquery.com/jquery-3.1.0.min.js"></script>
<div data-fte="1" data-ftr="1" class="_5jmm _5pat _3lb4 m_95yeui-j _x72" id="hyperfeed_story_id_581d0a6f0b12a3832990101" data-testid="fbfeed_story" [...]>
<div class="_4-u2 mbm _5v3q _4-u8" id="u_jsonp_2_1f">
<div class="_3ccb _4-u8" [...]>
<div></div>
<div class="userContentWrapper _5pcr" role="article" aria-label="Actualité">
<div class="_1dwg _1w_m">
<div class="_5g-l"><span>Publication suggérée</span></div>
</div>
</div>
</div>
</div>
</div>

Using Nokogiri's CSS method to get all elements within an alt tag

I am trying to use Nokogiri's CSS method to get some names from my HTML.
This is an example of the HTML:
<section class="container partner-customer padding-bottom--60">
<div>
<div>
<a id="technologies"></a>
<h4 class="center-align">The Team</h4>
</div>
</div>
<div class="consultant list-across wrap">
<div class="engineering">
<img class="" src="https://v0001.jpg" alt="Person 1"/>
<p>Person 1<br>Founder, Chairman & CTO</p>
</div>
<div class="engineering">
<img class="" src="https://v0002.png" alt="Person 2"/></a>
<p>Person 2<br>Founder, VP of Engineering</p>
</div>
<div class="product">
<img class="" src="https://v0003.jpg" alt="Person 3"/></a>
<p>Person 3<br>Product</p>
</div>
<div class="Human Resources & Admin">
<img class="" src="https://v0004.jpg" alt="Person 4"/></a>
<p>Person 4<br>People & Places</p>
</div>
<div class="alliances">
<img class="" src="https://v0005.jpg" alt="Person 5"/></a>
<p>Person 5<br>VP of Alliances</p>
</div>
What I have so far in my people.rake file is the following:
staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)
I am having a little trouble getting all elements within the alt="" tag (the name of the person), as it is nested under a few divs.
Currently, using div.consultant, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO, instead of just the person's name in alt=.
How could I simply get the element within alt?
Your desired output isn't clear and the HTML is broken.
Start with this:
require 'nokogiri'
doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]
Using text on the output of css isn't a good idea. css returns a NodeSet. text against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:
doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"
This behavior is documented in NodeSet#text:
Get the inner text of all contained Node objects
Instead, use text (AKA inner_text or content) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:
Returns the content for this Node
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.

HTML element to contain id or name from ko.observable using foreach

Below I have a for-each loop using knockout.js.
<div data-bind="foreach:Stuff">
<div class="row">
<span data-bind="text: $data.name"></span>
</div>
</div>
I need to have the HTML Element with an id or name or something that reflects a unique value related to the $data.name value, as another method runs asynchronously, and needs to know which HTML element to update.
Ideally, it would look something like this, I guess:
<div data-bind="foreach:Stuff">
<div class="row">
<span id="data-bind='text: $data.name'" data-bind="text: $data.name"></span>
</div>
</div>
I have found a knockout syntax that applies values during runtime to specified attributes:
<div data-bind="foreach:Stuff">
<div class="row">
<span data-bind="attr: { id: $data.name}"></span>
</div>
</div>
Are you looking for this
<div data-bind="foreach:Stuff">
<div class="row">
<span data-bind="text: $data.name,attr:{id:$data.name}'"></span>
</div>
</div>
Here name is a observable i believe when ever there is change in name in stuff it will automatically updates its value & attr:{id}just to give a dynamic id to element using available bindings .

Parse html page with mechanize to receive the appropriate array

I have the following html code on the page received by mechanize (agent.get):
<div class="b-resumehistorylist-views">
<!-- first date start-->
<div class="b-resumehistory-date">date1</div>
<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time1</div>
company1</div>
<!-- second date start -->
<div class="b-resumehistory-date">date2</div>
<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time2</div>
company2
</div>
<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time3</div>
company3</div>
<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time4</div>
company4</div>
<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time5</div>
company5</div>
<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time6</div>
company6</div>
<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time7</div>
company7</div>
...
</div>
I need to search inside the div with class="b-resumehistorylist-views" each date.
Then find all divs between two div-dates and link each item to this particular date.
The problem is that each item (div class = b-resumehistorylist-views) is not inside div=b-resumehistorylist-views.
At final stage I need to receive the following array:
array = [ [date1, time1, company1, companylink1], [date2, time2, company2, companylink2], [date2, time3, company3, companylink3],[date2, time4, company4, companylink4] ]
I know that I must use method search with text() option, but I cannot find the solution.
My code right now can parse all companies information between div class=b-resumehistory-company, but I need to find right date.
It would be the same thing as before, just some of the class attributes have been changed:
doc = agent.get(someurl).parser
doc.css('.b-resumehistory-company').map{|x| [x.at('./preceding-sibling::div[#class="b-resumehistory-date"][1]').text , x.at('.b-resumehistory-time').text, x.at('a').text, x.at('a')[:href]]}