Is there a way to parse html that include javascript in tags in ruby? - html

I am working on a web scraping problem in Ruby. I have seen multiple questions and answers related to this but in none I have seen HTML that include some JavaScript framework in it and I cannot figure out how to do it. I just want to select the HTML and return an array of objects. The following is my script and the HTML code. The HTML classes of the values like name, currency, balance are similar and the question of how can it be done?
content = document.css("div.acc-list").map do |parameters|
name = parameters.at_css("p.s3.bold.row.acDesc").text.strip, # argument?
currency = parameters.at_css(".row.ccy").text.strip, # argument?
balance = parameters.at_css(".row.acyOpeningBal").text.strip # argument?
Account.new name, currency, balance
end
pp content
These HTML paragraphs are inside multiple other classes which I think is due to the framework. However, they are inside a <div class = acc-list div>...</div> and I think I did correctly when I assigned "div.acc-list" to "content" variable.
<!-- HTML for name -->
<td bindonce="" ng-repeat="col in gridOptions.columns" sg-bind-html-compile="col.cellTemplate" bo-class="col.className" bo-style="{width: col.remWidth }"
class="ng-scope icon-two-line-col" style="width: 17.3333rem;">
<div style="width: 17.333333333333332rem" class="first-cell cellText ng-scope">
<i bo-class="{'active':row.selected }" class="i-32 active icon i-circle-account"></i>
<div class="info-wrapper" style="">
<p class="s3 bold" bo-bind="row.acDesc">Name_value</p> # value
<a ui-sref="app.layout.ACCOUNTS.DETAILS.{ID}({id:'091601003439274'})" href="/Bank/accounts/details/BG37FINV91503006938102">
<span bo-bind="row.iban">BG37FINV91503006938102</span>
<i class="i-arrow-right-5x8"></i>
</a>
</div>
</div>
</td>
<!-- HTML for currency -->
<td bindonce="" ng-repeat="col in gridOptions.columns" sg-bind-html-compile="col.cellTemplate" bo-class="col.className" bo-style="{width: col.remWidth }"
class="ng-scope" style="width: 4.4rem;">
<div style="width: 4.4rem" class="text-center cellText ng-scope">
<span bo-bind="row.ccy">EUR</span> # value
</div>
</td>
<!-- HTML for balance -->
<td bindonce="" ng-repeat="col in gridOptions.columns" sg-bind-html-compile="col.cellTemplate" bo-class="col.className" bo-style="{width: col.remWidth }"
class="ng-scope" style="width: 8.73333rem;">
<div style="width: 8.733333333333333rem" class="text-right cellText ng-scope">
<span bo-bind="row.acyAvlBal | sgCurrency">1 523.08</span> # value
</div>
</td>

Using CSS:
require 'nokogiri'
document = Nokogiri::HTML(<<EOT)
<div class="acc-list">
<!-- HTML for name -->
<td>
<div class="first-cell cellText ng-scope">
<div class="info-wrapper">
<!-- # value -->
<p class="s3 bold">Name_value</p>
</div>
</div>
</td>
<!-- HTML for currency -->
<td>
<div class="text-center cellText ng-scope">
<!-- # value -->
<span>EUR</span>
</div>
</td>
<!-- HTML for balance -->
<td>
<div class="text-right cellText ng-scope">
<!-- # value -->
<span>1 523.08</span>
</div>
</td>
</div>
EOT
Now that the DOM is loaded:
content = document.css('div.acc-list').map do |div|
name = div.at("p.s3.bold").text.strip # => "Name_value"
currency = div.at("div.text-center > span").text.strip # => "EUR"
balance = div.at("div.text-right > span").text.strip # => "1 523.08"
[ name, currency, balance ]
end
# => [["Name_value", "EUR", "1 523.08"]]
Your HTML sample has a lot of extraneous information that obscures the trees in this particular forest. I stripped it out because it wasn't useful. (And, when submitting a question you should automatically do that as part of simplifying the non-essential information so we can all focus on the actual problem.)
CSS doesn't care about parameters other than the node name, class and id. The class can chain the parameters in the definition of the class if you need that granular access, but often you can get away with a more general class selector; It just depends on the HTML.
Most XML and HTML parsing is basically the same tactic: Find an outer placeholder, look inside it and iterate grabbing the information needed. I can't demonstrate that completely because your example only has the outer div, but you can probably imagineer the necessary code to handle an inner loop.
at_css is almost equivalent to at, and Nokogiri is smart enough 99.9% of the time to determine whether a selector is CSS or XPath, so I tend toward using at because my fingers are lazy.

Related

Select a node with its children based on its class, and turn it into an object

I want to find out how to scrape website data. This is a part of the html that I am interested in. I am using cheerio for finding the data I need.
<td class="col-item-shopdetail">
<div class="shoprate2 text-right hidden-xs">
<div class="currbox-amount">
<span class="item-searchvalue-curr">SGD</span>
<span class="item-searchvalue-rate text-black">42.0000</span>
</div>
<div class="item-inverserate">TWD 100 = SGD 4.2</div>
<div class="rateinfo">
<span class="item-timeframe">12 hours ago</span>
</div>
</div>
<div class="shopdetail text-left">
<div class="item-shop">Al-Aman Exchange</div>
<div class="item-shoplocation">
<span class="item-location1"><span class="icon icon-location3"></span>Bedok</span>
<span class="item-location2"><span class="icon iconfa-train"></span>Bedok </span>
</div>
</div>
</td>
I wish to make "col-item-shopdetail" class as an object and store all class with name "col-item-shopdetail" into an array for access.
So if possible, it will be access like array.item-inverserate or through cheerio selector like
$('.col-item.shopdetail').children[0].children[0].children[1]
I have tried looping through the names of shop and store in an array and use another loop after finish looping the names to find the rates. Then try and match the rates to the name by access same index of the array. However this did not work for unknown reason where each time the rate printed is of different value and index of the same name are different in each try.
This is close to what I want but it does not work:
how to filter cheerio objects in `each` with selector?
In other words, you want an array of objects representing elements having class .col-item-shopdetail and each of those objects should have a property corresponding to the .item-inverserate element they contain ?
You need the map method
my_array = $('.col-item-shopdetail').map(function(i, el) {
// Build an object having only one property being the .item-inverserate text content
return {
itemInverserate: $(el).find('.item-inverserate').text()
};
}).get();
// You can also directly target inverserate nodes
// which will exclude empty entries ('shopdetail' that have no 'inverserate')
// Loop over .item-inverserate elements found
// somewhere in a .col-item-shopdetail
// (beware, space matters)
my_array = $('.col-item-shopdetail .item-inverserate').map(function(i, el) {
// Build an object having only one property being the .item-inverserate text content
return {itemInverserate: $(el).text()};
// Note: If all you need is the inverserate value,
// Why not avoiding an intermediate full object?
// return $(el).text()
}).get();
Since Cheerio developers have built their API based on jQuery with most of the core methods, we can simply test snippets in the browser ...
my_array = $('.col-item-shopdetail').map(function(i, el) {
return {
itemInverserate: $(el).find('.item-inverserate').text()
};
}).get();
console.log(my_array[0].itemInverserate)
my_array_2 = $('.col-item-shopdetail .item-inverserate').map(function(i, el) {
// Build an object having only one property being the .item-inverserate text content
return {itemInverserate: $(el).text()};
}).get();
console.log(my_array_2[0].itemInverserate)
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<table><tr><td class="col-item-shopdetail">
<div class="shoprate2 text-right hidden-xs">
<div class="currbox-amount">
<span class="item-searchvalue-curr">SGD</span>
<span class="item-searchvalue-rate text-black">42.0000</span>
</div>
<div class="item-inverserate">TWD 100 = SGD 4.2</div>
<div class="rateinfo">
<span class="item-timeframe">12 hours ago</span>
</div>
</div>
<div class="shopdetail text-left">
<div class="item-shop">Al-Aman Exchange</div>
<div class="item-shoplocation">
<span class="item-location1"><span class="icon icon-location3"></span>Bedok</span>
<span class="item-location2"><span class="icon iconfa-train"></span>Bedok </span>
</div>
</div>
</td></tr>
</table>

Using Nokogiri's CSS method to get all elements within an alt tag

I am trying to use Nokogiri's CSS method to get some names from my HTML.
This is an example of the HTML:
<section class="container partner-customer padding-bottom--60">
<div>
<div>
<a id="technologies"></a>
<h4 class="center-align">The Team</h4>
</div>
</div>
<div class="consultant list-across wrap">
<div class="engineering">
<img class="" src="https://v0001.jpg" alt="Person 1"/>
<p>Person 1<br>Founder, Chairman & CTO</p>
</div>
<div class="engineering">
<img class="" src="https://v0002.png" alt="Person 2"/></a>
<p>Person 2<br>Founder, VP of Engineering</p>
</div>
<div class="product">
<img class="" src="https://v0003.jpg" alt="Person 3"/></a>
<p>Person 3<br>Product</p>
</div>
<div class="Human Resources & Admin">
<img class="" src="https://v0004.jpg" alt="Person 4"/></a>
<p>Person 4<br>People & Places</p>
</div>
<div class="alliances">
<img class="" src="https://v0005.jpg" alt="Person 5"/></a>
<p>Person 5<br>VP of Alliances</p>
</div>
What I have so far in my people.rake file is the following:
staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)
I am having a little trouble getting all elements within the alt="" tag (the name of the person), as it is nested under a few divs.
Currently, using div.consultant, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO, instead of just the person's name in alt=.
How could I simply get the element within alt?
Your desired output isn't clear and the HTML is broken.
Start with this:
require 'nokogiri'
doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]
Using text on the output of css isn't a good idea. css returns a NodeSet. text against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:
doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"
This behavior is documented in NodeSet#text:
Get the inner text of all contained Node objects
Instead, use text (AKA inner_text or content) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:
Returns the content for this Node
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.

display data of json nested objects in html using angular js

Newbie to angularjs.trying to display data from json nested object like this
enter image description here
my html code is
<a rel="extranal" data-val="<%rcds%>" ng-repeat="rcds in rcd" class="international" id="<%rcds.id%>">
<span><img ng-src="<% rcds.routes.subroutes %>"/> <% rcds.subroutes[0].xyz%></span>
<div class="departure-time"><% rcds.subroutes[0].abc %></div>
</a>
want to display the data subroutes in the ng-repeat based on the condition of legtype in the json.how to do this.
if you want show your object as JSON the only thing that you need is write {{rcds | json}}
Otherwise if you want to navigate your nested object you should do somethings like:
<div ng-repeat="rcds in red">
<div ng-repeat="route in rcds.routes">
<!-- route element --->
<div ng-repeat="depart in route.depart">
<!-- depart element --->
<div ng-repeat="subroute in route.subroutes">
<!-- subroute element -->
</div>
</div>
</div>
</div>

HTML element to contain id or name from ko.observable using foreach

Below I have a for-each loop using knockout.js.
<div data-bind="foreach:Stuff">
<div class="row">
<span data-bind="text: $data.name"></span>
</div>
</div>
I need to have the HTML Element with an id or name or something that reflects a unique value related to the $data.name value, as another method runs asynchronously, and needs to know which HTML element to update.
Ideally, it would look something like this, I guess:
<div data-bind="foreach:Stuff">
<div class="row">
<span id="data-bind='text: $data.name'" data-bind="text: $data.name"></span>
</div>
</div>
I have found a knockout syntax that applies values during runtime to specified attributes:
<div data-bind="foreach:Stuff">
<div class="row">
<span data-bind="attr: { id: $data.name}"></span>
</div>
</div>
Are you looking for this
<div data-bind="foreach:Stuff">
<div class="row">
<span data-bind="text: $data.name,attr:{id:$data.name}'"></span>
</div>
</div>
Here name is a observable i believe when ever there is change in name in stuff it will automatically updates its value & attr:{id}just to give a dynamic id to element using available bindings .

Meteor, properly rendering HTML within a cursor.forEach

I am coming over from this post
Currently I have this code which is adapted from the mentioned post.
Template.messages.rendered = ->
this.autorun( (c) ->
document.bottomCheck = false
if null != chatDiv and chatDiv.scrollTop + chatDiv.offsetHeight >= chatDiv.scrollHeight
document.bottomCheck = true
$("#chat-box").empty()
messageCursor = Messages.find({}, {sort: {time: 1}})
messageCursor.forEach((message) ->
makeMessage(message) // Uses jQuery to insert HTML into our page
)
Deps.afterFlush(() ->
setScrollToBottom() if document.bottomCheck
)
)
This is working great to log each message every time new messages come in. The part I am confused about, is how to I populate my HTML with the proper data? I currently have the following HTML which calls a template helper to render my messages.
<template name="messages" class="message-style">
<div id="chat-box">
{{#each getMessages}}
<div class="chat-message" id="chat-message-scroll" style="background-color: {{backgroundColor}}">
<div class="row">
<div class="col-md-2 message-name">
<div class="chat-message-name">{{name}}</div>
</div>
<div class="col-md-8 message-contents">
<!-- <br> -->
<div class="chat-message-contents" style="color:{{textColor}}">{{{convertMsg message}}}</div>
</div>
<div class="col-md-2 message-timestamp">
<span class="chat-message-timestamp">
{{#if notSystemMsg this.type}}
{{#if isBookmarked this._id}}
<i class="fa fa-star" id='chat-full-bookmark'></i>
{{else}}
<i class="fa fa-star-o" id='chat-empty-bookmark'></i>
{{/if}}
{{/if}}
{{convertToLocalTime time}}
</span>
</div>
</div>
</div>
{{/each}}
</div>
</template>
I just would like to how how I can adapt what I currently have to take advantage of the new way I am getting my messages data? Do I just create and insert DOM elements programatically? My issue with that is that it doesn't seem like the meteor way of doing things because I wouldn't be using blaze or spacebars.
Any help on this issue would be very much appreciated.
EDIT Old getMessages helper
Template.messages.getMessages = () ->
... # Random Logic
allMessages = Messages.find({}, {sort: {time: -1 }}).fetch()
... # Messing with allMessages before returning to the helper
It would be best if your helper returned allMessages cursor, and not just an array. If you need to modify the documents before rendering HTML you can use the transform function as described here.