I'm running into problems scraping across multiple pages with Nokogiri. I need to be able to narrow down the results of what I am searching for based on the qualified hrefs first. So here is a script to get all of the hrefs I'm interested in obtaining. However, I'm having trouble parsing out the titles of the article so that I can link to them. It would be great to know that I can manually inspect the elements so that I have the links I want and whenever I find a link I want I can also grab the title/ text describing the article/href as in
<a href.......>Text Linked to</a>
so that I then have a hash with {:source => ".....", :url => ".....", :title => "....."}. Here is the script I have so far. It narrows down the links I am interested in having setup in the hash.
require 'nokogiri'
require 'open-uri'
page = "http://www.huffingtonpost.com/politics/"
doc = Nokogiri::HTML(open(page))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if{|href| href.empty?}
hrefs.each do |h|
if h.reverse[0,9] != "stnemmoc#"
if (h.reverse[0,7] == "scitilo") & (h.length > 65)
puts h
end
end
end
If someone could help and maybe explain how it is that I can find the hrefs I want first and then parse the text based on filtering the urls from the hrefs first, as I have here, that would be really nice. Also is it recommended that these Nokogiri scripts are put in Controllers and then sent into the database that way in Rails? I appreciate it.
Thanks
I'm not sure I understand your question completely, but I'm going to interpret it as "How do I extract links and access their attributes?"
Simply amend your selector:
links = doc.css('a[href]')
This will give you all a elements that have an href. You can then iterate over these and access their attributes.
Related
I am working on some website automation. Currently, I am unable to access a nested html documents with Splinter. Here's a sample website that will help demonstrate what I am dealing with: https://www.w3schools.com/html/tryit.asp?filename=tryhtml_elem_select
I am trying to get into the select element and choose the "saab" option. I am stuck on how to enter the second html document. I've read the documentation and saw nothing. I'm hoping there is a way with Python.
Any thoughts?
Before Solution:
from splinter import Browser
exe = {"executable_path": "chromedriver.exe"}
browser = Browser("chrome",**exe, headless=False)
url = "https://www.w3schools.com/html/tryit.asp?filename=tryhtml_elem_select"
browser.visit(url)
# This is where I'm stuck. I cannot find a way to access the second (nested) html doc
innerframe = browser.find_by_name("iframeResult").first
innerframe.find_by_name("cars")[0]
Solution:
from splinter import Browser
exe = {"executable_path": "chromedriver.exe"}
browser = Browser("chrome",**exe, headless=False)
url = "https://www.w3schools.com/html/tryit.asp?filename=tryhtml_elem_select"
browser.visit(url)
with browser.get_iframe("iframeResult") as iframe:
cars = iframe.find_by_name("cars")
cars.select("saab")
I figured out that these are called iframes. Once I learned the terminology, it wasn't too hard to figure out how it interact with it. "Nested html documents" was not returning the results I needed to find the solution.
I hope this helps someone out in the future!
On the web page there are a few articles. I need to get links to all articles.
I use Selenium and Powershell.
I do a search with:
FindElementByXPath("//*[contains(#class, 'without')]").getattribute("href")`
but only get a link to the first article.
How to get links to all the articles?
All links articles view:
<a class="without" href="http://articlelink.html"><h2>article</h2></a>
I don't know anything about Powershell But Using java with selenium you can do this like below mentioned code.
I know it's not a proper answer to deal with, but below code will give you the hint, that how you should go with other language.
List<WebElement> links = driver.findElements(By.className("without")); // Using list web-element get all web-elements, whose classname name as "without"
System.out.println(links.size()); //total number of links on the page.
for(int i = 0;i<links.size();i++)
{
System.out.println(links.get(i).getAttribute("href")); //Using for loop getting one by one links name.
links.get(i).click(); // click the link if you want to click
Thread.sleep(2500); //wait for 2.5 seconds
}
Hope my above answer will help you.
I am pretty new to r and selenium so hopefully i can express myself clearly about my question.
I want to scrape some data off a website (.aspx) and i need to type some chemical code to be able to pull out some information in the next page (using R-selenium to input and click element). So far i have been able to build a short code that will get me through the first step, i.e. pull out the correct page i wanted. But i had so much trouble in finding a good way to scrape the data (the chemical information in the table) off this website. Mainly because the website will not assign a new html address instead of give me the same aspx address for any chemical i search. I plan to overcome this and then build a loop so i can scrape more information automatically. Anyone has any good thoughts that how i should get the data off after click-element? I need the chemical information table in the second page.
Thanks heaps in advance!
Here i put my code that i wrote so far: the next step i need is to scrape the table out the next page!
library("RSelenium")
checkForServer()
startServer()
mybrowser <- remoteDriver()
mybrowser$open()
mybrowser$navigate("http://limitvalue.ifa.dguv.de/")
mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox <- mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox$sendKeysToElement(list("64-19-7"))
wxbutton <- mybrowser$findElement(using = 'css selector', "#Butsearch")
wxbutton$clickElement()
First of all, your tool choice is wrong.
Secondly, in your case
POST to the "permanent" url
302 redirect to a new url, which is http://limitvalue.ifa.dguv.de/WebForm_ueliste2.aspx in your case
GET the new url
Thirdly, what's the ultimate output you are after?
It really depends on how much data you are up to. Otherwise do a manual task.
I'm trying to get css working on this rake task.
namespace :task do
task test: :environment do
ticketmaster_url = "http://www.ticketmaster.co.uk/derren-brown-miracle-glasgow-04-07-2016/event/370050789149169E?artistid=1408737&majorcatid=10002&minorcatid=53&tpab=-1"
doc = Nokogiri::HTML(open(ticketmaster_url))
#psec-p label
doc.css("#psec-p").each do |price|
puts price.at_css("#psec-p")
byebug
end
end
end
However i'm returning this:
#<Nokogiri::XML::Element:0x3fd226469e60 name="fieldset" attributes=[#<Nokogiri::XML::Attr:0x3fd2281c953c name="class" value="group-price widget-group">, #<Nokogiri::XML::Attr:0x3fd2281c9528 name="id" value="psec-p">] children=[#<Nokogiri::XML::Text:0x3fd2281c8d44 "\n ">, #<Nokogiri::XML::Element:0x3fd2281c8c7c name="legend" attributes=[#<Nokogiri::XML::Attr:0x3fd2281c8c18 name="id" value="psec-p-legend">] children=[#<Nokogiri::XML::Text:0x3fd2281c8614 "Price:">]>, #<Nokogiri::XML::Text:0x3fd2281c8448 "\n ">]>
I'm guessing i selected the wrong element as i have chosen psec-p
Could someone let me know where i'm going wrong?
I've been following the railscast 190
The prices on http://www.ticketmaster.co.uk are applied to the HTML dynamically, via Javascript. This is partially done to hinder scraping efforts. You really cannot use Nokogiri to scrape this type of content from this domain, as Nokogiri processes raw HTML/XML, and does not execute Javascript in the process. Other tools exist to do this, but those would require an entirely different approach.
For learning purposes, you should choose a less dynamic site. For instance, http://www.wallacesuk.com has a nice, parseable site. You could easily learn basic web scraping techniques with a site that presents information inline with the page, such as this.
Scraping from http://ticketmaster.co.uk would require advanced scraping techniques, well beyond what Railscast 190 is demonstrating.
This:
doc.css("#psec-p").each do |price|
puts price.at_css("#psec-p")
byebug
end
can be better written using:
puts doc.at('#psec-p')
#psec-p is an ID, which can only occur once in a page, so at or at_css will find that one occurrence.
I need to scrape some URLs from some retailer product pages, but the specific URLs I need to get aren't in the html part of the page. The html looks like this for each of the items on which one would click to get to the page with the URL I need to grab:
<div id="name" class="hand bold" onclick="AVON.productcontrol.Go(45714);">ADVANCE TECHNIQUES Color Protection Conditioner Bonus Size</div>
I wrote the following to get URLs from the page, but since the actual URLs I need don’t seem to be stored in the page, it doesn’t get what I need:
def getUrls(URL):
"""input: product page url
output: list of urls to products
"""
connection = urllib.urlopen(URL)
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
urlList = [e.get('href') for e in foundElements]
return urlList
Is there a way to get the link that the function after ‘onclick’ (I guess AVON.productcontrol.Go(#);) takes you to? I don’t fully understand html, and while I’ve read a bit about onclick, I can’t figure out how the function after 'onclick' works.
In order to find the URL that you are taken to on click, you need to find the JavaScript source code of the 'Go' function and read and understand it. It's buried somewhere within a tag or some JavaScript .js file that is referenced directly or indirectly by the HTML page. Happy digging!
Or: you automate the interaction with the web page with a tool like Selenium (http://docs.seleniumhq.org/) and just check where it takes you if you click.