I guess it shows my kookiness here but how do I just get the HTML presentation of a website? for ex., I am trying to retrieve from a Wix site the HTML structure (what is actually being viewed by a user on the screen) but instead I am getting lots of scripts that exist on the site. I am doing a small code test for scraping. Much appreciated.
Alright,here we go. Sorry for the delay.
I used selenium to load the page, that way I could make sure to capture all the markup even if it's loaded by ajax. Make sure to grab the standalone library, that threw me for a loop.
Once the html is retrieved I pass it to jsoup which I use to iterate through the document and remove all the text.
Here's the example code:
// selenium to grab the html
// i chose to use this to get anything that may be loaded by ajax
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
// jsoup for parsing the html
import org.jsoup.Jsoup;
import org.jsoup.parser.Parser;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Example {
public static void main(String[] args) {
// Create a new instance of the html unit driver
// Notice that the remainder of the code relies on the interface,
// not the implementation.
WebDriver driver = new FirefoxDriver();
// And now use this to visit stackoverflow
driver.get("http://stackoverflow.com/");
// Get the page source
String html = driver.getPageSource();
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element el : doc.select("*")){
if (!el.ownText().isEmpty()){
for (TextNode node : el.textNodes())
node.remove();
}
}
System.out.println(doc);
driver.quit();
}
}
Not sure if you wanted to get rid of the attribute tags as well, currently they are left. However, it's easy enough to modify the code so that some or all of the attribute tags are removed too.
If you just require the content from the page, you can just use ?_escaped_fragment_ on every url to get the static content.
_escaped_fragment_ is an standard approach used for Ajax crawling for crawling the pages which are dynamic in nature or are generated / rendered at client side.
Wix based websites support _escaped_fragment.
Related
I am trying to fetch live gold prices from a website using this code.
import 'package:http/http.dart' as http;
import 'package:html/dom.dart' as dom;
import 'package:html/parser.dart' as parser;
getWebData() async {
final response = await http.get(Uri.parse(
"https://www.mcxindia.com/en/market-data/get-quote/FUTCOM/GOLD/05FEB2021"));
dom.Document document = parser.parse(response.body);
print(document);
var element = document.getElementById("litPrice");
}
I know how to fetch text when it's between <a> tag or <p> tag but in this case, it's just between a <span> tag which looks like
<span id="litPrice">47526</span> I am unable to figure out how to get this number.
html pub dev
http pub dev
The method would be the same as in the case of the <a> and <p> tags. The text between tags is available via element.innerHtml
I think the issue you're running into is that the tag you're selecting isn't present in the document returned from that url. In the browser, that element is inserted into the document via javascript after the document is loaded.
There are a couple of ways you could deal with this:
Use an API - If the service you're trying to access provides a REST API, then you can see if the data you're trying to access is available via that route. This is by far the better way, if it's available.
Use a headless browser - This is a really heavy-weight solution, and likely won't work for your use case, but I'll put it here as one of the possibilities anyway. In dart, you can use the puppeteer library to interact with a headless chrome browser. This way, you can load the page, wait for the scripts to run, then scrape the DOM - you should then find the element you're looking for.
I am trying to make a bot that can play Cookie Clicker. I have successfully opened the website using the webbrowser module. When I use the developer tool to see the html I can see the information I want to obtain, such as how much money I have, how expensive items are ect. But when I try to get that information using the requests and beautifulsoup it instead gets the html of a new window. How can I make it so that I get the html of the already opened tab?
import webbrowser
webbrowser.open('https://orteil.dashnet.org/cookieclicker/')
from bs4 import BeautifulSoup
import requests
def scrape():
html = requests.get('https://orteil.dashnet.org/cookieclicker/')
print(html)
scrape()
You can try to do this:
body_element = html.find_element_by_xpath("//body")
body_content = body_element.get_attribute("innerHTML")
print(body_content)
I am trying to parse sidebar TOC(Table of Components) of some documentation site.
Jsoup
I have tried Jsoup. I can not get TOC elements because the HTML content in this tag is not part of initial HTML but is set by JavaScript after the page is loaded.
You can see my previous question here:JSoup cannot parse child elements after depth 2
The suggested solution is to examine what connections are made manually from the Browser Dev Tools menu find the last version of the website. Parsing sidebar TOC of some documentation site is just one component of my java program so I cannot do this manually.
JavaFX Webview(not Android Webview)
I have tried JavaFX Webview because I need a browser that executes javascript code and fills Toc tag components.
WebView browser = new WebView();
WebEngine webEngine = browser.getEngine();
webEngine.load("https://learn.microsoft.com/en-us/ef/ef6/");
But I don't know how can I retrieve HTML code of the loaded website and transfer this data to Jsoup Document?
ANy advice appreciated.
WebView browser = new WebView();
WebEngine webEngine = browser.getEngine();
String url = "https://learn.microsoft.com/en-us/ef/ef6/";
webEngine.load(url);
//get w3c document from webEngine
org.w3c.dom.Document w3cDocument = webEngine.getDocument();
// use jsoup helper methods to convert it to string
String html = new org.jsoup.helper.W3CDom().asString(webEngine.get);
// create jsoup document by parsing html
Document doc = Jsoup.parse(url, html);
I can't promise this is the best way as I've not used Jsoup before and I'm not an expert on the XML API.
The org.jsoup.Jsoup class has a method for parsing HTML in String form: Jsoup.parse(String). This means we need to get the HTML from the WebView as a String. The WebEngine class has a document property that holds a org.w3c.dom.Document. This Document is the HTML content of the currently showing web page. We just need to convert this Document into a String, which we can do with a Transformer.
import java.io.StringWriter;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.jsoup.Jsoup;
public class Utils {
private static Transformer transformer;
// not thread safe
public static org.jsoup.nodes.Document convert(org.w3c.dom.Document doc)
throws TransformerException {
if (transformer == null) {
transformer = TransformerFactory.newDefaultInstance().newTransformer();
}
StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(doc), new StreamResult(writer));
return Jsoup.parse(writer.toString());
}
}
You would call this every time the document property changes. I did some "tests" by browsing Google and printing the org.jsoup.nodes.Document to the console and everything seems to be working.
There is a caveat, though; as far as I understand it the document property does not change when there are changes within the same web page (the Document itself may be updated, however). I'm not a web person, so pardon me if I don't make sense here, but I believe that this includes things like a frame changing its content. There may be a way around this by interfacing with the JavaScript using WebEngine.executeStript(String), but I don't know how.
I am trying to find the donation button on the website of
The University of British Columbia.
The donation button is located at the page footer, within the div classed as "span7"
However, when scraped, the html yeilded the div with nothing inside it.
My program works perfectly with direct div as source:
from bs4 import BeautifulSoup as bs
import re
site = '''<div class="span7" id="ubc7-footer-menu"><div class="row-fluid"><div class="span6"><h3>About UBC</h3><div>Contact UBC</div><div>About the University</div><div>News</div><div>Events</div><div>Careers</div><div>Make a Gift</div><div>Search UBC.ca</div></div><div class="span6"><h3>UBC Campuses</h3><div>Vancouver Campus</div><div>Okanagan Campus</div><h4>UBC Sites</h4><div>Robson Square</div><div>Centre for Digital Media</div><div>Faculty of Medicine Across BC</div><div>Asia Pacific Regional Office</div></div></div></'''
html = bs(site, 'html.parser')
link = html.find('a', string=re.compile('(?)(donate|donation|gift)'))
#returns proper donation URL
However, using the site does not work
from bs4 import BeautifulSoup as bs
import requests
import re
site = requests.get('https://www.ubc.ca/')
html = bs(site.content, 'html.parser')
link = html.find('a', string=re.compile('(?i)(donate|donation|gift)'))
#returns none
Is there something wrong with my parser? Is it some-sort of anti-scrape maneuver? Am I doomed?
I cannot seem to find the 'Donate' button on the URL that you provided, but there is nothing inherently wrong with your parser, its just that the GET request that you send only gives you the HTML initially returned from the response, rather than waiting for the page to fully render.
It appears that parts of the page are filled in by Javascript. You can use Splash, which is used to render Javascript-based pages. You can run Splash in Docker quite easily, and just make HTTP requests to the Splash container which will return HTML that looks just like the webpage as rendered in a web browser.
Although this sounds overly complicated, it is actually quite simple to set up since you don't need to modify the Docker image at all, and you need no previous knowledge of Docker to get it to work. It requires just a single line from the command line to start a local Splash server:
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
You then just modify any existing requests you have in your Python code to route to splash instead:
i.e. http://example.com/ becomes
http://localhost:8050/render.html?url=http://example.com/
Okay, so what I am trying to do is a picture of the day type situation.
I want to pull the top image from /r/Earthporn, have it display on a webpage (I will link back to the source), and pretty much thats it.
I thought using jSoup to parse might be helpful and now I've hit into a wall.
I need to find a way to parse out the html from the url source I give it, and then use that created variable to create an img tag in my own html outside of the script tag
Relevant code:
<script>
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
Document doc = Jsoup.connect("http://www.reddit.com/r/EarthPorn/top/?
sort=top&t=day").get();
Element link = doc.select("div.thing id-t3_22ds3k odd link > a[href]");
String linkHref = link.attr("href");
str="<img src="+linkHref+"/>";
}
</script>
After this it's all your usual html. I just want to be able to display the link that has been parsed out (here seen as linkHref) in the body of my html.
Not sure what I think I'm doing there with that str variable but I figured I would leave it in in case I'm onto something....which I highly doubt.
I'm new into this jSoup parsing world, since the only other parsing I've done is with AS3, and that was an xml sheet.
Any help would be greatly appreciated! Thanks in advance!