Parsing HTML with jSoup, then placing that variable within own HTML - html

Okay, so what I am trying to do is a picture of the day type situation.
I want to pull the top image from /r/Earthporn, have it display on a webpage (I will link back to the source), and pretty much thats it.
I thought using jSoup to parse might be helpful and now I've hit into a wall.
I need to find a way to parse out the html from the url source I give it, and then use that created variable to create an img tag in my own html outside of the script tag
Relevant code:
<script>
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
Document doc = Jsoup.connect("http://www.reddit.com/r/EarthPorn/top/?
sort=top&t=day").get();
Element link = doc.select("div.thing id-t3_22ds3k odd link > a[href]");
String linkHref = link.attr("href");
str="<img src="+linkHref+"/>";
}
</script>
After this it's all your usual html. I just want to be able to display the link that has been parsed out (here seen as linkHref) in the body of my html.
Not sure what I think I'm doing there with that str variable but I figured I would leave it in in case I'm onto something....which I highly doubt.
I'm new into this jSoup parsing world, since the only other parsing I've done is with AS3, and that was an xml sheet.
Any help would be greatly appreciated! Thanks in advance!

Related

LXML xpath does not detect in a "dirty" html file, however after indenting and cleaning it, it succeeds

Every sort of help will be extremely appreciated. I am building a parser to a web-site. I am trying to detect an element using lxml package, the element has a pretty simple relative xpath: '//div[#id="productDescription"]'. When I am manually going to the web page, making 'view page source' and copying the html string to local html file, everything works perfectly. However, if I download the file automatically:
headers = {"user-Agent": "MY SCRAPER USER-AGENT", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT": "1","Connection": "close", "Upgrade-Insecure-Requests": "1"}
product_HTML_bytes = requests.get(product_link, headers=headers, proxies={'http': "***:***"}).content
product_HTML_str = product_HTML_bytes.decode()
main_data = html.fromstring(product_HTML_str)
product_description_tags = main_data.xpath('//div[#id="productDescription"]')
...
I get nothing (and the data does exist in the file). I had also tried to first scrape a sample of pages using the same request.get with the same headers and so on, saving the files locally and then cleaning the extra spaces and indenting the document manually using this html formatter: https://www.freeformatter.com/html-formatter.html and then boom, it works again. However, I couldn't put my finger on what exactly changes in the files, but I was pretty sure extra spaces and indented tabs should not make a difference.
What am I missing here?
Thanks in Advance
Edit:
URL: https://www.amazon.com/Samsung-MicroSDXC-Adapter-MB-ME128GA-AM/dp/B06XWZWYVP
cause pasting it here is impossible because the file exceeds the length limit, I uploaded them to the web.
The not working HTML: https://easyupload.io/231pdd
The indented, clean, and formatted HTML page: https://easyupload.io/a9oiyh
For some strange reason, it seems the the lxml library mangles the text output of requests.get() when the output is filtered through the lxml.html.fromstring() method. I have no idea why.
The target data is still there, no doubt:
from bs4 import BeautifulSoup as bs
soup = bs(product_HTML_str,'lxml') #note that the lxml parser is used here!
for elem in soup.select_one('#productDescription p'):
print(elem.strip())
Output:
Simply the right card. With stunning speed and reliability, the...
etc.
I personally much prefer using xpath in lxml to find() and css selectors methods used by BeautifulSoup, but this time BeautifulSoup wins...

Using requests in python 3.x to search for a keword

I have a script that is supposed to make a query on the following website. https://searchwww.sec.gov/EDGARFSClient/.
import requests
keyword = "something"
body = {"query":keyword}
page = requests.post('https://searchwww.sec.gov/EDGARFSClient/', data=body)
print(page3.content)
This just returns the webpages html code as it is, without the keyword search. Any ideas on what am i doing wrong? Also is there a way to filter out only the links that the search reutrns?
The way i wanted to do it was to go through the html code and isolate all strings that start that look like this:
https://example-link.com
I think my main issue is that i need to pull up "advanced search" before i search for my keyword. That seems to be messing things up for me. I am not entirely sure as i've never done this before. Any help would be much appreciated.
I'm not sure how you got the "query" tag, but the correct tag for searches on this website is "search_text".
from bs4 import BeautifulSoup
import requests
keyword = "something"
body = {"search_text":keyword}
page = requests.post('https://searchwww.sec.gov/EDGARFSClient/', data=body)
soup = BeautifulSoup(page.content, features='lxml')
for a in soup.find_all('a', href=True):
if a.has_attr('class') and 'filing' in a['class']:
print(a['href'])
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');

splinter nested <html> documents

I am working on some website automation. Currently, I am unable to access a nested html documents with Splinter. Here's a sample website that will help demonstrate what I am dealing with: https://www.w3schools.com/html/tryit.asp?filename=tryhtml_elem_select
I am trying to get into the select element and choose the "saab" option. I am stuck on how to enter the second html document. I've read the documentation and saw nothing. I'm hoping there is a way with Python.
Any thoughts?
Before Solution:
from splinter import Browser
exe = {"executable_path": "chromedriver.exe"}
browser = Browser("chrome",**exe, headless=False)
url = "https://www.w3schools.com/html/tryit.asp?filename=tryhtml_elem_select"
browser.visit(url)
# This is where I'm stuck. I cannot find a way to access the second (nested) html doc
innerframe = browser.find_by_name("iframeResult").first
innerframe.find_by_name("cars")[0]
Solution:
from splinter import Browser
exe = {"executable_path": "chromedriver.exe"}
browser = Browser("chrome",**exe, headless=False)
url = "https://www.w3schools.com/html/tryit.asp?filename=tryhtml_elem_select"
browser.visit(url)
with browser.get_iframe("iframeResult") as iframe:
cars = iframe.find_by_name("cars")
cars.select("saab")
I figured out that these are called iframes. Once I learned the terminology, it wasn't too hard to figure out how it interact with it. "Nested html documents" was not returning the results I needed to find the solution.
I hope this helps someone out in the future!

Opening html file from hard drive and doing xpath search on it

I have an html file on my HD that I want to do an xpath search on like you do when scraping a website.
I have used the following code to scrape from websites:
from lxml import html
import requests
response = requests.get('http://www.website.com/')
if (response.status_code == 200):
pagehtml = html.fromstring(response.text)
for elt in pagehtml.xpath('//div[#class="content"]/ul/li/a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
Now this works well when getting something from a website, but how do I go about when the HTML file is on my HD. I have tried about 10 things and at the moment my code looks like this:
with open(r'website.html', 'rb') as infile:
data = infile.read()
for elt in data.xpath('//h3/a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
I keep getting different errors and sometimes '_io.BufferedReader' errors, but I just don't get the code right.
Any suggestions? Regards
You could use the following code:
from lxml import html
pagehtml = html.parse('index.html')
for elt in pagehtml.xpath('//a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
This makes sure that the decoding of the file data is handled automatically.

how to just get the HTML output structure of a site

I guess it shows my kookiness here but how do I just get the HTML presentation of a website? for ex., I am trying to retrieve from a Wix site the HTML structure (what is actually being viewed by a user on the screen) but instead I am getting lots of scripts that exist on the site. I am doing a small code test for scraping. Much appreciated.
Alright,here we go. Sorry for the delay.
I used selenium to load the page, that way I could make sure to capture all the markup even if it's loaded by ajax. Make sure to grab the standalone library, that threw me for a loop.
Once the html is retrieved I pass it to jsoup which I use to iterate through the document and remove all the text.
Here's the example code:
// selenium to grab the html
// i chose to use this to get anything that may be loaded by ajax
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
// jsoup for parsing the html
import org.jsoup.Jsoup;
import org.jsoup.parser.Parser;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Example {
public static void main(String[] args) {
// Create a new instance of the html unit driver
// Notice that the remainder of the code relies on the interface,
// not the implementation.
WebDriver driver = new FirefoxDriver();
// And now use this to visit stackoverflow
driver.get("http://stackoverflow.com/");
// Get the page source
String html = driver.getPageSource();
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element el : doc.select("*")){
if (!el.ownText().isEmpty()){
for (TextNode node : el.textNodes())
node.remove();
}
}
System.out.println(doc);
driver.quit();
}
}
Not sure if you wanted to get rid of the attribute tags as well, currently they are left. However, it's easy enough to modify the code so that some or all of the attribute tags are removed too.
If you just require the content from the page, you can just use ?_escaped_fragment_ on every url to get the static content.
_escaped_fragment_ is an standard approach used for Ajax crawling for crawling the pages which are dynamic in nature or are generated / rendered at client side.
Wix based websites support _escaped_fragment.