Getting the RSS link from the <head> in Actionscript 2 - html

I'm having problems getting to the rss link that tells the browser where the rss is for the site. The link is found in the <head> tag of the html here is an example of what the link looks like.
<link rel="alternate" type="application/rss+xml" title="CNN - Top Stories [RSS]" href="http://rss.cnn.com/rss/cnn_topstories.rss" />
My original approach was to treat the site like an XML file and look through the tags, but most sites have an arbitrary number of <meta> tags that forget to have a ending /> so the <link> tag I'm looking for becomes a child of a random <meta> tag.
Now I'm thinking of just treating the site like a string and looking for the <link> tag in it, but this causes problems since the <link> tag can have its attributes in any order possible. Of course I can work around this, but I would prefer something a bit neater than look for type="application/rss+xml" then look to the left and right of it for the first href it sees.

HTML parsing is hard! Even if you find a solution that works for one site, it will likely break in another. if you can find a library to help you your life will be a lot easier.
If you can not find a html parser for actionscript 2, maybe you could set up a server script to it for you? Like:
myXML.load("http://yourserver.com/cgi-bin/findrss?url=foo.com");
and then have it return the url as xml
If you try this approach, I recommend the python library Beautiful Soup. I've used it before and, in my opinion, it's amazing. It will work on any website you give it, no matter how horrible the markup is.
It would look something like this:
#!/usr/bin/python
import cgi
import cgitb; cgitb.enable() # Optional; for debugging only
import urllib2
from BeautifulSoup import BeautifulSoup
def getRssFromUrl(url):
try:
Response = urllib2.urlopen(url)
except Exception:
print "<error>error getting url</error>"
return []
html = Response.read()
soup = BeautifulSoup(html)
rssFeeds = soup.findAll('link', attrs={"type" : "application/rss+xml"})
return rssFeeds
print "Content-type: text/xml\n\n"
form = cgi.FieldStorage()
if form.has_key("url") is True:
url = form["url"].value
else:
url = ""
print "<xml>"
rssFeeds = getRssFromUrl(url)
for feed in rssFeeds:
print ("<url>%s</url>" % feed["href"])
print "</xml>"

Related

Can't find html tag when using Beautiful soup

I'm trying to get more familiar with web scraping. I came across this website, https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/ that gives an intro into web scraping using Beautiful Soup. Following the demonstration, I tried to scrape the value and name of the S&P stock index with the code they provided, but that wasn't working. I think some things have changed like the tag for price is no longer under h1 as the author wrote on the website. When I inspect the web page to view the html code, I can see all the tags used. I figured out that some of the html code isn't being scraped from the bloomberg website. I printed what the webscraper is collecting onto the console.
The code:
import urllib2
from bs4 import BeautifulSoup
quote_page = "http://www.bloomberg.com/quote/SPX:IND"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
print (soup)
name_box = soup.find("h1", attrs={"class": "price"})
name = name_box.text.strip() #get 'Nonetype object has no attribute text' here
print(name)
I was having troubles displaying what the code prints on stack, but basically some of the tags are not there. I'm wondering why this is and how to actually scrape the website. When I inspect the website, I can find the tag I am looking for which is:
<span class="priceText__1853e8a5">2,912.43</span>
But using the code I have, I can't seem to get this tag.

LXML xpath does not detect in a "dirty" html file, however after indenting and cleaning it, it succeeds

Every sort of help will be extremely appreciated. I am building a parser to a web-site. I am trying to detect an element using lxml package, the element has a pretty simple relative xpath: '//div[#id="productDescription"]'. When I am manually going to the web page, making 'view page source' and copying the html string to local html file, everything works perfectly. However, if I download the file automatically:
headers = {"user-Agent": "MY SCRAPER USER-AGENT", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT": "1","Connection": "close", "Upgrade-Insecure-Requests": "1"}
product_HTML_bytes = requests.get(product_link, headers=headers, proxies={'http': "***:***"}).content
product_HTML_str = product_HTML_bytes.decode()
main_data = html.fromstring(product_HTML_str)
product_description_tags = main_data.xpath('//div[#id="productDescription"]')
...
I get nothing (and the data does exist in the file). I had also tried to first scrape a sample of pages using the same request.get with the same headers and so on, saving the files locally and then cleaning the extra spaces and indenting the document manually using this html formatter: https://www.freeformatter.com/html-formatter.html and then boom, it works again. However, I couldn't put my finger on what exactly changes in the files, but I was pretty sure extra spaces and indented tabs should not make a difference.
What am I missing here?
Thanks in Advance
Edit:
URL: https://www.amazon.com/Samsung-MicroSDXC-Adapter-MB-ME128GA-AM/dp/B06XWZWYVP
cause pasting it here is impossible because the file exceeds the length limit, I uploaded them to the web.
The not working HTML: https://easyupload.io/231pdd
The indented, clean, and formatted HTML page: https://easyupload.io/a9oiyh
For some strange reason, it seems the the lxml library mangles the text output of requests.get() when the output is filtered through the lxml.html.fromstring() method. I have no idea why.
The target data is still there, no doubt:
from bs4 import BeautifulSoup as bs
soup = bs(product_HTML_str,'lxml') #note that the lxml parser is used here!
for elem in soup.select_one('#productDescription p'):
print(elem.strip())
Output:
Simply the right card. With stunning speed and reliability, the...
etc.
I personally much prefer using xpath in lxml to find() and css selectors methods used by BeautifulSoup, but this time BeautifulSoup wins...

Using requests in python 3.x to search for a keword

I have a script that is supposed to make a query on the following website. https://searchwww.sec.gov/EDGARFSClient/.
import requests
keyword = "something"
body = {"query":keyword}
page = requests.post('https://searchwww.sec.gov/EDGARFSClient/', data=body)
print(page3.content)
This just returns the webpages html code as it is, without the keyword search. Any ideas on what am i doing wrong? Also is there a way to filter out only the links that the search reutrns?
The way i wanted to do it was to go through the html code and isolate all strings that start that look like this:
https://example-link.com
I think my main issue is that i need to pull up "advanced search" before i search for my keyword. That seems to be messing things up for me. I am not entirely sure as i've never done this before. Any help would be much appreciated.
I'm not sure how you got the "query" tag, but the correct tag for searches on this website is "search_text".
from bs4 import BeautifulSoup
import requests
keyword = "something"
body = {"search_text":keyword}
page = requests.post('https://searchwww.sec.gov/EDGARFSClient/', data=body)
soup = BeautifulSoup(page.content, features='lxml')
for a in soup.find_all('a', href=True):
if a.has_attr('class') and 'filing' in a['class']:
print(a['href'])
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');

Opening html file from hard drive and doing xpath search on it

I have an html file on my HD that I want to do an xpath search on like you do when scraping a website.
I have used the following code to scrape from websites:
from lxml import html
import requests
response = requests.get('http://www.website.com/')
if (response.status_code == 200):
pagehtml = html.fromstring(response.text)
for elt in pagehtml.xpath('//div[#class="content"]/ul/li/a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
Now this works well when getting something from a website, but how do I go about when the HTML file is on my HD. I have tried about 10 things and at the moment my code looks like this:
with open(r'website.html', 'rb') as infile:
data = infile.read()
for elt in data.xpath('//h3/a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
I keep getting different errors and sometimes '_io.BufferedReader' errors, but I just don't get the code right.
Any suggestions? Regards
You could use the following code:
from lxml import html
pagehtml = html.parse('index.html')
for elt in pagehtml.xpath('//a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
This makes sure that the decoding of the file data is handled automatically.

Parsing HTML with jSoup, then placing that variable within own HTML

Okay, so what I am trying to do is a picture of the day type situation.
I want to pull the top image from /r/Earthporn, have it display on a webpage (I will link back to the source), and pretty much thats it.
I thought using jSoup to parse might be helpful and now I've hit into a wall.
I need to find a way to parse out the html from the url source I give it, and then use that created variable to create an img tag in my own html outside of the script tag
Relevant code:
<script>
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
Document doc = Jsoup.connect("http://www.reddit.com/r/EarthPorn/top/?
sort=top&t=day").get();
Element link = doc.select("div.thing id-t3_22ds3k odd link > a[href]");
String linkHref = link.attr("href");
str="<img src="+linkHref+"/>";
}
</script>
After this it's all your usual html. I just want to be able to display the link that has been parsed out (here seen as linkHref) in the body of my html.
Not sure what I think I'm doing there with that str variable but I figured I would leave it in in case I'm onto something....which I highly doubt.
I'm new into this jSoup parsing world, since the only other parsing I've done is with AS3, and that was an xml sheet.
Any help would be greatly appreciated! Thanks in advance!