Using BeautifulSoup for html scraping

Using BeautifulSoup for html scraping - html

So i'm trying to make a program that tells the user how far away voyager 1 is from the Earth, NASA has this info on their website here http://voyager.jpl.nasa.gov/where/index.html...
I can't seem to manage to get the information within the div, here's the div: <div id="voy1_km">Distance goes here</div>
my current program is as follows : `
import requests
from BeautifulSoup import BeautifulSoup
url = "http://voyager.jpl.nasa.gov/where/index.html"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
test = soup.find('div', {'id' : 'voy1_km'})
print test
So long story short, How do I get the div contents?

as you can see from the webpage itself, the distance keep changing which is actually driven by a Javascript. You can maybe just read the javascrip code so you don't even need to scrape to get the distance... (I hate websites using Javascript as much as you:) )
If you really want to get the number off their website. You can use Selenium.
# pip install selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://voyager.jpl.nasa.gov/where/index.html")
time.sleep(5)
elem = driver.find_element_by_class_name("tr_dark")
print elem.text
driver.close()
Here is the output:
Distance from Earth
19,964,147,071 KM
133.45208042 AU
Of course, please refer to the terms&conditions of their website regarding to what level you can scrape their website and distribute the data.

The bigger question is why even bother scraping it. If you dive a bit deeper into the Javascript file, you can repeat its calculation in a very simple manner:
import time
epoch_0 = 1445270400
epoch_1 = 1445356800
dist_0_v1 = 19963672758.0152
dist_1_v1 = 19966727483.2612
current_time = time.time()
current_dist_km_v1 = ( ( ( current_time - epoch_0 ) / ( epoch_1 - epoch_0 ) ) * ( dist_1_v1 - dist_0_v1 ) ) + dist_0_v1
print("{:,.0f} KM".format(current_dist_km_v1))

Related

How to scrape only texts from specific HTML elements?

I have a problem with selecting the appropriate items from the list.
For example - I want to omit "1." then the first "5" (as in the example)
Additionally, I would like to write a condition that the letter "W" should be changed to "WIN".
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
for i in content:
line = i.text.split()[0]
if re.search('Ajax', line):
res.append(line)
print(res)
results
['1.Ajax550016:315?WWWWW']
I need
Ajax;5;5;0;16;3;W;W;W;W;W

I would recommend to select your elements more specific:
for e in soup.select('.ui-table__row'):
Iterate the ResultSet and decompose() unwanted tag:
e.select_one('.wld--tbd').decompose()
Extract texts with stripped_strings and join() them to your expected string:
data.append(';'.join(e.stripped_strings))
Example
Also making some replacements, based on dict just to demonstrate how this would work, not knowing R or P.
...
soup = BS2(page,'html.parser')
data = []
for e in soup.select('.ui-table__row'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
To only get result for Ajax:
data = []
for e in soup.select('.ui-table__row:-soup-contains("Ajax")'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
Output
Based on actually data it may differ from questions example.
['Ajax;6;6;0;0;21;3;WIN;WIN;WIN;WIN;WIN']

you had the right start by using bs4 to find the table div, but then you gave up and just tried to use re to extract from the text. as you can see that's not going to work. Here is a simple way to hack and get what you want. I keep grabinn divs from the table div you find, and the grab the text of the next eight divs after finding Ajax. then I do some dirty string manipulation thing because the WWWWW is all in the same toplevel div.
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
#driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
driver.implicitly_wait(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
found = 0
for i in content.find('div'):
line = i.text.split()[0]
if re.search('Ajax', line):
found = 8
if found:
found -= 1
res.append(line)
# change field 5 into separate values and skip field 6
res = res[:4] +res[5].split(':') + res[7:]
# break the last field into separate values and drop the first '?'
res = res[:-1] + [ i for i in res[-1]][1:]
print(";".join(res))
returns
Ajax;5;5;0;16;3;W;W;W;W;W
This works, but it is very brittle, and will break as soon as the website changes their content. you should put in a lot of error checking. I also replaced the sleep with a wait call, and added chromedrivermamager, which allows me to use selenium with chrome.

How can I download link from YahooFinance in BeautifulSoup?

currently I'm trying to automatically scrape/download yahoo finance historical data. I plan to download the data using the download link provided in the website.
My code is to list all the available link and work it from there, the problem is that the exact link doesn't appear in the result. Here is my code(partial):
def scrape_page(url, header):
page = requests.get(url, headers=header)
if page.status_code == 200:
soup = bs.BeautifulSoup(page.content, 'html.parser')
return soup
return null
if __name__ == '__main__':
symbol = 'GOOGL'
dt_start = datetime.today() - timedelta(days=(365*5+1))
dt_end = datetime.today()
start = format_date(dt_start)
end = format_date(dt_end)
sub = subdomain(symbol, start, end)
header = header_function(sub)
base_url = 'https://finance.yahoo.com'
url = base_url + sub
soup = scrape_page(url, header)
result = soup.find_all('a')
for a in result:
print('URL :',a['href'])
UPDATE 10/9/2020 :
I managed to find the span which is the parent for the link with this code
spans = soup.find_all('span',{"class":"Fl(end) Pos(r) T(-6px)"})
However, when I print it out, it does not show the link, here is the output:
>>> spans
[<span class="Fl(end) Pos(r) T(-6px)" data-reactid="31"></span>]

To download the historical data in CSV format from Yahoo Finance, you can use this example:
import requests
from datetime import datetime
csv_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={from_}&period2={to_}&interval=1d&events=history'
quote = 'GOOGL'
from_ = str(datetime.timestamp(datetime(2019,9,27,0,0))).split('.')[0]
to_ = str(datetime.timestamp(datetime(2020,9,27,23,59))).split('.')[0]
print(requests.get(csv_link.format(quote=quote, from_=from_, to_=to_)).text)
Prints:
Date,Open,High,Low,Close,Adj Close,Volume
2019-09-27,1242.829956,1244.989990,1215.199951,1225.949951,1225.949951,1706100
2019-09-30,1220.599976,1227.410034,1213.420044,1221.140015,1221.140015,1223500
2019-10-01,1222.489990,1232.859985,1205.550049,1206.000000,1206.000000,1225200
2019-10-02,1196.500000,1198.760010,1172.630005,1177.920044,1177.920044,1651500
2019-10-03,1183.339966,1191.000000,1163.140015,1189.430054,1189.430054,1418400
2019-10-04,1194.290039,1212.459961,1190.969971,1210.959961,1210.959961,1214100
2019-10-07,1207.000000,1218.910034,1204.359985,1208.250000,1208.250000,852000
2019-10-08,1198.770020,1206.869995,1189.479980,1190.130005,1190.130005,1004300
2019-10-09,1201.329956,1208.459961,1198.119995,1202.400024,1202.400024,797400
2019-10-10,1198.599976,1215.619995,1197.859985,1209.469971,1209.469971,642100
2019-10-11,1224.030029,1228.750000,1213.640015,1215.709961,1215.709961,1116500
2019-10-14,1213.890015,1225.880005,1211.880005,1217.770020,1217.770020,664800
2019-10-15,1221.500000,1247.130005,1220.920044,1242.239990,1242.239990,1379200
2019-10-16,1241.810059,1254.189941,1238.530029,1243.000000,1243.000000,1149300
2019-10-17,1251.400024,1263.750000,1249.869995,1252.800049,1252.800049,1047900
2019-10-18,1254.689941,1258.109985,1240.140015,1244.410034,1244.410034,1581200
2019-10-21,1248.699951,1253.510010,1239.989990,1244.280029,1244.280029,904700
2019-10-22,1244.479980,1248.729980,1239.849976,1241.199951,1241.199951,1143100
2019-10-23,1240.209961,1258.040039,1240.209961,1257.630005,1257.630005,1064100
2019-10-24,1259.109985,1262.900024,1252.349976,1259.109985,1259.109985,1011200
...and so on.

I figured it out. That link is generated by javascript and requests.get() method won't work on dynamic content. I switched to selenium to download that link.

Can't seem to extract text from element using BS4

I am trying to extract the name on this web page: https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29
the element i am trying to grab it from is
<h1 class="hover_item_name" id="largeiteminfo_item_name" style="color:
rgb(210, 210, 210);">AK-47 | Redline</h1>
I am able to search for the ID "largeiteminfo_item_name" using selenium and retrieve the text that way but when i duplicate this with bs4 I can't seem to find the text.
Ive tried searching class "item_desc_description" but no text could be found there either. What am I doing wrong?
a = soup.find("h1", {"id": "largeiteminfo_item_name"})
a.get_text()
a = soup.find('div', {'class': 'item_desc_description'})
a.get_text()
I expected "AK-47 | Redline" but received '' for the first try and '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n' for the second try.

The data you are trying to extract is not present in the HTML page, I guess it might be generated aside with JavaScript (just guessing).
However I managed to find the info in the div "market_listing_nav".
from bs4 import BeautifulSoup as bs4
import requests
lnk = "https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29"
res = requests.get(lnk)
soup = bs4(res.text, features="html.parser")
elem = soup.find("div", {"class" : "market_listing_nav"})
print(elem.get_text())
This will output the following
Counter-Strike: Global Offensive
>
AK-47 | Redline (Field-Tested)
Have a look at the web page source for tag with better formatting or just clean up the on generated by my code.

Can't seem to scrape the website "Forbes" properly

I'm trying to scrape the links and titles of the articles on the frontpage of the website https://www.forbes.com/ .
I'm not proficient in html, but I'm been following some beautfiul soup tutorials and have been getting by with the knowledge I'm picking up along the way.
Here is what I have so far:
source = urllib.request.urlopen('https://www.forbes.com').read()
soup = bs.BeautifulSoup(source,'lxml') # Tried 'html.parser' as well
##print(soup.findAll('div',{'class':"c-entry-box--compact c-entry-box--compact--article"}))
for url in soup.findAll('a',{'class':"exit_trigger_set"}):
print (url.get('href'))
Inspecting the site's html, I seem to have the class and 'a' (not sure what you call 'a' in this case) correct.
However, instead of getting all the links of the articles on the frontpage, I'm only getting one.
https://www.amazon.com/Intelligent-REIT-Investor-Wealth-Investment/dp/1119252717
Not sure what I'm doing wrong.
Thank you.
EDIT:
This seems to find some of the top stories but I don't know how to pull out the links only
for i in soup.findAll('h4', {'class': "editable editable-hed"}):
print (i)

Here's how I would do it:
import urllib2
from bs4 import BeautifulSoup
import pandas as pd
source = urllib2.urlopen('https://www.forbes.com')
soup = BeautifulSoup(source,'lxml')
lst = []
for i in soup.findAll('h4', {'class': "editable editable-hed"}):
title = i.text
link = i.find('a')['href'][2:]
title = title.replace('\t','')
title = title.replace('\n','')
title = title.strip()
lst.append({'title':title, 'link':link})
df = pd.DataFrame.from_dict(lst)
And you get 15 articles and their links.

How do I rerender HTML PyQt4

I have managed to use suggested code in order to render HTML from a webpage and then parse, find and use the text as wanted. I'm using PyQt4. However, the webpage I am interested in is updated frequently and I want to rerender the page and check the updated HTML for new info.
I thus have a loop in my pythonscript so that I sort of start all over again. However, this makes the program crash. I have searched the net and found out that this is to be expected, but I have not found any suggestion on how to do it correctly. It must be simple, I guess?
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render (QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
r = Render(url)
html = r.frame.toHtml()
S,o when I hit r=Render(url) the second time, it crashes. S,o I am looking for something like r = Rerender(url).
As you might guess, I am not much of a programmer, and I usually get by by stealing code I barely understand. But this is the first time I can't find an answer, so I thought I should ask a question myself.
I hope my question is clear enough and that someone has the answer.

Simple demo (adapt to taste):
import sys, signal
from PyQt4 import QtCore, QtGui, QtWebKit
class WebPage(QtWebKit.QWebPage):
def __init__(self, url):
super(WebPage, self).__init__()
self.url = url
self.mainFrame().loadFinished.connect(self.handleLoadFinished)
self.refresh()
def refresh(self):
self.mainFrame().load(QtCore.QUrl(self.url))
def handleLoadFinished(self):
print('Loaded:', self.mainFrame().url().toString())
# do stuff with html ...
print('Reloading in 3 seconds...\n')
QtCore.QTimer.singleShot(2000, self.refresh)
if __name__ == '__main__':
signal.signal(signal.SIGINT, signal.SIG_DFL)
app = QtGui.QApplication(sys.argv)
webpage = WebPage('http://en.wikipedia.org/')
print('Press Ctrl+C to quit\n')
sys.exit(app.exec_())

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using BeautifulSoup for html scraping - html

Related

How to scrape only texts from specific HTML elements?

How can I download link from YahooFinance in BeautifulSoup?

Can't seem to extract text from element using BS4

Can't seem to scrape the website "Forbes" properly

How do I rerender HTML PyQt4

Categories

Resources