How to scrape only texts from specific HTML elements? - html

I have a problem with selecting the appropriate items from the list.
For example - I want to omit "1." then the first "5" (as in the example)
Additionally, I would like to write a condition that the letter "W" should be changed to "WIN".
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
for i in content:
line = i.text.split()[0]
if re.search('Ajax', line):
res.append(line)
print(res)
results
['1.Ajax550016:315?WWWWW']
I need
Ajax;5;5;0;16;3;W;W;W;W;W

I would recommend to select your elements more specific:
for e in soup.select('.ui-table__row'):
Iterate the ResultSet and decompose() unwanted tag:
e.select_one('.wld--tbd').decompose()
Extract texts with stripped_strings and join() them to your expected string:
data.append(';'.join(e.stripped_strings))
Example
Also making some replacements, based on dict just to demonstrate how this would work, not knowing R or P.
...
soup = BS2(page,'html.parser')
data = []
for e in soup.select('.ui-table__row'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
To only get result for Ajax:
data = []
for e in soup.select('.ui-table__row:-soup-contains("Ajax")'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
Output
Based on actually data it may differ from questions example.
['Ajax;6;6;0;0;21;3;WIN;WIN;WIN;WIN;WIN']

you had the right start by using bs4 to find the table div, but then you gave up and just tried to use re to extract from the text. as you can see that's not going to work. Here is a simple way to hack and get what you want. I keep grabinn divs from the table div you find, and the grab the text of the next eight divs after finding Ajax. then I do some dirty string manipulation thing because the WWWWW is all in the same toplevel div.
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
#driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
driver.implicitly_wait(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
found = 0
for i in content.find('div'):
line = i.text.split()[0]
if re.search('Ajax', line):
found = 8
if found:
found -= 1
res.append(line)
# change field 5 into separate values and skip field 6
res = res[:4] +res[5].split(':') + res[7:]
# break the last field into separate values and drop the first '?'
res = res[:-1] + [ i for i in res[-1]][1:]
print(";".join(res))
returns
Ajax;5;5;0;16;3;W;W;W;W;W
This works, but it is very brittle, and will break as soon as the website changes their content. you should put in a lot of error checking. I also replaced the sleep with a wait call, and added chromedrivermamager, which allows me to use selenium with chrome.

Related

How can I download link from YahooFinance in BeautifulSoup?

currently I'm trying to automatically scrape/download yahoo finance historical data. I plan to download the data using the download link provided in the website.
My code is to list all the available link and work it from there, the problem is that the exact link doesn't appear in the result. Here is my code(partial):
def scrape_page(url, header):
page = requests.get(url, headers=header)
if page.status_code == 200:
soup = bs.BeautifulSoup(page.content, 'html.parser')
return soup
return null
if __name__ == '__main__':
symbol = 'GOOGL'
dt_start = datetime.today() - timedelta(days=(365*5+1))
dt_end = datetime.today()
start = format_date(dt_start)
end = format_date(dt_end)
sub = subdomain(symbol, start, end)
header = header_function(sub)
base_url = 'https://finance.yahoo.com'
url = base_url + sub
soup = scrape_page(url, header)
result = soup.find_all('a')
for a in result:
print('URL :',a['href'])
UPDATE 10/9/2020 :
I managed to find the span which is the parent for the link with this code
spans = soup.find_all('span',{"class":"Fl(end) Pos(r) T(-6px)"})
However, when I print it out, it does not show the link, here is the output:
>>> spans
[<span class="Fl(end) Pos(r) T(-6px)" data-reactid="31"></span>]
To download the historical data in CSV format from Yahoo Finance, you can use this example:
import requests
from datetime import datetime
csv_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={from_}&period2={to_}&interval=1d&events=history'
quote = 'GOOGL'
from_ = str(datetime.timestamp(datetime(2019,9,27,0,0))).split('.')[0]
to_ = str(datetime.timestamp(datetime(2020,9,27,23,59))).split('.')[0]
print(requests.get(csv_link.format(quote=quote, from_=from_, to_=to_)).text)
Prints:
Date,Open,High,Low,Close,Adj Close,Volume
2019-09-27,1242.829956,1244.989990,1215.199951,1225.949951,1225.949951,1706100
2019-09-30,1220.599976,1227.410034,1213.420044,1221.140015,1221.140015,1223500
2019-10-01,1222.489990,1232.859985,1205.550049,1206.000000,1206.000000,1225200
2019-10-02,1196.500000,1198.760010,1172.630005,1177.920044,1177.920044,1651500
2019-10-03,1183.339966,1191.000000,1163.140015,1189.430054,1189.430054,1418400
2019-10-04,1194.290039,1212.459961,1190.969971,1210.959961,1210.959961,1214100
2019-10-07,1207.000000,1218.910034,1204.359985,1208.250000,1208.250000,852000
2019-10-08,1198.770020,1206.869995,1189.479980,1190.130005,1190.130005,1004300
2019-10-09,1201.329956,1208.459961,1198.119995,1202.400024,1202.400024,797400
2019-10-10,1198.599976,1215.619995,1197.859985,1209.469971,1209.469971,642100
2019-10-11,1224.030029,1228.750000,1213.640015,1215.709961,1215.709961,1116500
2019-10-14,1213.890015,1225.880005,1211.880005,1217.770020,1217.770020,664800
2019-10-15,1221.500000,1247.130005,1220.920044,1242.239990,1242.239990,1379200
2019-10-16,1241.810059,1254.189941,1238.530029,1243.000000,1243.000000,1149300
2019-10-17,1251.400024,1263.750000,1249.869995,1252.800049,1252.800049,1047900
2019-10-18,1254.689941,1258.109985,1240.140015,1244.410034,1244.410034,1581200
2019-10-21,1248.699951,1253.510010,1239.989990,1244.280029,1244.280029,904700
2019-10-22,1244.479980,1248.729980,1239.849976,1241.199951,1241.199951,1143100
2019-10-23,1240.209961,1258.040039,1240.209961,1257.630005,1257.630005,1064100
2019-10-24,1259.109985,1262.900024,1252.349976,1259.109985,1259.109985,1011200
...and so on.
I figured it out. That link is generated by javascript and requests.get() method won't work on dynamic content. I switched to selenium to download that link.

Scrape table with no ids or classes using only standard libraries?

I want to scrape two pieces of data from a website:
https://www.moneymetals.com/precious-metals-charts/gold-price
Specifically I want the "Gold Price per Ounce" and the "Spot Change" percent two columns to the right of it.
Using only Python standard libraries, is this possible? A lot of tutorials use the HTML element id to scrape effectively but inspecting the source for this page, it's just a table. Specifically I want the second and fourth <td> which appear on the page.
It's possible to do it with standard python libraries; ugly, but possible:
import urllib
from html.parser import HTMLParser
URL = 'https://www.moneymetals.com/precious-metals-charts/gold-price'
page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
resulttext = result.read()
class MyHTMLParser(HTMLParser):
gold = []
def handle_data(self, data):
self.gold.append(data)
parser = MyHTMLParser()
parser.feed(str(resulttext))
for i in parser.gold:
if 'Gold Price per Ounce' in i:
target= parser.gold.index(i) #get the index location of the heading
print(parser.gold[target+2]) #your target items are 2, 5 and 9 positions down in the list
print(parser.gold[target+5].replace('\\n',''))
print(parser.gold[target+9].replace('\\n',''))
Output (as of the time the url was loaded):
$1,566.70
8.65
0.55%

Can't seem to extract text from element using BS4

I am trying to extract the name on this web page: https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29
the element i am trying to grab it from is
<h1 class="hover_item_name" id="largeiteminfo_item_name" style="color:
rgb(210, 210, 210);">AK-47 | Redline</h1>
I am able to search for the ID "largeiteminfo_item_name" using selenium and retrieve the text that way but when i duplicate this with bs4 I can't seem to find the text.
Ive tried searching class "item_desc_description" but no text could be found there either. What am I doing wrong?
a = soup.find("h1", {"id": "largeiteminfo_item_name"})
a.get_text()
a = soup.find('div', {'class': 'item_desc_description'})
a.get_text()
I expected "AK-47 | Redline" but received '' for the first try and '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n' for the second try.
The data you are trying to extract is not present in the HTML page, I guess it might be generated aside with JavaScript (just guessing).
However I managed to find the info in the div "market_listing_nav".
from bs4 import BeautifulSoup as bs4
import requests
lnk = "https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29"
res = requests.get(lnk)
soup = bs4(res.text, features="html.parser")
elem = soup.find("div", {"class" : "market_listing_nav"})
print(elem.get_text())
This will output the following
Counter-Strike: Global Offensive
>
AK-47 | Redline (Field-Tested)
Have a look at the web page source for tag with better formatting or just clean up the on generated by my code.

Python3.5 BeautifulSoup4 get text from 'p' in div

I am trying to pull all the text from the div class 'caselawcontent searchable-content'. This code just prints the HTML without the text from the web page. What am I missing to get the text?
The following link is in the 'finteredcasesdoc.text' file:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html
import requests
from bs4 import BeautifulSoup
with open('filteredcasesdoc.txt', 'r') as openfile1:
for line in openfile1:
rulingpage = requests.get(line).text
soup = BeautifulSoup(rulingpage, 'html.parser')
doctext = soup.find('div', class_='caselawcontent searchable-content')
print (doctext)
from bs4 import BeautifulSoup
import requests
url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
I've added a much more reliable .find method ( key : value)
whole_section = soup.find('div',{'class':'caselawcontent searchable-content'})
the_title = whole_section.center.h2
#e.g. Missouri Court of Appeals,Southern District,Division Two.
second_title = whole_section.center.h3.p
#e.g. STATE of Missouri, Plaintiff-Appellant v....
number_text = whole_section.center.h3.next_sibling.next_sibling
#e.g.
the_date = number_text.next_sibling.next_sibling
#authors
authors = whole_section.center.next_sibling
para = whole_section.findAll('p')[1:]
#Because we don't want the paragraph h3.p.
# we could aslso do findAll('p',recursive=False) doesnt pickup children
Basically, I've dissected this whole tree
as for the Paragraphs (e.g. Main text, the var para), you'll have to loop
print(authors)
# and you can add .text (e.g. print(authors.text) to get the text without the tag.
# or a simple function that returns only the text
def rettext(something):
return something.text
#Usage: print(rettext(authorts))
Try printing doctext.text. This will get rid of all the HTML tags for you.
from bs4 import BeautifulSoup
cases = []
with open('filteredcasesdoc.txt', 'r') as openfile1:
for url in openfile1:
# GET the HTML page as a string, with HTML tags
rulingpage = requests.get(url).text
soup = BeautifulSoup(rulingpage, 'html.parser')
# find the part of the HTML page we want, as an HTML element
doctext = soup.find('div', class_='caselawcontent searchable-content')
print(doctext.text) # now we have the inner HTML as a string
cases.append(doctext.text) # do something useful with this !

Using BeautifulSoup for html scraping

So i'm trying to make a program that tells the user how far away voyager 1 is from the Earth, NASA has this info on their website here http://voyager.jpl.nasa.gov/where/index.html...
I can't seem to manage to get the information within the div, here's the div: <div id="voy1_km">Distance goes here</div>
my current program is as follows : `
import requests
from BeautifulSoup import BeautifulSoup
url = "http://voyager.jpl.nasa.gov/where/index.html"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
test = soup.find('div', {'id' : 'voy1_km'})
print test
So long story short, How do I get the div contents?
as you can see from the webpage itself, the distance keep changing which is actually driven by a Javascript. You can maybe just read the javascrip code so you don't even need to scrape to get the distance... (I hate websites using Javascript as much as you:) )
If you really want to get the number off their website. You can use Selenium.
# pip install selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://voyager.jpl.nasa.gov/where/index.html")
time.sleep(5)
elem = driver.find_element_by_class_name("tr_dark")
print elem.text
driver.close()
Here is the output:
Distance from Earth
19,964,147,071 KM
133.45208042 AU
Of course, please refer to the terms&conditions of their website regarding to what level you can scrape their website and distribute the data.
The bigger question is why even bother scraping it. If you dive a bit deeper into the Javascript file, you can repeat its calculation in a very simple manner:
import time
epoch_0 = 1445270400
epoch_1 = 1445356800
dist_0_v1 = 19963672758.0152
dist_1_v1 = 19966727483.2612
current_time = time.time()
current_dist_km_v1 = ( ( ( current_time - epoch_0 ) / ( epoch_1 - epoch_0 ) ) * ( dist_1_v1 - dist_0_v1 ) ) + dist_0_v1
print("{:,.0f} KM".format(current_dist_km_v1))