Table data returning empty values after web scraping

Table data returning empty values after web scraping - html

I tried to web scrape the table data from a binary signals website. The data updates after some time and I wanted to get the data as it updates. The problem is, when I scrape the code it returns empty values. The table has a table tag.
I'm not sure if it uses something else other than html because it updates without reloading. I had to use a browser user agent to get passed the security.
When I run it returns correct data but I have noticed signal id increments by 1
<table class="ui stripe hover dt-center table" id="isosignal-table" style="width:100%"><thead><tr><th></th><th class="no-sort">Current Price</th><th class="no-sort">Direction</th><th class="no-sort">Asset</th><th class="no-sort">Strike Price</th><th class="no-sort">Expiry Time</th></tr></thead><tbody><tr :class="[ signal.direction.toLowerCase() == 'call' ? 'call' : 'put' ]" :id="'signal-' + signal.id" :key="signal.id" ref="signals" v-for="signal in signals"><td style="display: none;" v-text="signal.id"></td><td v-text="signal.current_price"></td><td v-html="showDirection(signal.direction)"></td><td v-text="signal.asset"></td><td v-text="signal.strike_price"></td><td v-text="parseTime(signal.expiry)"></td></tr></tbody></table>
table = soup.table
print(table)
But when I run the whole code it returns this:
[]
['', '', '', '', '', '']
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "https://signals.investingstockonline.com/free-binary-signal-page"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req)
data = page.read()
soup = BeautifulSoup(data, 'html.parser')
table = soup.table
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
if len(row) < 1:
pass
print(row)
I thought it would display the whole table but it just displayed empty strings. What could be the problem?

In the HTML you've provided, there is no text content in the elements, so you're getting that correctly. When you look at the live website, text content that appears in the table was inserted dynamically by JS fetching information from a server via ajax. In other words, if you perform a request, you'll get the skeleton (HTML) but no meat (live data).
You can use something like Selenium to extract this information as follows:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://signals.investingstockonline.com/free-binary-signal-page")
for tr in driver.find_elements_by_tag_name("tr"):
for td in tr.find_elements_by_tag_name("td"):
print(td.get_attribute("innerText"))
Output (truncated):
EURJPY
126.044
22:00:00
1.50318
EURCAD
1.50332
22:00:00
1.12595
EURUSD
1.12604
22:00:00
0.86732
EURGBP
0.86743
22:00:00
1.29825
GBPUSD
1.29841
22:00:00
145.320

Related

Empty data table- using Beautifulsoup

I asked a question earlier regarding an error I was getting. It helped by adding the correct class name for the table on this webpage. However, the code runs but when i print(data), I get an empty response.
url = "https://coinmarketcap.com/currencies/bitcoin/historical-data/"
content = requests.get(url).content
soup = BeautifulSoup(content,'html.parser')
table = soup.find('table', {'class': 'cmc-table'})
data = [[td.text.strip() for td in tr.findChildren('td')]
for tr in table.findChildren('tr')]
print(data)
output:
C:\Users\Ejer\anaconda3\envs\pythonProject\python.exe C:/Users/Ejer/PycharmProjects/pythonProject/stock_analysis.py
[[], ['No Data']]
Process finished with exit code 0

What happens?
Website deals with dynamic content and table is empty in the moment you try to grab the html.
Solution
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
browser = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
browser.get('https://coinmarketcap.com/currencies/bitcoin/historical-data/')
sleep(2)
soup = BeautifulSoup(browser.page_source,'html.parser')
browser.close()
table = soup.find('table', {'class': 'cmc-table'})
[[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')][1:]
Output
[['Jan 09, 2021',
'$40,788.64',
'$41,436.35',
'$38,980.88',
'$40,254.55',
'$61,984,162,837',
'$748,563,483,043'],
['Jan 08, 2021',
'$39,381.77',
'$41,946.74',
'$36,838.64',
'$40,797.61',
'$88,107,519,480',
'$758,625,941,267'],
[...]]

Getting html table data other than text content (get "title" tag data)

One table entry within a table row on an html table I am trying to scrape looks like so:
<td class="top100nation" title="PAK">
<img src="/images/flag/flags_pak.jpg" alt="PAK"></td>
The web page to which this belongs is the following: http://www.relianceiccrankings.com/datespecific/odi/?stattype=bowling&day=01&month=01&year=2014. The entire column to which this belongs in the table has similar table data (i.e. it's a column of images).
I am using lxml in a python script. (Open to using BeautifulSoup instead, if I have to for some reason.) For every other column in the table, I can extract the data I want on the given row by using 'data = entry.text_content()'. Obviously, this doesn't work for this column of images. But I don't want the image data in any case. What I want to get from this table data is the 'PAK' bit - that is, I want the name of the nation. I think this is extremely simple but unfortunately I am a simpleton who doesn't understand the library he is using.
Thanks in advance
Edit: Full script, as per request
import requests
import lxml.html as lh
import csv
with open('firstPageCricinfo','w') as file:
writer = csv.writer(file)
page = requests.get(url)
doc = lh.fromstring(page.content)
#rows of the table
tr_elements = doc.xpath('//tr')
data_array = [[] for _ in range(len(tr_elements))]
del tr_elements[0]
for t in tr_elements[0]:
name=t.text_content()
if name == "":
continue
print(name)
data_array[0].append(name)
#printing out first row of table, to check correctness
print(data_array[0])
for j in range(1,len(tr_elements)):
T=tr_elements[j]
i=0
for t in T.iterchildren():
#column is not at issue
if i != 3:
data=t.text_content()
#image-based column
else:
#what do I do here???
data = t.
data_array[j].append(data)
i+=1
#printing last row to check correctness
print(data_array[len(tr_elements)-1])
with open('list1','w') as file:
writer = csv.writer(file)
for i in range(0,len(tr_elements)):
writer.writerow(data_array[i])`

Along with lxml library you'll either need to use requests or some other library to get the website content.
Without seeing the code you have so far, I can offer a BeautifulSoup solution:
url = 'http://www.relianceiccrankings.com/datespecific/odi/?stattype=bowling&day=01&month=01&year=2014'
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get(url).text, 'lxml')
r = soup.find_all('td', {'class': 'top100cbr'})
for td in r:
print(td.text.split('v')[1].split(',')[0].strip())
outputs about 522 items:
South Africa
India
Sri Lanka
...
Canada
New Zealand
Australia
England

Scrape table from webpage when in <div> format - using Beautiful Soup

So I'm aiming to scrape 2 tables (in different formats) from a website - https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate after using the search bar to iterate this over a list of license codes. I haven't included the loop fully yet but I added it at the top for completeness.
My issue is that because the two tables I want, Product Data and Certificate Data are in 2 different formats, so I have to scrape them separately. As the Product data is in the normal "tr" format on the webpage, this bit is easy and I've managed to extract a CSV file of this. The harder bit is extracting Certificate Data, as it is in "div" form.
I've managed to print the Certificate Data as a list of text, using the class function, however I need to have it in a tabular form saved in a CSV file. As you can see, I've tried several unsuccessful ways of converting it to a CSV but If you have any suggestions, it would be much appreciated, thank you!! Also any other general tips to improve my code would be great too, as I am new to web-scraping.
#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
#include all of the below
driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)
search_bar = driver.find_element_by_xpath('//*[#id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url
r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))
driver.quit()
df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)
#print(df[0].to_json(orient='records'))
print certificate
Output:
Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0
What I want but over hundreds/thousands of license codes (I just manually created this one sample in Excel):
Desired output
EDIT
So whilst this is now working for Certificate Data, I also want to scrape the Product Data and output that into another .csv file. However currently it is only printing 5 copies of the product data for the final license code which is not what I want.
New Code:
df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]
def get_data_by_code(code):
data = [
('code', code),
('submit', 'Search'),
]
response = requests.post('https://info.fsc.org/certificate.php', data=data)
soup = BeautifulSoup(response.content, 'lxml')
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
return [code, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
table = soup.find_all('table')[0]
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')

Here's all you need.
No chromedriver. No pandas. Forget about it in context of scraping.
import requests
import csv
from bs4 import BeautifulSoup
# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.
#Function to parse single data page based on single input code.
def get_data_by_code(code):
# Parameters to build POST-request.
# "type" and "submit" params are static. "code" is your desired code to scrape.
data = [
('type', 'certificate'),
('code', code),
('submit', 'Search'),
]
# POST-request to gain page data.
response = requests.post('https://info.fsc.org/certificate.php', data=data)
# "soup" object to parse html data.
soup = BeautifulSoup(response.content, 'lxml')
# "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
# Same for issue dates... etc.
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
# Returning found data as list of values.
return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
#Writing list of values to file as single row.
writer.writerow((get_data_by_code(code)))
Everything is really straightforward here. I'd suggest you spend some time in Chrome dev tools "network" tab to have a better understanding of request forging, which is a must for scraping tasks.
In general, you don't need to run chrome to click the "search" button, you need to forge request generated by this click. Same for any form and ajax.

well... you should sharpen your skills (:
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
### HERE'S THE PROBLEM:
# "soup" variable is declared inside of "get_data_by_code" function.
# So you can't use it in outer context.
table = soup.find_all('table')[0] # <--- you should move this line to
#definition of "get_data_by_code" function and return it's value somehow...
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
As per example you can return dictionary of values from "get_data_by_code" function:
def get_data_by_code(code):
...
table = soup.find_all('table')[0]
return dict(row=row, table=table)

Format URL query "example [{u'1438923522000': 98}]"

`import json
import urllib2
response = urllib2.urlopen('http://www.energyhive.com/mobile_proxy/getCurrentVa$
content = response.read()
for x in json.loads(content):
if x["cid"] == "PWER":
print(x["data"])
`
Hi all, I have some code that I require part of the code sent to a txt file, example [{u'1438923522000': 98}], after running code, I just need the txt after : save as txt, or better sql.

If x["data"] == [{u'1438923522000': 98}] then x["data"][0] == {u'1438923522000': 98}. If you can guarantee that the dict will only have one key (as your example does) then the expression you are looking for is something like
next(x["data"][0].values())
Since you appear to be using Python 3, dict.values() is a generator, so calling next() on it gives you the first value without needing to know what the associated key is.

#!/usr/bin/env python
import urllib2
import json
api_key = 'VtxgIC2UnhfUmXe_pBksov7-lguAQMZD'
url = 'http://www.energyhive.com/mobile_proxy/getCurrentValuesSummary?token='+api_key
response = urllib2.urlopen(url)
content = response.read()
for x in json.loads(content):
if x["cid"] == "PWER":
print (x["data"])
for y in json.loads(content):
if y["cid"] == "PWER_GAC":
print(y["data"])
when i load this code i get
[{u'1439087809000': 36}]
[{u'1439087809000': 0}]
i would like to delete everything apart from the results
36
0
updated api to run code

Issue in scraping data from a html page using beautiful soup

I am scraping some data from a website and I am able to do so using the below referred code:
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice
page = urllib2.urlopen('http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('O2_2012-12-21.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Month","Day of Week","OEM","Device Name","Price"])
oems = soup.findAll('span', {"class": "wwFix_h2"},text=True)
items = soup.findAll('div',{"class":"title"})
prices = soup.findAll('span', {"class": "handset"})
for oem, item, price in zip(oems, items, prices):
textcontent = u' '.join(islice(item.stripped_strings, 1, 2, 1))
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(oem.string).encode('utf8').strip(),textcontent,unicode(price.string).encode('utf8').strip()])
Now, issue is 2 of the all the price values I am scraping have different html structure then rest of the values. My output csv is showing "None" value for those because of this. Normal html structure for price on webpage is
<span class="handset">
FREE to £79.99</span>
For those 2 values structure is
<span class="handset">
<span class="delivery_amber">Up to 7 days delivery</span>
<br>"FREE on all tariffs"</span>
Out which I am getting right now displays None for the second html structure instead of Free on all tariffs, also price value Free on all tariffs is mentioned under double quotes in second structure while it is outside any quotes in first structure
Please help me solve this issue, Pardon my ignorance as I am new to programming.

Just detect those 2 items with an additional if statement:
if price.string is None:
price_text = u' '.join(price.stripped_strings).replace('"', '').encode('utf8')
else:
price_text = unicode(price.string).strip().encode('utf8')
then use price_text for your CSV file. Note that I removed the " quotes with a simple replace call.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Table data returning empty values after web scraping - html

Related

Empty data table- using Beautifulsoup

Getting html table data other than text content (get "title" tag data)

Scrape table from webpage when in <div> format - using Beautiful Soup

Format URL query "example [{u'1438923522000': 98}]"

Issue in scraping data from a html page using beautiful soup

Categories

Resources