trouble pulling html data - html

I am attempting to use pandas to scrape a very simple html table from a website. My code works fine on other sites, but not on the site of interest. It gives me "ValueError: No text parsed from document". Are there any common situations that prevent this?
I've also tried using requests, but the text property shows up empty as well.
urllogin = 'http://website.html'
values = {'user': 'id',
'password': 'pass'}
r = requests.post(urllogin, data=values)
url = 'http://wesbite/table'
tables = pd.read_html('url')
EDIT
This returns nothing as well.
soup = BeautifulSoup(urlopen(url), "html.parser")
print(soup.prettify())

Related

Selenium, using find_element but end up with half the website

I finished the linked tutorial and tried to modify it to get somethings else from a different website. I am trying to get the margin table of HHI but the website is coded in a strange way that I am quite confused.
I find the child element of the parent that have the text with xpath://a[#name="HHI"], its parent is <font size="2"></font> and contains the text I wanted but there is a lot of tags named exactly <font size="2"></font> so I can't just use xpath://font[#size="2"].
Attempt to use the full xpath would print out half of the website content.
the full xpath:
/html/body/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[3]/td/pre/font/table/tbody/tr/td[2]/pre/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font
Is there anyway to select that particular font tag and print the text?
website:
https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm
Tutorial
https://www.youtube.com/watch?v=PXMJ6FS7llk&t=8740s&ab_channel=freeCodeCamp.org
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import pandas as pd
# prepare it to automate
from datetime import datetime
import os
import sys
import csv
application_path = os.path.dirname(sys.executable) # export the result to the same file as the executable
now = datetime.now() # for modify the export name with a date
month_day_year = now.strftime("%m%d%Y") # MMDDYYYY
website = "https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm"
path = "C:/Users/User/PycharmProjects/Automate with Python – Full Course for Beginners/venv/Scripts/chromedriver.exe"
# headless-mode
options = Options()
options.headless = True
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service, options=options)
driver.get(website)
containers = driver.find_element(by="xpath", value='') # or find_elements
hhi = containers.text # if using find_elements, = containers[0].text
print(hhi)
Update:
Thank you to Conal Tuohy, I learn a few new tricks in Xpath. The website is written in a strange way that even with the Xpath that locate the exact font tag, the result would still print all text in every following tags.
I tried to make a list of different products by .split("Back to Top") then slice out the first item and use .split("\n"). I will .split() the lists within list until it can neatly fit into a dataframe with strike prices as index and maturity date as column.
Probably not the most efficient way but it works for now.
product = "HHI"
containers = driver.find_element(by="xpath", value=f'//font[a/#name="{product}"]')
hhi = containers.text.split("Back to Top")
# print(hhi)
hhi1 = hhi[0].split("\n")
df = pd.DataFrame(hhi1)
# print(df)
df.to_csv(f"{product}_{month_day_year}.csv")
You're right that HTML is just awful! But if you're after the text of the table, it seems to me you ought to select the text node that follows the B element that follows the a[#name="HHI"]; something like this:
//a[#name="HHI"]/following-sibling::b/following-sibling::text()[1]
EDIT
Of course that XPath won't work in Selenium because it identifies a text node rather than an element. So your best result is to return the font element that directly contains the //a[#name="HHI"], which will include some cruft (the Back to Top link, etc) but which will at least contain the tabular data you want:
//a[#name="HHI"]/parent::font
i.e. "the parent font element of the a element whose name attribute equals HHI"
or equivalently:
//font[a/#name="HHI"]
i.e. "the font element which has, among its child a elements, one whose name attribute equals HHI"

Accessing html Tables with rvest

So I am wanting to scrape some NBA data. The following is what I have so far, and it is perfectly functional:
install.packages('rvest')
library(rvest)
url = "https://www.basketball-reference.com/boxscores/201710180BOS.html"
webpage = read_html(url)
table = html_nodes(webpage, 'table')
data = html_table(table)
away = data[[1]]
home = data[[3]]
colnames(away) = away[1,] #set appropriate column names
colnames(home) = home[1,]
away = away[away$MP != "MP",] #remove rows that are just column names
home = home[home$MP != "MP",]
the problem is that these tables don't include the team names, which is important. To get this information, I was thinking I would scrape the four factors table on the webpage, however, rvest doesnt seem to be recognizing this as a table. The div that contains the four factors table is:
<div class="overthrow table_container" id="div_four_factors">
And the table is:
<table class="suppress_all sortable stats_table now_sortable" id="four_factors" data-cols-to-freeze="1"><thead><tr class="over_header thead">
This made me think that I could access the table via something along the lines of
table = html_nodes(webpage,'#div_four_factors')
but this doesnt seem to work as I am getting just an empty list. How can I access the four factors table?
I am by no means an HTML expert but it appears that the table you are interested in is commented out in the source code then the comment is overridden at some point before being rendered.
If we assume that the Home team is always listed second, we can just use positional arguments and scrape another table on the page:
table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
stringr::str_split("Schedule\n")
away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])
Obviously not the cleanest solution but such is life in the world of web scraping

Select XPath to scrape data with srcapy

I have looked at some of the XPath threads on here and read 2 XPath guides but I am having trouble writing a working code to scrape a webpage that is full of restaurants, phone numbers, addresses with scrapy.
A picture of the source code of the site and the highlighted data I want.
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//b")
items = []
for titles in titles:
item = kosherscrapeitem
item ["BusinessName"] = titles.select('//td[#class="line-content"]//html/body/table/tbody/tr[()').extract()
This is the data from copying the xpath on chrome:
/html/body/table/tbody/tr[485]/td[2]/text()
Can someone help me with my Xpath expression?

issue with Apache POI when converting .docx to a json document format.

I am currntly parsing a 26 page .docx with images,tables,italics,underlines. I am able to clear
Using apache POI I created XWPF document format with list of XWPF paragraphs. When i iterate through XWPF paragraphs, I am not able to get styles (italics,underlines,bolds) for individual lines if a single paragraph contains different styles.
i have tried using XWPF.paragraph.getrun(). XWPF...run.getfamilyfont() i am getting null. But i get the data at the paragraph level when i run XWPF.paragraph.getstyle()
Please do let me know if you have encountered similar issues.
I hope these code can help you , you can get some style from CTRPr object.
CTRPr rPr = run.getCTR().getRPr();
if(rPr!=null){
CTFonts rFonts = rPr.getRFonts();
if(rFonts!=null){
String eastAsia = rFonts.getEastAsia();
String hAnsi = rFonts.getHAnsi();
Enum hAnsiTheme = rFonts.getHAnsiTheme();
}
}

Combining HTML and Tkinter Text Input

I'm looking for some help in finding a way to construct a body of text that can be implemented within an HTML document upon users inputting their text to display in an Entry. I have figured out the following on how to execute the browser to open in a new window when clicking the button and displaying the HTML string. However, the area I am stuck on is grabbing the user input inside the wbEntry variable to function with the HTML string outputted by 'message'. I was looking at lambda's to use as a command within wbbutton, but not sure if that's the direction to look for a solution.
from tkinter import *
import webbrowser
def wbbrowser():
f = open('index.html','w')
message = "<html><head></head><body><p>This is a test</p></body</html>"
f.write(message)
f.close()
webbrowser.open_new_tab('index.html')
wbGui = Tk()
source = StringVar()
wbGui.geometry('450x450+500+300')
wbGui.title('Web Browser')
wblabel = Label(wbGui,text='Type Your Text Below').pack()
wbbutton = Button(wbGui,text="Open Browser",command = wbbrowser).pack()
wbEntry = Entry(wbGui,textvariable=source).pack()
I am using Python 3.5 and Tkinter on a Windows 7. The code above does not operate for me on my Mac OSX as that would require a different setup for my wbbrowser function. Any help would be appreciated.
Since you are associating a StringVar with the entry widget, all you need to do is fetch the value from the variable before inserting it into the message.
def wbbrowser():
...
text = source.get()
message = "<html><head></head><body><p>%s</p></body</html>" % text
...