Selenium, using find_element but end up with half the website - html

I finished the linked tutorial and tried to modify it to get somethings else from a different website. I am trying to get the margin table of HHI but the website is coded in a strange way that I am quite confused.
I find the child element of the parent that have the text with xpath://a[#name="HHI"], its parent is <font size="2"></font> and contains the text I wanted but there is a lot of tags named exactly <font size="2"></font> so I can't just use xpath://font[#size="2"].
Attempt to use the full xpath would print out half of the website content.
the full xpath:
/html/body/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[3]/td/pre/font/table/tbody/tr/td[2]/pre/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font
Is there anyway to select that particular font tag and print the text?
website:
https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm
Tutorial
https://www.youtube.com/watch?v=PXMJ6FS7llk&t=8740s&ab_channel=freeCodeCamp.org
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import pandas as pd
# prepare it to automate
from datetime import datetime
import os
import sys
import csv
application_path = os.path.dirname(sys.executable) # export the result to the same file as the executable
now = datetime.now() # for modify the export name with a date
month_day_year = now.strftime("%m%d%Y") # MMDDYYYY
website = "https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm"
path = "C:/Users/User/PycharmProjects/Automate with Python – Full Course for Beginners/venv/Scripts/chromedriver.exe"
# headless-mode
options = Options()
options.headless = True
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service, options=options)
driver.get(website)
containers = driver.find_element(by="xpath", value='') # or find_elements
hhi = containers.text # if using find_elements, = containers[0].text
print(hhi)
Update:
Thank you to Conal Tuohy, I learn a few new tricks in Xpath. The website is written in a strange way that even with the Xpath that locate the exact font tag, the result would still print all text in every following tags.
I tried to make a list of different products by .split("Back to Top") then slice out the first item and use .split("\n"). I will .split() the lists within list until it can neatly fit into a dataframe with strike prices as index and maturity date as column.
Probably not the most efficient way but it works for now.
product = "HHI"
containers = driver.find_element(by="xpath", value=f'//font[a/#name="{product}"]')
hhi = containers.text.split("Back to Top")
# print(hhi)
hhi1 = hhi[0].split("\n")
df = pd.DataFrame(hhi1)
# print(df)
df.to_csv(f"{product}_{month_day_year}.csv")

You're right that HTML is just awful! But if you're after the text of the table, it seems to me you ought to select the text node that follows the B element that follows the a[#name="HHI"]; something like this:
//a[#name="HHI"]/following-sibling::b/following-sibling::text()[1]
EDIT
Of course that XPath won't work in Selenium because it identifies a text node rather than an element. So your best result is to return the font element that directly contains the //a[#name="HHI"], which will include some cruft (the Back to Top link, etc) but which will at least contain the tabular data you want:
//a[#name="HHI"]/parent::font
i.e. "the parent font element of the a element whose name attribute equals HHI"
or equivalently:
//font[a/#name="HHI"]
i.e. "the font element which has, among its child a elements, one whose name attribute equals HHI"

Related

HTML Dec Code image in Tkinter label — either text or image is doubled

I'd like to add a picture to some of my tkinter labels, and I found a page with many of them (there are, of course, many similar pages), including some that I want.
But I'm having a strange behavior with this.
The code
import tkinter as tk
from tkinter import ttk
import html
root = tk.Tk()
root.geometry("200x100")
s = html.unescape('&#127937') # chequered flag
text = "some text"
label_text = "{}{}".format(text, s)
my_label = ttk.Label(root, text=label_text)
my_label.pack()
t = chr(9917)
another = "football ball"
another_text = "{}{}".format(t, another)
another_label = ttk.Label(root, text=another_text)
another_label.pack()
root.mainloop()
produces the following window:
On the other hand, if I replace label_text = "{}{}".format(text, s) with label_text = "{}{}".format(s, text) the flag appears twice instead (once before "some text" and another after).
Apparently this only happens with html images.
For example, with the second label, I have the expected behavior.
Is there something I'm doing wrong here, or should I just avoid these images in tkinter?
i wouldnt avoid them yet i wouldnt advise them either. Because tkinter propbably uses regular images its propbably not used to emojis. My recommendation is to use regular images instead of emojis.

Selenium, Python 3, simple scraping text from Erowid LSD experiences?

Based off of an answer on here about a similar thing, I tried to scrape the text of Erowid trip experiences. The URL has a bunch of trip links. I want to click each link and then print the 'report-text-surround' element, which is the trip text.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.erowid.org/experiences/exp.cgi?S1=2&S2=-3&C1=9&Str=')
#I tried to get hrefs by xpath, knowing that each trip links starts with 'exp.php?ID'.
view_links = driver.find_elements_by_xpath("""//*[contains(text(), 'exp.php?ID')]""")
for index, view in enumerate(view_links):
html = view.get_attribute('innerHTML')
href = html.split('"')[1]
view_links[index] = href
#And then visit each href and get the data
for href in view_links:
driver.get(href)
#I know this is the element containing the trip text.
trip_text = driver.find_elements_by_class_name('report-text-surround')
for trip in trip_text:
print (trip.text.encode('utf-8'))
So you are pretty close but there are just 2 small mistakes.
trip_text = driver.find_elements_by_class_name('report-text-surround')
for trip in trip_text:
print (trip.text.encode('utf-8'))
Your driver.find_elements_by_class_name should not be plural, as there is only one on the page. It has a lot of elements, but only one class ('report-text-surround'). This means you're going to get all the text at once, you could change this but you'd have to go through the child elements or get the elements seperately.
You can change that entire section to this:
text = (driver.find_element_by_class_name('report-text-surround').text).encode('utf-8')
print(text);
That will give you all of the text in the entire article. An easy way to split this up after would be to split each part of the text by \n\n.

Check if a page contains a certain text

How can I find a text, or rather make sure it exists, on an html page regardless where it's located and what html tags it's surrounded by and its case? I just know a text and I want to make sure a page contains it and the text is visible.
and the text is visible
This part is a crucial one - in order to determine element's visibility reliably, you would need the page rendered. Let's automate a real browser with selenium, get the element having the desired text and check if the element is displayed. Example in Python:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
desired_text = "Desired text"
driver = webdriver.Chrome()
driver.get("url")
try:
is_displayed = driver.find_element_by_xpath("//*[contains(., '%s')]" % desired_text).is_displayed()
print("Element found")
if is_displayed:
print("Element is visible")
else:
print("Element is not visible")
except NoSuchElementException:
print("Element not found")

Combining HTML and Tkinter Text Input

I'm looking for some help in finding a way to construct a body of text that can be implemented within an HTML document upon users inputting their text to display in an Entry. I have figured out the following on how to execute the browser to open in a new window when clicking the button and displaying the HTML string. However, the area I am stuck on is grabbing the user input inside the wbEntry variable to function with the HTML string outputted by 'message'. I was looking at lambda's to use as a command within wbbutton, but not sure if that's the direction to look for a solution.
from tkinter import *
import webbrowser
def wbbrowser():
f = open('index.html','w')
message = "<html><head></head><body><p>This is a test</p></body</html>"
f.write(message)
f.close()
webbrowser.open_new_tab('index.html')
wbGui = Tk()
source = StringVar()
wbGui.geometry('450x450+500+300')
wbGui.title('Web Browser')
wblabel = Label(wbGui,text='Type Your Text Below').pack()
wbbutton = Button(wbGui,text="Open Browser",command = wbbrowser).pack()
wbEntry = Entry(wbGui,textvariable=source).pack()
I am using Python 3.5 and Tkinter on a Windows 7. The code above does not operate for me on my Mac OSX as that would require a different setup for my wbbrowser function. Any help would be appreciated.
Since you are associating a StringVar with the entry widget, all you need to do is fetch the value from the variable before inserting it into the message.
def wbbrowser():
...
text = source.get()
message = "<html><head></head><body><p>%s</p></body</html>" % text
...

Scrapy : List all links and infos contained in same page from a website

I have the following mini basic spider I use to get all links from a website.
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class SampleItem(Item):
link = Field()
class SampleSpider(CrawlSpider):
name = "sample_spider"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
return item
I was wondering wether it would be possible to add to have this same spider scraping some html (like the one below)from these same links and to list link and info in a csv in two separate columns?
<span class="price">50,00 €</span>
marko
Yes, that's possible of course. First of all you need to use a feed export. This can be set in the settings.py with the options:
FEED_FORMAT = 'csv'
FEED_URL = 'file:///absolute/path/to/the/output.csv'
Then you will have to adjust your items to allow more elements. Currently, you only use the link. You will want to add a price field.
class SampleItem(Item):
link = Field()
price = Field()
One sidenote: Usually we define items in the items.py file, because generally multiple spiders should scrape the same type of item from several pages. You would then import them into your spider using from scrapername.items import SampleItem. An example application for this would be a price scraper which scrapes both Amazon and some smaller shops.
Finally, you will have to adjust the parse_page method of your spider. Currently you only save the URL into your item. You want to find the price and also save it. Finding numbers or texts on a page is a key element of scraping. For this purpose we have selectors. Scapy supports XPath, CSS and regular expression selectors. The first two are especially useful, because they can be nested. Regular expressions would generally be used when you found the correct HTML element, but there is too much information within one element.
A problem you might encounter is that a page might have multiple .price elements. Have you made sure there only is one? Otherwise the selector will give you all of them and you might have to refine your selector using more other tags.
So, let's assume there is only this one .price element and construct our selector. We use CSS selector here, because it's more intuitive in this case. You can call the selectors directly on the response using css and xpath methods. Both of them always return elements on which you might use css() and xpath() again. To get the textual representation you need to call extract() on them. This might be annoying at the beginning, but nesting selectors is very convenient. Note that the selectors give you the full HTML element including the tag. To only get the text content, you need to make this explicit. For CSS selectors via ::text, for XPath via /text().
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
try:
item['price'] = response.css('.price::text')[0].extract()
except IndexError:
# do whatever is best if price cannot be found
item['price'] = None
return item