Check if a page contains a certain text - html

How can I find a text, or rather make sure it exists, on an html page regardless where it's located and what html tags it's surrounded by and its case? I just know a text and I want to make sure a page contains it and the text is visible.

and the text is visible
This part is a crucial one - in order to determine element's visibility reliably, you would need the page rendered. Let's automate a real browser with selenium, get the element having the desired text and check if the element is displayed. Example in Python:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
desired_text = "Desired text"
driver = webdriver.Chrome()
driver.get("url")
try:
is_displayed = driver.find_element_by_xpath("//*[contains(., '%s')]" % desired_text).is_displayed()
print("Element found")
if is_displayed:
print("Element is visible")
else:
print("Element is not visible")
except NoSuchElementException:
print("Element not found")

Related

Docs Inserted Image always before all text

Making a simple app script that puts images and Text into a Google Doc separated by 2 Columns, for whatever reason, no matter the way I try it the images are always above the text (Inline) in the Doc, even though they should be layered (Inline),
//Replace QR Code
let qrText = editLocalBody.findText("{{qrCode}}");
let setImagePlace = qrText.getElement().asText().replaceText("{{qrCode}}", "");
let qrCodeImage = setImagePlace.getParent().asParagraph().insertInlineImage(0, qrCodeBlob);
From what I've seen this should insert an image wherever the text was previously located, but when it runs this it's always in the wrong spot, somehow above the text it was suppost to be in!
//Edit - To Show The Progression Of What Is Suppose To Happen And What Actually Happens:
I'm making QR Code badges for a propriety system that runs integrated tightly with Google, so I'm using appscript to get an entry from a google form containing an amount of badges (With relevent data) and autofill a Google Doc Accordingly.
// Loop Start
I fill my template with a text line that has key words in it I can select and replace later, with a keyword it can use to insert another this (This Part Works)
I first edit (findText("{{qrCode}}");) the QR Code, replacing (.replaceText) the keyword for it to nothing ("")
I then get the parent of the piece of code I ran above, which is a block of text (I think all the text in the Doc, I think this is where the issue lies, it puts it above the text because it's just one 'paragraph' or not multiple 'bodies' of text, if I could separate this I think it would work!) As a paragraph, and insert An Inline Image at Child Index (0, of the image ,qrCodeBlob)
I've debugged this script quite a bit, so I know It's that final line that inserting images fails, it sees all the text as 'one'.
// I want this (In Descending Order, each it's own full line):
Image
Text
Image
Text
//What It Gives Me (In Descending Order, each it's own full line):
Image
Image
Text
Text
let qrCodeImage = setImagePlace.getParent().asParagraph().insertInlineImage(0, qrCodeBlob);

Selenium, using find_element but end up with half the website

I finished the linked tutorial and tried to modify it to get somethings else from a different website. I am trying to get the margin table of HHI but the website is coded in a strange way that I am quite confused.
I find the child element of the parent that have the text with xpath://a[#name="HHI"], its parent is <font size="2"></font> and contains the text I wanted but there is a lot of tags named exactly <font size="2"></font> so I can't just use xpath://font[#size="2"].
Attempt to use the full xpath would print out half of the website content.
the full xpath:
/html/body/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[3]/td/pre/font/table/tbody/tr/td[2]/pre/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font
Is there anyway to select that particular font tag and print the text?
website:
https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm
Tutorial
https://www.youtube.com/watch?v=PXMJ6FS7llk&t=8740s&ab_channel=freeCodeCamp.org
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import pandas as pd
# prepare it to automate
from datetime import datetime
import os
import sys
import csv
application_path = os.path.dirname(sys.executable) # export the result to the same file as the executable
now = datetime.now() # for modify the export name with a date
month_day_year = now.strftime("%m%d%Y") # MMDDYYYY
website = "https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm"
path = "C:/Users/User/PycharmProjects/Automate with Python – Full Course for Beginners/venv/Scripts/chromedriver.exe"
# headless-mode
options = Options()
options.headless = True
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service, options=options)
driver.get(website)
containers = driver.find_element(by="xpath", value='') # or find_elements
hhi = containers.text # if using find_elements, = containers[0].text
print(hhi)
Update:
Thank you to Conal Tuohy, I learn a few new tricks in Xpath. The website is written in a strange way that even with the Xpath that locate the exact font tag, the result would still print all text in every following tags.
I tried to make a list of different products by .split("Back to Top") then slice out the first item and use .split("\n"). I will .split() the lists within list until it can neatly fit into a dataframe with strike prices as index and maturity date as column.
Probably not the most efficient way but it works for now.
product = "HHI"
containers = driver.find_element(by="xpath", value=f'//font[a/#name="{product}"]')
hhi = containers.text.split("Back to Top")
# print(hhi)
hhi1 = hhi[0].split("\n")
df = pd.DataFrame(hhi1)
# print(df)
df.to_csv(f"{product}_{month_day_year}.csv")
You're right that HTML is just awful! But if you're after the text of the table, it seems to me you ought to select the text node that follows the B element that follows the a[#name="HHI"]; something like this:
//a[#name="HHI"]/following-sibling::b/following-sibling::text()[1]
EDIT
Of course that XPath won't work in Selenium because it identifies a text node rather than an element. So your best result is to return the font element that directly contains the //a[#name="HHI"], which will include some cruft (the Back to Top link, etc) but which will at least contain the tabular data you want:
//a[#name="HHI"]/parent::font
i.e. "the parent font element of the a element whose name attribute equals HHI"
or equivalently:
//font[a/#name="HHI"]
i.e. "the font element which has, among its child a elements, one whose name attribute equals HHI"

HTML Dec Code image in Tkinter label — either text or image is doubled

I'd like to add a picture to some of my tkinter labels, and I found a page with many of them (there are, of course, many similar pages), including some that I want.
But I'm having a strange behavior with this.
The code
import tkinter as tk
from tkinter import ttk
import html
root = tk.Tk()
root.geometry("200x100")
s = html.unescape('&#127937') # chequered flag
text = "some text"
label_text = "{}{}".format(text, s)
my_label = ttk.Label(root, text=label_text)
my_label.pack()
t = chr(9917)
another = "football ball"
another_text = "{}{}".format(t, another)
another_label = ttk.Label(root, text=another_text)
another_label.pack()
root.mainloop()
produces the following window:
On the other hand, if I replace label_text = "{}{}".format(text, s) with label_text = "{}{}".format(s, text) the flag appears twice instead (once before "some text" and another after).
Apparently this only happens with html images.
For example, with the second label, I have the expected behavior.
Is there something I'm doing wrong here, or should I just avoid these images in tkinter?
i wouldnt avoid them yet i wouldnt advise them either. Because tkinter propbably uses regular images its propbably not used to emojis. My recommendation is to use regular images instead of emojis.

Selenium, Python 3, simple scraping text from Erowid LSD experiences?

Based off of an answer on here about a similar thing, I tried to scrape the text of Erowid trip experiences. The URL has a bunch of trip links. I want to click each link and then print the 'report-text-surround' element, which is the trip text.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.erowid.org/experiences/exp.cgi?S1=2&S2=-3&C1=9&Str=')
#I tried to get hrefs by xpath, knowing that each trip links starts with 'exp.php?ID'.
view_links = driver.find_elements_by_xpath("""//*[contains(text(), 'exp.php?ID')]""")
for index, view in enumerate(view_links):
html = view.get_attribute('innerHTML')
href = html.split('"')[1]
view_links[index] = href
#And then visit each href and get the data
for href in view_links:
driver.get(href)
#I know this is the element containing the trip text.
trip_text = driver.find_elements_by_class_name('report-text-surround')
for trip in trip_text:
print (trip.text.encode('utf-8'))
So you are pretty close but there are just 2 small mistakes.
trip_text = driver.find_elements_by_class_name('report-text-surround')
for trip in trip_text:
print (trip.text.encode('utf-8'))
Your driver.find_elements_by_class_name should not be plural, as there is only one on the page. It has a lot of elements, but only one class ('report-text-surround'). This means you're going to get all the text at once, you could change this but you'd have to go through the child elements or get the elements seperately.
You can change that entire section to this:
text = (driver.find_element_by_class_name('report-text-surround').text).encode('utf-8')
print(text);
That will give you all of the text in the entire article. An easy way to split this up after would be to split each part of the text by \n\n.

Combining HTML and Tkinter Text Input

I'm looking for some help in finding a way to construct a body of text that can be implemented within an HTML document upon users inputting their text to display in an Entry. I have figured out the following on how to execute the browser to open in a new window when clicking the button and displaying the HTML string. However, the area I am stuck on is grabbing the user input inside the wbEntry variable to function with the HTML string outputted by 'message'. I was looking at lambda's to use as a command within wbbutton, but not sure if that's the direction to look for a solution.
from tkinter import *
import webbrowser
def wbbrowser():
f = open('index.html','w')
message = "<html><head></head><body><p>This is a test</p></body</html>"
f.write(message)
f.close()
webbrowser.open_new_tab('index.html')
wbGui = Tk()
source = StringVar()
wbGui.geometry('450x450+500+300')
wbGui.title('Web Browser')
wblabel = Label(wbGui,text='Type Your Text Below').pack()
wbbutton = Button(wbGui,text="Open Browser",command = wbbrowser).pack()
wbEntry = Entry(wbGui,textvariable=source).pack()
I am using Python 3.5 and Tkinter on a Windows 7. The code above does not operate for me on my Mac OSX as that would require a different setup for my wbbrowser function. Any help would be appreciated.
Since you are associating a StringVar with the entry widget, all you need to do is fetch the value from the variable before inserting it into the message.
def wbbrowser():
...
text = source.get()
message = "<html><head></head><body><p>%s</p></body</html>" % text
...