I'm trying to restaurant names from Tripadvisor with Python 3 & lxml. The text i'm trying to retrieve is in the following element and is named 'Al Fresco's in this case.
<a target="_blank" href="/Restaurant_Review-g293925-d8327527-Reviews-
Al_Fresco_s-Ho_Chi_Minh_City.html" class="property_title"
onclick="ta.restaurant_list_tracking.clickDetailTitle('/Restaurant_Review-
g293925-d8327527-Reviews-Al_Fresco_s-
Ho_Chi_Minh_City.html','tags_category_tag_restaurants','8327527','1','0');">
Al Fresco's
</a>
The Xpath reference to this element:
//*[#id="eatery_8327527"]/div[2]/div[1]/div[1]/a
I use the following simple code to retrieve the text in this element:
from lxml import html
import requests
page = requests.get('https://www.tripadvisor.nl/Restaurants-g293925-
Ho_Chi_Minh_City.html')
tree = html.fromstring(page.content)
#This will create a list of Names:
Name = tree.xpath('//*[#id="eatery_8327527"]/div[2]/div[1]/div[1]/a/text()')
print ('Name: ', Name)
This returns me an empty array: Name: []
How do I get the text I want?
Without having a look at the actual page your Xpath is probably too strict. Try something like this:
//a[contains(#href,"Restaurant_Review")]/text()
If that yields too many results try adding the parent in front.
Hope that helps.
UPDATE:
After having a look at the actual page, this i probably what you are looking for:
//a[contains(#class,"property_title")]/text()
Related
I finished the linked tutorial and tried to modify it to get somethings else from a different website. I am trying to get the margin table of HHI but the website is coded in a strange way that I am quite confused.
I find the child element of the parent that have the text with xpath://a[#name="HHI"], its parent is <font size="2"></font> and contains the text I wanted but there is a lot of tags named exactly <font size="2"></font> so I can't just use xpath://font[#size="2"].
Attempt to use the full xpath would print out half of the website content.
the full xpath:
/html/body/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[3]/td/pre/font/table/tbody/tr/td[2]/pre/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font
Is there anyway to select that particular font tag and print the text?
website:
https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm
Tutorial
https://www.youtube.com/watch?v=PXMJ6FS7llk&t=8740s&ab_channel=freeCodeCamp.org
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import pandas as pd
# prepare it to automate
from datetime import datetime
import os
import sys
import csv
application_path = os.path.dirname(sys.executable) # export the result to the same file as the executable
now = datetime.now() # for modify the export name with a date
month_day_year = now.strftime("%m%d%Y") # MMDDYYYY
website = "https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm"
path = "C:/Users/User/PycharmProjects/Automate with Python – Full Course for Beginners/venv/Scripts/chromedriver.exe"
# headless-mode
options = Options()
options.headless = True
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service, options=options)
driver.get(website)
containers = driver.find_element(by="xpath", value='') # or find_elements
hhi = containers.text # if using find_elements, = containers[0].text
print(hhi)
Update:
Thank you to Conal Tuohy, I learn a few new tricks in Xpath. The website is written in a strange way that even with the Xpath that locate the exact font tag, the result would still print all text in every following tags.
I tried to make a list of different products by .split("Back to Top") then slice out the first item and use .split("\n"). I will .split() the lists within list until it can neatly fit into a dataframe with strike prices as index and maturity date as column.
Probably not the most efficient way but it works for now.
product = "HHI"
containers = driver.find_element(by="xpath", value=f'//font[a/#name="{product}"]')
hhi = containers.text.split("Back to Top")
# print(hhi)
hhi1 = hhi[0].split("\n")
df = pd.DataFrame(hhi1)
# print(df)
df.to_csv(f"{product}_{month_day_year}.csv")
You're right that HTML is just awful! But if you're after the text of the table, it seems to me you ought to select the text node that follows the B element that follows the a[#name="HHI"]; something like this:
//a[#name="HHI"]/following-sibling::b/following-sibling::text()[1]
EDIT
Of course that XPath won't work in Selenium because it identifies a text node rather than an element. So your best result is to return the font element that directly contains the //a[#name="HHI"], which will include some cruft (the Back to Top link, etc) but which will at least contain the tabular data you want:
//a[#name="HHI"]/parent::font
i.e. "the parent font element of the a element whose name attribute equals HHI"
or equivalently:
//font[a/#name="HHI"]
i.e. "the font element which has, among its child a elements, one whose name attribute equals HHI"
I have the following url: https://www.amazon.com/#twotabsearchtextbox. When I click on this link I would like it to go to Amazon, select the search bar, and input a value. My question is: how do I modify the URL to insert a value into the search bar?
Example Input
URL: https://www.amazon.com/#twotabsearchtextbox
Value: Avengers Movie
Expected Output:
I know you can select elements. I've also seen this done before somewhere. However, I cannot find a resource that explains the identifiers used. I saw you could highlight it here but not insert.
Perhaps there's something I'm missing, but I'd populate the search link.
<a href="https://www.amazon.com/s?k=YOUR+SEARCH+VALUE">
Like previously mentioned, you can even populate the value to whatever you'd like using some javascript. For example:
You have an anchor with an id containing what you want to search.
<a id="star+wars">My Link</a>
We can use javascript to populate the search URL with your search term, then add it to the anchor using setAttribute
let yourLink = document.getElementById("star+wars");
let searchID = yourLink.id;
yourLink.setAttribute("href", "https://www.amazon.com/s?k=" + searchID);
JSFiddle: https://jsfiddle.net/L5afroh1/
I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]
I am trying to find a way to return the index of a HTML child tag based on its xpath.
For instance, on the right rail of a page, I have three elements:
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[1]/h4
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[2]/h4
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[3]/h4
Assume that I've found the first element, and I want to return the number inside the tag div, which is 1. How can I do it?
I referred to this previous post (How to count HTML child tag in Selenium WebDriver using Java) but still cannot figure it out.
You can get the number using regex:
var regExp = /div\[([^)]+)\]/;
var matches = regExp.exec("//[#id=\"ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880\"]/div[2]/h4");
console.log(matches[1]); \\ returns 2
You can select preceeding sibling in xpath to get all the reports before your current one like this:
//h4[contains(text(),'hello1')]/preceding-sibling::h4
Now you only have to count how many you found plus the current and you have your index.
Another option would be to select all the reports at once and loop over them checking for their content. They always come in the same order they are in the dom.
for java it could look like this:
List<WebElement> reports = driver.findElements(By.xpath("//*[#id='ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880']/div/h4")
for(WebElement element : reports){
if(element.getText().contains("report1"){
return reports.indexOf(element) + 1;
}
}
Otherwise you will have to parse the xpath by yourself to extract the value (see LG3527118's answer for this).
date =soup.find_all("td", {"id": "utime"})
print date
[<td class="mstat-date" colspan="3" id="utime">23.11.2015 17:00</td>]
this is what i want
[23.11.2015 17:00]
print(soup.date.string)
AttributeError: 'NoneType' object has no attribute 'string'
Little help please, thank you.
The method find_all will always return a list. In a list, you must index, before you can access methods on the elements contained within. So when you have
date =soup.find_all("td", {"id": "utime"})
You can access the text from the first tag, by typing:
date[0].text
If there are more items in the list, you can use a list comprehension, like this:
[ _.text for _ in date ]
That will give you a list of dates, if the HTML you're scraping had more than one such date-like tags.
Your beautifulsoup instance will not have associated with it a date attribute/property, unless in some very specific conditions. Use your first function to access all the dates, so don't try this approach (bad: soup.date).