Correct xpath for table data Scrapy - html

I'm trying to scrape data from a table with the following html:
Sorry for loading as an image, when I try to paste the code it does not display correctly, but I am only interested in the text associated with the highlighted classes.
I have tried to work down the tree using, for example, response.xpath('//table/tbody/td').extract() which returns nothing. I have also tried accessing the classes like, for example, response.xpath('//div/div/div/div/div/div/table/tbody/tr/td[class="pricePweek"]').extract() but again this is returning nothing. Is it the line breaks which are casuing the problem here?
I haven't had this issues when using Scrapy before, but have not tried to scrape from a table structure like this.

Your issue is that your are using Browser to validate your Xpath and then using them on Scrapy. Which may not give you a true picture. Consider the below html page
<html>
<body>
<table>
<tr>
<td class="name">Tarun</td>
</tr>
</table>
</body>
</html>
If you save the HTML in a file and open in browser
Can you see tbody added by the browser? This is not there in our source code. Which scrapy would see. So your xpath should not have tbody in it. If you use below it should work
price = response.xpath('//td[class="pricePweek"]').extract()

I am not certain what kind of output you prefer. Assuming that your expected output is one item per row of the data table, and here is a sample code (you may need to remove the ipython console promptions):
In [10]: for tr in response.xpath('//table/tbody/tr'):
...: item = dict()
...: item['title'] = tr.xpath('./td[#class="title"]/text()').extract_first().strip()
...: item['description'] = ','.join(x.strip() for x in tr.xpath('./td[#class="description"]//text()').extract())
...: item['pricePweek'] = tr.xpath('./td[#class="pricePweek"]//text()').extract_first().strip()
...: item['weeks'] = tr.xpath('./td[#class="weeks"]/text()').extract_first().strip()
...: item['bookFees'] = tr.xpath('./td[#class="bookFees"]/text()').extract_first().strip()
...: item['total'] = tr.xpath('./td[#class="total"]/text()').extract_first().strip()
...: item['sDate'] = tr.xpath('./td[#class="sDate"]/text()').extract_first().strip()
...: item['bookLink'] = tr.xpath('./td[#class="bookLink"]/a/#href').extract_first().strip()
...: print(item)
And here is the print:
{'title': 'En-Suite (Ground Floor)', 'description': '10.5sqm,3/4 bed,En-suite Bathroom (WC, Basin and Bath),Use of ground floor communal kitchen', 'pricePweek': '£163.00', 'weeks': '50', 'bookFees': '£250.00', 'total': '£8,150.00', 'sDate': '23 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=2917&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=5386&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=3dd0f1b377330cfbad6327b728678cbd'}
{'title': 'En-Suite (Ground Floor)', 'description': '10.5sqm,3/4 bed,En-suite Bathroom (WC, Basin and Bath),Use of ground floor communal kitchen', 'pricePweek': '£163.00', 'weeks': '49', 'bookFees': '£250.00', 'total': '£7,987.00', 'sDate': '30 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=2917&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=6075&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=db85ff90cacb487ee98942d955141b09'}
{'title': 'Large Studio (Courtyard)', 'description': '22-23m,2,3/4 bed,Generous studio with same features as "Standard" but slightly larger,Dual Occupancy is available for an additional 20% of the advertised rate per week', 'pricePweek': '£223.00', 'weeks': '51', 'bookFees': '£250.00', 'total': '£11,373.00', 'sDate': '16 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=718&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=5652&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=e959ccd71b62be9211eb1dd3ad5b362c'}
{'title': 'Large Studio (Courtyard)', 'description': '22-23m,2,3/4 bed,Generous studio with same features as "Standard" but slightly larger,Dual Occupancy is available for an additional 20% of the advertised rate per week', 'pricePweek': '£223.00', 'weeks': '49', 'bookFees': '£250.00', 'total': '£10,927.00', 'sDate': '30 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=718&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=6075&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=5f798c129cfe56dead110ed5d80efa75'}
Note that since some cells contain other elements, you need handle them properly. For example, the description cell contains a unorder list, here I concatenate them by the seperation ,.
Hope this would be helpful.
Thanks

Related

Dash - Cant get Markdown to appear above Graphs

I am in the finishing stages of my first real Plotly Dash dashboard. I have run into a problem where I can't add dcc.Markdown to any dcc.Graph elements.
It works fine with the dash.DataTable, as shown in the image below.
I am using Python v3.10, Dash v 2.6.2 & Plotly 5.10.
I tried to then use the same methodology to add markdown to the chart next to it, but this throws an error
TypeError: The dash_bootstrap_components.Col component (version
1.2.1) with the ID "Graph(id='sun_burst1', figure={}, style={'height': '45vh'})" detected a Component for a prop other than children Prop
id has value Graph(id='sun_burst1', figure={}, style={'height':
'45vh'})
The DataTable is inside a Row and Col. The code is as follows. I haven't closed it below as it runs on for quite some time
dbc.Row(
# Dash Data Table
[dbc.Col(
[dcc.Markdown('### Top Risks ###'),
dash_table.DataTable(
id='table1',
columns=[
{'name': 'Risk ID', 'id': 'risk_id', 'type': 'text', 'editable': False},
Here is my erroneous code, Am I barking up the wrong Tree, does Markdown even work with dcc.Graph.
dbc.Col(
[dcc.Markdown('### Risk Breakdown ###'),
dcc.Graph(id='sun_burst1', figure={}, style={'height': '45vh'}),
width=4, lg={'size': 5, "offset": 0, 'order': 'second'}
]),
I really am quite stumped.
My page is made up of 2 rows, top row with 3 columns of width 4, Bottom Row 2 x 6 columns.
The answer was simple in the end. I had not taken into consideration where the 'children' ended.
I had the following which did not work. Note the terminating square bracket after the complete definition.
dbc.Col(
[dcc.Markdown('#### Breakdown of Risk by Risk Type - 2021 ####'),
dcc.Graph(id='sun_burst1', figure={}, style={'height': '40vh'})],
width=4, lg={'size': 5, "offset": 0, 'order': 'second'}
]),
Ther solution is as follows. encase the 'children in square brackets and leave the other column definitions outside.
Solution:
dbc.Col(
[dcc.Markdown('#### Breakdown of Risk by Risk Type - 2021 ####'),
dcc.Graph(id='sun_burst1', figure={}, style={'height': '40vh'})],
width=4, lg={'size': 5, "offset": 0, 'order': 'second'}
),
With this in place, I can now mark up my charts as I wish.
A big thank you to jinnynor at Plotly community for provoking my thoughts

Extract academic publication information from IDEAS

I want to extract the list of publications from a specific IDEAS's page. I want to retrieve information about name of the paper, authors, and year. However, I am bit stuck in doing so. By inspecting the page, all information is inside the div class="tab-pane fade show active" [...], then with h3 we do have the year of publication while inside each li class="list-group-item downfree" [...] we can find each paper with relative author (as showed in this image). At the end, what I willing to obtain is a dataframe containing three columns: title, author, and year.
Nonetheless, while I am able to retrieve each paper's name, when I want to add also year and author(s) I get confused. What I wrote so far is the following short code:
from requests import get
url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
containers = soup.findAll("div", {'class': 'tab-pane fade show active'})
title_list = []
year_list = []
for container in containers:
year = container.findAll('h3')
year_list.append(int(year[0].text))
title_containers = container.findAll("li", {'class': 'list-group-item downfree'})
title = title_containers[0].a.text
title_list.append(title)
What I get are two list of only one element each. This because the initial containers has the size of 1. Regarding instead how to retrieve author(s) name I have no idea, I tried in several ways without success. I think I have to stripe the titles using 'by' as separator.
I hope someone could help me or re-direct to some other discussion which face a similar situation. Thank you in advance. Apologize for my (probably) silly question, I am still a beginner in web scraping with BeautifulSoup.
You can get the desired information like this:
from requests import get
import pprint
from bs4 import BeautifulSoup
url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
container = soup.select_one("#content")
title_list = []
author_list = []
year_list = [int(h.text) for h in container.find_all('h3')]
for panel in container.select("div.panel-body"):
title_list.append([x.text for x in panel.find_all('a')])
author_list.append([x.next_sibling.strip() for x in panel.find_all('i')])
result = list(zip(year_list, title_list, author_list))
pp = pprint.PrettyPrinter(indent=4, width=250)
pp.pprint(result)
outputs:
[ ( 2020,
['The Role Of Public Procurement As Innovation Lever: Evidence From Italian Manufacturing Firms', 'A voyage in the role of territory: are territories capable of instilling their peculiarities in local production systems'],
['Francesco Crespi & Serenella Caravella', 'Cristina Vaquero-Piñeiro']),
( 2019,
[ 'Probability Forecasts and Prediction Markets',
'R&D Financing And Growth',
'Mission-Oriented Innovation Policies: A Theoretical And Empirical Assessment For The Us Economy',
'Public Investment Fiscal Multipliers: An Empirical Assessment For European Countries',
'Consumption Smoothing Channels Within And Between Households',
'A critical analysis of the secular stagnation theory',
'Further evidence of the relationship between social transfers and income inequality in OECD countries',
'Capital accumulation and corporate portfolio choice between liquidity holdings and financialisation'],
[ 'Julia Mortera & A. Philip Dawid',
'Luca Spinesi & Mario Tirelli',
'Matteo Deleidi & Mariana Mazzucato',
'Enrico Sergio Levrero & Matteo Deleidi & Francesca Iafrate',
'Simone Tedeschi & Luigi Ventura & Pierfederico Asdrubal',
'Stefano Di Bucchianico',
"Giorgio D'Agostino & Luca Pieroni & Margherita Scarlato",
'Giovanni Scarano']),
( 2018, ...
I got the years using a list comprehension. I got the titles and authors by appending a list to the title_list and title_list for the required elements in each div element with the class panel-body again using a list comprehension and using next.sibling for the i element to get the authors. Then I zipped the three lists and cast the result to a list. Finally I pretty printed the result.

Scrape table with no ids or classes using only standard libraries?

I want to scrape two pieces of data from a website:
https://www.moneymetals.com/precious-metals-charts/gold-price
Specifically I want the "Gold Price per Ounce" and the "Spot Change" percent two columns to the right of it.
Using only Python standard libraries, is this possible? A lot of tutorials use the HTML element id to scrape effectively but inspecting the source for this page, it's just a table. Specifically I want the second and fourth <td> which appear on the page.
It's possible to do it with standard python libraries; ugly, but possible:
import urllib
from html.parser import HTMLParser
URL = 'https://www.moneymetals.com/precious-metals-charts/gold-price'
page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
resulttext = result.read()
class MyHTMLParser(HTMLParser):
gold = []
def handle_data(self, data):
self.gold.append(data)
parser = MyHTMLParser()
parser.feed(str(resulttext))
for i in parser.gold:
if 'Gold Price per Ounce' in i:
target= parser.gold.index(i) #get the index location of the heading
print(parser.gold[target+2]) #your target items are 2, 5 and 9 positions down in the list
print(parser.gold[target+5].replace('\\n',''))
print(parser.gold[target+9].replace('\\n',''))
Output (as of the time the url was loaded):
$1,566.70
8.65
0.55%

PDFlib - "leading" option of create_textflow

I'm trying to figure out how to add line spacing without adding spacing above the very first line of textflow.
This code:
$text = 'For more information about the Giant Wing Paper Plane see ' .
'our Web site <underline=true>www.kraxi-systems.com' .
'the Giant Wing in a thunderstorm as soon as possible.';
$optlist = 'fontname=Helvetica fontsize=12 encoding=unicode leading=400%';
$tf = $p->create_textflow($text, $optlist);
$result = $p->fit_textflow($tf, 28.346, 28.346, 400, 700, 'fitmethod=nofit');
$p->delete_textflow($tf);
results in:
All is good.
Next, I'm increasing the leading option to 400% as:
$optlist = 'fontname=Helvetica fontsize=12 encoding=unicode leading=400%';
And that gives me this:
Question:
How do I keep first paragraph line at the original position and only increase line spacing AFTER it?
checkout the "firstlinedist" option. The default is leading, but you might set this to "ascender" or "capheigt" or any other value.
Please see PDFlib 9.2 API reference, chapter 5.2, table 5.12 for more details.

python color entire pandas dataframe rows based on column values

I have a script that downloads a .csv and does some manipulation and then emails panda dataframes in a nice html format by using df.to_html.
I would like to enhance these tables by highlighting, or coloring, different rows based on their text value in a specific column.
I tried using pandas styler which appears to work however I can not convert that to html using to_html. I get a "AttributeError: 'str' object has no attribute 'to_html"
Is there a another way to do this?
As an example lets say my DF looks like the following and I want to highlight all rows for each manufacturer. i.e Use three different colors for Ford, Chevy, and Dodge:
Year Color Manufacturer
2011 Red Ford
2010 Yellow Ford
2000 Blue Chevy
1983 Orange Dodge
I noticed I can pass formatters into to_html but it appears that it cannot do what I am trying to accomplish by coloring? I would like to be able to do something like:
def colorred():
return ['background-color: red']
def color_row(value):
if value is "Ford":
result = colorred()
return result
df1.to_html("test.html", escape=False, formatters={"Manufacturer": color_row})
Surprised this has never been answered as looking back at it I do not believe this is even possible with to_html formatters. After revisiting this several times I have found a very nice solution I am happy with. I have not seen anything close to this online so I hope this helps someone else.
d = {'Year' : [2011, 2010, 2000, 1983],
'Color' : ['Red', 'Yellow', 'Blue', 'Orange'],
'Manufacturer' : ['Ford', 'Ford', 'Chevy', 'Dodge']}
df =pd.DataFrame(d)
print (df)
def color_rows(s):
df = s.copy()
#Key:Value dictionary of Column Name:Color
color_map = {}
#Unqiue Column values
manufacturers = df['Manufacturer'].unique()
colors_to_use = ['background-color: #ABB2B9', 'background-color: #EDBB99', 'background-color: #ABEBC6',
'background-color: #AED6F1']
#Loop over our column values and associate one color to each
for manufacturer in manufacturers:
color_map[manufacturer] = colors_to_use[0]
colors_to_use.pop(0)
for index, row in df.iterrows():
if row['Manufacturer'] in manufacturers:
manufacturer = row['Manufacturer']
#Get the color to use based on this rows Manufacturers value
my_color = color_map[manufacturer]
#Update the row using loc
df.loc[index,:] = my_color
else:
df.loc[index,:] = 'background-color: '
return df
df.style.apply(color_rows, axis=None)
Output:
Pandas row coloring
Since I do not have the cred to embed images here is how I email it. I convert it to html with the following.
styled = df.style.apply(color_rows, axis=None).set_table_styles(
[{'selector': '.row_heading',
'props': [('display', 'none')]},
{'selector': '.blank.level0',
'props': [('display', 'none')]}])
html = (styled.render())