Scrapy output multiple item elements with xpath to single csv? - csv

I am attempting to scrape items from a page containing various HTML elements and a series of nested tables.
I have some code working that is successfully scraping from table X where class="ClassA" and outputting table elements into a series of items, such as company address, phone number, website address, etc.
I would like to add some extra items into this list that i am outputting, however the other items to be scraped aren't located within the same table, and some aren't even located in a table at all, eg < H1 > tag in another part of the page.
How is it possible to add some other items into my output, using xpath filter and have them appear in the same array / output structure ? I noticed if I scrape extra table items from another table (even when the table has the exact same CLASS Name and ID) the CSV output for those other items are outputted on different lines in the CSV, not keeping the CSV structure intact :(
Im sure there must be a way for items to remain unified in a csv output, even if they are scraped from slightly different areas on a page ? Hopefully its just a simple fix...
----- HTML EXAMPLE PAGE BEING SCRAPED -----
<html>
<head></head>
<body>
< // huge amount of other HTML and tables NOT to be scraped >
<h2>HEADING TO BE SCRAPED - Company Name</h2>
<p>Company Description</p>
< table cellspacing="0" class="contenttable company-details">
<tr>
<th>Item Code</th>
<td>IT123</td>
</tr>
<th>Listing Date</th>
<td>12 September, 2011</td>
</tr>
<tr>
<th>Internet Address</th>
<td class="altrow">http://www.website.com/</td>
</tr>
<tr>
<th>Office Address</th>
<td>123 Example Street</td>
</tr>
<tr>
<th>Office Telephone</th>
<td>(01) 1234 5678</td>
</tr>
</table>
<table cellspacing="0" class="contenttable" id="staff">
<tr><th>Management Names</th></tr>
<tr>
<td>
Mr John Citizen (CEO)<br/>Mrs Mary Doe (Director)<br/>Dr J. Watson (Manager)<br/>
</td>
</tr>
</table>
<table cellspacing="0" class="contenttable company-details">
<tr>
<th>Contact Person</th>
<td>
Mr John Citizen<br/>
</td>
</tr>
<tr>
<th class=principal>Company Mission</th>
<td>ACME Corp is a retail sales company.</td>
</tr>
</table>
</body>
</html>
---- SCRAPY CODE EXAMPLE ----
from scrapy.spider import Spider
from scrapy.selector import Selector
from my.items import AsxItem
class MySpider(Spider):
name = "my"
allowed_domains = ["website.com"]
start_urls = ["http://www.website.com/ABC" ]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//table[#class="contenttable company-details"]')
items = []
for site in sites:
item = MyItem()
item['Company_name'] = site.xpath('.//h1//text()').extract()
item['Item_Code'] = site.xpath('.//th[text()="Item Code"]/following-sibling::td//text()').extract()
item['Listing_Date'] = site.xpath('.//th[text()="Listing Date"]/following-sibling::td//text()').extract()
item['Website_URL'] = site.xpath('.//th[text()="Internet Address"]/following-sibling::td//text()').extract()
item['Office_Address'] = site.xpath('.//th[text()="Office Address"]/following-sibling::td//text()').extract()
item['Office_Phone'] = site.xpath('.//th[text()="Office Telephone"]/following-sibling::td//text()').extract()
item['Company_Mission'] = site.xpath('//th[text()="Company Mission"]/following-sibling::td//text()').extract()
yield item
Outputting to CSV
scrapy crawl my -o items.csv -t csv
With the example code above, the [company mission] item appears on a different line in the CSV to the other items (guessing because its in a different table) even though it has the same CLASS name and ID, and additionally im unsure how to scrape the < H1 > field since it falls outside the table structure for my current XPATH sites filter ?
I could expand the sites XPATH filter to include more content, but won't that be less effecient and defeat the point of filtering all together ?
Here's an example of the debug log, where you can see the Company Mission is being processed twice for some reason, and the first loop is empty, which must be why it is outputting onto a new line in the CSV, but why ??
{'Item_Code': [u'ABC'],
'Listing_Date': [u'1 January, 2000'],
'Office_Address': [u'Level 1, Some Street, SYDNEY, NSW, AUSTRALIA, 2000'],
'Office_Fax': [u'(02) 1234 5678'],
'Office_Phone': [u'(02) 1234 5678'],
'Company_Mission': [],
'Website_URL': [u'http://www.company.com']}
2014-02-06 16:32:13+1000 [my] DEBUG: Scraped from <200 http://www.website.com/Code=ABC>
{'Item_Code': [],
'Listing_Date': [],
'Office_Address': [],
'Office_Fax': [],
'Office_Phone': [],
'Company_Mission': [u'The comapany is involved in retail, food and beverage, wholesale services.'],
'Website_URL': []}
The other thing I am completely baffled about is why the items are spat out in the CSV in a completely different order to the items on the HTML page and the order I have defined in the spiders config file. Does scrapy run completely asynchronously returning items in whatever order it pleases ?

I understand you want to scrape 1 item for this page but //table[#class="contenttable company-details"] matches 2 tables elements in your HTML content, so the for site in sites: will run twice, creating 2 items.
And for each table, XPath expressions will be applied within the current table if they are relative -- .//th[text()="Item Code"]. Absolute XPath expressions, such as //th[text()="Company Mission"], will look for elements from the root element of your HTML document.
Your sample output shows the "Company_Mission" only once while you say it appears twice. And because you're using an absolute XPath expression for it, it should have indeed appeared twice. Not sure if the ouput matches your current spider code in the question.
So, first iteration of the loop,
<table cellspacing="0" class="contenttable company-details">
<tr>
<th>Item Code</th>
<td>IT123</td>
</tr>
<th>Listing Date</th>
<td>12 September, 2011</td>
</tr>
<tr>
<th>Internet Address</th>
<td class="altrow">http://www.website.com/</td>
</tr>
<tr>
<th>Office Address</th>
<td>123 Example Street</td>
</tr>
<tr>
<th>Office Telephone</th>
<td>(01) 1234 5678</td>
</tr>
</table>
in which you can scrape:
Item Code
Listing Date
Internet Address --> Website URL
Office Address
Office Telephone
and because you're using an absolute XPath expression, //th[text()="Company Mission"]/following-sibling::td//text() will look anywhere in the document, not only in this first <table cellspacing="0" class="contenttable company-details">
These extracted field go into an item of their own.
Then comes the 2nd table matching your XPath for sites:
<table cellspacing="0" class="contenttable company-details">
<tr>
<th>Contact Person</th>
<td>
Mr John Citizen<br/>
</td>
</tr>
<tr>
<th class=principal>Company Mission</th>
<td>ACME Corp is a retail sales company.</td>
</tr>
</table>
for which a new MyItem() is instantiated, and here, no XPath expression match except the absolute XPath for "Company Mission", so at the end of the loop iteration, you've got an item with only "Company Mission".
If you're sure you only expect 1 and only 1 item from this page, you can use longer XPaths like //table[#class="contenttable company-details"]//th[text()="Item Code"]/following-sibling::td//text() for each field you want, so that it will match the 1st or 2nd table,
and use only 1 MyItem() instance.
Also, you can try CSS selectors that would be shorter to read and write and easier to maintain:
"Company_name" <-- sel.css('h2::text')
"Item_Code" <-- sel.css('table.company-details th:contains("Item Code") + td::text')
"Listing_Date" <-- sel.css('table.company-details th:contains("Listing Date") + td::text')
etc.
Note that :contains() is available in Scrapy via cssselect underneath, but it's not standard (was remove from the CSS specs, but is handy) and ::text pseudo-element selector is also non-standard but a Scrapy extension, and is also handy.

guessing because its in a different table - wrong guess, there is no correlation between tables and items, in fact, it does not matter where is the data from, as long as you set it of the item fields.
meaning you can take Company_name and Company_Mission from wherever you want.
having said that, check what is returned from //th[text()="Company Mission"] and how many times it appears on the page, while other items xpath are relative (start with a .) this one is absolute (start with //), it may scrape a list of items and not just one

Related

laravel dompdf not rendering complex html with rowspan and colspan correctly

I have a very complex dynamic table that I need to output to pdf in laravel 5.6. The project I inherited had Dompdf installed and is already rendering all other content. Therefore, I use it as well for compatibility.
My issue is I have a table to render consisting of 13 columns and undefined number of rows, where intermittently a column may span 13 columns for a heading or a row may span several rows at any given time or a colspan within the rowspan that spans 11 columns from the 3rd row. No html is hardcoded except the <table>, <thead>, <th> and <tbody> tags. The html within the tbody tag is dynamically generated depending on the array data.
Everything looks great in the browser and when I view() the pdf blade as well as ctrl + p it creates a nice pdf, although for some reason rowspan cells spanning to the next page does not carry over markup and content. As soon as I try to stream() the pdf the table becomes warped and looks like a toppled building built by Picasso.
Here is links to pdf's, the one I ctrl + p lost its colour due to me removing names.
File to view pdf printed with ctrl + p
Pdf streamed with Dompdf
Image of viewing pdf in browser
Image of pdf when streaming via Dompdf:
Html sample rendered in browser:
<tr style="background-color: #5b8969;">
<td rowspan="2" style="background-color: #F8C293; color: black;">Spray 4</td>
<td>Pollinate</td>
<td>7-10 days later</td>
<td>BENOMYL WP 25KG </td>
<td>benomyl 500g/kg</td>
<td> </td>
<td>1000</td>
<td>2.00</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Full bloom</td>
<td>Black Spot</td>
<td>WETCIT DUO 20L </td>
<td>borax 10g/orange oil 50g/l</td>
<td> </td>
<td>1000</td>
<td>25.00</td>
<td>100.0000</td>
<td>120.0000L</td>
<td>2500.0000</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="13" style="background-color: #9fb5d3;" class="h3 font-weight-bold">ANOTHER ONE</td>
</tr>
<tr>
<td rowspan="7" style="background-color: #F8C293; color: black;">Spray 7</td>
<td>20 cm</td>
<td>African Armyworm</td>
<td>CERATO 250 EC 5L </td>
<td>pyraclostrobin 250g/l</td>
<td> </td>
<td>1000</td>
<td>2.00</td>
<td>10.0000</td>
<td></td>
<td>20.0000</td>
<td></td>
<td></td>
</tr>
Can someone please help and give me a clue on how to output such a complex table with Dompdf? As I would really want to keep on using only one PDF rendering library in this project.
Otherwise I am open to suggestions to use another pdf library that can handle rowspan that span pages and this complex layout?
Update
Based on a comment by Don't panic (he suggested validating html and fill empty td tags with ), that he subsequently deleted.
I re-wrote the HTML as a template into my pdf.blade.php view. Now, I only output the values in a loop in my view. Firstly, it becomes easier to maintain and to leave off the validation he suggested. I also filled every empty <td> tag with a hardcoded ' '. This is to more easily see why certain rows end where they should and others not. The result is sadly still the same, a warped table. But it does seem to be a rowspan issue not colspan. The 'rowspan' rows stack after another. So maybe missing a td tr.
Solved rowspan stacking issue
Two weeks of testing and only problem was it was not outputting certain rows' opening tags, which lead to rows not knowing when to begin. Now only problem left is rowspan across pages.
Update on update
So I have really tried everything I can to get DomPdf to do what it is suppose to do, which is rendering pdf's. I have read a bit more and found that this library has a long standing issue of not being able to render rowspan accross pages. Therefore, on to the next rendering library wkhtlmpdf or I could logically divide rowspans to stop at end of page and start again on new page. Will have to check my watch on this one.

Excel VBA web scraping from table

I am trying to extract some info from the table below into Excel using VBL without any success. The values which I need do not seem to have any element ID, tag name or class name assigned to it. I'm after the Fuel Usage value(89218) and the time value in the same row (01:15). Can anyone point me in the right direction on how to scrape values from a table, or how to extract data from specific TR, TD.
HTML source of the table:
<h3>Airbus A300-600-PW4158 Fuel Planner</h3>
<p>London to Chicago EGKK-KORD (3441 NM)<br /></p>
<h2>Total Fuel: 101901 POUNDS</h2>
<table width="100%" border=1>
<tr>
<th style="text-align:left;"> </th>
<th style="text-align:left;">Fuel</td>
<th style="text-align:left;">Time</th>
</tr>
<tr>
<td>Fuel Usage</td>
<td>89218</td>
<td>08:47</td>
</tr>
<tr>
<td>Reserve Fuel</td>
<td>12682</td>
<td>01:15</td>
</tr>
<tr>
<td>Fuel on Board</td>
<td>101901</td>
<td>10:02</td>
</tr>
</table>
much appreciated.
CSS Selectors:
Without seeing more of the HTML you can use the following CSS selectors selectors for the snippet shown:
tr td:nth-child(2)
tr td:nth-child(3)
With CSS selectors this will bring back nodeLists of all 2 or 3 child tds with a tr.
For example:
You can access individual items from a nodeList by index.
VBA:
The syntax in vba overall will be something like:
.document.querySelectorAll("tr td:nth-child(2)")(0).innerText
or possibly
.document.querySelectorAll("tr td:nth-child(2)").Item(0).innerText
The 0 is hypothetical. You would need to inspect your full HTML to ascertain the correct index to use.
The .document innerHTML can be populated from the .responseText using IE, for example, to navigate to the page.

How to match nearest tag backward with XPath

I have a HTML like this:
html =<<EOS
<table><!-- outer table -->
<tr><td>
<table><!-- inner table 1 -->
<tr><td>Foo</td></tr>
</table>
<table><!-- inner table 2 -->
<tr><td>Bar</td></tr>
</table>
</td></tr>
</table>
EOS
I want to get a changing value Bar from a static value Foo.
With this code I can get the value.
Nokogiri::HTML(html)
doc.xpath("//table[tr/td[text()='Foo']]/following-sibling::table//td").text
And I wanted to rewrite like this:
doc.xpath("//table[//td[text()='Foo']]/following-sibling::table//td").text
But this code doesn't work because //table[//td[text()='Foo']] matches outer table not the inner table.
Is there a expression for nearest backward match in XPath like this?
//table[(nearest match expression)td[text()='Foo']]
Yes, //table[//td[text()='Foo']] gives the outer table as the first result (not the only result) , but //table[//td[text()='Foo']]/following-sibling::table//td still retrieves <td>Bar</td>.
The problematic part of //table[//td[text()='Foo']] is the // in front of td, because it selects all descendant td elements:
<table>
<tr>
<td>This is selected</td>
<td>
<table>
<tr>
<td>This is also selected</td>
</tr>
</table>
</td>
</tr>
</table>
You should use // only sparingly. I would use the expression
//table[tr/td = 'Foo']/following-sibling::table[1]/tr/td
EDIT: As suggested by Phrogz, in Nokogiri, instead of [1] in the expression above, you can use at_xpath as in
doc.at_xpath(//table[tr/td = 'Foo']/following-sibling::table/tr/td).text
to only get the first result node that was found. That is, if you actually intend to only find one node and if the wanted node is the first one in document order.

MediaWIki table (infobox) goes to bottom-right of the page

I'm trying to add an infobox to a page on my MediaWiki site.
I have this code in my common.css:
.infobox {
border:1px solid #aaaaaa;
background-color:#f9f9f9;
padding:5px;
font-size: 95%;
border-collapse: collapse;
margin-left: 20px;
}
and I have this code in Template:Infobox
<table class="infobox" align="right" bgcolor="#E1E1E1" style="width:20em; font-size:90%; text-align:left; border: 1px green solid;">
<caption style="text-align:center; font-size:140%;"><i><b>{{{name}}}</b></i></caption>
<tr>
<td colspan="2" style="text-align:center;" bgcolor="#E1E1E1">{{{image}}}</td>
</tr>
<tr>
<td colspan="2" bgcolor="#E1E1E1" style="text-align:center;">{{{imagecaption}}}</td>
<tr>
<th>Author</th>
<td>{{{author}}}</td>
</tr>
<tr>
<th>Publisher</th>
<td>{{{publisher}}}</td>
</tr>
<tr>
<th>Publication date</th>
<td>{{{publication}}}</td>
</tr>
<tr>
<th>Illustrator</th>
<td>{{{illustrator}}}</td>
</tr>
<tr>
<th>Genre</th>
<td>{{{genre}}}</td>
</tr>
<tr>
<th>Pages</th>
<td>{{{pages}}}</td>
</tr>
<tr>
<th>ISBN</th>
<td>{{{isbn}}}</td>
</tr>
And lastly this is the code that I inserted into my MediaWiki page:
{{Infobox
| name = The Hitchhiker's Guide to the Galaxy
| image = [[Image:Hhgttg.jpg|150px]]
| image_caption = Movie Poster
| author = Douglas Adams
| country = United Kingdom
| language = English
| series = The Hitchhiker's Guide to the Galaxy
| genre = Science Fiction
| publisher = Pan Books
| release_date = 1979
| media_type = Paperback and hardcover
| pages = 180
| isbn = ISBN 0-330-25864-8
| followed_by = The Restaurant at the End of the Universe
}}
The problem I'm having is that the infobox aligns itself at the bottom-right on my MediaWiki page. I would much rather make it appear on the top-right, like in this page on Wikipedia: http://en.wikipedia.org/wiki/Bill_Gates
What can I add to my code to make this possible?
Wow, that's a weird one, but I think I've managed to figure out the reason:
You forgot to close the <table> inside the template.
Because of that, when the template is expanded at the beginning of the page, all the rest of the page content will end up being placed inside the unclosed HTML table. But, since you did remember to close the <td> and <tr> tags, it gets included outside them.
After parsing a page, MediaWiki runs HTML Tidy on it. When Tidy sees your unclosed table, it does two things to it:
It adds the missing </table> tag to the end of the page.
It sees that there's some content inside the table, but outside the table cells, where no content is supposed to be. It pulls that content out of the table, and places it before the table instead.
The end result is that everything on the page following the unclosed table ends up getting moved up to precede it in the final, tidied HTML. Weird, indeed.
Arguably, HTML Tidy should be smarter about this situation: if it sees an unclosed table with non-table markup following the last table cell, it should close the table there instead of assuming that the following content is also part of the table. It might be worth reporting this as a bug / feature request for Tidy. That said, there are plenty of other ways in which unclosed or otherwise malformed HTML markup in a template could mess up your page, so fixing this one specific case might not make much difference in the grand scheme of things.
And before you ask, yes, the ability to have unmatched HTML in a MediaWiki template is considered a feature, since it lets you do things like:
{{begin quote}}
This is some text wrapped in a fancy quote box by the surrounding templates.
{{end quote}}
Anyway, the fix to this particular problem is simple: just add the missing </table> to your template.

Only parsing outer element

I am writing a scraper with Nokogiri, and I want to scrape a large HTML file.
Currently, I am scraping a large table; here is a small fragment:
<table id="rptBidTypes__ctl0_dgResults">
<tr>
<td align="left">S24327</td>
<td>
Airfield Lighting
<div>
<div>
<table cellpadding="5px" border="2" cellspacing="1px" width="100%" bgcolor=
"black">
<tr>
<td bgcolor="white">Abstract:<br />
This project is for the purchase and delivery, of various airfield
lighting, for a period of 36 months, with two optional 1 year renewals,
in accordance with the specifications, terms and conditions specified in
the solicitation.</td>
</tr>
</table>
</div>
</div>
</td>
</tr>
</table>
And here is the Ruby code I am using to scrape:
document = doc.search("table#rptBidTypes__ctl0_dgResults tr")
document[1..-1].each do |v|
cells = v.search 'td'
if cells.inner_html.length > 0
data = {
number: cells[0].text,
}
end
ScraperWiki::save_sqlite(['number'], data)
end
Unfortunately this isn't working for me. I only want to extract S24327, but I am getting the content of every table cell. How do I only extract the content of the first td?
Keep in mind that under this table, there are many table rows following the same format.
In CSS, table tr means tr anywhere underneath the table, including nested tables. But table > tr means the tr must be a direct child of the table.
Also, it appears you only want the cell values, so you don't need to iterate. This will give you all such cells (the first in each row):
doc.search("table#rptBidTypes__ctl0_dgResults > tr > td[1]").map(&:text)
The content of the first td would be:
doc.at("table#rptBidTypes__ctl0_dgResults td").text
The problem is that your search is matching two different things: the <tr> tag nested directly within the table with id rptBidTypes__ctl0_dgResults, and the <tr> tag within the table nested inside that parent table. When you loop through document[1..-1] you're actually selecting the second <tr> tag rather than the first one.
To select just the direct child <tr> tag, use:
document = doc.search("table#rptBidTypes__ctl0_dgResults > tr")
Then you can get the text for the <td> tag with:
document.css('td')[0].text #=> "S24327"