While there are many questions like that, none of them describes my problem:
I have this list:
<ul>
<li>Burger</li>
<li>Fries</li>
<li>Coke</li>
</ul>
The list gets it's data from a database, that also includes the prices.
Now I need a list that also can show me the price in another column, like:
1. Burger | 6.99$
2. Fries | 2.99$
3. Coke | 1.99$
But all questions I find are about multiple columns if the list is too long.
Is there a way to reach my goal?
Lists aren't designed like that, I guess you could implement some kind of hacky way to make a multi-column list, or you can use a table:
<table>
<tr>
<th>Item</th>
<th>Price</th>
</tr>
<tr>
<td>Burger</td>
<td>$6.99</td>
</tr>
<tr>
<td>Fries</td>
<td>$2.99</td>
</tr>
<tr>
<td>Coke</td>
<td>$1.99</td>
</tr>
</table>
Related
I am attempting to scrape items from a page containing various HTML elements and a series of nested tables.
I have some code working that is successfully scraping from table X where class="ClassA" and outputting table elements into a series of items, such as company address, phone number, website address, etc.
I would like to add some extra items into this list that i am outputting, however the other items to be scraped aren't located within the same table, and some aren't even located in a table at all, eg < H1 > tag in another part of the page.
How is it possible to add some other items into my output, using xpath filter and have them appear in the same array / output structure ? I noticed if I scrape extra table items from another table (even when the table has the exact same CLASS Name and ID) the CSV output for those other items are outputted on different lines in the CSV, not keeping the CSV structure intact :(
Im sure there must be a way for items to remain unified in a csv output, even if they are scraped from slightly different areas on a page ? Hopefully its just a simple fix...
----- HTML EXAMPLE PAGE BEING SCRAPED -----
<html>
<head></head>
<body>
< // huge amount of other HTML and tables NOT to be scraped >
<h2>HEADING TO BE SCRAPED - Company Name</h2>
<p>Company Description</p>
< table cellspacing="0" class="contenttable company-details">
<tr>
<th>Item Code</th>
<td>IT123</td>
</tr>
<th>Listing Date</th>
<td>12 September, 2011</td>
</tr>
<tr>
<th>Internet Address</th>
<td class="altrow">http://www.website.com/</td>
</tr>
<tr>
<th>Office Address</th>
<td>123 Example Street</td>
</tr>
<tr>
<th>Office Telephone</th>
<td>(01) 1234 5678</td>
</tr>
</table>
<table cellspacing="0" class="contenttable" id="staff">
<tr><th>Management Names</th></tr>
<tr>
<td>
Mr John Citizen (CEO)<br/>Mrs Mary Doe (Director)<br/>Dr J. Watson (Manager)<br/>
</td>
</tr>
</table>
<table cellspacing="0" class="contenttable company-details">
<tr>
<th>Contact Person</th>
<td>
Mr John Citizen<br/>
</td>
</tr>
<tr>
<th class=principal>Company Mission</th>
<td>ACME Corp is a retail sales company.</td>
</tr>
</table>
</body>
</html>
---- SCRAPY CODE EXAMPLE ----
from scrapy.spider import Spider
from scrapy.selector import Selector
from my.items import AsxItem
class MySpider(Spider):
name = "my"
allowed_domains = ["website.com"]
start_urls = ["http://www.website.com/ABC" ]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//table[#class="contenttable company-details"]')
items = []
for site in sites:
item = MyItem()
item['Company_name'] = site.xpath('.//h1//text()').extract()
item['Item_Code'] = site.xpath('.//th[text()="Item Code"]/following-sibling::td//text()').extract()
item['Listing_Date'] = site.xpath('.//th[text()="Listing Date"]/following-sibling::td//text()').extract()
item['Website_URL'] = site.xpath('.//th[text()="Internet Address"]/following-sibling::td//text()').extract()
item['Office_Address'] = site.xpath('.//th[text()="Office Address"]/following-sibling::td//text()').extract()
item['Office_Phone'] = site.xpath('.//th[text()="Office Telephone"]/following-sibling::td//text()').extract()
item['Company_Mission'] = site.xpath('//th[text()="Company Mission"]/following-sibling::td//text()').extract()
yield item
Outputting to CSV
scrapy crawl my -o items.csv -t csv
With the example code above, the [company mission] item appears on a different line in the CSV to the other items (guessing because its in a different table) even though it has the same CLASS name and ID, and additionally im unsure how to scrape the < H1 > field since it falls outside the table structure for my current XPATH sites filter ?
I could expand the sites XPATH filter to include more content, but won't that be less effecient and defeat the point of filtering all together ?
Here's an example of the debug log, where you can see the Company Mission is being processed twice for some reason, and the first loop is empty, which must be why it is outputting onto a new line in the CSV, but why ??
{'Item_Code': [u'ABC'],
'Listing_Date': [u'1 January, 2000'],
'Office_Address': [u'Level 1, Some Street, SYDNEY, NSW, AUSTRALIA, 2000'],
'Office_Fax': [u'(02) 1234 5678'],
'Office_Phone': [u'(02) 1234 5678'],
'Company_Mission': [],
'Website_URL': [u'http://www.company.com']}
2014-02-06 16:32:13+1000 [my] DEBUG: Scraped from <200 http://www.website.com/Code=ABC>
{'Item_Code': [],
'Listing_Date': [],
'Office_Address': [],
'Office_Fax': [],
'Office_Phone': [],
'Company_Mission': [u'The comapany is involved in retail, food and beverage, wholesale services.'],
'Website_URL': []}
The other thing I am completely baffled about is why the items are spat out in the CSV in a completely different order to the items on the HTML page and the order I have defined in the spiders config file. Does scrapy run completely asynchronously returning items in whatever order it pleases ?
I understand you want to scrape 1 item for this page but //table[#class="contenttable company-details"] matches 2 tables elements in your HTML content, so the for site in sites: will run twice, creating 2 items.
And for each table, XPath expressions will be applied within the current table if they are relative -- .//th[text()="Item Code"]. Absolute XPath expressions, such as //th[text()="Company Mission"], will look for elements from the root element of your HTML document.
Your sample output shows the "Company_Mission" only once while you say it appears twice. And because you're using an absolute XPath expression for it, it should have indeed appeared twice. Not sure if the ouput matches your current spider code in the question.
So, first iteration of the loop,
<table cellspacing="0" class="contenttable company-details">
<tr>
<th>Item Code</th>
<td>IT123</td>
</tr>
<th>Listing Date</th>
<td>12 September, 2011</td>
</tr>
<tr>
<th>Internet Address</th>
<td class="altrow">http://www.website.com/</td>
</tr>
<tr>
<th>Office Address</th>
<td>123 Example Street</td>
</tr>
<tr>
<th>Office Telephone</th>
<td>(01) 1234 5678</td>
</tr>
</table>
in which you can scrape:
Item Code
Listing Date
Internet Address --> Website URL
Office Address
Office Telephone
and because you're using an absolute XPath expression, //th[text()="Company Mission"]/following-sibling::td//text() will look anywhere in the document, not only in this first <table cellspacing="0" class="contenttable company-details">
These extracted field go into an item of their own.
Then comes the 2nd table matching your XPath for sites:
<table cellspacing="0" class="contenttable company-details">
<tr>
<th>Contact Person</th>
<td>
Mr John Citizen<br/>
</td>
</tr>
<tr>
<th class=principal>Company Mission</th>
<td>ACME Corp is a retail sales company.</td>
</tr>
</table>
for which a new MyItem() is instantiated, and here, no XPath expression match except the absolute XPath for "Company Mission", so at the end of the loop iteration, you've got an item with only "Company Mission".
If you're sure you only expect 1 and only 1 item from this page, you can use longer XPaths like //table[#class="contenttable company-details"]//th[text()="Item Code"]/following-sibling::td//text() for each field you want, so that it will match the 1st or 2nd table,
and use only 1 MyItem() instance.
Also, you can try CSS selectors that would be shorter to read and write and easier to maintain:
"Company_name" <-- sel.css('h2::text')
"Item_Code" <-- sel.css('table.company-details th:contains("Item Code") + td::text')
"Listing_Date" <-- sel.css('table.company-details th:contains("Listing Date") + td::text')
etc.
Note that :contains() is available in Scrapy via cssselect underneath, but it's not standard (was remove from the CSS specs, but is handy) and ::text pseudo-element selector is also non-standard but a Scrapy extension, and is also handy.
guessing because its in a different table - wrong guess, there is no correlation between tables and items, in fact, it does not matter where is the data from, as long as you set it of the item fields.
meaning you can take Company_name and Company_Mission from wherever you want.
having said that, check what is returned from //th[text()="Company Mission"] and how many times it appears on the page, while other items xpath are relative (start with a .) this one is absolute (start with //), it may scrape a list of items and not just one
Say I have the given table:
+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+------+
| Row1 | D1.1 | D1.2 | D1.3 |
+------+------+------+------+
| Row2 | D2.1 | D2.2 | D2.3 |
+------+------+------+------+
| Row3 | D3.1 | D3.2 | D3.3 |
+------+------+------+------+
And I want to represent it in HTML5. The tricky thing is that tables like this must be semantically important, but the top-left cell is not semantically important, but instead a spacer to line up the more important column headers. What's the best way to do this? My first idea is to do it like this:
<table>
<thead>
<tr>
<th></th>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
</thead>
<tbody>
<tr>
<th>Row1</th>
<td>D1.1</td>
<td>D1.2</td>
<td>D1.3</td>
</tr>
<tr>
<th>Row2</th>
<td>D2.1</td>
<td>D2.2</td>
<td>D2.3</td>
</tr>
<tr>
<th>Row3</th>
<td>D3.1</td>
<td>D3.2</td>
<td>D3.3</td>
</tr>
</tbody>
</table>
Though, putting <th></th> in there feels just wrong, like using <p> </p> for spacing. Is there a better way to do this?
It's completely acceptable to have an empty <th> element, speaking in terms of either validity or semantics. Nothing in the spec forbids it; in fact, it contains at least one example that makes use of an empty <th> for this very purpose:
The following shows how one might mark up the gross margin table on page 46 of Apple, Inc's 10-K filing for fiscal year 2008:
<table>
<thead>
<tr>
<th>
<th>2008
<th>2007
<th>2006
<tbody>
<tr>
<th>Net sales
<td>$ 32,479
<td>$ 24,006
<td>$ 19,315
<!-- snip -->
</table>
For a discussion about semantics and empty table elements I would like to refer to this question on StackOverflow
Styling of "empty" cells (like background or borders) can sometimes depend on the absence/presence of "content" that is why people often put a inside. There is a special CSS tag for styling empty cells you can read about it here on MDN.
table {
empty-cells: hide;
}
Here you can find another article with some nice background information on this topic.
Any better way of using empty <th></th>:
Exact code:
<tr>
<th></th>
<th colspan="6"></th>
</tr>
I have a table where elements can have child elements with the very same attributes, like:
ITEM ATTRIBUTE 1 ATTRIBUTE 2
item value value
sub value value
sub value value
item value value
From this I've created a markup like this:
<table>
<thead>
<tr>
<th>ITEM</th>
<th>ATTRIBUTE 1</th>
<th>ATTRIBUTE 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>item</td>
<td>value</td>
<td>value</td>
</tr>
<tr>
<td colspan=3>
<table>
<tbody>
<tr>
<td>sub</td>
<td>value</td>
<td>value</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>item</td>
<td>value</td>
<td>value</td>
</tr>
</tbody>
</table>
My questions are now:
Is this the best semantic solution?
Is another approach better suited? If so, which is the recommended way?
Is the table header in charge for both tables or do I have to create a new one (maybe with visibility: hidden for the nested table?
Is this the best semantic solution?
Not really. While the act of nesting an element A within another element B can be used to indicate that A is a child of B, that isn't what you're doing here: you're nesting the table within a completely different row, so there's no implication of a parent-child relationship between A and B.
By creating a cell that spans all the columns in the table and then building another table inside that with the same number of columns, you're also effectively saying "these are some other columns, that don't relate to the ones in the outer table".
You can see the implied (lack of) relationship between the columns by adding a border to the cells in your example above:
Obviously you can fix that with CSS, but the unstyled rendering of a piece of HTML is often a good guide to its semantics.
Is another approach better suited? If so, which is the recommended way?
There's no standard way to represent hierarchical relationships between rows of a table in HTML. Cribbing from an answer I gave to a similar question, though, you can do it with extra classes, ids and data- attributes:
<table>
<thead>
<tr>
<th>ITEM</th>
<th>ATTRIBUTE 1</th>
<th>ATTRIBUTE 2</th>
</tr>
</thead>
<tbody>
<tr id=100>
<td>item</td>
<td>value</td>
<td>value</td>
</tr>
<tr id=110 data-parent=100 class=level-1>
<td>sub</td>
<td>value</td>
<td>value</td>
</tr>
<tr id=200>
<td>item</td>
<td>value</td>
<td>value</td>
</tr>
</tbody>
</table>
The parent-child relationship won't be visible in an unstyled rendering (there's no other way you could make it so without adding extra content, as far as I can see), but there are enough hooks to add the CSS required:
.level-1 > td:first-child {
padding-left: 1em;
}
... which results in this:
With a little javascript, you could also use the id and data-parent attributes to set things up so that e.g. hovering over a row causes its parent to be highlighted.
Is the table header in charge for both tables, or do I have to create a new one?
In your proposed solution, creating a single cell that spans all columns and then building another table inside it means that there's no implied relationship between the header cells and those of your "child" row. Obviously my suggested solution above doesn't have that problem.
This is W3C's recommendation:
At the current time, those who want to ensure consistent support across Assistive
Technologies for tables where the headers are not in the first row/column may want
to use the technique for complex tables H43: Using id and headers attributes to
associate data cells with header cells in data tables. For simple tables that have
headers in the first column or row we recommend the use of the th and td elements.
you can lock at this post: Best way to construct a semantic html table
hope that will help you to get your answer
Talking about semantics requires us to have more time than to find an answer for your question.
But for a whole point, this link should help you. That page contains all the information you may be interested in. Interestingly unlike normal 'declarative' spec w3c writes, it has 'suggestive' writing about the question in this context. You may wish to read right from the start.
I think putting the children in a separate table is the wrong way to go. Nested tables are not like nested lists; they don't carry that same semantic hierarchy. It seems everything should be within the same table if it all lists the same information.
For example, if your table had the headers
REGION POPULATION AREA
then you could have item1 = Earth, item2 = France, item3 = Paris... and it wouldn't really matter if France were a child of Earth or if Paris were a child of France; you'd still be better off keeping it all in one table and not trying to do a parent/child relationship other than in CSS styling.
If your table is really not comprehensible without someone knowing that parent/child relationship, could you give an example of the table data so I can better understand how to structure it?
I am using Grails and I am currently faced with this problem.
This is the result of my html table
And this is my code from the gsp page
<tr>
<th>Device ID</th>
<th>Device Type</th>
<th>Status</th>
<th>Customer</th>
</tr>
<tr>
<g:each in = "${reqid}">
<td>${it.device_id}</td>
</g:each>
<g:each in ="${custname}">
<td>${it.type}</td>
<td>${it.system_status}</td>
<td>${it.username}</td>
</g:each>
</tr>
So the problem is, how do I format the table such that the "LHCT271 , 2 , Thomasyeo " will be shifted down accordingly? I've tried to add the <tr> tags here and there but it doesn't work.. any help please?
I think you problem is not in the view, but in the controller (or maybe even the domain). You must have some way of knowing that reqid and custname are related if they are in the same table. You must use that to construct an object that can be easily used in a g:each
You are looking for a way to mix columns and rows, and still get a nice table. I'm afraid that is not possible.
Edit
(Sorry, I just saw the last comment.)
You cannot mix two items in a g:each.
Furthermore, if the two things are not related you probably must not put them in the same table. There will be no way for you or for Grails, to know how to properly organize the information
Do you want to display the first reqid against the first custname (three fields), the second againts the second and so on? And are those collections of same length?
In such case you could try the following:
<tr>
<th>Device ID</th>
<th>Device Type</th>
<th>Status</th>
<th>Customer</th>
</tr>
<g:each var="req" in="${reqid}" status="i">
<tr>
<td>${req}</td>
<td>${custname[i].type}</td>
<td>${custname[i].system_status}</td>
<td>${custname[i].username}</td>
</tr>
</g:each>
I have a Lexicon model, and I want user to be able to create dynamic feature to every lexicon.
And I have a complicate search interface that let user search on every single feature (including the dynamic ones) belonged to Lexicon model.
I could have used a serialized text field to save all the dynamic information if they are not for searching.
In case I want to let user search on all fields, I have created a DynamicField Model to hold all dynamically created features.
But imagine I have 1,000,000,000 lexicon, and if one create a dynamic feature for every lexicon, this will result creating 1,000,000,000 rows in DynamicField model.
So the sql search function will become quite inefficient while a lot of dynamic features created.
Is there a better solution for this situation?
Which way should I take?
searching for a better db design for dynamic fields
try to tuning mysql(add cache fields, add index ...) with current db design
Another idea might be to use MongoDB and MongoMapper, Thinking Sphinx or Solr. Here is Railscast on how to use Mongo: http://railscasts.com/episodes/194-mongodb-and-mongomapper
I think the best way to do this is to use a name/value pairing instead of dynamic fields. Let me explain using the EAV design pattern
So instead of having something like this:
Table: MedicalRecords
<table>
<tr>
<th>Temperature in degrees Fahrenheit</th>
<th>Presence of Cough</th>
<th>Type of Cough</th>
<th>Heart Rate in beats per minute</th>
<th>Column X</th>
<th>Column X + 1</th>
<th>... Column N</th>
</tr>
<tr>
<td>102</td>
<td>True</td>
<td>With phlegm, yellowish, streaks of blood</td>
<td>98</td>
<td>????</td>
<td>????</td>
<td>????</td>
</tr>
</table>
You would design your table like this:
Table: MedicalRecords
<table>
<tr>
<th>Name</th>
<th>Value</th>
</tr>
<tr>
<td>Temperature in degrees Fahrenheit</td>
<td>102</td>
</tr>
<tr>
<td>Presence of Cough</td>
<td>True</td>
</tr>
<tr>
<td>Type of Cough</td>
<td>With phlegm, yellowish, streaks of blood</td>
</tr>
<tr>
<td>Heart Rate in beats per minute</td>
<td>98</td>
</tr>
<tr>
<td>Column X</td>
<td>????</td>
</tr>
<tr>
<td>Column X + 1</td>
<td>????</td>
</tr>
<tr>
<td>... Column N</td>
<td>????</td>
</tr>
</table>
(Tried to get the table tags to work but couldn't Try coping my code into an html file to get the idea.)