CSS Selectors for Scrapy Web Scraping

CSS Selectors for Scrapy Web Scraping - html

I'm currently trying to scrape all the malls listed on the website
https://web.archive.org/web/20151112172204/http://www.simon.com/mall
using Python and Scrapy. I can't figure out how to extract the text "Anchorage 5th Avenue Mall".
<div class="st-country-padding">
<h4><a class="no-underline" href="/web/20151112172204/http://www.simon.com/search/alaska%2b(ak)" title="View Malls In Alaska">Alaska</a></h4>
<div>
Anchorage:
Anchorage 5th Avenue Mall
</div>
</div>
I've tried a number of differnet attempts including
response.css("a::attr(title)").extract()
But doesn't give me what I'm looking for.
Note that Anchorage is just the name of the first mall so I can't call that directly because there are 200 or so different malls

::attr(title) gives you the value of the title attribute. What you want is the text, so you need to use ::text instead.
Also, there doesn't appear to be a good way to identify the a element you want since it doesn't have anything that distinguishes it from the others, so a bit of pathing is necessary. Let me know if this works for you:
response.css(".st-country-padding > div > a:last-of-type::text").extract()

Related

How can I tell if I nested microdata correctly within my code?

I am new to microdata, and have to try and put together an assignment.
It requires "At least one itemtype should be embedded (or nested) in another itemtype: the value for at least one itemprop should itself be another itemtype with its own set of properties."
The code I came up with is this:
<div itemscope itemtype="https://schema.org/Person">
<div style="white-space: pre-wrap;">
<span itemprop="description">
Since I am still in my <span itemprop="knowsAbout">library and information science program</span>, I do not have as many finished projects as I wish to showcase here.
</span>
I have been performing coursework on reference and information services, information organization (including metadata), and an introduction to technologies that are used in the library sciences field, such as database management.
<div itemscope itemtype="https://schema.org/Action">
I have done a few <span itemprop="result">library science projects</span> over the last year that have been recently completed. Actually, this<span itemprop="result"> early version of my website</span> you are on is one of them! I programmed this
myself over the course of several days using a combination of HTML and CSS to create this website experience.</div>
<div itemscope itemtype="https://schema.org/CreativeWork">
Another project I have worked on is a <a itemprop="exampleOfWork" href="link_insert_here" target="_blank">LibGuide</a> (a library research guide) for LGBTQ+ characters in Comic Books and Graphic Novels. I have also created several learning aids.
First is an example of a <a itemprop="exampleOfWork" href="handout.pdf" target="_blank">handout</a> I created discussing some basic information on virtual machines. This is an example of a <a itemprop="exampleOfWork" href="link_insert_here" target="_blank">video tutorial</a> I created on how to sign up for a Local Public Library eCard.
</div>
</div>
</div>
I'm not sure I nested everything properly, and when I use various structured testing tools it picks up my microdata. While it seems right, I just can't tell.

How do you use xpath to find an element with two specific descendants?

I have an unordered list of list items containing elements for labels and values that are dynamically generated. I am trying to validate that the list contains a specific label with a specific value.
I am attempting to write an xpath that will allow me to find the parent element that contains the defined label and value with protractor's element(by.xpath). Given a list, I need to be able to find any single li by the combination of two descendants of specific attributes. For example, a li element that contains any descendent with class=label and text=Color AND any descendent with text=Blue.
<ul>
<li>
<span class='label'> Car </span>
<p> Ford </p>
</li>
<li>
<span class='label'> Color </span>
<p> <span>My favorite color is</span> : <webl>Blue</webl></p>
</li>
<li>
<span class='label'> Name </span>
<p> Meri </p>
</li>
<li>
<span class='label'> Pet </span>
<p> Cats <span>make the best pets</span> </p>
</li>
I have tried several variations on the following pattern:
//li[.//*[#class="label" | contains(text(), 'Color')] | .//*[contains(text(), 'Blue')]
This is the closest I think I have come and it's coming back as not a valid xpath. I've been looking at references, cheatsheets, and SO questions for several hours now and I am no closer to understanding what I am doing wrong. Eventually I will need to replace the text with variables, but right now I just need to get my head around this.
a list item that contains, at any depth,
any tag with a class of 'label' and text of x
AND
any tag with text y
Can anyone tell me what I am doing wrong? Am I just making it too complex?

The reason you are getting invalid xPath is because:
The |, or union, operator returns the union of its two operands,
which must be node-sets..
However since you have used inside one node you are getting issue. To meet your requirement below xpath will work just fine:
//*[#class="label" and contains(text(),'Color')]//ancestor::li//*[contains(text(), 'Blue')]

As per the HTML you have shared to locate the <li> element that contains a descendent with class='label' and text=Color AND any descendent with text=Blue you can use the following xpath based Locator Strategy:
//li[./span[#class='label' and contains(., 'Color')]][.//webl[contains(., 'Blue')]]
Proof Of Concept:

How to hCard company information?

I'm tying to markup some content semantically. The content is company information, which may have multiple addresses, multiple phone numbers, multiple email addresses.
The hCard generators that I see seems to expect a person's details (e.g. first name, last name, etc.).
Is there a way to markup just company details? If so, how?
Also, is hCard the correct format to use?

you can use multiples of most microformats' properties, as long as you heed the parental element(s), so in your case, as long as all the multiple data properties are children of .vcard and not .vcard as well, all is good. actually threw this together from two of their examples on http://microformats.org. here you go:
<div id="contact" class="vcard">
<h2>Contact Me Yo!</h2>
<h3 class="fn">Jane Doe</h3>
<p>You can contact me via email to
<a class="email" href="mailto:jane#example.com">jane#example.com</a>,
or reach me at the following address:</p>
<div class="adr">
<span class="type">home</span> address:
<div class="street-address">123 Main Street</div>
<span class="locality">Any Town</span>, <span class="region">CA</span>,
<span class="postal-code">91921-1234</span>
</div>
<div class="adr">
<span class="type">work</span> address:
<div class="street-address">789 Main Street</div>
<span class="locality">Any Town</span>, <span class="region">CA</span>,
<span class="postal-code">91921-1234</span>
</div>
</div>
references:
http://microformats.org/wiki/hcard-faq#Can_you_have_multiple_value_elements
http://microformats.org/wiki/hcard-faq#How_do_I_markup_multiple_addresses
is hcard the correct format to use?
100% absolutely...microformats are part of the html5 spec, they are the most widely used semantic web technology, they fit your exact needs, and they are (currently) indexed by the major search engines. microformats add levels to your document that most refuse to believe, but all you have do to is follow instructions, and you've got a pre-baked api in your markup.
that said, google/bing/yahoo!/yandex (? the russian search engine), have all openly endorsed schema.org, and while they support microformats (have for years), you'd be a fool to think they won't give their method(s) incentive(s) to be used. i'm not aware of any that are entirely microformats vs. schema.org yet, but i'm sure they are on the way. at the moment, imo, its more about tying everything into g+ for google right now, so everything else is taking a backseat. which only speaks to my point(s)...
clearly i am biased, but that's about as clear and dry as i can be. i actually have the same mental debate for each and every client that puts me in the position to run wild with their markup...i have yet to break down and start using schema, however, i am quite prepared for them to ping me randomly, should google magically stop harvesting microformats.

To add company informations you have to simply add an org at the same level of the fn of your hCard.
Here is example:
<div class="vcard">
<a class="url fn org" href="http://compa.ny">Company Name</a>
</div>
Or you can try it with Microdata/Schema.org which will be more supported by the great search engine providers: http://schema.org/Organization

Marking up a search result list with HTML5 semantics

Making a search result list (like in Google) is not very hard, if you just need something that works. Now, however, I want to do it with perfection, using the benefits of HTML5 semantics. The goal is to define the defacto way of marking up a search result list that potentially could be used by any future search engine.
For each hit, I want to
order them by increasing number
display a clickable title
show a short summary
display additional data like categories, publishing date and file size
My first idea is something like this:
<ol>
<li>
<article>
<header>
<h1>
<a href="url-to-the-page.html">
The Title of the Page
</a>
</h1>
</header>
<p>A short summary of the page</p>
<footer>
<dl>
<dt>Categories</dt>
<dd>
<nav>
<ul>
<li>First category</li>
<li>Second category</li>
</ul>
</nav>
</dd>
<dt>File size</dt>
<dd>2 kB</dd>
<dt>Published</dt>
<dd>
<time datetime="2010-07-15T13:15:05-02:00" pubdate>Today</time>
</dd>
</dl>
</footer>
</article>
</li>
<li>
...
</li>
...
</ol>
I am not really happy about the <article/> within the <li/>. First, the search result hit is not an article by itself, but just a very short summary of one. Second, I am not even sure you are allowed to put an article within a list.
Maybe the <details/> and <summary/> tags are more suitable than <article/>, but I don't know if I can add a <footer/> inside that?
All suggestions and opinions are welcome! I really want every single detail to be perfect.

1) I think you should stick with the article element, as
[t]he article element represents a
self-contained composition in a
document, page, application, or site
and that is intended to be
independently distributable or
reusable [source]
You merely have a list of separate documents, so I think this is fully appropriate. The same is true for the front page of a blog, containing several posts with titles and outlines, each in a separate article element. Besides, if you intend to quote a few sentences of the articles (instead of providing summaries), you could even use blockquote elements, like in the example of a forum post showing the original posts a user is replying to.
2) If you're wondering if it's allowed to include article elements inside a li element, just feed it to the validator. As you can see, it is permitted to do so. Moreover, as the Working Draft says:
Contexts in which this element may be
used:
Where flow content is expected.
3) I wouldn't use nav elements for those categories, as those links are not part of the main navigation of the page:
only sections that consist of major navigation blocks are appropriate for the nav element. In particular, it is common for footers to have a short list of links to various pages of a site, such as the terms of service, the home page, and a copyright page. The footer element alone is sufficient for such cases, without a nav element. [source]
4) Do not use the details and/or summary elements, as those are used as part of interactive elements and are not intended for plain documents.
UPDATE: Regarding if it's a good idea to use an (un)ordered list to present search results:
The ul element represents a list of
items, where the order of the items is
not important — that is, where
changing the order would not
materially change the meaning of the
document. [source]
As a list of search results actually is a list, I think this is the appropriate element to use; however, as it seems to me that the order is important (I expect the best matching result to be on top of the list), I think that you should use an ordered list (ol) instead:
The ol element represents a list of
items, where the items have been
intentionally ordered, such that
changing the order would change the
meaning of the document. [source]
Using CSS you can simply hide the numbers.
EDIT: Whoops, I just realized you already use an ol (due to my fatique, I thought you used an ul). I'll leave my ‘update’ as is; after all, it might be useful to someone.

I'd markup it up this way (without using any RDFa/microdata vocabularies or microformats; so only using what the plain HTML5 spec gives):
<ol start="1">
<li id="1">
<article>
<h1>The Title of the Page</h1>
<p>A short summary of the page</p>
<footer>
<dl>
<dt>Categories</dt>
<dd>First category</dd>
<dd>Second category</dd>
<dt>File size</dt>
<dd>2 <abbr title="kilobyte">kB</code></dd>
<dt>Published</dt>
<dd><time datetime="2010-07-15T13:15:05-02:00">Today</time></dd>
</dl>
</footer>
</article>
</li>
<li id="2">
<article>
…
</article>
</li>
</ol>
start attribute for ol
If the search engine uses pagination, you should give the start attribute to the ol, so that each li reflects the correct ranking position.
id for each li
Each li should get id atribute, so that you can link to it. The value should be the rank/position.
One could think that the id should be given to the article instead, but I think this would be wrong: the rank/order could change by time. You are not referring to a specific result but to a result position.
Remove the header
It is not needed if it contains only the heading (h1).
Add rel="external" to the link
The link to each search result is an external link (leading to a different website), so it should get the rel value external.
Remove nav
The category links are not navigation in scope of the article. So remove the nav.
Each category in a dd
You used:
<dt>Categories</dt>
<dd>
<ul>
<li>First category</li>
<li>Second category</li>
</ul>
</dd>
Instead, you should list each category in its own dd and remove the ul:
<dt>Categories</dt>
<dd>First category</dd>
<dd>Second category</dd>
abbr for file size
The unit in "2 kB" should be marked-up with abbr:
2 <abbr title="kilobyte">kB</code>
Remove pubdate attribute
It's not in the spec anymore.
Other things that could be done
give hreflang attribute to the link if the linked result has a different language than the search engine
give lang attribute to the link description and the summary if it is in a different language than the search engine
summary: use blockquote (with cite attribute) instead of p, if the search engine does not create a summary itself but uses the meta-description or a snippet from the page.
title/link description: use q (with cite attribute) if the link description is exactly the title from the linked webpage

Aiming for a 'perfect' HTML5 template is futile because the spec itself is far from perfect, with most of the prescribed use-cases for the new 'semantic' elements obscure at best. As long as your document is structured in a logical fashion, you won't have any problems with search engines (most of the new tags don't have the slightest impact). Indeed, following the HTML5 spec to the letter - for example, using <h1> tags within each new sectioning element - may make your site less accessible (to screen readers, for example). Don't strive for 'perfect' or close-to, because it doesn't exist - HTML5 is not thought-out well enough for that. Just concentrate on keeping your markup logical and uncluttered.

I found a good resource for HTML5 is HTML5Doctor. Check the article archive for practical implementations of the new tags. Not a complete reference mind you, but nice enough to ease into it :)
As shown by the Footer element page, sections can contain footers :)

Semantic HTML for messages

I'm making a small web-chat utility and am looking for advice on which elements to use for messages.
Here's what I'm thinking of using at the moment:
<p id="message-1">
<span class="timestamp" id="2009-03-10T12:04:01+00:00">
12:04
</span>
<cite class="admin">
Ross
</cite>
Lorem ipsum dolor sit amet.
</p>
I'd take advantage of CSS here to add brackets around the timestamp, icons for the cited user etc. I figured it would be silly (and incorrect) to use a blockquote for each message, although I consider the cite correct as it's referring to the user that posted the message.
I know this isn't a) an exact science and b) entirely essential but I'd prefer to use meaningful elements rather than spans throughout. Are there any other elements I should consider? Any microformats?

HTML isn't very semantic in a customizable way. Nevertheless your format should be understandable in any browser (with proper CSS, as you have pointed out).
What I see in the code example above is very similar to XML. It might be cumbersome and overkill for your needs, but I'd like to point out that you can use XML with XSLT as a substitute to both (X)HTML. This way you can get your tags as semantic as possible, and don't need to compromise with the limitations of the HTML tags.
w3schools has an article about the topic. I could swear that I saw a webpage in sun.com that was done in XML, but I can't find it anymore.
If you don't intend this to be interpreted or parsed by third party software, I'd nevertheless advise against this method, and stick with the proven HTML.

Seems reasonable to me, except that the ‘id’ is invalid. NAME tokens can't start with a number or contain ‘+’.
Plus if two people spoke at once you'd have non-unique IDs. Perhaps that data should go in another attribute, such as ‘title’ (so you can hover to see the exact timestamp).

If you're going for semantic HTML, you'll probably want to know that HTML5 doesn't consider your use of the <cite> element correct anymore.
A person's name is not the title of a work — even if people call that person a piece of work — and the element must therefore not be used to mark up people's names.

<ol>
<li class="message" value="1">
<span class="timestamp" id="2009-03-10T12:04:01+00:00">
12:04
</span>
<cite class="admin">
<address class="email">
<a href="mailto:ross#email.com">
Ross
</a>
</address>
</cite>
Lorem ipsum dolor sit amet.
</li>
</ol>
I would try something like the above. Notice I have placed everything in an Ordered list, as comments can be construed in the linear manner fitting an ordered list. Also, I have embedded, inside your Cite tag, an Address tag with an Anchor element. The unfortunately named Address element is actually meant to convey contact information for an Author, so you would probably want to link to the author's email address there.

What you suggested is already very good. If you want to take it a step further and be able to allow tons of different presentation options with the same markup (at the expense of heavier html) you may want to do something like:
<dl class="message" id="message-1">
<dt class="datetime">Datetime</dt>
<dd class="datetime">
<span class="day">Wed</span>
<span class="dayOfMonth">11</span>
<span class="month">Mar</span>
<span class="year">2009</span>
<span class="hourMin">17:34</span>
<span class="sec">33</span>
</dd>
<dt class="author">Author</dt>
<dd class="author">Ross</dd>
<dt class="message">Message</dt>
<dd class="message">Lorem ipsum dolor sit amet</dd>
</dl>

Since you mention microformats in the question, you are no doubt already familiar with the microformats wiki. It contains a good number of examples for different situations.
Another possibility would be to borrow parts of SIOC, which among other things is an ontology for forums - pretty similar to chat.
By re-using existing formats, you can take advantage of plugins and tools like Operator and maybe get more out of your app for free.

I'd use XML with XSLT to transform (style) the data.
It makes sense semantically here, but you also have the conversations in a suitable format for archiving (i.e. XML) - I assume you will have some sort of log or 'history'.

As #bobince said, the id="2009-03-10T12:04:01+00:00" is invalid.
You should change this:
<span class="timestamp" id="2009-03-10T12:04:01+00:00">
12:04
</span>
To this:
<time class="timestamp" datetime="2009-03-10T12:04:01+00:00">
12:04
</time>
You can get more information about the time tag at HTML5 Doctor and W3C:
The time tag on HTML5 offers a new element for unambiguously encoding dates and times for machines while still displaying them in a human-readable way.
The time element represents either a time on a 24 hour clock, or a precise date in the proleptic Gregorian calendar, optionally with a time and a time-zone offset.
...
I agree with the ordered list (ol) solution posted by #Robotsu, except by the time tag I just posted and the incorrect address inside cite tag!

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008