I'm currently practicing web scraping using the NYT Best Sellers website. I want to get the title of the #1 book on the list and found the HTML element:
<div class="book-body">
<p class="freshness">12 weeks on the list</p>
<h3 class="title" itemprop="name">CRAZY RICH ASIANS</h3>
<p class="author" itemprop="author">by Kevin Kwan</p>
<p itemprop="description" class="description">A New Yorker gets a surprise when she spends the summer with her boyfriend in Singapore.</p>
</div>
I'm using the following code to grab the specific text:
doc.css(".title").text
However, it returns the titles of every book on the list. How would I go about getting just the specific book title, "CRAZY RICH ASIANS"?
If you look at the return from doc.css(".title") you will see it is a collection of all the titles. As Nokogiri::XML::Element Objects
CSS to my knowledge does not have a selector for targeting the first element of a given class. (Someone may certainly correct me if I am wrong) but to get just the first element from a Nokogiri::XML::NodeSet is still very simple as it acts like an Array in many cases. For Example:
doc.css(".title")[0].text
You could also use xpath to select just the first one (since XPath does support index based selection) like so:
doc.xpath(doc.xpath("(//h3[#class='title'])[1]").text
Please Note:
Ruby indexes start at 0 as in the first example;
XPath indexes start at 1 as in the second example.
Related
What's the difference between a <seg> in XML and <span> in HTML? Here are two passages from Bibles, one from the English Bible in Christodouloupoulos' and Steedman's massively parallel Bible corpus,
<?xml version="1.0" ?>
<cesDoc version="4">
…
<text>
<body id="Bible" lang="en">
<div id="b.GEN" type="book">
<div id="b.GEN.1" type="chapter">
<seg id="b.GEN.1.1" type="verse">
In the beginning God created the heaven and the earth.
</seg>
<seg id="b.GEN.1.2" type="verse">
And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
</seg>
…
and the other from the NIV English Bible at Bible Gateway, which is where they got most of their texts from:
<p class="chapter-1">
<span id="en-NIV-27932" class="text Rom-1-1">
<span class="chapternum">1 </span>
Paul, a servant of Christ Jesus, called to be an apostle and set apart for the gospel of God—
</span>
<span id="en-NIV-27933" class="text Rom-1-2">
<sup class="versenum">2 </sup>the gospel he promised beforehand through his prophets in the Holy Scriptures
</span>
…
In the HTML, a it seems a <span> can replace a <seg>, except that the HTML has added verse numbers in <span>. Oh, and the chapters are in <div>. So it's not one-to-one.
Of course, I realize that HTML and XML are different, and this is only one juxtaposition; I'm sure there are others out there. But I'm going to need to be able to display XML as HTML, and I don't want to anger the doctype gods. So, conceptually, how is <seg> different from <span> in purpose, meaning and usage?
Update: #jim-garrison, says I'm going to need to read the schema to understand the XML, but I'm a neophyte at that, too. In particular, I did find some official-looking documentation for <seg> by TEI that makes me think it's use is a little more than arbitrary, but I have no idea how to interpret this documentation. Should it give us a more specific answer than what Jim has already written?
The difference between XML and HTML generally is that the list of tags that can be present in XML is defined by a DTD or XML Schema, and tags represent document semantics and not presentation. So tags can be named anything. In HTML the set of tags is generally predefined, as if there was a pre-existing HTML DTD or schema, but HTML is not XML and doesn't follow all the rules of XML. While HTML was in some sense derived from the same parent as XML (SGML), and the two are superficially very similar, they are most definitely NOT the same thing.
The answer to your specific question is that the writers of the XML chose to use a tag named <seg> ("segment"?) to represent generalized strings of text, with attributes providing additional semantic information. For more details you'll need to find the DTD or XML schema that governs the content of the XML and read the documentation that goes with it.
But I'm going to need to be able to display XML as HTML, and I don't want to anger the doctype gods. So, conceptually, how does different from in purpose, meaning and usage?
This is where you will use XSLT to transform the input XML into valid HTML. To figure out how to do that transformation you will need to know the full semantics of all the tags that can appear (again, go to the documentation for the DTD/Schema) and decide on a visual representation for the data. There's no one answer to "how should a <seg>" be transformed. That's up to your requirements regarding presentation. One possible transformation converts <seg> tags to <span>, but that may depend on the value of certain attributes (type="verse" vs some other type). It might even differ depending on output medium (desktop vs tablet vs phone vs watch vs ...?)
Once you convert from XML to HTML you have left the realm of the Doctype gods and they have no interest in what you do :-) There's a whole different set of deities such as CSS-Cthulhu, Javascript-Janai'ngo (look it up), et al who will take great pleasure making your life miserable.
I'm working on a small app that requires me to parse an html site on the web.
My problem is as follows :
The parsing routine is working fine for some infos BUT I'm searching for hours for a way to get some infos that refuse to appear.
Here is the partial code structure I'm willing to parse :
<body>
`<header>
<nav>
<div.....>
<aside......>
<main>
<div .....>
<a ......>
<a ......>
</div>
.
.
.
<div id="general">
<h2> ........</h2>
<p>
<span class="label">text</span>
"text 2 to be parsed"
<br>
<span class="label">other text</span>
"text 3 to be parsed"
<br>
just an exemple of structure, to be precise the url is http://www.ourairports.com/airports/EBBR/pilot-info.html
OK it seems that the html code is not appearing on the preview so in the source code of the page above, when you see [div id="general"], below you have a [p] followed by [span class="label"]some text[/span] and just below that you have text between brackets. This happens on several lines and I need to catch those infos .
I've tried with : //body/div/main/div[#id='general']/p as XpathQueryString but result is 1 node and empty
also with div[#id='general'] but result is no node found,
with div[#id='general']/p/span result is no node found,
with //div/p/span[#class='label'] results are the titles between the flags and >/span> but I'm looking to retrieve the text between quotes just behind and I cannot figure out how to succeed. I think I've tried all combinations (a lot others than explained above) but no chance. Is there a special path to get to this text ?
Thanks for your advices.
By the way, this is my very first post on stackoverflow.com and My first language is french, so I do apologize in advance for any rule not followed or my bad english.
Enjoy your day, evening, ... night on the keyboard.
Alain
Your first expression //body/div/main/div[#id='general']/p is expected to return a single node, the <p>. And it works exactly that way on the referred website as you observed. The expression reaches down to that node but not deeper where the text nests. However you must get the text too, just encapsulated in html, with fancy tags around it. A good XPath selector API used properly should return the html node that was matched, including the <p> tag itself.
If all you see in the end is just the text nodes try the following:
Think of the text among the <span>s as html nodes, text() nodes.
//div[#id='general']/p/text()
This will match the "text to be parsed".
A node() will match any html node (even text among tags) and a * any non-text() node.
For any number of steps, use the double slash:
//div[#id='general']/p//text()
Now you match every text node under the <p> tag, regardless of the nesting level. And since text nodes are by definition leaf nodes (cannot contain other nodes), this guarantees that you will not match members of the same path down the tree more than once.
Some comments on you expressions:
//body is superficial, there is only one body and html defines exactly where.
Nodes quantified by #id should not need be proceeded by selectors for their parents, start with //div[#id='something unique'] .
Learn more about XPath. An API that properly returns selected "nodes" and not just concatenated text can play an important role in the understanding of how the expressions work in practice.
How To (Semantically) Mark Up A (Theatre) Script / Play in HTML5?
For obvious reasons, it's hard to search for "play" and "script" without a search engine thinking you mean “play a sound" and “JavaScript".
How can I mark up a script (as in the document one would give to actors in a play) such that it is semantically correct, and easy to style?
For example, let's take the start of Hamlet
Hamlet
ACT I
SCENE I Elsinore. A platform before the castle.
[FRANCISCO at his post. Enter to him BERNARDO]
BERNARDO Who's there?
FRANCISCO Nay, answer me: stand, and unfold yourself.
Fairly obviously, I think, one should start with
<h1 id="title">Hamlet</h1>
<h2 id="act-1">Act 1</h2>
<h3 id="scene-1">Scene 1</h3>
But, then I get stuck.
I've tried looking at MicroData, but Schema.org's CreativeWork[0] really doesn't contain much that would be useful in the case of a work of fiction.
Is it enough just to say
<p class="stage-direction">FRANCISCO at his post. Enter to him BERNARDO</p>
<p id="1"><span class="character bernardo">BERNARDO</span>Who's there?</p>
<p id="2"><span class="character francisco">FRANCISCO</span>Nay, answer me: stand, and unfold yourself.</p>
Or is there a better / more sensible way of doing things?
[0]http://schema.org/CreativeWork
It seems that the idea of precisely specifying markup for dialogue has been abandoned, and the W3C now simply offers some guidelines which pretty much equate to your idea of using paragraphs and spans.
Note that the dl element, which older sources - including the spec - had formerly recommended, should now definitely not be used: "The dl element is inappropriate for marking up dialogue".
But of course all this might change next week, or month, or year…
Does this provide any inspiration? caesar in xml
I need a regular expression in NP++ to find query that is NOT inside an anchor or a hyperlink tag in an html file.. So it finds any gives search query (a word or a couple of words, like: "test", or "Ask a question", that is not linked.. Or in other words: finds the search query that is not linked, and ignores the linked one.
Given that links could be direct links, that tag is not directly before and after the query, or in more than one line.
Example:
<p>any text here, something else..</p>
<p>more
test
to find through other test. With much
<a href="http://www.site.com/folder/filename45.html">
<font color="#800000">Ask a question</font></a> more test</p>
<p>and test to Ask a question here.</p>
There is no perfect solution with regular expressions. It would be better to do this with a programming language and a DOM parser.
Here is about the best you can get:
test(?!((?!<a\W).)*</a)
It uses two negative lookaheads to match test if there is no </a before the next opening <a. Make sure to check . matches newline and to update to Notepad++ 6.
This will start to fail, if you have <a or </a in comments or within attribute strings. Not even speaking of invalid HTML.
I've been tasked with getting all the SMS updates from this page and putting them into a JSON feed using Yahoo Pipes. I'm not entirely sure how I would get each update, as they are not individual elements, but just a collection of title, etc. Any shared wisdom would be much appreciated!
<h1 id="blogtitle">SMS Update</h1>
<div class="blogposttime blogdetail">Left at 2nd January 2010 at 01:12</div>
<div class="blogcategories blogdetail">Recieved by SMS (Location: Pokhara - Nepal)</div>
<p class="blogpostmessage">
RACE DAY! We took the extra day off to pimp the rick some more, including a huge Australian flag. Quiet night at a pub with 6 other teams. Time for brekkie and then we're off to the rickshaw grounds for 8:30 for 10am start.
</p>
That seems a fairely easy job for a DOM/XML parser.
Since the blocks are not enclosed in XML tags you could look for elements that are present in each block, for example the <h1 id="blogtitle">SMS Update</h1> defines the start of a new block.
Use your DOM parser to look for all the elements with id blogtitle. At this point you can use a DOM function to reference the nextSibling of the blogtitle element. All you need is the 3 siblings after the blogtitle element.
With a little work you can easily use this logic to build your JSON object.