How to extract text as well as hyperlink text in scrapy? - csv

I want to extract from following html code:
<li>
<a test="test" href="abc.html" id="11">Click Here</a>
"for further reference"
</li>
I'm trying to do with following extract command
response.css("article div#section-2 li::text").extract()
But it is giving only "for further reference" line
And Expected output is "Click Here for further reference" as a one string.
How to do this?
How to modify this to do the same if following patterns are there:
Text Hyperlink Text
Hyperlink Text
Text Hyperlink

There are at least a couple of ways to do that:
Let's first build a test selector that mimics your response:
>>> response = scrapy.Selector(text="""<li>
... <a test="test" href="abc.html" id="11">Click Here</a>
... "for further reference"
... </li>""")
First option, with a minor change to your CSS selector.
Look at all text descendants, not only text children (notice the space between li and ::text pseudo element):
# this is your CSS select,
# which only gives direct children text of your selected LI
>>> response.css("li::text").extract()
[u'\n ', u'\n "for further reference"\n']
# notice the extra space
# here
# |
# v
>>> response.css("li ::text").extract()
[u'\n ', u'Click Here', u'\n "for further reference"\n']
# using Python's join() to concatenate and build the full sentence
>>> ''.join(response.css("li ::text").extract())
u'\n Click Here\n "for further reference"\n'
Another option is to chain your .css() call with XPath 1.0 string() or normalize-space() inside a subsequent .xpath() call:
>>> response.css("li").xpath('string()').extract()
[u'\n Click Here\n "for further reference"\n']
>>> response.css("li").xpath('normalize-space()').extract()
[u'Click Here "for further reference"']
# calling `.extract_first()` gives you a string directly, not a list of 1 string
>>> response.css("li").xpath('normalize-space()').extract_first()
u'Click Here "for further reference"'

I use xpath if that is the case the selector will be:
response.xpath('//article/div[#id="section-2"]/li/a/text()').extract()#this will give you text of mentioned hyper link >> "Click Here"
response.xpath('//article/div[#id="section-2"]/li/a/#href').extract()#this will give you link of mentioned hyper link >> "abc.html"
response.xpath('//article/div[#id="section-2"]/li/text()').extract()#this will give you text of li >> "for further reference"

Related

How do I get all the text under a tag with Nokogiri?

In this example I am trying to get the text from within the <td> tag of a table. First, the html code.
<table>
<tbody>
<tr>
<td>Single line of text</td>
</tr>
<tr>
<td>Text here<p>First line</p><p>Second line</p></td>
</tr>
</tbody>
</table>
Then the ruby code here.
require 'nokogiri'
require 'pp'
html = File.open('test.html').read
doc = Nokogiri::HTML(html)
rows = doc.xpath('//table[1]/tbody/tr')
data = rows.collect do |row|
row.at_xpath('td[1]/text()').to_s
end
pp data
And the result that I get is.
["Single line of text", "Text here"]
How can I get all of the text in the second <td> tag?
There are two changes you will need to make to get all the text nodes. First at_xpath will only ever return a single node, so to get multiple nodes you’ll need to use xpath.
Second, to get all descendant nodes, not just child nodes, use // instead of /.
Combining these, the line of code would be:
row.xpath('td[1]//text()').to_s
This will concatenate all the text nodes together, giving the result:
["Single line of text", "Text hereFirst lineSecond line"]
which may not be what you want. Rather than just call to_s on the resulting nodeset you will need to process to fit your needs.
How about this?
pp doc.search("//tr[2]//td//text()").map { |item| item.text }
As matt says, you can get all descendants using //.
You can also index the second tr if you want that one specifically. Just leave out the indexing to get all the trs.
And you can filter the resulting text objects to get only those that have a td upstream.
Finally, map over each Nokogiri object, plucking out the text into the final array, which looks like this:
["Text here", "First line", "Second line"]
You want the text method of Nokogiri::XML::Node if you want to get all the text for any element:
p doc.xpath('//table[1]/tbody/tr').map{ |tr| tr.text.strip }
#=> ["Single line of text", "Text hereFirst lineSecond line"]
(The strip method just gets rid of leading and trailing whitespace.)

How to get html tag text using XMLSlurper in Groovy

I am trying to modify html code in Groovy. I parsed it using XMLSlurper. The problem is i need to edit text of certain tag which contains text and children tags. Html code looks like this:
<ul><li>Text to modify<span>more text</span></li></ul>
In groovy i am trying this code:
def ulDOM = new XmlSlurper().parseText(ul);
def elements = ulDOM.li.findAll{
it.text().equals("text i am looking for");
}
The problem is i got empty array in 'elements' because it.text() returns text from 'it' node together with whole DOM subtree text nodes. In this case "Text to modifymore text". Note that contains() method is not enough for my solution.
My question is how to get exact text from a certain tag and not the text from whole DOM subtree?
.text() evaluate children and appends. Hence it will always include merged line.
Could you consinder localText()? Not exactly what you expect, it returns an array of strings.
import org.testng.Assert
ul='''<ul>
<li>Text to modify<span>more text</span>
</li>
</ul> '''
def ulDOM = new XmlSlurper().parseText(ul);
def elements = ulDOM.li.findAll{
String[] text = it.localText();
text[0].equals("Text to modify");
}
Assert.assertTrue(elements.size()==1)

How to parse HTML tags as raw text using ElementTree

I have a file that has HTML within XML tags and I want that HTML as raw text, rather than have it be parsed as children of the XML tag. Here's an example:
import xml.etree.ElementTree as ET
root = ET.fromstring("<root><text><p>This is some text that I want to read</p></text></root>")
If i try:
root.find('text').text
It returns no output
but root.find('text/p').text will return the paragraph text without the tags. I want everything within the text tag as raw text, but I can't figure out how to get this.
Your solution is reasonable. An element object is the list of children. The .text attribute of the element object is related only to things (usually a text) that are not part of other (nested) elements.
There are things to be improved in your code. In Python, string concatenation is an expensive operation. It is better to build the list of substrings and to join them later -- like this:
output_lst = []
for child in root.find('text'):
output_lst.append(ET.tostring(child, encoding="unicode"))
output_text = ''.join(output_lst)
The list can be also build using the Python list comprehension construct, so the code would change to:
output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]
output_text = ''.join(output_lst)
The .join can consume any iterable that produces strings. This way the list need not to be constructed in advance. Instead, a generator expression (that is what can be seen inside the [] of the list comprehension) can be used:
output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))
The one-liner can be formatted to more lines to make it more readable:
output_text = ''.join(ET.tostring(child, encoding="unicode")
for child in root.find('text'))
I was able to get what I wanted by appending all child elements of my text tag to a string using ET.tostring:
output_text = ""
for child in root.find('text'):
output_text += ET.tostring(child, encoding="unicode")
>>>output_text
>>>"<p>This is some text that I want to read</p>"
Above solutions will miss initial part of your html if your content begins with text. E.g.
<root><text>This is <i>some text</i> that I want to read</text></root>
You can do that:
node = root.find('text')
output_list = [node.text] if node.text else []
output_list += [ET.tostring(child, encoding="unicode") for child in node]
output_text = ''.join(output_list)

XPath - not match element with certain ancestor

I would like to count the text of all descendant elements which do not have a link as an ancestor.
//*[string-length(normalize-space(//*[not(ancestor::a)])) > 10]
Which if tested on this structure would return [Get This Text]
<b>
ignore
<a>ignore</a>
Get This Text
</b>
It's not really clear what you mean by "count the text" but the following expression returns all elements that don't have a link as ancestor and whose normalized string value is longer than 10 characters:
//*[not(ancestor::a) and string-length(normalize-space()) > 10]
Since you want the expression to return the string 'Get this text', maybe you want select text nodes, not elements:
//text()[not(ancestor::a) and string-length(normalize-space()) > 10]

Is Pandoc capable of injecting arbitrary HTML attributes to any elements?

So code blocks can define HTML attributes using the fenced_code_blocks extension:
~~~~ {#mycode .haskell .numberLines startFrom="100"}
qsort [] = []
qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++
qsort (filter (>= x) xs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Is it possible to use the above syntax, in some way, for regular text blocks? For example, I'd like to convert the following Markdown text:
# My header
~~~ {.text}
This is regular text. This is regular text.
~~~
~~~ {.quote}
> This is the first level of quoting.
>
> > This is nested blockquote.
>
> Back to the first level.
~~~
~~~ {data-id=test-123}
+ Red
+ Green
+ Blue
~~~
into something like this:
<h1 id="my-header">My header</h1>
<p class="text">This is regular text. This is regular text.</p>
<blockquote class="quote">
<p>This is the first level of quoting.</p>
<blockquote>
<p>This is nested blockquote.</p>
</blockquote>
<p>Back to the first level.</p>
</blockquote>
<ul data-id="test-123">
<li>Red</li>
<li>Green</li>
<li>Blue</li>
</ul>
If there is no such support in Pandoc itself, would it be possible to create a custom writer in Lua that does so?
Edit: Looking at the sample.lua custom writer, anyone know what the "attributes table" is on line 35? And how does one pass these attributes to specific Pandoc elements? Also, the functionality I'm looking for above is very similar to the header_extension extension except it would work for all elements, not just headers.
Pandoc's filters let you operate on Pandoc's internal representation of the document. It's possible to have a chain of filters that do different transformations. I'll share two illustrative examples of filters that should help.
Markdown Code Blocks
Code blocks in Pandoc are usually meant to embed source code listings from programming languages, but here we're trying to extract the body and interpret it as markdown. Rather than using classes from your input document like text and quote, let's use a generic as-markdown class. Pandoc will generate the appropriate tags automatically.
# My header
~~~ {.as-markdown}
This is regular text. This is regular text.
~~~
~~~ {.as-markdown}
> This is the first level of quoting.
>
> > This is nested blockquote.
>
> Back to the first level.
~~~
~~~ {.as-markdown data-id=test-123}
+ Red
+ Green
+ Blue
~~~
~~~ haskell
main :: IO ()
~~~
To ensure code blocks without the as-markdown class are interpreted as usual, I included a haskell code block. Here's the filter implementation:
#!/usr/bin/env runhaskell
import Text.Pandoc.Definition (Pandoc(..), Block(..), Format(..))
import Text.Pandoc.Error (handleError)
import Text.Pandoc.JSON (toJSONFilter)
import Text.Pandoc.Options (def)
import Text.Pandoc.Readers.Markdown (readMarkdown)
asMarkdown :: String -> [Block]
asMarkdown contents =
case handleError $ readMarkdown def contents of
Pandoc _ blocks -> blocks
-- | Unwrap each CodeBlock with the "as-markdown" class, interpreting
-- its contents as Markdown.
markdownCodeBlock :: Maybe Format -> Block -> IO [Block]
markdownCodeBlock _ cb#(CodeBlock (_id, classes, _namevals) contents) =
if "as-markdown" `elem` classes then
return $ asMarkdown contents
else
return [cb]
markdownCodeBlock _ x = return [x]
main :: IO ()
main = toJSONFilter markdownCodeBlock
Running pandoc --filter markdown-code-block.hs index.md produces:
<h1 id="my-header">My header</h1>
<p>This is regular text. This is regular text.</p>
<blockquote>
<p>This is the first level of quoting.</p>
<blockquote>
<p>This is nested blockquote.</p>
</blockquote>
<p>Back to the first level.</p>
</blockquote>
<ul>
<li>Red</li>
<li>Green</li>
<li>Blue</li>
</ul>
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">main ::</span> <span class="dt">IO</span> ()</code></pre></div>
Almost there! The only part that's not quite right is the HTML attributes.
Custom HTML Attributes from Code Block Metadata
The following filter should help you get started. It converts code blocks with the web-script class to an HTML <script> tag when the target format is html or html5.
#!/usr/bin/env runhaskell
import Text.Pandoc.Builder
import Text.Pandoc.JSON
webFormats :: [String]
webFormats =
[ "html"
, "html5"
]
script :: String -> Block
script src = Para $ toList $ rawInline "html" ("<script type='application/javascript'>" <> src <> "</script>")
injectScript :: Maybe Format -> Block -> IO Block
injectScript (Just (Format format)) cb#(CodeBlock (_id, classes, _namevals) contents) =
if "web-script" `elem` classes then
if format `elem` webFormats then
return $ script contents
else
return Null
else
return cb
injectScript _ x = return x
main :: IO ()
main = toJSONFilter injectScript
The data-id=test-123 in your last block would come through in the _namevals's key-value pairs with type [(String, String)]. All you'd need to do is refactor script to support arbitrary tags and key-value pairs for HTML attributes, and specify what HTML to generate based on those inputs. To see the native representation of the input document, run pandoc -t native index.md.
[Header 1 ("my-header",[],[]) [Str "My",Space,Str "header"]
,CodeBlock ("",["as-markdown"],[]) "This is regular text. This is regular text."
,CodeBlock ("",["as-markdown"],[]) "> This is the first level of quoting.\n>\n> > This is nested blockquote.\n>\n> Back to the first level."
,CodeBlock ("",["as-markdown"],[("data-id","test-123")]) "+ Red\n+ Green\n+ Blue"
,Para [Str "To",Space,Str "ensure",Space,Str "regular",Space,Str "code",Space,Str "blocks",Space,Str "work",Space,Str "as",Space,Str "usual."]
,CodeBlock ("",["haskell"],[]) "main :: IO ()"]
If you'd like to play around with either of these examples, they're both in my pandoc-experiments repository.
This is very doable in kramdown, which will convert the following input
# My header
This is regular text. This is regular text.
{: .text}
> This is the first level of quoting.
>
> > This is nested blockquote.
>
> Back to the first level.
{: .quote}
+ Red
+ Green
+ Blue
{: data-id="test-123"}
to
<h1 id="my-header">My header</h1>
<p class="text">This is regular text. This is regular text.</p>
<blockquote class="quote">
<p>This is the first level of quoting.</p>
<blockquote>
<p>This is nested blockquote.</p>
</blockquote>
<p>Back to the first level.</p>
</blockquote>
<ul data-id="test-123">
<li>Red</li>
<li>Green</li>
<li>Blue</li>
</ul>
See the attribute list definition section of the syntax for details.