On scraping a site, I have an HTML like this:
<div class="classA classB classC">
<div class="classD classE">
<h1 class="classF classD">Text I don't want</h1>
<ul>....</ul> <!-- containing more text in nested children, don't want -->
</div>
Text I want to grab.
<br>
More text I want to grab
</div>
Here, how can I select only the text I want to grab, i.e ["Text I want to grab", "More text I want to grab"] and prevent selecting Text I don't want. I am trying to select using CSS selector like this:
text = response.css('.classA:not(.classD) *::text').getall()
Does anyone know, what to do in this case, I am not familiar with xpath, but please do suggest if have a solution in it?
You are about to reach your goal. You want to prevent <h1 class="classF classD">Text I don't want</h1> using :not that's correct but you have to select the entire portion of html from where there is your desired output meaning you have to select <div class="classA classB classC"> at first then you have to prevent whatever you want. so the css expression should be like:
response.css('div.classA.classB.classC:not(.classF)::text').getall()
OR
' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])
Proven by scrapy shell:
In [1]: from scrapy.selector import Selector
In [2]: %paste
html='''
<div class="classA classB classC">
<div class="classD classE">
<h1 class="classF classD">Text I don't want</h1>
<ul>....</ul> <!-- containing more text in nested children, don't want -->
</div>
Text I want to grab.
<br>
More text I want to grab
</div>
'''
## -- End pasted text --
In [3]: resp=Selector(text=html)
In [4]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip()
Out[4]: 'Text I want to grab.\n \n More text I want to grab'
In [5]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).replace('\n',''
...: ).strip()
Out[5]: 'Text I want to grab. More text I want to grab'
In [6]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip().replace
...: ('\n','').strip()
Out[6]: 'Text I want to grab. More text I want to grab'
Out[7]: ['', 'Text I want to grab.', 'More text I want to grab']
In [8]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getal
...: l()])
Out[8]: 'Text I want to grab.More text I want to grab'
In [9]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])
Out[9]: 'Text I want to grab.More text I want to grab'
In [10]: ' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])
Out[10]: ' Text I want to grab. More text I want to grab'
Related
I'm trying to obtain the nested unnamed div inside:
div class="pod three columns closable content-sizes clearfix">
The nested unnamed div is also the first div inside the div above (see image)
I have tried the following:
for div in soup.findAll('div', attrs={'class':'pod three columns closable content-sizes clearfix'}):
print(div.text)
The length of
soup.findAll('div',attrs={'class':'pod three columns closable
content-sizes clearfix'})
is just one despite this div having many nested divs. So, the for-loop runs only once and prints everything.
I need all the text inside only the first nested div div (see image):
Project...
Reference Number...
Other text
Try:
from bs4 import BeautifulSoup
html_doc = """
<div class="pod three columns closable content-sizes clearfix">
<div>
<b>Project: ...</b>
<br>
<br>
<b>Reference Number: ...</b>
<br>
<br>
<b>Other text ...</b>
<br>
<br>
</div>
<div>
Other text I don't want
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(
soup.select_one("div.pod.three.columns > div").get_text(
strip=True, separator="\n"
)
)
Prints:
Project: ...
Reference Number: ...
Other text ...
Or without CSS selector:
print(
soup.find(
"div",
attrs={"class": "pod three columns closable content-sizes clearfix"},
)
.find("div")
.get_text(strip=True, separator="\n")
)
try this:-
result = soup.find('div', class_ = "pod three columns closable content-sizes clearfix").find("div")
print(result.text)
output:-
Project: .............
Reference Number: ....
Other text ...........
I have a text which is in format of (keeping tags and removing the text for understanding)
<h2>...</h2>
<p>...</p>
. .
. .
<p>...</p>
<h2>...</h2>
<ul>...</ul>
<li> .. </li>
...
<h2>...</h2>
<li> ..</li>
I am trying to use scrapy to separate/group the text based on the header. So as a first step I need to get 3 groups of data from the above.
from scrapy import Selector
sentence = "above text in the format"
sel = Selector(text = sentence)
// item = sel.xpath("//h2//text())
item = sel.xpath("//h2/following-sibling::li/ul/p//text()").extract()
I am getting an empty array. Any help appreciated.
I have this solution, made with scrapy
import scrapy
from lxml import etree, html
class TagsSpider(scrapy.Spider):
name = 'tags'
start_urls = [
'https://support.litmos.com/hc/en-us/articles/227739047-Sample-HTML-Header-Code'
]
def parse(self, response):
for header in response.xpath('//header'):
with open('test.html', 'a+') as file:
file.write(
etree.tostring(
html.fromstring(header.extract()),
encoding='unicode',
pretty_print=True,
)
)
With which I get the headers and all the content inside them
I want to pull the text inside the span tags but when I try and use .text or get_text() I get errors (either after print spans or in the for loop). What am I missing? I have it set just now to just do this for the first div of class col, just to test if it is working, but I will want it to work for the 2nd aswell.
Thanks
My Code -
premier_soup1 = player_soup.find('div', {'class': 'row-table details -bp30'})
premier_soup_tr = premier_soup1.find_all('div', {'class': 'col'})
for x in premier_soup_tr[0]:
spans = x.find('span')
print (spans)
Output
-1
<span itemprop="name">Alisson Ramses Becker</span>
-1
<span itemprop="birthDate">02/10/1992</span>
-1
<span itemprop="nationality"> Brazil</span>
-1
>>>
The HTML
<div class="col">
<p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span> </strong></p>
<p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>
<p>Place of birth:<span itemprop="nationality"> Brazil</span></p>
</div>
<div class="col">
<p>Club: <span itemprop="affiliation">Liverpool</span></p>
<p>Squad: 13</p><p>Position: Goal Keeper</p>
</div>
If you just want the text in the spans you can search specifically for the spans:
soup = BeautifulSoup(html, 'html.parser')
spans = soup.find_all('span')
for span in spans:
print(span.text)
If you want to find the spans with the specific divs, then you can do:
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
spans = div.find_all('span')
for span in spans:
print(span.text)
If you just want all of the values after the colons, you can search for the paragraph tags:
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
ps = div.find_all('p')
for p in ps:
print(p.text.split(":")[1].strip())
Kyle's answer is good, but to avoid printing the same value multiple times like you said happened, you need to change up the logic a little bit. First you parse and add all matches you find to a list and THEN you loop through the list with all the matches and print them.
Another thing that you may have to consider is this problem:
<div class=col>
<div class=col>
<span/>
</div>
</div>
By using a list instead of printing right away, you can handle any matches that are identical to any existing records
in the above html example you can see how the span could be added twice with how you find matches in the answer suggested by Kyle. It's all about making sure you create a logic that will only find the matches you need. How you do it is often/always dependant on how the html is formatted, but its also important to be creative!
Good luck.
I want get some specific information from this html code :
<div class="main">
<div class="a"><div><a>linkname1</a></div></div> <!-- I DON'T want get the text of this 'a' tag -->
<div class="b">xxx</div>
<div class="c">xxx</div>
<div class="a"><div><a>linkname2</a></div></div> <!-- I want get the text of this 'a' tag -->
<div class="a"><div><a>linkname3</a></div></div> <!-- I want get the text of this 'a' tag -->
<div class="a"><div><a>linkname4</a></div></div> <!-- I want get the text of this 'a' tag -->
<div class="a"><div><a>linkname5</a></div></div> <!-- I want get the text of this 'a' tag -->
<div class="d"></div>
<div class="c">xxx</div>
<div class="a"><div><a>linkname6</a></div></div> <!-- I DON'T want get the text of this 'a' tag -->
<div class="a"><div><a>linkname7</a></div></div> <!-- I DON'T want get the text of this 'a' tag -->
<div class="a"><div><a>linkname8</a></div></div> <!-- I DON'T want get the text of this 'a' tag -->
<div class="d"></div>
<div class="c">xxx</div>
<div class="a"><div><a>linkname9</a></div></div> <!-- I DON'T want get the text of this 'a' tag -->
<div class="a"><div><a>linkname10</a></div></div> <!-- I DON'T want get the text of this 'a' tag -->
</div>
I want get in an array the list of the link's text in the 'second' 'a' (class) tags block (between the first div with the class 'c' and the second div with the class 'c') . How can I do that via an xpath selector ? is it possible ? I don't find how do..
With my example, the expected result is :
linkname2
linkname3
linkname4
linkname5
Thank you :)
Your question is a Set question like explained in this SO answer: How to perform set operations in XPath 1.0.
So applied to your specific situation you should use an intersection like this:
(: intersection :)
$set1[count(. | $set2) = count($set2)]
set1 should be the follow set after div[#class='c'] and
set2 should be the preceding set before div[#class='d'].
Now, putting both together according to the above formula with
set1 = "div[#class='c'][1]/following-sibling::*" and
set2 = "div[#class='d'][1]/preceding-sibling::*"
the XPath expression could look like this:
div[#class='c'][1]/following-sibling::*[count(. | current()/div[#class='d'][1]/preceding-sibling::*) = count(current()/div[#class='d'][1]/preceding-sibling::*)]
Output:
linkname2
linkname3
linkname4
linkname5
You can try this expression:
/div/div[position() > 3 and position() < 8]/div/a/text()
I found one possible solution :)
//following::div[#class='a' and count(preceding::div[#class="c"]) = 1]/div/a/text()
I am new to html scraping and R, so this is a tricky problem for me. I have an html structure specified like below ( only body part). I have to separate sections, each with x number of paragraphs. What I want is to pick out all paragraphs in section1 to one object, and all paragraphs in section 2 in one object.
My current code looks like this:
docx <- read_html("Page.html")
sections = html_nodes(docx, xpath="//div [#class='sections']/*")
This gives me an xml_nodes object, List of 2, that has the paragraphs within. My problem then is that I cannot use xpathApply to a nodeset because it throws an error. But I want to pick out all the paragraphs like this:
subsparagraphs1 = html_nodes(sections[[1]], xpath="//p "),
but it then picks out all paragraphs from the WHOLE html page, not the first section.
I tried to be more specific:
subsections = html_nodes(sections[[1]], xpath="./div/div/p")
then it picks out nothing, or this:
subsections = html_nodes(sections[[1]], xpath="/p [#class = 'pwrapper']")
which also results in nothing. Can anyone help me get around this problem?
best, Mia
This is the html structure I have where I want Text1, text 2 and text 3 save in one object and 4,5 and 6 save in one object.
<div class = "content">
<div class = "title"> ... </div>
<div class = "sections">
<div> ... >/div
<div class = "sectionHeader">
<div>
<p class = "pwrapper"> Text 1 </p>
<p class = "pwrapper"> Text 2 </p>
<p class = "pwrapper"> Text 3 </p>
</div>
<div> ... </div>
<div> ... </div>
<div> ... >/div
<div class = "sectionHeader">
<div>
<p class = "pwrapper"> Text 4 </p>
<p class = "pwrapper"> Text 5 </p>
<p class = "pwrapper"> Text 6 </p>
</div>
<div> ... </div>
<div> ... </div>
</div>
</div>
Even if your input XML contains syntax errors, I will presume that the sectionHeader elements are siblings (they are on the same level under the same parent (sections).
In that case, your XPaths will be:
//div[#class = 'sections']//div[#class='sectionHeader'][1]//p[#class = 'pwrapper']/text()
//div[#class = 'sections']//div[#class='sectionHeader'][2]//p[#class = 'pwrapper']/text()
All that varies is the index into the //div[#class='sectionHeader'] sequence (1 and 2 – XPath starts with 1, not 0).
Please let me know if the structure of the input XML is different than what I observed/assumed.
P.S.: You may simplify the XPaths by removing the first path portion: //div[#class = 'sections'].