Parsing "Further reading" with selenium, python - html

I need to parse text from Further reading in wikipedia.
My code can open "google" by inputing request, for example 'Bill Gates', and then it can find url of wikipedia's page.And now i need to parse text from Further reading, but i do not know how.
Here is code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
URL = "https://www.google.com/"
adress = input() #input request, example: Bill Gates
def main():
driver = webdriver.Chrome()
driver.get(URL)
element = driver.find_element_by_name("q")
element.send_keys(adress, Keys.ARROW_DOWN)
element.send_keys(Keys.ENTER)
elems = driver.find_elements_by_css_selector(".r [href]")
link = [elem.get_attribute('href') for elem in elems]
url = link[0] #wikipedia's page's link
if __name__ == "__main__":
main()
And here's HTML code
<h2>
<span class="mw-headline" id="Further_reading">Further reading</span>
</h2>
<ul>
<li>...</li>
<li>...</li>
<li>...</li>
<li>...</li>
...
</ul>
<h3>
<span class="mw-headline" id="Primary_sources">Primary sources</span>
<ul>
<li>...</li>
<li>...</li>
<li>...</li>
...
</ul>
url - https://en.wikipedia.org/wiki/Bill_Gates

This page has Further Reading text between 2 h2 tags. To collect the text, just find ul elements between h2s. This is the code that worked for me:
# Open the page:
driver.get('https://en.wikipedia.org/wiki/Bill_Gates')
# Search for element, get text:
further_read = driver.find_element_by_xpath("//ul[preceding-sibling::h2[./span[#id='Further_reading']] and following-sibling::h2[./span[#id='External_links']]]").text
print(further_read)
I hope this helps, good luck.

Related

Python Selenium unable to find text of iterative span classes

Unable to get the "span text" printed.
Expected Output
Hello-World
Foo-Bar
Given HTML Snippet:
<div class="information-container">
<ul>
<li class="info-item">
<span class="info-text">
Hello-World
</span>
</li>
</ul>
<ul>
<li class="info-item">
<span class="info-text">
Foo-Bar
</span>
</li>
</ul>
</div>
My parse Code (Method 1):
page = self.browser.find_element_by_class_name("information-container")
for elem in page.find_elements_by_xpath('.//span[#class = "info-text"]'):
print("E>", elem.text)
attrs = self.browser.execute_script('var items = {}; for (index = 0; index < arguments[0].attributes.length; ++index) { items[arguments[0].attributes[index].name] = arguments[0].attributes[index].value }; return items;', elem)
pprint(attrs)
Output (Method 1):
E>
{'class': 'info-text'}
E>
{'class': 'info-text'}
My parse Code (Method 2):
page = self.browser.find_element_by_class_name("information-container")
li_objs = page.find_elements_by_class_name('info-text')
for o in li_objs:
print("text:", o.text)
Output (Method 2):
text:
text:
The text from all span tags on a page can be displayed using the following
from selenium import webdriver
driver = webdriver.Firefox(executable_path = 'path_to_driver')
driver.get('path_to_site')
elements = driver.find_elements_by_tag_name('span')
for element in elements:
text = element.text
print(text)

grouping text based on tags in HTML text

I have a text which is in format of (keeping tags and removing the text for understanding)
<h2>...</h2>
<p>...</p>
. .
. .
<p>...</p>
<h2>...</h2>
<ul>...</ul>
<li> .. </li>
...
<h2>...</h2>
<li> ..</li>
I am trying to use scrapy to separate/group the text based on the header. So as a first step I need to get 3 groups of data from the above.
from scrapy import Selector
sentence = "above text in the format"
sel = Selector(text = sentence)
// item = sel.xpath("//h2//text())
item = sel.xpath("//h2/following-sibling::li/ul/p//text()").extract()
I am getting an empty array. Any help appreciated.
I have this solution, made with scrapy
import scrapy
from lxml import etree, html
class TagsSpider(scrapy.Spider):
name = 'tags'
start_urls = [
'https://support.litmos.com/hc/en-us/articles/227739047-Sample-HTML-Header-Code'
]
def parse(self, response):
for header in response.xpath('//header'):
with open('test.html', 'a+') as file:
file.write(
etree.tostring(
html.fromstring(header.extract()),
encoding='unicode',
pretty_print=True,
)
)
With which I get the headers and all the content inside them

removing elements from html using BeautifulSoup and Python 3

I'm scraping data from the web and trying to remove all elements that have tag 'div' and class 'notes module' like this html below:
<div class="notes module" role="complementary">
<h3 class="heading">Notes:</h3>
<ul class="associations">
<li>
Translation into Русский available:
Два-два-один Браво Бейкер by <a rel="author" href="/users/dzenka/pseuds/dzenka">dzenka</a>, <a rel="author" href="/users/La_Ardilla/pseuds/La_Ardilla">La_Ardilla</a>
</li>
</ul>
<blockquote class="userstuff">
<p>
<i>Warnings: numerous references to and glancing depictions of combat, injury, murder, and mutilation of the dead; deaths of minor and major original characters. Numerous explicit depictions of sex between two men.</i>
</p>
</blockquote>
<p class="jump">(See the end of the work for other works inspired by this one.)</p>
</div>
source is here: view-source:http://archiveofourown.org/works/180121?view_full_work=true
I'm struggling to even find and print the elements I want to delete. So far I have:
import urllib.request, urllib.parse, urllib.error
from lxml import html
from bs4 import BeautifulSoup
url = 'http://archiveofourown.org/works/180121?view_full_work=true'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
removals = soup.find_all('div', {'id':'notes module'})
for match in removals:
match.decompose()
but removals returns an empty list. Can you help me select the entire div element that I've shown above so that I can select and remove all such elements from the html?
Thank you.
The div you are trying to find hasclass = "notes module", yet in your code you are trying to find those divs by id = "notes module".
Change this line:
removals = soup.find_all('div', {'id':'notes module'})
To this:
removals = soup.find_all('div', {'class':'notes module'})
Give it a go. It will kick out all available divs from that webpage under class='wrapper'.
import requests
from bs4 import BeautifulSoup
html = requests.get('http://archiveofourown.org/works/180121?view_full_work=true')
soup = BeautifulSoup(html.text, 'lxml')
for item in soup.select(".wrapper"):
[elem.extract() for elem in item("div")]
print(item)

BeautifulSoup - Adding attributes on Resultset

Here's my html structure to scrape:
<div class='schedule-lists'>
<ul>
<li>...</li>
<ul>
<li>...</li>
<ul class='showtime-lists'>
<li>...</li>
<li><a auditype="N" cinema="0100" href="javascript:void(0);" >12:45</a></li>
<li>...</li> -- (same structured as above)
<li>...</li> -- (same structured as above)
<li>...</li> -- (same structured as above)
<li>...</li> -- (same structured as above)
Here's my code:
from requests import get
from bs4 import BeautifulSoup
response = get('www.example.com')
response_html = BeautifulSoup(response.text, 'html.parser')
containers = response_html.find_all('ul', class_='showtime-lists')
#print(containers)
[<ul class="showtime-lists">
<li><a auditype="N" cinema="0100" href="javascript:void(0);" >12:45</a></li>
How can i add attributes on my Resultset containers? like adding movietitle="Logan" so it become:
<li><a movietitle="Logan" auditype="N" cinema="0100" href="javascript:void(0);" >12:45</a></li>
My best trial is using .append method but it can be done because the ResultSet act like a dictionary
You can try this:
...
a = find_all('a')
i = 0
for tag in a:
a[i]['movietitle'] = 'Logan'
i += 1
print str(a)

Of the same tags, I want to extract only the tags I want

I am studying crawling Using Python3.
<ul class='report_thum_list img'>
<li>...</li>
<li>...</li>
<li>...</li>
<li>...</li>
<li>...</li>
In this, I just want to pull out the li tag.
So, I wrote that
ulTag = soup.findAll('ul', class_='report_thum_list img')
liTag = ulTag[0].findAll('li')
# print(len(liTag))
I expected twenty (there are 20 posts per page.)
But over 100 came out.
Because There is another li tag in the li tag.
I do not want to extract the li tag inside the div tag.
How can I pull out 20 li tags?
This is my code.
url = 'https://www.posri.re.kr/ko/board/thumbnail/list/63?page='+ str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
ulTag = soup.find('ul', class_='report_thum_list img')
# liTag = ulTag.findAll('li')
liTag = ulTag.findChildren('li')
print(len(liTag))
liTag = soup.select('ul.report_thum_list > li')
Use CSS selector, it's very easy to use