BeautifulSoup - Adding attributes on Resultset - html

Here's my html structure to scrape:
<div class='schedule-lists'>
<ul>
<li>...</li>
<ul>
<li>...</li>
<ul class='showtime-lists'>
<li>...</li>
<li><a auditype="N" cinema="0100" href="javascript:void(0);" >12:45</a></li>
<li>...</li> -- (same structured as above)
<li>...</li> -- (same structured as above)
<li>...</li> -- (same structured as above)
<li>...</li> -- (same structured as above)
Here's my code:
from requests import get
from bs4 import BeautifulSoup
response = get('www.example.com')
response_html = BeautifulSoup(response.text, 'html.parser')
containers = response_html.find_all('ul', class_='showtime-lists')
#print(containers)
[<ul class="showtime-lists">
<li><a auditype="N" cinema="0100" href="javascript:void(0);" >12:45</a></li>
How can i add attributes on my Resultset containers? like adding movietitle="Logan" so it become:
<li><a movietitle="Logan" auditype="N" cinema="0100" href="javascript:void(0);" >12:45</a></li>
My best trial is using .append method but it can be done because the ResultSet act like a dictionary

You can try this:
...
a = find_all('a')
i = 0
for tag in a:
a[i]['movietitle'] = 'Logan'
i += 1
print str(a)

Related

Convert dataframe into a nested html file with R

I am trying to convert a csv file (in this example the tibble tree) into a nested html file like the one below. I did it expressing the csv file in MarkDown and the using pandoc.
What is the best way to do it with R? Is there an adequate package(s) to use? Is it also possible also in R to transform the html result inserting class and span in certain HTML elements?
library(tidyverse)
tree <- tibble::tribble(
~level1,~level2,~level3,~level4,
"Beverages","Water","","",
"Beverages","Coffee","","",
"Beverages","Tea","Black tea","",
"Beverages","Tea","White tea","",
"Beverages","Tea","Green tea","Sencha",
"Beverages","Tea","Green tea","Gyokuro",
"Beverages","Tea","Green tea","Matcha",
"Beverages","Tea","Green tea","Pi Lo Chun"
)
Created on 2021-04-23 by the reprex package (v1.0.0)
This is the nested html file that I want to obtain.
<ul>
<li>
<p>Beverages</p>
<ul>
<li>
<p>Water</p>
</li>
<li>
<p>Coffee</p>
</li>
<li>
<p>Tea</p>
<ul>
<li>
<p>Black Tea</p>
</li>
<li>
<p>White Tea</p>
</li>
<li>
<p>Green Tea</p>
<ul>
<li>Sencha</li>
<li>Gyokuro</li>
<li>Matcha</li>
<li>Pi Lo Chun</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
dat <- tibble::tribble(
~level1,~level2,~level3,~level4,
"Beverages","Water","","",
"Beverages","Coffee","","",
"Beverages","Tea","Black tea","",
"Beverages","Tea","White tea","",
"Beverages","Tea","Green tea","Sencha",
"Beverages","Tea","Green tea","Gyokuro",
"Beverages","Tea","Green tea","Matcha",
"Beverages","Tea","Green tea","Pi Lo Chun"
)
paths <- data.frame(pathString = apply(dat, 1, paste0, collapse = "/"))
library(data.tree)
tree <- as.Node(paths)
LL <- as.list(tree)
L <- LL[-1]
library(htmltools)
f <- function(node, nodeName){
if(all(lengths(node) == 0) && length(names(node))){
tagList(
tags$p(nodeName),
do.call(tags$ul, unname(lapply(names(node), tags$li)))
)
}else{
if(length(names(node))){
tags$li(
tags$p(nodeName),
do.call(tags$ul, mapply(f, node, names(node), SIMPLIFY = FALSE, USE.NAMES = FALSE))
)
}else{
tags$li(
tags$p(nodeName)
)
}
}
}
lis <- mapply(f, L, names(L), SIMPLIFY = FALSE, USE.NAMES = FALSE)
ul <- do.call(tags$ul, lis)
html <- as.character(tagList(tags$p(LL$name), ul))
> cat(html)
<p>Beverages</p>
<ul>
<li>
<p>Water</p>
</li>
<li>
<p>Coffee</p>
</li>
<li>
<p>Tea</p>
<ul>
<li>
<p>Black tea</p>
</li>
<li>
<p>White tea</p>
</li>
<p>Green tea</p>
<ul>
<li>Sencha</li>
<li>Gyokuro</li>
<li>Matcha</li>
<li>Pi Lo Chun</li>
</ul>
</ul>
</li>
</ul>

how to write css selector for scrapy?

I have the following web page:
<div id="childcategorylist" class="link-list-container links__listed" data-reactid="7">
<div data-reactid="8">
<strong data-reactid="9">Categories</strong>
</div>
<div data-reactid="10">
<ul id="categoryLink" aria-label="shop by category" data-reactid="11">
<li data-reactid="12">
Contact Lenses
</li>
<li data-reactid="14">
Beauty
</li>
<li data-reactid="16">
Personal Care
</li>
I want to have css selector of href tags under li tag, i.e. for contact lens, beauty and personal-care. How to write it?
I am writing it in the following way:
#childcategorylist li
gives me following output:
['<li class="titleitem" data-reactid="16"><strong data-reactid="17">Categories</strong></li>']
Please help!
I am not a expert in scrapy, but usually html elements should have a .text object.
If not, you might want to use regexp to extract the text between > and < like:
import re
txt = someArraycontainingStrings[0]
x = re.search(">[a-zA-Z]*</", txt)
Maybe that gives you proper results

Parsing "Further reading" with selenium, python

I need to parse text from Further reading in wikipedia.
My code can open "google" by inputing request, for example 'Bill Gates', and then it can find url of wikipedia's page.And now i need to parse text from Further reading, but i do not know how.
Here is code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
URL = "https://www.google.com/"
adress = input() #input request, example: Bill Gates
def main():
driver = webdriver.Chrome()
driver.get(URL)
element = driver.find_element_by_name("q")
element.send_keys(adress, Keys.ARROW_DOWN)
element.send_keys(Keys.ENTER)
elems = driver.find_elements_by_css_selector(".r [href]")
link = [elem.get_attribute('href') for elem in elems]
url = link[0] #wikipedia's page's link
if __name__ == "__main__":
main()
And here's HTML code
<h2>
<span class="mw-headline" id="Further_reading">Further reading</span>
</h2>
<ul>
<li>...</li>
<li>...</li>
<li>...</li>
<li>...</li>
...
</ul>
<h3>
<span class="mw-headline" id="Primary_sources">Primary sources</span>
<ul>
<li>...</li>
<li>...</li>
<li>...</li>
...
</ul>
url - https://en.wikipedia.org/wiki/Bill_Gates
This page has Further Reading text between 2 h2 tags. To collect the text, just find ul elements between h2s. This is the code that worked for me:
# Open the page:
driver.get('https://en.wikipedia.org/wiki/Bill_Gates')
# Search for element, get text:
further_read = driver.find_element_by_xpath("//ul[preceding-sibling::h2[./span[#id='Further_reading']] and following-sibling::h2[./span[#id='External_links']]]").text
print(further_read)
I hope this helps, good luck.

CSS children selector (not being able to select all children)

This is the image of what I'm trying to scrape using beautiful soup. But whenever I use the code shown below, I only get access to the first child. I am never able to get access to all the children. Can someone help me with this?
item = soup.select("ul.items > li")
print(len(item))
The problem can be fixed in 2 steps as follows:
Use select_one on soup to get the ul
Use find_all on ul to fetch all the li items.
Working solution:
# File name: soup-demo.py
inputHTML = """
<ul class="items">
<li class="class1">item 1</li>
<li class="class1">item 3</li>
<li class="class1">item 3</li>
</ul>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(inputHTML, 'html.parser')
itemList = soup.select_one("ul", class_="items")
items = itemList.find_all("li")
print("Found ", len(items), " items")
for item in items:
print(item)
Output:
$ python3 soup-demo.py
Found 3 items
<li class="class1">item 1</li>
<li class="class1">item 3</li>
<li class="class1">item 3</li>
Maybe your version is wrong. This is OK.
from bs4 import BeautifulSoup
html = '''
<ul class="items">
<li>1</li>
<li>2</li>
</ul>
'''
soup = BeautifulSoup(html,features="lxml")
item = soup.select('ul.items>li')
print (len(item))
There's another solution here
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<ul class="items">
<li>1</li>
<li>2</li>
</ul>
'''
doc = SimplifiedDoc(html)
item = doc.selects('ul.items>li')
print(len(item))
Here are more examples here

Of the same tags, I want to extract only the tags I want

I am studying crawling Using Python3.
<ul class='report_thum_list img'>
<li>...</li>
<li>...</li>
<li>...</li>
<li>...</li>
<li>...</li>
In this, I just want to pull out the li tag.
So, I wrote that
ulTag = soup.findAll('ul', class_='report_thum_list img')
liTag = ulTag[0].findAll('li')
# print(len(liTag))
I expected twenty (there are 20 posts per page.)
But over 100 came out.
Because There is another li tag in the li tag.
I do not want to extract the li tag inside the div tag.
How can I pull out 20 li tags?
This is my code.
url = 'https://www.posri.re.kr/ko/board/thumbnail/list/63?page='+ str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
ulTag = soup.find('ul', class_='report_thum_list img')
# liTag = ulTag.findAll('li')
liTag = ulTag.findChildren('li')
print(len(liTag))
liTag = soup.select('ul.report_thum_list > li')
Use CSS selector, it's very easy to use