With BeautifulSoup, I am trying to create a list of lists, which returns empty lists as well as variables in the sequence they appear, using this example of HTML code below...
[<div class="Stats">
</div>
<div class="Stats">
</div>
<div class="Stats">
</div>
<div class="Stats">
</div>
<div class="Stats">
</div>
<div class="Stats">
<div class="Stats__x">
<!--
-->C<!--
--></div>
</div>
<div class="Stats">
</div>
<div class="Stats">
</div>
<div class="Stats">
</div>]
My current code attempts are getting...
[['C']]
The result I would like to get is...
[[], [], [], [], [], ['C'], [], [], []]
I have tried many ways, creating empty list of lists by finding the number of divs x = len(stats = soup.find_all("div", {"class": "Stats"}), and then with for loops attempting to append an element if it exists and leave the empty list in place if it doesn't.
hList = []
for each in stats:
for each2 in each.find_all("div", {"class":"Stats__x"}):
hList.append(each2.text.split())
I probably need to perform some type of index assignment but I can't figure it out.
Thanks.
First I search all div with class="Stats" and inside every div I search one div with class="Stats__x". If I get None then I change it into []
data = '''<div class="Stats"></div>
<div class="Stats"></div>
<div class="Stats"></div>
<div class="Stats"></div>
<div class="Stats"></div>
<div class="Stats">
<div class="Stats__x">
<!--
-->C<!--
--></div>
</div>
<div class="Stats"></div>
<div class="Stats"></div>
<div class="Stats"></div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
result = []
for div in soup.find_all("div", {"class": "Stats"}):
item = div.find("div", {"class": "Stats__x"}):
if item:
result.append( item.text.split() )
#result.append( [item.text.strip()] )
else:
result.append( [] )
print(result)
Related
I have a div that looks like this:
<div id="data" class="grid grid-cols-2">
</div>
and I have a function that can append in data div:
function loadStaticBar(data) {
let pl_name= `bar-${data.title.replace(/\s+/g, '-').toLowerCase()}`
$('#data').append(`
<div class="flex flex-col" id="${pl_name}-wrapper">
<div class="static barchart" id="${pl_name}-plot">
</div>
</div>
`)
}
The content of loadStaticBar(data) is a key and value it's a details for charts:
{id: 453, title: 'Bar Chart Example', select: 'bar-form', xtitle: 'Values', ytitle: 'Date', …}
Now, I'm trying to get all the IDs with the class static. I have tried:
$('#data').find('.static')
And I get S.fn.init [prevObject: S.fn.init(1)] inside of this are bunch of information. How can I get the IDs of the div that containing static class like this.
ids = [line-plot, bar-plot]
The answer to the updated question could be:
function loadStaticBar(data) {
let pl_name= `bar-${data.title.replace(/\s+/g, '-').toLowerCase()}`
$('#data').append(`
<div class="flex flex-col" id="${pl_name}-wrapper">
<div class="static barchart" id="${pl_name}-plot">
</div>
</div>
`)
}
const data={id: 453, title: 'Bar Chart Example', select: 'bar-form', xtitle: 'Values', ytitle: 'Date'};
$(function(){
loadStaticBar(data); // create the divs first !!!
const ids=$("#data .static").get().map(el=>el.id);
console.log(ids);
});
<script src="https://code.jquery.com/jquery-3.6.1.min.js"></script>
<div id="data" class="grid grid-cols-2">
</div>
As you want to receive a "proper" array instead of a jQuery object it makes sense to .get() the selected DOM elements first and then .map() them using the standard Array-method.
Incidentally, you can solve the originally posted question also without jQuery:
document.addEventListener("DOMContentLoaded", function () {
const ids=[...document.querySelectorAll("#data .static")].map(el=>el.id);
console.log(ids);
});
<div id="data" class="grid grid-cols-2">
<div class="flex flex-col" id="line-wrapper">
<div class="static linechart" id="line-plot">
</div>
</div>
<div class="flex flex-col" id="bar-wrapper">
<div class="static barchart" id="bar-plot">
</div>
</div>
</div>
I have a website in the following format:
<html lang="en">
<head>
#anything
</head>
<body>
<div id="div1">
<div id="div2">
<div class="class1">
#something
</div>
<div class="class2">
#something
</div>
<div class="class3">
<div class="sub-class1">
<div id="statHolder">
<div class="Class 1 of 15">
"Name"
<b>Bob</b>
</div>
<div class="Class 2 of 15">
"Age"
<b>24</b>
</div>
# Here are 15 of these kinds
</div>
</div>
</div>
</div>
</div>
</body>
</html>
I want to retrieve all the content in those 15 classes. How do I do that?
Edit:
My Current Approach:
import requests
from bs4 import BeautifulSoup
url = 'my-url-here'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
name_box = soup.findAll('div', {"id": "div1"}) #I dont know what to do after this
Expected Output:
Name: Bob
Age: 24
#All 15 entries like this
I am using BeautifulSoup4 for this.
Is there any direct way to get all the contents in <div id="stats">?
Based on the HTML above, you can try it this way:
import requests
from bs4 import BeautifulSoup
result = {}
url = 'my-url-here'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
stats = soup.find('div', {'id': 'statHolder'})
for data in stats.find_all('div'):
key, value = data.text.split()
result[key.replace('"', '')] = value
print(result)
# Prints:
# [{'Name': 'Bob'}, {'Age': '24'}]
for key, value in result.items():
print(f'{key}: {value}')
# Prints:
# Name: Bob
# Age: 24
This finds the div with the id of statHolder.
Then, we find all divs inside that div, and extract the two lines of text (using split) -- the first line being the key, and the second line being the value. We also remove the double quotes from the value using replace.
Then, we add the key-value pair to our result dictionary.
Iterating through this, you can get the desired output as shown.
If you do it according to the actual html of the webpage the following will give you the stats as a dictionary. It takes each element with class pSt as the key and then moves to the following strong tag to get the associated value.
from bs4 import BeautifulSoup as bs
#html is response.content assuming not dynamic
soup = bs(html, 'html.parser')
stats = {i.text:i.strong.text for i in soup.select('.pSt')}
For your shown html you can use stripped_strings to get the first sibling
from bs4 import BeautifulSoup as bs
html = '''
<html lang="en">
<head>
#anything
</head>
<body>
<div id="div1">
<div id="div2">
<div class="class1">
#something
</div>
<div class="class2">
#something
</div>
<div class="class3">
<div class="sub-class1">
<div id="statHolder">
<div class="Class 1 of 15">
"Name"
<b>Bob</b>
</div>
<div class="Class 2 of 15">
"Age"
<b>24</b>
</div>
# Here are 15 of these kinds
</div>
</div>
</div>
</div>
</div>
</body>
</html>
'''
soup = bs(html, 'html.parser')
stats = {[s for s in i.stripped_strings][0]:i.b.text for i in soup.select('#statHolder [class^=Class]')}
print(stats)
I want to extract HTML between two HTML tags with identical id
html = '''<div id="note">
<div id="seccion">
<a name="title">Title of the seccion 1</a>
</div>
<div id="content">
<div id="col1">xxx</div>
<div id="col2">xxx</div>
</div>
<div id="content">
<div id="col1">xxx</div>
<div id="col2">xxx</div>
</div>
<div id="seccion">
<a name="title">Title of the seccion 2</a>
</div>
<div id="block">
<div id="col1">xxx</div>
<div id="col2">xxx</div>
</div>
<div id="block">
<div id="col1">xxx</div>
<div id="col2">xxx</div>
</div>
<div id="seccion">
<a name="title">Title of the seccion 3</a>
</div>
<div id="block">
<div id="col1">xxx</div>
<div id="col2">xxx</div>
</div>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
seccion= soup.find_all("div", {"id": "seccion"})
for item in seccion:
print([a.text for a in item.find_all("a", {"name": "title"})])
Unfortunately, sections are not separated in a div from which children are dropped.
In sections with I don't know how many blocks there are.
I am not sure if it is posible to extract html between 2 divs when names are identical.
You can separate sections by using .find_all() with parameter recursive=False and then check if the <div> contains id="seccion" attribute.
For example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
sections = []
for div in soup.select_one('div#note').find_all('div', recursive=False):
if div.get('id') == 'seccion':
sections.append([div])
else:
sections[-1].append(div)
for section in sections:
for div in section:
print(div.get_text(strip=True, separator='\n'))
print('-' * 80)
Prints the three sections separated:
Title of the seccion 1
xxx
xxx
xxx
xxx
--------------------------------------------------------------------------------
Title of the seccion 2
xxx
xxx
xxx
xxx
--------------------------------------------------------------------------------
Title of the seccion 3
xxx
xxx
--------------------------------------------------------------------------------
One option is to use selenium
Download driver for google Chrome here
To get 'xpath' right click on the element then 'copy' and select 'Copy XPATH' or 'Copy Full XPATH'
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless') #Opens Chrome in background
driver = webdriver.Chrome(executable_path='Path_to_chromedriver.exe', options=options)
driver.get('url') #Webpage url
Text = driver.find_element("xpath","Element_xpath").Text #Get the label text
driver.close() #Close Chrome
I want to create a list of items with subitems. The items and the subitems are listed in arrays like this:
items=[firstItem,secondItem,thirdItem,...]
and
subitems=[{"name":"firstSubitem1","name":"secondSubitem1",...},{"name":"firstSubItem2","name":"secondSubItem2"},...]
So I've set up this html code:
<body>
<div ng-app="" ng-init="items=[firstItem,secondItem,thirdItem,...],subitems=[{"name":"firstSubitem1","name":"secondSubitem1",...},{"name":"firstSubItem2","name":"secondSubItem2"},...]">
<div ng-repeat="itemNames in items" id="item{{$index}}">{{itemNames}}</div>
</div>
</body>
But that only outputs the list if items, and i want to make a list of items (divs) that have subitems (paragraphs). So the output that I want to create is this:
<body>
<div (...) id="item0">
firstItem
<p>firstSubitem1</p>
<p>secondSubitem1</p>
<p>...</p>
</div>
<div (...) id="item1">
secondItem
<p>firstSubitem2</p>
<p>...</p>
</div>
<div (...) id="item2">
...
</div>
</body>
So how can I make it so that will be the output?
You need to restructure your data as following:
items = [{
title: firstItem,
subitems: [firstSubitem1, secondSubitem1, ...]
}, {
title: secondItem,
subitems: [firstSubitem2, secondSubitem2, ...]
}....]
And html to be as follows:
<div ng-repeat="item in items" id="item.title{{$index}}">{{item.title}}
<p ng-repeat="subitem in item.subitems">{{subitem}}</p>
</div>
If you insist on your data structure, use $index from the outer loop:
<body>
<div ng-app="" ng-init="items=[firstItem,secondItem,thirdItem,...],subitems=[{firstSubitem1,secondSubitem1,...},{firstSubItem2,secondSubItem2},...]">
<div ng-repeat="itemNames in items" id="item{{$index}}">
{{itemNames}}
<p data-ng-repeat="(key, value) in subitems[$index]"> {{ value }}</p>
</div>
</div>
I have a use in case which I had solved up 'til now with ng-repeat-start. It currently looks like this:
Controller
$scope.providers = [
'Site Hosting',
'Best Hosting',
'Fantastic Hosting'
];
$scope.providerColumns = {
price: {
text: 'Price',
exposed: true
},
site: {
text: 'Website',
exposed: true
},
phone: {
text: 'Phone Number',
exposed: false
}
};
Template
<div class='table-header-row'>
<div class='first-header-cell' ng-repeat='obj in firstCellsHeaders'>
{{obj.text}}
</div>
<div class='middle-header-cell' ng-repeat-start='prov in providers' ng-if='providerColumns.price.exposed'>
{{prov}} {{providerColumns.price.text}}
</div>
<div class='middle-header-cell' ng-if='providerColumns.site.exposed'>
{{prov}} {{providerColumns.site.text}}
</div>
<div class='middle-header-cell' ng-repeat-end ng-if='providerColumns.phone.exposed'>
{{prov}} {{providerColumns.phone.text}}
</div>
<div class='last-header-cell' ng-repeat='obj in lastCellsHeaders'>
{{obj.text}}
</div>
</div>
<div class='table-row'>
<div class='first-cells' ng-repeat='obj in firstCells'>
{{obj.text}}
</div>
<div class='middle-cells' ng-repeat-start='prov in providers' ng-if='providerColumns.price.exposed'>
{{data[prov].price}}
</div>
<div class='middle-cells' ng-if='providerColumns.site.exposed'>
{{data[prov].site}}
</div>
<div class='middle-cells' ng-repeat-end ng-if='providerColumns.phone.exposed'>
{{data[prov].phone}}
</div>
<div class='last-cells' ng-repeat='obj in lastCells'>
{{obj.texLine2}}
</div>
</div>
It's essentially a table built out of div's in order to facilitate reordering the columns after angular finishes rendering.
Now I've realized the object represented in ng-repeat-start is more dynamic than I previously thought, and I want to be able to have a squared ng-repeat, i.e. nested repeat without nested elements, much like the oft-applied solution of using ng-repeat on a tbody element, and then an additional ng-repeat on the tr elements within it, which essentially is a flattened nesting effect. I need a solution for horizontal elements, preferably divs. I can't have them tied together in any way, however, so no nesting of divs or anything like that. Ideally it would look something like this:
<div ng-repeat='prov in providers' ng-repeat='col in providerColumns' ng-if='col.exposed'>
{{data[prov][col]}}
</div>
Any ideas?
Instead going with:
<div ng-repeat='obj1 in parent' ng-repeat='obj2 in obj1'>
{{content[obj1][obj2]}}
</div>
Why not use a nested ngRepeat, like this:
<div ng-repeat='obj1 in parent'>
<div ng-repeat='obj2 in obj1'>
{{content[obj1][obj2]}}
</div>
</div>