This is the image of what I'm trying to scrape using beautiful soup. But whenever I use the code shown below, I only get access to the first child. I am never able to get access to all the children. Can someone help me with this?
item = soup.select("ul.items > li")
print(len(item))
The problem can be fixed in 2 steps as follows:
Use select_one on soup to get the ul
Use find_all on ul to fetch all the li items.
Working solution:
# File name: soup-demo.py
inputHTML = """
<ul class="items">
<li class="class1">item 1</li>
<li class="class1">item 3</li>
<li class="class1">item 3</li>
</ul>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(inputHTML, 'html.parser')
itemList = soup.select_one("ul", class_="items")
items = itemList.find_all("li")
print("Found ", len(items), " items")
for item in items:
print(item)
Output:
$ python3 soup-demo.py
Found 3 items
<li class="class1">item 1</li>
<li class="class1">item 3</li>
<li class="class1">item 3</li>
Maybe your version is wrong. This is OK.
from bs4 import BeautifulSoup
html = '''
<ul class="items">
<li>1</li>
<li>2</li>
</ul>
'''
soup = BeautifulSoup(html,features="lxml")
item = soup.select('ul.items>li')
print (len(item))
There's another solution here
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<ul class="items">
<li>1</li>
<li>2</li>
</ul>
'''
doc = SimplifiedDoc(html)
item = doc.selects('ul.items>li')
print(len(item))
Here are more examples here
Related
Hi I am try to amke regexp which extract only li tags in ul tags (no ol)
Text:
<ul><li>some text</li></ul>
<ol><li>some text</li></lo>
Extracted
<ul>**<li>**some text</li></ul>
<ol><li>some text</li></lo>
Could you help me ?
Solution 1
Regex solution
/(?<=<ul>\s*(?:<li>.*?<\/li>\s*)*)<li>.*?<\/li>/gi
Demo
If you work in a team and someone else may read your code I advise you to use Solution 2. It's more simple and easy to understand by code reading.
Solution 2
Do it in 2 steps:
Delete all <ol>...</ol> nodes;
Take all <li>...</li> nodes.
*I assume your html is valid and you have no <li> outside <ul> or <ol>.
Code example in JavaScript:
let html = `
<ul>
<li>take this node 1</li>
<li>take this node 2</li>
</ul>
<ol>
<li>exclude this node</li>
<li>exclude this node</li>
</ol>
<ul>
<li>take this node 3</li>
<li>take this node 4</li>
</ul>
<ol>
<li>exclude this node</li>
<li>exclude this node</li>
</ol>
`;
let htmlWithoutOl = html.replace(/<ol>.*?<\/ol>/gis, '');
let matches = htmlWithoutOl.matchAll(/<li>.*?<\/li>/gis);
for (const match of matches) {
console.log(match[0]);
}
Here's my html structure to scrape:
<div class='schedule-lists'>
<ul>
<li>...</li>
<ul>
<li>...</li>
<ul class='showtime-lists'>
<li>...</li>
<li><a auditype="N" cinema="0100" href="javascript:void(0);" >12:45</a></li>
<li>...</li> -- (same structured as above)
<li>...</li> -- (same structured as above)
<li>...</li> -- (same structured as above)
<li>...</li> -- (same structured as above)
Here's my code:
from requests import get
from bs4 import BeautifulSoup
response = get('www.example.com')
response_html = BeautifulSoup(response.text, 'html.parser')
containers = response_html.find_all('ul', class_='showtime-lists')
#print(containers)
[<ul class="showtime-lists">
<li><a auditype="N" cinema="0100" href="javascript:void(0);" >12:45</a></li>
How can i add attributes on my Resultset containers? like adding movietitle="Logan" so it become:
<li><a movietitle="Logan" auditype="N" cinema="0100" href="javascript:void(0);" >12:45</a></li>
My best trial is using .append method but it can be done because the ResultSet act like a dictionary
You can try this:
...
a = find_all('a')
i = 0
for tag in a:
a[i]['movietitle'] = 'Logan'
i += 1
print str(a)
I'm trying to create a recursive list using Thymeleaf. I'm using a simple Java object to model a node which has has two fields, a description and then an array list of child nodes. I'm using the following HTML/Thymeleaf to process the structure but it isn't recursively iterating through to the next level down.
My Java code looks as follows:
public class Node {
public String description;
public ArrayList<Node> children;
}
My Thymeleaf/HTML code is as follows:
<html>
...
<body>
<div th:fragment="fragment_node" th:remove="tag">
<ul th:if="${not #lists.isEmpty(node.children)}" >
<li th:each="child : ${node.children}"
th:text="${child.description}"
th:with="node = ${child}"
th:include="this::fragment_node">List Item</li>
</ul>
</div>
</body>
</html>
If my data structure looks as follows:
Main node 1
Child node 1
Child node 2
Main node 2
Child node 3
Child node 4
I'd expect to get:
<ul>
<li>Main Node 1</li>
<li>
<ul>
<li>Child node 1</li>
<li>Child node 2</li>
</ul>
</li>
<li>Main Node 2</li>
<li>
<ul>
<li>Child node 3</li>
<li>Child node 4</li>
</ul>
</li>
</ul>
However, I only get:
<ul>
<li>Main Node 1</li>
<li>Main Node 2</li>
</ul>
Can anyone spot why this may not be working?
The cause of the problem is
You are trying to th:text and trying to add the description to a <li> as well as you are trying to th:include the fragment inside the same tag <li>.
Your th:include is replaced by the th:text as th:text is processed with priority by default.
Direct solution to your source code
.....
<li th:each="child : ${node.children}" th:inline="text" th:with="node = ${child}">
[[${child.description}]]
<ul th:replace="this::fragment_node">List Item</ul>
</li>
.....
Even thought the above will work as you want, personally I find some design issues in your thymeleaf page.
Better solution using fragment parameters
...
<ul th:fragment="fragment_node(node)" th:unless="${#lists.isEmpty(node.children)}" >
<li th:each="child : ${node.children}" th:inline="text">
[[${child.description}]]
<ul th:replace="this::fragment_node(${child})"></ul>
</li>
</ul>
...
The asp.net repeater is inserting an extra <li></li> into every <ItemTemplate>. It does the same thing with tables. It inserts an extra <tr></tr>. And same with divs. Basically any element that I put in the ItemTemplate comes out with a duplicate.
Here is my code.
This is part of that builds the litData the aspx.vb file:
If item.ItemType = ListItemType.Item Then
Dim drv As System.Data.DataRowView = DirectCast((e.Item.DataItem), System.Data.DataRowView)
Dim strLinkValue As String = drv.Row("ReturnVal").ToString()
Dim Literal1 As Literal = DirectCast(item.FindControl("litData"), Literal)
Literal1.Text = "<a href=" & strQString.ToLower().Replace("/default.aspx", "") & strLinkValue & "/default.aspx>Hello World" + strLinkValue + "</a>"
End If
And this is in the .aspx file:
<asp:Repeater ID="rptData" runat="server" OnItemDataBound="rptData_OnItemDataBound" EnableViewState="false">
<HeaderTemplate>
<ul>
</HeaderTemplate>
<ItemTemplate>
<li><asp:Literal ID="litData" runat="server" EnableViewState="false"></asp:Literal></li>
</ItemTemplate>
<FooterTemplate>
</ul>
</FooterTemplate>
</asp:Repeater>
I am expecting it to render...
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
What is actually rendering...
<ul>
<li>Item 1</li>
<li></li>
<li>Item 2</li>
<li></li>
<li>Item 3</li>
<li></li>
</ul>
As per comment you are binding data to the repeater inside ItemDataBound handler. That is not the right way to do it. Move your data binding to the Page_Load:
protected void Page_Load(object sender, EventArgs e)
{
// other code
if (!IsPostBack)
{
rptData.DataSource = objDS;
rptData.DataBind();
}
}
And you might not need the Item Bound handler anymore
the HTML:
<ul id="nav">
<li id="listItem">a list item</li>
<li id="link01">list item with ID</li>
<li id="link02">another link with ID</li>
<li class="lastItem">Contact</li>
<li class="lastItem">the Very Last List Item</li>
</ul>
the JavaScript:
alert($$('.lastItem').getFirst('li').get('text'));
console returns this error:
TypeError: $$(...).getFirst(...).get is not a function
um...whut? what did i miss? if i take out the getFirst(), it works, but returns, of course, both <li> text contents... which i don't want. just want the first...
halp.
WR!
You trying to call getFirst on Elements array($$ return elements array!) the getFirst() method is only on a dom mootools element and it will return his first child
what you are looking for is this:
alert($$('.lastItem')[0].get('text'));