Extracting Multiple Child Elements from a Parent using Cheerio - cheerio

I'm trying to use Cheerio to scrape data and ultimately convert the resultant HTML to Markdown.
While not core to this question, to convert to Markdown, all I need is some valid HTML. Specifically, for this case, a div with one or more <ul> tags.
I mention this so it's clear that I'm not using the resultant HTML to directly render, but I need it in a form that I can use to convert to Markdown.
Using the simplified example below and given a known class name of "things", there are two <ul> tags in the parent div.
Note that the ul tags do not have a class or id in the code I'm scraping.
<div class="things"> // <= want
<h5 class="heading">Things</h5> // <= don't want
<ul> // <= want with children
<li class="sub-heading">Fruits</li>
<li class="fruit-item">Apple</li>
<li class="fruit-item">Pear</li>
</ul>
<ul> // <= want with children
<li class="sub-heading">Veg</li>
<li class="veg-item">Carrot</li>
<li class="veg-item">Spinach</li>
</ul>
</div>
I want every ul with their list items in a surrounding div.
The following results HTML w/o a surrounding div and with stuff I don't want (e.g. <h5 class="heading">Things</h5>):
const stuffIWant = $(".things").html();
The following results HTML w/o a surrounding div, only the contents on one of the <ul> tags, not the ul itself:
const stuffIWant = $(".things ul").html();
I know that this is because .html() returns the first element, so I'm just getting the list items from the first ul.
This my problem and is where I'm confusing myself.
I've also tried various forms of filter, map, and each, but I can't, for the life of me, get multiple <ul> tags returned in an enclosing div.
I'm thinking maybe I need iterate through the "things" div, using each or map and append the elements I want to a new div (somehow?), but that seems more complicated than it should be, so I'm asking here.
Any advice toward helping me wrap my head around this would be much appreciated.
Thanks.

While this post wasn't clarified completely, it seems there are two ways to interpret it. One possibility is that you want all of the <li>s for each of your <ul>s in a series of arrays:
const $ = cheerio.load(html);
const result = [...$(".things ul")].map(e =>
[...$(e).find("li")].map(e => $(e).text())
);
console.log(result);
Which gives
[
[ 'Fruits', 'Apple', 'Pear' ],
[ 'Veg', 'Carrot', 'Spinach' ],
]
Now, if the <div class="things"> wrapper is repeated and you want to distinguish each of these groups, you can modify the above code as follows:
const cheerio = require("cheerio"); // 1.0.0-rc.12
const html = `
<div class="things">
<h5 class="heading">Things</h5>
<ul>
<li class="sub-heading">Fruits</li>
<li class="fruit-item">Apple</li>
<li class="fruit-item">Pear</li>
</ul>
<ul>
<li class="sub-heading">Veg</li>
<li class="veg-item">Carrot</li>
<li class="veg-item">Spinach</li>
</ul>
</div>
<div class="things">
<h5 class="heading">Things 2</h5>
<ul>
<li class="sub-heading">Foo</li>
<li class="fruit-item">Bar</li>
<li class="fruit-item">Baz</li>
</ul>
</div>
`;
const $ = cheerio.load(html);
const result = [...$(".things")].map(e =>
[...$(e).find("ul")].map(e =>
[...$(e).find("li")].map(e => $(e).text())
)
);
console.log(JSON.stringify(result, null, 2));
This gives:
[
[
[
"Fruits",
"Apple",
"Pear"
],
[
"Veg",
"Carrot",
"Spinach"
]
],
[
[
"Foo",
"Bar",
"Baz"
]
]
]
In other words, there's an extra layer:
- .things
- ul
- li
as opposed to the top code, which flattens .things:
- .things ul
- li

Related

how to write css selector for scrapy?

I have the following web page:
<div id="childcategorylist" class="link-list-container links__listed" data-reactid="7">
<div data-reactid="8">
<strong data-reactid="9">Categories</strong>
</div>
<div data-reactid="10">
<ul id="categoryLink" aria-label="shop by category" data-reactid="11">
<li data-reactid="12">
Contact Lenses
</li>
<li data-reactid="14">
Beauty
</li>
<li data-reactid="16">
Personal Care
</li>
I want to have css selector of href tags under li tag, i.e. for contact lens, beauty and personal-care. How to write it?
I am writing it in the following way:
#childcategorylist li
gives me following output:
['<li class="titleitem" data-reactid="16"><strong data-reactid="17">Categories</strong></li>']
Please help!
I am not a expert in scrapy, but usually html elements should have a .text object.
If not, you might want to use regexp to extract the text between > and < like:
import re
txt = someArraycontainingStrings[0]
x = re.search(">[a-zA-Z]*</", txt)
Maybe that gives you proper results

How to render ul li value dynamically from a sample json in angular 6

I have a sample json,I need to create a sidebar using ul li tag and value comes from json.My json structure is something like this.
{"filter":{"Category1":{"value":["one","two","three"]},"Category2":{"value":["four","five","six"]}}}.I have already done in angularjs here http://plnkr.co/edit/D8M1U81tVz3UuzjWathk?p=preview , but This does not work in angular 6.Can anyone please help me,I am new in angular,Here is the code below
app.html
<ul *ngFor="(x, y) of items.filter">
<li class="parent"><b>{{x}}</b></li>
<li class="child">
<ul>
<li *ngFor="p of y.value">{{p}}</li>
</ul>
</li>
</ul>
app.ts
export class Heroes {
let items = {"filter":{"Category1":{"value":["one","two","three"]},"Category2":{"value":["four","five","six"]}}};
}
I suggest you to work with an iterable objects when you try to use *ngFor in your html component.
So, that's my solution:
<ul *ngFor="let parent of (items.filter | keyvalue)">
<li class="parent">{{ parent.key }}</li>
<ul *ngFor="let child of (parent.value.value | keyvalue)">
<li class="child">{{ child.value }}</li>
</ul>
</ul>
First of all I used the keyvalue pipe from angular (https://angular.io/api/common/KeyValuePipe) after that you are allowed to iterate your json object as you want without change it.
Also, here I leave an example of how it works (https://stackblitz.com/edit/angular-gqaaet)
You need to change the json format.
items = {"filter":
[{
"name":"Category1",
"value":["one","two","three"]
},
{
"name":"Category2",
"value":["four","five","six"]
}
]};
HTML
<ul *ngFor="let item of items.filter">
<li class="parent"><b>{{item.name}}</b></li>
<li class="child">
<ul>
<li *ngFor="let p of item.value">{{p}}</li>
</ul>
</li>
</ul>
The above code would work. *ngFor works with the iteration protocols, the json format you've added has map(filter) of map(category) of map(value) format, where the values are not obtained to be iterated.

selenium, xpath: How to select a node within node?

I have a webpage that have a structure like this:
<div class="l_post j_l_post l_post_bright "...>
...
<div class="j_lzl_c_b_a core_reply_content">
<li class="lzl_single_post j_lzl_s_p first_no_border" ...>
<div class="lzl_cnt">
content
</div>
</li>
<li class="lzl_single_post j_lzl_s_p first_no_border" ...>
...
</li>
</div>
</div>
<div class="l_post j_l_post l_post_bright "...>
...(contain content, same as above)
</div>
...
Currently I could select all the content in one step like this:
for i in driver.find_elements_by_xpath('//*[#class="lzl_cnt"]'):
print(i.text)
But as you could see, the webpage consist of repetitive blocks that contain the contents that I need, therefore I want to get those contents separately along with other information that differs between those repetitive blocks(<div class="l_post j_l_post l_post_bright "...>...</div>), moreover, I want those contents within <li class ="lzl_single_post"...>to be separated so as to be easier for me to process the contents later . I tried this:
items = []
# get each blocks
for sel in driver.find_elements_by_xpath('//div[#class="l_post j_l_post l_post_bright "]'):
name = sel.find_element_by_css_selector('.d_name').text
try: content = sel.find_element_by_css_selector('.j_d_post_content').text
except: content = '',
try:
reply = []
# get each post within specific block
for i in sel.find_elements_by_xpath('//*[#class="lzl_cnt"]'):
reply.append(i.text)
except: reply = []
items.append({'name': name, 'content': content, 'reply': reply})
But the result shows that I am getting all the replies on the webpage every time the outer for-loop runs instead of a set of replies for each individual block that I wanted
Any suggestions?
Just add . (context pointer) to XPath as
sel.find_elements_by_xpath('.//*[#class="lzl_cnt"]')
Note that //*[#class="lzl_cnt"] means all nodes in DOM with "lzl_cnt" class name while .//*[#class="lzl_cnt"] means all nodes that are descendant of sel with "lzl_cnt" class name

ng repeat does not return variable from JSON file

I have the following html code that belongs to a template in AngularJS framework:
<ul class="sb-parentlist">
<h1 class={{headerClass}}> Popular</h1><div data-ng-repeat="data in data track by $index">
<li>
<span class="sb-text-title" href="#" ng-click="openFaq = ! openFaq"><b>{{data[$index].faq.frage |translate}}</b></span>
<span ng-show="openFaq" class="sb-text">
<br>
{{data[$index].faq.antwort |translate}}
</span>
</li>
</div>
</ul>
I am getting the number of "li" elements on my browser correctly on printing the results, but the variables are not defined as they should be, blank entries appearing.
here is the JSON entry:
{
"faq":
{"frage":"HB_START_FAQ_Q",
"antwort":"HB_START_FAQ_A"}
,
"screencast":"HB_START_SCREENCAST"
},
{
"faq":
{"frage":"HB_START_FAQ_Q_1",
"antwort":"HB_START_FAQ_A_1"}
,
"screencast":"HB_START_SCREENCAST_1"
},
{
"faq":
{"frage":"HB_START_FAQ_Q_2",
"antwort":"HB_START_FAQ_A_2"}
,
"screencast":"HB_START_SCREENCAST_2"
},
{
"faq":
{"frage":"HB_START_FAQ_Q_3",
"antwort":"HB_START_FAQ_A_3"}
,
"screencast":"HB_START_SCREENCAST_3"
}
I am interested to get the nested item. Any ideas?
Because data is ambiguous between the collection name and the item being iterated over - change your ngRepeat syntax:
data-ng-repeat="item in data track by $index"
And use item[$index]. Im not entirely sure why you aren't just doing data.faq - you need to select by the $index

Play Framework: Html Template content not escaping properly

im trying to populate a list in an html template using a prexisting module
#{Nav.list.map( l =>
l.id match {
case "Art" => { <li id="art"><span>Articles</span></li> }
case "Due" => { <li id="toggle"><span>Links</span>
<div id="drawer">
<div id="drawerContent" style="display:none;">
<ul>
<li><span>link title 2</span></li>
<li><span>link title 3</span></li>
<li><span>link title 4</span></li>
</ul>
</div>
</div>
</li> }
case _ => { <li id="#l.id"><span>#l.title</span></li> }
} )}
the # isnt functioning as an escape character for the final case and instead just gets parsed as #l.id etc i originally did this with nested if else statements with very verbose brackets and that worked but wasnt very nice on the eyes, i think the formatter is having problems with nested scala constructs but im not sure.
i tried using for instead of map and tried enclosing and escaping the match construct, they compile but the issue still remains
I think the issue here is that you are in scala world when doing #{}, thus you can do the following in the last case:
case _ => <li id={l.id}><a href={l.href} title={l.title}><span>{l.title}</span></a></li>
Alternatively I think you can do:
#Nav.list.map( l => ... )