Creating table of contents in html - html

Is it possible to create an ordered list like the following?
I like this for a table of contents I'm creating.
Into
Section1
2.1 SubSection1
2.2 SubSection2
Section2
.....
I have the following but each subsection restarts from 1.
<ol>
<li>
</li>
<li>
</li>
<ol>
<li>
</li>
<li>
</li>
</ol>
</ol>
Thanks

This can indeed be done with pure CSS:
ol {
counter-reset: item;
}
li {
display: block;
}
li:before {
content: counters(item, ".")" ";
counter-increment: item;
}
Same example as a fiddle.

There's quite a number of jQuery plugins to generate a table of contents.
Look at this one for starters
Another one here, with ordered lists

Have you seen this post:
Number nested ordered lists in HTML
I don't think it can be done without using JS.

This code leads to the desired output for me:
<ol>
<li>
foo
</li>
<li>
bar
<ol>
<li>
baz
</li>
<li>
qux
</li>
</ol>
</li>
<li>
alpha
<ol>
<li>
beta
</li>
<li>
gamma
</li>
</ol>
</li>
</ol>
CSS:
ol {
counter-reset: item;
}
li {
display: block;
}
li::before {
content: counters(item, ".")". ";
counter-increment: item;
}
Fiddle: http://jsfiddle.net/Lepetere/evm8wyj5/1/

For myself I was not happy with existing solutions. So I created a solution with Python3 and BeautifulSoup.
The function take HTML source as string and looks for header tags (e.g. h1). In the next steps an id= is created for the header and also corresponding toc entry.
def generate_toc(html_out):
"""Create a table of content based on the header tags.
The header tags are used to create and link the toc.
The toc as place on top of the html output.
Args:
html_out(string): A string containing the html source.
Returns:
(string): The new string.
"""
from bs4 import BeautifulSoup
# the parser
soup = BeautifulSoup(html_out, 'html.parser')
# create and place the div element containing the toc
toc_container = soup.new_tag('div', id='toc_container')
first_body_child = soup.body.find_all(recursive=False)[0]
first_body_child.insert_before(toc_container)
# toc headline
t = soup.new_tag('p', attrs={'class': 'toc_title'})
t.string = 'Inhalt'
toc_container.append(t)
def _sub_create_anchor(h_tag):
"""Create a toc entry based on a header-tag.
The result is a li-tag containing an a-tag.
"""
# Create anchor
anchor = uuid.uuid4()
h_tag.attrs['id'] = anchor # anchor to headline
# toc entry for that anchor
a = soup.new_tag('a', href=f'#{anchor}')
a.string = h_tag.string
# add to toc
li = soup.new_tag('li')
li.append(a)
return li
# main ul-tag for the first level of the toc
ul_tag = soup.new_tag('ul', attrs={'class': 'toc_list'})
toc_container.append(ul_tag)
# helper variables
curr_level = 1
ul_parents = [ul_tag]
# header tags to look for
h_tags_to_find = [f'h{i}' for i in range(1, 7)] # 'h1' - 'h6'
for header in soup.find_all(h_tags_to_find):
next_level = int(header.name[1:])
if curr_level < next_level: # going downstairs
# create sub ul-tag
sub_ul_tag = soup.new_tag('ul', attrs={'class': 'toc_list'})
# connect it with parent ul-tag
ul_parents[-1].append(sub_ul_tag)
# remember the sub-ul-tag
ul_parents.append(sub_ul_tag)
elif curr_level > next_level: # going upstairs
# go back to parent ul-tag
ul_parents = ul_parents[:-1]
curr_level = next_level
# toc-entry as li-a-tag
li_tag = _sub_create_anchor(header)
# add to last ul-tag
ul_parents[-1].append(li_tag)
return soup.prettify(formatter='html5')
This is maybe not elegant in all of your use cases. Myself I use to put TOC's on top of HTML reports generated by data sciences routines (e.g. pandas).

<h5>Table of contents:</h5> <ol> <li>Types of home fruit drying machines</li> <li>What fruits can you dry with an electric fruit drying machine?</li> < li>Zagores Machine Electric Fruit Dryer</li> <li>Buy Electric Fruit Dryer</ li> <li>Gas fruit drying machine</li> <li>Electric fruit drying machine price< /li> <li>FAQ</li> </ol>
.
see this in: enter link description here

Related

Iterate a block of HTML, regardless of element type, with Nokogiri?

I'm trying to iterate a block of HTML with Nokogiri, regardless of what the element type is.
For example, given this variable html, passed through Nokogiri:
require 'nokogiri'
html = "<p>Some text</p><ol><li>List item 1</li><li>List item 2</li></ol><p>Last bit of text</p>"
parsed_html = Nokogiri::HTML(html)
I know I can iterate over each <p> by doing:
parsed_html.css("p").each do |p|
puts p
end
But again that only grabs all <p> tags and not the <ol> and its children.
I also know I can grab the <ol> by doing:
parsed_html.css("p, ol").each do |p|
puts p
end
But how can I iterate over all the elements regardless of explicitly stating which ones I want to iterate over?
For example, given another html block:
html = "<p>text 1</p><ol><li>item 1</li><li>item 2</li></ol><ul><li>item 1</li></ul><h2>header</h2>"
How can I return something like:
<p>text 1</p>
<ol><li>item 1</li><li>item 2</li></ol>
<ul><li>item 1</li></ul>
<h2>header</h2>
Thanks in advance.
Use the CSS child selector:
parsed_html.css('body > *')
This selects only direct children of the element(s).
irb(main):015:0> parsed_html = Nokogiri::HTML(html)
irb(main):016:0> parsed_html.css('body > *')
=> [#<Nokogiri::XML::Element:0x3c00 name="p" children=[#<Nokogiri::XML::Text:0x3bec "text 1">]>, #<Nokogiri::XML::Element:0x3c64 name="ol" children=[#<Nokogiri::XML::Element:0x3c28 name="li" children=[#<Nokogiri::XML::Text:0x3c14 "item 1">]>, #<Nokogiri::XML::Element:0x3c50 name="li" children=[#<Nokogiri::XML::Text:0x3c3c "item 2">]>]>, #<Nokogiri::XML::Element:0x3ca0 name="ul" children=[#<Nokogiri::XML::Element:0x3c8c name="li" children=[#<Nokogiri::XML::Text:0x3c78 "item 1">]>]>, #<Nokogiri::XML::Element:0x3cc8 name="h2" children=[#<Nokogiri::XML::Text:0x3cb4 "header">]>]
irb(main):017:0> parsed_html.css('body > *').map {|e| e.name }
=> ["p", "ol", "ul", "h2"]
This works since Nokogiri will create a skeleton when you use Nokogiri::HTML:
irb(main):018:0> parsed_html.to_s
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>text 1</p>\n<ol>\n<li>item 1</li>\n<li>item 2</li>\n</ol>\n<ul><li>item 1</li></ul>\n<h2>header</h2>\n</body></html>\n"
You can also just use Nokogiri::HTML.fragment instead of HTML():
frag = Nokogiri::HTML.fragment(html)
frag.children.map(&:to_html).join("\n")
Just answering the questions you wrote:
how can I iterate over all the elements
CSS accepts wildcards, so you can just:
Nokogiri::HTML(html).css("*").map(&:name)
# => ["html", "body", "p", "ol", "li", "li", "p"]
given "this html" how do I return "something like"
html = "<p>text 1</p><ol><li>item 1</li><li>item 2</li></ol><ul><li>item 1</li></ul><h2>header</h2>"
puts Nokogiri::HTML(html).css('body').inner_html
# <p>text 1</p>
# <ol>
# <li>item 1</li>
# <li>item 2</li>
# </ol>
# <ul><li>item 1</li></ul>
# <h2>header</h2>
I want to be able to iterate over all the first level child elements (p, ol, ul, h2)
Nokogiri::HTML(html).css('body').children.map(&:name)
# => ["p", "ol", "ul", "h2"]

how to write css selector for scrapy?

I have the following web page:
<div id="childcategorylist" class="link-list-container links__listed" data-reactid="7">
<div data-reactid="8">
<strong data-reactid="9">Categories</strong>
</div>
<div data-reactid="10">
<ul id="categoryLink" aria-label="shop by category" data-reactid="11">
<li data-reactid="12">
Contact Lenses
</li>
<li data-reactid="14">
Beauty
</li>
<li data-reactid="16">
Personal Care
</li>
I want to have css selector of href tags under li tag, i.e. for contact lens, beauty and personal-care. How to write it?
I am writing it in the following way:
#childcategorylist li
gives me following output:
['<li class="titleitem" data-reactid="16"><strong data-reactid="17">Categories</strong></li>']
Please help!
I am not a expert in scrapy, but usually html elements should have a .text object.
If not, you might want to use regexp to extract the text between > and < like:
import re
txt = someArraycontainingStrings[0]
x = re.search(">[a-zA-Z]*</", txt)
Maybe that gives you proper results

Of the same tags, I want to extract only the tags I want

I am studying crawling Using Python3.
<ul class='report_thum_list img'>
<li>...</li>
<li>...</li>
<li>...</li>
<li>...</li>
<li>...</li>
In this, I just want to pull out the li tag.
So, I wrote that
ulTag = soup.findAll('ul', class_='report_thum_list img')
liTag = ulTag[0].findAll('li')
# print(len(liTag))
I expected twenty (there are 20 posts per page.)
But over 100 came out.
Because There is another li tag in the li tag.
I do not want to extract the li tag inside the div tag.
How can I pull out 20 li tags?
This is my code.
url = 'https://www.posri.re.kr/ko/board/thumbnail/list/63?page='+ str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
ulTag = soup.find('ul', class_='report_thum_list img')
# liTag = ulTag.findAll('li')
liTag = ulTag.findChildren('li')
print(len(liTag))
liTag = soup.select('ul.report_thum_list > li')
Use CSS selector, it's very easy to use

Customize ordered lists increments

How can i display custom ordered list ? Is it possible to get below output
Tour 1: Hello
Tour 2: Whats up ?
Tour 3: Bye
Tour 4: Test Tour
You can use CSS counters and content to prepend a word to an increment. Demo
HTML
<ol>
<li>Hello</li>
<li>Whats Up</li>
<li>Bye</li>
<li>How Are You</li>
</ol>
CSS
ol {
counter-reset: tour;
}
li:before {
counter-increment: tour;
content: "Tour " counter(tour) ": ";
}
Output
Tour 1: Hello
Tour 2: Whats up ?
Tour 3: Bye
Tour 4: Test Tour
Explanation
Using counter-reset sets the <ol> counter to your counter tour
Every <li> increments tour with counter-increment
Set the content of the pseudo element :before to "Tour " + counter value + ": "
You can use a pseudo element to do this effect, but I'm not sure as far as the colon goes:
<ol class="tour">
<li>First thing's first</li>
<li>Second's the best</li>
<li>Why not third?, Because I though it was the best.</li>
<li>How about fourth?</li>
</ol>
And the CSS (margins would need to be tweaked to your liking - although you could probably use positioning to achieve the same thing):
ol.tour li:before {
content:"Tour";
margin-left:-60px;
margin-right:30px;
}
ol.tour{
margin-left:40px;
}
Example Fiddle: http://jsfiddle.net/n4s8fo2q/

Need help in forming regular expression in perl

I need some suggestion in parsing a html content,need to extract the id of tag <\a> inside a div, and store it into an variable specific variable. i have tried to make a regular expression for this but its getting the id of tag in all div. i need to store the ids of tag<\a> which is only inside a specific div .
The HTML content is
<div class="m_categories" id="part_one">
<ul>
<li>-
aaa
</li>
<li>-
bbb
</li>
.
.
.
</div>
<div class="m_categories hidden" id="part_two">
<ul>
<li>-
ccc
</li>
<li>-
ddd
</li>
<li>-
eee
</li>
.
.
</div>
Need some suggestion, Thanks in advance
update:
the regex i have used
if($content=~m/sel_cat " id="([^<]*?)"/is){}
while($content=~m/sel_cat " id="([^<]*?)"/igs){}
You should really look into HTML::Parser rather than trying to use a regex to extract bits of HTML.
one way to us it to extract the id element from each div tag would be:
# This parser only looks at opening tags
sub start_handler {
my ($self, $tagname, $attr, $attrseq, $origtext) = #_;
if ($tagname eq 'div') { # is it a div element?
if($attr->{ id }) { # does div have an id?
print "div id found: ", $attr->{ id }, "\n";
}
}
}
my $html = &read_html_somehow() or die $!;
my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler );
$p->parse($html);
This is a lot more robust and flexible than a regex-based approach.
There are so many great HTML parser around. I kind of like the Mojo suite, which allows me to use CSS selectors to get a part of the DOM:
use Mojo;
my $dom = Mojo::DOM->new($html_content);
say for $dom->find('a.sel_cat')->all_text;
# Or, more robust:
# say $_->all_text for $dom->find('a.sel_cat')->each;
Output:
aaa
bbb
ccc
ddd
eee
Or for the IDs:
say for $dom->find('a.sel_cat')->attr('id');
# Or, more robust_
# say $_->attr('id') for $dom->find('a.sel_cat')->each;
Output:
sel_cat_10018
sel_cat_10007
sel_cat_10016
sel_cat_10011
sel_cat_10025
If you only want those ids in the part_two div, use the selector #part_two a.sel_cat.