Is there a markdown flavor / extension that allows to structure a document using nested blocks instead of using headings levels ?
My problem with the latter approach is that if you separate your markdown files (and then use pandoc or anything else to combine them) you have to keep track of your headings levels across multiple files. And if one day you decide to change a sub-sub-section into a sub-section you need to change all the headings inside this section.
For now, I have a plugin that allows me to parse a markdown file like this (changing the rules for code blocks and headings) :
# Title
A paragraph
# Sub-title
A paragraph in a sub-section
Into this :
<section>
<h1>Title</h1>
<p>A paragrpah</p>
<section>
<h2>Sub-title</h2>
<p>A paragraph in a sub-section</p>
</section>
</section>
Is there a problem with this indentation-based approach ? I suppose I am not the first one to try to use a nested structure, but I can't find useful resources online.
Related
I tried to search the web about what is the purpose of the HTML <var> Tag and didn't find any good explanation or let say I'm not satisfied yet. I can read what they say about it but I don't understand the purpose. I tried two different lines of code and both gives me the same thing now I need to know what exactly is <var> and why we should use it rather than a single style.
<var>y</var> = <var>m</var><var>x</var> + <var>b</var>
<p style='font-style:italic'>y = mx + b</p>
Reference to name only one: https://html.com/tags/var/
Funny because I read the explanation but I still don't see what is the use of <var> other than just making the text italic!
Here is how W3Schools defines HTML:
HTML stands for Hyper Text Markup Language
HTML is the standard markup language for creating Web pages
HTML describes the structure of a Web page
HTML consists of a series of elements
HTML elements tell the browser how to display the content
HTML elements label pieces of content such as "this is a heading", "this is a paragraph", "this is a link", etc.
The way I see it is that, even though <var> and <i> have the same output printed to the browser, they mean different things, specially if you are "reading" pages without opening a browser like search engines do.
Check it is not particular to the example you mentioned. Look at the example on <b> and <strong> (https://www.w3schools.com/html/html_formatting.asp). They also have the same output but mean different things.
Semantics.
<p> tags are generic paragraph elements, typically used for text.
<var> elements represent the name of a variable in a mathematical expression or a programming context.
If you italicize a paragraph it may resemble the default styling of the <var> element, but that's where the similarities end. Also, they're different to screen readers.
Here's an example using both elements and you can see that semantically, it's a paragraph of text that contains references to variables in a mathematical sense:
<p>The volume of a box is <var>l</var> × <var>w</var> × <var>h</var>, where <var>l</var> represents the length, <var>w</var> the width and <var>h</var> the height of the box.</p>
Background:
I am working on a simple web scraper for learning purposes. I am trying to scrape the main-headings<h2> and the sub-headings <h3> elements from the Wikipedia page about the Ruby programming language. I can access each of these individually, but I would like to write my code in a way that any Wikipedia article could be substituted in.
Main question:
I am looking for a way to list all the <h3> elements that lie between the <h2> elements on the page. Is there a way to do that directly via Nokogiri, or will it involve using some Ruby as a work around?
Basically, I want to be able to list the main heading and the accompanying sub-headings, but I can not see a way to group them as Wikipedia does not have them grouped in their html.
Thank you for your time.
-M
I would use Nokogiri's CSS selectors. The Bastard's Book of Ruby has a good primer on that. http://ruby.bastardsbook.com/chapters/html-parsing/
In your case, you'd want to use the following:
page.css('h2:not([id]) > span.mw-headline, h3:not([id]) > span.mw-headline')
Based on what I see in the dev tools console for Wikipedia pages, the main headings and subheadings do not have ID attributes, which is why I use the :not([id]) pseudo-selector. It'll look for all h2 and h3 elements that do not have IDs. Each nested span with the heading title has the .mw-headline class.
If you only want the h3 elements (each section's sub-heading), you can just have:
page.css('h3:not([id]) > span.mw-headline')
related questions before:
HTML XPath: Extracting text mixed in with multiple tags?
HTML XPath: Selectively avoiding tags when extracting text
//sorry for my poor English
I'm a beginner of writing web crawler, I'm trying to extract main content from a web pages(in Chinese) by xpath(though I have learned that there are algorithms both taditional and machine learning ways to extracting web main content) ,and I'm a very beginner at writing xpath rules.
I'm in faced with a web page that contains text mixed in complex tags,I summarize it as follows,where character(e.g. A,A2) means text only,'...' means more tags even nested without text.I want to get "AA2BB2CDEFGHIJKLMNOP"
...
<div id="artibody" class="art_context">
<div align="center">...</div>
<div align="center"><font>A</font>A2</div>
<div align="left"><br><br><strong>B</strong>B2</div>
<div align="left">
<p>C<a>D</a>E</p>
<p>F<a>G</a>H<a>I</a>J</p>K
</div>
<div align="center">...</div>
<div align="center"><font>L</font></div>
<p>M</p><!--M contains only text luckly-->
<p>N</p>
<p>O</p>
<p>P<span>...</span><div class="shareBox">...</div>
</p>
<span id="arctTailMark"></span>
<script>
var page_navigation = document.getElementById('page_navigation');
...
</script>
<div style="padding:10px 0 30px 0">...</div>
</div>
Thanks for previous questions, I write a rule
'string(//div[#class=\"art_context\"])'
I get all content in plain text I want without tags ,but the js code in <script> is extracted as well.I tried the following,but it seems not helpful.There are still js codes in it .
'string(//div[#class=\"art_context\" and not(self::script)])'
The following one get "\r\n" only.
'//div[#class=\"art_context\" and not(self::script)]/text()'
Here are my questions:
1.How to write the xpath rule to meet my need : extracting content in div[#id="artibody"] except codes in <script>
2.Is the rule for question1 simple and powerful? Maybe I will meet more pages with a div[#id="artibody"] but the descendant nodes are quite different.
3.Any further suggestions on my task? Extracting web content from one website,but the main content lays in <div> with different id,class,and descendant node structure. I run the spider on my laptop(Intel corei5 3225,8G RAM) while using machine learning algorithms may decrease the crawl speed significantly.At the same time writing many xpath rule seems bothering.
I'd appreciate it if you could give me any suggestions on this question(and my English).
To get all descendant text nodes except the script contents, you can use this:
//div[#class="art_context"]//*[not(self::script)]/text()
In natural language: “Get all text nodes from descendants of all div[#class="art_context"] elements that are not script elements”.
The // after div[#class="art_context"] is needed to select descendants, not just children.
In comparison, the //div[#class="art_context" and not(self::script)]/text() expression in the question says “Get all text-node children of all div[#class="art_context"] elements that are not also script elements.”
So the and not(self::script) part in the expression in the question is redundant, because all the expression is doing is selecting just //div[#class="art_context"] anyway, and then the /text() part is selecting only the text-node direct children of that div, which is just line breaks.
Also, if instead of using XPath to just get the set of text nodes, you want to use XPath to get the result as a single string, you can use the functions string-join(…) and normalize-space(…):
normalize-space(string-join(//div[#class="art_context"]//*[not(self::script)]/text(), ""))
if you had to identify the main text of the page (e.g. on a blog page to identify the post's content) what would you do? What do you think is the simplest way to do it?
Get the page content with cURL
Maybe use a DOM parser to identify the elements of the page
That's a pretty hard task but I would start by counting spaces inside of DOM elements. A tell tale sign of human-readable content is spaces and periods. Most articles seem to encapsulate the content in paragraph tags so you could look at all p tags with n spaces and at least one punctuation mark.
You could also use the amount of grouped paragraph tags inside an element.. So if a div has N paragraph children, it could very well be the content you're wanting to extract.
There are some framework that can archive this, one of them is http://code.google.com/p/boilerpipe/ which uses some statistics.
Some features that can detect html block with main content:
p, div tags
amount of text inside/outside
amount of links inside/outside (i.e remove munus)
some css class names and id (frequntly those block have classes or ids with main, main_block, content e.t.c)
relation between title and text inside content
You might consider:
Boilerpipe: "The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings."
Ruby Readability: "Ruby Readability is a tool for extracting the primary readable content of a webpage. It is a Ruby port of arc90's readability project."
The Readability API: "If you'd like access to the Readability parser directly, the Content API is available upon request. Contact us if you're interested."
It seems like the best answer is "it depends". As in, it depends on how the site in question is marked up.
If the author uses "common" tags, you could look for a container
element ID'd as "content" or "main."
If the author is using HTML5, you should in theory be able to query for the <article> element, if it's a page with only one "story" to tell.
Recently I faced the same problem. I developed a news article scraper and I had to detect the main textual content of the article pages. Many news sites are displaying lots of other textual content beside the "main article" (e.g 'read next', 'you might be interested in'). My first approach was to collect all text between <p> tags. But this did't work because there were news sites that used the <p> for other elements like navigation, 'read more', etc. too. Some time ago I stumbled on the Boilerpipe libary.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
That sounded like the perfect solution for my problem, but it wasn't. It failed at many news sites, because it was often not able to parse the whole text of the news article. I don't know why, but think that the boilerpipe algorithm can't deal with badly written html. So in many cases it just returned an empty string and not the main content of the news article.
After this bad experience I tried to develop my own "article text extractor" algorithm. The main idea was to split the html into different depths, for example:
<html>
<!-- depth: 1 -->
<nav>
<!-- depth: 2 -->
<ul>
<!-- depth: 3 -->
<li>Site<!-- depth: 5 --></li>
<li>Site<!--- depth: 5 ---></li>
</ul>
</nav>
<div id='text'>
<!--- depth: 2 --->
<p>Thats the main content...<!-- depth: 3 --></p>
<p>main content, bla, bla bla ... <!-- depth: 3 --></p>
<p>bla bla bla interesting bla bla! <!-- depth: 3 --></p>
<p>whatever, bla... <!-- depth: 3 --></p>
</div>
</html>
As you can see, to filer out the surplus "clutter" with this algorithm, things like navigation elements, "you may like" sections, etc. must be on a different depth than the main content. Or in other words: the surplus "clutter" must be described with more (or less) html tags than the main textual content.
Calculate the depth of every html element.
Find the depth with the highest amount of textual content.
Select all textual content with this depth
To proof this concept I wrote a Ruby script, which works out good, with most of the news sites. In addition to the Ruby script I also developed the textracto.com api which you can use for free.
Greetings,
David
It depends very much on the page. Do you know anything about the page's structure beforehand? If you are in luck, it might provide an RSS feed that you could use or it might be marked up with some of the new HTML5 tags like <article>, <section> etc. (which carry more semantic power than pre-HTML5 tags).
I've ported the original boilerpipe java code into a pure ruby implementation Ruby Boilerpipe also a Jruby version wrapping the original Java code Jruby Boilerpipe
I'm trying to figure out how to reference another area of a page with Markdown. I can get it working if I add a
<div id="mylink" />
and for the link do:
[My link](#mylink)
But my guess is that there's some other way to do an in-page link in Markdown that doesn't involve the straight up div tag.
Any ideas?
See this answer.
In summary make a destination with
<a name="sometext"></a>
inserted anywhere in your markdown markup (for example in a header:
## heading<a name="headin"></a>
and link to it using the markdown linkage:
[This is the link text](#headin)
or
[some text](#sometext)
Don't use <div> -- this will mess up the layout for many renderers.
(I have changed id= to name= above. See this answer for the tedious explanation.)
I guess this depends on what you're using to generate html from your markdown. I noticed, that jekyll (it's used by gihub.io pages by default) automatically adds the id="" attribute to headings in the html it generates.
For example if you're markdown is
My header
---------
The resulting html will look like this:
<h2 id="my-header">My header</h2>
So you can link to it simply by [My link](#my-header)
With the PHP version of Markdown, you can also link headers to fragment identifiers within the page using a syntax like either of the following, as documented here
Header 1 {#header1}
========
## Header 2 ## {#header2}
and then
[Link back to header 1](#header1)
[Link back to header 2](#header2)
Unfortunately this syntax is currently only supported for headers, but at least it could be useful for building a table of contents.
The destination anchor for a link in an HTML page may be any element with an id attribute. See Links on the W3C site. Here's a quote from the relevant section:
Destination anchors in HTML documents
may be specified either by the A
element (naming it with the name
attribute), or by any other element
(naming with the id attribute).
Markdown treats HTML as HTML (see Inline HTML), so you can create your fragment identifiers from any element you like. If, for example, you want to link to a paragraph, just wrap the paragraph in a paragraph tag, and include an id:
<p id="mylink">Lorem ipsum dolor sit amet...</p>
Then use your standard Markdown [My link](#mylink) to create a link to fragment anchor. This will help to keep your HTML clean, as there's no need for extra markup.
For anyone use Visual Studio Team Foundation Server (TFS) 2015, it really does not like embedded <a> or <div> elements, at least in headers. It also doesn't like emoji in headers either:
### 🔧 Configuration 🔧
Lorem ipsum problem fixem.
Gets translated to:
<h3 id="-configuration-">🔧 Configuration 🔧</h3>
<p>Lorem ipsum problem fixem.</p>
And so links should either use that id (which breaks this and other preview extensions in Visual Studio), or remove the emoji:
Here's [how to setup](#-configuration-) //🔧 Configuration 🔧
Here's [how to setup](#configuration) //Configuration
Where the latter version works both online in TFS and in the markdown preview of Visual Studio.
In Pandoc Markdown you can set anchors on arbitrary spans inside a paragraph using syntax [span]{#anchor}, e.g.:
Paragraph, containing [arbitrary text]{#mylink}.
And then reference it as usual: [My link](#mylink).
If you want to reference a whole paragraph then the most straightforward way is to add an empty span right in the beginning of the paragraph:
[]{#mylink}
Paragraph text.