The simple_format helper in Rails will take text input and convert newline characters to p or br tags which is the exact opposite of what I'm trying to accomplish.
How would i go about taking a snippet of HTML that looks like:
<p>Lorem</p>
<p>Ipsum.
<br />
Lorem ipsum.
<br />
Lorem ipsum.
</p>
<p>
Lorem ipsum.
</p>
And convert it to something that looks like:
Lorem\n\nIpsum.\nLorem ipsum.\nLorem ipsum.\n\nLorem ipsum.
new_html = html
.gsub("\n", '') # remove existing new lines
.gsub('</p>', "</p>\n") # add a new line per para tag
.gsub('<br />', "<br />\n") # add a new line per break tag
ActionController::Base.helpers
.strip_tags(new_html) # remove all html tags
Complementing previous answer you might want to add .gsub('<br/>', "<br/>\n") in case break tags are this way and .gsub('<div/>', "<div/>\n"). Just make sure you are replacing all tags that must break the text.
Related
I have a requirement where I have to eliminate <br> tags enclosed in <p> tags whenever they are not preceded with text or followed with text, let me give a complete example.
Asterisk (*) tags are meant to be matched, the others are meant to be left untouched.
<div>
<p>
<br/>*
<span>Text1</span>
<br/>
<i>Text2
</i>
</p>
<p>
<b>
<i>
<br/>*
</i>
</b>
<span>Text3</span>
<br/>
<br/>
Text4
<i>
<br/>*
</i>
</p>
<p>
<span>Text4</span>
<br/>*
</p>
</div>
Putting things simple, I need to normalize the text formatting from some Word documents where the editors were doing line-breaks act like paragraphs, line-breaks are meant to break text and not imply spacing between lines, this is the paragraph's job.
So, all I need is to keep <br/> tags surrounded by text safe and match the rest to issue a delete.
Thanks!
You could use two queries:
//p/descendant-or-self::*/*[1 ]/self::br[not(preceding-sibling::node()/normalize-space()!='')]
//p/descendant-or-self::*/*[last()]/self::br[not(following-sibling::node()/normalize-space()!='')]
I have following MD code for one of my GitHub page.
<p align="justify">
Text Text Text.
</p>
<p align="justify">
**Text** Text Text.
`Text` Text.
</p>
While the first paragraph tag works Okay as there is no Markdown editing done inside it, text inside second paragraph tag is not Formatted as it should be. (No Bold, no quotes)
Why is it so?? And how to use Markdown editing inside paragraph then?
The Markdown rules plainly state:
Note that Markdown formatting syntax is not processed within block-level HTML tags. E.g., you can’t use Markdown-style *emphasis* inside an HTML block.
That said, GitHub Pages uses Kramdown to parse Markdown, and Krandown has a slightly different behavior which gives you more flexibility. In fact, Kramdown's documentation states:
If an HTML tag has an attribute markdown="1", then the default mechanism for parsing syntax in this tag is used.
In other words, do this:
<p align="justify" markdown="1">
**Text** Text Text.
`Text` Text.
</p>
And you will get the following output:
<p align="justify">
<strong>Text</strong> Text Text.
<code>Text</code> Text.</p>
Kramdown is smart enough to recognize that you are inside a <p> tag, and does not wrap the individual lines in new <p> tags, which would be invalid HTML. If you actually want each line to be a separate paragraph, then you should use a <div> to wrap everything. Like this:
<div align="justify" markdown="1">
**Text** Text Text.
`Text` Text.
</div>
Which results in this output:
<div align="justify">
<p><strong>Text</strong> Text Text.</p>
<p><code>Text</code> Text.</p>
</div>
For completeness, it should be noted that GitHub READMEs and Gists do not use the same Markdown parser. Instead they use an extended Commonmark parser, which handles Markdown in raw HTML differently that the two ways described above. In Commonmark, whether the content of a raw HTML block is parsed as Markdown or not depends on whether the content is wrapped by blank lines. In that case, the proper way would be to do this:
<div align="justify">
**Text** Text Text.
`Text` Text.
</div>
However, as GitHub will strip out the align attribute, there isn't any point it doing that on pages hosted on github.com (such as READMEs). There is also the problem that Commonmark is not smart enough to detect that the wrapping raw HTML tag is a <p> tag, and wraps each line in another <p>, resulting in invalid HTML. Therefore, you must use a <div> in that case.
While the use-blank-lines method of telling the parser to parse the contents as Markdown is a more elegant solution that markdown="1", it is only supported by Commonmark parsers, which Kramdown is not. Therefore, as long as GitHub Pages uses Kramdown, you need to follow Kramdown's rules.
I have my habits with LaTeX, then in HTML I don’t know which element can replace the LaTeX’s \subparagraph{} command. <br /> isn’t a good idea because it is the equivalent of the blank line in LaTeX. I can create a special class “subparagraph” but before I want to know if HTML didn’t have a similar element.
The \subparagrahp{} LaTeX’s command is something between the paragraph and the HTML’s <br /> element. Overapi didn’t tell me more :/
Someone have any idea please?
You could use a div element as the paragraph subsitute and p elements as subparagraphs, with additional class for styling, this could represent your LaTeX document structure.
\paragraph{Introfoo}
Introduction lorem lorem
\subparagraph*{}
Foobar lorem impsum ugh
\subparagraph*{}
Foobar lorem impsum ugh
would translate to:
The h3 tag is just a suggestion, the level depends on your other structure around this.
<div class="paragraph">
<h3 class="paragraph">Introfoo</h3>
<p class="paragraph">
Introduction lorem lorem
</p>
<p class="subparagraph">
Foobar lorem impsum ugh 1
</p>
<p class="subparagraph">
Foobar lorem impsum ugh 2
</p>
</div>
LaTeX' \paragraph is a heading element, the next-to-smallest one, so I'd map it to <h5>, leaving <h6> for subparagraphs, and use CSS to give them display:inline-block (run-in headers) and appropriate other styling as desired. This will leavel h1-h4 for title-and-or-chapter, section, subsection, subsubsection.
I am trying to create a pre-formatted block of text using the pre element where there are sometimes a few blank lines in between the content. The problem is that occasionally the text breaks onto a separate line after either a forward slash(/) or colon(:)
An example is as follows:
<pre>Lorem ipsum dolor sitat: http://wwww.site.com/foo/bar</pre>
Displays as:
Lorem ipsum dolor sitat: http://wwww.site.com/foo
/bar
Does anyone know how to resolve this?
I'm with #Kyle, without seeing the page itself I think that it is too long for the container that you have it in, so maybe make the text smaller or widen the container and see if that helps.
I tried:
<html>
<pre>Lorem ipsum dolor sitat: http://wwww.site.com/foo/bar</pre>
</html>
and didn't get any line breaks.
I have always used either a <br /> or a <div/> tag when something more advanced was necessary.
Is use of the <p/> tag still encouraged?
Modern HTML semantics are:
Use <p></p> to contain a paragraph of text in a document.
Use <br /> to indicate a line break inside a paragraph (i.e. a new line without the paragraph block margins or padding).
Use <div></div> to contain a piece of application UI that happens to have block layout.
Don't use <div /> or <p /> on their own. Those tags are meant to contain content. They appear to work as paragraph breaks only because when the browser sees them, and it "helpfully" closes the current block tag before opening the empty one.
A <p> tag wraps around something, unlike an <input/> tag, which is a singular item. Therefore, there isn't a reason to use a <p/> tag..
I've been told that im using <br /> when i should use <p /> instead. – maxp 49 secs ago
If you need to use <p> tags, I suggest wrapping the entire paragraph inside a <p> tag, which will give you a line break at the end of a paragraph. But I don't suggest just substituting something like <p/> for <br/>
<p> tags are for paragraphs and signifying the end of a paragraph. <br/> tags are for line breaks. If you need a new line then use a <br/> tag. If you need a new paragraph, then use a <p> tag.
Paragraph is a paragraph, and break is a break.
A <p> is like a regular Return in Microsoft Office Word.
A <br> is like a soft return, Shift + Return in Office Word.
The first one sets all paragraph settings/styles, and the second one barely breaks a line of text.
Yes, <p> elements are encouraged and won't get deprecated any time soon.
A <p> signifies a paragraph. It should be used only to wrap a paragraph of text.
It is more appropriate to use the <p> tag for this as opposed to <div>, because this is semantically correct and expected for things such as screen readers, etc.
Using <p /> has never been encouraged:
From XHTML HTML Compatibility Guidelines
C.3. Element Minimization and Empty Element Content
Given an empty instance of an element whose content model is not
EMPTY (for example, an empty title or
paragraph) do not use the minimized
form (e.g. use <p> </p> and not <p />).
From the HTML 4.01 Specification:
We discourage authors from using empty P elements. User agents should ignore empty P elements.
While they are syntactically correct, empty p elements serve no real purpose and should be avoided.
The HTML DTD does not prohibit you from using an empty <p> (a <p> element may contain PCDATA including the empty string), but it doesn't make much sense to have an empty paragraph.
Use it for what? All tags have their own little purpose in life, but no tag should be used for everything. Find out what you are trying to make, and then decide on what tag fits that idea best:
If it is a paragraph of text, or at least a few lines, then wrap it in <p></p>
If you need a line break between two lines of text, then use <br />
If you need to wrap many other elements in one element, then use the <div></div> tags.
The <p> tag defines a paragraph. There's no reason for an empty paragraph.
For any practical purpose, you don’t need to add the </p> into your markup. But if there is a string XHTML adheration requirement, then you would probably need to close all your markup tags, including <p>. Some XHTML analyzer would report this as an error.