Turn html text into markdown manually (javascript / nodejs) - html

I'm a bit stuck. I have scraped a website and would now like to convert it into markdown. My html looks like this:
Some text more text, and more text. Some text more text, and more text.
Once in a while <span class="bold">something is bold</span>.
Then some more text. And <span class="bold">more bold stuff</span>.
There are html to markdown modules available, however, they would only work if the text <b> looked like this </b>.
How could I go through the html, and everytime I find a span which is supposed to bold something, turn this piece of the html into bold markdown, that is, make it **look like this**

Try this one https://github.com/domchristie/to-markdown, an HTML to Markdown converter written in JavaScript.
It can be extended by passing in an array of converters to the options object:
toMarkdown(stringOfHTML, { converters: [converter1, converter2, …] });
In your case, the converter can be
{
filter: 'span',
replacement: function(content) {
return '**' + content + '**';
}
}
Refer to its readme for more details.

Notepad++ is an open-source editor that supports regex. This picture shows the basic idea.
You know how to use an editor to find and replace strings. In an editor like Notepad++ you can look for string patterns and replace parts of the patterns and keep what's left. In your case, you want to find strings that are framed by HTML markup. Here the regex in the 'Find what' edit box displays that, with the special notation ([^<]*) meaning save zero or more of any character other than the '<' for use in a replacement string. The 'Replace with' edit box says used what was saved (as \1) in the expression **\1** which gives you what you prefer to have in the text file. It remains to click on 'Replace all'.
To be able to do this you need to install Notepad++ and learn some basic Perl regex. To get this dialogue box click on Ctl-H. Of course, if you get it wrong there's always Ctl-Z.

Related

How to convert CSS unicode string to text character?

I'm using an icon set that has to be used through an element like this:
<i class="icons-recycle"></i>
This generates an element with the following CSS:
.icons-recycle::before {
content: "\e67f";
}
What I need is to copy/paste the Unicode glyph that's generated via "\e67f" so I can use it in Photoshop to do some designs with that icon (I already have the .ttf file installed).
For example, "\u00C6" gives me Æ, according to this online converter.
However, I am unable to find what that character is, and I cannot select it on the HTML page! How can I convert this "\e67f" to a text character to paste into Photoshop?
MDN says the content attribute, when used like this, is in a Unicode escape sequence.
Easiest way I found was to use the content string to convert to an HTML character entity:
Instead of \e67f, type <div></div> (replace the \ with &#x) and the browser will display the individual glyph .
You can then copy/paste into Photoshop, it should display as expected if you have the proper font installed.

Why do some strings contain " " and some " ", when my input is the same(" ")?

My problem occurs when I try to use some data/strings in a p-element.
I start of with data like this:
data: function() {
return {
reportText: {
text1: "This is some subject text",
text2: "This is the conclusion",
}
}
}
I use this data as follows in my (vue-)html:
<p> {{ reportText.text1 }} </p>
<p> {{ reportText.text2 }} </p>
In my browser, when I inspect my elements I get to see the following results:
<p>This is some subject text</p>
<p>This is the conclusion</p>
As you can see, there is suddenly a difference, one p element uses and the other , even though I started of with both strings only using . I know and technically represent the same thingm, but the problem with the string is that it gets treated as a string with 1 large word instead of multiple separate words. This screws up my layout and I can't solve this by using certain css properties (word-wrap etc.)
Other things I have tried:
Tried sanitizing the strings by using .replace( , ), but that doesn't do anything. I assume this is because it basically is the same, so there is nothing to really replace. Same reason why I have to use blockcode on stackoverflow to make the destinction between and .
Logged the data from vue to see if there is any noticeable difference, but I can't see any. If I log the data/reportText I again only see string with 's
So I have the following questions:
Why does this happen? I can't seem to find any logical explanation why it sometimes uses 's and sometimes uses 's, it seems random, but I am sure I am missing something.
Any other things I could try to follow the path my string takes, so I can see where the transformation from to happens?
Per the comments, the solution devised ended up being a simple unicode character replacement targeting the \u00A0 unicode code point (i.e. replacing unicode non-breaking spaces with ordinary spaces):
str.replace(/[\\u00A0]/g, ' ')
Explanation:
JavaScript typically allows the use of unicode characters in two ways: you can input the rendered character directly, or you can use a unicode code point (i.e. in the case of JavaScript, a hexadecimal code prefixed with \u like \u00A0). It has no concept of an HTML entity (i.e. a character sequence between a & and ; like ).
The inspector tool for some browsers, however, utilizes the HTML concept of the HTML entity and will often display unicode characters using their corresponding HTML entities where applicable. If you check the same source code in Chrome's inspector vs. Firefox's inspector (as of writing this answer, anyway), you will see that Chrome uses HTML entities while Firefox uses the rendered character result. While it's a handy feature to be able to see non-printable unicode characters in the inspector, Chrome's use of HTML entities is only a convenience feature, not a reflection of the actual contents of your source code.
With that in mind, we can infer that your source code contains unicode characters in their fully rendered form. Regardless of the form of your unicode character, the fix is identical: you need to target these unicode space characters explicitly and replace them with ordinary spaces.

Word html format: insert a custom TOC via field code

I am generating Word docs from html. Basically, I build a file with html and save it as a .doc. Then I open it in Word and apply a template. All good so far.
I would like to automatically generate a custom TOC via the HTML ie when I am building the document. I need to insert a field code to do that, in the same way I do to add page numbering via the HML. eg:
<span style="mso-field-code: PAGE " class="page-field"></span>
If I save my html doc as docx and apply a template, I can make a TOC based in the styles in the way one would normally create a TOC in Word. I customised the TOC so the Title style is the top level followed by H1, H2 then H3. If I then toggle the field code on the TOC, the field code looks like this:
{ TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1" }
Now, I can add HTML like this to insert the TOC:
<div style="mso-field-code: TOC " class="toc-field">TOC goes HERE</div>
When I do that, if I right click the text "TOC goes HERE" I get the option to "Update field" and if I do that a TOC is generated using the default H1,H2,H3 tags.
But, what I can't work out is how to include the
\t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
part so my custom style sequence is applied. I have tried all sorts of combinations and it seems that adding anything after TOC causes Word to not make a field code.
Does anyone have any suggestions?
Update:
Based on the essential help from #slightlysnarky below, I thought I would summarise the outcome here because the information I needed was in a Microsoft chm file that was taken down many years ago. If you read the following extract from that help manual and compare it to the solution below you will see how this all works.
Word marks and stores information for simple fields by means of the Span element with the mso-field-code style. The mso-field-code value represents the string value of the field code. Formatting in the original field code might be lost when saving as HTML if only the string value of the code is necessary for its calculation.
Word has a different way of storing field information to HTML for more complex fields, such as ones that have formatted text or long values. Word marks these fields with so the data is not displayed in the browser. Word uses the Span element with the mso-element: field-begin, mso-element: field-separator, and mso-element: field-end attributes to contain the three respective parts of the field code: the field start, the separator between field code and field results, and the field end. Whenever possible, Word will save the field to HTML in the method that uses the least file space.
So, basically, add tags as shown below to your HTML at the point you wish the TOC to appear.
:-)
Word recognises a "complex field format" in HTML, along the same lines as it does in the Office Open XML format. So you can use
<span style='mso-element:field-begin'></span>TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
<span style='mso-element:field-separator'></span>This text will show but the user will need to update the field
<span style='mso-element:field-end'></span>
This construct is outlined in a Microsoft document called "Microsoft Office HTML and XML Reference". It's a Windows .exe that unpacks to a .chm Help file. You can get it here
The info. on encoding fields is in Getting Started with Microsoft Office 2000 HTML and XML->Microsoft Word->Fields
There may be a later version but that's the only one I could find.

ruby tags for Sphinx/rst

I create HTML documents from a rst-formated text, with the help of Sphinx. I need to display some Japanese words with furiganas (=small characters above the words), something like that :
I'd like to produce HTML displaying furiganas thanks to the < ruby > tag.
I can't figure out how to get this result. I tried to:
insert raw HTML code with the .. raw:: html directive but it breaks my line into several paragraphs.
use the :superscript: directive but the text in furigana is written beside the text, not above.
use the :role: directive to create a link between the text and a CSS class of my own. But the :role: directive can only be applied to a segment of text, not to TWO segments as required by the furiganas (=text + text above it).
Any idea to help me ?
As long as I know, there's no simple way to get the expected result.
For a specific project, I choosed not to generate the furiganas with the help of Sphinx but to modify the .html files afterwards. See the add_ons/add_furiganas.py script and the result here. Yes, it's a quick-and-dirty trick :(

How to modify how TinyMCE format text

TinyMCE color formating is putting in to span tag,
now I need when ever user change color for a text add
one extra character
(for those who may wonder way I need this, read this: Inserting HTML tag in the middle of Arabic word breaks word connection (cursive))
so this is how TinyMCE normaly format text:
<p><span style="color: #ff6600;">forma</span>tings</p>
this is how I need to be:
<p>X<span style="color: #ff6600;">forma</span>tings</p>
so before any span I need to add one extra character.
I was searching throug TinyMCE source but I couldn't find where it assembly this.
I totaly understand your need for a word-joiner.
Depending on the browser you might be able to insert this character using a css-pseudo element - in this case before: http://www.w3schools.com/cssref/sel_before.asp
Your tinymce content css (use the tinymce init setting content_css) should contain the following:
body span:before {
content:'\2060'; // use '\00b6' to get something visible for testing
}
UPDATE: Approch2:
You can do this check to enter your word joiners:
var ed = tinymce.get('content') || tinymce.editors[0];
var span = $(ed.getBody()).find('span:not(.has_word_joiner)').each(function(index) {
ed.selection.select(this);
ed.execCommand('mceInsertContent', false, '\u2060<span class="has_word_joiner">'+this.innerHTML+'</span>'); // you might want to add the formerspan attributes too, but that is a minor issue
});
You might need to call this using an own plugin on special events.