Trim string to length ignoring HTML - html

This problem is a challenging one. Our application allows users to post news on the homepage. That news is input via a rich text editor which allows HTML. On the homepage we want to only display a truncated summary of the news item.
For example, here is the full text we are displaying, including HTML
In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us.
We want to trim the news item to 250 characters, but exclude HTML.
The method we are using for trimming currently includes the HTML, and this results in some news posts that are HTML heavy getting truncated considerably.
For instance, if the above example included tons of HTML, it could potentially look like this:
In an attempt to make a bit more space in the office, kitchen, I've pulled...
This is not what we want.
Does anyone have a way of tokenizing HTML tags in order to maintain position in the string, perform a length check and/or trim on the string, and restore the HTML inside the string at its old location?

Start at the first character of the post, stepping over each character. Every time you step over a character, increment a counter. When you find a '<' character, stop incrementing the counter until you hit a '>' character. Your position when the counter gets to 250 is where you actually want to cut off.
Take note that this will have another problem that you'll have to deal with when an HTML tag is opened but not closed before the cutoff.

Following the 2-state finite machine suggestion, I've just developed a simple HTML parser for this purpose, in Java:
http://pastebin.com/jCRqiwNH
and here a test case:
http://pastebin.com/37gCS4tV
And here the Java code:
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
public class HtmlShortener {
private static final String TAGS_TO_SKIP = "br,hr,img,link";
private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
private static final int STATUS_READY = 0;
private int cutPoint = -1;
private String htmlString = "";
final List<String> tags = new LinkedList<String>();
StringBuilder sb = new StringBuilder("");
StringBuilder tagSb = new StringBuilder("");
int charCount = 0;
int status = STATUS_READY;
public HtmlShortener(String htmlString, int cutPoint){
this.cutPoint = cutPoint;
this.htmlString = htmlString;
}
public String cut(){
// reset
tags.clear();
sb = new StringBuilder("");
tagSb = new StringBuilder("");
charCount = 0;
status = STATUS_READY;
String tag = "";
if (cutPoint < 0){
return htmlString;
}
if (null != htmlString){
if (cutPoint == 0){
return "";
}
for (int i = 0; i < htmlString.length(); i++){
String strC = htmlString.substring(i, i+1);
if (strC.equals("<")){
// new tag or tag closure
// previous tag reset
tagSb = new StringBuilder("");
tag = "";
// find tag type and name
for (int k = i; k < htmlString.length(); k++){
String tagC = htmlString.substring(k, k+1);
tagSb.append(tagC);
if (tagC.equals(">")){
tag = getTag(tagSb.toString());
if (tag.startsWith("/")){
// closure
if (!isToSkip(tag)){
sb.append("</").append(tags.get(tags.size() - 1)).append(">");
tags.remove((tags.size() - 1));
}
} else {
// new tag
sb.append(tagSb.toString());
if (!isToSkip(tag)){
tags.add(tag);
}
}
i = k;
break;
}
}
} else {
sb.append(strC);
charCount++;
}
// cut check
if (charCount >= cutPoint){
// close previously open tags
Collections.reverse(tags);
for (String t : tags){
sb.append("</").append(t).append(">");
}
break;
}
}
return sb.toString();
} else {
return null;
}
}
private boolean isToSkip(String tag) {
if (tag.startsWith("/")){
tag = tag.substring(1, tag.length());
}
for (String tagToSkip : tagsToSkip){
if (tagToSkip.equals(tag)){
return true;
}
}
return false;
}
private String getTag(String tagString) {
if (tagString.contains(" ")){
// tag with attributes
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
} else {
// simple tag
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
}
}
}

You can try the following npm package
trim-html
It cutting off sufficient text inside html tags, save original html stricture, remove html tags after limit is reached and closing opened tags.

If I understand the problem correctly, you want to keep the HTML formatting, but you want to not count it as part of the length of the string you are keeping.
You can accomplish this with code that implements a simple finite state machine.
2 states: InTag, OutOfTag
InTag:
- Goes to OutOfTag if > character is encountered
- Goes to itself any other character is encountered
OutOfTag:
- Goes to InTag if < character is encountered
- Goes to itself any other character is encountered
Your starting state will be OutOfTag.
You implement a finite state machine by procesing 1 character at a time. The processing of each character brings you to a new state.
As you run your text through the finite state machine, you want to also keep an output buffer and a length so far encountered varaible (so you know when to stop).
Increment your Length variable each time you are in the state OutOfTag and you process another character. You can optionally not increment this variable if you have a whitespace character.
You end the algorithm when you have no more characters or you have the desired length mentioned in #1.
In your output buffer, include characters you encounter up until the length mentioned in #1.
Keep a stack of unclosed tags. When you reach the length, for each element in the stack, add an end tag. As you run through your algorithm you can know when you encounter a tag by keeping a current_tag variable. This current_tag variable is started when you enter the InTag state, and it is ended when you enter the OutOfTag state (or when a whitepsace character is encountered while in the InTag state). If you have a start tag you put it in the stack. If you have an end tag, you pop it from the stack.

Here's the implementation that I came up with, in C#:
public static string TrimToLength(string input, int length)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
if (input.Length <= length)
return input;
bool inTag = false;
int targetLength = 0;
for (int i = 0; i < input.Length; i++)
{
char c = input[i];
if (c == '>')
{
inTag = false;
continue;
}
if (c == '<')
{
inTag = true;
continue;
}
if (inTag || char.IsWhiteSpace(c))
{
continue;
}
targetLength++;
if (targetLength == length)
{
return ConvertToXhtml(input.Substring(0, i + 1));
}
}
return input;
}
And a few unit tests I used via TDD:
[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}
[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}
[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I";
Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}
[Test]
public void Html_TrimWellFormedHtml()
{
string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
"In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
"</div>";
string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";
Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}
[Test]
public void Html_TrimMalformedHtml()
{
string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
"In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";
string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";
Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

I'm aware this is quite a bit after the posted date, but i had a similiar issue and this is how i ended up solving it. My concern would be the speed of regex versus interating through an array.
Also if you have a space before an html tag, and after this doesn't fix that
private string HtmlTrimmer(string input, int len)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
if (input.Length <= len)
return input;
// this is necissary because regex "^" applies to the start of the string, not where you tell it to start from
string inputCopy;
string tag;
string result = "";
int strLen = 0;
int strMarker = 0;
int inputLength = input.Length;
Stack stack = new Stack(10);
Regex text = new Regex("^[^<&]+");
Regex singleUseTag = new Regex("^<[^>]*?/>");
Regex specChar = new Regex("^&[^;]*?;");
Regex htmlTag = new Regex("^<.*?>");
while (strLen < len)
{
inputCopy = input.Substring(strMarker);
//If the marker is at the end of the string OR
//the sum of the remaining characters and those analyzed is less then the maxlength
if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
break;
//Match regular text
result += text.Match(inputCopy,0,len-strLen);
strLen += result.Length - strMarker;
strMarker = result.Length;
inputCopy = input.Substring(strMarker);
if (singleUseTag.IsMatch(inputCopy))
result += singleUseTag.Match(inputCopy);
else if (specChar.IsMatch(inputCopy))
{
//think of as 1 character instead of 5
result += specChar.Match(inputCopy);
++strLen;
}
else if (htmlTag.IsMatch(inputCopy))
{
tag = htmlTag.Match(inputCopy).ToString();
//This only works if this is valid Markup...
if(tag[1]=='/') //Closing tag
stack.Pop();
else //not a closing tag
stack.Push(tag);
result += tag;
}
else //Bad syntax
result += input[strMarker];
strMarker = result.Length;
}
while (stack.Count > 0)
{
tag = stack.Pop().ToString();
result += tag.Insert(1, "/");
}
if (strLen == len)
result += "...";
return result;
}

Wouldn't the fastest way be to use jQuery's text() method?
For example:
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
var text = $('ul').text();
Would give the value OneTwoThree in the text variable. This would allow you to get the actual length of the text without the HTML included.

Related

xkcd api, how to read explanations?

xkcd comics has an json api to read meta data about one spesific comic/stripe.
eg. to get json data I can use:
https://xkcd.com/json.html
https://xkcd.com/2173/info.0.json
But it does not contain the explanation of the xkcd. It can be found on another page:
https://www.explainxkcd.com/wiki/index.php?title=2173:_Trainea_Neural_Net&oldid=176507
How can i get the explanation via an api as well, is it possible? (I dont want to use curl to scrape the entire Html page.)
If by explanation, you mean the text that appears when you hover over the comic, that is called the alt text. It is available in the JSON that is returned:
{
"month": "7",
"num": 2173,
"link": "",
"year": "2019",
"news": "",
"safe_title": "Trained a Neural Net",
"transcript": "",
"alt": "It also works for anything you teach someone else to do. \"Oh yeah, I trained a pair of neural nets, Emily and Kevin, to respond to support tickets.\"",
"img": "https://imgs.xkcd.com/comics/trained_a_neural_net.png",
"title": "Trained a Neural Net",
"day": "8"
}
If you meant the explanation from explainxkcd.com, then that is a different API. It uses the mediawiki platform (the same one used by Wikipedia). You can find the API documentation on their website, including an example on how to parse text.
Final result is this URL: https://www.explainxkcd.com/wiki/api.php?action=parse&page=2172:_Lunar_Cycles&prop=wikitext&sectiontitle=Explanation&format=json
Example output:
{
"parse": {
"title": "2172: Lunar Cycles",
"pageid": 22099,
"wikitext": {
"*": "{{comic\n| number = 2172\n| date = July 5, 2019\n| title = Lunar Cycles\n| image = lunar_cycles.png\n| titletext = The Antikythera mechanism had a whole set of gears specifically to track the cyclic popularity of skinny jeans and low-rise waists.\n}}\n\n==Explanation==\n{{incomplete|Created by a MOONBOT. Joke cycle explanations need to be expanded and title text needs to be explained. Do NOT delete this tag too soon.}}\n\nThis comic shows a mixture of real, scientific lunar cycles and cycles that are comedic or fictional in nature.\n\n*'''Nodal precession:''' The Moon's orbital plane is tilted slightly compared to the Earth's orbital plane around the sun (the {{w|ecliptic}}). This tilt is why we don't constantly see eclipses; most of the time, the Moon's orbital plane is tilted higher or lower than the Sun, so they generally don't cross each other. The two points at which these planes ''do'' cross are called {{w|lunar nodes}}. {{w|Nodal precession}} is the gradual rotation of these nodes over time, which for the Moon follows an 18.6 year cycle.\n\n*'''Apsidal precession:''' All orbits have two points where the orbiting body is either closest to, or furthest away from, the thing they are orbiting. These points are called {{w|apsides}}, and the imaginary line between them is called the ''line of apsides''. {{w|Apsidal precession}} is the gradual rotation of this line over time, which occurs in cycles of around 8.9 years for the Moon.\n\n*'''Phase:''' {{w|Lunar phase}} describes the change in shape of the sunlit side of the Moon as viewed from the Earth's surface, which is caused by the changing angle between Moon and Sun as the Moon revolves around the Earth. The cycle of lunar phases takes 29.5 days, a figure referred to as the ''synodic month''.\n\n*'''Distance:''' Because the Moon's orbit around the Earth is elliptical, its distance from the Earth varies slightly over the course of an orbit. This means that the moon's distance also follows a cycle which is the same as the length of one lunar orbit: approximately 27.5 days. This figure is referred to as the ''anomalistic month''. Note that the synodic month is (perhaps counterintuituvely) two days ''longer'' than the sidereal month - or to put it another way, it takes 2 more days for the Moon's phases to cycle than it does for the Moon to go around the Earth. This is due to the fact that the Earth is also moving ''around'' the Sun while the phases are going on, which means that the Moon has to spend 2 extra days \"catching up\" to the point at which the lunar phase cycle can restart.\n\n*'''Earth-Moon relative size''': This is a joke cycle; the Earth and Moon do not physically change size, nor does the Moon ever become larger than the Earth. This may be playing on the idea that the Moon often ''appears'' to change size to viewers on Earth, due to various factors; most commonly, this is due to the {{w|Moon illusion}}, which tricks the brain into perceiving the Moon as much larger than it really is. There are also so-called {{w|supermoon}}s, which occur when the full moon coincides with the Moon's closest approach to Earth; these actually do increase the Moon's apparent size, although by a relatively insignificant amount.\n\n*'''Lunar shape:''' Again, this is a joke cycle; the Moon does not actually change shape. A shape intermediate between circle and square is known as a {{w|squircle}}, a subclass of the {{w|superellipse}}.\n\n*'''Lunar mood:''' The moon does not have a mood, although humans can have moods that fluctuate over time, sometimes with a regularity akin to a cycle. Ironically, the section of the graph that shows a good (i.e. happy) mood has the graph line curving up then down like the mouth of a frown, and for the bad (unhappy) mood it curves down and then up, as in the mouth of a smile.\n\n* The final diagram shows many different cycles superimposed on each other, highlighting areas where several cycles are coinciding. This is likely satirizing the media trend of overhyping astronomical coincidences and giving them grand-sounding names:\n:*The light gray \"phase \u00d7 distance\" plot does not correspond to the product of periods given for phase and distance, which [https://i.imgur.com/0i0mcPn.png look like this] instead.\n:*A [[wikipedia:harvest moon|harvest moon]] is the traditional name for the full moon closest to the autumnal equinox, but there is nothing astronomically significant about it.\n:*A [[wikipedia:Supermoon|supermoon]] is a full or new moon when the Moon is closest to the Earth, resulting in a slightly larger-than-usual apparent size. A full supermoon is roughly 14% larger in diameter than when the Moon is furthest away. See also [[1394: Superm*n]].\n:*A [[wikipedia:blue moon|blue moon]] is the extra full moon in years with 13 full moons, which happens once every two or three years (hence the phrase \"once in a blue moon\"). Blue moons don't look any different from regular full moons.\n:*{{w|Astrology}} is a pseudoscience which claims that the positions of the celestial bodies can be used to predict human affairs. The chart jokingly suggests that astrology actually ''does'' work, but only within a very specific two-week timeframe.\n:*The [[wikipedia:Golden Age of Television|Golden Age of Television]] is said to have occurred in the 1940s and 50s, and the 2000s.\n:*There are no occurrences of '''dire moon''' or '''pork moon''' in the Google Books N-Gram viewer, which includes many works from the 1800s through 2008. A [[wikipedia:blood moon|blood moon]] refers to the moon during a lunar eclipse.\n:*While the popularity of '''skinny jeans''' ([[wikipedia:Slim-fit pants|slim-fit pants]]) does change over time, the idea that this is connected to a lunar cycle is also a joke.\n:*Finally, while the idea of a '''total eclipse of the sea''' seems absurd, [https://www.deepseanews.com/2017/08/what-happens-in-the-sea-during-a-solar-eclipse/ an eclipse was famously used to explain the migration of maritime animals]:\n:::''biologists were beginning to unravel the mystery of this \u2018false bottom\u2019\u2013a layer in the ocean that looks the the sea floor on the sounder but isn\u2019t\u2013which covered much of the ocean. This false bottom rises in up at night and sinks down during the day. This rising and falling is in fact caused by the largest migration of animal on Earth\u2013everything from fish, shrimp and jellyfish, moving hundreds of meters in unison up and down each day.... the moon moved into its place in front of the sun, daylight rapidly faded, and the scientists solved the migration mystery: the deep layer of animals began to rise. Bioluminescent creatures started to shine, and nocturnal creatures started a frantic upward thrust. As the world grew darker, they swam upward nearly 80 meters. But this frantic migration didn\u2019t last long. As the moon receded and the sun revealed itself, the massive animal layer did an about-face, scrambling back into the safety of the darkness.''\n:: (Backus, Clark, and Wing (1965) [https://sci-hub.tw/10.1038/205989a0 \"Behaviour of certain marine organisms during the solar eclipse of July 20, 1963\"] ''Nature'' '''4975:'''989-91.)\nThe '''{{w|Antikythera_mechanism|Antikythera mechanism}}''' mentioned in the title text is an ancient Greek machine, rediscovered in 1901, designed to calculate astronomical positions. The title text jokes that there is a set of gears on said mechanism that is used to predict the popularity of \"skinny jeans\" and \"low-rise waists.\" Since it was likely created in the 1st or 2nd century B.C., it is impossible for the creators to have had any knowledge of skinny jeans or low-rise waists - both are modern-day clothing fashions.\n\n==Transcript==\n{{incomplete transcript|Do NOT delete this tag too soon.}}\n\n:Understanding lunar cycles\n\n:Nodal precession\n:[A diagram showing a broad cosine-like wave with wavelength labelled as 18.6 years. To the right are two diagrams showing an orbital cycle moving in and out of plane.]\n\n:Apsidal precession\n:[A diagram similar to the one above but with a slightly shorter wavelength, labelled as 8.9 years. To the right are two diagrams showing an elliptical orbit around a planet and the same orbit rotated.]\n\n:Phase\n:[A diagram similar to those above with a shorter wavelength, labelled as 29.5 days. To the right is a diagram showing four phases of the moon: New, Waxing crescent, Waxinf gibbos, Full.]\n\n:Distance\n:[A diagram similar to those above with a shorter wavelength, labelled as 27.5 days. To the right is a diagram showing the distance of the moon from the Earth over time, with distances marked by arrows.]\n\n:Earth-Moon relative size\n:[A wave with long wavelength with an arrow pointing to the minimum labelled 'Earth bigger' and an arrow pointing to the maximum labelled 'Moon bigger'. To the right are two diagrams of the moon and Earth, one showing the Earth bigger than the Moon and the other showing the Moon bigger than the Earth.]\n\n:Lunar shape\n:[A wave with long wavelength with an arrow pointing to the minimum labelled 'Circle' and an arrow pointing to the maximum labelled 'Square'. To the right is a diagram showing a circle, a circle transforming into a square with outward arrows at each corner and a square transforming into a circle with inward arrows.]\n\n:Lunar mood\n:[A wave with long wavelength with an arrow pointing to the minimum labelled 'Bad' and an arrow pointing to the maximum labelled 'Good'. To the right are four emojis: :), :|, :(, :|]\n\n:[A superimposed graph of all the above waves. Different points on the graph are labelled: Harvest moon, Supermoon, Blue moon, Skinny Jeans popular, Super blood moon, Golden age of TV, Dire moon, Pork moon, Two week window in which astrology works, Total eclipse of the sea.]\n\n\n\n{{comic discussion}}"
}
}
}
I was wondering this myself today, and, no surprise, someone has done most of the hard work for us in the form of the xkcd explainer Chrome extension. Specifically, the repository that houses the code for the extension has a parser.js file (i.e., the hard work that has been done for us) and a main.js file built with the browser in mind but whose logic and functionality can easily be abstracted to still respect the primary goal and in different environments (e.g., Node.js): get the explanation of the XKCD comic in an easy-to-use fashion.
The code snippet below contains my reworking/merging of the parser.js and main.js files linked to above. The decent chunk of code can be copied and pasted right in the browser console to see the effect:
The example above is for comic 74 given by loadExplain(74) in the code below. Simply change num in loadExplain(num) to the number of the comic whose explanation HTML you want or loadExplain() for the most recent comic.
// parser.js
let comicid = 0;
let refNum = 0;
let refs = [];
function wikiparse(wikitext, num){
comicid = num;
let lines = wikitext.split(/\r?\n/);
let html = "";
let bulletLevel = 0; //level of bullet points
let quotes = 0; //previous line was quote
let tablerow = false; //true if currently in table row
for(let i = 0; i < lines.length; i++){
let line = lines[i];
if(line !== ""){
line = convertLine(line); //perform simple inline parsing
if(line[0] === "*"){ //bullet points
let bulletNum = line.match(/^\*+/)[0].length; //number of * in front of string
line = "<li>" + line.replace(/^\*+ */, "") + "</li>";
if(bulletLevel < bulletNum){ //start of new level of bulleting
line = "<ul>" + line;
bulletLevel++;
}
else if(bulletLevel > bulletNum){ //end of level
line = "</ul>" + line;
bulletLevel--;
}
}
else if(bulletLevel > 0){ //end of bulleting
line = "</ul><p>" + line + "</p>";
bulletLevel--;
}
else if(line[0] === ":"){ //quotes
line = "<dd>" + line.substring(1) + "</dd>";
if(!quotes){ //start of quote
line = "<dl>" + line;
quotes = 1;
}
}
else if(quotes){ //end of quote
line = "</dl><p>" + line + "</p>";
quotes = 0;
}
else if(line[0] === '{' && line[1] === '|'){ //tables
line = "<table " + line.substring(2) + ">";
}
else if(line[0] === '|' && line[1] == '-'){ //start of table row
line = "";
tablerow = true;
}
else if(line[0] === '|' && line[1] === '}'){ //end of table
line = "</table>"; //no rows?
tablerow = false;
}
else if(line[0] === '!'){ //table heading
line = "<th>" + line.substring(1).replace(/!!/g, "</th><th>") + "</th>";
if(tablerow){
line = "<tr>" + line + "</tr>";
tablerow = false;
}
}
else if(line[0] === '|'){ //table cell
line = "<td>" + line.substring(1).replace(/\|\|/g, "</td><td>") + "</td>";
if(tablerow){
line = "<tr>" + line + "</tr>";
tablerow = false;
}
}
else line = "<p>" + line + "</p>"; //regular text
html += line;
}
}
if(refNum > 0) {
let refFormatted = "<div class='references'><ol>";
for(let i = 0; i < refs.length; i++) {
refFormatted += "<li id='note-" + i + "'><a href='#ref-" + i + "'>↑</a><span>" + refs[i] + "</span></li>";
}
refFormatted += "</ol></div>";
html += refFormatted;
}
return html;
}
function convertLine(line){ //replace simple inline wiki markup
//headings and subheadings
//format ==<text>== -> <h2>, ===<text>=== -> h3, etc.
if(line[0] === '=' && line[line.length - 1] === '='){
let headingLeft = line.match(/^=+/)[0].length; //number of '='s on the left
let headingRight = line.match(/=+$/)[0].length; //number of '='s on the right
let headingNum = Math.min(headingLeft, headingRight);
if(headingNum >= 1 && headingNum <= 6){
line = "<h" + headingNum + ">" + line.substring(headingNum, line.length - headingNum) + "</h" + headingNum + ">";
}
}
//link to another xkcd comic
//format: [[<id>: <title]] or [[<id>: <title>|<id>]]
line = line.replace(/\[\[([0-9]+): [^\]]+(|\1)?\]\]/g, convertComicLink);
//link to within explain page
//format: [[#<heading>|<display>]]
line = line.replace(/\[\[#[^\]]+\]\]/g, convertHeadingLink);
//internal links
//format: [[<target>]] or [[<target>|<display>]]
line = line.replace(/\[\[[^\]]+\]\]/g, convertInternalLink);
// citation needed
//format: {{Citation needed}}
line = line.replace(/{{Citation needed}}/g, convertCitationLink);
//what if links
//format: {{what if|<id>|<title>}}
line = line.replace(/{{what if(\|[^\|]+){1,2}}}/g, convertWhatIfLink);
//wikipedia links
//format: {{w|<target>}} or {{w|<target>|<display>}} (or W)
line = line.replace(/{{[wW](\|[^}]+){1,2}}}/g, convertWikiLink);
//tvtropes links
//format: {{tvtropes|<target>|<display>}}
line = line.replace(/{{tvtropes(\|[^}]+){2}}}/g, convertTropesLink);
//other external links
//format: [http://<url>] or [http://<url> <display>] (includes https)
line = line.replace(/\[((http|https):)?\/\/([^\]])+]/g, convertOtherLink);
//references
line = line.replace(/<ref>.+<\/ref>/g, convertRefLink);
//bold
//format: '''<text>'''
line = line.replace(/'''(?:(?!''').)+'''/g, convertBold);
//italics
//format: ''<text>'' or ''<text>
line = line.replace(/''[^('')\n]+''/g, convertItalics)
.replace(/''.+/g, convertItalics);
return line;
}
function convertComicLink(link){
let firstSep = link.indexOf(":");
let secondSep = link.indexOf("|");
let id = link.substring(2, firstSep);
let display = "";
if(secondSep === -1) {
let title = link.substring(firstSep + 2, link.length - 2);
display = id + ": " + title;
}
else {
display = link.substring(secondSep + 1, link.length - 2);
}
return '' + display + '';
}
function convertHeadingLink(link){
let target = link.substring(3, link.length-2);
let display = "";
let separator = target.indexOf("|");
if(separator === -1){
display = target;
}
else{
display = target.substring(separator + 1);
target = target.substring(0, separator);
}
return '' + display + '';
}
function convertInternalLink(link){
let target = link.substring(2, link.length-2);
let display = "";
let separator = target.indexOf("|");
if(separator === -1){
display = target;
}
else{
display = target.substring(separator + 1);
target = target.substring(0, separator);
}
return '' + display + '';
}
function convertCitationLink(){
return '<sup>[<i>citation needed</i>]</sup>';
}
function convertWhatIfLink(link){
let firstSep = link.indexOf("|") + 1;
let secondSep = link.indexOf("|", firstSep);
let id = link.substring(firstSep, secondSep);
let title = link.substring(secondSep + 1, link.length - 2);
return '<a rel="nofollow" href="http://what-if.xkcd.com/' + id + '">' + title + '</a>';
}
function convertWikiLink(link){
let target = link.substring(4, link.length-2);
let display = "";
let separator = target.indexOf("|");
if(separator === -1){
display = target;
}
else{
display = target.substring(separator + 1);
target = target.substring(0, separator);
}
return '' + display + '';
}
function convertTropesLink(link){
let firstSep = link.indexOf("|") + 1;
let secondSep = link.indexOf("|", firstSep);
let target = link.substring(firstSep, secondSep);
let display = link.substring(secondSep + 1, link.length - 2);
return '<a rel="nofollow" class="external text" href="http://tvtropes.org/pmwiki/pmwiki.php/Main/' + target + '">' +
'<span style="background: #eef;" title="Warning: TV Tropes. See comic 609.">' + display + '</span>' +
'</a>';
}
function convertOtherLink(link){
let separator = link.indexOf(" ");
let target = "";
let display = "";
if(separator === -1){
target = link.substring(1, link.length - 1);
display = "[X]";
}
else{
target = link.substring(1, separator);
display = link.substring(separator + 1, link.length - 1);
}
return '<a rel="nofollow" href="' + encodeURI(target) + '">' + display + '</a>';
}
function convertRefLink(link) {
let display = link.substring(5, link.length - 6);
refNum++;
refs.push(display);
return "<sup id='ref-" + (refNum - 1) + "'><a href='#note-" + (refNum - 1) + "'>[" + refNum + "]</a></sup>";
}
function convertBold(text){
return "<b>" + text.substring(3, text.length - 3) + "</b>";
}
function convertItalics(text){
if(text.substr(-2) === "''") {
return "<i>" + text.substring(2, text.length - 2) + "</i>";
}
return "<i>" + text.substring(2) + "</i>";
}
// main.js
async function getJSON(url, callback){
const response = await fetch(url);
const responseJSON = await response.json()
callback(responseJSON);
}
async function loadExplain(comic = ''){
if (comic === '') {
let latestComic = await fetch("https://explainxkcd.com/wiki/api.php?action=expandtemplates&format=json&origin=*&text={{LATESTCOMIC}}");
latestComicJSON = await latestComic.json();
comic = +latestComicJSON.expandtemplates['*'];
}
getJSON("https://explainxkcd.com/wiki/api.php?action=query&prop=revisions&rvprop=content&format=json&origin=*&redirects=1&titles=" + comic, function(obj){
let pages = obj.query.pages;
let page = pages[Object.keys(pages)[0]].revisions[0]["*"];
let start = page.indexOf("{{incomplete|");
if(start === -1){ //incomplete tag at the beginning of explanation
start = page.indexOf("== Explanation ==") + 18;
if(start === -1 + 18){
start = page.indexOf("==Explanation==") + 16;
}
if(page[start] == "\n") start++;
}
else{ //complete explanation
start = page.indexOf("\n", start) + 1;
}
let end = page.indexOf("==Transcript==") - 1;
if(end === -1 - 1){
end = page.indexOf("== Transcript ==") - 1;
}
let rawExplain = page.substring(start, end);
let explanation = wikiparse(rawExplain, comic);
let readMore = '<p><b>Read more at the explain xkcd wiki.</b></p>';
console.log(explanation, readMore);
});
}
// get parsed explanation HTML
// loadExplain(); // for most recent comic
loadExplain(74) // for custom comic (1 - 2296 as of this answer posting)

Trouble Adding Array output to an Dynamically Generated HTML String in GAS Google Script

I am trying to automate my businesses blog. I want to create a dynamic html string to use as a wordpress blog description. I am pulling text data from email body's in my gmail account to use as information. I parse the email body using the first function below.
I have everything working properly except for the for loop (in the second code block) creating the description of the post. I have searched for hours and tried dozens of different techniques but I cant figure it out for the life of me.
Here is how I am reading the text values into an array:
function getMatches(string, regex, index) {
index || (index = 1); // default to the first capturing group
var matches = [];
var match;
while (match = regex.exec(string)) {
matches.push(match[index]);
}
return matches;
}
This is how I am trying to dynamically output the text arrays to create a basic HTML blogpost description (which I pass to xmlrpc to post):
var1 = getMatches(string, regex expression, 1);
var2 = getMatches(string, regex expression, 1);
var3 = getMatches(string, regex expression, 1);
var3 = getMatches(string, regex expression, 1);
var fulldesc = "<center>";
var text = "";
for (var k=0; k<var1.length; k++) {
text = "<u><b>Var 1:</u></b> " + var1[k] + ", <u><b>Var 2:</u></b> " + var2[k] + ", <u><b>Var 3:</u></b> " + var3[k] + ", <u><b>Var 4:</u></b> " + var4[k] + ", <br><br>";
fulldesc += text;
}
fulldesc += "</center>";
Lastly here is the blog post description code (using GAS XMLRPC library):
var fullBlog = "<b><u>Headline:</u> " + sub + "</b><br><br>" + fulldesc + "<br><br>General Description: " + desc;
var blogPost = {
post_type: 'post',
post_status: 'publish', // Set to draft or publish
title: 'Ticker: ' + sub, //sub is from gmail subject and works fine
categories: cat, //cat is defined elsewhere and works fine
date_created_gmt: pubdate2, //defined elsewhere (not working but thats another topic)
mt_allow_comments: 'closed',
description: fullBlog
};
request.addParam(blogPost);
If there's only one value in the var1,2,3,4 arrays all works as it should. But any more than 1 value and I get no output at all from the "fulldesc" var. All other text variables work as they should and the blog still gets posted (just minus some very important information). I'm pretty sure the problem lies in my for loop which adds the HTML description to text var.
Any suggestions would be greatly appreciated, I'm burned out trying to get the answer! I am a self taught programmer (just from reading this forum) so please go easy on me if I missed something stupid :)
Figured it out: It wasnt the html/text loop at all. My blogpost title had to be a variable or text, but not both.
Not working:
title: 'Ticker: ' + sub, //sub is from gmail subject and works fine
Working:
var test = 'Ticker: ' + sub;
//
title:test,

Partial replace in docs what matches only and preserve formatting

Let's assume that we have first paragraph in our google document:
Wo1rd word so2me word he3re last.
We need to search and replace some parts of text but it must be highlighted in editions history just like we changed only that parts and we must not loose our format (bold, italic, color etc).
What i have/understood for that moment: capturing groups didn't work in replaceText() as described in documentation. We can use pure js replace(), but it can be used only for strings. Our google document is array of objects, not strings. So i did a lot of tries and stopped at that code, attached in this message later.
Can't beat: how i can replace only part of what i've found. Capturing groups is very powerful and suitable instrument, but i can't use it for replacement. They didn't work or i can replace whole paragraph, that is unacceptable because of editions history will show full paragraph replace and paragraphs will lose formatting. What if what we searching will be in each and every paragraph, but only one letter must be changed? We will see full document replacement in history and it will be hard to find what really changed.
My first idea was to compare strings, that replace() gives to me with contents of paragraph then compare symbol after symbol and replace what is different, but i understand, that it will work only if we are sure that only one letter changed. But what if replace will delete/add some words, how it can be synced? It will be a lot bigger problem.
All topics that i've found and read triple times didn't helped and didn't moved me from the dead point.
So, is there any ideas how to beat that problem?
function RegExp_test() {
var docParagraphs = DocumentApp.getActiveDocument().getBody().getParagraphs();
var i = 0, text0, text1, test1, re, rt, count;
// equivalent of .asText() ???
text0 = docParagraphs[i].editAsText(); // obj
// equivalent of .editAsText().getText(), .asText().getText()
text1 = docParagraphs[i].getText(); // str
if (text1 !== '') {
re = new RegExp(/(?:([Ww]o)\d(rd))|(?:([Ss]o)\d(me))|(?:([Hh]e)\d(re))/g); // v1
// re = new RegExp(/(?:([Ww]o)\d(rd))/); // v2
count = (text1.match(re) || []).length; // re v1: 7, re v2: 3
if (count) {
test1 = text1.match(re); // v1: ["Wo1rd", "Wo", "rd", , , , , ]
// for (var j = 0; j < count; j++) {
// test1 = text1.match(re)[j];
// }
text0.replaceText("(?:([Ww]o)\\d(rd))", '\1-A-\2'); // GAS func
// #1: \1, \2 etc - didn't work: " -A- word so2me word he3re last."
test1 = text0.getText();
// js func, text2 OK: "Wo1rd word so-B-me word he3re last.", just in memory now
text1 = text1.replace(/(?:([Ss]o)\d(me))/, '$1-B-$2'); // working with str, not obj
// rt OK: "Wo1rd word so-B-me word he-C-re last."
rt = text1.replace(/(?:([Hh]e)\d(re))/, '$1-C-$2');
// #2: we used capturing groups ok, but replaced whole line and lost all formatting
text0.replaceText(".*", rt);
test1 = text0.getText();
}
}
Logger.log('Test finished')
}
Found a solution. It's a primitive enough but it can be a base for a more complex procedure that can fix all occurrences of capture groups, detect them, mix them etc. If someone wants to improve that - you are welcome!
function replaceTextCG(text0, re, to) {
var res, pos_f, pos_l;
var matches = text0.getText().match(re);
var count = (matches || []).length;
to = to.replace(/(\$\d+)/g, ',$1,').replace(/^,/, '').replace(/,$/, '').split(",");
for (var i = 0; i < count; i++) {
res = re.exec(text0.getText())
for (var j = 1; j < res.length - 1; j++) {
pos_f = res.index + res[j].length;
pos_l = re.lastIndex - res[j + 1].length - 1;
text0.deleteText(pos_f, pos_l);
text0.insertText(pos_f, to[1]);
}
}
return count;
}
function RegExp_test() {
var docParagraphs = DocumentApp.getActiveDocument().getBody().getParagraphs();
var i = 0, text0, count;
// equivalent of .asText() ???
text0 = docParagraphs[i].editAsText(); // obj
if (text0.getText() !== '') {
count = replaceTextCG(text0, /(?:([Ww]o)\d(rd))/g, '$1A$2');
count = replaceTextCG(text0, /(?:([Ss]o)\d(me))/g, '$1B$2');
count = replaceTextCG(text0, /(?:([Hh]e)\d(re))/g, '$1C$2');
}
Logger.log('Test finished')
}

How do I detect and remove "\n" from given string in action script?

I have the following code,
public static function clearDelimeters(formattedString:String):String
{
return formattedString.split("\n").join("").split("\t").join("");
}
The spaces i.e. "\t" are removed but the newline "\n" are not removed from the formattedString.
I even tried
public static function clearDelimeters(formattedString:String):String
{
var formattedStringChar:String = "";
var originalString:String = "";
var j:int = 0;
while((formattedStringChar = formattedString.charAt(j)) != "")
{
if(formattedStringChar == "\t" || formattedStringChar == "\n")
{
j++;
}
else
{
originalString = originalString + formattedString;
}
j++;
}
return originalString;
}
This also didn't work.
Expected help is the reason why newline delimeters are not removed and some way to remove the newline.
Thank you in anticipation
There are a few cases that the line-end marking may be: CRLF, CR, LF, LFCR. Possibly your string contains CRLF for line endings instead of only LF (\n). And so, with all the LFs removed, some text editors will still treat CRs as line-end characters.
Try this instead:
//this function requires AS3
public static function clearDelimeters(formattedString:String):String {
return formattedString.replace(/[\u000d\u000a\u0008\u0020]+/g,"");
}
Note that \t is for tab, it's not space. Or if you're working with HTML, <br> and <br/> are used to make line breaks in HTML but they are not line-end characters.
The regexp answer is correct but I always like the more readable version of it (don't know how it does with performance though)
result = string.split("\n\r").join("");
or do the \n and \r split separate. The \n\r is a common standard for all operating systems. Check wikipedia to check why those are joined together((CR+LF, '\r\n', 0x0D0A)).
http://en.wikipedia.org/wiki/Newline#Representations
Are you sure it isn't a
<br>
or
</br>?
or
/r
// try this. it works for me!!! Wink-;^D
function removeNewLinesFrom(This){
nl='' + newline;
removed=''
for(i=0;i<=(This.length-1);i++){
if(This.charAt(i)!=nl){removed+=This.charAt(i)}
}
return(removed)
}
// Simplify the name of the function
rnlf=removeNewLinesFrom
// Wright a example
example='hello '+newline+'world'
// prompt the example
trace('prompt='+rnlf(example))

iTextSharp HTML to PDF preserving spaces

I am using the FreeTextBox.dll to get user input, and storing that information in HTML format in the database. A samle of the user's input is the below:
                                                                     133 Peachtree St NE                                                                     Atlanta,  GA 30303                                                                     404-652-7777                                                                      Cindy Cooley                                                                     www.somecompany.com                                                                     Product Stewardship Mgr                                                                     9/9/2011Deidre's Company123 Test StAtlanta, GA 30303Test test.  
I want the HTMLWorker to perserve the white spaces the users enters, but it strips it out. Is there a way to perserve the user's white space? Below is an example of how I am creating my PDF document.
Public Shared Sub CreatePreviewPDF(ByVal vsHTML As String, ByVal vsFileName As String)
Dim output As New MemoryStream()
Dim oDocument As New Document(PageSize.LETTER)
Dim writer As PdfWriter = PdfWriter.GetInstance(oDocument, output)
Dim oFont As New Font(Font.FontFamily.TIMES_ROMAN, 8, Font.NORMAL, BaseColor.BLACK)
Using output
Using writer
Using oDocument
oDocument.Open()
Using sr As New StringReader(vsHTML)
Using worker As New html.simpleparser.HTMLWorker(oDocument)
worker.StartDocument()
worker.SetInsidePRE(True)
worker.Parse(sr)
worker.EndDocument()
worker.Close()
oDocument.Close()
End Using
End Using
HttpContext.Current.Response.ContentType = "application/pdf"
HttpContext.Current.Response.AddHeader("Content-Disposition", String.Format("attachment;filename={0}.pdf", vsFileName))
HttpContext.Current.Response.BinaryWrite(output.ToArray())
HttpContext.Current.Response.End()
End Using
End Using
output.Close()
End Using
End Sub
There's a glitch in iText and iTextSharp but you can fix it pretty easily if you don't mind downloading the source and recompiling it. You need to make a change to two files. Any changes I've made are commented inline in the code. Line numbers are based on the 5.1.2.0 code rev 240
The first is in iTextSharp.text.html.HtmlUtilities.cs. Look for the function EliminateWhiteSpace at line 249 and change it to:
public static String EliminateWhiteSpace(String content) {
// multiple spaces are reduced to one,
// newlines are treated as spaces,
// tabs, carriage returns are ignored.
StringBuilder buf = new StringBuilder();
int len = content.Length;
char character;
bool newline = false;
bool space = false;//Detect whether we have written at least one space already
for (int i = 0; i < len; i++) {
switch (character = content[i]) {
case ' ':
if (!newline && !space) {//If we are not at a new line AND ALSO did not just append a space
buf.Append(character);
space = true; //flag that we just wrote a space
}
break;
case '\n':
if (i > 0) {
newline = true;
buf.Append(' ');
}
break;
case '\r':
break;
case '\t':
break;
default:
newline = false;
space = false; //reset flag
buf.Append(character);
break;
}
}
return buf.ToString();
}
The second change is in iTextSharp.text.xml.simpleparser.SimpleXMLParser.cs. In the function Go at line 185 change line 248 to:
if (html /*&& nowhite*/) {//removed the nowhite check from here because that should be handled by the HTML parser later, not the XML parser
Thanks for the help everyone. I was able to find a small work around by doing the following:
vsHTML.Replace(" ", " ").Replace(Chr(9), " ").Replace(Chr(160), " ").Replace(vbCrLf, "<br />")
The actual code does not display properly but, the first replace is replacing white spaces with , Chr(9) with 5 , and Chr(160) with .
I would recommend using wkhtmltopdf instead of iText. wkhtmltopdf will output the html exactly as rendered by webkit (Google Chrome, Safari) instead of iText's conversion. It is just a binary that you can call. That being said, I might check the html to ensure that there are paragraphs and/or line breaks in the user input. They might be stripped out before the conversion.