Replace HTML escape sequence with its single character equivalent in C

Replace HTML escape sequence with its single character equivalent in C - html

My program is loading some news article from the web. I then have an array of html documents representing these articles. I need to parse them and show on the screen only the relevant content. That includes converting all html escape sequences into readable symbols. So I need some function which is similar to unEscape in JavaScript.
I know there are libraries in C to parse html.
But is there some easy way to convert html escape sequences like & or ! to just & and !?

This is something that you typically wouldn't use C for. I would have used Python. Here are two questions that could be a good start:
What's the easiest way to escape HTML in Python?
How do you call Python code from C code?
But apart from that, the solution is to write a proper parser. There are lots of resources out there on that topic, but basically you could do something like this:
parseFile()
while not EOF
ch = readNextCharacter()
if ch == '\'
readNextCharacter()
elseif ch == '&'
readEscapeSequence()
else
output += ch
readEscapeSequence()
seq = ""
ch = readNextCharacter();
while ch != ';'
seq += ch
ch = readNextCharacter();
replace = lookupEscape(seq)
output += replace
Note that this is only pseudo code to get you started

Just wrote and tested a version that does this (crudely). Didn't take long.
You'll want something like this:
typedef struct {
int gotLen; // save myriad calls to strlen()
char *got;
char *want;
} trx_t;
trx_t lut[][2] = {
{ 5, "&", "&" },
{ 5, "!", "!" },
{ 8, "†", "*" },
};
const int nLut = sizeof lut/sizeof lut[0];
And then a loop with two pointers that copies characters within the same buf, sniffing for the '&' that triggers a search of the replacement table. If found, copy the replacement string to the destination and advance the source pointer to skip past the HTML token. If not found, then the LUT may need additional tokens.
Here's a beginning...
void replace( char *buf ) {
char *pd = buf, *ps = buf;
while( *ps )
if( *ps != '&' )
*pd++ = *ps++;
else {
// EDIT: Credit #Craig Estey
if( ps[1] == '#' ) {
if( ps[2] == 'x' || ps[2] == 'X' ) {
/* decode hex value and save as char(s) */
} else {
/* decode decimal value and save as char(s) */
}
/* advance pointers and continue */
}
for( int i = 0; i < nLut; i++ )
/* not giving it all away */
/* handle "found" and "not found" in LUT *
}
*pd = '\0';
}
This was the test program
int main() {
char str[] = "The fox & hound† went for a walk! & chat.";
puts( str );
replace( str );
puts( str );
return 0;
}
and this was the output
The fox & hound† went for a walk! & chat.
The fox & hound* went for a walk! & chat.
The "project" is to write the interesting bit of the code. It's not difficult.
Caveat: Only works when substitution length is shorter or equal to target length. Otherwise need two buffers.

Related

Partial replace in docs what matches only and preserve formatting

Let's assume that we have first paragraph in our google document:
Wo1rd word so2me word he3re last.
We need to search and replace some parts of text but it must be highlighted in editions history just like we changed only that parts and we must not loose our format (bold, italic, color etc).
What i have/understood for that moment: capturing groups didn't work in replaceText() as described in documentation. We can use pure js replace(), but it can be used only for strings. Our google document is array of objects, not strings. So i did a lot of tries and stopped at that code, attached in this message later.
Can't beat: how i can replace only part of what i've found. Capturing groups is very powerful and suitable instrument, but i can't use it for replacement. They didn't work or i can replace whole paragraph, that is unacceptable because of editions history will show full paragraph replace and paragraphs will lose formatting. What if what we searching will be in each and every paragraph, but only one letter must be changed? We will see full document replacement in history and it will be hard to find what really changed.
My first idea was to compare strings, that replace() gives to me with contents of paragraph then compare symbol after symbol and replace what is different, but i understand, that it will work only if we are sure that only one letter changed. But what if replace will delete/add some words, how it can be synced? It will be a lot bigger problem.
All topics that i've found and read triple times didn't helped and didn't moved me from the dead point.
So, is there any ideas how to beat that problem?
function RegExp_test() {
var docParagraphs = DocumentApp.getActiveDocument().getBody().getParagraphs();
var i = 0, text0, text1, test1, re, rt, count;
// equivalent of .asText() ???
text0 = docParagraphs[i].editAsText(); // obj
// equivalent of .editAsText().getText(), .asText().getText()
text1 = docParagraphs[i].getText(); // str
if (text1 !== '') {
re = new RegExp(/(?:([Ww]o)\d(rd))|(?:([Ss]o)\d(me))|(?:([Hh]e)\d(re))/g); // v1
// re = new RegExp(/(?:([Ww]o)\d(rd))/); // v2
count = (text1.match(re) || []).length; // re v1: 7, re v2: 3
if (count) {
test1 = text1.match(re); // v1: ["Wo1rd", "Wo", "rd", , , , , ]
// for (var j = 0; j < count; j++) {
// test1 = text1.match(re)[j];
// }
text0.replaceText("(?:([Ww]o)\\d(rd))", '\1-A-\2'); // GAS func
// #1: \1, \2 etc - didn't work: " -A- word so2me word he3re last."
test1 = text0.getText();
// js func, text2 OK: "Wo1rd word so-B-me word he3re last.", just in memory now
text1 = text1.replace(/(?:([Ss]o)\d(me))/, '$1-B-$2'); // working with str, not obj
// rt OK: "Wo1rd word so-B-me word he-C-re last."
rt = text1.replace(/(?:([Hh]e)\d(re))/, '$1-C-$2');
// #2: we used capturing groups ok, but replaced whole line and lost all formatting
text0.replaceText(".*", rt);
test1 = text0.getText();
}
}
Logger.log('Test finished')
}

Found a solution. It's a primitive enough but it can be a base for a more complex procedure that can fix all occurrences of capture groups, detect them, mix them etc. If someone wants to improve that - you are welcome!
function replaceTextCG(text0, re, to) {
var res, pos_f, pos_l;
var matches = text0.getText().match(re);
var count = (matches || []).length;
to = to.replace(/(\$\d+)/g, ',$1,').replace(/^,/, '').replace(/,$/, '').split(",");
for (var i = 0; i < count; i++) {
res = re.exec(text0.getText())
for (var j = 1; j < res.length - 1; j++) {
pos_f = res.index + res[j].length;
pos_l = re.lastIndex - res[j + 1].length - 1;
text0.deleteText(pos_f, pos_l);
text0.insertText(pos_f, to[1]);
}
}
return count;
}
function RegExp_test() {
var docParagraphs = DocumentApp.getActiveDocument().getBody().getParagraphs();
var i = 0, text0, count;
// equivalent of .asText() ???
text0 = docParagraphs[i].editAsText(); // obj
if (text0.getText() !== '') {
count = replaceTextCG(text0, /(?:([Ww]o)\d(rd))/g, '$1A$2');
count = replaceTextCG(text0, /(?:([Ss]o)\d(me))/g, '$1B$2');
count = replaceTextCG(text0, /(?:([Hh]e)\d(re))/g, '$1C$2');
}
Logger.log('Test finished')
}

How to convert large UTF-8 encoded char* string to CStringW (UTF-16)?

I have a problem with converting a UTF-8 encoded string to a UTF-16 encoded CStringW.
Here is my source code:
CStringW ConvertUTF8ToUTF16( __in const CHAR * pszTextUTF8 )
{
_wsetlocale( LC_ALL, L"Korean" );
if ( (pszTextUTF8 == NULL) || (*pszTextUTF8 == '\0') )
{
return L"";
}
const size_t cchUTF8Max = INT_MAX - 1;
size_t cchUTF8;
HRESULT hr = ::StringCbLengthA( pszTextUTF8, cchUTF8Max, &cchUTF8 );
if ( FAILED( hr ) )
{
AtlThrow( hr );
}
++cchUTF8;
int cbUTF8 = static_cast<int>( cchUTF8 );
int cchUTF16 = ::MultiByteToWideChar(
CP_UTF8,
MB_ERR_INVALID_CHARS,
pszTextUTF8,
-1,
NULL,
0
);
CString strUTF16;
strUTF16.GetBufferSetLength(cbUTF8);
WCHAR * pszUTF16 = new WCHAR[cchUTF16];
int result = ::MultiByteToWideChar(
CP_UTF8,
0,
pszTextUTF8,
cbUTF8,
pszUTF16,
cchUTF16
);
ATLASSERT( result != 0 );
if ( result == 0 )
{
AtlThrowLastWin32();
}
strUTF16.Format(_T("%s"), pszUTF16);
return strUTF16;
}
pszTextUTF8 is htm file's content in UTF-8.
When htm file's volume is less than 500kb, this code works well.
but, when converting over 500kb htm file, (ex 648KB htm file that I have.)
pszUTF16 has all content of file, but strUTF16 is not. (about half)
I guess File open is not wrong.
In strUTF16 m_pszData has all content how to I get that?
strUTF16.Getbuffer(); dosen't work.

The code in the question is stock full of bugs, somewhere in the order of 1 bug per 1-2 lines of code.
Here is a short summary:
_wsetlocale( LC_ALL, L"Korean" );
Changing a global setting in a conversion function is unexpected, and will break code calling that. It's not even necessary either; you aren't using the locale for the encoding conversion.
HRESULT hr = ::StringCbLengthA( pszTextUTF8, cchUTF8Max, &cchUTF8 );
This is passing the wrong cchUTF8Max value (according to the documentation), and counts the number of bytes (vs. the number of characters, i.e. code units). Besides all that, you do not even need to know the number of code units, as you never use it (well, you are, but that is just another bug).
int cbUTF8 = static_cast<int>( cchUTF8 );
While that fixes the prefix (count of bytes as opposed to count of characters), it won't save you from using it later on for something that has an unrelated value.
strUTF16.GetBufferSetLength(cbUTF8);
This resizes the string object that should eventually hold the UTF-16 encoded characters. But it doesn't use the correct number of characters (the previous call to MultiByteToWideChar would have provided that value), but rather chooses a completely unrelated value: The number of bytes in the UTF-8 encoded source string.
But it doesn't just stop there, that line of code also throws away the pointer to the internal buffer, that was ready to be written to. Failure to call ReleaseBuffer is only a natural consequence, since you decided against reading the documentation.
WCHAR * pszUTF16 = new WCHAR[cchUTF16];
While not a bug in itself, it needlessly allocates another buffer (this time passing the correct size). You already allocated a buffer of the required size (albeit wrong) in the previous call to GetBufferSetLength. Just use that, that's what the member function is for.
strUTF16.Format(_T("%s"), pszUTF16);
That is probably the anti-pattern associated with the printf family of functions. It is the convoluted way to write CopyChars (or Append).
Now that that's cleared up, here is the correct way to write that function (or at least one way to do it):
CStringW ConvertUTF8ToUTF16( __in const CHAR * pszTextUTF8 ) {
// Allocate return value immediately, so that (N)RVO can be applied
CStringW strUTF16;
if ( (pszTextUTF8 == NULL) || (*pszTextUTF8 == '\0') ) {
return strUTF16;
}
// Calculate the required destination buffer size
int cchUTF16 = ::MultiByteToWideChar( CP_UTF8,
MB_ERR_INVALID_CHARS,
pszTextUTF8,
-1,
nullptr,
0 );
// Perform error checking
if ( cchUTF16 == 0 ) {
throw std::runtime_error( "MultiByteToWideChar failed." );
}
// Resize the output string size and use the pointer to the internal buffer
wchar_t* const pszUTF16 = strUTF16.GetBufferSetLength( cchUTF16 );
// Perform conversion (return value ignored, since we just checked for success)
::MultiByteToWideChar( CP_UTF8,
MB_ERR_INVALID_CHARS, // Use identical flags
pszTextUTF8,
-1,
pszUTF16,
cchUTF16 );
// Perform required cleanup
strUTF16.ReleaseBuffer();
// Return converted string
return strUTF16;
}

How do I detect and remove "\n" from given string in action script?

I have the following code,
public static function clearDelimeters(formattedString:String):String
{
return formattedString.split("\n").join("").split("\t").join("");
}
The spaces i.e. "\t" are removed but the newline "\n" are not removed from the formattedString.
I even tried
public static function clearDelimeters(formattedString:String):String
{
var formattedStringChar:String = "";
var originalString:String = "";
var j:int = 0;
while((formattedStringChar = formattedString.charAt(j)) != "")
{
if(formattedStringChar == "\t" || formattedStringChar == "\n")
{
j++;
}
else
{
originalString = originalString + formattedString;
}
j++;
}
return originalString;
}
This also didn't work.
Expected help is the reason why newline delimeters are not removed and some way to remove the newline.
Thank you in anticipation

There are a few cases that the line-end marking may be: CRLF, CR, LF, LFCR. Possibly your string contains CRLF for line endings instead of only LF (\n). And so, with all the LFs removed, some text editors will still treat CRs as line-end characters.
Try this instead:
//this function requires AS3
public static function clearDelimeters(formattedString:String):String {
return formattedString.replace(/[\u000d\u000a\u0008\u0020]+/g,"");
}
Note that \t is for tab, it's not space. Or if you're working with HTML, <br> and <br/> are used to make line breaks in HTML but they are not line-end characters.

The regexp answer is correct but I always like the more readable version of it (don't know how it does with performance though)
result = string.split("\n\r").join("");
or do the \n and \r split separate. The \n\r is a common standard for all operating systems. Check wikipedia to check why those are joined together((CR+LF, '\r\n', 0x0D0A)).
http://en.wikipedia.org/wiki/Newline#Representations

Are you sure it isn't a
<br>
or
</br>?
or
/r

// try this. it works for me!!! Wink-;^D
function removeNewLinesFrom(This){
nl='' + newline;
removed=''
for(i=0;i<=(This.length-1);i++){
if(This.charAt(i)!=nl){removed+=This.charAt(i)}
}
return(removed)
}
// Simplify the name of the function
rnlf=removeNewLinesFrom
// Wright a example
example='hello '+newline+'world'
// prompt the example
trace('prompt='+rnlf(example))

Unicode, VBScript and HTML

I have the following radio box:
<input type="radio" value="香">香</input>
As you can see, the value is unicode. It represents the following Chinese character: 香
So far so good.
I have a VBScript that reads the value of that particular radio button and saves it into a variable. When I display the content with a message box, the Chinese Character appears. Additionally I have a variable called uniVal where I assign the unicode of the Chinese character directly:
radioVal = < read value of radio button >
MsgBox radioVal ' yields chinese character
uniVal = "香"
MsgBox uniVal ' yields unicode representation
Is there a possibility to read the radio box value in such a way that the unicode string is preserved and NOT interpreted as the chinese character?
For sure, I could try to recreate the unicode of the character, but the methods I found in VBScript are not working correctly due to VBScripts implicit UTF-16 setting (instead of UTF-8). So the following method does not work correctly for all characters:
Function StringToUnicode(str)
result = ""
For x=1 To Len(str)
result = result & "&#"&ascw(Mid(str, x, 1))&";"
Next
StringToUnicode = result
End Function
Cheers
Chris

I got a solution:
JavaScript is in possession of a function that actually works:
function convert(value) {
var tstr = value;
var bstr = '';
for(i=0; i<tstr.length; i++) {
if(tstr.charCodeAt(i)>127)
{
bstr += '&#' + tstr.charCodeAt(i) + ';';
}
else
{
bstr += tstr.charAt(i);
}
}
return bstr;
}
I call this function from my VBScript... :)

Here is a VBScript function that will always return a positive value for the Unicode code point of a given character:-
Function PositiveUnicode(s)
Dim val : val = AscW(s)
If (val And &h8000) <> 0 Then
PositiveUnicode = (val And &h7FFF) + &h8000&
Else
PositiveUnicode = CLng(val)
End If
End Function
This will save you loading two script engines to acheive a simple operation.
"not working correctly due to VBScripts implicit UTF-16 setting (instead of UTF-8)."
This issue has nothing to do with UTF-8. It is purely the result of AscW use of the signed integer type.
As to why you have to recreate the &#xxxxx; encodings that you sent this is result of how HTML (and XML) work. The use of this character encoding entity is a convnience that the specification does not require to remain intact. Since the character encoding of the document is quite capable or representing that character the DOM is at liberty to convert it.

Trim string to length ignoring HTML

This problem is a challenging one. Our application allows users to post news on the homepage. That news is input via a rich text editor which allows HTML. On the homepage we want to only display a truncated summary of the news item.
For example, here is the full text we are displaying, including HTML
In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us.
We want to trim the news item to 250 characters, but exclude HTML.
The method we are using for trimming currently includes the HTML, and this results in some news posts that are HTML heavy getting truncated considerably.
For instance, if the above example included tons of HTML, it could potentially look like this:
In an attempt to make a bit more space in the office, kitchen, I've pulled...
This is not what we want.
Does anyone have a way of tokenizing HTML tags in order to maintain position in the string, perform a length check and/or trim on the string, and restore the HTML inside the string at its old location?

Start at the first character of the post, stepping over each character. Every time you step over a character, increment a counter. When you find a '<' character, stop incrementing the counter until you hit a '>' character. Your position when the counter gets to 250 is where you actually want to cut off.
Take note that this will have another problem that you'll have to deal with when an HTML tag is opened but not closed before the cutoff.

Following the 2-state finite machine suggestion, I've just developed a simple HTML parser for this purpose, in Java:
http://pastebin.com/jCRqiwNH
and here a test case:
http://pastebin.com/37gCS4tV
And here the Java code:
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
public class HtmlShortener {
private static final String TAGS_TO_SKIP = "br,hr,img,link";
private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
private static final int STATUS_READY = 0;
private int cutPoint = -1;
private String htmlString = "";
final List<String> tags = new LinkedList<String>();
StringBuilder sb = new StringBuilder("");
StringBuilder tagSb = new StringBuilder("");
int charCount = 0;
int status = STATUS_READY;
public HtmlShortener(String htmlString, int cutPoint){
this.cutPoint = cutPoint;
this.htmlString = htmlString;
}
public String cut(){
// reset
tags.clear();
sb = new StringBuilder("");
tagSb = new StringBuilder("");
charCount = 0;
status = STATUS_READY;
String tag = "";
if (cutPoint < 0){
return htmlString;
}
if (null != htmlString){
if (cutPoint == 0){
return "";
}
for (int i = 0; i < htmlString.length(); i++){
String strC = htmlString.substring(i, i+1);
if (strC.equals("<")){
// new tag or tag closure
// previous tag reset
tagSb = new StringBuilder("");
tag = "";
// find tag type and name
for (int k = i; k < htmlString.length(); k++){
String tagC = htmlString.substring(k, k+1);
tagSb.append(tagC);
if (tagC.equals(">")){
tag = getTag(tagSb.toString());
if (tag.startsWith("/")){
// closure
if (!isToSkip(tag)){
sb.append("</").append(tags.get(tags.size() - 1)).append(">");
tags.remove((tags.size() - 1));
}
} else {
// new tag
sb.append(tagSb.toString());
if (!isToSkip(tag)){
tags.add(tag);
}
}
i = k;
break;
}
}
} else {
sb.append(strC);
charCount++;
}
// cut check
if (charCount >= cutPoint){
// close previously open tags
Collections.reverse(tags);
for (String t : tags){
sb.append("</").append(t).append(">");
}
break;
}
}
return sb.toString();
} else {
return null;
}
}
private boolean isToSkip(String tag) {
if (tag.startsWith("/")){
tag = tag.substring(1, tag.length());
}
for (String tagToSkip : tagsToSkip){
if (tagToSkip.equals(tag)){
return true;
}
}
return false;
}
private String getTag(String tagString) {
if (tagString.contains(" ")){
// tag with attributes
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
} else {
// simple tag
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
}
}
}

You can try the following npm package
trim-html
It cutting off sufficient text inside html tags, save original html stricture, remove html tags after limit is reached and closing opened tags.

If I understand the problem correctly, you want to keep the HTML formatting, but you want to not count it as part of the length of the string you are keeping.
You can accomplish this with code that implements a simple finite state machine.
2 states: InTag, OutOfTag
InTag:
- Goes to OutOfTag if > character is encountered
- Goes to itself any other character is encountered
OutOfTag:
- Goes to InTag if < character is encountered
- Goes to itself any other character is encountered
Your starting state will be OutOfTag.
You implement a finite state machine by procesing 1 character at a time. The processing of each character brings you to a new state.
As you run your text through the finite state machine, you want to also keep an output buffer and a length so far encountered varaible (so you know when to stop).
Increment your Length variable each time you are in the state OutOfTag and you process another character. You can optionally not increment this variable if you have a whitespace character.
You end the algorithm when you have no more characters or you have the desired length mentioned in #1.
In your output buffer, include characters you encounter up until the length mentioned in #1.
Keep a stack of unclosed tags. When you reach the length, for each element in the stack, add an end tag. As you run through your algorithm you can know when you encounter a tag by keeping a current_tag variable. This current_tag variable is started when you enter the InTag state, and it is ended when you enter the OutOfTag state (or when a whitepsace character is encountered while in the InTag state). If you have a start tag you put it in the stack. If you have an end tag, you pop it from the stack.

Here's the implementation that I came up with, in C#:
public static string TrimToLength(string input, int length)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
if (input.Length <= length)
return input;
bool inTag = false;
int targetLength = 0;
for (int i = 0; i < input.Length; i++)
{
char c = input[i];
if (c == '>')
{
inTag = false;
continue;
}
if (c == '<')
{
inTag = true;
continue;
}
if (inTag || char.IsWhiteSpace(c))
{
continue;
}
targetLength++;
if (targetLength == length)
{
return ConvertToXhtml(input.Substring(0, i + 1));
}
}
return input;
}
And a few unit tests I used via TDD:
[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}
[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}
[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I";
Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}
[Test]
public void Html_TrimWellFormedHtml()
{
string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
"In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
"</div>";
string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";
Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}
[Test]
public void Html_TrimMalformedHtml()
{
string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
"In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";
string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";
Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

I'm aware this is quite a bit after the posted date, but i had a similiar issue and this is how i ended up solving it. My concern would be the speed of regex versus interating through an array.
Also if you have a space before an html tag, and after this doesn't fix that
private string HtmlTrimmer(string input, int len)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
if (input.Length <= len)
return input;
// this is necissary because regex "^" applies to the start of the string, not where you tell it to start from
string inputCopy;
string tag;
string result = "";
int strLen = 0;
int strMarker = 0;
int inputLength = input.Length;
Stack stack = new Stack(10);
Regex text = new Regex("^[^<&]+");
Regex singleUseTag = new Regex("^<[^>]*?/>");
Regex specChar = new Regex("^&[^;]*?;");
Regex htmlTag = new Regex("^<.*?>");
while (strLen < len)
{
inputCopy = input.Substring(strMarker);
//If the marker is at the end of the string OR
//the sum of the remaining characters and those analyzed is less then the maxlength
if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
break;
//Match regular text
result += text.Match(inputCopy,0,len-strLen);
strLen += result.Length - strMarker;
strMarker = result.Length;
inputCopy = input.Substring(strMarker);
if (singleUseTag.IsMatch(inputCopy))
result += singleUseTag.Match(inputCopy);
else if (specChar.IsMatch(inputCopy))
{
//think of as 1 character instead of 5
result += specChar.Match(inputCopy);
++strLen;
}
else if (htmlTag.IsMatch(inputCopy))
{
tag = htmlTag.Match(inputCopy).ToString();
//This only works if this is valid Markup...
if(tag[1]=='/') //Closing tag
stack.Pop();
else //not a closing tag
stack.Push(tag);
result += tag;
}
else //Bad syntax
result += input[strMarker];
strMarker = result.Length;
}
while (stack.Count > 0)
{
tag = stack.Pop().ToString();
result += tag.Insert(1, "/");
}
if (strLen == len)
result += "...";
return result;
}

Wouldn't the fastest way be to use jQuery's text() method?
For example:
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
var text = $('ul').text();
Would give the value OneTwoThree in the text variable. This would allow you to get the actual length of the text without the HTML included.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Replace HTML escape sequence with its single character equivalent in C - html

Related

Partial replace in docs what matches only and preserve formatting

How to convert large UTF-8 encoded char* string to CStringW (UTF-16)?

How do I detect and remove "\n" from given string in action script?

Unicode, VBScript and HTML

Trim string to length ignoring HTML

Categories

Resources