Find and Replace and a WYSIWYG Editor - html

My problem is as follows:
I have a column: ProductName. Now, the text entered here is entered from tinyMCE so it has all kinds of tags. The user wants to be able to do a Find-And-Replace on all products, and it has to support coloring.
For example - let's say this is a portion of a ProductName:
other text.. <strong>text text <font color="#ff6600">colortext®</font></strong> ..other text
Now, the user wants to replace the :
<font color="#ff6600">colortext®</font>
The original name has the <strong> tag in it so it appears bold. So the users makes it bold - now the text he is searching for is:
<strong><font color="#ff6600">colortext®</font></strong>
Obviously I'm not going to find it. Plus there's the matter of spaces: in one place it has a space in another it doesn't.
Is there a way to overcome this?

Strip the HTML tags from the search text and do a plain text search first. Then, part by part (i.e., text node by text node), take the element path of the search text's parts, and compare these with their counterparts in the found text. If the paths for all parts match, you're done.
Edit: By path, I meant something similar to XPath, or the path notion of the TinyMCE editor. Example: plain text part of the search text is "colortext®". The path of this text node in the search text is <strong>/<font color="#ff6600">. Search for the same plain text in the text body (trivial), and take it's path, which is also <strong>/<font color="#ff6600">. (Compare this with the path of "other text..", which is /, and of "text text", which is <strong>.) The two paths are the same, so this is a real match. If you have a DOM tree representation, determining the path shouldn't be difficult.

You're asking for several related, but discrete, abilities:
Search and Replace content
Search and Replace formatting
Search and Replace similar (ie, ignore trivial differences in whitespace)
You should take this in steps - otherwise it becomes overwhelming and a single search algorithm won't be able to do all three without intense effort and resulting in difficult to maintain code.
First, look at the similar problem. Make a search that ignores spaces and case. You might want to get into Lucene or another search engine technology if you also need to deal with "bowl" vs "bowls" and "intelligent" vs "smart" - though I expect this is beyond your current needs.
Once you have that working, it becomes one layer in your stack of searches.
Second, look a formatting search. This is typically done using tokens or tags - which you already have in the form of HTML. However, you have to be able to deal with things out of sequence - so <b><i>text</i></b> needs to be caught in a search for <i><b>text</b></i> and the malformed representation where tags aren't nested properly, such as <b><i>text</b></i>.
One method of this is to pre-parse the string and apply the formatting styles to each character. So you'd have a t that's bold and italic, an e that's bold and italic, etc. to make this easier and faster use a hash to represent the style combination - Read the first character, figure out what style it is (keep track of this turning styles on and off and you find tags) and if it already exists in the hash, assign that hash number to the letter. If it doesn't, get the new hash number and assign that.
Now you can compare the letter and its style hash against your search and get format and content matches. Stack that on top of your similar match and you have what you need.
-Adam

If it's valid XML, an XSLT would be trivial for this kind of exercise.
Use an identity template, and then add an XPath to find the specific node you want:
<xsl:template match="//strong/font">
<xsl:copy>
<!-- Insert the replacement text here -->
</xsl:copy>
</xsl:template>
When working with XML, this would be a maintainable, extensible solution.

Not sure to understand everything you said but the use of regular expression seems a good way to overcome the problem you're talking about.

Related

Markup-less way to title and link abbrevations/acronyms to glossary entries

Background: I'm writing a DocBook 5 document (and including in it some already-written text) with the intention of generating HTML from it. I would like to get the semantic markup correct from the beginning so I don't need to re-do it later, but the standard way does not seem to generate what I'm looking for, so I'm not sure if I should deviate from it or not, depending on what is possible with XSL.
Current setup: My glossary only has abbreviated items. It consists of glossentrys each containing a full-spelling glossterm and some non-zero amount of acronyms and/or abbrevs. I suppose it could all just be glossterms instead. It doesn't matter to me. Suppose for example I have this:
<glossentry xml:id="ff"><glossterm>Firefox</glossterm>
<acronym>FF</acronym>
<glossdef>
<para>The web browser made by Mozilla.</para>
</glossdef>
</glossentry>
Ideal
Suppose wherever I want to refer to Firefox, I put FF in the text. Ideally, without any additional markup, wherever "FF" (case sensitive) appeared as a plain whole word in a paragraph (or title, but not, for example, code or programlisting or inside a URL inside an attribute...) in my DocBook file, it would come out in HTML as the text "FF" but marked up as a link to the glossary entry, but not with the standard link CSS, and furthermore with a title attribute having the value Firefox. That way a reader can hover to get the acronym/abbreviation spelled out for them, and if that is insufficient, they can click to be taken to a fuller definition. Meanwhile I would style it black and underlined, so that they know this feature is there, but it doesn't distract one's attention like a normal link does, especially with how often it occurs in the text.
Main question: is such replacement of plaintext, markup-less terms even possible in XSL (without creating something like the Scunthorpe problem)? If so, can it do this for every acronym or abbrev found in the glossary, automatically?
I could not figure out how to do this directly, but that is still my goal. Meanwhile I've tried other things:
Approach 1
Set up a keyboard macro so I can type ff and have that be transformed while I'm typing into <xref linkend="ff"/>.
Pros:
links to the glossary
spells out the abbreviation
Cons:
spells out the abbrevation (it would be nice to keep it short to read, not just short to type)...workaround: make the acronym into a glossterm and put it first in the glossentry (loss of semantics, but maybe that's OK here?)
links to the glossary (I would like it styled differently)...solution: CSS for a.xref
even with the above two worked around, the title comes out as FFFirefox instead of just Firefox (and with others that have more than one synonym, the mashing-together continues)...solution: put an alternate xml:id on your preferred acronym/abbrev, and then make the links in it refer to that id in their endterm attribute (as well as the linkend referring to the first id)
I have to remember to use the keyboard macro rather than just typing and letting the system do the work; any imported text then has to have text replacements done on it for each glossary entry
Approach 2
Using <xsl:param name="glossterm.auto.link" select="1"/> and a keyboard macro to change FF into <glossterm>FF</glossterm>.
Pros:
links to the glossary
Cons:
spells out the abbrevation (it would be nice to keep it short to read, not just short to type)...workaround: make the acronym into a glossterm and put it first in the glossentry (loss of semantics, but maybe that's OK here?)
links to the glossary (I would like it styled differently)...solution: CSS for .glossterm
even with the above two worked around, no title attribute is given
I have to remember to use the keyboard macro rather than just typing and letting the system do the work; any imported text then has to have text replacements done on it for each glossary entry
Approach 1 so far seems better after the workarounds, but is there a way to achieve the ideal I outlined?

What's the rationale for not allowing multiline placeholders in HTML5?

I'm creating a very simple form that has a text area. The text area takes in a formatted block of names separated by newlines. To make the application slightly more useable, it would be nice if I could include a placeholder example that had multiple lines of text. Unfortunately, that doesn't seem to be possible with the HTML5 specification. Does anybody know why?
<placeholder> is like <blockquote> to me. It has a specific niche.
In the case of the <placeholder> attribute, it's mainly used in one-line form fields; not text areas.
How often do you use a carriage return in a one-line form field? Never.
The <placeholder> attribute represents a short hint (a word or short phrase) intended to aid the user with data entry. A hint could be a sample value or a brief description of the expected format. The attribute, if specified, must have a value that contains no U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR) characters.
Since HTML5 is still fresh, new, and continues to be optimized and tweaked in various browsers; who knows what crazy things would happen cross browser-wise if the <placeholder> attribute didn't have such strict guidelines set up?
The web seems to be moving in the direction to help designers/developers type less code, and make less mistakes.
I've seen a few posts (by people like Paul Irish and Jeffrey Way) talking about omitting things like closing tags, and many standard elements have been modified in HTML5 to be shorter/easier (e.g.<!doctype html>). Also, what used to be traditional attributes required to make a webpage function well can now be easily thrown out all together. The web is getting simpler, and more complex at the same time.
All in all though, if you're wanting something to fix the dilemma (that you are seemingly suffering from by the tone of your question), then just use the <title> attribute instead. Refer to the selected answer in the question located at the following link:
Can you have multiline HTML5 placeholder text in a <textarea>?

Find spaces in anchor links

We've got a large amount of static that HTML has links like e.g.
Link
However some of them contain spaces in the anchor e.g.
Link
Any ideas on what kind of regular expression I'd need to use to find the Spaces after the # and replace them with a - or _
Update: Just need to find them using TextMate, hence no need for a HTML parsing lib.
This regex should do it:
#[a-zA-Z]+\s+[a-zA-Z\s]+
Three Caveats.
First, if you are afraid that the page text itself (and not just the links) might contain information like "#hashtag more words", then you could make the regex more restrictive, like this:
#[a-zA-Z]+\s+[a-zA-Z\s]+\">
Second, if you have hash tags that contain characters beyond A-Z, then just add them in between the second set of brackets. So, if you have '-' as well, you would modify to:
#[a-zA-Z]+\s+[a-zA-Z-\s]+\">
Finally, this assumes that all the links you are trying to match start with a letter/word and are followed by a space, so, in the current form, it would not match "Anchor-tags-galore", but would match "Anchor tags galore."
Have you considered using an HTML parsing library like BeautifulSoup? It would make finding all the hrefs much easier!
Here, this regex matches the hash and all the words and spaces in between:
#(\w+\s)+\w+
http://dl.getdropbox.com/u/5912/Jing/2009-08-12_1651.png
When you have some time, you should download "The Regex Coach", which is an awesome tool to develop your own regexes. You get instant feedback and you learn very fast. Plus it comes at no cost!
Visit the homepage

How do you handle translation of text with markup?

I'm developing multi-language support for our web app. We're using Django's helpers around the gettext library. Everything has been surprisingly easy, except for the question of how to handle sentences that include significant HTML markup. Here's a simple example:
Please log in to continue.
Here are the approaches I can think of:
Change the link to include the whole sentence. Regardless of whether the change is a good idea in this case, the problem with this solution is that UI becomes dependent on the needs of i18n when the two are ideally independent.
Mark the whole string above for translation (formatting included). The translation strings would then also include the HTML directly. The problem with this is that changing the HTML formatting requires changing all the translation.
Tightly couple multiple translations, then use string interpolation to combine them. For the example, the phrase "Please %s to continue" and "log in" could be marked separately for translation, then combined. The "log in" is localized, then wrapped in the HREF, then inserted into the translated phrase, which keeps the %s in translation to mark where the link should go. This approach complicates the code and breaks the independence of translation strings.
Are there any other options? How have others solved this problem?
Solution 2 is what you want. Send them the whole sentence, with the HTML markup embedded.
Reasons:
The predominant translation tool, Trados, can preserve the markup from inadvertent corruption by a translator.
Trados can also auto-translate text that it has seen before, even if the content of the tags have changed (but the number of tags and their position in the sentence are the same). At the very least, the translator will give you a good discount.
Styling is locale-specific. In some cases, bold will be inappropriate in Chinese or Japanese, and italics are less commonly used in East Asian languages, for example. The translator should have the freedom to either keep or remove the styles.
Word order is language-specific. If you were to segment the above sentence into fragments, it might work for English and French, but in Chinese or Japanese the word order would not be correct when you concatenate. For this reason, it is best i18n practice to externalize entire sentences, not sentence fragments.
2, with a potential twist.
You certainly could localize the whole string, like:
loginLink=Please log in to continue
However, depending on your tooling and your localization group, they might prefer for you to do something like:
// tokens in this string add html links
loginLink=Please {0}log in{1} to continue
That would be my preferred method. You could use a different substitution pattern if you have localization tooling that ignores certain characters. E.g.
loginLink=Please %startlink%log in%endlink% to continue
Then perform the substitution in your jsp, servlet, or equivalent for whatever language you're using ...
Disclaimer: I am not experienced in internationalization of software myself.
I don't think this would be good in any case - just introduces too much coupling …
As long as you keep formatting sparse in the parts which need to be translated, this could be okay. Giving translators the possibility to give special words importance (by either making them a link or probably using <strong /> emphasis sounds like a good idea. However, those translations with (X)HTML possibly cannot be used anywhere else easily.
This sounds like unnecessary work to me …
If it were me, I think I would go with the second approach, but I would put the URI into a formatting parameter, so that this can be changed without having to change all those translations.
Please log in to continue.
You should keep in mind that you may need to teach your translators a basic knowledge of (X)HTML if you go with this approach, so that they do not screw up your markup and so that they know what to expect from that text they write. Anyhow, this additional knowledge might lead to a better semantic markup, because, as mentioned above, texts could be translated and annotated with (X)HTML to reflect local writing style.
What ever you do keep the whole sentence as one string. You need to understand the whole sentece to translate it correctly.
Not all words should be translated in all languages: e.g. in Norwegian one doesn't use "please" (we can say "vær så snill" literally "be so kind" but when used as a command it sounds too forceful) so the correct norwegian vould be:
"Logg inn for å fortsette" lit.: "Log in to continue" or
"Fortsett ved å logge inn" lit.: "Continue by to log in" etc.
You must allow completely changing the order, e.g. in a fictional demo language:
"Für kontinuer Loggen bitte ins" (if it was real) lit.: "To continue log please in"
Some language may even have one single word for (most of) this sentence too...
I'll recommend solution 1 or possibly "Please %{startlink}log in%{endlink} to continue" this way the translator can make the whole sentence a link if that's more natural, and it can be completely restructured.
Interesting question, I'll be having this problem very soon. I think I'll go for 2, without any kind of tricky stuff. HTML markup is simple, urls won't move anytime soon, and if anything is changed a new entry will be created in django.po, so we get a chance to review the translation ( ex: a script should check for empty translations after makemessages ).
So, in template :
{% load i18n %}
{% trans 'hello world' %}
... then, after python manage.py makemessages I get in my django.po
#: templates/out.html:3
msgid "hello world"
msgstr ""
I change it to my needs
#: templates/out.html:3
msgid "hello world"
msgstr "bonjour monde"
... and in the simple yet frequent cases I'll encounter, it won't be worth any further trouble. The other solutions here seems quite smart but I don't think the solution to markup problems is more markup. Plus, I want to avoid too much confusing stuff inside templates.
Your templates should be quite stable after a while, I guess, but I don't know what other trouble you expect. If the content changes over and over, perhaps that content's place is not inside the template but inside a model.
Edit: I just checked it out in the documentation, if you ever need variables inside a translation, there is blocktrans.
Makes no sense, how would you translate "log in"?
I don't think many translators have experience with HTML (the regular non-HTML-aware translators would be cheaper)
I would go with option 3, or use "Please %slog in%s to continue" and replace the %s with parts of the link.

Regex to match attributes in HTML?

I have a txt file which actually is a html source of some webpage.
Inside that txt file there are various strings preceded by a "title=" tag.
e.g.
<div id='UWTDivDomains_5_6_2_2' title='Connectivity Framework'>
I am interested in getting the text Connectivity Framework to be extraced and written to a separate file.
Like this, there are many such tags each having a different text after the title='some text here which i need to extract '
I want to extract all such instances of the text from the html source/txt file and write to a separate txt file. The text can contain lower case, upper case letters and number only. The length of each text string(in characters) will vary.
I am using PowerGrep for windows. Powergrep allows me to search a text file with regular expression inout.
I tried using the search as
title='[a-zA-Z0-9]
It shows the correct matches, but it matches only first character of the string and writes only the first character of the text string matched to the second txt file, not all string.
I want all string to be matched and written to the second file.
What is the correct regular expression or way to do what i want to do, using powergrep?
-AD.
I'm just not sure how many times the question of regular expression parsing of HTML files has to be asked (and answered with the correct solution of "use a DOM parser"). It comes up every day.
The difficulties are:
In HTML attributes can have single-quotes, double-quotes or even no quotes;
Similar strings can appear in the HTML document itself;
You have to handle correct escaping; and
Malformed HTML (decent parsers are extremely robust to common errors).
So if you cater for all this (and it gets to be a pretty complicated yet still imperfect regex), it's still not 100%.
HTML parsers exist for a reason. Use them.
I'm not familiar with PowerGrep, however, your regex is incomplete. Try this:
title='[a-zA-Z0-9 ]*'
or better yet:
title='([^']*)'
The other answers all give correct changes to the regex, so I'll explain what the issue was with your original.
The square brackets indicate a character class - meaning that the regex will match any character within those brackets. However, like everything else, it will only match it once by default. Just as the regex "s" would match only the first character in "ssss", the regex "[a-zA-Z0-9]" will match only the first character in "Connectivity Framework".
By adding repetition, one can get that character class to match repeatedly. The easiest way to do this is by adding an asterisk after it (which will match 0 or more occurences). Thus the regex "[a-zA-Z0-9]*" will match as many characters in a row until it hits a character that is not in that character class (in your case, the space character since you didn't include that in your brackets).
Regexes though can be pretty complex to describe the syntax accurately - what if someone put a non-alphanumeric character such as an ampersand within the attribute? You could try to capture all input between the quotes by making the character set "anything except a quote character", so "'[^']*'" would usually do the right thing. Often you need to bear in mind escaping as well (e.g. with a string 'Mary\'s lamb' you do actually want to capture the apostrophe in the middle so a simple "everything but apostrophes" character set won't cut it) though thankfully this is not an issue with XML/HTML according to the specs.
Still, if there is an existing library available that will do the extraction for you, this is likely to be faster and more correct than rolling your own, so I would lean towards that if possible.
I would use this regular expression to get the title attribute values
<[a-z]+[^>]*\s+title\s*=\s*("[^"]*"|'[^']*'|[^\s >]*)
Note that this regex matches the attribute value expression with quotes. So you have to remove them if needed.
Here's the regex you need
title='([a-zA-Z0-9]+)'
but if you're going to be doing a lot more stuff like this, using a parser might make it much more robust and useful.
Try this instead:
title=\'[a-zA-Z0-9]*\'