I'm developing an application using GWT (first-timer) and am now at the stage where I want to establish a central structure to provide actual text-based content to my views.
Even though it's obviously possible to define those text-values inline (using UiBinder or calling the appropriate methods on the corresponding objects), I'd be much more comfortable storing them in a central place as is possible using GWT's Constants. Indeed, my application will only be available in one language (for now), so all-the-way i18n may seem overkill, but I'm assuming that those facilities might be best-suited for what I require, seeing how they, too, must have been designed with providing all (constant) text content in mind.
However, my application features several passages of text that are somewhat longer and more complex than your average label text, meaning they could span several lines and might require basic text formatting. I have come up with several ideas on how to fix those issues, but I'm far from satisfied.
First problem: Lengthy string values.
import com.google.gwt.i18n.client.Constants;
public interface AppConstants extends Constants {
#Constants.DefaultStringValue("User Administration")
String userAdministrationTitle();
// ...
}
The sample above contains a very simple string value, defined in the manner that static string internationalization dictates (as far as I know). To add support for another language, say, German, one would provide a .properties file containing the translation:
userAdministrationTitle = Benutzeradministration
Now, one could easily abuse this pattern to a point and never provide a DefaultStringValue, leaving an empty string instead. Then, one could create a .properties file for the default language and add text like one would with a translation. Even then, however, it is (to my knowledge) not possible to apply line-breaks for long values simply to keep the file somewhat well-formatted, like this:
aVeryLongText = This is a really long text that describes some features of the
application in enough detail to allow the user to act on a basis
of information rather than guesswork.
Second problem: Formatting parts of the text.
Since the values are plain strings, there isn't much room for formatting there. Instinctively, I would do the same thing as I would if I were writing the text straight into the regular HTML document and add HTML-tags like <strong> or <em>.
Further down the road, at the point where the strings are read and applied to the widget that's going to display them, there is a problem though: setting the value using a method like setText(String) causes that string to be escaped and the HTML-tags to be printed alongside the rest of the text rather than to be interpreted as formatting instructions. So no luck.
A way to solve this would be to disect the string provided by the i18n file and isolate any HTML-tags, then baking the mess together again using a SafeHtmlBuilder and using that to set the value of the widget, which would indeed result in a formatted text being displayed. That sounds like much of an overcomplication though, so I don't quite like that idea.
So what am I looking for now, dear user who actually read this all the way through (thanks!)? I'm looking for solutions that don't require hacks like the ones described above and provide the functionality that I'm looking for. Alternatively, I welcome any guidance if I'm on the wrong path entirely (GWT first-timer, as I mentioned one eternity ago :-) ). Or basically anything that's on topic and might help find a solution. An acceptable solution, for example, would be a system like the string value files used in Android development (which allows for HTML-styling the texts but obviously requires the containing UI elements to accept that).
Fortunately, there is a standard solution that you can use. First, you need to create a ClientBundle:
public interface HelpResources extends ClientBundle {
public static final HelpResources INSTANCE = GWT.create(HelpResources.class);
#Source("account.html")
public ExternalTextResource account();
#Source("organization.html")
public ExternalTextResource organization();
}
You need to put this bundle into its own package. Then you add HTML files to the same package - one for each language:
account.html
account_es.html
organization.html
organization_es.html
Now, when you need to use it, you do:
private HelpResources help = GWT.create(HelpResources.class);
...
try {
help.account().getText(new ResourceCallback<TextResource>() {
#Override
public void onError(ResourceException e) {
// show error message
}
#Override
public void onSuccess(TextResource r) {
String text = r.getText();
// Pass this text to HTML widget
}
} catch (ResourceException e) {
e.printStackTrace();
}
You need to use HTML widget to display this text if it contains HTML tags.
If you're using UiBinder, i18n support is built-in. Otherwise, use Messages and Constants and use the value with setHTML rather than setText.
For long lines, you should be able to use multiline values in properties files by ending lines with a backslash.
Related
The web application I am working on uses resource strings for localization. The issue I am having is with styling certain parts of these strings. Let's say I want to display this string:
user1234 created a new document.
So in the resource file it would be localized like so:
{username} created a new document.
The issue is I also need <b></b> tags around {username}. I can't put these tags in the html file because I need it to apply just to the username, not to the whole localized string. So unless I split up the string into two localized strings (which I should definitely not do, because other languages do not necessarily have the same sentence structure), I have to put these html tags in the localized string itself:
<b>{username}</b> created a new document.
Even if we disregard best practices for a moment (of which I have read briefly) and go with this, this solution isn't working for me. I believe this is because the application is using Polymer (this seems to work with Angular). So if we stick by the following two requirements:
Use Polymer
Have the whole string together as one resource string
then there doesn't seem to be a way to style certain parts of the string. Does anyone know a solution?
I got it to work by setting the resource string to the inner HTML of the element which contains the string. So let's say the div containing the text has id="textElem", in the Javascript I set the inner HTML like so:
this.$.textElem.innerHTML = this.localize('user_created_document', 'username', this.username)
I suppose I should have specified in the question that my previous attempts of setting the string were just (a) simply binding the string to the property of an object and referencing that in the HTML, and (b) localizing the string directly in the HTML, neither of which worked.
As the title, I have the special requirement to extract text line by line or block by block of BT and ET.
below is the pdf content, I tried PDFTextstripper class, but it is not what I want,
so any one has the solution to resolve the problem?
I wanna parse this : [ (=\324Z\016) ] TJ.......
this is my pdf: https://dl.dropboxusercontent.com/u/63353043/docu.pdf
below is my code:
enter coList<Object> tokens = pages.get(0).getContents().getStream().getStreamTokens();
tokens.forEach(s->{
if(s instanceof COSString){
System.out.print(s.toString());
}
but, I get thoes:
COSString{&Ð}COSString{O6}COSString{&³}COSString{p»}COSString{6±}COSString{˛¨}COSString{+^}COSString{+·}COSString{˚©}COSString{9}COSString{O©}COSString{en}COSString{˛¨}COSString{Fœ}COSString{0ł}COSString{Q¯}COSString{#”}COSString{˛¨}COSString{+^}COSString{(Ï}COSString{˚©}COSString{9}COSString{O©}COSString{en}COSString{Zo}COSString{#°}COSString{˜}COSString{p»}COSString{#Š}COSString{5×}COSString{,
}COSString{:É}COSString{(Ù}COSString{4ÿ}COSString{ä}COSString{_Á}COSString{˛¨}COSString{:É}COSString{p»}COSString{O©}COSString{en}COSString{#p}COSString{/F}COSString{O©}COSString{en}COSString{F,}COSString{_N}COSString{!}COSString{9»}COSString{]˘}COSString{!¢}COSString{˜.}COSString{p»}COSString{#°}COSString{˜}COSString{#p}COSString{<:}COSString{Zo}COSString{1¸}COSString{ä}COSString{˚~}COSString{F³}COSString{!Ø}COSString{]Š}COSString{2}COSString{6±}COSString{˛¨}COSString{gî}COSString{+·}COSString{9á}COSString{XS}COSString{hP}COSString{h[}COSString{˜º}COSString{˚.}COSString{p»}COSString{5d}COSString{5×}COSString{]˘}COSString{_ö}COSString{#c}COSString{2˚}COSString{]˜}COSString{+·}COSString{9á}COSString{p»}COSString{#b}COSString{˚.}COSString{eÚ}COSString{;
}COSString{!5}COSString{:É}COSString{XS}COSString{hP}COSString{h[}COSString{˜º}COSString{p»}COSString{B!}COSString{&Ø}COSString{,}COSString{/F}COSString{^r}COSString{˛²}COSString{2&}COSString{˜.}COSString{N}COSString{*ø}COSString{˜.}COSString{&¢}COSString{+B}COSString{+·}COSString{9á}COSString{ä}COSString{ZX}COSString{˚}COSString{˛µ}COSString{6«}COSString{0ł}COSString{!Ø}COSString{p»}COSString{O©}COSString{en}COSString{F,}COSString{Q±}COSString{"}}COSString{XS}COSString{&Ð}COSString{_N}COSString{!}COSString{9»}COSString{˛²}COSString{F,}COSString{(Ï}COSString{O©}COSString{en}COSString{F³}COSString{/?}COSString{˛¨}COSString{=}COSString{˚4}COSString{9}COSString{p»}COSString{;}COSString{5×}COSString{_Á}COSString{˜³}COSString{#É}COSString{0·}COSString{F,}COSString{Q±}COSString{"}}COSString{p»}COSString{1û}COSString{"}}COSString{˚.}COSString{O©}COSString{en}COSString{p»}COSString{#°}COSString{˜}COSString{Lê}COSString{5d}COSString{Zo}COSString{1¸}COSString{"G}COSString{˚.}COSString{ä}COSString{&Ð}COSString{Kå}COSString{Yª}COSString{#°}COSString{#´}COSString{5ê}COSString{p»}COSString{(Ï}COSString{O©}COSString{en}COSString{ZR}COSString{pÉ}COSString{p„}COSString{˜}COSString{G“}COSString{_û}COSString{%v}COSString{pÎ}COSString{1û}COSString{"}}COSString{1¹}COSString{F,}COSString{˛µ}COSString{5×}COSString{˜}COSString{2˚}COSString{]˜}COSString{+·}COSString{9á}COSString{p»}COSString{O´}COSString{5×}COSString{#b}COSString{˚.}COSString{+·}COSString{9á}COSString{p»}COSString{˜}COSString{]y}COSString{/0}COSString{}COSString{2§}COSString{˚.}COSString{˛¨}COSString{8E}COSString{N}COSString{*ø}COSString{22}COSString{++}COSString{&¢}COSString{+B}COSString{)%}COSString{ä}COSString{&Ð}COSString{!Í}COSString{˚b}COSString{f¨}COSString{Y)}COSString{.}COSString{"Q}COSString{5ê}COSString{p»}COSString{)*}COSString{7D}COSString{˛¨}COSString{˜³}COSString{˚b}COSString{P¥}COSString{&Ð}COSString{!Í}COSString{˚b}COSString{˛µ}COSString{G“}COSString{0m}COSString{F,}COSString{0m}COSString{<i}COSString{˛³}COSString{p»}COSString{;fi}COSString{˛µ}COSString{BÞ}COSString{\}COSString{F,}COSString{BO}COSString{Bˆ}COSString{Q™}COSString{-Ž}COSString{F,}COSString{!Ñ}COSString{Fr}COSString{p»}COSString{$™}COSString{/½}COSString{BO}COSString{Bˆ}COSString{F,}COSString{#™}COSString{5×}COSString{˛¨}COSString{nƒ}COSString{nƒ}COSString{p»}COSString{˚}COSString{5×}COSString{f‰}COSString{P¥}COSString{#Š}COSString{\\}COSString{F,}COSString{p°}COSString{1¹}COSString{<:}COSString{6±}COSString{C®}COSString{DÙ}COSString{˛µ}COSString{Q¯}COSString{_Á}COSString{9Ë}COSString{F,}COSString{˚b}COSString{#°}COSString{˜}COSString{p»}COSString{_Á}COSString{9Ë}COSString{F,}COSString{˚b}COSString{˚}COSString{<:}COSString{6±}COSString{C®}COSString{DÙ}COSString{˛µ}COSString{Cˆ}COSString{/?}COSString{1¸}COSString{"G}COSString{p°}COSString{p“}COSString{/4}COSString{˜.}COSString{p»}COSString{+·}COSString{O©}COSString{en}COSString{F,}COSString{˚3}COSString{9}COSString{7D}COSString{#Þ}COSString{DÙ}COSString{;}COSString{T}COSString{T`}COSString{5“}COSString{˛²}COSString{p»}COSString{]2}COSString{ }COSString{]2}COSString{(Ï}COSString{p°}COSString{:}COSString{6«}COSString{Må}COSString{&Ð}COSString{˛µ}COSString{M;}COSString{0·}COSString{e;}COSString{O«}COSString{aw}COSString{˚b}COSString{F,}COSString{FÇ}COSString{ZH}
(If this wasn't so long, it would have been more appropriate as a comment to the question instead of as an answer. But comments are too limited.)
The OP in his question shows that he essentially wants to parse the content stream of e.g. a page and extract the strings drawn in a legible form. He attempts this by simply taking the tokens in the content stream and looking at the COSString instances in there:
List<Object> tokens = pages.get(0).getContents().getStream().getStreamTokens();
tokens.forEach(s->{
if (s instanceof COSString) {
System.out.print(s.toString());
}
});
Unfortunately the output looks like a mess.
Why do the string values in the content stream look so messy?
The reason for this is that those COSString instances represent the PDF string objects as they are, and that there is no single encoding of PDF string objects in content streams, not even a limitation to a few standardized ones.
The encoding of a string completely depends on the definition of the font currently active when the string drawing instruction in question is executed.
Fonts in PDFs can be defined to use either some standard encoding or a custom one and it is very common, in particular in case of embedded font subsets, to use custom encodings mapping constructed similar to this:
the code 1 to the glyph of this font which is first used on the page,
the code 2 on the second glyph of this font on the page which is not identical to the first,
the code three to the third glyph of this font on the page not identical to either of the first two,
etc...
Obviously there is no good to conjecture the meaning of string bytes for such encodings.
Thus, when parsing the content stream, you have to keep track of the current font and look up the meaning of each byte (or multi byte sequence!) of a COSString in the definition of that current font in the resource dictionary of the current page.
How to map those messy bytes using the current font definition?
The encoding of a PDF font might have to be determined in different ways.
There may be a ToUnicode map in the font definition which shows you which bytes to map to which character.
Otherwise the encoding may be a standard encoding like MacRomanEncoding or WinAnsiEncoding in which case one has to bring along the mapping table oneself (they are printed in the PDF specification).
Otherwise the encoding might be based on such a standard encoding but deviations are given by a mapping from codes to names of glyphs. If certain standard names are used, the character can be derived from that name. These names are listed in another document.
Otherwise some CIDSystemInfo entry may point to yet another standard Registry and Ordering from which to derive a mapping table specified in other documents.
Otherwise the font program itself may include usable mappings to Unicode.
Otherwise ???
Any pitfalls to evade?
The current font is a PDF graphics state attribute. Thus, one does not only have to remember the most recently set font but also consider the effects of operations changing the whole graphics state, in particular the save-graphics-state and restore-graphics-state operations which push the current graphics state onto a stack or pop it there-from.
How can PDF libraries help you?
PDF libraries which support you in text extraction can do so by doing all the heavy lifting for you,
parsing the content stream,
keeping track of the graphics state,
determining the encoding of the current font,
translating any drawn PDF strings using that encoding,
and only forward the resulting characters and some extra data (current position on the page, text drawing orientation, font and font size, colors, and other effects) to you.
In case of PDFBox 2.0.x, this is what the PDFTextStreamEngine class does for you: you only have to override its processTextPosition method to which PDFBox forwards those enriched character information in the TextPosition parameter. As you also want to know the starts and ends of the BT ... ET text object envelopes, you also have to override beginText and endText.
The class PDFTextStripper is based on that class and collects and sorts those character information bits to build a string containing the page text which it eventually returns.
In PDFBox 1.8.x there was a very similar PDFTextStripper class but the base methods were not that properly separated in a base class, everything was somewhat more intermingled and it was harder to implement one's own extraction ways.
(In other PDF libraries there are similar constructs, sometimes event based like in PDFBox, sometimes as a collected sequence of Textposition-like objects.)
Any widget that has setHTML method could give a hole in security system, but if we validate String & only accept some limited html tags such as <b>, <i>.... And then we put this string into setHTML method.
Then my question is "is it still safe if we do that"
For example, we check the String text to make sure it only contain some limited html tags <b>, </b>, <i>, </i>... If the string text contain other tags then we won't let uses to input that text. Then we use:
html1.setHTML(text); instead of html1.setHTML(SafeHtmlUtils.fromString(text))
i don't know why html1.setHTML(SafeHtmlUtils.fromString(text)) does not generate the formatted text, it just shows plain text when i run it in eclipse? For example
html1.setHTML(SafeHtmlUtils.fromString("<b>text</b>"))
will have plain text result <b>text</b> instead of bold text "text" with correct html format
You want to sanitize the html, not escape it. The fromString method is meant to escape the string - if a user types enters a < b, but forgets the space, then adds >c, you don't want the c to be bold and the b to be missing entirely. Escaping is done to actually render the string that is given, assuming it is text.
On the complete other end of the spectrum, you can use fromTrustedString which tells GWT that you absolutely trust the source of the data, and that you will allow it to do anything. This typically should not be done for any data that comes from the user.
Somewhere off to the side of all of the then we have sanitation, the process where you take a string that is meant to be HTML, and ensure it is safe, rather than either treating it like text, or trusting it implicitly. This is hard to do well - any tag that has a style attribute could potentially attack you (this is why GWT has SafeStyle like SafeHtml, any tag that has a uri, url or href could be used to attack (hence SafeUri), and any attribute that the browser treats as a callback such as onclick or the like can be used to run JavaScript. The HtmlSanitizer type is meant to be able to do this.
There is a built-in implementation of this, as of at least GWT 2.4 - SimpleHtmlSanitizer. This class whitelists certain html tags, including your <b> and <i> tags, as well as a few others. Attributes are completely removed, as there are too many cases where they might not be safe. As the class name suggests, this is just a simple approach to this problem - a more complex and in-depth approach might be more true to the original code, but this also comes with the risk of allowing unsafe HTML content.
Situation: I have a group of strings that represent Named Entities that were extracted from something that used to be an HTML doc. I also have both the original HTML doc, the stripped-of-all-markup plain text that was fed to the NER engine, and the offset/length of the strings in the stripped file.
I need to annotate the original HTML doc with highlighted instances of the NEs. To do that I need to do the following:
Find the start / end points of the NE strings in the HTML doc. Something that resulted in a DOM Range Object would probably be ideal.
Given that Range object, apply a styling (probably using something like <span class="ne-person" data-ne="123">...</span>) to the range. This is tricky because there is no guarantee that the range won't include multiple DOM elements (<a>, <strong>, etc.) and the span needs to start/stop correctly within each containing element so I don't end up with totally bogus HTML.
Any solutions (full or partial) are welcome. The back-end is mostly Python/Django, and the front-end is using jQuery. We would rather do this on the back-end, but I'm open to anything.
(I was a bit iffy on how to tag this question, so feel free to re-tag it.)
Use a range utility method plus an annotation library such as one of the following:
artisan.js
annotator.js
vie.js
The free software Rangy JavaScript library is your friend. Regarding your two tasks:
Find the start / end points of the […] strings in the HTML doc. You can use Range#findText() from the TextRange extension. It indeed results in a DOM Level 2 Range compatible object [source].
Given that Range object, apply a styling […] to the range. This can be handled with the Rangy Highlighter module. If necessary, it will use multiple DOM elements for the highlighting to keep up a DOM tree structure.
Discussion: Rangy is a cross-browser implementation of the DOM Level 2 range utility methods proposed by #Paul Sweatte. Using an annotation library would be a further extension on range library functionality; for example, Rangy will be the basis of Annotator 2.0 [source]. It's just not required in your case, since you only want to render highlights, not allow users to add them.
My colleague is extremely 'hot' on properly formatted and indented html being delivered to the client browser. This is so that the page source is easily readable by a human.
Firstly, if I have a partial view that is used in a number of different areas in my site, should the rendering engine be automatically formatting the indentations for me (ala setting the Formatting property on an XmlTextWriter)?
Secondly, my colleague has created a number of HtmlHelper extension methods for writing to the response. These all require a CurrentIndent parameter to be passed to them. This smells wrong to me.
Can anyone help with this?
This sounds difficult to maintain. If someone removed an outer element from the HTML, would anyone bother to update the CurrentIndent values in the code? These days most developers usually view their HTML through Firebug anyway, which formats the markup automatically with indentation.
If you really want to post-process HTML through a formatting filter then try a .NET port of HTML Tidy.
Browsers absolutely don't care how beautiful the HTML indentation is. What's even more, deeply nested (and thus heavily indented) HTML adds a slight overhead to the page (in terms of bytes to download). Granted, you can always compress response and well-indented HTML is nicer to support.
Even if for some crazy reason it HAS TO be indented "properly", it shouldn't be done the way your colleague suggests.
An HttpModule attached to ReleaseRequestState event of the HttpApplication object should do the trick. And of course, you're going to need to come up with a filter that handles this indenting.
public class IndentingModule: IHttpModule {
public void Dispose() {
}
public void Init(HttpApplication context) {
context.ReleaseRequestState +=
new EventHandler(context_ReleaseRequestState);
}
void context_ReleaseRequestState(object sender, EventArgs e) {
HttpApplication app = (HttpApplication)sender;
app.Response.Filter = new IndentingFilter(app.Response.Filter)
}
}
Rather than waste time implementing a proper indenting solution which would affect all HTTP requests (thus adding CPU and bandwidth overhead), just suggest to your colleague that he use an HTML beautifier. That way the one person that cares about it is the one person that pays the cost of it.
This Firefox plugin is an HTML validator that also includes a beautification function. See the documentation here.