JEditorPane print dropping spaces - html

I have a JEditorPane containing HTML like this:
use the <strong>File</strong> menu, <strong>Open File</strong> to run the conversion
In the interactive window it looks like what I expect: "use the File menu, Open File to run the conversion"
However, when I print it using JEditorPane.print I get: "use theFile menu, Open File to run the conversion"
i.e. It is dropping the space between 'the' and 'File'.
This is happening apparently at random throughout the HTML file, happens on 'span', 'strong', 'a', and 'em' tags that I have tried. About half such tags in the html are dropping a space. As in the example, it will happen on one such tag in a line, and not the next one. Or visa-versa. Or both. I've tried putting the space inside the 'strong' or 'a', or both inside and outside, and it doesn't make any difference. And the space only gets dropped at the start of a tag, not the end.
It happens on two physical printers, and on PDF creation, so I don't think its printer dependent. I have also tried multiple JEditorPane printing methods and they all have the same behavior. I have also tried different scaling, no difference either.
Using a nsbp; does keep a space, but I have many hundreds of such tags, and its going to be a real pain to insert that everywhere. I'm hoping to find something that is causing this and turn it off.
Thanks

I guess this happens because of incorrect fractional measuring.
Try to call editorPaneInstance.getDocument().putProperty("i18n", Boolean.TRUE);
Similar problem is explained here and here

Related

Strange symbol shows up on website (L SEP)?

I noticed on my website, http://www.cscc.org.sg/, there's this odd symbol that shows up.
It says L SEP. In the HTML Code, it display the same thing.
Can someone shows me how to remove them?
That character is U+2028 or HTML entity code 
 which is a kind of newline character. It's not actually supposed to be displayed. I'm guessing that either your server side scripts failed to translate it into a new line or you are using a font that displays it.
But, since we know the HTML and UNICODE vales for the character, we can add a few lines of jQuery that should get rid of the character. Right now, I'm just replacing it with an empty space in the code below. Just add this:
$(document).ready(function() {
$("body").children().each(function() {
$(this).html($(this).html().replace(/
/g," "));
});
});
This should work, though please note that I have not tested this and may not work as none of my browsers will display the character.
But if it doesn't, you can always try pasting your text block onto http://www.nousphere.net/cleanspecial.php which will remove any special characters.
Some fonts render LS as L SEP. Such a glyph is designed for unformatted presentations of the character, such as when viewing the raw characters of a file in a binary editor. In a formatted presentation, actual line spacing should be displayed instead of the glyph.
The problem is that neither the web server nor web browser are interpreting the LS as a newline. The web server could detect the LS and replace it with <br>. Such a feature would fit well with a web server that dynamically generates HTML anyway, but would add overhead and complexity to a web server that serves file contents without modification.
If a LS makes its way to the web browser, the web browser doesn't interpret it as formatting. Page formatting is based only on HTML tags. For example, LF and CR just affect formatting of the HTML source code, not the web page's formatting (except in <pre> sections). The browser could in principle interpret LS and PS (paragraph separator) as <br> and <p>, but the HTML standard doesn't tell browsers to do that. (It seems to me like it would be a good addition.)
To replace the raw LS character with the line separation that the content creator likely intended, you'll need to replace the LS characters with HTML markup such as <br>.
This is the solution for the 'strange symbol' issue.
$(document).ready(function () {
$("body").children().each(function() {
document.body.innerHTML = document.body.innerHTML.replace(/\u2028/g, ' ');
});
})
The jquery/js solutions here work to remove the character, but it broke my Revolution Slider. I ended up doing a search replace for the character on the wp_posts tabel with Better Search Replace plugin: https://wordpress.org/plugins/better-search-replace/
When you copy paste the character from a page to the plugin box, it is invisible, but it does work. Before doing DB replaces, always have a database (or full) backup ready! And be sure to uncheck the bottom checkbox to not do a dry run with the plugin.

Regex match and delete everything before string (opening html tag)

I'm using Dreamweaver and Notepad++ and have searched high and low but nothing seems to work from what I've found.
I've got a whole stack of html pages and I need to remove from all of them everything above but not including the first tag in the document. Specifically, everything before the string "<h1" (no quotes). I've tried various examples in Notepad++ and it finds the first h1 tag but doesn't replace everthing before it.
Assuming you want to lose everything in your file before the "<h1" text
then specify ".*<[hH]1" as search tag and "<h1" as replacement and check
the box marked ". matches newline". Works for me.
You can do this from the Command Line or a text editor that allows you to search-replace multiple files. However, are you sure the content is the same in every html file?

What is this INSANE space character??? (google chrome)

This is driving me absolutely, !&&%&$ insane... it defies everything that I can think of.
THIS character right here... " "
In between these quotes... open google chrome and inspect. You will see its a ... normal right? Now right click and actually view the source of this stack overflow page. It's a regular space... (also, the character I copied was an actual space).
I could understand if it's some kind of rich text editor or something, but in the raw html source is a regular space, so what gives?
Here's just with hitting the space key (which works fine)... " ".
You can even copy it and paste it everywhere and wreak havoc and make chrome put everywhere. Even though whats copied in your clipboard is just a SPACE.
I have these stupid characters show up everywhere randomly in my website and I have no idea where they come from, or WHY is google converting a SPACE into a nbsp;
I have tried inspecting the actual character code and it's a regular space from all things I can find...
Every single method I try shows it as a NORMAL space... so what gives?
If i use ruby and do " ".ord I get 32. If i do it with the broken space I also get 32.
Please help me im losing my mind.
edit: you can prove this... view source on this page and you will see two empty " " like normal. Now look in console and only the one will be a , yet the raw source is identical.
Image for people not using chrome (this is looking at this very post via chrome dev tools):
Here's the HTML of the same text you see when you view source... no nbsp to be found.
When I view this page's source in Internet Explorer, or download it directly from the server and view it in a text editor, the first space character in question is formatted like this in the actual HTML:
THIS character right here... " "
Notice the   entity. That is Unicode codepoint U+00A0 NO-BREAK SPACE. Chrome is just being nice and re-formatting it as when inspecting the HTML. But make no mistake, it is a real non-breaking space, not Unicode codepoint U+0020 SPACE like you are expecting. U+00A0 is visually displayed the same as U+0020, but they are semantically different characters.
The second space character in question is formatted like this in the actual HTML:
<p>Here's just with hitting the space key (which works fine)... <code>" "</code>.</p>
So it is Unicode codepoint U+0020 and not U+00A0. Viewing the raw hex data of this page confirms that:
It turns out the two seemingly identical whitespace characters are not the same character.
Behold:
var characters = ["a", "b", "c", "d", " "];
var typedSpace = " ";
var copiedSpace = " ";
alert("Typed: " + characters.indexOf(typedSpace)); // -1
alert("Copied: " + characters.indexOf(copiedSpace)); // 4
alert(typedSpace === copiedSpace); // false
JSFiddle
typedSpace.charCodeAt(0) returns 32, the classic space. Whereas copiedSpace.charCodeAt(0) returns 160, the &#160 AKA character.
The difference between the two is that a whole bunch of   repeated after one another will hold their ground and create additional space between them, whereas a whole bunch of repeated characters will squish together into one space.
For instance:
A       B results in: A       B
A B results in: A B
To convert the   character with a character in a string, try this:
.replace(new RegExp(String.fromCharCode(160),"g")," ");
To the people in the future like myself that had to debug this from a high level all the way down to the character codes, I salute you.
Don't get yer knickers in a knot. It's one of those special html characters that we old-school love because we was tort rite.
For many of us, we were taught that a sentence started with a capital letter and ended with a full-stop. But the next sentence is separated from this by TWO spaces.
Good-ol'-HTML doesn't like space(s). If you enter a string of words with 5 spaces between them (using an unintelligent editor like MS Notepad, then html shows it with single spaces.
SO, to get it looking like we old-farts like, we end a sentence with '.&NbSp; Next' This puts two spaces after the full-stop, and looks like '.  Next' rather than '. Next'.
Next point is that the real space (32) works as a linebreak, so that's good.
EXCEPT for we old-farts, who HATE to see our name split across a linebreak. That annoys us NO-END.
But, of course, that's where &NbSp; comes in handy again. If you enter 'John&NbSp;Brown', then the html thinks that's a single word, and it displays it just rite for we oldies.
How do these &NbSp; thingies get there? Well, good old Word (and I suspect many intelligent editors) see two spaces and output them as a non-breaking space followed by a normal space.
And when in Word, you can insert a non-breaking space between John and Brown by the key sequence alt-ctrl-space (sorry, you apple-users)
Lesson-over (with the exception that the term &NbSp; needs to be all lowercase - THIS viewer was even converting it)
It is a non breaking space. is the entity used to represent a non-breaking space. It is essentially a standard space, the primary difference being that a browser should not break (or wrap) a line of text at the point that this occupies.
Most likely the character is being inserted by your HTML Editor. Could you give a more specific example in context?
This is not actually an answer to the question but instead a tool that can be used to detect this special white space in the html of the pages of a website so we can proceed to locate and remove it.
The tool what basically does is:
Fetches the content of a URL
Looks for occurrences of chr(194).chr(160) in the HTML contents
Replaces and highlights the ocurrences with something more visible
This way you can actually know where the spaces are and edit your page properly to remove them.
The online version of the tool can be found here:
http://tools.heavydots.com/nbsp-space-char-detect/
A working example can be seen with the url of this question that contains one ocurrence:
http://tools.heavydots.com/nbsp-space-char-detect/?url=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F26962323%2Fwhat-is-this-insane-space-character-google-chrome&highlight=1&hstring=%7BNBSP%7D
There's a Github repo available if someone wants the code to run it locally:
https://github.com/HeavyDots/nbsp-space-char-detect
Hope someone finds it useful, for any feedback there's a comments section on the tool's page.
Updated 5th of January 2017
At our company blog we just wrote a funny post about this annoying white space. You're invited to drop by and read it! :-)
http://heavydots.com/blog/when-the-white-space-became-a-beast
As the previous answers have mentioned, it's a non-breaking space (nbsp). On Macs, this character gets inserted when you accidentally press Alt + Space (most of the time, this happens when entering code that requires Alt for special characters, e.g. [ on a German keyboard layout).
To remap this key combination to a plain ol' SPACE character, you can change your default keybinding as suggested on Apple SE
For whitespace, Press "Alt+0160" which is a character also.

Chrome adds non breaking space in text copied from PDF and pasted to TinyMCE

I'm afraid this is highly specific, so please bear with me and read carefully.
The problem:
Open a PDF file, select and copy some text that contains line breaks and paste it into a TinyMCE textarea in the Google Chrome browser. Then delete any line break and insert a space at the same point: the space that is added is non-breaking even though I used a regular "space bar" key stroke in TinyMCE.
How do I know there is a non-breaking space?
You can click the "show invisible characters" button on the first row of my TinyMCE implementation (see link below). Remember that with TinyMCE your must turn that option Off and On again every time you modify the text to see the changes.
The non-breaking spaces will appear in orange, normal spaces appear normally.
What I have found so far:
If I delete the character that comes after the line break and then type that character again, I can insert a normal space. The problem seems to be attached to that character.
If I delete the character occuring before the line break, the problem persists, i.e. when I delete the space and type a new space it is still a non breaking space.
Also when I save the text to the MySQL database, and read it again in TinyMCE, the problem still occurs, which reinforces my impression that the "hidden" character is attached to the letter following the line break (there is no saving on the test page of course).
Replicating it
You could of course try it yourself, but here is my testbed for you: http://www.roseback.com/test/tinymce4.html
I have tested it with many PDF files that we receive from graphic designers, from many products and eras. These PDFs are the files that are used for printing and there is no problem with those files for that use.
I uploaded a sample file here: http://www.roseback.com/test/languedoc.pdf. Test with the first paragraph starting with "Ce film exceptionnel".
However I have also tested random PDF files from the web and replicated the problem every time. So if you try with your own files and can't replicate, that might be interesting.
Environment:
Web page: the page is in HTML5, in UTF-8.
On the original page, the page is served via PHP and the textarea content comes from a MySQL 5.1 DB. The DB connection is set to UTF-8 in PHP, the content of the table and of the text field is in utf8_unicode_ci
On the test page there is no content and no saving, so no DB is involved.
Browser: Chrome. Does not happen in Firefox or Opera (not tested elsewhere)
TinyMCE: version 3 and version 4 (both standard version, not jQuery)
OS: on Windows 7 Pro 64 bit and also on Windows XP Pro 32 bit
I would appreciate any feedback, even simple confirmation / replication of the problem.
Hmm, i think what you observe has something to do with the fact that tinymce inserts non breaking spaces instead of spaces. Tinymce needs to so this in order to avoid that the browser shows more than one space concurrently entered as one single space (this is the default browser behaviour).
You can verify this by inserting more than one space and then have a look at the non-visible characters.

<p> tag getting added to inline macro

I am using umbraco 4.7. I have created a razor macro to insert telephone number in my rich text editor. Whenever i add the macro in the RTE, <p> tags are added around the macro automatically. I tried removing the extra <p> tags by editing the html but as soon as I click on save, the <p> tags are added again. I tried installing this package
but it didn't solve my problem. I have tried setting the <TidyEditorContent>False</TidyEditorContent> and checking the forced_root_block : 'p' but none of these could solve the problem.
Any pointers to solve this issue?
This is a common issue, and the RTE causes a number of headaches for maintaining the integrity of your web page. The whole "should we include paragraph tags or not?" question is a difficult one, as it is fine to remove them when only inserting a single paragraph of text, but what if the content editor decides to add more, and you are stripping out the first and last P tags?
Bearing in mind that the CSS for a site will always need to support whatever you choose, the best option is to edit the configuration file to make TinyMCE omit the P tags. To do this, you need to edit the /config/tinyMceConfig.config file. This has two interesting sections at the bottom. contains allowed HTML tags, and the of course contains the opposite.
If you look in the valid elements list of comma separated values, you should see a value #p[id|style|dir|class|align]. Taking note of EXACTLY how this is formatted, you should be able to move it into the invalid elements section. Put it after the default font tag, with a comma preceding it of course.
Restart IIS and try entering your content in the RTE. When you publish, then view the output, you should see that the string in the RTE has had the paragraph tags stripped.
One final option is to strip the paragraph tags from the output. There are many new ways of doing this, but for your Razor version I would use something like:
#Html.Raw(umbraco.library.RemoveFirstParagraphTag(value.ToString()))
This idea is covered in another StackOverflow article.
Good luck, and please let us know if you have any success.
use this jquery .....
$('p').each(function() {
var $this = $(this);
if($this.html().replace(/\s| /g, '').length == 0)
$this.remove();
});