How to convert text to speech on a web page? - html

I am making a web page that displays fragments of text from news sites (CNN, BBC, etc.) but I also want it to be read to people who can't see. How can I program the HTML page to read the text for them? Any ideas?
Thanks, Boda Cydo.

People who can't see will already be using either a screen reader (which will read the text to them), braille display or similar.
You just need to focus on making the text accessible and let their software handle "displaying" it to them.

The best way to make your website readable to people who cannot see is to use semantic HTML and follow standards. HTML readers can't magically infer your meaning if you don't. For example:
Use H1-H6 to designate the correct levels of titles in your site
Use P tags for body content
Use UL lists for navigation and A tags only for things that are really links
Use CSS for style - If an image is just used for style, put it in a background image instead
Only use tables for data that really is tabular.
If you have any content images, use IMG and provide ALT text
Use LABEL tags appropriately for forms
Use title attributes where appropriate
Most importantly - try turning off CSS in your browser. Does your web page still make sense to you? If so, you are probably on the right path.

No, you need to use Flash or a Java Applet to do this. There is nothing native in a browser for text-to-speech. Most people with these needs already have software that does this for them.

look to http://en.wikipedia.org/wiki/JAWS_%28screen_reader%29 . As far as i know it have integration with browsers

As Diodeus noted, if they have a need for text to speech then they will already have software to read for them. Just make the text available.
If you actually want to go through with implementing it yourself (though I wouldn't recommend it) then you can try to use the Google Translate API as described here. It looks like Google has taken down that text-to-speech site for now, but I assume (since it's Google after all) that they'll eventually release it. You may want to also look at the Android TTS library here.

Related

Is Google microformats supposed to be visible on the web page?

I was trying to add microformats as following to my webpage:
<div itemscope itemtype="http://schema.org/Product">
<span itemprop="brand">Company Name</span>
<span itemprop="name">Product Name</span>
<span itemprop="description">Product Description</span>
Product #: <span itemprop="sku">12345</span>
</div>
I thought this microformat will only show up in a google search result page. But after adding it, those information became visible on my webpage, and not in a good shape.
Is there something wrong? Or should I use display:none to make it invisible on my webpage?
Microformats are meant to add machine readable meaning to existing content on the page. They're not invisible meta data, they augment content that's already there. So, yes, it'll show up. You can hide or style it via any of the usual ways in which you hide or style content.
You are using Microdata, not Microformats.
Microdata is a syntax to include structured data within HTML5. Ideally you would use your existing content (i.e., add the needed attributes like itemprop etc. to your already existing markup), and only if that’s not possible, the hidden elements meta and link (which are allowed in the body if used for Microdata).
If you don’t want to use your existing markup and the visible content, you could use an alternative syntax: JSON-LD. This gets included as a data block (using the script element), which is not visible by default.
Don't try to use hide or style on your content, it will have a bad impact on your site. You might get penalized for cloaking if you practice it on all of your pages.
If you are trying to mark/let the bots know about some more info that is not on your page you can try using either the Data Highlighter for simple things in you Search Engine Console (Webmaster Tools) or for more complicated stuff you can try using JSON-LD coding on you pages.
Microformats are HTML. Used to publish a standard API that is consumed and used by search engines, browsers, and other web sites. Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. Microformats are a way to enable "smart scraping" of web pages, so that you can create tools and scripts that losslessly extract machine-readable information from cleanly-formatted, human-readable HTML. Structured Data is the name given to content which is marked up in a specific way, using MicroFormatting, to explain what that content is all about.
It is always recommended to show the Microdata information and not to hide it. You can probably try to give a good shape. It would show up in the Google and Bing result pages as well but you need to wait a little for that. There is nothing wrong with the Microformats applied by you. The thing is SEO need some more patience.

Why HTML with disabled CSS matters?

When creating a website why should you care for HTML with no style?
Is there any device which will render HTML only (no CSS or JavaScript)?
Do you usually care how your website will display without CSS?
Why is it important?
There are several cases in which websites may be used without styling. As mentioned in the comments, screen readers (such as those used by visually-impaired people) read only content, not styling.
Perhaps more importantly, many search engine spiders (think: Google) read your site without styling. When you view your site without CSS, you will gain a better understanding of how search engines view your content.
And if you are lucky, or your content is particularly geeky, you may get the occasional guru who browses your site via Lynx.
There seem to be a few misconceptions here. First of all and most importantly, screenreaders do take into account CSS and JavaScript. Why? Simply because unlike in the past they are not running on their own, but rather work as addons for existing browsers or include the render engines inline in their own systems.
Does that mean you don't need to concern yourself with screenreaders at all? Sadly that's not the case either. For example, if you add display:table to an element just because you want to vertically align something some screenreaders will actually treat it like a real table (which makes no practical sense). The good part is though that pages are read top-to-bottom, header and menu first (if found) and that adding display:none through javascript to an element will hide the element from the screen reader as well. Now, the following is going to sound really harsh, but except if you're making a real high profile website I wouldn't advice you to concern yourself with this too much. On one hand screenreaders are becoming better and better (try for example the one that's included on your android device if you have a recent version of android) and on the other hand blind people are used to websites being a 'bit' messy. Now, that doesn't mean you should start using flash or otherwise crazy stuff, but it does mean that if you just write a proper website, make your menu a list, make your divisions divs and not tables etc. you should in general be fine. And if you are making a high profile website then you should check out WAI-ARIA.
Now, getting to the search engine part, that's not true either for the big search engines at least. Google does take styling into account. Not all the styling that's unimportant for Google, but it actually will realize which stuff is hidden and analyzes the javascript whether hidden content will be shown (as part of it's anti-SEO work), it will search for links in your javascript and probably lots more I am not aware of. Bing does this to a large extend as well, though for example duckduckgo does not do this too much/at all. Either way, once again, the notion that Google sees your site like lynx does was true in the far past, but by now is invalid.
And if you check your serverlogs you will see that nobody accessed your site through lynx. That's just the reality of life nowadays. In the past (again) people would occasionally use lynx if they only had access to a console, but nowadays it's far easier to pull your phone from your pocket which runs a full web browser.
First part of the answer : 'text based browsers'
Text-based browser list
Alynx
ELinks (active version of Links)
Emacs/W3
Line Mode Browser
Links
Lynx
Net-Tamer
w3m
WebbIE
Second part : 'search engines'
List of semantic search engines
List of search engines
Third part : 'web accessibility' where software helps people with disabilities get access to the web.
It's important to note that for the third part, accessibility, it is
sometimes a legal requirement. For example, in the UK it is illegal to
have a website that is not accessible to blind people. There are
similar requirements for US government services. – slebetman
It's also an applicable law in canada
See this list of tools from w3 for a Complete List of Web Accessibility Evaluation Tools
CSS isn’t an on/off thing. Although CSS may be completely disabled, it is much more common that some of your CSS settings get ignored or overridden. Here is an incomplete list of cases (see my CSS Caveats for some additional details):
Speech-based browsers generally ignore most of CSS, largely because most of CSS is directed towards visual rendering.
So do “text-only browsers” (more accurately, character cell browsers, which render in plain text only using a monospace font but may be able to use colors and bolding).
Search engines generally don’t care about CSS at all.
CSS support varies. The more advanced CSS features you use, the more probable it is that many browsers don’t implement them.
CSS support may be disabled by the user, completely or partially.
User style sheets may interfere with your CSS code or even override them.
Browser settings e.g. on minimum font size may make some of you CSS settings ineffective.
Browsers have bugs. The more complicated CSS techniques you use, the more probable it is that you trigger a bug in some browsers.
An external CSS file (the recommended way of using CSS) may get lost, e.g. a browser may need to wait (perhaps in vain) from a server, or an archiving system may archive an HTML file but fail to archive the CSS file.
Styling may get lost in transfer, e.g. when copying content from a web page to MS Word or Excel or Notepad or email.

Embed sandboxed HTML on a webpage + SEO

I want to embed some HTML on my website... I would like that:
SEO: that content can be crawled and indexed
Integration: it renders nicely (does not break my DOM trees for instance, or does not inherit my styles)
Security: it remains safe for our user (javascript disabled)
Flexibility: the HTML can be completely free (don't want any BBCode or MarkDown or even TinyMCE, it's our users that are writing the HTML code...)
I saw that I might be able to use the IFrame for that, but I am not sure it is a very good solution concerning my SEO constraint.
Any answer would be greatly appreciated!!! Thanks.
For your requirements (rendering and security, primarily), IFRAME seems to be your only option, especially when we consider no rules are specified for the HTML content except the JS removal. Even some CSS + 'a' tag can bring a serious security risk, like overlaying outgoing links on your standard interface.
For the SEO part, you can use SEO maps to show the search engines the relation between the content and the container, also use html tags like link to make connection.
To make sure the user's html is safe then you should use HTMLPurifer. In terms of the rest of the question, you should split this up into multiple questions.

Is there a way to make search bots ignore certain text? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 months ago.
Improve this question
I have my blog (you can see it if you want, from my profile), and it's fresh, as well as google robots parsing results are.
The results were alarming to me. Apparently the most common 2 words on my site are "rss" and "feed", because I use text for links like "Comments RSS", "Post Feed", etc. These 2 words will be present in every post, while other words will be more rare.
Is there a way to make these links disappear from Google's parsing? I don't want technical links getting indexed. I only want content, titles, descriptions to get indexed. I am looking for something other than replacing this text with images.
I found some old discussions on Google, back from 2007 (I think in 3 years many things could have changed, hopefully this too)
This question is not about robots.txt and how to make Google ignore pages. It is about making it ignore small parts of the page, or transforming the parts in such a way that it will be seen by humans and invisible to robots.
There is a simple way to tell google to not index parts of your documents, that is using googleon and googleoff:
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index-->
In this example, the second paragraph will not be indexed by Google. Notice the “index” parameter, which may be set to any of the following:
index — content surrounded by “googleoff: index” will not be indexed
by Google
anchor — anchor text for any links within a “googleoff: anchor” area
will not be associated with the target page
snippet — content surrounded by “googleoff: snippet” will not be used
to create snippets for search results
all — content surrounded by “googleoff: all” are treated with all
source
Google ignores HTML tags which have data-nosnippet:
<p>
This text can be included in a snippet
<span data-nosnippet>and this part would not be shown</span>.
</p>
Source: Special tags that Google understands - Inline directives
I work on a site with top-3 google ranking for thousands of school names in the US, and we do a lot of work to protect our SEO. There are 3 main things you could do (which are all probably a waste of time, keep reading):
Move the stuff you want to downplay to the bottom of your HTML and use CSS and/or to place it where you want readers to see it. This won't hide it from crawlers, but they'll value it lower.
Replace those links with images (you say you don't want to do that, but don't explain why not)
Serve a different page to crawlers, with those links stripped. There's nothing black hat about this, as long as the content is fundamentally the same as a browser sees. Search engines will ding you if you serve up a page that's significantly different from what users see, but if you stripped RSS links from the version of the page crawlers index, you would not have a problem.
That said, crawlers are smart, and you're not the only site filled with permalink and rss links. They care about context, and look for terms and phrases in your headings and body text. They know how to determine that your blog is about technology and not RSS. I highly doubt those links have any negative effect on your SEO. What problem are you actually trying to solve?
If you want to build SEO, figure out what value you provide to readers and write about that. Say interesting things that will lead others to link to your blog, and crawlers will understand that you're an information source that people value. Think more about what your readers see and understand, and less about what you think a crawler sees.
Firstly think about the issue. If Google think "RSS" is the main keyword that may suggest the rest of your content is a bit shallow and needs expanding. Perhaps this should be the focus of your attention.If the rest of your content is rich I wouldn't worry about the issue as a search engine should know what the page is about from title and headings. Just make sure RSS etc is not in a heading or bold or strong tag.
Secondly as you rightly mention, you probably don't want use images as they are not assessable to screen readers without alt text and if they have alt text or supporting text then you add the keyword back in. However aria live may help you get around this issue, but I'm not an expert on accessibility.
Options:
Use JavaScript to write that bit of content (maybe ajax it in after load). Search engines like Google can execute JavaScript but I would guess it wont value any JS written content very highly.
Re-word the content or remove duplicates of it, one prominent RSS feed link may be better than several smaller ones dotted around the page.
Use the css content attribute with pseudo :before or :after to add your content. I'm not sure if bots will index words in content attributes in CSS and know that contents value in relation to each page but it seems unlikely. Putting words like RSS in the CSS basically says it's a style thing not an HTML thing, therefore even if engines to index it they wont add much/any value to it. For example, the HTML and CSS could be:
.add-text:after { content:'View my RSS feed'; }
Note the above will not work in older versions of IE, so you may need some IE version comments if you care about that.
"googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own internal website).
They are not supported by Google's web-search at all. So please refrain from doing that and I think that should not be marked as a correct answer as this might create ambiguity.
Now, to get Google to exclude part of a page, you will need to place that content in a separate file, such as excluded.html, and use an iframe to display that content in the host page.
The iframe tag grabs content from another file and inserts it into the host page. I think there is no other available method so far.
The only control that you have over the indexing robots, is the robots.txt file. See this documentation, linked by Google on their page explaining the usage of the file.
You basically can prohibit certain links and URL's but not necessarily keywords.
Other than black-hat server-side methods, there is nothing you can do. You may want to look at why you have those words so often and remove some of them from the site.
It used to be that you could use JS to "hide" things from googlebot, but you can't now that it parses JS. ( http://www.webmasterworld.com/google/4159807.htm )
Google crawler are smart but someone that program them are smartest. Human always sees what is sensible in the page, they will spend time on blog that have some nice content and most rare and unique.
It is all about common sense, how people visit your blog and how much time they spend. Google measure the search result in the same way. Your page ranking also increase as daily visits increase and site content get better and update every day.
This page has "Answer" words repeated multiple times. It doesn't mean that it will not get indexed. It is how much useful is to every one.
I hope it will give you some idea
you have to manually detect the "Google Bot" from request's user agent and feed them little different content than you normally serve to your user.

Apart from <script> tags, what should I strip to make sure user-entered HTML is safe?

I have an app that reprocesses HTML in order to do nice typography. Now, I want to put it up on the web to let users type in their text. So here's the question: I'm pretty sure that I want to remove the SCRIPT tag, plus closing tags like </form>. But what else should I remove to make it totally safe?
Oh good lord you're screwed.
Take a look at this
Basically, there are so many things you want to strip out. Plus, there's stuff that's valid, but could be used in malicious ways. What if the user wants to set their font size smaller on a footnote? Do you care if that get applied to your entire page? How about setting colors? Now all the words on your page are white on a white background.
I would look into the requirements phase again.
Is a markdown-like alternative possible?
Can you restrict access to the final content, reducing risk of exposure? (meaning, can you set it up so the user only screws themselves, and can't harm other people?)
You should take the white-list rather than the black-list approach: Decide which features are desired, rather than try to block any unwanted feature.
Make a list of desired typographic features that match your application. Note that there is probably no one-size-fits-all list: It depends both on the nature of the site (programming questions? teenagers' blog?) and the nature of the text box (are you leaving a comment or writing an article?). You can take a look at some good and useful text boxes in open source CMSs.
Now you have to chose between your own markup language and HTML. I would chose a markup language. The pros are better security, the cons are incapability to add unexpected internet contents, like youtube videos. A good idea to prevent users' rage is adding an "HTML to my-site" feature that translates the corresponding HTML tags to your markup language, and delete all other tags.
The pros for HTML are consistency with standards, extendability to new contents types and simplicity. The big con is code injection security issues. Should you pick HTML tags, try to adopt some working system for filtering HTML (I think Drupal is doing quite a good job in this case).
Instead of blacklisting some tags, it's always safer to whitelist. See what stackoverflow does: What HTML tags are allowed on Stack Overflow?
There are just too many ways to embed scripts in the markup. javascript: URLs (encoded of course)? CSS behaviors? I don't think you want to go there.
There are plenty of ways that code could be sneaked in - especially watch for situations like <img src="http://nasty/exploit/here.php"> that can feed a <script> tag to your clients, I've seen <script> blocked on sites before, but the tag got right through, which resulted in 30-40 passwords stolen.
<iframe>
<style>
<form>
<object>
<embed>
<bgsound>
Is what I can think of. But to be sure, use a whitelist instead - things like <a>, <img>† that are (mostly) harmless.
† Just make sure that any javascript:... / on*=... are filtered out too... as you can see, it can get quite complicated.
I disagree with person-b. You're forgetting about javascript attributes, like this:
<img src="xyz.jpg" onload="javascript:alert('evil');"/>
Attackers will always be more creative than you when it comes to this. Definitely go with the whitelist approach.
MediaWiki is more permissive than this site; yes, it accepts setting colors (even white on white), margins, indents and absolute positioning (including those that would put the text completely out of screen), null, clippings and "display;none", font sizes (even if they are ridiculously small or excessively large) and font-names (even if this is a legacy non-Unicode Symbol font name that will not render text successfully), as opposed to this site which strips out almost everything.
But MediaWiki successifully strips out the dangerous active scripts from CSS (i.e. the behaviors, the onEvent handlers, the active filters or javascript link targets) without filtering completely the style attribute, and bans a few other active elements like object, embed, bgsound.
Both sits are banning marquees as well (not standard HTML, and needlessly distracting).
But MediaWiki sites are patrolled by lots of users and there are policy rules to ban those users that are abusing repeatedly.
It offers support for animated iamges, and provides support for active extensions, such as to render TeX maths expressions, or other active extensions that have been approved (like timeline), or to create or customize a few forms.