Scan site for images and alt attributes [closed] - html

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
We'd like to run a scan on our site that returns a report with the following:
each image tag found and a visual representation of that image on the report
the alt attribute for that image (also identify if an alt attribute isn't found)
Is there a simple tool that does this? We're attempting to check for alt attributes, and make sure the alt attributes accurately describe to the image they represent. That's why the visual representation in the report is important.

Try the Python package Beautiful Soup. It will parse all of your HTML for you in a really simple statement. Try this code:
website = urllib2.urlopen(url)
websitehtml = website.read()
soup = BeautifulSoup(websitehtml)
matches = soup.findAll('img')
for row in matches:
print row['src']
print row['alt']
From here use row['src'] to set the src of an image and print out alt next to it.

Accessify.com has a plethora of accessibility testing tools as bookmarklets (or "favelets"). One of them does what I think you are looking for. Look on that page for "Alt attributes - show all". Drag that link to your bookmarks and then use it on the page you want to test.
Also, the Web Accessibilty Toolbar (available for Internet Explorer and Opera) has a "List Images" option under "Images" that will do the same thing - list images and the code associated with each.
As for checking whole sites, there are free accessibility checkers available that should have a feature like this, like aDesigner.

http://sourceforge.net/projects/simplehtmldom/
I'd use something like that, very good and easy to use!

This answer on SO some pointers on using Selenium to check your site for images with alt text present.

It sounds like you want something that works like what e.g. Jeremy provided. I.e., just some long list with each image and its alt attribute. The problem is that this will not provide you with enough context to provide an useful alt attribute, because the alt attribute should not (in general) "accurately describe [...] the image they represent", but rather describe what the image is intended to represent on the current page. It is difficult to provide a short description on how to write useful alt texts. The Wikipedia article on alt attributes itself kind of sucks in it current state, but the references are useful. There are, of course, many other SO questions related to this.
There might be some prewritten tool that does what you requested, if e.g. all pages are reachable from the start page it would be possible to just crawl the entire web site and generate the list. But if it's only possible to reach some pages by e.g. searching, some site-specific tool is probably needed.
Either way, let's assume that we have such a tool available. Even then, its use is fairly limited. Even if you can get a list of all images on the web site, with its associated alt text, you still have to visit all pages, one page at a time, and probably use some web developer extension in some browser (there are such tools provided in other answers, I think) that displays all alt texts on the page; and, then, fix the alt text, after you found out what the image is actually used for on the relevant page.
So, this tool you are requesting would only be useful for finding the pages with possible incorrect usage of the alt attribute (i.e., any page with an image on it). (But depending on the site under consideration, even this might be of some help of course.) You still need to open the web page the image is actually used on (or, if you prefer, read the HTML code for the page) to find out what the correct/better alt text would be.
So, at most you get a list of pages with images on that you have to inspect. But this will still miss some important cases, e.g. cases where the CSS background-image property is used to display a button (instead of an img image), that should have an alt text.

You can use a powerfull JAVA API : JSOUP
Documentation for building selectors: selectors syntax
Training : online lab
For your case:
Document doc = Jsoup.connect("https://stackoverflow.com/").get();
System.out.println(doc.title());
Elements imgWithAltAttr = doc.select("img[alt]");
for (Element img : imgWithAltAttr) {
System.out.println("%s\n\t%s",
img.attr("alt"), img.absUrl("src"));
}
We use Jsoup in our accessibilty project : https://github.com/Tanaguru/Tanaguru

Related

Is there any meaning behind so many tags in html? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 months ago.
Improve this question
So I am now learning html, and I was just wondering why tags such as cite even exist. When I open a website as a user, I still see the text as italic when the code is written as cite.
I found that the tags are useful when it comes to screen readers, so basically for users that have problems with their vision.
Are there any more reasons for these tags? Thank you so much in advance!
Tags are small snippets of HTML coding that tell engines how to properly “read” your content. In fact, you can vastly improve search engine visibility by adding SEO tags in HTML.
When a search engine’s crawler comes across your content, it takes a look at the HTML tags of the site. This information helps engines like Google determine what your content is about and how to categorize the material.
Some of them also improve how visitors view your content in those search engines. And this is in addition to how social media uses content tags to show your articles.
In the end, it’s HTML tags for SEO that will affect how your website performs on the Internet. Without these tags, you’re far less likely to really connect with an audience.
About cite tag: The tag defines the title of a creative work (e.g. a book, a poem, a song, a movie, a painting, a sculpture, etc.). Note: A person's name is not the title of a work. The text in the element usually renders in italic.
Regarding the cite tag, according to MDN:
The HTML element is used to describe a reference to a cited
creative work, and must include the title of that work. The reference
may be in an abbreviated form according to context-appropriate
conventions related to citation metadata.
This enables you to manage all the css applied to quotes easily, were that to be your use case (if you happened to have a lot of quotes on a site). The italics you have observed are part of that css, or rather the default css applied by the browser.
In the broader spectrum
Oftentimes you will run into tags that as of today are not in use anymore. There's different industry standards for different time periods.
All of the tags exist, because there was a reason for web browsers to have a specific way of reading a piece of content.
For example centering a div used to be an almost legendary task that was achievable using multiple methods, all of which had different advantages and disadvantages. However, nowdays it's customary to use the flexbox.
Bottom line is its a way for web browsers and search engines to read and interpret the content you're providing
Tags such as and are used for text decoration nothing else you can also change text fonts and styles by using CSS.

Is using the visually hidden technique better than img alt text? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I'm curious whether the CSS visually-hidden technique, most commonly used on font icons, for alternative image text is preferable to using the alt attribute. The argument against the alt attribute being that a screen reader announces "Graphic" any time it sees and <img> which is less natural. For example:
<p>ABC <img src="right-arrow.png" src="converts to"> XYZ</p>
Reads as "ABC graphic converts to XYZ"
<p>
ABC
<span class="visually-hidden">converts to</span>
<img src="right-arrow.png" src="" aria-hidden="true">
XYZ
</p>
Reads as "ABC converts to XYZ"
I can agree it's less natural when screen reader reads "Graphic" every time we focus on some image. From the other hand, straightforward explanation of type of content may be extremely important for people with impairments. Navigating through site is hard enough, so if we can narrow the field of interpretation, it's advised to do it.
What I mean is surfing the Internet with screen reader is not a perfect experience. But for the moment we have to stick to it and by making some things more schematic, we actually make it a little more clearer.
Also we can look to WCAG docs about this issue where there's a few advised techniques to choose from.
Situation A: If a short description can serve the same purpose and present the same information as the non-text content:
G94: Providing short text alternative for non-text content that serves the same purpose and presents the same information as the non-text content using one of the following techniques:
Short text alternative techniques for Situation A:
ARIA6: Using aria-label to provide labels for objects
ARIA10: Using aria-labelledby to provide a text alternative for non-text content
G196: Using a text alternative on one item within a group of images that describes all items in the group
H2: Combining adjacent image and text links for the same resource
H35: Providing text alternatives on applet elements
H37: Using alt attributes on img elements
H53: Using the body of the object element
H86: Providing text alternatives for ASCII art, emoticons, and leetspeak
So basically we could choose also from aria-attributes (and we can sometimes, but only if alt is not enough) BUT there is also one more strong argument for using alt attributes - SEO
Using alt text on your images can make for a better user experience, but it may also help earn you both explicit and implicit SEO benefits. Along with implementing image title and file naming best practices, including alt text may also contribute to image SEO.
While search engine image recognition technology has vastly improved over the years, search crawlers still can't "see" the images on a website page like we can, so it's not wise to leave the interpretation solely in their hands. If they don't understand, or get it wrong, it's possible you could either rank for unintended keywords or miss out on ranking altogether.
quote from here
Both techniques are counterproductive and useless.
In your example :
ABC → XYZ
This means nothing to me. I'm not blind, I have no screenreader. And I still don't understand what is this curious arrow.
Of course having an img tag (with an alt) may help me to understand the meaning of this arrow by hovering with a mouse. But, how can I guess that this arrow is in fact an img tag and that I have to hover with my mouse? How can I do without using a mouse?
Let's try another method :
ABC converts to XYZ
We don't use img tag, we don't use visually hidden text. Everybody with or without screenreader will understand.
There are a lot more people targeted by accessibility than blind people and you should always try to satisfy everybody.

schema.org microdata - image required for Article

Every example I find for https://schema.org/Article gives at least one error in the Google Structured Data Testing Tool
Supposedly headline is now required; that's fine, just change itemprop="name" to itemprop="name headline"
But it also gives the error:
A value for the image field is required.
Required by:
Articles Rich Snippets
why is an image required for Articles?
why isn't this documented anywhere?
what should I give as an image?
I've heard people say the image should be an image of the article (literally interpreting the documentation: "An image of the item.") - Like an actual screenshot of the article. Isn't that useless? Is that even correct? Do I have to give an image? Will the same blank image or logo be ok for every article? Or would duplicate images get penalised due to not being relevant?
Schema.org (the vocabulary) doesn’t require any properties.
Google’s Structured Data Testing Tool is not a Schema.org validator. The tool primarily tests if your use of Schema.org conforms to Google’s own requirements for showing one of their Rich Snippets (or similar features/products).
So if Google’s tool says that a "value for the image field is required", this does not mean that you are technically required to provide it, or that something bad (like a ranking penalty) happens if you don’t provide it. It just means that Google won’t display a Rich Snippet for this document on their SERPs.
Google typically documents the properties they require for their features on https://developers.google.com/structured-data/ (example: their Article Rich Snippet).
Discussing which images are okay in Google’s eyes is off-topic on Stack Overflow (you might get help on Webmasters SE).
From Schema.org’s perspective: the image property is, as you note, defined to have an "image of the item" as value. For Product, it could be a photograph of the product, for Person a portrait etc. For an Article, the equivalent would be a photo/screenshot of the article itself; however, that’s often not really useful. So I think it’s common practice to specify an image that represents this article somehow: practically speaking, an image that could be shown as teaser image for your article. But you should not use the image property just because an image is part of the article (so don’t use it for all images, and don’t use it if you have no image that would/should represent the article); for such cases, you could use the hasPart property instead.

SEO title vs alt vs text [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
Does the title attribute in a link do the job of the real text in the link for SEO?
i.e
Web Design
is it the same as:
click here
when trying to get a good page rank for keywords like "web design"? is it like alt attribute in an image tag? or is it useless in SEO?
is it the same as:
click here
what's the difference between all the above?
Thank you in advance!
Alt is not a valid attribute for <a> elements.
Use alt to describe images
Use title to describe where the link is going.
The textvalue (click here) is the most important part
The title attribute gets more and more ignored.
Google looks far more on the link text than the title attribute.
For google the title tag is like a meta tag which is not important compared to content.
Image alt tags are however still very important (especially for image search)
The main feature of those tags is to provide usability for your users, not to feed informatino to search engines.
title attribute hasn't the same value as link text on SEO.
between
Web Design
and
click here
stick with the first option. But it is duplicate data, and has no real aggregate value on the case.
The main title purpose, it to give a tooltip about the link's page title. Putting the linked page title is the correct application (think on user first).
The alt attribute is for allow non-textual content to be represented. Consider the examples on WHATWG: http://www.whatwg.org/specs/web-apps/current-work/multipage/embedded-content-1.html#alt
EDIT
1
<span>2</span>
3
...
27
The title tag should be used to provide ADDITIONAL information for an element such as a link. If your title tag duplicates the actual link text then it will have no SEO benefit (there are arguments that the duplication could have a slight negative effect too). If, however, you can provide additional, meaningful information on the link such as further details about the content linked (especially if it links to a filetype that Google wouldn't be able to access/index) then they're definitely worth having.
Even as the tooltip in the browser, having a tooltip with the same text as the link text makes no sense, so as a rule of thumb only use it when you have something additional to add, not duplicate.
HTH
The text in the title attribute is not seen by crawlers. It won't cause keyword stuffing and it won't replace the anchor text for a given URL. It will, however, provide additional info if this is needed.
Use it to help your visitors not your SEO efforts.
alt is only valid for images — it's alternate text that serves for screen readers and people with images turned off to understand what an image represents.
title applies to most (if not all) elements, and can be used to provide tooltips for more information about parts of your pages.
I don't think either attribute plays any major roles in SEO. As Joe Hopfgartner says, the actual text of your links is much more significant in terms of semantics, which is why using "click here" as link text is discouraged these days.
Use this pseudo-code:
Text
For instance, this:
Example
renders like this:
Example

Is there a way to make search bots ignore certain text? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 months ago.
Improve this question
I have my blog (you can see it if you want, from my profile), and it's fresh, as well as google robots parsing results are.
The results were alarming to me. Apparently the most common 2 words on my site are "rss" and "feed", because I use text for links like "Comments RSS", "Post Feed", etc. These 2 words will be present in every post, while other words will be more rare.
Is there a way to make these links disappear from Google's parsing? I don't want technical links getting indexed. I only want content, titles, descriptions to get indexed. I am looking for something other than replacing this text with images.
I found some old discussions on Google, back from 2007 (I think in 3 years many things could have changed, hopefully this too)
This question is not about robots.txt and how to make Google ignore pages. It is about making it ignore small parts of the page, or transforming the parts in such a way that it will be seen by humans and invisible to robots.
There is a simple way to tell google to not index parts of your documents, that is using googleon and googleoff:
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index-->
In this example, the second paragraph will not be indexed by Google. Notice the “index” parameter, which may be set to any of the following:
index — content surrounded by “googleoff: index” will not be indexed
by Google
anchor — anchor text for any links within a “googleoff: anchor” area
will not be associated with the target page
snippet — content surrounded by “googleoff: snippet” will not be used
to create snippets for search results
all — content surrounded by “googleoff: all” are treated with all
source
Google ignores HTML tags which have data-nosnippet:
<p>
This text can be included in a snippet
<span data-nosnippet>and this part would not be shown</span>.
</p>
Source: Special tags that Google understands - Inline directives
I work on a site with top-3 google ranking for thousands of school names in the US, and we do a lot of work to protect our SEO. There are 3 main things you could do (which are all probably a waste of time, keep reading):
Move the stuff you want to downplay to the bottom of your HTML and use CSS and/or to place it where you want readers to see it. This won't hide it from crawlers, but they'll value it lower.
Replace those links with images (you say you don't want to do that, but don't explain why not)
Serve a different page to crawlers, with those links stripped. There's nothing black hat about this, as long as the content is fundamentally the same as a browser sees. Search engines will ding you if you serve up a page that's significantly different from what users see, but if you stripped RSS links from the version of the page crawlers index, you would not have a problem.
That said, crawlers are smart, and you're not the only site filled with permalink and rss links. They care about context, and look for terms and phrases in your headings and body text. They know how to determine that your blog is about technology and not RSS. I highly doubt those links have any negative effect on your SEO. What problem are you actually trying to solve?
If you want to build SEO, figure out what value you provide to readers and write about that. Say interesting things that will lead others to link to your blog, and crawlers will understand that you're an information source that people value. Think more about what your readers see and understand, and less about what you think a crawler sees.
Firstly think about the issue. If Google think "RSS" is the main keyword that may suggest the rest of your content is a bit shallow and needs expanding. Perhaps this should be the focus of your attention.If the rest of your content is rich I wouldn't worry about the issue as a search engine should know what the page is about from title and headings. Just make sure RSS etc is not in a heading or bold or strong tag.
Secondly as you rightly mention, you probably don't want use images as they are not assessable to screen readers without alt text and if they have alt text or supporting text then you add the keyword back in. However aria live may help you get around this issue, but I'm not an expert on accessibility.
Options:
Use JavaScript to write that bit of content (maybe ajax it in after load). Search engines like Google can execute JavaScript but I would guess it wont value any JS written content very highly.
Re-word the content or remove duplicates of it, one prominent RSS feed link may be better than several smaller ones dotted around the page.
Use the css content attribute with pseudo :before or :after to add your content. I'm not sure if bots will index words in content attributes in CSS and know that contents value in relation to each page but it seems unlikely. Putting words like RSS in the CSS basically says it's a style thing not an HTML thing, therefore even if engines to index it they wont add much/any value to it. For example, the HTML and CSS could be:
.add-text:after { content:'View my RSS feed'; }
Note the above will not work in older versions of IE, so you may need some IE version comments if you care about that.
"googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own internal website).
They are not supported by Google's web-search at all. So please refrain from doing that and I think that should not be marked as a correct answer as this might create ambiguity.
Now, to get Google to exclude part of a page, you will need to place that content in a separate file, such as excluded.html, and use an iframe to display that content in the host page.
The iframe tag grabs content from another file and inserts it into the host page. I think there is no other available method so far.
The only control that you have over the indexing robots, is the robots.txt file. See this documentation, linked by Google on their page explaining the usage of the file.
You basically can prohibit certain links and URL's but not necessarily keywords.
Other than black-hat server-side methods, there is nothing you can do. You may want to look at why you have those words so often and remove some of them from the site.
It used to be that you could use JS to "hide" things from googlebot, but you can't now that it parses JS. ( http://www.webmasterworld.com/google/4159807.htm )
Google crawler are smart but someone that program them are smartest. Human always sees what is sensible in the page, they will spend time on blog that have some nice content and most rare and unique.
It is all about common sense, how people visit your blog and how much time they spend. Google measure the search result in the same way. Your page ranking also increase as daily visits increase and site content get better and update every day.
This page has "Answer" words repeated multiple times. It doesn't mean that it will not get indexed. It is how much useful is to every one.
I hope it will give you some idea
you have to manually detect the "Google Bot" from request's user agent and feed them little different content than you normally serve to your user.