schema.org microdata - image required for Article - html

Every example I find for https://schema.org/Article gives at least one error in the Google Structured Data Testing Tool
Supposedly headline is now required; that's fine, just change itemprop="name" to itemprop="name headline"
But it also gives the error:
A value for the image field is required.
Required by:
Articles Rich Snippets
why is an image required for Articles?
why isn't this documented anywhere?
what should I give as an image?
I've heard people say the image should be an image of the article (literally interpreting the documentation: "An image of the item.") - Like an actual screenshot of the article. Isn't that useless? Is that even correct? Do I have to give an image? Will the same blank image or logo be ok for every article? Or would duplicate images get penalised due to not being relevant?

Schema.org (the vocabulary) doesn’t require any properties.
Google’s Structured Data Testing Tool is not a Schema.org validator. The tool primarily tests if your use of Schema.org conforms to Google’s own requirements for showing one of their Rich Snippets (or similar features/products).
So if Google’s tool says that a "value for the image field is required", this does not mean that you are technically required to provide it, or that something bad (like a ranking penalty) happens if you don’t provide it. It just means that Google won’t display a Rich Snippet for this document on their SERPs.
Google typically documents the properties they require for their features on https://developers.google.com/structured-data/ (example: their Article Rich Snippet).
Discussing which images are okay in Google’s eyes is off-topic on Stack Overflow (you might get help on Webmasters SE).
From Schema.org’s perspective: the image property is, as you note, defined to have an "image of the item" as value. For Product, it could be a photograph of the product, for Person a portrait etc. For an Article, the equivalent would be a photo/screenshot of the article itself; however, that’s often not really useful. So I think it’s common practice to specify an image that represents this article somehow: practically speaking, an image that could be shown as teaser image for your article. But you should not use the image property just because an image is part of the article (so don’t use it for all images, and don’t use it if you have no image that would/should represent the article); for such cases, you could use the hasPart property instead.

Related

HTML title attribute instead of the tag content

Considering accessibility, is it good to use:
without any content, instead of:
About
Update: Here's a live demo, supposing that an appropriate font is available. I actually use RichStyle font.
a[href="about.html"]:before {
/* 1F6C8 🛈 CIRCLED INFORMATION SOURCE = information */
content: "\1F6C8";
}
Here is some good information regarding use of the title attribute with the anchor tag.
w3.org - supplementing link text with the title attribute
The objective of this technique is to demonstrate how to use a title attribute on an anchor element to provide additional text describing a link. The title attribute is used to provide additional information to help clarify or further describe the purpose of a link. If the supplementary information provided through the title attribute is something the user should know before following the link, such as a warning, then it should be provided in the link text rather than in the title attribute.
Because of the extensive user agent limitations in supporting access to the title attribute, authors should use caution in applying this technique. For this reason, it is preferred that the author use technique C7: Using CSS to hide a portion of the link text (CSS) or H30: Providing link text that describes the purpose of a link for anchor elements.
"Good" is a not very well defined term.
Is it accessible to screen readers - yes. According to the WAI-ARIA name calculation algorithm, title will be used to calculate the name. It is step H and is referred to as the tooltip.
http://www.w3.org/TR/accname-aam-1.1/#mapping_additional_nd_te
However, that is not the whole picture, because there also needs to be a visible name for the link that is accessible by sighted keyboard-only users. Title attributes are only displayed in HTML when you mouse over an item.
Therefore, this technique will only be accessible if there is some other visible indication of what the link is and this visible indicator adheres to all the other accessibility requirements.
The WCAG says in his normative section (2.4.4) : "The purpose of each link can be determined from the link text alone" (http://www.w3.org/TR/WCAG20/#navigation-mechanisms).
WCAG also says that you can "supplement" the link text with the title attribute, so it excludes the title attribute from the definition of "link text".
Conclusion: you must provide a link text like in your second example:
About
Although this is not explicitly defined in this normative section of the WCAG, the "link text" is defined in the techniques as being the content of the inner text of a link including images alternative.
See the two following and complementary techniques for more informations:
http://www.w3.org/TR/2015/NOTE-WCAG20-TECHS-20150226/H30.html
http://www.w3.org/TR/WCAG20-TECHS/H33.html
EDIT: one comment illustrated that my answer lacked of some examples
Please note that WCAG2.0 and WAI-ARIA are two complementary and distinct guidelines and that you can provide additional informations which can be exposed to accessibility API if it's necessary. But in no case, you should consider that exposing an information to the accessibility API is sufficient enough for those not using Assistive technologies.
So the following example is wrong as aria-label can't be accessed within an user agent without the use of assistive technologies
The following is also wrong:
<a href="about.html" title="About">
<img width="100" height="100" src="about.png" alt="" /></a>
Although your link exposes an "accessible name" to the Accessibility API (WAI-ARIA), it does not provide a "link text" as specified by the WCAG normative guideline (and apparently means that a significant image is used as decorative).
So if your link only contains an image, this image should be present in the HTML code, and be correctly entitled with the alt attribute
EDIT 2: You can read the following blog post to illustrate the problem with the sole use of the title attribute https://silktide.com/i-thought-title-text-improved-accessibility-i-was-wrong/
Its not bad or good. Because if you want to do button with image and no text with the a tag you will do something like.
Just make sure you put a style for your a so users will be able to click on it.
For SEO purposes it is good. You can aim at keywords with title attribute. so, It is good to use. it gives user an idea about what he is going to do by clicking after it. I use it often.

Can longdesc be used without meaningful alt text?

I'm attempting to write good fallback text content for a webcomic. Naturally, there is a huge amount of actual text locked in the image, and plenty of descriptions/actions/expressions that could also be described. Having longdesc="#transcript" seems like the perfect use case, and comes with benefits for searching and automatic translation.
But what do I do with the alt? I've checked the official specs, and dug around in WebAIM and similar sites, but I've never seen a use case for having longdesc supplant alt. This makes sense for the usual applications (overview a chart in the alt text, link to a full breakdown elsewhere), but it seems like any alternative text I could offer for a comic would be redundant and miss out on the rich markup provided by the long description.
Here are some possibilities:
<img alt="" longdesc="#transcript" />
<img alt="[transcript text stripped of HTML and made attribute-safe]" longdesc="#transcript" />
<img alt="[Summary of comic contents... Which can get iffy, like this: 'Garfield talks about being fat. Punchline: he's fat.']" longdesc="#transcript" />
<img alt="[apologize profusely to screen reader users]" longdesc="#transcript" />
None of these seem ideal for various reasons, whether that be repeated content, no longdesc support, or me annoying Assistive Technology users. Without a sound declaration from folks who have thought about and dealt with this stuff way more than I have, I'm at a loss.
The alt attribute is required for accessibility, and it is even formally required by the HTML 4.01 specification. The use of the longdesc attribute, even if it were implemented, would not make the alt attribute unnecessary. The description of img in HTML 4.01 shows the following example:
<IMG src="sitemap.gif"
alt="HP Labs Site Map"
longdesc="sitemap.html">
It adds: “The alt attribute provides a short description of the image. This should be sufficient to allow users to decide whether they want to follow the link given by the longdesc attribute to the longer description, here "sitemap.html".” (Whether this would be adequate even if longdesc were supported is a different matter.)
Due to lack of support, longdesc has remained useless. In order to refer to a long description, you need to use a normal link near the image. This lets anyone (even people who can see the image but may need some explanation) access the description.
I would think about it from the point of view of someone who needs alt-text, i.e. someone who cannot see the image.
It appears that the transcript is on the same page (from longdesc="#transcript"), so the 'link' that longdesc provides would take you to another section of the same page? Perhaps the transcript is further down the page?
In which case the key information for that user is what it is (very briefly), and where to find the transcript.
I would suggest something like:
<img alt="Comic frames, full description below." longdesc="#transcript" />
Although longdesc is getting a little more support these days and an update to the HTML5 spec is proposed, you can't rely on it yet.
Therefore if the transcript is not immediately apparent, I'd also include a link nearby to take you to it.

How to creat a 508 compliant org chart?

I've been researching this and haven't found much in terms of standard solutions for creating a 508-compliant, accessible org chart. We have images that represent organizational structure. It seems like the options would be to create an external file to link to that attempts to represent the relationships in the chart (although I'm not sure if there's a commonly accepted way to do this via text for a hierarchical tree), or maybe create an imagemap that doesn't actually link to anything externally but just exists for the labels. That seems much more of a hack. I also just thought of another potential representation - another html file (linked) that is basically just your standard list, which can represent a unlimited hierarchical complexity. Some labeled items are outside the general hierarchy (so groupings of various types within th hierarchy, etc.). Just wondering if anyone else had run into this, or just seen examples of how others have approached it?
Section 508 says, regarding web-based intranet and internet information and applications, which is probably what matters here: “(a) A text equivalent for every non-text element shall be provided (e.g., via "alt", "longdesc", or in element content).” Any solution that fulfills the requirement is 508 compliant. Note that this is a legal and formal matter; it does not imply that the content is really accessible.
So you can, for example, write a textual description of the organization (equivalent in content to the image) into the alt attribute. There is no defined upper limit on its length. Alternatively, you can use the longdesc attribute to link to a page containing an equivalent description, which may use all the expressive power of HTML, e.g. nested lists, or a table (which has accessibility requirements of course). Software support to longdesc is limited, if not anecdotal, but Section 508 explicitly mentions this possibility. Most sensibly, you can write a textual description, using HTML markup as needed, either in the page content (in which case you can use alt="") or on a separate page that you link to.
For a more specific answer, I think you need to ask a more specific question – like one with a real image representing an org chart.
I'm working toward a deadline that led me to this question more than five years after it was asked. Even now, if somebody hands me a visually presented org chart with no accessible fallback, Jukka's answer offers the best solution I can think of.
But what if we are part of the creation process (which is always the ideal), able to influence accessibility from the start? With well-structured semantic HTML, is it possible that no fallback will be needed? That's what I've gone in search of now, and here are a couple of resources that may be useful to someone in similar need. Both of these are licensed open source, which in both cases (using the MIT License) simply requires keeping the original copyright and license notice in the source code.
Here's a CSS solution proposed by Erin Sullivan.
And here's another that uses the Treeflex CSS library.
I always try to keep content separate from presentation, and CSS offers the possibility for continually customizing, refining and improving the presentation. I expect to use one of these in my current project, and I hope this research benefits others who are committed to better accessibility.

Scan site for images and alt attributes [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
We'd like to run a scan on our site that returns a report with the following:
each image tag found and a visual representation of that image on the report
the alt attribute for that image (also identify if an alt attribute isn't found)
Is there a simple tool that does this? We're attempting to check for alt attributes, and make sure the alt attributes accurately describe to the image they represent. That's why the visual representation in the report is important.
Try the Python package Beautiful Soup. It will parse all of your HTML for you in a really simple statement. Try this code:
website = urllib2.urlopen(url)
websitehtml = website.read()
soup = BeautifulSoup(websitehtml)
matches = soup.findAll('img')
for row in matches:
print row['src']
print row['alt']
From here use row['src'] to set the src of an image and print out alt next to it.
Accessify.com has a plethora of accessibility testing tools as bookmarklets (or "favelets"). One of them does what I think you are looking for. Look on that page for "Alt attributes - show all". Drag that link to your bookmarks and then use it on the page you want to test.
Also, the Web Accessibilty Toolbar (available for Internet Explorer and Opera) has a "List Images" option under "Images" that will do the same thing - list images and the code associated with each.
As for checking whole sites, there are free accessibility checkers available that should have a feature like this, like aDesigner.
http://sourceforge.net/projects/simplehtmldom/
I'd use something like that, very good and easy to use!
This answer on SO some pointers on using Selenium to check your site for images with alt text present.
It sounds like you want something that works like what e.g. Jeremy provided. I.e., just some long list with each image and its alt attribute. The problem is that this will not provide you with enough context to provide an useful alt attribute, because the alt attribute should not (in general) "accurately describe [...] the image they represent", but rather describe what the image is intended to represent on the current page. It is difficult to provide a short description on how to write useful alt texts. The Wikipedia article on alt attributes itself kind of sucks in it current state, but the references are useful. There are, of course, many other SO questions related to this.
There might be some prewritten tool that does what you requested, if e.g. all pages are reachable from the start page it would be possible to just crawl the entire web site and generate the list. But if it's only possible to reach some pages by e.g. searching, some site-specific tool is probably needed.
Either way, let's assume that we have such a tool available. Even then, its use is fairly limited. Even if you can get a list of all images on the web site, with its associated alt text, you still have to visit all pages, one page at a time, and probably use some web developer extension in some browser (there are such tools provided in other answers, I think) that displays all alt texts on the page; and, then, fix the alt text, after you found out what the image is actually used for on the relevant page.
So, this tool you are requesting would only be useful for finding the pages with possible incorrect usage of the alt attribute (i.e., any page with an image on it). (But depending on the site under consideration, even this might be of some help of course.) You still need to open the web page the image is actually used on (or, if you prefer, read the HTML code for the page) to find out what the correct/better alt text would be.
So, at most you get a list of pages with images on that you have to inspect. But this will still miss some important cases, e.g. cases where the CSS background-image property is used to display a button (instead of an img image), that should have an alt text.
You can use a powerfull JAVA API : JSOUP
Documentation for building selectors: selectors syntax
Training : online lab
For your case:
Document doc = Jsoup.connect("https://stackoverflow.com/").get();
System.out.println(doc.title());
Elements imgWithAltAttr = doc.select("img[alt]");
for (Element img : imgWithAltAttr) {
System.out.println("%s\n\t%s",
img.attr("alt"), img.absUrl("src"));
}
We use Jsoup in our accessibilty project : https://github.com/Tanaguru/Tanaguru

How does Google use HTML tags to enhance the search engine?

I know that Google’s search algorithm is mainly based on pagerank. However, it also does analysis and uses the structure of the document H1, H2, title and other HTML tags to enhance the search results.
What is the name of this technique "using the document structure to enhance the search results"?
And are there any academic papers to help me study this area?
The fact that Google is taking the HTML structure into account is well covered in SEO articles however I could not find it in the academic papers.
I think it's called "Semantic Markup"
[...] semantic markup is markup that is descriptive enough to allow us and the machines we program to recognize it and make decisions about it. In other words, markup means something when we can identify it and do useful things with it. In this way, semantic markup becomes more than merely descriptive. It becomes a brilliant mechanism that allows both humans and machines to “understand” the same information. http://www.digital-web.com/articles/writing_semantic_markup/
A more practical article here
http://robertnyman.com/2007/10/29/explaining-semantic-mark-up/
SEO has become almost a religion to some people where they obsess about minutiae. Frankly, I'm not convinced that all this effort is justified.
My advice? Ignore what so-called pundits say and just follow Google's guidelines.
You might be looking for an academic answer but honestly, this isn't an academic question beyond the very basics of how Web indexing works. The reality of a modern page indexing and ranking algorithm is far more complex.
You may want to look at one of the earlier works on search engines. Note the authors' names. You may also want to read Google Patent application 20050071741.
These general principles aside, Google's search algorithm is constantly tweaked based on actual and desired results. The exact workings are a closely guarded secret just to make it harder for people to game the system. Much of the "advice" or descriptions on how Google's search algorithm works is pure supposition.
So, apart from having a title and having well-formed and valid HTML, I don't think you're going to find what you're looking for.
Google very deliberately doesn't give away too much information about its search algorithm, so it's unlikely you will find a definitve answer or academic paper that confirms this. If you're interested from an SEO point of view, just write your pages so they are good for humans and the robots will like them too.
To make a page good for humans, you SHOULD use tags such as h1, h2 and so on to create a hierarchical page outlay... a bit like this...
h1 "Contact Us"
...h2 "Contact Details"
......h3 "Telephone Numbers"
......h3 "Email Addresses"
...h2 "How To Find Us"
......h3 "By Car"
......h3 "By Train"
The difficulty with your question is that if you put something in your h1 tag hoping that it would increase your position in Google, but it didn't match up with other content on your page, you could look like you are spamming. Similarly, if your page is made up of too many headings and not enough actual content, you could look like you are spamming. It's not as simple as add a h1 and h2 tag and you'll go up! That's why you need to write websites for humans, not robots.
I have found this paper:
A New Study on Using HTML Structures to Improve Retrieval
however it is an old paper 1999,
still looking for more recent papers.
Check out
http://jcmc.indiana.edu/vol12/issue3/pan.html
http://www.springerlink.com/content/l22811484243r261/
Some time spent on scholar.google.com might help you find what you are looking for
You can also try searching the 'Computer Science' section of arXiv: http://arxiv.org for "search engine" and the various terms that others have suggested.
It contains many academic papers, all freely available... hopefully some of them will be relevant to your research. (Of course the caveat of validating any paper's content applies.)
Like cletus said follow the google guidelines.
I did a few tests came to the conclusion that title, image alt and h tags the most important. Also worth to mention is google adsense. I had the feeling if you implement these, the rank of your site increase.
I believe what you are interested in is called structural-fingerprinting, and it is often used to determine the similarity of two structures. In Google's case, applying a weight to different tags and applying to a secret algorithm that (probably) uses the frequencies of the different elements in the fingerprint. This is deeply routed in information theory - if you are looking for academic papers on information theory, I would start with "A Mathematical Theory of Communication" by Claude Shannon
I would also suggest looking at Microformats and RDF's. Both are used to enhance searching. These are mostly search engine agnostic, but there are some specific things as well. For google specific guidelines for HTML content read this link.
In short; very carefully. In long:
Quote from anatomy of a large-scale hypertextual erb search engine:
[...] This gives us some limited
phrase searching as long as there are
not that many anchors for a particular
word. We expect to update the way that
anchor hits are stored to allow for
greater resolution in the position and
docIDhash fields. We use font size
relative to the rest of the document
because when searching, you do not
want to rank otherwise identical
documents differently just because one
of the documents is in a larger
font. [...]
It goes on:
[...] Another big difference between
the web and traditional well controlled collections is that there
is virtually no control over what
people can put on the web. Couple
this flexibility to publish anything
with the enormous influence of search
engines to route traffic and companies
which deliberately manipulating search
engines for profit become a serious
problem. This problem that has not
been addressed in traditional closed
information retrieval systems. Also,
it is interesting to note that
metadata efforts have largely failed
with web search engines, because any
text on the page which is not directly
represented to the user is abused to
manipulate search engines. [...]
The Challenges in a web search engine addresses these issues in a more modern fashion:
[...] Web pages in HTML fall into the middle of this continuum of structure in documents, being neither close to free text nor to well-structured data. Instead HTML markup provides limited structural information, typically used to control layout but providing clues about semantic information. Layout information in HTML may seem of limited utility, especially compared to information contained in languages like XML that can be used to tag content, but in fact it is a particularly valuable source of meta-data in unreliable corpora such as the web. The value in layout information stems from the fact that it is visible to the user [...]:
And adds:
[...] HTML tags can be analyzed for what semantic information can be inferred. In addition to the header tags mentioned above, there are tags that control the font face (bold, italic), size, and color. These can be analyzed to determine which words in the document the author thinks are particularly important. One advantage of HTML, or any markup language that maps very closely to how the content is displayed, is that there is less opportunity for abuse: it is difficult to use HTML markup in a way that encourages search engines to think the marked text is important, while to users it appears unimportant. For instance, the fixed meaning of the tag means that any text in an HI context will appear prominently on the rendered web page, so it is safe for search engines to weigh this text highly. However, the reliability of HTML markup is decreased by Cascading Style Sheets which separate the names of tags from their representation. There has been research in extracting information from what structure HTML does possess.For instance, [Chakrabarti etal, 2001; Chakrabarti, 2001] created a DOM tree of an HTML page and used this information to in-crease the accuracy of topic distillation, a link-based analysis technique.
There are number of issues a modern search engine needs to combat, for example web spam and blackhat SEO schemes.
Combating webspam with trustrank
Webspam taxonomy
Detecting spam web pages through content analysis
But even in a perfect world, e.g. after eliminating the bad apples from the index, the web is still an utter mess because no-one has identical structures. There are maps, games, video, photos (flickr) and lots and lots of user generated content. In other word, the web is still very unpredictable.
Resources
Hypertext and the web:
Extracting knowledge from the World Wide Web
Rich media and web 2.0
Thresher: automating the unwrapping of semantic content from the World Wide Web
Information retrieval
Webspam papers
Combating webspam with trustrank
Webspam taxonomy
Detecting spam web pages through content analysis
To keep it painfully simple. Make your information architecture logical. If the most important elements for user comprehension are highlighted with headings and grouped logically, then the document is easier to interpret using information processing algorithms. Magically, it will also be easier for users to interpret. Remember the search engine algorithms were written by people trying to interpret language.
The Basic Process is:
Write well structured HTML - using header tags to indicate the most critical elements on the page. Use logical tags based on the structure of your information. Lists for lists, headers for major topics.
Supply relevant alt tags and names for any visual elements, and then use simple css to arrange these elements.
If the site works well for users and contains relevant information, you don't risk becoming a black listed spammer, and search engine algorithms will favor your page.
I really enjoyed the book Transcending CSS
for a clean explanation of properly structured HTML.
I suggest trying Google scholar as one of your avenues when looking for academic articles
semantic search
I found it interesting that - with no meta keywords nor description provided - in a scenatio like this:
<p>Some introduction</p>
<h1>headline 1</h1>
<p>text for section one</p>
Always the "text for section one" is shown on the search result page.
New tag to use called CANONICAL can now also be used, from Google, click HERE