Adding ids to HTML tags for QA automation - html

I have a query In our application we have lots of HTML tags. During development many tags were not given any id because of no requirement.Now the QA team wants to automate the test cases using QTP. In most of the cases this tool doesn't recognizes because it does not find ids for most of the HTML tags.Now we are asked to add ids to all the HTML tags.
I want to know if there will be any effect adding id attribute to these tags. Even positive impact are welcome

I do not think there will be any either positive or negative effect : maybe the size of the HTML page will increase a bit, but probably not that much.
Still, are you sure you need to put "id" attributes on every HTML tag of your pages ? Wouldn't only a few of those be enough ? Like on form fields, on links, on error-messages ; and that's probably about it ?
One thing you must take care, though, is that "id", as in "identifers", must be unique ; which implies it might be good, before starting adding them, to define some kind of "id-policy", to say, for instance, that "ids for elements of that kind should be named that way".
And, for your next projects : have developpers add those when theyr're developping ;-)
(And following the policy, of course)
Now that I'm thinking about it : a positive effect might be that it'll be easier to write Javascript code interacting with your HTML document -- but that'll be true for next projects or evolutions for this one, when those id are already present in the HTML at the time developpers put the JS code in place...

Since there are no QTP related answers yet.
GUI recognition in QTP is object-oriented. In order to identify an object QTP needs a unique combination of object's properties, and checking them better to be as fast as possible - that is why HTML ID would be ideal.
Now, where it is especially critical - for objects that do not have other unique identifiers. The most typical example - html tables. Their contents is dynamic, their number on the page may vary. By adding HTML ID you allow recognition mechanism get straight to the right table.
Objects with other unique properties can be recognized well without HTML ID. For example, if you have a single "submit" link on the page QTP will successfully recognize it by inner text.
So the context-specific answer: don't start adding ids to every single tag. Ask automation guys to prepare a list of objects they have problem with. And add ids to those objects.
PS. It also depends on automation programming skills. There are descriptive programming and dynamic recognition methods. They allow retrieving the right objects even without ids provided.

As Albert said, QTP doesn't rely solely on elements' id, in fact due to the fact that many web applications generate different ids for each session, (as far as I remember) the id property isn't part of the default description for most web test objects.
QTP is pretty good at recognizing most simple web controls and if you're facing problems it may be the case that a Web Extensibility project will help you bridge the gap between the semantics of your web application and the raw HTML it is created in. If a complex control is recognized by QTP as a WebElement (which is actually the div that contains the span that drives the code) you will understandably have object recognition problems since there are many divs on the page but probably many less complex controls.

If you are talking about side-effects - NO. Adding ids won't cause any problems (apart from taking up some extra bytes of course)
If you really have the need to add ids, go ahead and add them.

http://www.w3.org/TR/html4/struct/links.html#anchors-with-id says: The id and name attributes share the same name space. This means that they cannot both define an anchor with the same name in the same document. It is permissible to use both attributes to specify an element's unique identifier for the following elements: A, APPLET, FORM, FRAME, IFRAME, IMG, and MAP. When both attributes are used on a single element, their values must be identical.

Related

Storing site data in columns or rows

This is a question of how to perform the best practice of storing data from a webpage. Like texts/image-urls/links etc.
I have an CMS were you can create web pages. Here you can edit texts/upload images. In the future it would also be nice to "add new elements", add links to a-tags etc.
I need to have a robust and flexible solution that also have good performance. In both getting/recieving this data.
Lets consider I have 1000 pages with each around 25 elements on each page that can be updated and stored in the database.
Alternative 1)
Create a table and 1 column for each element on these pages for example columns like:
title_1, title_2,image_1,image_2.
Here we have a set of columns that we can update, these we can use on the web page.
Alternative 2)
Create 1 table with the columns (id, namespace, page_id, data)
And for each element on the page I add the namespace in association with the page_id to make the data output unique. In the data I can add any kind of information; text, links etc.
What do you suggest as a good solution for this issue? I'm ofcourse also open for other alternatives.
Thanks!
I would recommend option two, with the addition of a column identifying the element id/or type, if indeed the element id is somehow comparable. That is to say, if anchor text (say) is always stored as element id = 4, then you might want an element id = 4 so that you could compare anchor texts across multiple documents.
If, on the other hand (and this is the scenario I imagine is more likely), you may have 1-25 elements on a page and each of them could be different (eg document one has three anchor texts and four images, document two has one anchor text and no images, etc) it would make sense to add an element_type_id table that stores a bit of information about the element types. This is assuming that you ever have any interest in comparing (say) images across multiple documents, or anchor texts across multiple documents, etc.
Another thing to consider: if you are likely to see the same element over and over again, it actually makes more sense to effectively parameterize those elements by way of a lookup table. So basically store each (say) unique anchor text in one table and reference its id in your actual data table.
If I may add one additional thing: SO may not be the best place for the particular question you are asking. I'm not totally sure of that and maybe I'm wrong... but I would poke around the Stack Exchange network and see if other forums more closely deal with the type of question you asking. In the very least, I'd observe that your question is fairly vague and the goal of achieving a "robust and flexible solution that also {has} good performance. In both getting/recieving this data." is not likely to be accomplished simply by asking for advice on SO. There is a LOT that goes into data architecture, and certainly many of the details I would consider important in designing this myself are not present in your questions. And if you're not sure what those details are, I am not sure if SO is really the best place to set about learning them. I think https://softwareengineering.stackexchange.com/ may be a better fit for this question.
Just my opinion, and I could be wrong. Either way, I would consider learning a bit about database normal forms (http://www.bkent.net/Doc/simple5.htm or Google it) as well as do a little research on the types of design considerations that go into building a database (an old but still good SO article on that is here: What are the most important considerations when designing a database?)

Class vs. ID - Readability [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
What is the preferred method when dealing with choosing a class vs. an ID?
For instance, you can have a bunch of elements that might be styled identically and could all use the same class. However, for readability purposes, it's sometimes nice to have a unique ID for each element instead.
Obviously you don't want to go ridiculously overboard where every element has an ID. However, where do you guys draw the line and does using all IDs where you could be using classes slow things down noticeably? If so... when?
How to stop obliterating semantic HTML.
Most people learn HTML from looking at source code and of HTML and tinkering with it, learning how <tag>foo</tag> looks and running along with it. They don't really gain a deep understand of it, but they go on to do things that require a deep understanding, the side effect is the problem you and thousands of others have every day -- they're doing things and they don't know fully how these tools work, because it looks so simple on the surface and the powerful uses are are "hidden" in the funny manual that nobody feels the need to read. Everything is plainly explained and been written down for a long time.
What IDs are for (directly from the HTML4 spec, with my notes)
The id attribute assigns a unique identifier to an element (it only happens ONCE, never TWICE or more, I'm tired of seeing people come on this site and dropping in their code with the same ID in twenty elements)
The id attribute has several roles in HTML:
As a style sheet selector. (This means, you can use it to describe CSS styles)
As a target anchor for hypertext links.(When you can jump to a section of a page)
As a means to reference a particular element from a script.(document.getElementById("whatever"))
As the name of a declared OBJECT element.
For general purpose processing by user agents (e.g. for identifying fields when extracting data from HTML pages into a database, translating HTML documents into other formats, etc.).
What Classes are for (directly from the HTML4 spec, with my notes)
The class attribute [...] assigns one or more class names to an element (this one gets to be re-used to your heart's content) ; the element may be said to belong to these classes. A class name may be shared by several element instances. The class attribute has several roles in HTML:
As a style sheet selector (when an author wishes to assign style information to a set of elements).
For general purpose processing by user agents. (Basically, it's just another part of an element)
What? I don't get it.
IDs: It's the fingerprint of something, there's only one, you only use each fingerprint once in the entire document. You only use it when you need to give something an ID. You probably don't want to have hundreds of these, or even tens of these. You rarely if ever need to start making these. The specific uses are for target anchors, improving selector speed in rare edge-cases. Generally you never describe your CSS based on IDs, you might have some edge-cases such as #HEADER .body h1, which may be different from your #BODY, I'd still advise against making them IDs for no real reason.
Classes: Nothing to do with unique fingerprints or linking to sections of a page, classes don't uniquely identify something. Classes describe a group of things that belong together or should behave the same way. If you're part of the class called coffee you should exhibit classes as one might expect from coffee, if you're a class of cellphone, then look like a cellphone (don't provide coffee).
But how the heck am I supposed to access the 4th cell in the 6th column of some table, or group of divs or that 20th list item?
This is where people who don't know what HTML is throw their hands up in the air and decide to assign IDs to all the elements. This is a total side-effect of nobody properly explaining to you how HTML works. That's a nice way of saying you didn't RFTM or ask questions early on (user1066982 in this case, did, which is amazing and makes me happy, I'm writing this to point other people to in the future who fail at HTML).
You need to start learning right now. Stop pretending you understand this stuff.
HTML is not a string of text such as <foo><bar>baz</bar>blah<ding/></foo>, sure that's how you write HTML but if that's what you believe it is you do not understand HTML in the browser.
HTML is a document that is structured like XML. HTML documents have a model, that means they aren't flat text. The text-representation of that document is a way your browser can take flat text and turn it into a tree structure. Trees are like arrays, except they aren't just flat elements in an array one-after-another, but rather they nest so one element may point to several other elements.
This below isn't a diagram (stolen from the w3c's spec on the Document Object Model) of how to write HTML text, this is a diagram of how your browser stores it in memory:
Since it's in memory like that, it doesn't mean "Oh crap! I have no way to access the first TD in the second TR of the table body in the table!", it means you simply and plainly explain to your code that there is a child element inside of the table.
JavaScript provides a full DOM API that allows you to access every single node in that DOM tree.
PHP provides a full DOM API that allows you to access every single node in that DOM tree.
C++ has a full DOM API that allows you to access every single node in that DOM tree.
ASP provides a full DOM API that allows you to access every single node in that DOM tree.
EVERYTHING that touches the DOM provides a full DOM API that allows you to access every single node in that DOM tree, with the exception of sub-standard software that throws regular expressions around in a futile attempt at parsing HTML.
Use the API for the DOM to access those nodes based on semantic HTML. Semantic HTML means you have a structure to your HTML that makes sense. Paragraphs go in <p> tags, headings go into heading tags, and so on.
You never, under any circumstances, what-so-ever need to reproduce the DOM API through hacking in values with ID tags because you didn't know you could just say getAllEmentsByTagName("td")[4] to get the fourth element.
If you can grab getAllEmentsByName("td")[4] you don't need to do <td id="id4"> and then later getElementById("id4") because you didn't want learn just one other API call. I dread the day I ever have to maintain a pile of code left behind by someone who felt the need to stick an ID into every element "just to be sure", especially when I need to go back and insert a new element between the fifth and sixth element in a table of thousands (can you imagine replacing EVERY id? Especially when this feature was accounted for over 10 years ago?! Insanity!)
Tl;dr
HTML isn't actually just a pile of text with one way to access it
rtfm, stop pretending you understand it because you can do a handful of things, you're holding yourself back.
Don't shove IDs everywhere, only use them where absolutely required.
Use classes to describe things, not identify things.
?????
Profit.
However, for readability purposes, it's sometimes nice to have a unique ID for each element instead.
This makes absolutely no sense to me. What makes an ID more readable than a class? There's no point assigning unique identifiers to each of a group of related elements if there's no benefit in having identities.
For what it's worth, realize that a single element can have both classes and an ID. If your elements need to be uniquely identified somehow, give them IDs. If multiple elements should be styled identically and are all similar in purpose anyway, use classes. If your elements fit both criteria, give them both attributes, and use each attribute accordingly.
IDs should not be used for styling. Use classes instead. IDs have a very high specificity, and are difficult to override (leading to more IDs, and longer selector chains). Also, IDs are used for JavaScript DOM selection, so if you're using the same IDs in your CSS that you're using in your JavaScript, you've tied the styles to the scripts, and that's bad separation of concerns.
IDs are for JavaScript. Classes are for CSS.
Note: JavaScript and specificity are not the only reasons. Others include fragment identifiers and code reuse. As I say in the comments, there are several smart people who advise against IDs (start there and follow the links)
I use IDs for elements that have clear responsibility, Classes for element that have same presentations, for example:
HTML:
<div id='sport-news'>
<article class='news'>...</article>
<article class='news'>...</article>
</div>
CSS:
.news { /* global styles */ }
[id=sport-news] .news { /* specific styles */ }
JavaScript:
var sportNews = document.getElementById('sport-news') // faster
, news = sportNews.childNodes;
For me, [id] .class is more readable than .parent-class .child-class.
When designing, I will use both id's and classes. For specific items I will use id only. But if you need to apply same styles for different items, use classes. You cannot use same id for different items because id is specific to one item only.

Why is it a bad thing to have multiple HTML elements with the same id attribute?

Why is it bad practice to have more than one HTML element with the same id attribute on the same page? I am looking for a way to explain this to someone who is not very familiar with HTML.
I know that the HTML spec requires ids to be unique but that doesn't sound like a convincing reason. Why should I care what someone wrote in some document?
The main reason I can think of is that multiple elements with the same id can cause strange and undefined behavior with Javascript functions such as document.getElementById. I also know that it can cause unexpected behavior with fragment identifiers in URLs. Can anyone think of any other reasons that would make sense to HTML newbies?
Based on your question you already know what w3c has to say about this:
The id attribute specifies a unique id for an HTML element (the id
attribute value must be unique within the HTML document).
The id attribute can be used to point to a style in a style sheet.
The id attribute can also be used by a JavaScript (via the HTML DOM)
to make changes to the HTML element with the specific id.
The point with an id is that it must be unique. It is used to identify an element (or an anything: if two students had the same student id schools would come apart at the seems). It's not like a human name, which needn't be unique. If two elements in an array had the same index, or if two different real numbers were equal... the universe would just fall apart. It's part of the definition of identity.
You should probably use class for what you are trying to do, I think (ps: what are you trying to do?).
Hope this helps!
Why should I care what someone wrote in some document?
You should care because if you are writing HTML, it will be rendered in a browser which was written by someone who did care. W3C created the spec and Google, Mozilla, Microsoft etc... are following it so it is in your interest to follow it as well.
Besides the obvious reason (they are supposed to be unique), you should care because having multiple elements with the same id can break your application.
Let's say you have this markup:
<p id="my_id">One</p>
<p id="my_id">Two</p>
CSS is forgiving, this will color both elements red:
#my_id { color:red; }
..but with JavaScript, this will only style the first one:
document.getElementById('my_id').style.color = 'red';
This is just a simple example. When you're doing anything with JavaScript that relies on ids being unique, your whole application can fall apart. There are questions posted here every day where this is actually happening - something crucial is broken because the developer used duplicate id attributes.
Because if you have multiple HTML elements with the same ID, it is no longer an IDentifier, is it?
Why can't two people have the same social security number?
You basicaly responded to the question. I think that as long as an elemenet can no longer be uniquely identified by the id, than any function that resides on this functionality will break. You can still choose to search elements in an xpath style using the id like you would use a class, but it's cumbersome, error prone and will give you headaches later.
The main reason I can think of is that multiple elements with the same id can cause strange and undefined behavior with Javascript functions such as document.getElementById.
... and XPath expressions, crawlers, scrapers, etc. that rely on ids, but yes, that's exactly it. If they're not convinced, then too bad for them; it will bite them in the end, whether they know it or not (when their website gets visited poorly).
Why should a social security number be unique, or a license plate number? For the same reason any other identifier should be unique. So that it identifies exactly one thing, and you can find that one thing if you have the id.
The main reason I can think of is that multiple elements with the same
id can cause strange and undefined behavior with Javascript functions
such as document.getElementById.
This is exactly the problem. "Undefined behavior" means that one user's browser will behave one way (perhaps get only the first element), another will behave another way (perhaps get only the last element), and another will behave yet another way (perhaps get an array of all elements). The whole idea of programming is to give the computer (that is, the user's browser) exact instructions concerning what you want it to do. When you use ambiguous instructions like non-unique ID attributes, then you get unpredictable results, which is not what a programmer wants.
Why should I care what someone wrote in some document?
W3C specs are not merely "some document"; they are the rules that, if you follow in your coding, you can reasonably expect any browser to obey. Of course, W3C standards are rarely followed exactly by all browsers, but they are the best set of commonly accepted ground rules that exist.
The short answer is that in HTML/JavaScript DOM API you have the getElementById function which returns one element, not a collection. So if you have more than one element with the same id, it would not know which one to pick.
But the question isn't that dumb actually, because there are reasons to want one id that might refer to more than one element in the HTML. For example, a user might make a selection of text and wants to annotate it. You want to show this with a
<span class="Annotation" id="A01">Bla bla bla</span>
If the user selected text that spans multiple paragraphs, then the needs to be broken up into fragments, but all fragments of that selection should be addressable by the same "id".
Note that in the past you could put
<a name="..."/>
elements in your HTML and you could find them with getElementsByName. So this is similar. But unfortunately the HTML specifications have started to deprecate this, which is a bad idea because it leaves an important use case without a simple solution.
Of course with XPath you can do anything use any attribute or even text node as an id. Apparently the XPointer spec allows you to make reference to elements by any XPath expression and use that in URL fragment references as in
http://my.host.com/document.html#xpointer(id('A01'))
or its short version
http://my.host.com/document.html#A01
or, other equivalent XPath expressions:
http://my.host.com/document.html#xpointer(/*/descendant-or-self::*[#id = 'A01'])
and so, one could refer to name attributes
http://my.host.com/document.html#xpointer(/*/descendant-or-self::*[#name = 'A01'])
or whatever you name your attributes
http://my.host.com/document.html#xpointer(/*/descendant-or-self::*[#annotation-id = 'A01'])
Hope this helps.

Does using custom data attributes produce browser compatibility issues?

I have to choose between custom data tags or ids. I would like to choose custom data tags, but I want to be sure that they do not cause browser compatibility issues for the most widely used browsers today.
I'm using jQuery 1.6 and my particular scenario involves a situation where I need to reference a commentId for several actions.
<div data-comment-id="comment-1" id="comment-1">
<a class="foo"></a>
</div>
It's easier to extract data tags in jQueryin: $('foo').data('commentId');
Extract a substring from the id seems a bit complicated and could break for one reason or another: <a id="comment-1"
Are there any sweeping merits or fatal flaws for either approach?
I would advise in favor of data attributes for the following reasons:
ids need to be unique document-wide. Thus they are limited in the semantics they can carry
you can have multiple data-attributes per element
and probably less relevant in your case:
changing ids might break idrefs
However, I'm not sure whether I understand your specs completely as extracting the element id in jQuery is as trivial as getting the data attribute: $('.foo').attr('id');.
You might be interested in Caniuse.com, a browser compatibility site for web technologies.
If XHTML is an issue to you, you might also be interested in how to use custom data attributes in XHTML: see here for a discussion on SO and here for an XHTML-compatible approach using namespaces.
this guy says data attibutes work on IE6.

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.