Dearest StackOverflow homies,
I'm playing with HTML that was output by EverNote and need to parse the following:
Note Title
Note anchor (hyperlink identities of the notes themselves)
Note Creation Date
Note Content, and
Intra-notebook hyperlinks (the
links within the content of a note to another note's anchor)
According to examples by Duncan Temple Lang, author of the [r] XML package and a SO answer by #jdharrison, I have been able to parse the Note Title, Note anchor, and Note Creation Dates with relative ease. For those who may be interested, the commands to do so are
require("XML")
rawHTML <- paste(readLines("EverNotebook.html"), collapse="\n") #Yes... this is noob code
doc = htmlTreeParse(rawHTML,useInternalNodes=T)
#Get Note Titles
html.titles<-xpathApply(doc, "//h1", xmlValue)
#Get Note Title Anchors
html.tAnchors<-xpathApply(doc, "//a[#name]", xmlGetAttr, "name")
#Get Note Creation Date
html.Dates<-xpathApply(doc, "//table[#bgcolor]/tr/td/i", xmlValue)
Here's a fiddle of an example HTML EverNote export.
I'm stuck on parsing 1. Note Contents and 2. Intra-notebook hyperlinks.
Taking a closer look at the code it is apparent the solution for the first part is to return every upper-most* div that does NOT include a table with attribute bgcolor="#D4DDE5." How is this accomplished?
Duncan says that it is possible to use XPath to parse XML according to NOT conditions:
"It allows us to express things such as "find me all nodes named a" or "find me all nodes named a that have no attribute named b" or "nodes a that >have an attribute b equal to 'bob'" or "find me all nodes a which have c as >an ancestor node"
However he does not go on to describe how the XML package can parse exclusions... so I'm stuck there.
Addressing the second part, consider the format of anchors to other notes in the same notebook:
<a href="#13178">
The goal with these is to procure their number and yet this is difficult because they are solely distinguished from www links by the # prefix. Information on how to parse for these particular anchors via partial matching of their value (in this case #) is sparse - maybe even requiring grep(). How can one use the XML package to parse for these special hrefs? I describe both problems here since it's possible a solution to the first part may aid the second... but perhaps I'm wrong. Any advice?
UPDATE 1
By upper-most div I intend to say outer-most div. The contents of every note in an EverNote HMTL export are within the DOMs outer-most divs. Thus the interest is to return every outer-most div that does NOT include a table with attribute bgcolor="#D4DDE5."
"....to return every upper-most div that does NOT include a table with attribute bgcolor="#D4DDE5." How is this accomplished?"
One possible way ignoring 'upper-most' as I don't know exactly how would you define it :
//div[not(table[#bgcolor='#D4DDE5'])]
Above XPath reads: select all <div> not having child element <table> with bgcolor attribute equals #D4DDE5.
I'm not sure about what you mean by "parse" in the 2nd part of the question. If you simply want to get all of those links having special href, you can partially match the href attribute using starts-with() or contains() :
//a[starts-with(#href, '#')]
//a[contains(#href, '#')]
UPDATE :
Taking "outer-most" div into consideration :
//div[not(table[#bgcolor='#D4DDE5']) and not(ancestor::div)]
Side note : I don't know exactly how XPath not() is defined, but if it works like negation in general, (this worked as confirmed by OP in the comment below) you can apply one of De Morgan's law :
"not (A or B)" is the same as "(not A) and (not B)".
so that the updated XPath can be slightly simplified to :
//div[not(table[#bgcolor='#D4DDE5'] or ancestor::div)]
I am validating my store in http://validator.w3.org/ and i have some errors.
Looks like that products from feature slider and new products listing on the homepage have identical ids.
Here is an example of the error:
Line 601, Column 93: ID "product-price-16" already defined
<span class="regular-price" id="product-price-16">
An "id" is a unique identifier. Each time this attribute is used in a document it must have a different value. If you are using this attribute as a hook for style sheets it may be more appropriate to use classes (which group elements) than id (which are used to identify exactly one element).
Line 288, Column 93: ID "product-price-16" first defined here
<span class="regular-price" id="product-price-16">`
Is it possible to define IDs of products from feature slider with an prefix?
Thanks!
Product price display is updated by JavaScript using these IDs, and causing a JavaScript error can result in broken JavaScript form handling, namely, add to cart. The template which is rendering these is used in a number of places. You will want to set a different template for the price in the feature slider rather than edit the catalog/product/price.phtml template.
What's the point of the name attribute on an HTML form? As far as I can tell, you can't read the form name on submission or do anything else with it. Does it serve a purpose?
In short, and probably oversimplifying a bit: It is used instead of id for browsers that don't understand document.getElementById.
These days it serves no real purpose. It is a legacy from the early days of the browser wars before the use of name to describe how to send control values when a form is submitted and id to identify an element within the page was settled.
From the specification:
The name attribute represents the form's name within the forms collection.
Once you assign a name to an element, you can refer to that element via document.name_of_element throughout your code. It doesn't work to tell when you've got multiple fields of the same name, but it does allow shortcuts like:
<form name="myform" ...>
document.myform.submit();
instead of
document.getElementsByName('myform')[0].submit();
Here's what MDN has to say about it:
name
The name of the form. In HTML 4, its use is deprecated (id should be used instead). It must be unique among the forms in a document and not just an empty string in HTML 5.
(from <form>, Attributes, name)
I find it slightly confusing that specifies that it must be unique, non-empty string in HTML 5 when it was deprecated in HTML 4. (I'd guess that requirement only applies if the name attribute is specified at all?). But I think it's safe to say that any purpose it once served has been superseded by the id attribute.
You can use the name attribute as an "extra information" attribute - similarly as with a hidden input - but this keeps the extra information tied into the form, which makes it just a little simpler to read/access.
name attribute is not completely redundant vis-à-vis id. As aforementioned, it useful with <forms>, but less known is that it can also be used with with any HTMLCollection, such as the children property of any DOM element.
HTMLCollection, in additional to be a array-like object, will have named properties commensurate with any named members (or the first occurrence in case of non-unique name). It is useful to retrieve specific named nodes.
For example, in the following example HTML:
<div id='person1'>
<span name='firstname'>John</span>
<span name='lastname'>Doe</span>
<span name='middlename'></span>
</div>
<div id='person2'>
<span name='firstname'>Jane</span>
<span name='lastname'>Doe</span>
<span name='middlename'></span>
</div>
by naming each child, one can quickly and efficiently retrieve a named element, such as lastname, as such:
document.getElementById('person1').children.namedItem('lastname')
...and if there is no risk of 'length' being the name of a member element, (being that length is a reserved property of HTMLCollection), a more terse notation may be used instead:
document.getElementById('person1').children.lastname
DOM Living Standard 2019 March 29
An HTMLCollection object is a collection of elements...
The namedItem(key) method, when invoked, must run these steps:
If key is the empty string, return null.
Return the first element in the collection for which at least one of the following is true:
it has an ID which is key;
it is in the HTML namespace and has a name attribute whose value is key;
This question already has answers here:
Difference between id and name attributes in HTML
(22 answers)
Closed 3 years ago.
When using the HTML <input> tag, what is the difference between the use of the name and id attributes especially that I found that they are sometimes named the same?
In HTML4.01:
Name Attribute
Valid only on <a>, <form>, <iframe>, <img>, <map>, <input>, <select>, <textarea>
Name does not have to be unique, and can be used to group elements together such as radio buttons & checkboxes
Can not be referenced in URL, although as JavaScript and PHP can see the URL there are workarounds
Is referenced in JavaScript with getElementsByName()
Shares the same namespace as the id attribute
Must begin with a letter
According to specifications is case sensitive, but most modern browsers don't seem to follow this
Used on form elements to submit information. Only input tags with a name attribute are submitted to the server
Id Attribute
Valid on any element except <base>, <html>, <head>, <meta>, <param>, <script>, <style>, <title>
Each Id should be unique in the page as rendered in the browser, which may or may not be all in the same file
Can be used as anchor reference in URL
Is referenced in CSS or URL with # sign
Is referenced in JavaScript with getElementById(), and jQuery by $(#<id>)
Shares same name space as name attribute
Must contain at least one character
Must begin with a letter
Must not contain anything other than letters, numbers, underscores (_), dashes (-), colons (:), or periods (.)
Is case insensitive
In (X)HTML5, everything is the same, except:
Name Attribute
Not valid on <form> any more
XHTML says it must be all lowercase, but most browsers don't follow that
Id Attribute
Valid on any element
XHTML says it must be all lowercase, but most browsers don't follow that
This question was written when HTML4.01 was the norm, and many browsers and features were different from today.
The name attribute is used for posting to e.g. a web server. The id is primarily used for CSS (and JavaScript). Suppose you have this setup:
<input id="message_id" name="message_name" type="text" />
In order to get the value with PHP when posting your form, it will use the name attribute, like this:
$_POST["message_name"];
The id is used for styling, as said before, for when you want to use specific CSS content.
#message_id
{
background-color: #cccccc;
}
Of course, you can use the same denomination for your id and name attribute. These two will not interfere with each other.
Also, name can be used for more items, like when you are using radio buttons. Name is then used to group your radio buttons, so you can only select one of those options.
<input id="button_1" type="radio" name="option" />
<input id="button_2" type="radio" name="option" />
And in this very specific case, I can further say how id is used, because you will probably want a label with your radio button. Label has a for attribute, which uses the id of your input to link this label to your input (when you click the label, the button is checked). An example can be found below
<input id="button_1" type="radio" name="option" /><label for="button_1">Text for button 1</label>
<input id="button_2" type="radio" name="option" /><label for="button_2">Text for button 2</label>
IDs must be unique
...within page DOM element tree so each control is individually accessible by its id on the client side (within browser page) by
JavaScript scripts loaded in the page
CSS styles defined on the page
Having non-unique IDs on your page will still render your page, but it certainly won't be valid. Browsers are quite forgiving when parsing invalid HTML. but don't do that just because it seems that it works.
Names are quite often unique but can be shared
...within page DOM between several controls of the same type (think of radio buttons) so when data gets POSTed to server only a particular value gets sent. So when you have several radio buttons on your page, only the selected one's value gets posted back to server even though there are several related radio button controls with the same name.
Addendum to sending data to server: When data gets sent to server (usually by means of HTTP POST request) all data gets sent as name-value pairs where name is the name of the input HTML control and value is its value as entered/selected by the user. This is always true for non-Ajax requests. In Ajax requests name-value pairs can be independent of HTML input controls on the page, because developers can send whatever they want to the server. Quite often values are also read from input controls, but I'm just trying to say that this is not necessarily the case.
When names can be duplicated
It may sometimes be beneficial that names are shared between controls of any form input type. But when? You didn't state what your server platform may be, but if you used something like ASP.NET MVC you get the benefit of automatic data validation (client and server) and also binding sent data to strong types. That means that those names have to match type property names.
Now suppose you have this scenario:
you have a view with a list of items of the same type
user usually works with one item at a time, so they will only enter data with one item alone and send it to server
So your view's model (since it displays a list) is of type IEnumerable<SomeType>, but your server side only accepts one single item of type SomeType.
How about name sharing then?
Each item is wrapped within its own FORM element and input elements within it have the same names so when data gets to the server (from any element) it gets correctly bound to the string type expected by the controller action.
This particular scenario can be seen on my Creative stories mini-site. You won't understand the language, but you can check out those multiple forms and shared names. Never mind that IDs are also duplicated (which is a rule violation) but that could be solved. It just doesn't matter in this case.
name identifies form fields*; so they can be shared by controls that stand to represent multiple possibles values for such a field (radio buttons, checkboxes). They will be submitted as keys for form values.
id identifies DOM elements; so they can be targeted by CSS or JavaScript.
* name's are also used to identify local anchors, but this is deprecated and 'id' is a preferred way to do so nowadays.
name is the name that is used when the value is passed (in the URL or in the posted data). id is used to uniquely identify the element for CSS styling and JavaScript.
The id can be used as an anchor too. In the old days, <a name was used for that, but you should use the id for anchors too. name is only to post form data.
name is used for form submission in the DOM (Document Object Model).
ID is used for a unique name of HTML controls in the DOM, especially for JavaScript and CSS.
The name defines what the name of the attribute will be as soon as the form is submitted. So if you want to read this attribute later you will find it under the "name" in the POST or GET request.
Whereas the id is used to address a field or element in JavaScript or CSS.
The id is used to uniquely identify an element in JavaScript or CSS.
The name is used in form submission. When you submit a form only the fields with a name will be submitted.
The name attribute on an input is used by its parent HTML <form>s to include that element as a member of the HTTP form in a POST request or the query string in a GET request.
The id should be unique as it should be used by JavaScript to select the element in the DOM for manipulation and used in CSS selectors.
I hope you can find the following brief example helpful:
<!DOCTYPE html>
<html>
<head>
<script>
function checkGender(){
if(document.getElementById('male').checked) {
alert("Selected gender: "+document.getElementById('male').value)
}else if(document.getElementById('female').checked) {
alert("Selected gender: "+document.getElementById('female').value)
}
else{
alert("Please choose your gender")
}
}
</script>
</head>
<body>
<h1>Select your gender:</h1>
<form>
<input type="radio" id="male" name="gender" value="male">Male<br>
<input type="radio" id="female" name="gender" value="female">Female<br>
<button onclick="checkGender()">Check gender</button>
</form>
</body>
</html>
In the code, note that both 'name' attributes are the same to define optionality between 'male' or 'female', but the 'id's are not equals to differentiate them.
Adding some actual references to W3C documentation that authoritatively explain the role of the 'name' attribute on form elements. (For what it's worth, I arrived here while exploring exactly how Stripe.js works to implement safe interaction with the payment gateway Stripe. In particular, what causes a form input element to get submitted back to the server, or prevents it from being submitted?)
The following W3C documentation is relevant:
HTML 4: https://www.w3.org/TR/html401/interact/forms.html#control-name Section 17.2 Controls
HTML 5: https://www.w3.org/TR/html5/forms.html#form-submission-0 and
https://www.w3.org/TR/html5/forms.html#constructing-the-form-data-set Section 4.10.22.4 Constructing the form data set.
As explained therein, an input element will be submitted by the browser if and only if it has a valid 'name' attribute.
As others have noted, the 'id' attribute uniquely identifies DOM elements, but is not involved in normal form submission. (Though 'id' or other attributes can of course be used by JavaScript to obtain form values, which JavaScript could then use for Ajax submissions and so on.)
One oddity regarding previous answers/commenters concern about id's values and name's values being in the same namespace. So far as I can tell from the specifications, this applied to some deprecated uses of the name attribute (not on form elements). For example https://www.w3.org/TR/html5/obsolete.html:
"Authors should not specify the name attribute on a elements. If the attribute is present, its value must not be the empty string and must neither be equal to the value of any of the IDs in the element's home subtree other than the element's own ID, if any, nor be equal to the value of any of the other name attributes on a elements in the element's home subtree. If this attribute is present and the element has an ID, then the attribute's value must be equal to the element's ID. In earlier versions of the language, this attribute was intended as a way to specify possible targets for fragment identifiers in URLs. The id attribute should be used instead."
Clearly, in this special case, there's some overlap between id and name values for 'a' tags. But this seems to be a peculiarity of processing for fragment ids, not due to general sharing of namespace of ids and names.
An interesting case of using the same name: input elements of type checkbox like this:
<input id="fruit-1" type="checkbox" value="apple" name="myfruit[]">
<input id="fruit-2" type="checkbox" value="orange" name="myfruit[]">
At least if the response is processed by PHP, if you check both boxes, your POST data will show:
$myfruit[0] == 'apple' && $myfruit[1] == 'orange'
I don't know if that sort of array construction would happen with other server-side languages, or if the value of the name attribute is only treated as a string of characters, and it's a fluke of PHP syntax that a 0-based array gets built based on the order of the data in the POST response, which is just:
myfruit[] apple
myfruit[] orange
Can't do that kind of trick with ids. A couple of answers in What are valid values for the id attribute in HTML? appear to quote the spec for HTML 4 (though they don't give a citation):
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
followed by any number of letters, digits ([0-9]), hyphens ("-"),
underscores ("_"), colons (":"), and periods (".").
So the characters [ and ] are not valid in either ids or names in HTML4 (they would be okay in HTML5). But as with so many things html, just because it's not valid doesn't mean it won't work or isn't extremely useful.
If you are using JavaScript/CSS, you must use the 'id' of a control to apply any CSS/JavaScript stuff on it.
If you use name, CSS won't work for that control. As an example, if you use a JavaScript calendar attached to a textbox, you must use the id of the text control to assign it the JavaScript calendar.