How do I get all links of a Wikipedia page, not including the ones in the navbox at the bottom?

How do I get all links of a Wikipedia page, not including the ones in the navbox at the bottom? - mediawiki

I want to get all the links on the Wikipedia page Charles Maurice de Talleyrand-Périgord, but without all the links within the navboxes at the bottom of the page. I've been using the links prop (e.g. https://en.wikipedia.org/w/api.php?origin=*&action=query&prop=links&titles=Charles%20Maurice%20de%20Talleyrand-Périgord&pllimit=max), but it includes the navboxes, and I haven't found another property that pertains to my concern. Is there a way to exclude them?

Related

What does "reference" mean when using figure?

It seems the only real difference between the figure and aside elements, is that a figure is "referenced" by the main flow, where as an aside is not.
What exactly does reference mean in this context?
For example if I have a figure image of a mountain. Does it mean I need to point out that exact figure of the mountain or does it mean I could be talking about a mountain. How specific does a "reference" need to be for its proper use? Also if it needs to be referenced by the main flow does that mean it needs to be something the entire page is about and not just an article of it?
<figure>
<img src="mountainReiner.jpg">
<figcaption><p>mount reiner</p></figcaption>
</figure>
<p> paragraph talking about mountain reiner</p>

2017-07-18 update
The HTML spec was updated[❄︎][❆] to clarify what’s meant by referenced as far as HTML elements.
The word referenced in the figure element section and also in other sections is now a link to https://html.spec.whatwg.org/multipage/dom.html#referenced which reads:
Elements can be referenced (referred to) in some way, either explicitly or implicitly. One way that an element in the DOM can be explicitly referenced is by giving an id attribute to the element, and then creating a hyperlink with that id attribute's value as the fragment for the hyperlink's href attribute value. Hyperlinks are not necessary for a reference, however; any manner of referring to the element in question will suffice.
And there’s a detailed example following that text.
2017-07-13 original answer
What exactly does reference mean in this context?
It comes from the fact that in print publishing, figures are illustrations that are titled and numbered so you can cite them (reference them) by number+title easily in some any part of a publication.
For example, see how-to-use-figures style guides such as the following:
MLA Tables, Figures, and Examples — https://owl.english.purdue.edu/owl/resource/747/14/
Each illustration must include a label, a number, a caption and/or source information.
APA Tables and Figures — https://owl.english.purdue.edu/owl/resource/560/20/
For figures, make sure to include the figure number and a title with a legend and caption.
MLA Citation Guide (8th Edition): Images, Charts, Graphs, Maps & Tables — https://columbiacollege-ca.libguides.com/mla/images
Each figure should be assigned a figure number, starting with number 1 for the first figure used in the assignment.
In Web documents, you can just put an id attribute on the figcaption elements that provide the title for each figure, and then use a-element hyperlinks to reference those titles, so numbers for figures aren’t strictly necessary. But many Web documents use numbered figures anyway.
Does it mean I need to point out that exact figure of the mountain or does it mean I could be talking about a mountain.
It means, give figures a figcaption with a title+id that lets you refer to it elsewhere in the doc.
How specific does a "reference" need to be for its proper use?
It can (should) be exact—a hyperlink pointing to a unique id.
Even if the figure is unnumbered and has the same title as another figure in the same document, if it has a unique id attribute, you can refer to it specifically/exactly by hyperlinking to that id.
Also if it needs to be referenced by the main flow does that mean it needs to be something the entire page is about and not just an article of it?
No, the HTML spec at least imposes no special requirements like that.
For the record here, the relevant language from the HTML spec is this:
The figure element represents some flow content, optionally with a caption, that is self-contained (like a complete sentence) and is typically referenced as a single unit from the main flow of the document.
https://html.spec.whatwg.org/multipage/grouping-content.html#the-figure-element

Reference means the subject matter or what your web page is about. If your web page is about Hiking(the referent), the figure of the mountain is there to reference the theme of hiking.
Another version is:
Aside is information that is not related directly to the main content. (Generally not assisting directly in the main content. (such as car parts for the Hiking Page)
Figure may be used most widely in documents to illuminate the main themes of the document or to assist proving the main theme.

How to tell google this text is part of another article

After every article in my website there are previews for other articles. They are random previews.
The problem is the previews are really big: got headline, subheadline and 6 rows of text. Sometimes google thinks they are part of my article.
Is there any way to tell google that this div contains text from another article?
preview example:

By using the appropriate semantic markup that HTML5 offers, user agents (like Google) would, in principle, be able to understand this; but that, of course, doesn’t necessarily mean that they (currently) support (all of) this.
The teasers should be outside of the main element.
Signal: It’s not part of this page’s main content.
The teasers should be in an aside element.
Signal: It’s only "tangentially related" to the page’s content.
Each teaser should be in its own article element.
Signal: It’s a self-contained item of content.
Each teaser’s link (to the full article) should get the bookmark link type.
Signal: The permalink URL of the teaser/article is not the same as the current page’s URL.
(One could also consider using the blockquote element for the parts taken over literally, i.e., in cases where the teaser doesn’t contain (slightly) different content, like a summary. But it depends on your understanding of your content, if you really quote here.)
However, that doesn’t stop Google to show parts of the teasers in their SERPs (if their algorithms deem it useful, get confused, or whatever). Without using some "hacks" (e.g., with JS or an iframe), it’s not possible nor intended to hide parts of the page for Google Search and their SERPs.

Wrap the preview article div in
<!--googleoff: all-->
<!--googleon: all>
That tells Google not to index that part of your page.
You can costumize the tag to your preference:
index — content surrounded by “googleoff: index” will not be indexed by Google
anchor — anchor text for any links within a “googleoff: anchor” area will not be associated with the target page
snippet — content surrounded by “googleoff: snippet” will not be used to create snippets for search results
all — content surrounded by “googleoff: all” are treated with all attributes: index, anchor, and snippet
(Source)

Rel canonical without a primary URL

Background: We have a situation where the customer can select in which places to publish a content on a website. If it’s a municipality website, an article describing a playground could be published both in the “For families” section and “Parks” section. In some government site with instructions for companies divided into sections by company types: Instructions that are identical for all companies will be published in all company type sections. There is often no definite primary place that is more right than the others.
The CMS renders top, bottom and side content relevant to the part of the site where you are, so only the content part is identical between locations.
Questions:
Do I need rel canonical for URLs inside the same site, or is it only for external links?
If I need them, can I somehow specify that they are all “primary” or did I already do that by not having a the canonical tag at all?
Do search engines generally show pages that has the canonical tag?

If you want to merge internal pages, then yes, a canonical is required for those pages.
By setting a canonical, the target URL will be displayed preferably by Google.
No, they display the page that is linked to in the canonical.

How should I use html5 elements with modal sections of my web page?

I'm pretty inexperienced as far as html goes and even less so with html5.
I have a question regarding modal popups - page sections that are interacted with using javascript/ajax, but not necessarily displayed on the page all the time. These are not generally in the main html flow - I might for instance place all my modal code at the end of the page for maintainability. The question is - should I be declaring these chunks of the page using html section tags, or something else?
To shed more light on the situation I'm describing, I have an application page. This contains a number of sections (I'm not referring to html5 here). The first section is modal on entering the page - it's a "click to continue if you agree" section. The next 5 chunks belong to a stepped application form - each step is displayed on at a time using a multiview control. Then another modal - a UI block, followed by a final decision section.

Since they are modal, and appear out of the flow, it is probably most suitable to use a div for them. If you do want to use a semantic block, then which you use will depend on what the content is, and how it relates to the rest of the page. The following articles should help you make that decision:
http://html5doctor.com/the-section-element/
http://html5doctor.com/the-article-element/
http://html5doctor.com/avoiding-common-html5-mistakes/ (particularly the first section of that article - "Don’t use section as a wrapper for styling")
Edit: Have added that 3rd link, since I now have enough rep to do so :-) yay!

The question is - should I be declaring these chunks of the page as sections, or something else
One of the big advantages of HTML5 is it's sematically readable. If you feel that your modal pop ups are better described by something like an article tag, then use an article. Use the tag you feel most accurately describes your functionality.
For example, let's say I have a sample page like so:
<html>
<head></head>
<body>
<article>

</article>
</body>
</html>
I would expect the content of that article tag to fit this definition:
The article element represents a component of a page that consists of a self-contained composition in a document, page, application, or site and that is intended to be independently distributable or reusable, e.g. in syndication. This could be a forum post, a magazine or newspaper article, a blog entry, a user-submitted comment, an interactive widget or gadget, or any other independent item of content.
W3C Specification. The Article Element.
Note: In this context, an article is designed to represent flow content. Given that your aim is not to write flow content (as you correctly put) this is not a good example. This is very clear from the definition I've provided.
Similarly, if I replaced article with section, I would expect it to fit this definition:
Examples of sections would be chapters, the various tabbed pages in a tabbed dialog box, or the numbered sections of a thesis. A Web site's home page could be split into sections for an introduction, news items, and contact information.
W3C Specification. The Section Element
If I were you I would have a look through the spec and think the following questions:
What does my content actually mean to the user?
How will my content appear to other programmers?
Does the use of this content give me a hint at the correct semantics?

It depends what you have in your modal.
You could have a login form, subscribe stuff, advertisements, articles, a frame of another page, so it would only make sense to use <section> if they are actually an interesting section of the page, for example, you have an article and then you want to display the autor info in a modal box, then I would say that it would acceptable to use <section>.
So overall if it is part of the content then sounds ok to use that, if is is not you should use a <div>.
I would also say that no one has the answer for this as it is purely opinionated, and quite frankly doesn't matter.

There is also another way to incorporate modals. As they are dependent of JavaScript you could also load the popup contents via AJAX without having them in the document flow. A recent project I worked on, first renders links to a normal and complete HTML page for popup contents (e.g. contact forms). If JS is enabled, a parameter is added to the links to load only the main content without header, menu and sidebars via AJAX.
As the modal content does not really belong to your site content (if it does it shouldn't be a popup but within the documents main content) it shouldn't get marked up with some section, main or article tag. Instead use a div to render the popups or use an iframe if that is admissible for your project.

It doesn't really matter what tag is used for a modal, as long as it's appropriate the purpose (don't use a <fieldset> for example). Usually we see a <div> representing a modal.
You can use the role attribute for semantic information about the purpose of an element. In this case role="dialog" would be appropriate. You can find more info on the role attribute in HTML5 here.
Also note ARIA attributes: They enhance accessibility. For example aria-hidden="true" specifies that the element isn't visible. Screen-readers use this to skip the content.

How would you pick the best image from a webpage in a crawler?

If you were given any random webpage on the internet and had the html source only. What method would use to give you the most accurate image that would best describe that webpage? Assume that there are no meta tags or hints.
Facebook does something similar when you post a link but they give you choices of n images to chose from, they don't actually pick one unless it has the meta tags on it.

Try to analyze the structure of the page. The majority of web pages roughly has a header, content and footer area. The content area is most likely to contain images related to the subject of the page, so that's what you're looking for.
Find the content area
Most content areas are div elements with with an ID or class named content, so that's always a good first guess. There may be alternative descriptors of the content element, so you'll need to do some research to find common patterns.
The content area will also contain multiple h1 or h2 headings in most cases, so that's another indicator to look for.
Find the header and footer
Another approach is to identify the header and footer. Headers usually contain a hint to the logo of the site, such as an image, CSS class name or link to the root of the site. Footers are most likely to contain things like copyright statements.
You can also find the header and footer by analyzing the links on the page. Most internal links will be in the header and footer, while the content has relatively more outgoing links, if any.
Once you have the header and footer, the content is usually in between :)
Find an image
Once you've identified the content area, the first image is usually your best pick. You should, however, ignore images with a small width and/or height, as these will likely be decorative images.
You could also double-check the images against any included CSS files, to make sure you're not picking an image that's related to the design of the page.
Fall back to an educated guess
If you cannot reliably guess the content area of the page, just use the biggest image on the page, as egrunin suggested. Again, you can check this image against the CSS files, to rule out any design-related images.
In the fall-back case, you could log the URL and review those pages to improve your image detection algorithms.

This is best-guess stuff, but:
ignoring anything hosted in another domain will eliminate most ads
once you've grabbed the images, you can get their size; the biggest is probably the one to use.
images that are inside <a> and point to the root of the domain are probably logos. Example: the SO logo on this page is inside .
Edited to add:
It's true that large sites use auxiliary servers for their images. But you could probably make up a couple of simple parsing rules that will get 80% of cases, picking out g-ecx.images-amazon.com and static.ak.fbcdn.net as non-ad servers.

If you find og:image meta property, you can use that quite safely, as part of Open Graph specification used to provide images for Facebook links.
Example of format:
<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:image" content="http://ia.media-imdb.com/rock.jpg"/>
...
</head>
...
</html>

Well I would try to look for divs/spans/h1 with something like class or id = "logo" or "top". Almost every page has its logo on the top of page. Just look on stackoverflow :) logo.
I do it this way in my crawler and it works fine :)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008