Algorithm to develop an article extractor - html

I have undertaken a project which will extract the main content from any webpage. For example, if I input the URL of any news article, it will return the article part only. The first step would be getting the source code of the given URL. There are many ways to do it. After getting HTML code of given webpage, I will keep the part inside <body> tag because obviously article will be somewhere inside body.
After this, I am selecting each div element and checking how much text it contains. At end I am selecting the div with most text inside it.
Other way I am thinking is, for each <p> element, I will check the parent of it. At end, I will select the div which has most <p> child directly. To understand it better check this tree- Tree of an HTML
Now I know that these methods are the basic and that's why I am asking this question. I want to know the suggestions of the community about this. What approaches you all use?

I like the idea of implementing your own 'News' crawler...
A few suggestions:
Check the source ('Right Click' > 'Inspect' at chrome) of some popular sites (e.g. The New York Times); search for common html object names, ids or classes they use to identify the different blocks in the html; for instance: divs with 'story' or 'story-body' ids.
I would go with the word count, but also use a dictionary of common phrases, which are likely to appear in a news article.
I would search for the block within 'header' and 'footer', excluding comments section or advertisements (again, by searching the values of the object id or class names).
Start your crawling from the main page, it will probably have references to the sub pages or articles - once you have the reference (e.g. a header or article name), it will help you navigate in the sub page itself.
In any case, I suggest working with java jsoup library - it will make your life easier; use it with the jquery-like selectors.
Goodluck.

Related

URL to an unnamed part of a web page

I'd like to refer to a specific part of a web page which I am not the author of, and which is not tagged with the NAME attribute. The specification of the part I have in mind could be made, e.g., as the location a certain word appears, and which could be manually reached via a FIND operation. I imagine something like
http://somesite.com#search-for:foo-bar
Is there some feature in HTML allowing for this?
No.
You can only link to elements with an id and a elements with a name.

JSoup Select Tag Recursive Search

I recently tried to work with JSoup to parse HTML documents, I went through the turorial on JSoup and found that the select-Method might be what I am looking for.
What I try to accomplish is to find all elements in a html document which possess a certain class. To test that, I tried this with the amazon web page (idea: find all deals with certain offers).
So I inspected the web page to see which classes and ids are being used and then I tried to integrate this into a small code snippet. In this example I found the follwing element:
<span id="dealTitle" class="a-size-base a-color-link dealTitleTwoLine restVisible singleCellTitle autoHeight">PROCAVE Matratzen-Brücke aus Schaumstoff 25 x 200 cm für ...</span>
This element is embedded in other elements and exists multiple times (for each deal of course). So here is my code to read the deal elements:
Document doc = Jsoup.connect("https://www.amazon.de/gp/angebote/ref=gbph_ftr_s-8_cd61_page_1?gb_f_LD=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CUPCOMING,dealTypes:LIGHTNING_DEAL,page:1,sortOrder:BY_SCORE,dealsPerPage:8&pf_rd_p=425ddcb8-bed4-4e85-ac0f-c1a79d14cd61&pf_rd_s=slot-8&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_r=BTHRY008J9N3N5CCMNEN&gb_f_second=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,dealTypes:COUPON_DEAL,page:8,sortOrder:BY_SCORE,dealsPerPage:8").timeout(0).get();
Elements deals = doc.select("span.a-size-base.a-color-link.dealTitleTwoLine.restVisible.singleCellTitle.autoHeight");
for (Element deal : deals) {
if (deal.text().contains("ItemMatch")) {
System.out.println("Found deal: " + deal.text());
}
}
Unfortunately I can't get the element I am looking for. deals has always the size of 0. I tried to modify my select with only part of the classes, I added the id-attribute and so on. Nevertheless, I do not get the elements (in this case these are nested into some others). If I try an element which is above this element in the DOM hierarchy (e.g. the div with class "a-section a-spacing-none slotContainer"), this is found.
Do I actually need to specify the whole DOM hierarchy (by using ">" in my select expressions? I expected to be able to define a selector and JSoup would travers and search the whole DOM-tree.
No, you do not have to specify the full DOM hierarchy. Your test should work, if the elements are really part of the DOM. I suspect that they might not be part of DOM as it is loaded be JSoup. The reason might me, that the inner DOM nodes are filled by JavaScript through AJAX. JSoup does not run JavaScript, so dynamically loaded parts of the DOM are not accessible. To achieve what you want you can either look into the AJAX calls directly and analyze them, or you move on to another solution like selenium webdriver, which runs a real browser including a working JavaScript engine.

Adding dynamic content to a webpage: using a table, divs, or lists?

When adding dynamic content to a webpage, is it better to add this by just adding a new row to a table, by adding a new div element, or adding a list element?
For example, if I wanted a posting of content to be added to a page when a user submits it, kind of like an ebay listing, or like reddit. When a user submits an "offer", the offer will be shown on the page, under all previous offers. What element would make the most sense to stick this information in?
It depends on the information. It doesn't matter if you are adding the content dynamically or writing static HTML - it's the same question.
You can get the answer by probing the type of the data:
for list data (ordered or unordered) - use a list
for tabular data - use table
for other data - consider div (or other tags like section, footer, article etc.)

extracting value from a <ul> with specific text text using HTMLAgilitypack

I'm trying to extract a link from http://www.raws.dri.edu/cgi-bin/rawLIST.pl?idIAN1+id
this site contains an unsorted list and I want to get the link for Daily Summary.
So far I've tried using an xpath string of "//ul/li/a" using the .SelectNodes() method. Doing so returns only the first item in the list which is what I want but ultimately in the future I may want to get the link to a different page so being able to specify which link to retrieve is what I need.
If you use //ul/li/a, you should get all the <a> links, not one.
If you want to extract the links that contain some text (e.g. Time Series Graph), you can do:
//ul/li/a[contains(text(), 'Time Series Graph')]
Similar, if you're looking for some specific text in the href attribute:
//ul/li/a[contains(#href, 'Time Series Graph')]
By the way, I see you have asked many questions pointing to the same website, etc. My suggestion is: Learn a little bit of XPath, the basics, and read a tutorial about how HtmlAgilityPack works (pretty simple once you understand the basics of XPath), and then start working on that scraper.

Headings created inside of a template

I have a number of templates that create headings based on a formula. I am wondering if there is anyway to create an "edit" link that will take you directly to that section? The way that it currently works, the edit link takes you to editing the template itself. Could I possibly create a customized link that would keep you on the page and take you to right part?
Here is some sample code to help clear things up...
Template:Head:
==={{{1}}}===
This is a heading titled "{{{1}}}"
Test Page:
=Section 1=
{{head|1.1}}
{{head|1.2}}
{{head|1.3}}
=Section 2=
{{head|2.1}}
{{head|2.2}}
{{head|2.3}}
At the moment, if I want to edit the information for template "2.3", I have to edit all of section 2. (Note that for this example, that isn't a big deal. For the actual templates I am working with on my site, the templates have dozens of parameters and there are sometimes 10 or more in a section.)
Bottom line, is there way to create a custom edit link inside of the {{head}} template that would take you directly to editing the templates call on the page "Test Page"? Hope that makes sense.
Edit: Is there perhaps a way to make use of "anchor" tags? Can anchors be passed in to the URL?
To restate your problem, when you transclude a section heading the header isn't treated as being part of the destination page, so the edit link takes you back to the source. So you need a separate container for the template in order to edit it individually, and a complete section is the smallest editable container.
The only way I can think of doing this is using subpages (or virtual subpages if you don't have that ennabled in this namespace, doesn't change anything). So instead of placing {{head|1.1}} on MyPage, put it on MyPage/Subpage1 and then transclude that into MyPage in the usual way ({{:MyPage/Subpage1}}).
{{head}} can then include a custom edit link to the template input by using HTML heading tags (<h2> is equal to ==, etc.) to suppress the standard edit link and then use one of these templates (probably {{ed right}}) to create a custom edit link pointing to MyPage/Subpage1.
The way to create anchors in Mediawiki, by the way, is to use a <span id="name"/> tag, but that doesn't create a container that can be edited (or at least, not that I've been able to work out through URL tinkering).
I'm pretty sure there's no way to do that. As far as MediaWiki's section editing feature is concerned, the only thing that begins a new section is a line of the form:
=== Some text here ===
with the number of = signs determining the level of the heading. There's no way to get MediaWiki to let you edit any segment of the document that doesn't begin and end with such a line (or the beginning or end of the page).
Well, OK, I'm sure you technically could do it with an extension, in the sense that you can do anything with a MediaWiki extension. All you'd need to do is provide some way (e.g. a special parameter in an edit URL) for to user to indicate "I want to edit this template", then extract the template from the wikitext, present it to the user for editing, and write the result back into the page text over the original.
The tricky part will be extracting the template from the page source. (Finding and replacing templates on a page is a fairly common task for MediaWiki bot writers, so you might want to look for ideas there.) Whatever method you end up using for that, there will probably be edge cases where you need to give up and tell the user "Sorry, but I can't figure out how that template is transcluded here."