Java regex to extract td content from mavenrepository page - html

I need to extract license information hyperlinks from maven repository page ( end goal is to find copyright information associated with each maven dependency). Following is the relevant portion of the html.
I want to get all hrefs under the table which is just below the Licenses tag. In this case it's http://www.apache.org/licenses/LICENSE-2.0.txt. There could be other which has more licenses links and I want to capture them all in list of strings. Kindly help me with the regex to perform that. Alternatively if anybody has other ideas like a rest api to get licenses given artifact identified and version from mavenrepository that will be fantastic. Looking forward to hearing from you. Following is the source page from which I'm trying to scrape the licenses urls. https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-annotations/2.5.0
<div class="version-section">
<h2>Licenses</h2>
<table class="grid" width="100%">
<thead>
<tr>
<th style="width: 16em;">License</th>
<th>URL</th>
</tr>
</thead>
<tbody>
<tr>
<td>The Apache Software License, Version 2.0</td>
<td>
http://www.apache.org/licenses/LICENSE-2.0.txt
</td>
</tr>
</tbody>
</table>
</div>

Related

How Browser Engine works?

Here are my HTML test codes using Google HTML/CSS guide.
<table>
<thead>
<tr>
<th>Date
<th>Country
<tbody>
<tr>
<td>24/07/2018
<td>Myanmar
<tr>
<td>31/06/2018
<td>France
</table>
The following is how browser interprets it.
<table>
<thead>
<tr>
<th>Date
</th>
<th>Country
</th>
</tr>
</thead>
<tbody>
<tr>
<td>24/07/2018
</td>
<td>Myanmar
</td>
</tr>
<tr>
<td>31/06/2018
</td>
<td>France
</td>
</tr>
</tbody>
</table>
Here is my questions.
How the browser detect the lack of closing tag and how they interprets it.
It is preferable to use without closing elements? If it is, when should I use?
If it is not preferable, why?
Will it be impact on styling and adding interactivity on HTML semantic style?
Answers:
This is beyond the scope of SO, but just like any compiler detects something that opened is not closed. You can try and program something that identifies a valid bracket series, it would probably be similar.
Not using closing elements may break your page beyond being horribly un-maintainable. Always use closing elements.
See 2.
See 2.
Browsers sometimes can guess what you meant (better say, they parse luckily in a way that produces what you meant), but might also be wrong. See this:
<div>Hello<span>world
is this:
<div>Hello</div><span>world></span>
or
<div>Hello<span>world></span></div>
both are valid, and the browser has no idea which you meant. If you really want to get into how browsers are supposed to parse HTML, see this great link by
#modu and #kalido: http://w3c.github.io/html/syntax.html#tokenization . You may be able to workout how a browser should parse the above line.

Crystal report handling HTML table

As far as I know, Crystal report has a lot of limitation on interpreting HTML tags. I have also seen the supported tags list provided here. However, I have tr and td tags and I cannot avoid using it.
Is there anything I can do in between the HTML and Crystal Report so that table can be created?
Sample Data
<p>Your tenses are a bit confused here.
For next piece of writing, write everything in past tense. Set all your
story with “Last Sunday” to get used to recounting in past
tense first.</p>
<table border="1" cellpadding="1" cellspacing="1" style="width:500px">
<tbody>
<tr>
<td>look</td>
<td>looked</td>
</tr>
<tr>
<td>cook</td>
<td>cooked</td>
</tr>
<tr>
<td>bless</td>
<td>blessed</td>
</tr>
</tbody>
</table>
<p> </p>

Trying to add a hyperlink to a web based table generator

Sorry but I really have no experience in web-developing. I am trying to make a web site for my boys soccer team and have been working with the godaddy website builder system. The issue I have is I build a table using a web based table building site, tablesgenerator.com, and built the table with my player roster. Now what I am trying to do is make each players name a hyperlink to there own page so I can customize it for each. Please help because I have tried so many things. I will add some of the code from the table to show you what I am working with.
<th class="tg-n19i">NUMBER</th>
<th class="tg-3wsf">NAME</th>
<th class="tg-3wsf">POSITION</th>
<th class="tg-3wsf">GRADE</th>
</tr>
<tr>
<td class="tg-43qd">5</td>
<td class="tg-43qd">Osvaldo Araujo</td>
<td class="tg-43qd">Mid/Def</td>
<td class="tg-43qd">12</td>
As you can see the name is the second line and I just need to find a way to link it. Any help would be much appreciated. Thank you.
Assuming you already have a page for each of the players you could make their name a link like this:
<th class="tg-n19i">NUMBER</th>
<th class="tg-3wsf">NAME</th>
<th class="tg-3wsf">POSITION</th>
<th class="tg-3wsf">GRADE</th>
</tr>
<tr>
<td class="tg-43qd">5</td>
<td class="tg-43qd">Osvaldo Araujo</td>
<td class="tg-43qd">Mid/Def</td>
<td class="tg-43qd">12</td>
You would just need to change the "page url" bit for each team member's respective page.

R-Advanced Web Scraping-bypassing aspNetHidden using xmlTreeParse()

This question takes a bit of time to introduce, bear with me. It will be fun to solve if you can get there. This scrape would be replicated over thousands of pages on this website using a loop.
I'm trying to scrape the website http://www.digikey.com/product-detail/en/207314-1/A25077-ND/ looking to capture the data in the table with Digi-Key Part Number, Quantity Available etc.. including the right hand side with Price Break, Unit Price, Extended Price.
Using the R function readHTMLTable() doesn't work and only returns NULL values. The reason for this (I believe) is because the website has hidden it's content using the tag "aspNetHidden" in the html code.
For this reason I also found difficulty using htmlTreeParse() and xmlTreeParse() with the whole section parented by not appearing in the results.
Using the R function scrape() from the scrapeR package
require(scrapeR)
URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")
does return the full html code including the lines of interest:
<th align="right">Digi-Key Part Number</th>
<td id="reportpartnumber">
<meta itemprop="productID" content="sku:A25077-ND">A25077-ND</td>
<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>
However, I haven't been able to select the nodes out of this block of code with the error being returned:
no applicable method for 'xpathApply' applied to an object of class "list"
I've received that error using different functions such as:
xpathSApply(URL,'//*[#id="pricing"]/tbody/tr[2]')
getNodeSet(URL,"//html[#class='rd-product-details-page']")
I'm not the most familiar with xpath but have been identifying the xpath using inspect element on the webpage and copy xpath.
Any help you can give on this would be much appreciated!
You've not read the help for scrape have you? It returns a list, you need to get parts of that list (if parse=TRUE) and so on.
Also I think that web page is doing some heavy heavy browser detection. If I try and wget the page from the command line I get an error page, the scrape function gets something usable (but seems different to you) and Chrome gets the full junk with all the encoded stuff. Yuck. Here's what works for me:
> URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")
> tables = xpathSApply(URL[[1]],'//table')
> tables[[2]]
<table class="product-details" border="1" cellspacing="1" cellpadding="2">
<tr class="product-details-top"/>
<tr class="product-details-bottom">
<td class="pricing-description" colspan="3" align="right">All prices are in US dollars.</td>
</tr>
<tr>
<th align="right">Digi-Key Part Number</th>
<td id="reportpartnumber"><meta itemprop="productID" content="sku:A25077-ND"/>A25077-ND</td>
<td class="catalog-pricing" rowspan="6" align="center" valign="top">
<table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
<tr>
<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>
Adjust to your use-case, here I'm getting all the tables and showing the second one, which has the info you want, some of it in the pricing table which you can get directly with:
pricing = xpathSApply(URL[[1]],'//table[#id="pricing"]')[[1]]
> pricing
<table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
<tr>
<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>
</tr>
and so on.

How do I grab info from source code from other site?

Is there a way to grab things out of the source code of another site directly into your site?
For example, let's say than in a site the following source code exists:
<table ...>
<tr>
<td class=...>...</div></td>
</tr>
<tr>
<td class=....><div align="... class=...>"Interesting string that keeps changing"</div></td>
</tr>
</table>
And we want that Interesting string that keeps changing to appear in our website as well.
you could use php
you use
$html = file_get_contents('url to website');
or use a hidden if you want a javascript function, and then just grab the innerhtml