html text Extraction in j2me - html

I want to extract text from a web page in j2me, I have used String operations,but I am not getting the result, was my code correct?
The String from a web page:
<td align="left" valign="middle" class="celebrity-details-description-txt" >
<p style="text-align: justify;">
Hero Gopichand's new movie under the direction of Chandrasekhar Yeleti is progressing at brisk pace in Ladakh. Recently, the unit has shot an extensive action scene on Gopichand, Taapsee and others under Buzkashi sport backdrop. alt="Buzkashi sport, gopichand buzkashi, gopichand new movie, buzkashi afghanisthan sport
</p>
<p style="text-align: justify;"> </p> <p style="text-align: justify;"> </p>
</td>
Here my CODE:
int tdIndex = readUrl.indexOf("<td align=\"left\" valign=\"middle\" class=\"celebrity-details-description-txt\">");
tdIndex = readUrl.indexOf(">", tdIndex);
int endtdIndex = readUrl.indexOf("</td>", tdIndex);
String content = readUrl.substring(tdIndex + 1, endtdIndex);

It looks like in readUrl String there is an extra space between middle and class on the td. My suggestion is to change
int tdIndex = readUrl.indexOf("<td align=\"left\" valign=\"middle\" class=\"celebrity-details-description-txt\">");
to just
int tdIndex = readUrl.indexOf("class=\"celebrity-details-description-txt\"");

Related

Using a pandas data frame inside of an HTML

I'm trying to implement a pandas dataframe in to this HTML code.
So, instead of User in the Dear User row,
I can take a specific value from the last row of my data frame.
And instead of Users daily amount after the Reason: row I will get a specific reason from the last row of a data frame.
Sample Code:
msg.attach(MIMEText(
'''
<html>
<body>
<p><img src="cid:1" width="900" height="100"></p>
<b><h1 style="text-align: left;">Pay Attention!</h1></b>
<h3 style="text-align: left;">Dear User,</h3>
<p>There has been an anomaly in the following meaurement: </p>
<b><p>Users daily amount </p></b>
<h3>Reasons: </h3>
<p> - Increase in the amount of Information Request sent</p>
<p> - Unknown new rule firing </p>
<h3>Last 2 Weeks Activity</h3>
<p><img src="cid:0"></p>
</body>
</html>
''',
'html', 'utf-8'))
Hope I'm clear, Thanks a lot!
It is not entirely clear what you want to achieve, but as far as I understood, you want to pass a value from pandas dataframe to your html code. In this case, you can simply concatenate the strings.
user = df.iloc[-1][0] //index of value
html = '''
<html>
<body>
<p><img src="cid:1" width="900" height="100"></p>
<b><h1 style="text-align: left;">Pay Attention!</h1></b>
<h3 style="text-align: left;">Dear ''' + user + ''',</h3>
<p>There has been an anomaly in the following meaurement: </p>
<b><p>Users daily amount </p></b>
<h3>Reasons: </h3>
<p> - Increase in the amount of Information Request sent</p>
<p> - Unknown new rule firing </p>
<h3>Last 2 Weeks Activity</h3>
<p><img src="cid:0"></p>
</body>
</html>
'''
msg.attach(MIMEText(html, 'html'))

Xpath issues selecting <spans> nested in <td>

I'm trying to extract text from a lot of XHTML documents with a program that uses Xpath queries to map the text into a structured table. the XHTML document looks like this
<td class="td-3 c12" valign="top">
<p class="pa-4">
<span class="ca-5">text I would like to select </span>
</p>
</td>
<td class="td-3 c13" valign="top">
<p class="pa-2">
<span class="ca-0">some more text I want to select </span>
</p>
<p class="pa-2">
<span class="ca-0">
<br>
</br>
</span>
</p>
<p class="pa-2">
<span class="ca-5">text and values I don't want to select.</span>
</p>
<p class="pa-2">
<span class="ca-5"> also text and values I don't want to </span>
</p>
</td>
I'm able to select the the spans by their class and retrieve the text/values, however they're not unique enough and I need to filter by table classes. for example only the text from span class ca-0 that is a child of td class td-3 c13
which would be <span class="ca-0">some more text I want to select </span>
I've tried all these combinations
//xhtml:td[#class="td-3 c13"]/xhtml:span[#class = "ca-0"]
//xhtml:span[#class = "ca-0"] //ancestor::xhtml:td[#class= "td-3 c13"]
//xhtml:td[#class="td-3 c6"]//xhtml:span[#class = "ca-0"]
I'm not sure how much your sample xml reflects your actual xml, but strictly based on your sample xml (AND disregarding possible namespaces issues you will probably face), the following xpath expression:
//td[contains(#class,"td-3")]/p[1]/span/text()
selects
text I would like to select
some more text I want to select
According to the doc, and to support namespaces, you should write something like this (fn:...) :
//*:td[fn:contains(#class,"td-3")]/*:p[1]/*:span
Or with a binding namespace :
node.xpath("//xhtml:td[fn:contains(#class,'td-3')]/xhtml:p[1]/xhtml:span", {"xhtml":"http://example.com/ns"})
This expression should work too (select the first span of the first p of each td element) :
//*:td/*:p[1]/*:span[1]
Side notes :
Your XPath expressions could be fixed. Span is not a child but a descendant, so we use //. We use () to keep the first result only.
(//xhtml:td[#class="td-3 c13"]//xhtml:span[#class = "ca-0"])[1]
(//xhtml:td[#class="td-3 c6"]//xhtml:span[#class = "ca-0"])[1]
Replace // with a predicate [] :
(//xhtml:span[#class = "ca-0"][ancestor::xhtml:td[#class= "td-3 c13"]])[1]
Test your XPath with : https://docs.marklogic.com/cts.validIndexPath
The solution is
//td[(#class ="td-3") and (#class = "c13)]/p/span
for some reason it sees the
<td class="td-3 c13">
as separate classes e.g.
<td class = "td-3" and class = "c13"
so you need to treat them as such
Thanks to #E.Wiest and #JackFleeting for validating and pointing me in the right direction.

ASP.NET - Image not rendered at runtime

Please help me with a weird issue I am facing in my web application. In the scenario below an image (company logo) is not getting rendered at runtime.
I have designed a very simple page named default2.aspx. The intention of this page is to work as a maintenance page in case of an application downtime.
The logic of implementing this by have a SiteIsActive switch key in web.config with "N" representing downtime.
In the Global.asax I have implemented the "Application_BeginRequest" method to check for the SiteIsActive flag and redirect to default2.aspx. While the functionality is working fine with both the SiteIsActive cases, the image (company logo) is not getting rendered in default2.aspx at runtime. However when I directly browse the default2.aspx by setting it as startup page the logo/image is displayed at runtime.
Note: This scenario is being tried from within visual studio 2015 and is not yet being deployed.
Any suggestions would be appreciated as to why the image is not being rendered when the switch (SiteIsActive = N) and being re-directed to default2.aspx
Code for default2.aspx below:
<%# Page Language="C#" AutoEventWireup="true" CodeFile="Default2.aspx.cs" Inherits="Default2" %>
<!DOCTYPE html>
<head runat="server">
<link href="~/Image/BrightSide.css" rel="stylesheet" type="text/css"/>
<title>AAA</title>
</head>
<body id="bodyindex">
<div id="wrap" style="width: 1000px;">
<div id="header">
<table style="background-color: rgb(255, 255, 255);" width="100%">
<tbody>
<tr>
<td width="15%">
<img src="~Image/AAA_Image.jpg" /></td>
<td width="32%">
<h1 style="font: bolder 3.1em 'Trebuchet MS',Arial,Sans-serif;">AAAy</h1>
</td>
<td align="right" width="53%"> </td>
</tr>
</tbody>
</table>
<br>
<br>
</div>
<div id="content-wrap">
<h1 align="center">The AAA website is currently unavailable due to scheduled maintenance.<br />
<br />
We are sorry for the inconvenience. Please re-visit the site.
<!-- <div class="caption" style="height:50px;"><div class="content"></div></div> -->
</h1>
<br>
<br>
<br>
<div style="clear: both;">
<!-- wrap lower portion-->
<div style="width: 75%; float: left;">
<h1> </h1>
<p> </p>
</div>
<br>
<!-- content-wrap ends here -->
</div>
<!-- footer starts here -->
<div id="footer">
<div class="footer-left">
<p class="align-left">
� 2017 <strong>AAA</strong>
</p>
</div>
<div class="footer-right">
<p class="align-right"> </p>
</div>
</div>
<!-- footer ends here -->
<!-- wrap ends here -->
</div>
</div>
</body>
</html>
Code defined in Application_BeginRequest.
The logic I implemented is within the two comments "// Logic for maintainence page". Also please note I have included some logic for IPAddress exclusion where for such IPs the regular websites will be shown and for others the maintenance page is dispalyed:
public void Application_BeginRequest(object sender, EventArgs e)
{
// Check the application validity
LicenseHelper.CheckValidity();
// Enable request debugging
// Application start events
FirstRequestInitialization(sender, e);
CMSRequest.BeforeBeginRequest(sender, e);
// Check if Database installation needed
if (InstallerFunctions.InstallRedirect())
{
return;
}
// Enable debugging
SetInitialDebug();
CMSRequest.AfterBeginRequest(sender, e);
// Logic for maintainence page
HttpContext context = HttpContext.Current;
string maintenancePage = System.Configuration.ConfigurationManager.AppSettings["Maintenance_RedirectTo"];
string siteIsActive = System.Configuration.ConfigurationManager.AppSettings["SiteIsActive"];
if (siteIsActive != null && siteIsActive.Equals("N", StringComparison.CurrentCultureIgnoreCase))
{
if (!context.Request.Path.Equals("/" + maintenancePage))
{
string exceptionIP = string.Empty;
string ips = System.Configuration.ConfigurationManager.AppSettings["Maintenance_Redirect_Exceptions"];
string ipAddress2 = Request.ServerVariables["HTTP_X_FORWARDED_FOR"];
if (ipAddress2 == null || ipAddress2.ToLower() == "unknown")
ipAddress2 = Request.ServerVariables["REMOTE_ADDR"];
//ipAddress2 = "127.0.0.1";
if (ips != null)
{
System.Collections.Generic.List<string> ipAddressList = new System.Collections.Generic.List<string>(
ips.Split(new char[] { '|' }, StringSplitOptions.RemoveEmptyEntries));
if (ipAddressList != null)
{
exceptionIP = ipAddressList.Find(delegate (string ipAddress)
{
return ipAddress.Equals(ipAddress2, StringComparison.CurrentCultureIgnoreCase);
});
}
}
if (exceptionIP == null || exceptionIP.Length <= 0)
{
this.Response.Redirect(maintenancePage);
}
}
}
// Logic for maintainence page
}

HTML Agility get text from paragraph tags in a div

I'm trying to get the text of paragraph tags in a div using htmlagilitypack 2.28 in a windows phone 8.1 app.
The structure of div is
<div id="55">
<p> </p>
<p><span class="dropcap">W
</span><span class="zw-portion"><strong>ith the start of festive season in India</strong>, we
will also witness the f<strong>irst London Derby</strong> of the season
between the newly London rivals <strong>Chelsea and Arsenal</strong>. It will be a great chance
for Arsene Wenger to get rid of his <strong>1000</strong></span>
<strong><span class="zw-portion">th</span><span class="zw-portion"> managed </span>
<span class="zw-portion">6-0 </spa>
<span class="zw-portion">massacre</span></strong>
<span class="zw-portion"> in March,</span>
<span class="zw-portion"> </span>
<span class="zw-portion">while the Special One will be eager to continue his winning rampage
</span>
<span class="zw-portion"> </span>
<span class="zw- portion">over his “<strong>Specialist in Failure</strong>” counterpart. Although
both clubs can boast of being unbeaten this season and both clubs can take this opportunity
</span>
<span class="zw-portion"> to bring down their rival</span><span class="zw-portion">.</span></p>
<p> </p>
<p><iframe width="640" height="360" src="https://www.youtube.com/embed/zFBN8M1pCxo?
feature=oembed" frameborder="0" allowfullscreen=""></iframe></p>
<p class="zw-paragraph" data-textformat="
{"type":"text","td":"none"}"></p>
<p class="zw-paragraph" data-textformat=
{"type":"text","td":"none"}">
<span class="zw-portion">The rivalry between Chelsea and Arsenal was not as a primary London
Derby, until Chelsea rose to top of Premier League in 2000’s, when they consistently competed
against each other. The rivalry between the two clubs rose higher as compared to their
traditional rivals. Both the clubs rivalry are now not only limited to their pitch but has also
been to the fans. In 2009 survey by Football Fans Census, Arsenal fans named Chelsea as the
<strong>most disliked club</strong> </span>
<span class="zw-portion"> ahead of their traditional rivals <strong>Manchest</strong></span>
<strong> <span class="zw-portion">er United and Tottenham Hotspur</span></strong>
<span class="zw-portion">. However the report of the other camp doesn’t differ much as Chelsea
fans ranks Arsenal as their <strong>second most-disliked club</strong></span>
<strong><span class="zw-portion">.
</span></strong></p>
</div>
I want to extract only the text containined within the paragraph element within the div.
I have written the following code so far where feedurl contains the address of page from which data is to be extracted (the correct address is extracted). After that i try to get a reference to the div using it's id (which is equal to 55 always).
var feedurl = GetValue("feedurl");
string htmlPage = "asdsad";
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(feedurl);
htmldoc.OptionUseIdAttribute=true;
HtmlNode div = htmldoc.GetElementbyId("55");
if (div != null)
{
htmlPage += "done";
}
_content = htmlPage;
return _content;
htmldoc.GetElementbyId("55"); is returning a null reference.
I've read to use htmldoc.DocumentNode.SelectNodes([arguments]). but there is not SelectNodes method available to me. And I'm lost on how to proceed further. Please help.
HtmlAgilityPack version for WP 8.1 doesn't support SelectNodes() because that method requires XPath implementation which unfortunately missing in .NET version for WP8.1.
The solution is to use HtmlAgilityPack's LINQ API instead of Xpath. For example, to get <div> element having id attribute equals 55 :
HtmlNode div55 = htmldoc.DocumentNode
.Descendants("div")
.FirstOrDefault(o => o.GetAttributeValue("id", "")
== "55");

Extract Specific Text from Html Page using htmlagilitypack

Hey most of my issue has been solved but i have little problem
This is Html
<tr>
<td class="ttl">
</td>
<td class="nfo">- MP4/H.263/H.264/WMV player<br />
- MP3/WAV/еAAC+/WMA player<br />
- Photo editor<br />
- Organizer<br />
- Voice command/dial<br />
- Flash Lite 3.0<br />
- T9</td>
</tr>
Currently i am using this code provided by Stackoverflow User
var text1 = htmlDoc.DocumentNode.SelectNodes("//td[#class='nfo']")[1].InnerHtml;
textBox1.Text = text1;
know problem its is getting all text
with <br>
how i can remove <br> from it and put , between them
its should look like this
MP4/H.263/H.264/WMV player,- MP3/WAV/еAAC+/WMA player,- Photo editor,- Organizer,- Voice command/dial,- Flash Lite 3.0,- T9
Also how to get this
<div id="ttl" class="brand">
<h1>Nokia C5-03</h1>
<p><img src="http://img.gsmarena.com/vv/logos/lg_nokia.gif" alt="Nokia" /></p>
</div>
i am trying this
var text41 =
htmlDoc.DocumentNode.SelectNodes("//div
id[#class='brand']")[0].InnerText;
i get invalid token error
i only want C5-03 without nokia text
You can simply use a string.Replace("<br />", ""); to remove the <br /> tags.
Better yet, use the InnerText instead of InnerHtml, so no HTML comes through:
var text1 = htmlDoc.DocumentNode.SelectNodes("//td[#class='nfo']")[1].InnerText;
If you really want to replace all <br /> tags with a , you will indeed need to use Replace:
text1.Replace("<br />", ",");
To select the value in the <H1> tag, you could use:
var text42 = htmlDoc.DocumentNode.SelectNodes("//div[id='ttl']"/h1)[0].InnerText;