Extract the thread head and thread reply from a forum - html

I want to extract only the views and replies of the user and the title of the head from a forum. In this code when you supply a url the code returns everything. I just want only the thread heading which is defined in title tag and the user reply which is in between the div content tag. Help me how extract. Explain how to print this in a txt file
package extract;
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
public class TestJsoup
{
public void SimpleParse()
{
try
{
Document doc = Jsoup.connect("url").get();
doc.body().wrap("<div></div>");
doc.body().wrap("<pre></pre>");
String text = doc.text();
// Converting nbsp entities
text = text.replaceAll("\u00A0", " ");
System.out.print(text);
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String args[])
{
TestJsoup tjs = new TestJsoup();
tjs.SimpleParse();
}
}

Why do you wrapt the body-Element in a div and a pre Tag?
The title-Element can be selected like this:
Document doc = Jsoup.connect("url").get();
Element titleElement = doc.select("title").first();
String titleText = titleElement.text();
// Or shorter ...
String titleText = doc.select("title").first().text();
Div-Tags:
// Document 'doc' as above
Elements divTags = doc.select("div");
for( Element element : divTags )
{
// Do something there ... eg. print each element
System.out.println(element);
// Or get the Text of it
String text = element.text();
}
Here's an overview about the whole Jsoup Selector API, this will help you finding any kind of element you need.

Well I used another code and I collected data from this specific tags.
Elements content = doc.getElementsByTag("blockquote");
Elements k=doc.select("[postcontent restore]");
content.select("blockquote").remove();
content.select("br").remove();
content.select("div").remove();
content.select("a").remove();
content.select("b").remove();

Related

How to parse over different elements in a HTML document (in Dart/Flutter) and keep the order intact

I have a large HTML document containing important information of different types in sequence.
I'm parsing in Dart/Flutter
Obtaining the raw information is fine
My problem is that parsing for Elements of the different types/names (Image, text, headings etc) will lose the order in which the elements are displayed in relation to each other in the document.
Eg. A heading, then an image, then some text, then another image, then some text.
I really need the equivalent to this: html.getElementsByTagName('title' or 'p' or 'whatever-else-I-need'). Then I can process in the loop and output my model in a properly sequenced list.
Parsing sequence-critical information of different element tags / data types must be a common occurrence. Much appreciated.
I'm not an expert on package:html (nor with HTML and CSS in general), but I think that you can use Document.querySelectorAll with an appropriate selector string:
import 'package:html/parser.dart' as html;
void main() {
var htmlStr = r'''
<html>
<head>
<title>My title</title>
</head>
<body>
<p>Lorem ipsum</p>
<img src="foo.png">
</body>
</html>
''';
var document = html.parse(htmlStr);
var elements = document.querySelectorAll('title,p,img');
elements.forEach(print);
// Prints:
// <html title>
// <html p>
// <html img>
}
If a selector doesn't do what you want, you could write a function that walks the tree:
import 'package:html/dom.dart' as dom;
/// Walks [document] and invokes [elementCallback] on each element using a preorder
/// traversal.
///
/// [elementCallback] should return true to continue walking the tree, false to
/// abort.
void walk(dom.Document document, bool Function(dom.Element) elementCallback) {
var stack = <dom.Element>[];
stack.addAll(document.children.reversed);
while (stack.isNotEmpty) {
var element = stack.removeLast();
if (!elementCallback(element)) {
break;
}
stack.addAll(element.children.reversed);
}
}
and then you could run walk with an appropriate callback that conditionally adds each Element to some List, e.g.:
var elements = <dom.Element>[];
var wantedTags = {'title', 'p', 'img'};
walk(document, (element) {
if (wantedTags.contains(element.localName)) {
elements.add(element);
}
return true;
});

Can use of HtmlAgilityPack be modified to only extract main part of HTML document?

I have some .NET code that ingests HTML files and extracts text from them. I am using HtmlAgilityPack to do the extraction. Before I wanted to extract most of the text that was there that was there, so it worked fine. Now requirements have changed and I need to only extract text from he main body of the document. So suppose I scraped HTML from a news webpage. I just want the text of the article, not the ads, titles of other albeit related articles, header/footers etc.
It is possible to modify my calls to HtmlAgilityPack to only extract the main text? Or is there an alternative way to do the extraction?
Here's the current block of code that gets text from HTML:
using HtmlAgilityPack;
public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
public void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;
// get text
html = ((HtmlTextNode) node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;
// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html));
}
break;
case HtmlNodeType.Element:
switch (node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}
if (node.HasChildNodes)
{
ConvertContentTo(node, outText);
}
break;
}
}
private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
So, ideally, what I want is to let HtmlAgilityPack determine which parts of the input HTML constitute the "main" text block and input only those elements. I do not know what the structure of input HTML will be but I do know that it will vary a lot (before it was a lot more static)

Compile CSS into HTML as Inline Style in Grails?

I want to generate GSP templates for html emails. To support more mail clients it is recommended to use inline css in html style elements.
Here is a discussion on that topic: "Compile" CSS into HTML as inline styles
Is there a Grails plugin where I can specify certain GSP files for which the CSS should be compiled as inline?
If there is no plugin, how can I specify GSP files for which the css should be complied inline?
Here is an example. I have the following GSP templates for my html mails that I send with the Grails mail plugin.
/mail/signup_mail.gsp
/mail/welcome.gsp
/mail/newsletter.gsp
Each GSP file includes a style.css file. This should be compiled inline.
We do this with a free method on the Mailchimp API. You can also use Premailer.
http://apidocs.mailchimp.com/api/1.2/inlinecss.func.php
http://premailer.dialect.ca/
You can fit the following Java code in your grails application.
import java.io.IOException;
import java.util.StringTokenizer;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class AutomaticCssInliner {
public static void main(String[] args) throws IOException {
final String style = "style";
final String html = "<html>" + "<body> <style>"
+ "body{background:#FFC} \n p{background:red}"
+ "body, p{font-weight:bold} </style>"
+ "<p>...</p> </body> </html>";
// Document doc = Jsoup.connect("http://mypage.com/inlineme.php").get();
Document doc = Jsoup.parse(html);
Elements els = doc.select(style);// to get all the style elements
for (Element e : els) {
String styleRules = e.getAllElements().get(0).data().replaceAll(
"\n", "").trim(), delims = "{}";
StringTokenizer st = new StringTokenizer(styleRules, delims);
while (st.countTokens() > 1) {
String selector = st.nextToken(), properties = st.nextToken();
Elements selectedElements = doc.select(selector);
for (Element selElem : selectedElements) {
String oldProperties = selElem.attr(style);
selElem.attr(style,
oldProperties.length() > 0 ? concatenateProperties(
oldProperties, properties) : properties);
}
}
e.remove();
}
System.out.println(doc);// now we have the result html without the
// styles tags, and the inline css in each
// element
}
private static String concatenateProperties(String oldProp, String newProp) {
oldProp = oldProp.trim();
if (!newProp.endsWith(";"))
newProp += ";";
return newProp + oldProp; // The existing (old) properties should take precedence.
}
}

How to add a <body> element to a manually generated Document?

I'm attempting to use JSoup to generate HTML from nothing i.e. not parsing a file, but rather generating HTML output in order to display the data in an object. I'm brand new to JSoup and have been looking for some examples of how to use it to generate HTML but haven't found much useful content for this specific task so I've been tinkering, but with minimal success. Here's some [non-working] code:
package jsouptest;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JSoupTest {
public static void main(String[] args) {
Document doc = new Document("");
Element headline = doc.body().appendElement("h1").text("Some text");
Element pTag = doc.body().appendElement("p").text("some text ...");
Element span = pTag.prependElement("span").text("MoarTxt");
}
}
This line:
Element headline = doc.body().appendElement("h1").text("Some text");
Throws a NullPointerException. Through some trial and error, I believe that I've determined the problem is that doc.body() isn't defined anywhere. I assumed (apparently, incorrectly) that a newly instantiated Document would come with an empty body. That doesn't seem to be the case, however. I can't figure out if I need to instantiate a new body element. I've read through the javadoc for the Document class but don't see any kind of factory or setter methods that would generate the body element for me.
Recommendations for resources beyond the JSoup API JavadDoc that might be helpful are welcome as well.
To append a <body> element to a newly created document, in its simplest form, use:
doc.appendElement("body");
Heres' your full code:
public static void main(String[] args) {
Document doc = new Document("");
doc.appendElement("body");
Element headline = doc.body().appendElement("h1").text("Some text");
Element pTag = doc.body().appendElement("p").text("some text ...");
Element span = pTag.prependElement("span").text("MoarTxt");
System.out.println(doc);
}
Output:
<body>
<h1>Some text</h1>
<p><span>MoarTxt</span>some text ...</p>
</body>
As for documentation, I believe you are already there, their official site is the best place. I'd also take a look at their cookbok.

Escape HTML tags in XAML code

How can escape html tags into a xaml code?
For example, if i want to show <b>text</b> in an xaml content to put into a RichTextBox as following:
private void button1_Click(object sender, RoutedEventArgs e)
{
string mystring = "<b>test</b>";
MyRTB.Blocks.Add(Convert(#"<Bold>" + mystring + "</Bold>"));
}
static public Paragraph Convert(string text)
{
String formattedText = ParaHead + text + ParaTail;
Paragraph p = (Paragraph)XamlReader.Load(formattedText);
return p;
}
I tried with multiple combinations of {} and {} and etc but doesnt work, and I dont want use hexa scape if i can do it.
Thanks in advance
You just need to XML-escape it by replacing < with <.
The built-in SecurityElement.Escape or WebUtility.HtmlEncode functions will do that for you.