Jsoup filter out only some tags from html to text - html

can any master of jsoup tell me some suggestions to filter html to text/string? I've tried calling text() of Document. But all tags/elements will be filtered. My aim is to filter some specified tags.
i.e: I've html text like:
<div>hello<p>world</div>,<table><tr><td>xxx</td></tr>
to get result:
<div>hello<p>world</div>,xxx
which has filtered tags.

I can't test this right now but I think you want to write a recursive function that steps through the tree and prints each node based on a condition. The following is an example of what it might look like but I expect that you will have to modify it to suit your needs more precisely.
Document doc = JSoup.parse(page_text);
recursive_print(doc.head());
recursive_print(doc.body());
...
private static Set<String> ignore = new HashSet<String>(){{
add("table");
...
}};
public static void recursive_print(Element el){
if(!ignore.contains(el.className()))
System.out.println(el.html());
for(Element child : el.children())
recursive_print(child);
}

You can use Whitelist to achieve this goal. For example:
Whitelist whiteList = new Whitelist();
whiteList.addTags("div", "p", "td");
It means that all other tags will be removed.

Related

How to make new document with JXA?

How to make new document and close? Need this to workaround apple automation buggy insanity. What I try is this:
var app = Application('Keynote')
var doc = app.make(new document) // How to write this correctly?
doc.close({saving: 'no'})
AppleScript and JavaScript syntax is completely different. You have to think more in terms of JavaScript
For example JXA doesn't understand make(new).
You have to create an instance from the class name (note the uppercase spelling) and then call make().
Actually the var keywords and the trailing semicolons are not needed.
keynote = Application('Keynote')
keynote.activate()
newDocument = keynote.Document().make()
Within the parentheses of Document() you can pass parameters similar to AppleScript’s with properties for example
newDocument = keynote.Document({
documentTheme: keynote.themes["Gradient"],
width:1920,
height:1080
})
AppleScript’s multiple word properties like document theme are written as one camelCased word.
To close the frontmost document write
keynote.documents[0].close()

Allow using some html tags in MVC 4

How i can allow client to use html tags in MVC 4?
I would like to save records to the database and when it extract in view allow only some HTML tags (< b > < i > < img >) and others tags must be represented as text.
My Controller:
[ValidateInput(false)]
[HttpPost]
public ActionResult Rep(String a)
{
var dbreader = new DataBaseReader();
var text = Request["report_text"];
dbreader.SendReport(text, uid, secret).ToString();
...
}
My View:
#{
var dbreader = new DataBaseReader();
var reports = dbreader.GetReports();
foreach (var report in reports)
{
<div class="report_content">#Html.Raw(report.content)</div>
...
}
}
You can replace all < chars to HTML entity:
tags = tags.Replace("<", "<");
Now, replace back only allowed tags:
tags = tags
.Replace("<b>", "<b>")
.Replace("</b>", "</b>")
.Replace("<i>", "</i>")
.Replace("</i>", "</i>")
.Replace("<img ", "<img ");
And render to page using #Html.Raw(tags)
If you are trying some property of your view model object to accept Html text, use AllowHtmlAttribute
[AllowHtml]
public string UserComment{ get; set; }
and before binding to the view
model.UserComment=model.UserComment.Replace("<othertagstart/end>",""); //hard
Turn off validation for report_text (1) and write custom HTML encoder (2):
Step 1:
Request.Unvalidated().Form["report_text"]
More info here. You don't need to turn off validation for entire controller action.
Step 2:
Write a custom html encoder (convert all tags except b, i, img to e.g.: script -> ;ltscript;gt), since you are customizing a default behaviour of request validation and html tag filtering. Consider to safeguard yourself from SQL injection attacks by checking SQL parameters passed to stored procedures/functions etc.
You may want to check out BBCode BBCode on Wikipedia. This way you have some control on what is allowed and what's not, and prevent illegal usage.
This would work like this:
A user submits something like 'the meeting will now be on [b]monday![/b]'
Before saving it to your database you remove all real html tags ('< ... >') to avoid the use of illegal tags or code injection, but leave the pseudo tags as they are.
When viewed you convert only the allowed pseudo html tags into real html
I found solution of my problem:
html = Regex.Replace(html, "<b>(.*?)</>", "<b>$1</b>");
html = Regex.Replace(html, "<i>(.*?)</i>", "<i>$1</i>");
html = Regex.Replace(html, "<img(?:.*?)src="(.*?)"(?:.*?)/>", "<img src=\"$1\"/>");

Forced to use template?

This code from Dart worries me:
bool get isTemplate => tagName == 'TEMPLATE' || _isAttributeTemplate;
void _ensureTemplate() {
if (!isTemplate) {
throw new UnsupportedError('$this is not a template.');
}
...
Does this mean that the only way I can modify my document is to make it html5?
What if I want to modify an html4 document and set innerHtml in a div, how do I achieve this?
I am assuming you are asking about the code in dart:html Element?
The method you are referring to is only called by the library itself, and only in methods where isTemplate has to be true, for example this one. If you follow this link, you can also read what other fields/methods work like this.
innerHtml is a field in every subclass of Element which supports it, for example DivElement
Example:
DivElement myDiv1 = new DivElement();
myDiv1.innerHtml = "<p>I am a DIV!</p>";
query("#some_div_id").innerHtml = "<p>Hey, me too!</p>";

JQuery selectors - using html snippets as "context" in filter and find

A quick question about using context with Jquery selectors:
I'm trying to grab the text from a div element that has id="time". Can a HTML snippet be used as context in the following:
// An AJAX request here returns a HTML snippet "response":
var myTime = $("#time", response).text();
The reason I'm doing this is that I want the time variable from within the html held in response, but don't want the overhead of loading all of the html into the DOM first. (it's a large amount of html).
From the comments what I understand is the response is <span id="time">blah blah</span> which means the element time is the root variable itself, that is why the child lookup is not working.
var response = '<span id="time">blah blah</span>';
var myTime = $(response).text(); // Or $(response).filter("#time").text();
alert(myTime)
Demo: Fiddle
This method uses filter() rather than find(), the difference being:
filter() – search through the passed element set
find() – search through all the child elements only.
Did you try it?
$("#time", "<div><span id=time></span></div>")[0].id //returns 'time'
From the jQuery source code:
// HANDLE: $(expr, context)
// (which is just equivalent to: $(context).find(expr)
} else {
return this.constructor( context ).find( selector );
}
so valid selectors should work in the context parameter. Personally, I prefer using find to begin with because it keeps all the selectors in the same order instead of $("second > third", "first");

Remove attributes using HtmlAgilityPack

I'm trying to create a code snippet to remove all style attributes regardless of tag using HtmlAgilityPack.
Here's my code:
var elements = htmlDoc.DocumentNode.SelectNodes("//*");
if (elements!=null)
{
foreach (var element in elements)
{
element.Attributes.Remove("style");
}
}
However, I'm not getting it to stick? If I look at the element object immediately after Remove("style"). I can see that the style attribute has been removed, but it still appears in the DocumentNode object. :/
I'm feeling a bit stupid, but it seems off to me? Anyone done this using HtmlAgilityPack? Thanks!
Update
I changed my code to the following, and it works properly:
public static void RemoveStyleAttributes(this HtmlDocument html)
{
var elementsWithStyleAttribute = html.DocumentNode.SelectNodes("//#style");
if (elementsWithStyleAttribute!=null)
{
foreach (var element in elementsWithStyleAttribute)
{
element.Attributes["style"].Remove();
}
}
}
Your code snippet seems to be correct - it removes the attributes. The thing is, DocumentNode .InnerHtml(I assume you monitored this property) is a complex property, maybe it get updated after some unknown circumstances and you actually shouldn't use this property to get the document as a string. Instead of it HtmlDocument.Save method for this:
string result = null;
using (StringWriter writer = new StringWriter())
{
htmlDoc.Save(writer);
result = writer.ToString();
}
now result variable holds the string representation of your document.
One more thing: your code may be improved by changing your expression to "//*[#style]" which gets you only elements with style attribute.
Here is a very simple solution
VB.net
element.Attributes.Remove(element.Attributes("style"))
c#
element.Attributes.Remove(element.Attributes["style"])