How to add a <body> element to a manually generated Document? - html

I'm attempting to use JSoup to generate HTML from nothing i.e. not parsing a file, but rather generating HTML output in order to display the data in an object. I'm brand new to JSoup and have been looking for some examples of how to use it to generate HTML but haven't found much useful content for this specific task so I've been tinkering, but with minimal success. Here's some [non-working] code:
package jsouptest;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JSoupTest {
public static void main(String[] args) {
Document doc = new Document("");
Element headline = doc.body().appendElement("h1").text("Some text");
Element pTag = doc.body().appendElement("p").text("some text ...");
Element span = pTag.prependElement("span").text("MoarTxt");
}
}
This line:
Element headline = doc.body().appendElement("h1").text("Some text");
Throws a NullPointerException. Through some trial and error, I believe that I've determined the problem is that doc.body() isn't defined anywhere. I assumed (apparently, incorrectly) that a newly instantiated Document would come with an empty body. That doesn't seem to be the case, however. I can't figure out if I need to instantiate a new body element. I've read through the javadoc for the Document class but don't see any kind of factory or setter methods that would generate the body element for me.
Recommendations for resources beyond the JSoup API JavadDoc that might be helpful are welcome as well.

To append a <body> element to a newly created document, in its simplest form, use:
doc.appendElement("body");
Heres' your full code:
public static void main(String[] args) {
Document doc = new Document("");
doc.appendElement("body");
Element headline = doc.body().appendElement("h1").text("Some text");
Element pTag = doc.body().appendElement("p").text("some text ...");
Element span = pTag.prependElement("span").text("MoarTxt");
System.out.println(doc);
}
Output:
<body>
<h1>Some text</h1>
<p><span>MoarTxt</span>some text ...</p>
</body>
As for documentation, I believe you are already there, their official site is the best place. I'd also take a look at their cookbok.

Related

How to get all of the headlines from a google news search using Jsoup

public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://www.google.com/search?q=tesla&oq=tesla&aqs=chrome.0.69i59l3j0l3.494j0j9&sourceid=chrome&ie=UTF-8#q=tesla&tbm=nws").userAgent("Mozilla").get();
Elements links = doc.select("div[class=_cnc]");
for (Element link : links) {
Elements titles = link.select("h3.r_U6c");
String title = titles.text();
System.out.println(title);
System.out.println("Headline: " + link.text());
System.out.println("Link: " + link.attr("data-href"));
}
}}
Here is the HTMl layout. I want to extract the titles for each of the links. I am just not sure on how to format the CSS selector portions of my code. I tried to look through some old threads but couldn't get anything to work. I am just looking for the text of the headlines not the actual links. The print link statements were just for some testing that I couldn't get running.
Thanks guys
Picture of HTML
The page you're trying to fetch is loaded with Javascript. Jsoup don't process Javascript scripts.
Instead use some tools like Selenium or ui4j.

Parsing html page content without using selector

I am going to parse some web pages using Java program. For this purpose I wrote a small code for parsing page content by using xpath as selector. For parsing different sites you need to find the appropriate xpath per each site. The problem is for doing that you need an operator to find the write xpath for you. (for example using firepath firefox addon) Suppose you dont know what page you should parse or the number of sites get really big for operator to find right xpath. In this case you need a way for parsing pages without using any selector. (same scenario exist for CSS selector) Or there should be a way to find xpath automatically! I was wondering what is the method of parsing web pages in this way?
Here is the small code which I wrote for this purpose, please feel free to extend that in presenting your solutions.
public downloadHTML(String url) throws IOException{
CleanerProperties props = new CleanerProperties();
// set some properties to non-default values
props.setTranslateSpecialEntities(true);
props.setTransResCharsToNCR(true);
props.setOmitComments(true);
// do parsing
TagNode tagNode = new HtmlCleaner(props).clean(
new URL(url)
);
// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
tagNode, "c:\\TEMP\\clean.xml", "utf-8"
);
}
public static void testJavaxXpath(String pattern)
throws ParserConfigurationException, SAXException, IOException,
FileNotFoundException, XPathExpressionException {
DocumentBuilder b = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
org.w3c.dom.Document doc = b.parse(new FileInputStream(
"c:\\TEMP\\clean.xml"));
// Evaluate XPath against Document itself
javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xPath.evaluate(pattern,
doc.getDocumentElement(), XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
Element e = (Element) nodes.item(i);
System.out.println(e.getFirstChild().getTextContent());
}
}

How to set codes in html page?

I am new in html and i'm making a html page in which i want to display some code, but it not showing proper way like what we are writing in notepad. So, i have to write each and every line or any other solution is there. Suppose this is the code
public void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.liste);
this.requestWindowFeature(Window.FEATURE_NO_TITLE);
// Setup the list view
final ListView prestListView = (ListView) findViewById(R.id.list);
final prestationAdapterEco prestationAdapterEco = new prestationAdapterEco(this, R.layout.prestation);
prestListView.setAdapter(prestationAdapterEco);
// Populate the list, through the adapter
for(final prestationEco entry : getPrestations()) {
prestationAdapterEco.add(entry);
}
prestListView.setClickable(true);
prestListView.setOnItemClickListener(new AdapterView.OnItemClickListener() {
#Override
public void onItemClick(AdapterView<?> arg0, View arg1, int position, long arg3) {
Object o = prestListView.getItemAtPosition(position);
String str=(String)o;//As you are using Default String Adapter
Toast.makeText(getApplicationContext(),str,Toast.LENGTH_SHORT).show();
}
});
}
and i want to show it on html page like this only. Please help.. Thanks
<pre><code> code... </code></pre>
This seems to be the best method they've found here: <code> vs <pre> vs <samp> for inline and block code snippets
It also happens to be the recommended way to show a sample of computer code on W3.org.
I think you're looking for code/syntax highlighting in HTML pages. Unfortunately there is no highlighting feature using tags in HTML.
In HTML You can use:
<code>You code goes here..</code>
For more reference: http://www.w3schools.com/tags/tag_code.asp and the list of global attributes code tag supports: http://www.w3schools.com/tags/ref_standardattributes.asp
However, you can make use of some javascripts which enables syntaxt highlighting in HTML.
You can check this thread: "Syntax highlighting code with Javascript" for more info.

code tag and pre css in html not functioning properly

in html i am using the code tag as below and also i am using the css as shown below :-
<style type="text/css">
code { white-space: pre; }
</style>
<code>
public static ArrayList<File> getFiles(File[] files){
ArrayList<File> _files = new ArrayList<File>();
for (int i=0; i<files.length; i++)
if (files[i].isDirectory())
_files.addAll(getFiles(new File(files[i].toString()).listFiles()));
else
_files.add(files[i]);
return _files;
}
public static File[] getAllFiles(File[] files) {
ArrayList<File> fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
</code>
When i use the code tag as shown above some part of the code is missing in the html page when viewed. when view the above html page the output is as shown below:-
public static ArrayList getFiles(File[] files){
ArrayList _files = new ArrayList();
for (int i=0; i fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
In the first method some part is missing and the second method is not appearing at all. what is the problem and how to fix it?
You have these <File> inside your <code> tag, you need to convert them to < and > html entities
Demo
<code>
public static ArrayList<File> getFiles(File[] files){
ArrayList<File> _files = new ArrayList<File>();
for (int i=0; i<files.length; i++)
if (files[i].isDirectory())
_files.addAll(getFiles(new File(files[i].toString()).listFiles()));
else
_files.add(files[i]);
return _files;
}
public static File[] getAllFiles(File[] files) {
ArrayList<File> fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
</code>
As already identified by Mr. Alien, you have characters being interpreted as markup inside your <code> block.
As an alternative to escaping lots of characters, providing your code does not include the string </script, you can exploit the parsing and (non)execution behaviour of the <script> element like this:
<code>
<script type="text/x-code">
public static ArrayList<File> getFiles(File[] files){
ArrayList<File> _files = new ArrayList<File>();
for (int i=0; i<files.length; i++)
if (files[i].isDirectory())
_files.addAll(getFiles(new File(files[i].toString()).listFiles()));
else
_files.add(files[i]);
return _files;
}
public static File[] getAllFiles(File[] files) {
ArrayList<File> fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
</script>
</code>
with this CSS:
script[type=text\/x-code] {
display: block;
white-space: pre;
line-height: 20px;
margin-top: -20px;
}
See JSfiddle: http://jsfiddle.net/fZuPm/3/
Update: In the comments, RoToRa raises some interesting points about the "correctness" of this approach, and I thank RoToRa for them.
Using a type attribute to stop the contents of a script tag from being executed as JavaScript is a well understood technique, and although the list of type names that cause script to be executed varies from browser to browser, finding one that won't cause execution is not hard.
More interesting is the question of the semantics. It is my view that the semantics of the script element are essentially inert, like a div or span element, while RoToRa's view is that it affects the semantics of the content. Looking at the specs, it is not easy to resolve. HTML 4.01 says very little about the semantics of the script element, concentrating solely on its functionality.
The HTML5 spec is not much better, but it does say "The element does not represent content for the user.". I don't know what to make of that. Saying what an element doesn't do is not very helpful. If it implies that its contents are semantically "hidden" from the user, such that the its contents are not semantically part of contents of the containing code element, then this technique should not be used.
If, however, it means that no new semantics are introduced by the script element, then there doesn't appear to be any problem.
I can't find any evidence of a script element being semantically required to contain script, as RoToRa suggests, and while it might be considered common-sense to infer that, that's not how HTML semantics works.
In many ways, this approach is really about trying to find a way to do validly what the XMP element does in browsers anyway, but is not valid. XMP was very nearly made valid in HTML5 but just missed out. The editor described it as a tough call. Using the script element like this meets that requirement, but it seems nevertheless to be controversial. If you are uncomfortable with whatever semantics you feel are being applied is this approach, I would suggest that you don't use it.

Extract the thread head and thread reply from a forum

I want to extract only the views and replies of the user and the title of the head from a forum. In this code when you supply a url the code returns everything. I just want only the thread heading which is defined in title tag and the user reply which is in between the div content tag. Help me how extract. Explain how to print this in a txt file
package extract;
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
public class TestJsoup
{
public void SimpleParse()
{
try
{
Document doc = Jsoup.connect("url").get();
doc.body().wrap("<div></div>");
doc.body().wrap("<pre></pre>");
String text = doc.text();
// Converting nbsp entities
text = text.replaceAll("\u00A0", " ");
System.out.print(text);
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String args[])
{
TestJsoup tjs = new TestJsoup();
tjs.SimpleParse();
}
}
Why do you wrapt the body-Element in a div and a pre Tag?
The title-Element can be selected like this:
Document doc = Jsoup.connect("url").get();
Element titleElement = doc.select("title").first();
String titleText = titleElement.text();
// Or shorter ...
String titleText = doc.select("title").first().text();
Div-Tags:
// Document 'doc' as above
Elements divTags = doc.select("div");
for( Element element : divTags )
{
// Do something there ... eg. print each element
System.out.println(element);
// Or get the Text of it
String text = element.text();
}
Here's an overview about the whole Jsoup Selector API, this will help you finding any kind of element you need.
Well I used another code and I collected data from this specific tags.
Elements content = doc.getElementsByTag("blockquote");
Elements k=doc.select("[postcontent restore]");
content.select("blockquote").remove();
content.select("br").remove();
content.select("div").remove();
content.select("a").remove();
content.select("b").remove();