Parsing ONLY plain text from HTML using Kanna Swift - html

I am using Kanna Swift for HTML parsing.
For example:
How can I parse ONLY the highlighted English Text in this situation?
To be prone to something, usually something bad, means to have
a tendency to be affected by it or to do it.
<div class="caption hide_cn">
<a class="anchor" name="prone_1"></a>
<span class="num">1</span>
<span class="st" title="能被表示程度的副词或介词词组修饰的形容词">ADJ-GRADED </span>
<span class="tips_box">
<span class="lbl type-syntax">
<span class="span"> [</span>
verb-link <span class="hi rend-sc">ADJ</span>
</span>
<span class="lbl type-syntax">
<span class="span">, </span>
<span class="hi rend-sc">ADJ</span>
to-infinitive
<span class="span">]</span>
</span>
</span>
<span class="def_cn cn_before">
<span class="chinese-text">有(不好的)倾向的;易于</span>
…
<span class="chinese-text">的;很可能</span>
…
<span class="chinese-text">的</span>
</span>
To be <b>prone to</b> something, usually something bad, means to have a tendency to be affected by it or to do it.
<span class="def_cn cn_after">
<span class="chinese-text">有(不好的)倾向的;易于</span>
…
<span class="chinese-text">的;很可能</span>
…
<span class="chinese-text">的</span>
</span>
</div>
If I use:
doc.css("div[class='caption hide_cn']")
I get all the messy part around the sentence I want.
Maybe I am wrong but I could not find enough documentation about the usage.
e.g. I learned"span[class= 'xxx xxx']" from stackoverflow instead of the documentation from that github page.
Do we have something like "[class != 'xxx xxx'] " or !=span

After some tweaks, I found a work around solution, in case someone needs it later.
We can use the removeChild method to remove all the other sections!
// Search for nodes by CSS
for whole in doc.css("div[class='caption hide_cn']") {
if let a1 = doc.css("span[class='num']").first {
whole.removeChild(a1)
}
if let a2 = doc.css("span[class='st']").first {
whole.removeChild(a2)
}
if let a3 = doc.css("span[class='tips_box']").first {
whole.removeChild(a3)
}
if let s1 = doc.css("span[class='def_cn cn_before']").first {
whole.removeChild(s1)
}
if let s2 = doc.css("span[class='def_cn cn_after']").first {
whole.removeChild(s2)
}
print(whole.text)
}
It's a pity I could not find this in the documentation. I guess those packages/libs are powerful enough to do almost anything you want. You just need to tweak a little bit.

Related

How to replace <span style="text-decoration: underline;"> </span> with <u></u> tag for a dynamic string in angular 6

I have a input box where user enters string and does some string formatting like font name, bold,italic,underline, font size etc. for selected text. When user underlines any word or sentence then that word/sentence source code is like <span style="text-decoration: underline;"></span>
If user changes font size or font name then <span style="font-family"...> or <span style="font-size"...
these tags are attached.
In this i have to replace <span style="text-decoration: underline;"> tag and its corresponding </span> with <u>and </u>. How to do that. Because replacing <span style="text-decoration: underline;"> tag is easy. I can find it and replace it. But replacing its </span> tag is difficult, if there are more than 1 </span> tags. Because I need to find out which </span> tag need to be replaced from multiple </span> tags.
How can I do this in angular 6. Thank you.
Any help would be appreciated.
In Javascript, you can do this kind of replace using regular expression:
var x = '<span style="text-decoration: underline;">Test</span>'
function replace_span_to_u(input) {
return input.replace(/\<span style\=\"text-decoration\: underline\;\"\>(.*?)\<\/span\>/g, function(matched, index) {
return '<u>' + index + '</u>';
})
}
replace_span_to_u(x)

Use regular expressions to add new class to element using search/replace

I want to add a NewClass value to the class attribute and modify the text of the span using find/replace functionality with a pair of regular expressions.
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
I am trying to get the following result using after search/replace:
<span class='customer NewClass' id='phone$1'>Organization</span>
Also curious to know if a single find/replace operation can been used for both tasks?
Regex can do this, but be aware the using regex to change HTML can have a lot of edge cases that you may not have accounted for.
This regex101 example shows those three <span> elements changed to add NewClass and the contents to be changed to Organization.
Other technologies, however, would be safer. jQuery, for example, could replace them regardless of the order of the attributes:
$("span#phone$1").addClass("NewClass");
$("span#phone$1").text("Organization");
So just be careful with it, and you should be fine.
EDIT
According to comments on the OP, you want to only change the span containing ID phone$1, so the regex101 link has been updated to reflect this.
EDIT 2
Permalink was too long to fit into a comment, so adding the permalink here. Click on the "Content" tab at the bottom to see the replacement.
You can use a regex like this:
'.*?' id='phone\$1'>.*?<
With substitution string:
'customer' id='phone\$1'>Organization<
Working demo
Php code
$re = "/'.*?' id='phone\\$1'>.*?</";
$str = "<div>\n <span class='customer' id='phone\$0'>Home</span>\n<br/>\n <span class='customer' id='phone\$1'>Business</span>\n<br/>\n <span class='customer' id='phone\$2'>Mobile</span>\n</div>";
$subst = "'customerNewClass' id='phone\$1'>Organization<";
$result = preg_replace($re, $subst, $str);
Result
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customerNewClass' id='phone$1'>Organization</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
Since your tags include preg_match and preg_replace, I think you are using PHP.
Regex is generally not a good idea to manipulate HTML or XML. See RegEx match open tags except XHTML self-contained tags SO post.
In PHP, you can use DOMDocument and DOMXPath with //span[#id="phone$1"] xpath (get all span tags with id attribute vlaue equal to phone$1):
$html =<<<DATA
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
DATA;
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$sps = $xp->query('//span[#id="phone$1"]');
foreach ($sps as $sp) {
$sp->setAttribute('class', $sp->getAttribute('class') . ' NewClass');
$sp->nodeValue = 'Organization';
}
echo $dom->saveHTML();
See IDEONE demo
Result:
<div>
<span class="customer" id="phone$0">Home</span>
<br>
<span class="customer NewClass" id="phone$1">Organization</span>
<br>
<span class="customer" id="phone$2">Mobile</span>
</div>

Scraping with Nokogiri::HTML - Can't get text from XPATH

I'm trying to scrape html with Nokogiri.
This is the html source:
<span id="J_WlAreaInfo" class="wl-areacon">
<span id="J-From">山东济南</span>
至
<span id="J-To">
<span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
全国
<s></s>
</span>
</span>
</span>
I need to get the following text: 山东济南
Checked shortest XPATH with firebug:
//*[#id="J-From"]
Here is my ruby code:
doc = Nokogiri::HTML(open("http://foo.html"), "UTF-8")
area = doc.xpath('//*[#id="J-From"]')
puts area.text
However, it returns nothing.
What am I doing wrong?
However, it returns nothing. What am I doing wrong?
xpath() returns an array containing the matches (it's actually called a NodeSet):
require 'nokogiri'
html = %q{
<span id="J_WlAreaInfo" class="wl-areacon">
<span id="J-From">山东济南</span>
至
<span id="J-To">
<span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
全国
<s></s>
</span>
</span>
</span>
}
doc = Nokogiri::HTML(html)
target_tags = doc.xpath('//*[#id="J-From"]')
target_tags.each do |target_tag|
puts target_tag.text
end
--output:--
山东济南
Edit: You can actually call text() on the Array, but it will return the concatenated results of the text for each match in the array--which is not something I've ever found useful--but because there is only one match you should have gotten the result 山东济南. There is nothing in your post that indicates why you didn't get that result.
If you only want a single result from your xpath, i.e. the first match, then you can use at_xpath():
target_tag = doc.at_xpath('//*[#id="J-From"]')
puts target_tag.text
--output:--
山东济南

goquery- Extract text from one html tag and add it to the next tag

Yeah, sorry that the title explains nothing. I'll need to use an example.
This is a continuation of another question I posted which solved one problem but not all of them. I've put most of the background info from that question into this one. Also, I've only been looking into Go for about 5 days (and I only started learning code a couple months ago), so I'm 90% sure that I'm close to figuring out what I want and that the problem is that I've got some silly syntax mistakes.
Situation
I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). Here's what it looks like:
<html>
<body>
<h1>
<span class="text">Go </span>
</h1>
<p>
<span class="text">totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<h1>
<span class="text">debugger </span>
</h1>
<p>
<span class="text">should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle </span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
Objective
I'd like to:
Extract the content of <h1..."text".
Insert (and concatenate) this extracted content into the content of <p..."text".
Only do this for the <p> tag that immediately follows the <h1> tag.
Do this for all of the <h1> tags on the page.
Once again, an example explains ^this better. This is what I want it to look like:
<html>
<body>
<p>
<span class="text">Go totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<p>
<span class="text">debugger should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle</span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
Solution Attempts
Because distinguishing further the <h1> tags from the <p> tags would provide more parsing options, I've figured out how to change the class attributes of the <h1> tags to this:
<html>
<body>
<h1>
<span class="title">Go </span>
</h1>
<p>
<span class="text">totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<h1>
<span class="title">debugger </span>
</h1>
<p>
<span class="text">should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle </span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
with this code:
html_code := strings.NewReader(`
code_example_above
`)
doc, _ := goquery.NewDocumentFromReader(html_code)
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
s.SetAttr("class", "title")
class, _ := s.Attr("class")
if class == "title" {
fmt.Println(class, s.Text())
}
})
I know that I can select the <p..."text" following the <h1..."title" with either doc.Find("h1+p") or s.Next() inside the doc.Find("h1").Each function:
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
s.SetAttr("class", "title")
class, _ := s.Attr("class")
if class == "title" {
fmt.Println(class, s.Text())
fmt.Println(s.Next().Text())
}
})
I can't figure out how to insert the text from <h1..."title" to <p..."text". I've tried using quite a few variations of s.After(), s.Before(), and s.Append(), e.g., like this:
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
s.SetAttr("class", "title")
class, _ := s.Attr("class")
if class == "title" {
s.After(s.Text())
fmt.Println(s.Next().Text())
}
})
but I can't figure out how to do exactly what I want.
If I use s.After(s.Next().Text()) instead, I get this error output:
panic: expected identifier, found 5 instead
goroutine 1 [running]:
code.google.com/p/cascadia.MustCompile(0xc2082f09a0, 0x62, 0x62)
/home/*/go/src/code.google.com/p/cascadia/selector.go:59 +0x77
github.com/PuerkitoBio/goquery.(*Selection).After(0xc2082ea630, 0xc2082f09a0, 0x62, 0x5)
/home/*/go/src/github.com/PuerkitoBio/goquery/manipulation.go:18 +0x32
main.func·001(0x0, 0xc2082ea630)
/home/*/go/test2.go:78 +0x106
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc2082ea600, 0x7cb678, 0x2)
/home/*/go/src/github.com/PuerkitoBio/goquery/iteration.go:7 +0x173
main.ExampleScrape()
/home/*/go/test2.go:82 +0x213
main.main()
/home/*/go/test2.go:175 +0x1b
goroutine 9 [runnable]:
net/http.(*persistConn).readLoop(0xc208047ef0)
/usr/lib/go/src/net/http/transport.go:928 +0x9ce
created by net/http.(*Transport).dialConn
/usr/lib/go/src/net/http/transport.go:660 +0xc9f
goroutine 17 [syscall, locked to thread]:
runtime.goexit()
/usr/lib/go/src/runtime/asm_amd64.s:2232 +0x1
goroutine 10 [select]:
net/http.(*persistConn).writeLoop(0xc208047ef0)
/usr/lib/go/src/net/http/transport.go:945 +0x41d
created by net/http.(*Transport).dialConn
/usr/lib/go/src/net/http/transport.go:661 +0xcbc
exit status 2
(The lines of my script don't match the lines of the examples above, but "line 72" of my script contains the code s.After(s.Next().Text()). I don't know what exactly panic: expected identifier, found 5 instead means.)
Summary
In summary, my problem is that I can't quite wrap my head around how to use goquery to add text to a tag.
I think I'm close. Would any gopher Jedis be able and willing to help this padawan?
Something like this code does the job, it finds all <h1> nodes, then all <span> nodes inside these <h1> nodes, looking for one with class text. Then it gets the next element to the <h1> node, if it is a <p>, that has inside a <span>, then it replaces this last <span> with a new <span> with the new text and removes the <h1>.
I wonder if it's possible to create nodes using goquery without writing html...
package main
import (
"fmt"
"strings"
"github.com/PuerkitoBio/goquery"
)
var htmlCode string = `<html>
...
<html>`
func main() {
doc, _ := goquery.NewDocumentFromReader(strings.NewReader((htmlCode)))
doc.Find("h1").Each(func(i int, h1 *goquery.Selection) {
h1.Find("span").Each(func(j int, s *goquery.Selection) {
if s.HasClass("text") {
if p := h1.Next(); p != nil {
if ps := p.Children().First(); ps != nil && ps.HasClass("text") {
ps.ReplaceWithHtml(
fmt.Sprintf("<span class=\"text\">%s%s</span>)", s.Text(), ps.Text()))
h1.Remove()
}
}
}
})
})
htmlResult, _ := doc.Html()
fmt.Println(htmlResult)
}

Format dom as expression using css

I'm creating a site that uses tags and needs to perform basic tag algebra with operators not, and, or. I have a dom element that describes the expression but can't display the expression using css.
Consider the following expression:
([Green] or ((not [Blue]) and ([Red] or (not [Yellow]))))
Which is represented in the dom as:
<span class="tag-expression">
<span class="tag-or">
<span class="tag" value="green">Green</span>
<span class="tag-and">
<span class="tag-not">
<span class="tag" value="blue">Blue</span>
</span>
<span class="tag-or">
<span class="tag" value="red">Red</span>
<span class="tag-not">
<span class="tag" value="yellow">Yellow</span>
</span>
</span>
</span>
</span>
</span>
I've managed to include the parenthesis using css' :before and :after tied with the content attribute (jfiddle demo). But have no luck showing the operators ¬, &, |. I've been toying with including a <span class="operator"/> with an image background but I was wondering is there is another way to make this using the :before and :after selectors.
Any ideas?
Here you go, it works with what you provided me in the example, you should test it out in more complex expressions to make sure it is correct.
I added some complex CSS selectors at the end of your CSS script for showing your operators:
.tag-expression .tag-or > span:nth-child(2):before {
content: ' | (';
}
.tag-expression .tag-and > span:first-child:after {
content: ' ) & ';
}
.tag-expression .tag-not:before {
content: ' ( ¬ ';
}​
You can checkout this in this fiddle. Let me know if that solves your problem.