How do remove between two patterns in bash [duplicate] - html

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 3 years ago.
How do i remove the text between two patterns in a line of a file, i have list of lines here i show only two lines for simpliciy
<sup id="Gen.2.23" class="v0_2_23">23</sup>Anke Adam pulo:</span></p><p class="q2"><span class="v0_2_23">“La ke non nerrepi-heihei pen arrepi-lo lapen ne-ok pen a-ok-lo;</span></p><p class="q2"><span class="v0_2_23">bangpi aphan ‘Arloso’ pusi hangpo,</span></p><p class="q2"><span class="v0_2_23">pima bangpi ke Pinso pensi enlo.”</span></p>
<sup id="Gen.2.24" class="v0_2_24">24</sup>Anke Adam pulo:</span></p><p class="q2"><span class="v0_2_24">“La ke non nerrepi-heihei pen arrepi-lo lapen ne-ok pen a-ok-lo;</span></p><p class="q2"><span class="v0_2_24">bangpi aphan ‘Arloso’ pusi hangpo,</span></p><p class="q2"><span class="v0_2_24">pima bangpi ke Pinso pensi enlo.”</span></p>
i want to remove the text between </span></p><p class="q2"> to ">
The result which i need in output is shown below
<sup id="Gen.2.23" class="v0_2_23">23</sup>Anke Adam pulo: “La ke non nerrepi-heihei pen arrepi-lo lapen ne-ok pen a-ok-lo;bangpi aphan ‘Arloso’ pusi hangpo, pima bangpi ke Pinso pensi enlo.”</span></p>
<sup id="Gen.2.24" class="v0_2_24">24</sup>Anke Adam pulo: “La ke non nerrepi-heihei pen arrepi-lo lapen ne-ok pen a-ok-lo;bangpi aphan ‘Arloso’ pusi hangpo, pima bangpi ke Pinso pensi enlo.”</span></p>
When i used sed 's/<\/span><\/p><p class="q2">*.*">//g' it removes the first <span and last ">

It looks like you are looking for a non-greedy match, otherwise the .*>" will match as much possible on the line. The syntax for non-greedy matching is generally *?, although I don' believe it is supported by sed. So, for your case you could do something like,
perl -pe 's;</span></p><p class="q2">.*?">;;g' input.html
But, as #melpomene suggests, regexps aren't a good choice for HTML parsing.

It looks like this yields what you want:
sed 's/<\/span><\/p><p class="q2"><span class="v0_2_23">//g' file
To avoid escaping you can use a different separator like:
sed 's|</span></p><p class="q2"><span class="v0_2_23">||g' file

Related

Vue v-html not render \r \n \t

I have data like this :
data: {
content:"<p><span style=\"font-size:16px\">Berikut adalah beberapa pemberontakan yang pernah terjadi di daerah.</span></p>\r\n\r\n<p><span style=\"font-size:16px\"><strong>1. Pemberontakan Angkatan Perang Ratu Adil (APRA) </strong></span></p>\r\n\r\n<ul>\r\n\t<li><span style=\"font-size:16px\">di Bandung, pada 23 Januari 1950.</span></li>"
}
From the data I want to display it using Vue js.
This is my Vue js code:
<div class="row px-3" v-html="data.content"></div>
And if the above code is executed then the result is like this :
You can see, \r \n and \t don't seem to be rendering by Vue js
How to get \r \n and \t to be rendered by Vue js and can display as below?
\r, \n, and \t are not valid HTML; they are escape sequences that are used in other languages (so expecting them to work in HTML would be like pasting python code into a javascript file and expecting it to run.) You need to replace them with HTML that does what you want it to do. For new lines, the <br> tag could be used, but traditionally people handle line breaks by wrapping their sections in paragraphs (<p>) or divs (<div>). For tabs, you'll need to google for how to handle indenting in HTML as there is a lot more to say about it than I can explain in a short answer here.
I don't have the complete code, but after first read :
Try
v-html="content"
and
data(){
return {
content: "<p><span style=\"font-size:16px\">Berikut adalah beberapa pemberontakan yang pernah terjadi di daerah.</span></p>\r\n\r\n<p><span style=\"font-size:16px\"><strong>1. Pemberontakan Angkatan Perang Ratu Adil (APRA) </strong></span></p>\r\n\r\n<ul>\r\n\t<li><span style=\"font-size:16px\">di Bandung, pada 23 Januari 1950.</span></li>"
}
}

How to replace HTML tags throughout file with string based on tag ID

I have a bunch of HTML files with empty <a> tags tied to unique page IDs. I'd like to replace each one with the same link but including the text of the ID visibly in the line (i.e. displayed within the <p>...</p> tags within which the <a> tags occur) and changing the class of the tag so I can format it in the CSS.
So what I have currently is tags like this occurring throughout text:
<a id="page_123" class="garbage1"></a>some text<a id="page_124" class="garbage2"></a>
And I want to replace it with:
<a id="page_123" class="pagenum"> 123 </a>some text<a id="page_124" class="pagenum"> 124 </a>
So that the resulting display is:
some text 123 some text 124 some text.
The class of each of these tags is not always the same but I want to change all of them to pagenum. The tag id is always of the form page_####, anywhere from 1 to 4 digits.
I'm way over my head on this. I've gotten as far as constructing a horrifying regex (I know, I know) to pick the pattern out of the files, which seems to work when I test via cat file1.html | grep -o "\Wa\sid..page.\d*.\sclass..\w*....a."—that returns every instance of the pattern and nothing else. I'm totally stuck trying to go from there to making the replacement happen.
First handle the greedy sed by inserting newlines.
sed -r 's#</a>#&\n#g' file1.html
Next determine how strict your match should be
sed -r 's#</a>#&\n#g' file1.html |
sed -r 's#(<a id="page_)([^"]*)(" class=")([^"]*).*#\1\2\3pagenum">\2</a>#'
If perl is your option, would you please try:
perl -0777 -pe 's/(<a\s+id="page_(\d+)"\s+class=").+?">/$1pagenum"> $2 /g' file.html
The -0777 option tells perl to slurp whole file to allow included line breaks.
The regex .+? is used as a non-greedy match.
You can use:
sed -r 's/(id="page_([0-9]+)" +class=)"[^"]*">/\1"pagenum"> \2 /g' file1.html
This is finding the following expression:
(id="page_([0-9]+)" +class=)"[^"]*">
replacing it with:
\1"pagenum"> \2
See a demo here.
Please don't parse HTML with regex, but use a parser like xidel or xmlstarlet instead.
Assuming 'input.htm':
<html>
<body>
<p><a id="page_123" class="garbage1"></a>some text<a id="page_124" class="garbage2"></a></p>
<p><a id="page_125" class="garbage1"></a>some text<a id="page_126" class="garbage2"></a></p>
<p><a id="page_127" class="garbage1"></a>some text<a id="page_128" class="garbage2"></a></p>
</body>
</html>
With xmlstarlet:
$ xmlstarlet ed -O -P \
-u '//a' -x 'concat(" ",substring(#id,6)," ")' \
input.htm
<html>
<body>
<p><a id="page_123" class="garbage1"> 123 </a>some text<a id="page_124" class="garbage2"> 124 </a></p>
<p><a id="page_125" class="garbage1"> 125 </a>some text<a id="page_126" class="garbage2"> 126 </a></p>
<p><a id="page_127" class="garbage1"> 127 </a>some text<a id="page_128" class="garbage2"> 128 </a></p>
</body>
</html>
With xidel:
$ xidel -s --input-format=xml input.htm -e '
x:replace-nodes(
//a,
function($x){
element a {$x/#*,x" {substring($x/#id,6)} "}
}
)
' --printed-node-format=xml
<html>
<body>
<p><a id="page_123" class="garbage1"> 123 </a>some text<a id="page_124" class="garbage2"> 124 </a></p>
<p><a id="page_125" class="garbage1"> 125 </a>some text<a id="page_126" class="garbage2"> 126 </a></p>
<p><a id="page_127" class="garbage1"> 127 </a>some text<a id="page_128" class="garbage2"> 128 </a></p>
</body>
</html>

CSS place elemtent at position of another (NOT parent)

I have an XML file which consists mostly of text. However, there are some element in there containing additional information. Let's call these additional information bits "element". In some paragraphs there is none of these, in some paragraph several. Sometimes there are even right after each other. However, in different paragraphs they are always at different positions.
Here's mock up:
<paragraph>
Qui <element>20</element> corti. num sit <element>10</element><element>5</element> igitu pugis quium. quem er Epiendis nessictilluptiudicaribus? qui ipsarent scit verspitomnesse con eiudicitinec tam ret pari Graeperi diurum eo <element>50</element> nebituratam num aerminxeato nilibus. nostereffer est modulceribus, ficantendus anonea Chraectatur, quemodumquae ut pet sum re vivatotertentu vitra cortem nonemod hunturunclia dolum poraectiatiamas rein eximplatorefut egra vartere
</paragraph>
Now I want to transform this XML file into HTML with XSLT. The problem is: The "additonal elements" should appear nowhere in the text, but at a separate column. Like this:
As you can see, the numbers (bearing my "additional information") appear right at the level where they are in the text: "20", "10" and "5" are at the first line, because in the XML source data they are are referenced after the words "Qui" and "sit", which are in the output both at the first line in the text. "50" is right at the level of "eo" nebituram".
How can I achieve this behaviour with CSS?
It is rather easy to put an anchor element in the HTML at this very same position:
eo <a id="some_id"/> nebituratam
Let's say that the "50" is in a span-element
<span stlye="...">50</span>
However, what do I put in the CSS here? Is there a way to state in CSS to place this span right at the level of the anchor with the id "some_id"?
Of course, the anchor can in no way be the parent of the span element.
Honestly, I'm quite sure that my problem is unsolvable without JavaScript or something, but I'd like to avoid skripts wherever I could.
Try This
p{
padding-left:50px;
position:relative;
}
p span{
position:absolute;
left:10px;
top:45%;
font-size:16px;
font-weight:bold;
}
<p>Qui corti. num sit igitu pugis quium. quem er Epiendis nessictilluptiudicaribus? qui ipsarent scit verspitomnesse
con eiudicitinec tam ret pari Graeperi diurum eo <span>50</span> nebituratam num aerminxeato nilibus. nostereffer est modulceribus,
ficantendus anonea Chraectatur, quemodumquae ut pet sum re vivatotertentu vitra cortem nonemod hunturunclia dolum
poraectiatiamas rein eximplatorefut egra vartere</p>

Unwanted breakline behavior - general explanation

I have experienced cross-browser problem that a line on a page in a narrow column is breaked too early despite the space left where the last word could easily fit.
Firstly I thought there is something wrong with my stylesheet but it looks the same in a simple fiddle which I created (no php tags, no line-breakers, etc.):
I am sorry as it's in Czech Language but for demo purpose I hope it's ok.
It shows the same bug in FF,IE and Chrome on Win7 and Win8, even on iPad.
Link to fiddle: http://jsfiddle.net/Grows/q9wqeu14/1/
Demo:
HTML:
<div class="column">
<p>Jsme tým zkušených profesionálů, který Vám pomůže s kompletním IT řešením. Spravujeme IT techniku jak menším firmám do deseti uživatelů, tak i velkým společnostem se stovkami stanic a desítkami serveů. Náklady na externí správu sítě jsou zcela individuální a závisí na rozsahu sítě (počet serverů, stanic, aktivních prvků apod.), dohodnuté frekvenci návštěv a garantované době servisních zásahů. U menších firem se tato částka obvykle pohybuje v jednotkách tisíců korun měsíčně, takže se určitě vyplatí více, než zaměstnávat vlastního správce sítě.</p>
</div>
<div class="column">
<ul>
<li>Individuální přístup a vstřícnou péči o uživatele výpočetní techniky</li>
<li>Pravidelnou údržbu výpočetní techniky - minimalizují se její výpadky</li>
<li>Garanci servisního zásahu - minimalizuje ztráty způsobené výpadkem</li>
<li>Řízení IT procesů - provozujeme systém HELPDESK pro hlášení servisních požadavků, telefonickou linku HOT-LINE a automatický monitorovací systém NAGIOS, který nepřetržitě monitoruje chod Vašich klíčových zařízení</li>
<li>Poradenství a konzultační služby</li>
</ul>
</div>
<div class="column">
<ul>
<li>Finanční úspora - IT specialistu využíváte jen tehdy, je-li to potřeba. Ušetříte na mzdových nákladech, odborných školeních apod.</li>
<li>Flexibilita - služba je smluvně garantovaná, nemusíte řešit nemoci, dovolené, zástupy apod.</li>
<li>Profesionalita - pracujeme v týmu, máme zkušenosti, kvalitní technické zázemí a podporu našich dodavatelů. Jsme schopni minimalizovat rizika výpadku sítě či je zkrátit na minimum.</li>
<li>Přenesení zodpovědnosti za bezproblémový chod Vaší sítě na dodavatele</li>
</ul>
</div>
CSS:
body {
font-size: 14px; font-family: 'Arial'; text-align: left;
}
.column {
width: 214px; border: 1px black solid;
}
li {
list-style-type: disc; list-style-position: outside;
}
What me and both my client see is the weird break between 3rd and 4th row but also in more text.
I tried to search similar questions here and Google it but no success.
Is this a standard browser behavior or there is something wrong?
I really don't want to use manual line-breakers like br, wbr, nbsp, etc.
Thanks a lot!
Cheers, Martin
---- UPDATED ----
Thanks for the given solutions guys so far.
There are no white-spaces of any kind it's just pure text, so I can't remove any.
Also it must stay in three divs.
I guess it's some weird behavior of czech language in browser but I didn't see something like this before.
Maybe I can't do anything with it and this could be an answer too :)
---- SOLVED ---
Emmanuel was right.
There was something weird with some space characters. When I deleted them and typed space again, it dissapeared. Thank you so much! If someone explain this to me I would be very happy because in the source-code there weren't any visible "white-space" like tags...
See Remove non breaking space from <h4>. In your editor, search for non-breaking spaces if you know how to do that, turn on a mode which displays them, or do a search-and-replace of non-breaking spaces with regular old sp0aces.
Remove second and third div like this:
<div class="column">
<p>Jsme tým zkušených profesionálů, který Vám pomůže s kompletním IT řešením. Spravujeme IT techniku jak menším firmám do deseti uživatelů, tak i velkým společnostem se stovkami stanic a desítkami serveů. Náklady na externí správu sítě jsou zcela individuální a závisí na rozsahu sítě (počet serverů, stanic, aktivních prvků apod.), dohodnuté frekvenci návštěv a garantované době servisních zásahů. U menších firem se tato částka obvykle pohybuje v jednotkách tisíců korun měsíčně, takže se určitě vyplatí více, než zaměstnávat vlastního správce sítě.</p>
<ul>
<li>Individuální přístup a vstřícnou péči o uživatele výpočetní techniky</li>
<li>Pravidelnou údržbu výpočetní techniky - minimalizují se její výpadky</li>
<li>Garanci servisního zásahu - minimalizuje ztráty způsobené výpadkem</li>
<li>Řízení IT procesů - provozujeme systém HELPDESK pro hlášení servisních požadavků, telefonickou linku HOT-LINE a automatický monitorovací systém NAGIOS, který nepřetržitě monitoruje chod Vašich klíčových zařízení</li>
<li>Poradenství a konzultační služby</li>
</ul>
<ul>
<li>Finanční úspora - IT specialistu využíváte jen tehdy, je-li to potřeba. Ušetříte na mzdových nákladech, odborných školeních apod.</li>
<li>Flexibilita - služba je smluvně garantovaná, nemusíte řešit nemoci, dovolené, zástupy apod.</li>
<li>Profesionalita - pracujeme v týmu, máme zkušenosti, kvalitní technické zázemí a podporu našich dodavatelů. Jsme schopni minimalizovat rizika výpadku sítě či je zkrátit na minimum.</li>
<li>Přenesení zodpovědnosti za bezproblémový chod Vaší sítě na dodavatele</li>
</ul>
</div>
You just need word-break: break-all; style to column. WORKING CODE
If you look specifically at:
řešením. Spravujeme
in a HEX editor, the "space" between the '.' and 'S' is actually 2 bytes:
C2 A0
C2A0 is a non-breaking space in UTF-8 HEX and a "normal" breaking space is HEX 20.
Since this is a non-breaking space, the browser doesn't consider it a valid point to break the word to line-wrap.

How can I extract information from an HTML file using Perl regular expressions?

I have two files, XML and an HTML and need to extract data from these on certain patterns.
My XML file is pretty well formatted and I can use readline to read a line and search data between tags.
if($line =~ /\<tag1\>$varvalue\<\/tag1\>/)`
However, for my HTML, it has one of the worst code I have seen and the file is like:
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
<div class="address">
<i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
</div>
</div>
<div class="mtitle">
<a href="/movie/dream-house-2011" title="Dream House" onmouseover="mB(event, 771204354);" >**Dream House**</a>
<span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>
<div class="times">
**1:00 PM,**
</div>
Now from this file I need to pick data which is shown in bold.
I can use Perl regular expression to search data from this file.
RegEx match open tags except XHTML self-contained tags
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Using regular expressions to parse HTML: why not?
When you are done reading those come back :)
Edit : and to actually solve your problem take a look at this module :
http://perlmeme.org/tutorials/html_parser.html
Some sample to parse the an html file :
#!/usr/local/bin/perl
use HTML::TreeBuilder;
$tree = HTML::TreeBuilder->new;
$tree->parse_file('C:\Users\Stefanos\workspace\HTML_Parser_Test\test.html');
#divs = $tree->find('div');
$tree->delete;
In this example I just used your tags as the main body of an .html file. The divs are stored in the #divs array. Since I have no idea which text you want to find, because ** is not a element I can't help you further..
P.S. I have never used this module but I just did it in 5 minutes so it is not so hard to parse the html file and find whatever you want..
Regex to match any specific tag and store of contents result into $1:
if ($subject =~ m!<tagname[^>]*>(.*?)</tagname>!s) {
# Successful match
}
Although you will soon realize the limitations of this approach when you have nested elements..
Replace tagname with actual tag.. e.g. in your case i, a, span, div although for div you will also get the contents of the first div which is not what you want..
Parsing XML and HTML using regular expressions is a fool's errand. There are many simple to use Perl modules for parsing HTML. Here is something using HTML::TokeParser::Simple. I've omitted the code to associate movies and showtimes with theaters (because I have no intention of building an appropriate input file):
#!/usr/bin/env perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);
my #theaters;
while (my $div = $parser->get_tag('div')) {
my $class = $div->get_attr('class');
next unless defined($class) and $class eq 'theater';
my %record;
$record{theater} = $parser->get_text('/a');
$record{address} = $parser->get_text('/i');
s{(?:^\s+)|(?:\s+\z)}{} for values %record;
push #theaters, \%record;
}
use YAML;
print Dump \#theaters;
__DATA__
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
<div class="address">
<i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
</div>
</div>
<div class="mtitle">
<a href="/movie/dream-house-2011" title="Dream House" onmouseover="mB(event, 771204354);" >**Dream House**</a>
<span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>
<div class="times">
**1:00 PM,**
</div>
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**Some other theater*</a></h2>
<div class="address">
<i>**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**</i>
</div>
</div>
Output:
[sinan#macardy]:~/tmp> ./tt.pl
---
- address: '**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**'
theater: '**University Village 3**'
- address: '**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**'
theater: '**Some other theater*'