sorry if this is irrelevance :-)
I need to write something in my html code to convert digits of form 0123456789 to ۰۱۲۳۴۵۶۷۸۹ (Persian digits uni06F0..uni06F9).
the number of visitors is generated by blog service. and I want to convert its digits to Arabic.
Counter:
تعداد بازدیدکنندگان : <BlogSky:Weblog Counter /> نفر
the Persian part of above code mean 'Number of visitors' and 'Persons' (from left to right). but digits are represented in latin (0123...).
Is it possible to write something like a function in html? i want it to be a global one for using in weblogs.
Note: I don't know anything about web programming languages. I'm not sure about language of above code. (html?)
HTML only describes the structure of the document. You'll have to use JavaScript - a client-side language that allows you to do what you need, ie manipulate DOM tree - in that case.
Here you've got an example of code that replaces 0...9 into ۰...۹ in given String:
myString.replace(/\d/g, function(matches) {
return String.fromCharCode("\u06F0".charCodeAt(0) + (matches[0] * 1));
})
So basically what you need now is to fetch text from document and replace it by itself but modified with above code:
//HTML
<p id="digits">0123456789</p>
//JavaScript:
var text = document.getElementById("digits").firstChild;
text.nodeValue = text.nodeValue.replace(/\d/g, function(matches) {
return String.fromCharCode("\u06F0".charCodeAt(0) + (matches[0] * 1));
});
Related
I need regex matching every pair of <p>...<br> and <p CLASS='extmsg' >...<br> to distinguish parts of chat conversation, which I receive as string in following format:
<p CLASS='extmsg'>16:30:24 ~ customer#home.com: hello<br>
<p>16:30:14 ~ consultant#company.com: hello to you<br>
<p CLASS='extmsg'>16:30:03 ~ sam.i.am#greeneggs.ham: how are you<br>
<p>03/06/2018 16:29:55 ~ bok.kier#ccc.pl: im fine<br>
I need it for parsing method.
Don't parse HTML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint
xmlstarlet
saxon-lint (my own project)
Check: Using regular expressions with HTML tags
Example :
xmllint --html --xpath '//p[#CLASS="extmsg"]/text()' file
Regexes are not suitable for this, as per Giles Quenot's answer. Using a proper parser is a much better way to do this. If you do receive messages in the format shown:
One message per line
Every message starts with "<p"
Every message ends with "<br>"
an easier idea might be string-matching the start of the line in stead. I don't know what language you're using, but an example in javascript might be:
var inputString = "" // From wherever you get your data
var lines = inputString.split("\n")
for (i = 0; i < lines.length; i++) {
var line = lines[i]
if (line.indexOf("<p CLASS='extmsg'>") == 0) {
console.log("Customer just said: " + line)
} else {
console.log("Representative just said: " + line)
}
}
You can trim the <p> and <br> tags out too, as you already know how long they are.
NOTE This will break if the format of the data changes (e.g. a designer gets into the CSS file and starts using BEM notation, changing extmsg to message--external, and adding message--internal to the rep's messages). As it would if you used a regex or a parser. The best way to deal with this would be to get whoever is supplying the data to make you a proper API for this info.
I'm not at all familiar with perl, but have some understanding of html. I'm currently trying to configure code from an online program that processes text inputted from the user to calculate and output a few important numbers in order to do the same for a large number of files containing text in a local directory. The problem lies in my lack of understanding for how or why the code from the site is splitting the inputted text by looking for & and =, as the inputted text never contains these characters, and neither my files. Here's some of the code from the online program:
if ($ENV{'REQUEST_METHOD'} ne "POST") {
&error('Error','Use Standard Input by METHOD=POST');
}
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
if ($buffer eq '') {
&error('Error','Can not execute directly','Check the usage');
}
$ref = $ENV{'HTTP_REFERER'};
#pairs = split(/&/,$buffer);
foreach $pair (#pairs) {
($name,$value) = split(/=/,$pair);
if ($name eq "ATOMS") { $atoms = $value; }
It then uses these "pairs" to appropriately calculate the required numbers. The input from the user is simply a textarea named "ATOMS", and the form action is the cgi script:
<form method=POST action="/path/to/the/cgi/file.cgi">
<textarea name="ATOMS" rows=20 cols=80></textarea>
</form>
I've left out the less important details of both the html and perl codes. So far all I've been able to do is get all the content from all files in a given directory in a text format, but when I input this into the script that uses the text from textarea to calculate the values (in place of the variable $buffer), it doesn't work, which I suspect is due to the split codes, which cannot find the & and = symbols. How does the code get these symbols from the online script, and how can I implement that to use for my local files? Let me know if any additional information is needed, and thanks in advance!
The encoding scheme forms use (by default) to POST data over HTTP consists of key=value pairs (hence the =) which are separated by & characters.
The latter doesn't much matter for your form since it has only one control in it.
This is described pretty succinctly in the HTML 4 specification and in more detail in the HTML 5 specification.
If you aren't dealing with data from a form, you should remove all the form decoding code.
Not sure where you got that code from, but it's prehistoric (from the last millennium).
You should be able to replace it all with.
use CGI ':cgi';
my $atoms = param('ATOMS');
I would like to make a sort of hash key out of a text (in my case html) that would match/compare to the hash of other similar text
ex of matching texts:
"2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_2 + ..
...
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_3 + ..
So far I've thought of removing numbers and tags but that wold still leave the random words.
Is there anything out there that dose this?
I have root access to the server so I can add any UDF that is necesare and if needed I can do the processing in c or other languages.
The ideal would be a function like generateSimilarHash(text) and an other function compareSimilarHashes(hash1,hash2) that would return the procent of matching text.
Any function like compare(text1,text2) would not work as in my case as I have many pages to compare (~20 mil at the moment)
Any advice is welcomed!
UPDATE:
I'm refering to ahash function as it is described on wikipedia:
A hash function is any algorithm or subroutine that maps large data
sets of variable length to smaller data sets of a fixed length.
the fixed length part is not necessary in my case.
It sounds like you need to utilize a program like diff.
If you are just trying to compare text a hash is not the way to go because slight differences in input cause total and complete differnces in output. (Thus the reason why they are used to encode passwords, and secure text). Character difference programs are pretty complicated, unless you really are interested in how they work and are trying to write your own I would just use a solution like the one that is shown here using sdiff to get a percentage.
Percentage value with GNU Diff
You could use some sort of Levenshtein distance algoritm. this works for small pieces of text, but I'm rather sure that something similar can be applied to large chunks of text.
Ref: http://en.m.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance
I've found out that tag order in webpages can create a very distinctive pattern, that remains the same even if portions of text / css / script change. So I've made a string generated by the tag order (ex: html head meta title body div table tr td span bold... => "hhmtbdttsb...") and then I just do exact matches between these strings. I can even apply the Levenshtein distance algorithm and get accurate results.
If I didn't have html, I would have used the punctuation/end-lines for splitting, or something similar.
I need to clean up some text for html that used ALLCAPS instead of italics. So I'd like to take something that looks like this:
Here is an artificial EXAMPLE of a piece of TEXT that
uses allcaps as a way of EMPHASIZING words.
And convert it into this:
Here is an artificial <em>example</em> of a piece of <em>text</em> that
uses allcaps as a way of <em>emphasizing</em> words.
I'm tagging this with regex and notepad++, but (as you can probably tell) I don't know the first thing about how to use them.
There're no such possibilities with Notepad++ regex engine.
You can run a script that do the job, in Perl for example:
perl -pi.back -e "s#\b([A-Z]+)\b#'<em>'.lc($1).'</em>'/eg" yourfile.html
yourfile.html will be saved in yourfile.html.back
As far as I konw the regex engine of Notepad++ is not advanced enough to do this.
I would advice to use a programming language to accomplish this, in PHP for example you could do this:
echo preg_replace_callback('/([A-Z]{2,})/', create_function('$s', 'return "<em>".strtolower($s[0])."</em>";'), $s);
Be sure to exclude the legitim first capital letter of a single word in the regex.
AFAIK you cannot change casing in the Find\Replace mechanism of Notepad++.
If all you need is the <em> tag insertion you can do the following:
In the Find box type (\s+)([A-Z]+)(\s+), abd in the Replace type \1<em>\2</em>\3.
You can try some of the TextFX tools maybe in the TextFX Characters sub-menu.
Here is how to do this using JavaScript's string replace method:
var capfix = function (x) {
var emout = function (y) {
y = y.charAt(0) + "<em>" + y.toLowerCase() + "</em>" + y.charAt(y.length - 1);
};
return x.replace(/\s[A-Z]\s/g, emout);
};
To execute just call:
capfix(yourData);
This assumes that "yourData" is just a variable that represents your data as a string. If you wanted to use a web tool then "yourData" could represent the value from some input control, as in the following:
var yourData = document.getElementById("myinput").value;
alert(capfix(yourData));
To make that work just put an id attribute on your web tool input such as:
<textarea id="myinput"></textarea>
I have a html file with one <pre>...</pre> tag. What regex is necessary to match all content within the pre's?
QString pattern = "<pre>(.*)</pre>";
QRegExp rx(pattern);
rx.setCaseSensitivity(cs);
int pos = 0;
QStringList list;
while ((pos = rx.indexIn(clipBoardData, pos)) != -1) {
list << rx.cap(1);
pos += rx.matchedLength();
}
list.count() is always 0
HTML is not a regular language, you do not use regular expressions to parse it.
Instead, use QXmlSimpleReader to load the XML, then QXmlQuery to find the PRE node and then extract its contents.
DO NOT PARSE HTML USING Regular Expressions!
Instead, use a real HTML parser, such as this one
i did it using substrings:
int begin = clipBoardData.indexOf("<pre");
int end = clipBoardData.indexOf("</body>");
QString result = data.mid(begin, end-begin);
The result includes the <pre's> but i found out thats even better ;)
I have to agree with the others. Drupal 6.x and older are using regex to do a lot of work on the HTML data. It quickly breaks if you create pages of 64Kb or more. So using a DOM or just indexOf() as you've done is a better much faster solution.
Now, for those interested in knowing more about regex, Qt uses the perl implementation. This means you can use the lazy operator. Your regex would become:
(<pre>.*?</pre>)+
to get each one of the <pre> block in your code (although if you have only one, then the question mark and the plus are not required.) Note that no delimiters at the start and end of the regular expression are required here.
QRegExp re("(<pre>.*?</pre>)+", Qt::CaseInsensitive);
re.indexIn(html_input);
QStringList list = re.capturedTexts();
Now list should have one <pre> tag or more.