How do web browsers implement a text search? - html

I would like to build an application that contains HTML web control, and enables searching and highlighting multiple phrases in the html area, like it's implemented in web browsers today.
The main idea is to make the search ignore the html special tags, and refer only to the real text (the inner html).
I know that the solution will include editing the DOM, but I'm not sure how this could be implemented.
I searched a lot over the net, and I read also the post of Zavael, but unfortunately I couldn't find any suitable answer. Any answer, example or direction will be appreciated.

If you are referring to an inline search of HTML content within the page: firstly I would say that it probably isn't a good idea. You shouldn't supplant the browser's native functionality. Let users do things the way they are used to. Keep the experience consistent across different websites.
However, if you need this for some niche purpose I don't think it would be that hard.
jQuery could achieve it. Just a very rough start:
//As you type in the search box...
$("#search_box").change(function() {
//...you search the searchable area (maybe use some regex here)
$("#search_area").html();
});
EDIT: Questioner asked if I can elaborate on the code.
//As you type in the search box...
$("#search_box").change(function() {
//This part of the code is called when the contents of the search box changes
//This is the contents of the searchable area:
$("#search_area").html();
//This is the contents of the search box:
$(this).val();
//So you could perform a function here, like replacing the occurences of the
//search text with a span (you could use CSS to highlight the span with a
//background colour:
var html_contents = $("#search_area").html();
var search_tem = $(this).val();
$("#search_area").html(html_contents.replace(search_term, '<span class="highlight">'+search_term+'</span>'));
});
Please note that in the above example, the Javascript 'replace' function will only replace the first occurence. You need to use a regular expression in the first paramater to replace all. But I don't want to write all the code for you ;)

Related

MediaWiki: How to update a link status programmatically

My extension renders additional links on a page (that is adds some <a href='...'>...</a> to the page text (in HtmlPageLinkRendererEnd hook)).
See small arrows in https://withoutvowels.org/wiki/Tanakh:Genesis_1:1 for an example. The arrows are automatically added by my extension (sorry, at the time of writing this the source code is not yet released).
The problem is that red/blue ("new") status is not updated for links which I add.
Please explain how to make Wikipedia to update color of my links as appropriate together with regular [[...]] MediaWiki links.
My current workaround is to run php maintenance/update.php. It is a very bad workaround. How to do it better?
Normally you'd use LinkRenderer to create the links and LinkBatch to make the page existence check efficient (you don't want a separate SQL query for each link). You can't really do that in HtmlPageLinkRendererEnd since you only learn about the links one by one.
The way the parser deals with this is that it replaces links with a placeholder and collects them in a list, then after parsing is mostly done it looks them all up at once and then switches the placeholders with the rendered links. You can probably hook into somthing that happens between the two (e.g. ParserAfterParse), get the list of links from the parser and use them to build a list of your own links.
With valuable help of Wikitech-l mailing list, I found a solution.
The solution is to use ParserAfterTidy hook.
public static function onParserAfterTidy( &$parser, &$text ) {
# ...
$parserOutput = $parser->getOutput();
foreach($parserOutput->getLinks() as ...) {
# ...
$parserOutput->addLink( Title::newFromDBkey(...) );
}
}

How do I prevent unnecessary resource-load when I create new HTML elements?

Update:
Finally, I guess I was asking a stupid question. The jQuery creates DOM elements and it will be requested anyway. So I think it's better to use .html(xxx) to implement the feature rather than using $() to create anything before.
This is quite tricky and I never realize it before. But today I realized it's very important to a web project.
Say I have two images created dynamically:
var $img1 = $('<img>');
$img1.attr('src', 'http://domain.com/1.png');
var $img2 = $('<img>');
$img2.attr('src', 'http://domain.com/2.png');
Right after the browser runs the code above, the two images would be requested. That would be a waste of the client's and the server-side traffic.
Is it possible for me to control when the resource request be sent?
My expectation is NOT to do it by assigning src later because in my case it'd be much more complicated, the HTML code is containing a lot of stuff rather than some img tags. For example, is it possible to tell the browser that "please wait until the img tag is added onto the DOM tree"?
Append the images to DOM after the page load like this:
$(document).ready(function() {
// You could use whatever jQuery selector here you like to
// determine where to append the new elements.
// For this example, I am just appending to end of document.
$(document).append($('<img src="http://domain.com/1.png>');
$(document).append($('<img src="http://domain.com/2.png>');
});

How do I determine the current pages document type in umbraco?

I have what I feel is a very simple question about Umbraco, but one that has as of yet no apparent answer.
I have a razor template, standard stuff, with # displaying variables and some inline C# code.
At one point in the template I use:
#Umbraco.RenderMacro("myCustomMacro");
no problems there, everything works as expected.
Now, this macro is inserted on every page (it's in the master template) but I have a page property that allows the content authors to turn it on and off via a check box in the page properties, again so far so good everything works perfectly.
However I now find that for a certain "document type" this component MUST be displayed, so I've been trying to find a way to perform that check.
Now in my mind, this should be as simple as doing something like this:
#{
if(CurrentPage.documentType == "someDocTypeAliasHere")
{
//Render the macro
}
else
{
// Render the macro only if the tick box is checked
}
}
as I say, this is (or I believe it should be anyway) a very simple operation, but one that so far does not seem to have a result.
What Have I tried so far?
Well apart from reading every page on our-umbraco that mentions anything to do with razor & the #CurrentPage variable, Iv'e been through the razor properties cheat sheet, and tried what would appear to be the most common properties including (In no specific order):
#CurrentPage.NodeTypeAlias
#CurrentPage.NodeType
#CurrentPage.ContentType
#CurrentPage.DocumentType
and various letter case combinations of those, plus some others that looked like they might fit the bill.
Consistently the properties either don't exist or are empty so have no useable information in them to help determine the result.
So now after a couple of days of going round in circles, and not getting anywhere I find myself here..
(Please note: this is not a search the XSLT question, or iterate a child collection or anything like that, so any requests to post XSLT, Macros, Page templates or anything like that will be refused, all I need to do is find a way to determine the Document Type of the current page being rendered.)
Cheers
Shawty
PS: Forgot to mention, I'm using
umbraco v 4.11.8 (Assembly version: 1.0.4869.17899)
Just in case anyone asks.
In Umbraco 7 use currentPageNode.DocumentTypeAlias
In Umbraco 7.1 I use: #if (#CurrentPage.DocumentTypeAlias == "NewsItem")
think you do actually need to create a node each time when you are on the page to access the pages properties like nodetypealias and stuff, try this i have the same kind of functionality on my site, http://rdmonline.co.uk/ but in the side menu where depending on the page/section it shows a diff menu links.
#{
var currentPageID = Model.Id;
var currentPageNode = Library.NodeById(currentPageID);
if (currentPageNode.NodeTypeAlias == "someDocTypeAliasHere")
{
//Render the macro
}
else
{
// Render the macro only if the tick box is checked
}
}
Let me know if this works for you.
This is a bit unrelated to this post, but searching Google brought me to this post, so I thought I'd share in case anoyne else is dealing with this issue: In Umbraco 7, to get all content in the site for a specific type:
var articles = CurrentPage.AncestorOrSelf(1).Descendants()
.Where("DocumentTypeAlias == \"BlogPost\"").OrderBy("CreateDate desc");
If your razor view inherits from Umbraco.Web.Mvc.UmbracoViewPage, you could also use UmbracoHelper:
#if (UmbracoHelper.AssignedContentItem.DocumentTypeAlias.Equals("NewsItem")) { ... }
Querying for a specific DocumentType is also easy:
UmbracoHelper.AssignedContentItem.Descendants("NewsItem")
This code will recursively return the list of IPublishedContent nodes.
If you wish to use this list with your specific DocumentType information, these items would have to be mapped to the specific type. Other than that, IPublishedContent gives you the basic information for the nodes.
I've later saw that you have been using an older version of Umbraco. :)
This implementation is only for v7.

Pulling out some text from a giant HTML file using Nokogiri/xpath

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/

Match multiple terms within <body> tags

I've want to match any occurrence of a search term (or list of search terms) within the tags of a document. My current solution uses preg (within a Joomla plugin)
$pattern = '/matchthisterm/i';
$article->text = preg_replace($pattern,"<span class=\"highlight\">\\0</span>",$article->text);
But this replaces everything within the HTML of the document so I need to match the tags first. Is this even the best way to achieve this?
EDIT:
OK, I've used simplehtmldom, but just need some help getting to the correct term. So far I've got:
$pattern = '/(matchthisterm)/i';
$html = str_get_html($buffer);
$es = $html->find('text');
foreach ($es as $term) {
//Match to the terms within the text nodes
if (preg_match($pattern, $term->plaintext)) {
$term->outertext = '<span class="highlight">' . $term->outertext . '</span>';
}
}
This makes the entire node text bold, am I ok to use the preg_replace in here?
SOLUTION:
//Get the HTML and look at the text nodes
$html = str_get_html($buffer);
$es = $html->find('text');
foreach ($es as $term) {
//Match to the terms within the text nodes
$term->outertext = str_ireplace('matchthis', '<span class="highlight">matchthis</span>', $term->outertext);
}
No, processing [X][HT]ML with regex is largely disastrous. In the simplest case for your example, this input:
bof
gives quite thoroughly broken output:
matchthisterm</span>/bar">bof
The proper way to do it would be to use a proper HTML/XML parser (for example DOMDocument.loadHTML or simplehtmldom), then scan and replace the contents of each text node separately. Finally re-save the HTML back to a string.
An alternative for search term highlighting is to do it in JavaScript. Since the browser has already parsed the HTML to a DOM, that saves you a processing step. See eg. this question for an example.
I agree processing HTML with regex is not a good solution.
I just read the argument about why regex can't parse HTML here: RegEx match open tags except XHTML self-contained tags
I quite agree with the whole thing, but the problem is MUCH simpler here: we just need to know whether we are inside some HTML tag or not. We don't have to parse an HTML structure and interpreting a tree and mismatching tags or some other errors. We just know that a HTML tag is something between < and >. I believe the regex is a very good, adapted and consistent tool here.
It's not because we're dealing with some HTML that we don't want to use regex. We need to focus on the real problem here, which I believe doesn't really process HTML. We only need to know whether we're inside a tag or not. I hope I won't get too much downvotes for this, but I completely assume my position.
I'm redirecting you to a previous post (where you put a link to this topic) I made sooner this day: Highlight text, except html tags
On the same idea, and I hope we know all we need to, you're using preg_replace() where a simpler function like str_ireplace() would be sufficient. If you just need to replace a word (or a set of words) inside a string and deal with case insensivity, don't use regex. Keep it simple. (I'm assuming you didn't simplify the replacement you're trying to make on purpose to explain your problem here).
I haven't used preg but I've done pattern matching in perl, java and actionscript before. If this is anything similar you have to escape special characters. For example "\<span class.... I found a website that talks about using preg, in case you haven't come across this site, that can be found here