HTML Search Function - html

I need a search function for on an intranet , can I do this purely in HTML ? and if so could you possibly point me in the right direction , I tried googeling it but it came up with searching via search engines.
Cheers
EDIT: I only need it to search for text on 1 page , its not a whole website , just one page.

You will need some server-side scripting in order to provide the directory listings.
I recommend using PHP's glob function recursively, but there might be a better option.
Edit:
For one page, using JavaScript, you could get the contents of all of the elements, and use regex or indexOf to determine if the string exists within the text, and if so, where.
If you are to use the indexOf function, as the function only returns the index of the first occurrence of the string, you will need to repeat the search until you've gathered all occurrences.
You may specify the start parameter to snip the front of the searching area, to begin the new search after your last found occurrence.

Related

Regular expression to pick a row in an html table containing desired text

Sorry, but uhrm, I'd like to use regexp (actually I'd use something else but I want to do the task within a Matlab function) to pick a single row containing desired keywords within an html table.
I am using Matlab calling function regexpi (case-insensitive version of regexp), which is akin to PHP regex from what I can tell.
Ok, here's a snippet from such an html table to parse:
<tr><td>blu</td><td>value</td></tr><tr><td>findme</td><td>value</td></tr><tr><td>ble</td><td>value</td></tr>
The desired row to pick contains the word "findme".
(added:) Content of other cells and tags in the table could be anything (here "bla" is a dummy value)- the important part is the presence of "findme" and that a single line (not more) is caught (or all lines containing "findme" but such behaviour is not expected). Any paired name/value table in a wikipedia page is a good example.
I tinkered with https://regex101.com/ using whatever I could dig up at the Matlab documentation (forward/backward looking, combinations of :,> and ?), but have failed to identify a pattern that will pick just the right row (or all those that contain the keyword "findme"). The following pattern for instance will pick the text but not the entire row: <tr[^>]*>[^>]*.*?(findme).*?<\/td .
Pattern <tr[^>]*>(.*?findme.*?)<\/tr[^>]*> picks the row but is too greedy and picks preceding rows.
Note that the original task I had set out was to capture entire tables and then parse these, but the Matlab regexp-powered function I found for the task had trouble with nested tables (or I had trouble implementing it for the task).
The question is how to return a row containing desired keywords from an html table, programmatically, within a matlab function (without calling an external program)? Bonus question is how to solve the nested table issue, but maybe that's another question.
I suggest you split up the string with strsplit and use contains for the filtering, which is a lot more readable and maintainable than a regex pattern:
htmlString = ['<tr><td>blu</td><td>value</td></tr><tr><td><a',...
'href="bla">findme</a></td><td>value</td></tr><tr><td><a',...
'href="ble">ble</a></td><td>value</td></tr>'];
keyword = 'findme';
splitStrings = strsplit(htmlString,'<tr>');
desiredRow = ['<tr>' splitStrings{contains(splitStrings,keyword)}]
The output is:
<tr><td>findme</td><td>value</td></tr>
Alternatively you may also combine extractBetween and contains:
allRows = extractBetween(htmlString,'<tr>','</tr>');
desiredRow = ['<tr>' allRows{contains(allRows,keyword)} '</tr>']
If you must use regex:
regexp(htmlString,['<tr><td>[^>]+>' keyword '.*?<\/tr>'],'match')
Try this
%<td>(.*?)%sg
https://regex101.com/r/0Xq0mO/1

Wikipedia api fulltext search to return articles with title, snippet and image

I've been looking for a way to query the wikipedia api based on a search string for a list of articles with the following properties:
Title
Snippet/Description
One or more images related to the article.
I also have to make the query using jsonp.
I've tried using the list=search parameter
http://en.wikipedia.org/w/api.php?action=query&list=search&prop=images&format=json&srsearch=test&srnamespace=0&srprop=snippet&srlimit=10&imlimit=1
But it seems to ignore the prop=images, I've also tried variations using the prop=imageinfo and prop=pageimages. But they all give me the same result as just using the list=search.
I've also tried action=opensearch
http://en.wikipedia.org/w/api.php?action=opensearch&search=test&limit=10&format=xml
Which gives me exactly what I want when i set format=xml, but returns a simple array of page titles when using format=json and therefore fails because of the jsonp requirement.
Is there another approach to doing this? I'd really like to solve this in a single request rather than make the first search request and then a second request for the images using titles=x|y|z
As Bergi suggested, using generators is the way to go here. Specifically what I would do:
use list=search as a generator, to get the list of articles
use prop=pageimages to get a representative image for each article
use prop=extracts to get a description for each article
The whole query could look like this:
http://en.wikipedia.org/w/api.php?format=json&action=query&generator=search&gsrnamespace=0&gsrsearch=test&gsrlimit=10&prop=pageimages|extracts&pilimit=max&exintro&explaintext&exsentences=1&exlimit=max
I've tried using the list=search parameter, but it seems to ignore the prop=images
If you want to retrieve any properties, you need to specify a list of pages for which you want to get these; e.g. by using the titles=, pageids=, or revids= parameters. You didn't send any, so you did not get a result for the prop=images.
If you did use api.php?action=query&list=search&srsearch=test&prop=images&titles=test you would have gotten the search results for test and the images of the Test page.
You can however also use the collection that the list query generates for your property query, using the list module as a generator. The query would look like
api.php?action=query&generator=search&gsrsearch=test&gsrnamespace=0&gsrprop=snippet&prop=images. Unfortunately, it does not yield the attributes that the list contained, but only used the pageids for a basic property query.
Using two queries is probably the way to go. Btw, I'd recommend to use the pageimages property, it will likely give you the best results.

Return certain part of string mysql

I am fetching results from a database table which contains the text of multiple pages.
These pages have links in their content.
I am trying to get all the links from the pages in a table, but I am also getting the unwanted text.
For example, this could be the content of a certain part of a page:
line 1: This is the link for lalalaal </a href="page5.html"> click</a>
line 2 if you want to go to page lalalala2 click
Now I only want the area starting from the <a href and ending at </a> in the result record. if there are more than 1 anchor tags in the text, then each anchor tag should be treated as a record.
the returned result should be like
ID value
1 ' click '
2 ' click '
I have tried the following queries :
Select * from [Database.tablename] where value between <a href and </a>;
Select * from [Database.tablename] locate '(<a href, Value)>0' and locate (</a>, value)>0;
but none of the 2 queries are giving me the wanted result...
This sort of text extraction is probably best addressed using regular expressions.
MySQL has some support (see here), but it could only be useful to identify which rows do have an <a></a> pair. Even identifying that there is at least one link inside a record doesn't help you extracting the (possibly many) links and treating them as different records themselves.
To successfully extract those links, at least according to my knowledge, you need a tool better suited to work with regular expressions. Most languages (Perl, PHP, Python, Java, etc.) support them, some natively, some using available libraries. You can select only records containing links (using REGEXP), and extract every link via code.
Another way of handling this would be performing the query on MySQL, exporting the results to a text file, and working on its contents with shell scripting (for instance, using sed under UNIX/Linux).
If you need it to be implemented using only MySQL, then my best guess is trying with a stored procedure (to be able to work on the results record-by-record.) I still cannot think of an implementation of such SP that guarantees detecting and successfully extracting every possible link inside a record as one record per link.

Find column values that are a start string of given string.

I have a database table that contains URLs in a column. I want to show certain data depending on what page the user is on, defaulting to a 'parent' page if not a direct match. How can I find the columns where the value is part of the submitted URL?
Eg. I have www.example.com/foo/bar/baz/here.html; I would expect to see (after sorting on length of column value):
www.example.com/foo/bar/baz/here.html
www.example.com/foo/bar/baz
www.example.com/foo/bar
www.example.com/foo
www.example.com
if all those URLs are in the table of course.
Is there a built in function or would I need to create a procedure? Googling kept getting me to LIKE and REGEXP, which is not what I need. I figured that a single query would be much more efficient than chopping the URL and making multiple queries (the URLs could potentially contain many path components).
Simple turn around the "Like" operator:
SELECT * FROM urls WHERE "www.example.com/foo/bar/baz/here.html" LIKE CONCAT(url, "%");
http://sqlfiddle.com/#!2/ef6ee/1

separating values in a URL, not with an &

Each parameter in a URL can have multiple values. How can I separate them? Here's an example:
http://www.example.com/search?queries=cars,phones
So I want to search for 2 different things: cars and phones (this is just a contrived example). The problem is the separator, a comma. A user could enter a comma in the search form as part of their query and then this would get screwed up. I could have 2 separate URL parameters:
http://www.example.com/login?name1=harry&name2=bob
There's no real problem there, in fact I think this is how URLs were designed to handle this situation. But I can't use it in my particular situation. Requires a separate long post to say why... I need to simply separate the values.
My question is basically, is there a URL encodable character or value that can't possibly be entered in a form (textarea or input) which I can use as a separator? Like a null character? Or a non-visible character?
UPDATE: thank you all for your very quick responses. I should've listed the same parameter name example too, but order matters in my case so that wasn't an option either. We solved this by using a %00 URL encoded character (UTF-8 \u0000) as a value separator.
The standard approach to this is to use the same key name twice.
http://www.example.com/search?queries=cars&queries=phones
Most form libraries will allow you to access it as an array automatically. (If you are using PHP (and making use of $_POST/GET and not reinventing the wheel) you will need to change the name to queries[].)
You can give them each the same parameter name.
http://www.example.com/search?query=cars&query=phones
The average server side HTTP API is able to obtain them as an array. As per your question history, you're using JSP/Servlet, so you can use HttpServletRequest#getParameterValues() for this.
String[] queries = request.getParameterValues("query");
Just URL-encode the user input so that their commas become %2C.
Come up with your own separator that is unlikely to get entered in a query. Two underscores '__' for example.
Why not just do something like "||"? Anyone who types that into a search area probably fell asleep on their keyboard :} Then just explode it on the backend.
easiest thing to do would be to use a custom separator like [!!ValSep!!].