Regular expression to pick a row in an html table containing desired text - html

Sorry, but uhrm, I'd like to use regexp (actually I'd use something else but I want to do the task within a Matlab function) to pick a single row containing desired keywords within an html table.
I am using Matlab calling function regexpi (case-insensitive version of regexp), which is akin to PHP regex from what I can tell.
Ok, here's a snippet from such an html table to parse:
<tr><td>blu</td><td>value</td></tr><tr><td>findme</td><td>value</td></tr><tr><td>ble</td><td>value</td></tr>
The desired row to pick contains the word "findme".
(added:) Content of other cells and tags in the table could be anything (here "bla" is a dummy value)- the important part is the presence of "findme" and that a single line (not more) is caught (or all lines containing "findme" but such behaviour is not expected). Any paired name/value table in a wikipedia page is a good example.
I tinkered with https://regex101.com/ using whatever I could dig up at the Matlab documentation (forward/backward looking, combinations of :,> and ?), but have failed to identify a pattern that will pick just the right row (or all those that contain the keyword "findme"). The following pattern for instance will pick the text but not the entire row: <tr[^>]*>[^>]*.*?(findme).*?<\/td .
Pattern <tr[^>]*>(.*?findme.*?)<\/tr[^>]*> picks the row but is too greedy and picks preceding rows.
Note that the original task I had set out was to capture entire tables and then parse these, but the Matlab regexp-powered function I found for the task had trouble with nested tables (or I had trouble implementing it for the task).
The question is how to return a row containing desired keywords from an html table, programmatically, within a matlab function (without calling an external program)? Bonus question is how to solve the nested table issue, but maybe that's another question.

I suggest you split up the string with strsplit and use contains for the filtering, which is a lot more readable and maintainable than a regex pattern:
htmlString = ['<tr><td>blu</td><td>value</td></tr><tr><td><a',...
'href="bla">findme</a></td><td>value</td></tr><tr><td><a',...
'href="ble">ble</a></td><td>value</td></tr>'];
keyword = 'findme';
splitStrings = strsplit(htmlString,'<tr>');
desiredRow = ['<tr>' splitStrings{contains(splitStrings,keyword)}]
The output is:
<tr><td>findme</td><td>value</td></tr>
Alternatively you may also combine extractBetween and contains:
allRows = extractBetween(htmlString,'<tr>','</tr>');
desiredRow = ['<tr>' allRows{contains(allRows,keyword)} '</tr>']
If you must use regex:
regexp(htmlString,['<tr><td>[^>]+>' keyword '.*?<\/tr>'],'match')

Try this
%<td>(.*?)%sg
https://regex101.com/r/0Xq0mO/1

Related

Indexeddb sorting with multiple indexes

I have a file object store by indexing name and library_id like below,
let objectStore = db.createObjectStore('file', { keyPath: 'id' });
tempStore.createIndex('nameLibId', ['attributes.name', 'attributes.library_id'], { unique: false });
The object store contains multiple library id's files. I'd like apply the name sort to the particular library id's files. I tried indexing in the below format but it returns empty data.
let self = this,
db = get(self, 'db'),
transaction = db.transaction(["file"], "readonly"),
objectStore = transaction.objectStore("file"),
index = objectStore.index('nameLibId'),
keyRange = IDBKeyRange.only('library_id')),
req = index.getAll(keyRange);
req.onsuccess = ((e)=>{
console.log(e.target.result); // returns empty array
});
Attached the screenshot of db model for reference.
24536475, abc, created, jhgf and lastmodified file names are belongs to a library id called 123.
Screen Shot..* file names are belongs to an another library id called 234.
I need the files which are sorted by name only the given library id. Any help would be highly appreciated.
If your index is based on a properties array and you want to match something using IDBKeyRange.only, then your parameter to IDBKeyRange.only should also be an array. Right now you are comparing a basic string value against a properties array value, where of course nothing matches. In other words, you cannot query against a two-part array using only one part of it.
Furthermore, the parameter to IDBKeyRange.only isn't a property name, it is a value. You want to specify a value to match in the index's set of keypath values. For example, if your index was based exclusively on attributes.name, then you would want to specify a particular value within that index, such as "abc".
And so, taking into account the above two points, and given that your index is not a single value but is instead an array of two properties, you need to revise your parameter to IDBKeyRange.only to look for an array. Something like IDBKeyRange.only(['abc', 'yoktc....']);.
Now, this is further complicated by the fact that what you are doing in your code does not actually accomplish what you want. Ignoring the sort concern for a moment, you only want to use the id condition, and not the name, when matching rows of this index. So you might be tempted to try IDBKeyRange.only([undefined, 'asdf']). Unfortunately this will not work at all because you cannot specify undefined (you will get a javascript error).
So, you must always query by both values, even though you only want to apply criteria to one of the values. The trick here is that you switch to using a different method than only. You use IDBKeyRange.bound(), and furthermore, you do a trick where you specify a criteria such as "smallest possible number is less than my number and my number is less than largest possible number", e.g. a condition that always is true. You use "smallest possible value" as your lower boundary, and "largest possible value" as your upper boundary.
Here is an example in your case. The smallest possible value of name I think is empty string. The largest possible value of name is probably any non-alphanumeric character, so let's use tilde "~". So, now we would rewrite the range parameter. Instead of using IDBKeyRange.only, we use IDBKeyRange.bound. It looks like the following (roughly):
var libId = ???;
var smallestNameValue = '';
var largestNameValue = '~';
var lowerBound = [smallestNameValue, libId];
var upperBOund = [largestNameValue, libId];
var range = IDBKeyRange.bound(lowerBound, upperBound);
Now, the second part, regarding sorting, and a major caveat of using indices that have multiple parts (not to be confused with the multiPart index property, ugh). And I myself get this backwards all the time, so I might even be wrong here and the above will work. The problem with the above is that one the first criterion is met the second is ignored, because of how the short-circuited array sorting algorithm works in indexedDB's comparison function. Your query is going to match everything, because every index row meets the criteria. So the trick to this is to always query first by the important condition, to basically pay attention to the order in which you specify your conditions. So what that means is that you need to switch the order of the properties you specified when creating the index, so that you can query first by libId and then by name.
Instead of createIndex('nameLibId',['attributes.name','attributes.library_id']); you want to do createIndex('nameLibId',['attributes.library_id', 'attributes.name']);. And this also means you need to swap your lower and upper bound queries, e.g. var lowerBound = [libId, smallestNameValue]; (and don't forget to switch the upper).
As I mentioned in my answer on using compound indices, you can always using indexedDB.cmp to experiment. Right now, open up the console on this web page. In the console, type something like this:
indexedDB.cmp(['', '5'], ['~', '5']);
Take a look at the results.
Some final notes:
Tilde might be the wrong thing to use, sorry but I am not bothering to remember, you could also just try any valid sentinel value, where by sentinel I mean any value you know will always come after all your other valid values
As I point out in my other answer, if either prop is missing in the data the actual object won't match
for cmp, -1 means left is less than right, 0 means left equals right, and 1 means left greater than right

Reshape the dataset into more relational format (Transpose SOME rows and assign them to a data subset)

I have a spreadsheet/csv:
Code:,101,Course Description:,"Introduction to Rocket Science",
Student Name,Lecture Hours,Labs Hours,Test Score,Status
John Galt,48,120,4.7,Passed
James Taggart,50,120,4.9,Passed
...
I need to reshape it to the following view:
Code:,Course Description:,Students,Lecture Hours,Labs Hours,Average Test Score,Teaching Staff
101,"Introduction to Rocket Science",John Galt,48,120,4.7,Passed
101,"Introduction to Rocket Science",James Taggart,50,120,4.9,Passed
...
Beleive it or not, can not get the right idea how to do that despite it seems to be very primitive transformation, is there any silver bullet for this?
Original records (csv) have in a way json-like structure so my first approach was to represent the original data as a vector and then transpose it, (but in this case my resulting table looks like sparced matrix - rows I have transpored are blank in the rest of its values)
Another way Im considering - **serialize it into jsons and then de-serialize** into new spreadsheet (jsonize()) - in this case, Im having problems with merging them properly.
In both ways I have it "half-working";
Can anyone suggest simple and reliable algorithm for this;
Any language, RegEx, any tools, code snippets are very appreciated
Assuming that the pattern you've described here is consistent throughout, there are quite a few different approaches you could take I think, but in all cases you basically can use that fact that the 'Course' rows start with "Code:" but that's never going to be a student name.
You can take advantage of this either by a regular expression find/replace, or within OpenRefine.
Example:
Open file in a text editor that supports regular expressions in
find/replace
Search for lines starting with 'Code:' and add additional commas to the start of the row to shift the course data columns to the
right e.g. search for: ^Code: replace with: ,,,,,^Code:
If you now import the file into OpenRefine then you'll have a project with 10 columns (the 10th col is caused by the trailing
comma at the end of the course data row)
You can now use Transpose (or just rename) on the right-most columns which contain the course data, while leaving the left-most
columns which contain the student details
Isolate the rows that contain the phrase 'Student Name' in the first column and remove them (via a filter or facet)
Move the Course Code/Description columns to the beginning of the project, and use the 'Edit Cells->Fill Down' option on each column to get the values repeated on all the relevant lines
Finally rename the columns as you want, remove any extraneous columns

Return certain part of string mysql

I am fetching results from a database table which contains the text of multiple pages.
These pages have links in their content.
I am trying to get all the links from the pages in a table, but I am also getting the unwanted text.
For example, this could be the content of a certain part of a page:
line 1: This is the link for lalalaal </a href="page5.html"> click</a>
line 2 if you want to go to page lalalala2 click
Now I only want the area starting from the <a href and ending at </a> in the result record. if there are more than 1 anchor tags in the text, then each anchor tag should be treated as a record.
the returned result should be like
ID value
1 ' click '
2 ' click '
I have tried the following queries :
Select * from [Database.tablename] where value between <a href and </a>;
Select * from [Database.tablename] locate '(<a href, Value)>0' and locate (</a>, value)>0;
but none of the 2 queries are giving me the wanted result...
This sort of text extraction is probably best addressed using regular expressions.
MySQL has some support (see here), but it could only be useful to identify which rows do have an <a></a> pair. Even identifying that there is at least one link inside a record doesn't help you extracting the (possibly many) links and treating them as different records themselves.
To successfully extract those links, at least according to my knowledge, you need a tool better suited to work with regular expressions. Most languages (Perl, PHP, Python, Java, etc.) support them, some natively, some using available libraries. You can select only records containing links (using REGEXP), and extract every link via code.
Another way of handling this would be performing the query on MySQL, exporting the results to a text file, and working on its contents with shell scripting (for instance, using sed under UNIX/Linux).
If you need it to be implemented using only MySQL, then my best guess is trying with a stored procedure (to be able to work on the results record-by-record.) I still cannot think of an implementation of such SP that guarantees detecting and successfully extracting every possible link inside a record as one record per link.

Find column values that are a start string of given string.

I have a database table that contains URLs in a column. I want to show certain data depending on what page the user is on, defaulting to a 'parent' page if not a direct match. How can I find the columns where the value is part of the submitted URL?
Eg. I have www.example.com/foo/bar/baz/here.html; I would expect to see (after sorting on length of column value):
www.example.com/foo/bar/baz/here.html
www.example.com/foo/bar/baz
www.example.com/foo/bar
www.example.com/foo
www.example.com
if all those URLs are in the table of course.
Is there a built in function or would I need to create a procedure? Googling kept getting me to LIKE and REGEXP, which is not what I need. I figured that a single query would be much more efficient than chopping the URL and making multiple queries (the URLs could potentially contain many path components).
Simple turn around the "Like" operator:
SELECT * FROM urls WHERE "www.example.com/foo/bar/baz/here.html" LIKE CONCAT(url, "%");
http://sqlfiddle.com/#!2/ef6ee/1

How to code Regular Expression with an IF ELSE function

I am trying to build a scraper to extract key metrics from a website. One of the metrics is to find the Model number of the products on the website. I am using Outwit as the base program but I'm now stuck when it comes to some exceptions in the sites source code.
Here is an example of the source code:
var zx_description = "Test Dress<br/><br/>Model: Nice01j<br/>
Where the information I am looking to extract is: Nice01j
The issue is that for some products the word Modell is spelled Model and also that the end of the actual model name/number does not always end with a row break but in some cases the code might look like this:
var zx_description = "Test Dress<br/><br/>Model: Nice01j";
I have managed to create the RegEx before the Modell number as below:
/var zx_description[\s\S]+?Modell:/
So now Im looking to alter it so that it also takes in consideration that the spelling might be Model with just one "l".
Also the second part is to create a RegEx for the capturing of te info after the actual Model name which in should be something like:
IF: < br comes before "; then < br ELSE ";
Is this possible to state in a Regular Expression and if so how would I do that?
Based on your use of [\s\S] it looks to me like you need to run through a regular expression tutorial. For your question, specifically focus on optional items and capturing groups.
http://www.regular-expressions.info/tutorial.html