Return certain part of string mysql - mysql

I am fetching results from a database table which contains the text of multiple pages.
These pages have links in their content.
I am trying to get all the links from the pages in a table, but I am also getting the unwanted text.
For example, this could be the content of a certain part of a page:
line 1: This is the link for lalalaal </a href="page5.html"> click</a>
line 2 if you want to go to page lalalala2 click
Now I only want the area starting from the <a href and ending at </a> in the result record. if there are more than 1 anchor tags in the text, then each anchor tag should be treated as a record.
the returned result should be like
ID value
1 ' click '
2 ' click '
I have tried the following queries :
Select * from [Database.tablename] where value between <a href and </a>;
Select * from [Database.tablename] locate '(<a href, Value)>0' and locate (</a>, value)>0;
but none of the 2 queries are giving me the wanted result...

This sort of text extraction is probably best addressed using regular expressions.
MySQL has some support (see here), but it could only be useful to identify which rows do have an <a></a> pair. Even identifying that there is at least one link inside a record doesn't help you extracting the (possibly many) links and treating them as different records themselves.
To successfully extract those links, at least according to my knowledge, you need a tool better suited to work with regular expressions. Most languages (Perl, PHP, Python, Java, etc.) support them, some natively, some using available libraries. You can select only records containing links (using REGEXP), and extract every link via code.
Another way of handling this would be performing the query on MySQL, exporting the results to a text file, and working on its contents with shell scripting (for instance, using sed under UNIX/Linux).
If you need it to be implemented using only MySQL, then my best guess is trying with a stored procedure (to be able to work on the results record-by-record.) I still cannot think of an implementation of such SP that guarantees detecting and successfully extracting every possible link inside a record as one record per link.

Related

Upload Text File Containing Long List of Tags to WordPress

I have a text file with about 6700 tags that I would like to add to my Word Press site. Of course, it is not efficient to do this manually. Is it possible to automate the insertion of these tags?
I tried a few plugins like Smart Tag Insert, but these are ineffective and have low review scores. Additionally, I see that tags in my MyPHPAdmin panel are stored in the table wp_terms. I wanted to write an SQL script that does what I need. However, the table also stores a series of other values (like menu names). There is no way to identify rows in this table as tags and not something else (like the name of a menu). So, I am also confused about that too.
Thank you for your time and help!
You should loop through all the tags in the file and then use the wp_insert_term function to insert the tags into the database.
Documentation : https://developer.wordpress.org/reference/functions/wp_insert_term/
So the first argument is the tag term and the second one is post_tag.
Exemple:
wp_insert_term(
'tag_name', // the term
'post_tag', // the taxonomy
);

Regular expression to pick a row in an html table containing desired text

Sorry, but uhrm, I'd like to use regexp (actually I'd use something else but I want to do the task within a Matlab function) to pick a single row containing desired keywords within an html table.
I am using Matlab calling function regexpi (case-insensitive version of regexp), which is akin to PHP regex from what I can tell.
Ok, here's a snippet from such an html table to parse:
<tr><td>blu</td><td>value</td></tr><tr><td>findme</td><td>value</td></tr><tr><td>ble</td><td>value</td></tr>
The desired row to pick contains the word "findme".
(added:) Content of other cells and tags in the table could be anything (here "bla" is a dummy value)- the important part is the presence of "findme" and that a single line (not more) is caught (or all lines containing "findme" but such behaviour is not expected). Any paired name/value table in a wikipedia page is a good example.
I tinkered with https://regex101.com/ using whatever I could dig up at the Matlab documentation (forward/backward looking, combinations of :,> and ?), but have failed to identify a pattern that will pick just the right row (or all those that contain the keyword "findme"). The following pattern for instance will pick the text but not the entire row: <tr[^>]*>[^>]*.*?(findme).*?<\/td .
Pattern <tr[^>]*>(.*?findme.*?)<\/tr[^>]*> picks the row but is too greedy and picks preceding rows.
Note that the original task I had set out was to capture entire tables and then parse these, but the Matlab regexp-powered function I found for the task had trouble with nested tables (or I had trouble implementing it for the task).
The question is how to return a row containing desired keywords from an html table, programmatically, within a matlab function (without calling an external program)? Bonus question is how to solve the nested table issue, but maybe that's another question.
I suggest you split up the string with strsplit and use contains for the filtering, which is a lot more readable and maintainable than a regex pattern:
htmlString = ['<tr><td>blu</td><td>value</td></tr><tr><td><a',...
'href="bla">findme</a></td><td>value</td></tr><tr><td><a',...
'href="ble">ble</a></td><td>value</td></tr>'];
keyword = 'findme';
splitStrings = strsplit(htmlString,'<tr>');
desiredRow = ['<tr>' splitStrings{contains(splitStrings,keyword)}]
The output is:
<tr><td>findme</td><td>value</td></tr>
Alternatively you may also combine extractBetween and contains:
allRows = extractBetween(htmlString,'<tr>','</tr>');
desiredRow = ['<tr>' allRows{contains(allRows,keyword)} '</tr>']
If you must use regex:
regexp(htmlString,['<tr><td>[^>]+>' keyword '.*?<\/tr>'],'match')
Try this
%<td>(.*?)%sg
https://regex101.com/r/0Xq0mO/1

Best way to parse a big and intricated Json file with OpenRefine (or R)

I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.
Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use
I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.

Find column values that are a start string of given string.

I have a database table that contains URLs in a column. I want to show certain data depending on what page the user is on, defaulting to a 'parent' page if not a direct match. How can I find the columns where the value is part of the submitted URL?
Eg. I have www.example.com/foo/bar/baz/here.html; I would expect to see (after sorting on length of column value):
www.example.com/foo/bar/baz/here.html
www.example.com/foo/bar/baz
www.example.com/foo/bar
www.example.com/foo
www.example.com
if all those URLs are in the table of course.
Is there a built in function or would I need to create a procedure? Googling kept getting me to LIKE and REGEXP, which is not what I need. I figured that a single query would be much more efficient than chopping the URL and making multiple queries (the URLs could potentially contain many path components).
Simple turn around the "Like" operator:
SELECT * FROM urls WHERE "www.example.com/foo/bar/baz/here.html" LIKE CONCAT(url, "%");
http://sqlfiddle.com/#!2/ef6ee/1

Mediawiki blank all pages per namespace. I want to blank all User_talk pages

I want to know if there is a way to blank all user_talk pages enmass. Not delete them, just blank them. I don't know how to write bots, so I'm really asking if there is an extension or pre written bot for this. Thank you
You could write a simple SQL to do this, just look into the page table, for my installation the namespace value for User talk: is 3, so I could just delete all pages with namespace=3.
Deleting the row from the database, will leave the page as blank (not created)
I suggest using AWB. You can easy have it build a list based on a names space and then use a simple ReGeX replace such as: Search: (.*)* Replace with: (empty space).