Regular expressions to match several characters Editpad - html

I have a wordpress export with older posts which are going to be imported into a different installation. But I have an issue where there is a part of the content that needs to be converted into a different type of content in the other installation.
The original code is like this:
000d3c4c]]></content:encoded>
And I need to make it into this
</content:encoded><wp:postmeta>
<wp:meta_key><![CDATA[mkd_post_audio_link_meta]]></wp:meta_key>
<wp:meta_value><![CDATA[000d2aa8.mp3]]></wp:meta_value>
</wp:postmeta>
I'm using EditPad Pro to search through the file, and to prevent me from using several hours doing this change manually, I tant to use the extensive search and replace feature in EditPad, but I have some issues trying to match this. I want an expression to be like this.
First search and replace will work no matter what because i can change
<a href="000
into
</content:encoded>
<wp:postmeta>
<wp:meta_key>
<![CDATA[mkd_post_audio_link_meta]]></wp:meta_key><wp:meta_value>><![CDATA[000d2aa8.mp3]]></wp:meta_value>
</wp:postmeta>
But I struggle with the next part.
How can I change everything after .mp3 to this
]]></wp:meta_value>
</wp:postmeta>
To make it brief. I want to replace all occurences of
">000d3c4c</a>]]></content:encoded>
with
]]></wp:meta_value>
</wp:postmeta>
The 000d3c4c are different for each occurence of the mp3 link.

Match this
<a[^>]*>([^<]+)
and replace by this
</content:encoded>
<wp:postmeta>
<wp:meta_key><![CDATA[mkd_post_audio_link_meta]]></wp:meta_key>
<wp:meta_value><![CDATA[$1.mp3]]></wp:meta_value>
</wp:postmeta>

Related

Finding a specific link from a site

I'm trying to find a specific link from a web page using windows command line and tools. I think Xidel can do what I want to do.
In the page, the link is used like this:
file: 'http://link.link/index.txt'
Note: there's only one line like this. Now if I can set something like
file: '{%link}'
then I'll be able to extract the link. Also if I want to change the word index.txt to something like root.txt and then use aria2 to download the link as http://link.link/root.txt , what do I need to do?
(I don't have any experience with any of these tools/command like scripts, I just wanted to make something that does this (some alternatives are already available but I want to do it myself) and this only. So I did search for it and have an idea on how can I do it but extrating the exact url seems to be the hardest part since I couldn't find anything that might help me in xidel's docs)
Xidel is meant to extract data from HTML/XML/json files, but it can also extract from CSV's and TXT if you know how to use the $raw variable and xidel/xquery functions, like extract(), tokenize() and replace().
Post the URL or the source (or part thereof) of the webpage and I'll see how I can help you.

Regex match and delete everything before string (opening html tag)

I'm using Dreamweaver and Notepad++ and have searched high and low but nothing seems to work from what I've found.
I've got a whole stack of html pages and I need to remove from all of them everything above but not including the first tag in the document. Specifically, everything before the string "<h1" (no quotes). I've tried various examples in Notepad++ and it finds the first h1 tag but doesn't replace everthing before it.
Assuming you want to lose everything in your file before the "<h1" text
then specify ".*<[hH]1" as search tag and "<h1" as replacement and check
the box marked ". matches newline". Works for me.
You can do this from the Command Line or a text editor that allows you to search-replace multiple files. However, are you sure the content is the same in every html file?

Regex: img src urls that don't have multiple paths

Going through another crazy website migration!
I have HTML img src urls that look like this
http://blog.example.com/imagename.jpg
Image formats can also be jpg, png, or gif
We need a regex that finds every url that has the domain then "/imagename.jpg" immediately after.
Very new to regex, what would the expression be?
Better Alternative for WordPress Migrations
If you are moving your website and you will want to replace all references to the old site with the new domain, I suggest you use David Coveney's Serialized Search & Replace DB v2.1.0. You'll want to run this on a new copy of the database, always have a backup handy. Import the database on the destination server, then run the tool - You don't even have to upload the server files.
When I do this coming from a development server to live domain, I usually do two search & replaces:
One for URLs, very basic:
Search: mywebsite.devserver.com
Replace: my-new-website.com
And one for file paths:
Search: /vhosts/devserver.com/mywebsite
Replace: /vhosts/my-new-website.com/httpdocs
(Note: This is assuming the majority of the file path is the same for both servers. Your search & replace paths may need to be more accurate)
The reason you want serialized search and replace is that some data is stored in PHP-serialized format, and if you change the value with a text editor or in MySQL directly, it may not be able to unserialize afterwards.
Regex Answer
Select images hosted by blog.example.com with the following regex pattern:
((http|https)://blog\.example.com/[^ \r\n]+\.(jpg|jpeg|png|gif))
Which basically searches for this: http(s)://blog.example.com/*.(jpg/png/etc)
Matches the URLs in the following examples:
http://example.com/imagename.jpg
http://blog.example.com/imagename.jpg
http://blog.example.com/favicon.png
http://blog.example.com/uploads/2013/05/kitten.gif
https://blog.example.com/ssl-secure.png
This is my favorite gif https://blog.example.com/some-hilarious-image.gif hahaha
DOES NOT match any of these:
blog.example.com/google.png
https://blog.google.com/google.png
our website is http://blog.example.com and has an image named /imagename.png
http://blog.example.com/
WHY it doesn't match those (by line):
Does not include http(s)://
Hosted by google
Paragraph text, where the URL is split into two parts
Not an image
$1 returns the full URL of the image.
I tested this on RegexTester.com. You can copy the pattern in the top field, and all of the examples in the box below. The red highlights are matches.
Many good suggestions already, and why would a wordpress site hardcode domain name to links, but thats not our problem right now. If you need a regex then try this:
(?<=<img).+(?<=src=["'])(.+(?:jpe?g|gif|png))
EXPLAINED:
(?<=<img).+(?<=src=["']) - be sure we're inside an <img> tag up to src attribute
(.+(?:jpe?g|gif|png)) capture everything up to required extension

Ordering the items in the command palette overlay in Sublime Text 2

With the help from fraxel on this question: Assigning multiple snippets to a single key binding I was able to create a handy popup menu to organize all my snippets in the way Textmate works.
Is it possible to tell ST2 what order I want them to appear in in the overlay command palette? Right now they are seemingly random and I'd love to be able to set some order to them.
I can't tell how ST2 orders them.
More:
All of my snippets have file names that indicate their language:
e.g.
PHP_mySQLLoop.sublime-snippet
PHP_mySQLi.sublime-snippet
HTML_basePage.sublime-snippet
jQuery_ajax.sublime-snippet
The <description> I use is also standard:
JMR PHP, mySQL Loop
JMR PHP, mySQLi
JMR HTML, Base page
JMR jQuery, $ajax
Changing PHP_mySQLi.sublime-snippet to PHP_1_mySQLi.sublime-snippet or 1_PHP_mySQLi.sublime-snippet had no effect nor did changing it's description.
I do not have the <scope> set for any snippet at his point...maybe I should...?
(This is on OS X, not sure if that has anything to do with this...)

Pulling out some text from a giant HTML file using Nokogiri/xpath

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/