Regex: img src urls that don't have multiple paths - html

Going through another crazy website migration!
I have HTML img src urls that look like this
http://blog.example.com/imagename.jpg
Image formats can also be jpg, png, or gif
We need a regex that finds every url that has the domain then "/imagename.jpg" immediately after.
Very new to regex, what would the expression be?

Better Alternative for WordPress Migrations
If you are moving your website and you will want to replace all references to the old site with the new domain, I suggest you use David Coveney's Serialized Search & Replace DB v2.1.0. You'll want to run this on a new copy of the database, always have a backup handy. Import the database on the destination server, then run the tool - You don't even have to upload the server files.
When I do this coming from a development server to live domain, I usually do two search & replaces:
One for URLs, very basic:
Search: mywebsite.devserver.com
Replace: my-new-website.com
And one for file paths:
Search: /vhosts/devserver.com/mywebsite
Replace: /vhosts/my-new-website.com/httpdocs
(Note: This is assuming the majority of the file path is the same for both servers. Your search & replace paths may need to be more accurate)
The reason you want serialized search and replace is that some data is stored in PHP-serialized format, and if you change the value with a text editor or in MySQL directly, it may not be able to unserialize afterwards.
Regex Answer
Select images hosted by blog.example.com with the following regex pattern:
((http|https)://blog\.example.com/[^ \r\n]+\.(jpg|jpeg|png|gif))
Which basically searches for this: http(s)://blog.example.com/*.(jpg/png/etc)
Matches the URLs in the following examples:
http://example.com/imagename.jpg
http://blog.example.com/imagename.jpg
http://blog.example.com/favicon.png
http://blog.example.com/uploads/2013/05/kitten.gif
https://blog.example.com/ssl-secure.png
This is my favorite gif https://blog.example.com/some-hilarious-image.gif hahaha
DOES NOT match any of these:
blog.example.com/google.png
https://blog.google.com/google.png
our website is http://blog.example.com and has an image named /imagename.png
http://blog.example.com/
WHY it doesn't match those (by line):
Does not include http(s)://
Hosted by google
Paragraph text, where the URL is split into two parts
Not an image
$1 returns the full URL of the image.
I tested this on RegexTester.com. You can copy the pattern in the top field, and all of the examples in the box below. The red highlights are matches.

Many good suggestions already, and why would a wordpress site hardcode domain name to links, but thats not our problem right now. If you need a regex then try this:
(?<=<img).+(?<=src=["'])(.+(?:jpe?g|gif|png))
EXPLAINED:
(?<=<img).+(?<=src=["']) - be sure we're inside an <img> tag up to src attribute
(.+(?:jpe?g|gif|png)) capture everything up to required extension

Related

Seeking Guidance

I'm new with this and not certain what platform to use to achieve my desired outcome (i.e. php, javascript, etc.), but I'm a fast learner.
I add videos to my YouTube channel daily. After this I update two separate webpages where I manually embed the newest video URL.
Question:
I would like to automate this work process. What is the best approach (i.e. CSS, Javascript, PHP, etc.) that I can use to "get" the most current YouTube video URL and embed it into my webpage(s) automatically?
I hope I explained this properly. Let me know if you need any additional information. Thanks in advance for any guidance you can offer!!!
(1) Get link of latest video on your Channel:
You can request from Youtube, a Channel's feed using
https://www.youtube.com/feeds/videos.xml?channel_id=XXXXX
Where XXXXX is the channel's ID (as shown in browser's address bar).
The first entry in the XML document is the latest video.
Use Javascript Fetch API to load the XML or else have a JS function to call a PHP script that gives/reads back this XML document.
After correct loading, you'll have a String (text) copy of that same document existing in some variable that you put it into. The idea here is to edit the text by code (instead of highlighting and replacing the URL in a text editor). The code should find and replace the URL. The code should then save the edited text as a new HTML file (overwrite the old one using PHP)
With Javascript, either use its String functions to extract the URL or follow some tutorial about parsing XML to extract data.
(2) To update the webpages: (use PHP)
Option 1 is to load the old page and use PHP String functions to replace text of old link with latest new link. Then write the edited text as file (overwrite older HTML file)
Option 2 is to have a "template" document already stored as String in your code. Then simply replace (or add if needed) the URL of new video. Then have PHP save the text of String as an HTML file, overwriting the old .html.
Use this service, I think it will be the easiest way: https://latestyoutu.be/
You can find your channel ID by clicking Settings, Advanced, and Account Information and paste it into the site. (https://support.google.com/youtube/answer/3250431). This is probably the most hassle-free way of doing what it seems you want to.

What is the difference between these URL syntax?

I was sent a hyperlink to a Tableau Public link by a client. When I tried opening it, I got a 404 exception. I wrote back to the client but was told by the same that the link was working fine. I visited his profile page and was able to open the presentation there, but the URL that ended up working was slightly different than the one behind the original, non-functioning link.
Here's the anonymized URL behind the original link
https://public.tableau.com/profile/[client_name]%23!/vizhome/Project-AirportDelay/FlightPerformancesinUSA?publish=yes
And here's the URL via the profile page:
https://public.tableau.com/profile/[client_name]#!/vizhome/Project-AirportDelay/FlightPerformancesinUSA
The only differences I see are ?publish=yes and %23!. I tried appending the former, ?publish=yes, to the working URL, and it was still functional. So I suspect that it has to do with the other difference %23! vs. #!. Could the first work because he is opening it from his computer where he is likely logged onto Tableau Public? What's the difference between these syntax? Any ideas about why the original hyperlink might not be functional?
For obvious privacy reasons, I can't provide the whole URL.
It looks like the basic URL pattern for passing filters ?publish=yes
and
%23 is the URL encoded representation of #
The first # after the authority component starts the fragment component. If the # should be part of the path component or the query component, it has to be percent-encoded as %23.
As # is a reserved character, these URIs aren’t equivalent:
http://example.com/foo#bar
http://example.com/foo%23bar
There are countless ways how a URI reference could become erroneous. The culprit is often a software, like a word processor, where someone pastes the correct URI, and the software incorrectly percent-encodes it (maybe assuming that the user didn’t paste the real/correct URI).
Copy-pasting the URI from the browser address bar into a plain text document should always work correctly.

Regular expressions to match several characters Editpad

I have a wordpress export with older posts which are going to be imported into a different installation. But I have an issue where there is a part of the content that needs to be converted into a different type of content in the other installation.
The original code is like this:
000d3c4c]]></content:encoded>
And I need to make it into this
</content:encoded><wp:postmeta>
<wp:meta_key><![CDATA[mkd_post_audio_link_meta]]></wp:meta_key>
<wp:meta_value><![CDATA[000d2aa8.mp3]]></wp:meta_value>
</wp:postmeta>
I'm using EditPad Pro to search through the file, and to prevent me from using several hours doing this change manually, I tant to use the extensive search and replace feature in EditPad, but I have some issues trying to match this. I want an expression to be like this.
First search and replace will work no matter what because i can change
<a href="000
into
</content:encoded>
<wp:postmeta>
<wp:meta_key>
<![CDATA[mkd_post_audio_link_meta]]></wp:meta_key><wp:meta_value>><![CDATA[000d2aa8.mp3]]></wp:meta_value>
</wp:postmeta>
But I struggle with the next part.
How can I change everything after .mp3 to this
]]></wp:meta_value>
</wp:postmeta>
To make it brief. I want to replace all occurences of
">000d3c4c</a>]]></content:encoded>
with
]]></wp:meta_value>
</wp:postmeta>
The 000d3c4c are different for each occurence of the mp3 link.
Match this
<a[^>]*>([^<]+)
and replace by this
</content:encoded>
<wp:postmeta>
<wp:meta_key><![CDATA[mkd_post_audio_link_meta]]></wp:meta_key>
<wp:meta_value><![CDATA[$1.mp3]]></wp:meta_value>
</wp:postmeta>

Finding a specific link from a site

I'm trying to find a specific link from a web page using windows command line and tools. I think Xidel can do what I want to do.
In the page, the link is used like this:
file: 'http://link.link/index.txt'
Note: there's only one line like this. Now if I can set something like
file: '{%link}'
then I'll be able to extract the link. Also if I want to change the word index.txt to something like root.txt and then use aria2 to download the link as http://link.link/root.txt , what do I need to do?
(I don't have any experience with any of these tools/command like scripts, I just wanted to make something that does this (some alternatives are already available but I want to do it myself) and this only. So I did search for it and have an idea on how can I do it but extrating the exact url seems to be the hardest part since I couldn't find anything that might help me in xidel's docs)
Xidel is meant to extract data from HTML/XML/json files, but it can also extract from CSV's and TXT if you know how to use the $raw variable and xidel/xquery functions, like extract(), tokenize() and replace().
Post the URL or the source (or part thereof) of the webpage and I'll see how I can help you.

Pulling out some text from a giant HTML file using Nokogiri/xpath

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/