How to extract particular links from html source code with regex

How to extract particular links from html source code with regex - html

I have html page with full of link. but they inside the pre tag like below
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 640px;
height: 130px;
text-align: left;
overflow: auto">
http://test.com/files/tivist.r00
http://test.com/files/tivist.r01
http://test.com/files/fdfd.rar
http://test.com/files/gfgf.rar.html
http://test.com/files/trtr.zip
</pre>
</div><br />
The page is full of links like those
Is there any way get only those links form whole page.
I am using notepad++ . If i can get regex which can just extract those links

you can use the following regex to find them all in the document.
http://[^\s]*
I guess you could edit it to or something similiar
http://[^\s"><]*

Besure you set the line by line option off. Notepad++ has a very limited and poorly documented regex engine. Try downloading editpad pro trial edition.
(?<=\<pre.+?)http:\/\/.+?($|\s)(?=.+?\<\/pre\>)
This should only get links that are within a pre tag.
Here is a screen shot from Edit Pad Pro Trial edition

Related

HTML href >link text</a> is different from browser "tab title"

I have the following code
<div class="content">
<p>
Arbeitszeugnisse.pdf
</p>
</div>
where content looks like this
<head>
<style type="text/css">
.content {
width: 35%;
float: left;
padding: 0px;
border: 0px solid #8511ae;
margin-top: 0%;
margin-bottom: 0%;
margin-left: 2%;
margin-right: 3%;
background-color: #faf9d8;
}
</style>
</head>
When clicking on the link, the tab title used to be Arbeitszeugnisse.pdf.
After adding a page to the Arbeitszeugnisse.pdf with PDF Arranger 1.4.2 under Ubuntu 20.04.4 and replacing existing file with the new one, the tab title now prints
Layout 1 - Arbeitszeugnisse.pdf
Similarly I have
Bildungsweg.pdf
Bildungsweg.pdf is made up also of multiple files that were concatenated with PDF Arranger. When clicking on Bildungsweg.pdf, the tab title prints
G0-034-F1-20190701130851 - Bildungsweg.pdf
Is there a way to get clean tab titles with no extra text?

You can use the command line exiftool after every PDF edition. For example:
exiftool -Title="" Bildungsweg.pdf
Leave the flag -Title="" empty, without a string. That way, the web browsers will display the filename instead of the metadata incrusted into the PDF.
Take a look at this Gist about anonymising PDFs.

You could achieve the desired result by setting window.status but if it’s an external file I’d have a look at the meta properties of the file generated.
My understanding of the way files are prepared by the PDF parser is that this can be problematic but I think this is where you should begin your bug hunt.
In case you embed the file here is the JavaScript:
<script type=“text/javascript”>
window.status = ‘hello, world’;
</script>

How do you make "infoboxes" in mediawiki?

I'm making a wiki using Mediawiki. I've seen a right side bar of each page on other wikis.
Like this: http://minecraft.gamepedia.com/Diamond_Ore
The right side bar has information about the thing that the wiki post explains or what ever.
I want to know if it's possible to make that on each page, and how?
I've made an Example page: [Link deleted]
That's my wiki and I want to know how I could add a sidebar to the page.

To make simple yet elegant and flexible infoboxes, first check that the ParserFunctions extension is installed. Then create a template called "Infobox" (or whatever) with the following combination of HTML and wikitext:
<div class="infobox">
<div class="infobox-title">{{{title|{{PAGENAME}}}}}</div>{{#if:{{{image|}}}|
<div class="infobox-image">[[File:{{PAGENAME:{{{image}}}}}|300px]]</div>}}
<table class="infobox-table">{{#if:{{{param1|}}}|<tr>
<th>Parameter 1</th>
<td>{{{param1}}}</td>
</tr>}}{{#if:{{{param2|}}}|<tr>
<th>Parameter 2</th>
<td>{{{param2}}}</td>
</tr>}}{{#if:{{{param3|}}}|<tr>
<th>Parameter 3</th>
<td>{{{param3}}}</td>
</tr>}}{{#if:{{{param4|}}}|<tr>
<th>Parameter 4</th>
<td>{{{param4}}}</td>
</tr>}}{{#if:{{{param5|}}}|<tr>
<th>Parameter 5</th>
<td>{{{param5}}}</td>
</tr>}}</table>
</div>
Replace "param1", "param2", etc. with the parameters that you actually want for your infobox, such as "name", "birth-date", etc. If you need more parameters, just duplicate (with copy-paste) one of the existing parameters and modify it.
Then go to MediaWiki:Common.css and add some styling (if you don't have the necessary permissions to edit MediaWiki:Common.css, you'll have to add this CSS as inline styling to the HTML in the template, or better yet, use the TemplateStyles extension):
.infobox {
background: #eee;
border: 1px solid #aaa;
float: right;
margin: 0 0 1em 1em;
padding: 1em;
width: 400px;
}
.infobox-title {
font-size: 2em;
text-align: center;
}
.infobox-image {
text-align: center;
}
.infobox-table th {
text-align: right;
vertical-align: top;
width: 120px;
}
.infobox-table td {
vertical-align: top;
}
Finally, go to the wiki pages that require the infobox and copy the following wikitext, replacing "Infobox" with the name you've given to your template, and "param1", "param2" etc. with the names you've given to your parameters:
{{Infobox
| title =
| image =
| param1 =
| param2 =
| param3 =
| param4 =
| param5 =
}}
You may safely leave empty or delete any parameters you don't use, as they are all optional. If "title" is not provided, the infobox will default to the page name, which is usually what you want.
I've had to develop infoboxes like these for countless clients, and I slowly arrived to this solution as optimal in regards to complexity and flexibility. Hope it helps someone!
PS for any advanced users: I recommend using HTML rather than wikitext for defining the main table because this way we avoid conflicts with the pipes of the #ifs. In Wikipedia this conflict is avoided by using a template called {{!}} that inserts a pipe, but this results in unreadable wikitext.

Infoboxes are just tables with a right side float and some additional formatting.
{| style="float:right;border:1px solid black"
| My fantastic infobox
|-
| More info
|}
For best practice, you should include your infobox formatting in a class in your wiki's CSS, and define an infobox template instead of creating separate tables on every page.

Not sure how fresh #Sophivorus post is - there is no date so I assume it's from 2018, if not - sorry.
I was searching for a tutorial that would explain the nasty infoboxes and their creation process with no luck. Wiki itself was no help. After 2,5 evening I finally gave up and found Sophivorus' post. It gave me a hint that maybe creating an infobox in HTML with simple CSS is a better ad easier idea.. I thought that the Wiki must convert its s&&&ty infoboxes to HTML and I wasn't wrong! :) I have a nice infobox without learning wiki's scribunto-lua-srua-wikitext monster and much knowledge in HTML and CSS.
I would like to share the solution for dummies like me in a step-by-step way.
Go to a wikipage with an infobox you like (in my case Formica fusca page).
Select whole content of the infobox and right click on its title at the top of the infobox.
Choose to inspect the element (naming might differ between browsers - I use Opera)
Look to the right (a window opened).
Move cursor above the entries in the right window and look for <table class"[..] marker. Note that the whole infobox was highlighted as you moused over the <table class="
Right click on it and go to Copy, then select Copy outerHTML.
Create Template:Examplenamehere page on your MediaWiki, edit it and place copied HTML code there. Edit it as you wish.
As Sophivorus said go to MediaWiki:Common.css and add some styling - you might use the code he/she mentioned.
Styling. In HTML code add classes to elements you wish to style through CSS page (from step 8). You will need at least those: 1) give the whole infobox (in this case it wil be <table class="infobox" bla bla bla>) 2) titles/headers class infobox-subtitle (or whatever; <th class="infobox-subtitle" bla bla>) 3)give sub-title a class if you wish any sub-titles (<td class="infobox-subtitle" bla bla>).
Copy CSS from here or create your code in CSS by yourselves using classes mentioned above
.infobox {
background-color: #ffff00;
border: 2px solid #008600;
float: right;
margin: 0 0 1em 1em;
padding: 1em;
width: 400px;
}
.infobox-title {
border: 1px solid #000000;
font-size: 1.5em;
text-align: center;
background-color: #ff0000;
}
.infobox-subtitle {
font-size: 1em;
text-align: center;
background-color: #fff00;
}
.infobox-image {
text-align: center;
background-color: #ffff00;
}
In HTML:
In places you want the user to add the information paste {{{somenamehere}}} and then create .. a template you will have to place on other pages to invoke your Examplenamehere template:
{|Examplenamehere
|{{{somenamehere1}}}
|{{{somenamehere2}}}
|etc. etc.
|}
To place a header with page name: {{{nazwa|{{PAGENAME}}}}} instead of a title at the top of the table.
If you wish to check or copy my template, go here: http://wiki.mrowki.ovh/index.php?title=Szablon:Opis and inspect the page

Modifying the iframe for FCKeditor

I'm trying to figure out what file the styling for FCKeditor's WYSIWIG editor is located.
This is a sample WYSIWIG iframe:
<iframe id="edit-field-location-0-value___Frame" height="100%" frameborder="0" width="100%" scrolling="no" src="/sites/all/modules/fckeditor/fckeditor/editor/fckeditor.html?InstanceName=edit-field-location-0-value&Toolbar=Default" style="margin: 0px; padding: 0px; border: 0px none; background-color: transparent; background-image: none; width: 100%; height: 100%;">
I want to modify the height, but I just can't seem to find the file, or any documentation on this. Does anyone have any information?

Hah someone is still using the old module/editor (why?) o.O :-)
Considering that the editor is alrrady dead, same with the FCKeditor module for Drupal, I suggest taking the most easy path - altering the FCKeditor module
Find the following line in fckeditor.module:
// sensible default for small toolbars
$height = intval($element['#rows']) * 14 + 140;
and try setting height to any other value. I do not have this module installed, but that would be my first guess.

Embed PDF file into webpage from online source, which requires download of the file with "Content-Disposition:" tag in HTTP-headers

I need to display the PDF files from a third party on a webpage. I have links to the documents as they appear on the source pages. Unfortunately, none of the links are actual links to the documents, but rather GET requests with certain parameters, or other indirect references like so:
http://cdm.unfccc.int/UserManagement/FileStorage/SNM7EQ2RUD4IA0JLO3HCZ8BTK1VX5P
If the website does not enforce the download with Content-Disposition: attachment; tag in the response headers, as the one above, then I can easily achieve the necessary display by:
<object width="90%" height="600" type="application/pdf"
data="http://cdm.unfccc.int/UserManagement/FileStorage/SNM7EQ2RUD4IA0JLO3HCZ8BTK1VX5P"
id="pdf_content">
<p>Can't seem to display the document. Try <a href="http://cdm.unfccc.int/UserManagement/FileStorage/SNM7EQ2RUD4IA0JLO3HCZ8BTK1VX5P">
downloading</a> it.</p>
<embed type="application/pdf" src="http://cdm.unfccc.int/UserManagement/FileStorage/SNM7EQ2RUD4IA0JLO3HCZ8BTK1VX5P"
width="90%" height="600" />
</object>
This "stands" and "falls" very gracefully in majority of the browsers. The use of <object> and <embed> at the same time works for me, and, as far as I've tested, does not effect the problem that I describe below (tell me if I'm wrong).
The problem begins when the website does require the download with the above mention tag in the HTTP-headers. For instance, the document on the following link:
http://mer.markit.com/br-reg/PublicReport.action?getDocumentById=true&document_id=103000000000681
would not be displayed through the HTML structure I showed above. It falls gracefully and the link for downloading works just fine, but I need to view it!
I've been banging my head on the wall for 3 days now, can't figure it out.
Maybe there is a way to catch the headers of the request somehow and ignore them, or maybe force the "viewability" into the GET request.
For general information, this is a part of Ruby on Rails application, so the solution should be coming from along those lines. I'm not giving any ROR code in here, because it doesn't seem to be a source of concerns.
Any straight-forward solution would be prayed upon, while any others - heavily appreciated.
The alternative solutions I thought of and discarding comments:
Download all those files to local storage in advance and just serve them from there.
The necessary storage capacity would be around ~1TB and growing, so storing it on the server would be expensive for a small commercial SaaS that it is.
Cache those documents around the time when they might be needed. For instance, when someone opens the page of the project, the process in the background downloads the related PDFs, so if the user clicks the document link he is served the document which was just downloaded to the local storage. Cache could be kept for a few hours/days just in case of user return.
This might be viable, but if the user base would be significant, then this solution would have the same problem as the one above. Also at this moment, I would not know how to go about implementing this kind of algorithm (very much a beginner, you see)

You may want to look into using http://pdfobject.com or maybe just adapt some of its code as it seems to be able to do what you want. I whipped up a proof-of-concept:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Embedding a PDF using PDFObject: Simple example with basic CSS</title>
<!-- This example created for PDFObject.com by Philip Hutchison (www.pipwerks.com) -->
<style type="text/css">
body
{
font: small Arial, Helvetica, sans-serif;
color: #454545;
background: #F8F8F8;
margin: 0px;
padding: 2em;
}
h1
{
font-weight: normal;
font-size: x-large;
}
a:link, a:visited
{
color: #3333FF;
text-decoration: none;
border-bottom: 1px dotted #3333FF;
}
a:hover, a:visited:hover
{
color: #FF3366;
text-decoration: none;
border-bottom: 1px solid #FF3366;
}
#pdf
{
width: 500px;
height: 600px;
margin: 2em auto;
border: 10px solid #6699FF;
}
#pdf p
{
padding: 1em;
}
#pdf object
{
display: block;
border: solid 1px #666;
}
</style>
<script type="text/javascript" src="pdfobject.js"></script>
<script type="text/javascript">
window.onload = function () {
var success =
new PDFObject({ url: "http://mer.markit.com/br-reg/PublicReport.action?getDocumentById=true&document_id=103000000000681" }).embed("pdf");
};
</script>
</head>
<body>
<h1>
Embedding a PDF using PDFObject: Simple example with basic CSS</h1>
<p>
This example uses one line of JavaScript wrapped in a <em>window.onload</em> statement,
with a little CSS added to control the styling of the embedded element.
</p>
<div id="pdf">
It appears you don't have Adobe Reader or PDF support in this web browser. <a href="http://mer.markit.com/br-reg/PublicReport.action?getDocumentById=true&document_id=103000000000681">
Click here to download the PDF </a>
</div>
</body>
</html>

Why tinymce inserts on copy pasting the html?

I'm using the latest version of TinyMce,
And when I copy paste some html content from wikipedia, it actually inserts lots of which are not present in the source.
Example, I select the following string from wikipedia:
trained professionals and paraprofessionals coming
From this page: http://en.wikipedia.org/wiki/Health_care
And It has the following source code:
trained professionals and paraprofessionals coming
Note: As we see there are no noob-spaces ( ).
Then when I paste it to the tinymce it produces the following html:
<h3 style="background-image: none; margin: 0px 0px 0.3em; overflow: hidden; padding-top: 0.5em; padding-bottom: 0.17em; border-bottom-style: none; font-size: 17px; font-family: sans-serif; line-height: 19.200000762939453px;"><span style="font-size: 13px; font-weight: normal;">trained </span><a style="text-decoration: none; color: #0b0080; background-image: none; font-size: 13px; font-weight: normal;" title="Professional" href="http://en.wikipedia.org/wiki/Professional">professionals</a><span style="font-size: 13px; font-weight: normal;"> and </span><a style="text-decoration: none; color: #0b0080; background-image: none; font-size: 13px; font-weight: normal;" title="Paraprofessional" href="http://en.wikipedia.org/wiki/Paraprofessional">paraprofessionals</a><span style="font-size: 13px; font-weight: normal;"> coming</span></h3>
Or, as a plain text it would look like this:
trained professionals and paraprofessionals coming together
Which actually breaks my layout because it all goes in one line (as one word).
Any ideas why it does it and how to prevent it?

Whenever you copy some content from websites, it copies the style of the text also. So all you need to do is you should paste the copied content into notepad first, then from there you can again copy the same content and then paste in tinymce.
(Notepad gives you the plain content without any inline style)

First Copy the content in any place ex (Wikipedia, google, etc). Past the all content in Notepad file. The total back links and spaces are deleted after copy the notepad content past the Tiny MCE Editor. It is the better way to use this type.

When copying content from a web page, use View Source in a browser and copy the relevant part from the source and then insert it in “raw mode” (source mode, HTML mode, whatever it is called—I presume TinyMce has got such a mode; if not, get a better tool). To make this easier, in Firefox, you can paint an area and then right-click and select the option of viewing the source of the selection. (Well this might need an add-on like DOM Inspector, I’m not sure.)
It’s possible that TinyMce converts spaces to something else even in “raw” mode. I have seen such things happen in a CMS (spuriously changing normal spaces to no-break spaces), with no explanation found, and I hope I won’t need to use such a CMS ever again.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008