Pulling out some text from a giant HTML file using Nokogiri/xpath - html

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.

According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/

Related

How to find for the wikipedia links in the infobox templates and other templates, using sql dumps

I want to extract the pages mentioned in the infobox and templates of pages.
E.g. From this page:
https://en.wikipedia.org/wiki/DNA
I want to extract all of the links in the infobox, like: "Genetics", "Introduction to Genetics" etc.
I want to do it, by using the sql dumps, possibly avoiding to parse the xml of whole pages, and I don't want to do it with APIs.
I could not find a way.
While Pagelinks does include also the links of infoboxes, I cannot find a way to exclude them.
I thought Templatelinks may have that info, but it is not: I could not find the pageids of the corresponding links in infoboxes.
Where is this information stored?
Or which kind of tables should I look at?
I consulted previous questions:
where can I find the infobox templates used in wiki?
and Mediawiki reference:
https://www.mediawiki.org/wiki/Manual:Templatelinks_table#Schema_summary
but could not find a solution.
That is a sidebar rather than an infobox: https://en.wikipedia.org/wiki/Template:Genetics_sidebar
I don't think there's a way of doing it other than parsing the content of the template to extract the links or using the API: e.g. https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Template:Genetics%20sidebar&pllimit=100&plnamespace=0
Something like this should also work but it's not returning any results for me:
SELECT * from pagelinks
where pl_title = 'Genetics_sidebar'
and pl_namespace = 0
and pl_from_namespace = 10
https://quarry.wmcloud.org/query/71442

Dynamically produce html based on templates

I am trying to automate a workflow for automatically creating HTML newsletters based on information stored in a spreadsheet.
Currently, I am using a newsletter drag and drop tool, in which several pre-programmed blocks are available (e.g. full column block, 2 column block etc). When creating a newsletter, I drag and drop a block and fill in my content (e.g. uploading an image, inserting a url). This is all well and good, however, since I have to create the same newsletter in 10 different languages, this process is quiet time consuming and prone to human error. While all newsletters are the same in terms of layout, the images and urls differ.
To solve this issue, I would like to get rid of the drag and drop process, and instead automate the workflow in some other way.
One idea that I have already tried, but that doesn't seem like the perfect option to me, is to dynamically create the needed HTMLs in Excel. Basically, the idea is to take the existing block template structure, and put it into Excel with some formulas.
I could then copy and paste the links to the images (in a simple format, such as EN1.jpg, ES1.jpg, etc.), as well as to urls (url.com, url.es).
This is some example block:
<img alt="" align="center" width="700" style="max-width:700px;" class="resetWidth" border="0" src="IMAGE" />
My final expected result is something like this:
I define the layout in a very quick manner (e.g. writing fullcolumn, half column, fullcolumn). The corresponding code is taken from the template. I then provide the attributes (image url, link url) in the form of a list or so. The end result should then be 10 html files that I simply have to upload to the newsletter software.
I would appreciate it very much if anyone had any ideas on this.
Another option for translating the page is to do something like this https://www.w3schools.com/howto/howto_google_translate.asp
it adds a selection for languages to translate into.
As for automating the images, you could set up folders for each langauge and reuse the name of images based on where you want them so they would be placed in the correct location.
All you'll have to do it replace the images with the same file names and swap the default language on the Google Translator.
So something like this that the html will stay the same with regards to the image names
For the link variables you may be able to write some JS or another language to take advantage of the
<html lang="">
and based on which lang is set, insert a set of links to the file.

Finding a specific link from a site

I'm trying to find a specific link from a web page using windows command line and tools. I think Xidel can do what I want to do.
In the page, the link is used like this:
file: 'http://link.link/index.txt'
Note: there's only one line like this. Now if I can set something like
file: '{%link}'
then I'll be able to extract the link. Also if I want to change the word index.txt to something like root.txt and then use aria2 to download the link as http://link.link/root.txt , what do I need to do?
(I don't have any experience with any of these tools/command like scripts, I just wanted to make something that does this (some alternatives are already available but I want to do it myself) and this only. So I did search for it and have an idea on how can I do it but extrating the exact url seems to be the hardest part since I couldn't find anything that might help me in xidel's docs)
Xidel is meant to extract data from HTML/XML/json files, but it can also extract from CSV's and TXT if you know how to use the $raw variable and xidel/xquery functions, like extract(), tokenize() and replace().
Post the URL or the source (or part thereof) of the webpage and I'll see how I can help you.

How to best transfer a document to a SAPUI5 framwork?

I'd like to achieve the following and I'm looking for ideas. I have a document and I want to represent/transform this content in/to a nice SAPUI5 framework. My idea is the following: a split app with having the paragraph titles in the master view (plus a search function on top) and the respective content in the detail view.
I'd like to know from you if
a) you might want to share your ideas and hints on alternatives.
b) this can be achieved within one single file (i.e. all the code for the split app and document content in one html) and maybe using pure html code (xml also feasible) - against the background of easily handing a large amount of text available in html.
c) if you happen to have/know a reusable template.
Thanks in advance!
An interesting question. I went through a similar exercise once, re-presenting my site with UI5.
To your questions:
(a) I would think that the approach you suggest is a good one
(b) You can indeed include all the app in a single file, I do that often by using script templates, even with XML Views. You can see some examples in my sapui5bin repository, in particular in the SinglePageExamples folder. Have a look at this html file for example: https://github.com/qmacro/sapui5bin/blob/master/SinglePageExamples/SAP-Inside-Track-Sheffield-2014/end.html
What I would suggest is, rather than intermingle the document content and the app & view definitions, maintain the content of your document separately, for example, in XML or JSON, and use a client side model to load it in and bind the parts to the right places.

Headings created inside of a template

I have a number of templates that create headings based on a formula. I am wondering if there is anyway to create an "edit" link that will take you directly to that section? The way that it currently works, the edit link takes you to editing the template itself. Could I possibly create a customized link that would keep you on the page and take you to right part?
Here is some sample code to help clear things up...
Template:Head:
==={{{1}}}===
This is a heading titled "{{{1}}}"
Test Page:
=Section 1=
{{head|1.1}}
{{head|1.2}}
{{head|1.3}}
=Section 2=
{{head|2.1}}
{{head|2.2}}
{{head|2.3}}
At the moment, if I want to edit the information for template "2.3", I have to edit all of section 2. (Note that for this example, that isn't a big deal. For the actual templates I am working with on my site, the templates have dozens of parameters and there are sometimes 10 or more in a section.)
Bottom line, is there way to create a custom edit link inside of the {{head}} template that would take you directly to editing the templates call on the page "Test Page"? Hope that makes sense.
Edit: Is there perhaps a way to make use of "anchor" tags? Can anchors be passed in to the URL?
To restate your problem, when you transclude a section heading the header isn't treated as being part of the destination page, so the edit link takes you back to the source. So you need a separate container for the template in order to edit it individually, and a complete section is the smallest editable container.
The only way I can think of doing this is using subpages (or virtual subpages if you don't have that ennabled in this namespace, doesn't change anything). So instead of placing {{head|1.1}} on MyPage, put it on MyPage/Subpage1 and then transclude that into MyPage in the usual way ({{:MyPage/Subpage1}}).
{{head}} can then include a custom edit link to the template input by using HTML heading tags (<h2> is equal to ==, etc.) to suppress the standard edit link and then use one of these templates (probably {{ed right}}) to create a custom edit link pointing to MyPage/Subpage1.
The way to create anchors in Mediawiki, by the way, is to use a <span id="name"/> tag, but that doesn't create a container that can be edited (or at least, not that I've been able to work out through URL tinkering).
I'm pretty sure there's no way to do that. As far as MediaWiki's section editing feature is concerned, the only thing that begins a new section is a line of the form:
=== Some text here ===
with the number of = signs determining the level of the heading. There's no way to get MediaWiki to let you edit any segment of the document that doesn't begin and end with such a line (or the beginning or end of the page).
Well, OK, I'm sure you technically could do it with an extension, in the sense that you can do anything with a MediaWiki extension. All you'd need to do is provide some way (e.g. a special parameter in an edit URL) for to user to indicate "I want to edit this template", then extract the template from the wikitext, present it to the user for editing, and write the result back into the page text over the original.
The tricky part will be extracting the template from the page source. (Finding and replacing templates on a page is a fairly common task for MediaWiki bot writers, so you might want to look for ideas there.) Whatever method you end up using for that, there will probably be edge cases where you need to give up and tell the user "Sorry, but I can't figure out how that template is transcluded here."