How can I convert Wikitext Markup containing the double curly bracket functions, into plaintext or html? - mediawiki

I am creating a customized Wiki Markup parser/interpreter. There is a big task however in regards to interpreting functions like these:
{{convert|500|ft|m|0}}
which is converted like so:
500 feet (152 m)
I'd like to avoid having to manually code interpretations of these functions, and would rather employ a method where I query a string
+akiva#akiva-ThinkPad-X230:~$ wiki-to-text "convert|3|to(-)|6|ft|abbr=on}}"
and get a return of:
"3 to 6 ft (0.91–1.83 m)"
Is there a tool to do this? Offline is by far the most ideal solution, but I could live with having to query a server.

You could query the MediaWiki api to get a parsed text from wikitext. E.g. to parse the template Template:Done from the english wikipedia you could use: https://en.wikipedia.org/w/api.php?action=parse&text={{Template:done}}&title=Test (see the online docs for parse). You, however, need a MediaWiki instance that provides a template that you want to parse and which works in the exact same way. If you install a webserver locally, you can install your own MediaWiki instance and parse wikitext locally, too.
Btw.: There's the Parsoid project, too, which implements a node-based wikitext->html->wikitext parser. However, it, iirc, still needs to query the api of the wiki to parse templates.

Related

Need a method for deleting/ignoring Solr index records with a particular string in a field

This is a bit hard to explain so bear with me. We have a website that uses a built-in Solr product to index or remove content when it is added/updated/deleted. Standard web content is specifically tagged as published or private, so it is easy to exclude private content from our custom search engine. However, binary files (DOCs, PDFs, etc.) do not have a public/private workflow state. The only way we can determine if a file is private is that, for some reason, the CMS doubles-up the FullURL string. So the URL will have two instances of "http" in the string. Not sure why that happens, but it's a good thing b/c it's the only way to tell if a file is published or private.
Because the Solr install that's packaged with the CMS is so wonky, and b/c we have numerous other sites in other CMSes, we have a "catalog" Solr install in AWS that aggregates content from our various web properties using a data import handler. So what I'm looking for is a way, using the DIH data-config.xml file, to exclude any index records that have "http" in the URL string twice. I'm currently using a filter query (fq) field in the tag to filter out certain records, but I don't know how to write a fq to do what I'm suggesting above or if that's even possible. My hunch is that I'd need a function query, but that's a level of Solr knowledge I haven't yet achieved. If anyone has an advice or knows how to write a function query that would exclude a url field with two instances of "http" in the string I'd appreciate it!

Get GitLab "my projects" tab as JSON or XML?

Is it possible to get serialized output of some sort from the /dashboard/projects screen in GitLab?
(I want to track differences and alert myself when someone assigns me a new project. One option is of course to build a script that iterates through the HTML pages, but if there's a way to get all projects at once -- preferably in a machine-friendly format -- that's even better.)
I think that usually this kind of alert are not strictly needed, because usually the assignment workflow is about issue/MR assignment (which usually end up in a email in you inbox), anyway..
You should take a look at GitLab API or, even better, use an already existing project like Python GitLab
It is a Python client implementation of GitLab API and also have an handy gitlab command line tool that can give you the required data in a human/machine readable format

Difference between name.html.erb vs name.erb

What is the difference between name.html.erb vs name.erb?
In particular, are there any reasons why name.erb could be bad to use?
I understand that name.html.erb is the convention - this indicates a HTML template and ERB Engine. But I can't find information if are there any reasons not to use name.html.erb, but name.erb instead.
My new workplace asks me to use name.erb, so I want to know: might there be any problems with this?
In short, no, there won't be any problems. Erb files simply output text. In many cases the file extension is ignored by the reading app as the reading app reads/interprets the containing text and its syntax validity. As #taglia suggests, the file extensions are mostly a 'hint' for you and may also be used by the OS to select a default app to open the file with. See here for a more thorough explanation: Output Type for an ERB File
Rails convention dictates template files to include the extension of the output type and the name of the file should end with the .erb extension. As you mentioned, name.html.erb indicates an HTML template and ERB extension that allows any instance variables in your controller's index action to get passed into the template and used. Similarly, name.js.erb indicates a JavaScript template. See here under 'Conventions or Template Files': An Introduction to ERB Templating
ERB is just a templating language, it is not limited to HTML (you could have name.txt.erb, or name.js.erb). Removing html from the name is just going to make your life more difficult (assuming it works), because you won't be able to know what file you are dealing with unless you open it.

Extract HTML Tables from Multiple webpages Using R

Hi I have done thorough research and have come to this extent. All I am trying to do is extract HTML table spanning many webpages.
I have to query the website sec.gov's database and the table then returns appropriate number of results (the size and number of pages vary with every query). For example:
Link: http://www.sec.gov/cgi-bin/srch-edgar
Inputs to be given:
Enter a Search string box: form-type=(8-k) AND filing-date=20140523
Start: 2014
End: 2014
How can I do this totally in R without even opening the browser?
I am sharing what all I have done
I tried many packages and the closest I came to was package RCurl. But in getURL function I opened the browser, ran the query in browser and pasted it in getURL. It returned a very long character, which has the URLs that can be looped and produce the output I want. All this information is in the "center" tag of output.
Now I do not know how to get those URLs out from the middle of the character.
Also, this is not what I wanted. I wanted to run a web query directly from R and get the varied HTML table outputs directly into R. is this possible at all?
Thanks
Meena
Yes, it is possible. You will want to use a combination of the RCurl and XML packages. You will need to programmatically generate the query parameters in the URL (based on the HTML form) and then use getURL() or getURLContent(). Sometimes, the server will expect an HTTP POST, so there is postForm().
To parse the result, look up the XPath language, which the XML package supports with getNodeSet(). I think there is also a function in the XML package for parsing an HTML table into a data.frame.
You might want to invest in this book.

How to search a word in a html file without any java coding?

I'm doing a project in Java which creates a user manual (html files that are linked together like Windows "Help and support centre") of software. Now once a user manual is created I have only html files remaining. Now I want to search html file that contains specified keyword(Search Engine).How can I do this without Java code??
grep, find, python script, or open any file with a text editor and try edit->search
(on windows use windows search in file)
If all of your other code is written in java, then it'll be sensible (without knowing your usecase) to use java for searching as well. You might of course use some commandline programs as grep or find - or built in search functionality in a webbrowser, but if the search should be part of a java application anyway, why not go for java and e.g. Lucene?
If this 'help' is going to be online than you can embed google search in it (limiting the search results to specified site:). Alternatively if you're hosting the pages yourself you can use htdig for indexing the pages.
However if it's going to offilne you'll be better of by generating a static index page with links to topics. In order to create a more help-system-alike user experience you can hide the contents of the index in the invisible html DIV tags and add a JavaScript that takes searched phrase as an input and that unhides the matched words with their links.
Maybe I'm missing something, but have you looked at javahelp? It has indexing and searching built in, and can be used online or offline.