Get the number of pages in a Mediawiki (Wikipedia) namespace - mediawiki

So there is AllPages API which allows one to enumerate all pages in a Mediawiki namespace. But how can one just get a count of pages in a namespace? I am interested in the count primarily so that I can show a progress bar for the AllPages API call which can take quite some time.
There is Siteinfo which has some statistics, but is is not at the same granularity as namespaces.

I don't think there's a proper API for it. There's the {{PAGESINNAMESPACE}} magic word (disabled by default), which can be used with the parse API. Or you can use the search API with nothing but a namespace filter - how well that works might depend on what search backend you are using.

Related

Azure ARM Template (JSON) Self-Reference

I'm creating some default "drag and drop" templates for our developers, and one section is the required tags. Most of the tags reference a variable: nice and easy. But one wants to reference the resource itself and I cannot figure out a way to it. Does anyone have any suggestions?
The tag itself is called "Context" and it's value should be the "type" of the resource it is in, e.g. "Microsoft.Web/serverfarms". This is desired to aid with billing. Obviously I could either create a different template per resource type (not ideal considering the number of different resources) or rely on the devs to update the field manually (not ideal either as relying on them to add the tags manually hasn't worked so far in a lot of cases), but I am trying to automate it.
Extrapolating from the [variables('< variablename >')] function I did try [resources('type')] but Azure complained that "resources is not a valid selection". I thought it might have complained that it couldn't tell which resource to look at, but it didn't get that far. Internet searches have not turned up anything useful so far.
I can't find a way to do this cleanly either (I hope someone corrects me though! This is a topic for us too). The reference and resourceId functions look promising, but both are unavailable inside of the resources block, would require some parsing, and also require the api version, which you probably also need to vary by resource and so you're just back to where you started. ARM won't even let you use a variable for the resource type property(probably a good thing), so that option is out too.
As such, you'll either have to live with your team having to replace that chunk of text manually or pursue some alternative.
The simplest thing that comes to mind would be to write a script in a language that understands JSON. That script reads the template, adds the tag to the resource, then saves the template again.
A similar approach would be to do it after the resources are deployed by writing a script that loops through all resources and making sure they have the tag. You can use automation to schedule this on a regular basis if you're concerned about it being missed. If you're deploying the templates using a script, you could add it in that script too.
There's some things you probably do with nested templates, but you probably wouldn't be making anyone's life easier or making the process more reliable.
This could be achievable potentially through some powershell specifically around Resource and Resource Group. Would need to run a Get-AzResource either at the subscription or potentially just the resource group level. Then pull the ResourceType field from the object return and use a Set-AzResource command passing in the ResourceID from above and the new tag mapped to the returnedResourceType field.

Managing strings and constants in a web app

We have a pretty large React-Redux based web app. In the app - and specifically in the UI - we have a lot of strings and constants (URL, name of app, button labels etc...). What's a recommended way of managing those strings and constants, considering the following requirements:
We have a lot of on premise installations and we want to be able to easily change things like system name / link url / button names.
We want to easily be able to go over the language in the UI and modify it.
We want to be able to localize the app in multiple languages.
The obvious method is to have the strings scattered all over and utilize find-and-replace, but we are wondering if there is a better way to centralize string management.
You could route your scripts though a configuration file so that you have all of these in one place to modify them. This is common in most CMS systems, i.e. Wordpress, Opencart.
Google the term i18n and you should be able to find a heap of information on internationalisation.
Here is a simple class that I saw which might make it easier to understand how this might work out for you in your project.

What techniques are available for programatically transforming HTML/DOM in an iOS Application?

I'm processing a variety of RSS feeds, which contain summaries, as well as the target page URL content, and trying to use a uniform transformation method.
XSLT was the first thing that occurred to me to try, as it would accomplish what I want, in a standard way, without a lot of fuss aside from adding new XSLT stylesheets to accommodate uniquely formatted sites and feed content.
Problem: XSLT libraries are considered "private" in iOS, and even linking statically against your own copy will get you rejected by the Apple Store analysis tools.
I've looked into the possibility if injecting the stylesheet and data into a UIWebView that wasn't displayed, but this seems like a really roundabout and hackish way to get at the system's underlying XSLT processor in an "approved" fashion.
What alternative techniques/libraries exist which would let me do this in a standard fashion, ie: without rolling my own.
I'm not sure I fully understand your requirements, but one possbility would be to use libxml (which is allowed in iOS) to parse the XML and if necessary manipulate the DOM. If you really need to do XML transformations this is going to be more effort than XSLT, but if you just need to extract data from the XML, that can be done fairly easily with xpath queries.
That said, I have read several people claiming they got XSLT working on iOS and had their apps approved in the app store. In particular, I've seen this stackoverflow answer claimed as a working solution by multiple people. And if that fails, another answer suggested building the libxslt library yourself with renamed symbols to bypass the app store checks. I would only suggest that as a last resort though.
You'll probably want to look into Hpple for something powerful but light weight / native. See the tutorial on getting started here: http://www.raywenderlich.com/14172/how-to-parse-html-on-ios. Good luck!
I'm going to also recommend TFHpple but I'm also going to elaborate on the solution. I've explored an app that navigates a 3rd party (well, I'm the 3rd party, they're the source but that's semantics) website/data source but there are some pitfalls. The biggest pitfall is obvious: if the data source DOM changes you need to change your app and re-release. A creative way around this would be to publish/expose a global copy of the DOM on a public server that way the end user doesn't have to update their app any time the data source changes (as long as the change isn't radical).
For instance, if your expected DOM search in TFHpple is #"//figure[#class='figure']/a" and then a week from now your data source's resource you're looking for is altered to #"//figure1[#class='figure1']/a" you just opened yourself to an App Store release... UNLESS... you publish the expected DOM searches on a web server you control in a data dictionary that your app can consume and serve out to the various DOM search elements within your app. The only problem I foresee here is that if the data source adds or removes a data element you want to consume you either have to release a build or handle the removal ahead of time (respectively).
Lastly if the data source DOM isn't well formed or consistent you may be beating your head against a wall more times than not.

best practices for writing to a file from multiple methods

I have a class that contains a bunch of methods for checking data I scrape every week (for things like well-formedness and other errors in gathering the data). Each of these methods performs a test, and then prints out a summary of the test.
I want to print out the output from these tests to a file, but I'm not sure what the best way to do it is. For example...
Should the class hold an instance variable to the file, and each method open/appends/closes the file? (A problem is that methods sometimes call other methods, so this seems kinda messy?)
Should each method get passed the file as a parameter? (Seems messy as well.)
Should each method return a string, and a"central" method that calls all the other tests outputs all these strings to a file?
I'm not really familiar with using logger libraries -- would that be a solution?
My particular context
I have a scraper that pulls data from various websites and stores them in a database. Websites change all the time, so I'm writing a "scrape checker" program that checks my scrapes for various things, like:
number of empty results
length of results
weird characters in results
and so on
So I have methods like:
check_num_empty_results
check_weird_characters
check_scrape (calls a bunch of other checks)
check_scrape_pair (sometimes I want to check pairs of scrapes together, e.g., to match results against each other, so this is different checking each one in isolation)
etc.
I want my "scrape checker" program to print out a file that summarizes all the checks.
Separation of concerns. Write code the focuses on the scraping activity and return the value(s) scraped. Then use aspect oriented programming for logging, which can simplify the problem greatly as the aspect holds the reference to the file or logging API.
Ultimately, it depends on what language you're using.
The first solution makes the most sense if your language permits it. For each instance of the logging class, have a field for the file object that you're reading from/writing to. This is basically equivalent to passing the file object as a parameter to every method.
That said, most mature languages have modules that will do a lot of this work for you; off the top of my sh/awk, Perl, and Python all come to mind as being suited to this task (though if you want to, you could use Java or something else).
Seems like a logging framework would be a perfect solution for this. If you are using Java or .NET, log4j and log4net are pretty much the de-facto standards for that.

How can I extract addresses and phone number from HTML?

Is there a library that specializes in parsing such data?
You could use something like Google Maps. Geocode the address and, if successful, Google's API will return an XML representation of the address with all of the elements separated (and corrected or completed).
EDIT:
I'm being voted down and not sure why. Parsing addresses can be a little difficult. Here's an example of using Google to do this:
http://blog.nerdburn.com/entries/code/how-to-parse-google-maps-returned-address-data-a-simple-jquery-plugin
I'm not saying this is the only way or necessarily the best way. Just a way to parse addresses on a web site.
There are 2 parts to this: extract the complete address from the page, and parse that address into something you can use (store the various parts in a DB for example).
For the first part you will need a heuristic, most likely country-dependant: for US addresses [A-Z][A-Z],?\s*\d\d\d\d\d should give you the end of an address, provided the 2 letters turn out to be a state. Finding the beginning of the string is left as an exercise.
The second part can be done either through a call to Google maps, or as usual in Perl, using a CPAN module: Lingua::EN::AddressParse (test it on your data to see if it works well enough for you).
In any case this is a difficult task, and you will most likely never get it 100% right, so plan for manually checking the addresses before using them.
You don't need regular expressions (yet) or a general parser like pyparsing (at all). Look at something like Beautiful Soup, which will parse even bad HTML into something like a tree of tags. From there, you can look at the source of the page, and find out what tags to drill down through to get to the data. Then, from Beautiful Soup's tree, you can search for these nodes using XPath (in recent versions), and directly loop over the tags you're interested in, getting to the actual data easily. From there, you can parse the data using a quick regex or something. This will be more flexible and more future proof, and also possibly less head-exploding, than just trying to do it in pure regular expressions.