web crawling using regex or xml libraries

web crawling using regex or xml libraries - html

I am trying to do webcrawler program using python. In this case, can we use regex to get the expected string or can we use the XML packages in python to get the string ?
I can see most of them are using regex. I like to know reason behind it.

Related

English Wiktionary API: declension table missing in the output

Yet another English Wiktionary parsing question.
Overall, I am prepared to parse the wikitext format, so the standard API works for me.
The trouble is though that I want to use the English Wiktionary API to obtain the declension tables. For some odd reason, the tables are referenced by codes. Sometimes they are in the output, but in most cases they are missing. E.g. a call to a Russian word like http://en.wiktionary.org/w/api.php?format=xml&action=query&titles=крот&rvprop=content&prop=revisions&redirects=1 yields:
====Declension====
{{ru-noun-table|b|a=an}}
How do I convert it into a full declension table?
I played with a bunch of parameters from here: https://www.mediawiki.org/wiki/API:Query - no result.
One workaround I found is to use the new Wiktionary RESTful API, like this: https://en.wiktionary.org/api/rest_v1/page/html/крот (reference: https://en.wiktionary.org/api/rest_v1/#/). But it only returns HTML, which is more difficult to parse!
Is that the best that can be done?
Is there a special call to the declension tables perhaps? I mean, if it gets generated, there's got to be a way.

The table is generated by a Module of wiktionary, namely Module:ru-noun, which is a lua script. It functions like a regular mediawiki template call, the script is contextualized with parameters (b,a=an) and has access to page name (крот).
See "Wikinflection: Massive semi-supervised generation of multilingual inflectional corpus from Wiktionary" for the rational behind this, then the resulting Dictionary builder project.

Node.js Web Crawling - How to get all data loaded in HTML?

I am trying to use Node.js to implement Data Scrawling. I used axios to GET HTML file and use cheerio to get data.
However, I found that the HTML doesn't return with data but only layout. I guess the website with load the layout first, then doing ajax things to query data then rendering.
So, Anyone know how to GET the full HTML with data? Any library or tools?
Thanks.

i would suggest you to use selenium library with bs4 library in python if have some experience on python.
for node
https://www.npmjs.com/package/selenium-webdriver
i have written scraper in python using both library.
scraper is for linked in profile which take name from excel file and search if data available add it into another excel file
https://github.com/harsh4870/Scraper_LinkedIn
for node code goes like
driver = webdriver.Firefox();
driver.get("http://example.com");
html = driver.getPageSource();

xgettext to generate po file from html files

This is something I am trying so hard to get. tried a bunch of options, including this one found here Extracting gettext strings from Javascript and HTML files (templates). No go.
this is the sample html
<h1 data-bind="text: _loc('translate this')"></h1>
the command I have tried (php, glade..)
xgettext -LPHP --force-po -o E:\Samples\poEdit\translated.po --from-code=utf-8 -k_loc E:\Samples\poEdit\html\samplePO.html
glade seems to look only inside tags and completely skips the keyword. Anyone solve this problem?

We eventually ended up writing a small .net application to parse the html and create a json representation and used language PYTHON with xgettext to create the po file from javascript.

How do I use the Perl Text-MediawikiFormat to convert mediawiki to xhtml?

On an Ubuntu platform, I installed the nice little perl script
libtext-mediawikiformat-perl - Convert Mediawiki markup into other text formats
which is available on cpan. I'm not familiar with perl and have no idea how to go about using this library to write a perl script that would convert a mediawiki file to an html file. e.g. I'd like to just have a script I can run such as
./my_convert_script input.wiki > output.html
(perhaps also specifying the base url, etc), but have no idea where to start. Any suggestions?

I believe #amon is correct that perl library I reference in the question is not the right tool for the task I proposed.
I ended up using the mediawiki API with the action="parse" to convert to HTML using the mediawiki engine, which turned out to be much more reliable than any of the alternative parsers I tried proposed on the list. (I then used pandoc to convert my html to markdown.) The mediawiki API handles extraction of categories and other metadata too, and I just had to append the base url to internal image and page links.
Given the page title and base url, I ended up writing this as an R function.
wiki_parse <- function(page, baseurl, format="json", ...){
require(httr)
action = "parse"
addr <- paste(baseurl, "/api.php?format=", format, "&action=", action, "&page=", page, sep="")
config <- c(add_headers("User-Agent" = "rwiki"), ...)
out <- GET(addr, config=config)
parsed_content(out)
}

The Perl library Text::MediawikiFormat isn't really intended for stand-alone use but rather as a formatting engine inside a larger application.
The documentation at CPAN does actually show a way how to use this library, and does note that other modules might provide better support for one-off conversions.
You could try this (untested) one-liner
perl -MText::MediawikiFormat -e'$/=undef; print Text::MediawikiFormat::format(<>)' input.wiki >output.html
although that defies the whole point (and customization abilities) of this module.
I am sure that someone has already come up with a better way to convert single MediaWiki files, so here is a list of alternative MediaWiki processors on the mediawiki site. This SO question coud also be of help.
Other markup languages, such as Markdown provide better support for single-file conversions. Markdown is especially well suited for technical documents and mirrors email conventions. (Also, it is used on this site.)
The libfoo-bar-perl packages in the Ubuntu repositories are precompiled Perl modules. Usually, these would be installed via cpan or cpanm. While some of these libraries do include scripts, most don't, and aren't meant as stand-alone applications.

SSIS: HTML Encoding a URL in a Script Task

I would like to URL encode a URL in a C# SSIS script task. I have tried using System.Web.HttpServerUtility, but intellisense shows it doesn't seem to be aware of "Server". Here's an example of the code that's raising an error:
Import System.Web
...
...
...
Server.HtmlEncode(TestVariable);
I have worked around this issue by writing a function that finds and replaces characters in a string to mimic HTML encoding a string, but I honestly abhore the solution. I would really just like to find out what I need to differently to use what's baked into .NET instead of reinventing the HTML Encoder wheel.

You need to add a reference to the assembly System.Web

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

web crawling using regex or xml libraries - html

I am trying to do webcrawler program using python. In this case, can we use regex to get the expected string or can we use the XML packages in python to get the string ? I can see most of them are using regex. I like to know reason behind it.

Related

English Wiktionary API: declension table missing in the output

Node.js Web Crawling - How to get all data loaded in HTML?

xgettext to generate po file from html files

How do I use the Perl Text-MediawikiFormat to convert mediawiki to xhtml?

SSIS: HTML Encoding a URL in a Script Task

Categories

Resources