How can I extract addresses and phone number from HTML? - html

Is there a library that specializes in parsing such data?

You could use something like Google Maps. Geocode the address and, if successful, Google's API will return an XML representation of the address with all of the elements separated (and corrected or completed).
EDIT:
I'm being voted down and not sure why. Parsing addresses can be a little difficult. Here's an example of using Google to do this:
http://blog.nerdburn.com/entries/code/how-to-parse-google-maps-returned-address-data-a-simple-jquery-plugin
I'm not saying this is the only way or necessarily the best way. Just a way to parse addresses on a web site.

There are 2 parts to this: extract the complete address from the page, and parse that address into something you can use (store the various parts in a DB for example).
For the first part you will need a heuristic, most likely country-dependant: for US addresses [A-Z][A-Z],?\s*\d\d\d\d\d should give you the end of an address, provided the 2 letters turn out to be a state. Finding the beginning of the string is left as an exercise.
The second part can be done either through a call to Google maps, or as usual in Perl, using a CPAN module: Lingua::EN::AddressParse (test it on your data to see if it works well enough for you).
In any case this is a difficult task, and you will most likely never get it 100% right, so plan for manually checking the addresses before using them.

You don't need regular expressions (yet) or a general parser like pyparsing (at all). Look at something like Beautiful Soup, which will parse even bad HTML into something like a tree of tags. From there, you can look at the source of the page, and find out what tags to drill down through to get to the data. Then, from Beautiful Soup's tree, you can search for these nodes using XPath (in recent versions), and directly loop over the tags you're interested in, getting to the actual data easily. From there, you can parse the data using a quick regex or something. This will be more flexible and more future proof, and also possibly less head-exploding, than just trying to do it in pure regular expressions.

Related

Acquiring table info with YQL XPath query on wikia

Objective
Scrap HTML table from warframe wikia.
Background
I am trying to get the information of a table in warframe, the Mods List table. To achieve this objective I read the HTML-parser on Node.js topic and concluded that using YQL was my best option.
Code
By using Google Chrome Dev Tools, and two chrome extensions called CSS and XPath checker and XPath Helper, I was able to pin point the exact location of the table I am looking for with the following XPath query:
//*[#id="mw-content-text"]/div[33]/div/div[1]/table/tbody
Now, Chrome says this is the correct path, and the plugins I am using suggest it as well.
Problem
The problem is that when I use YQL, the result in Json is something utterly and completely different from the talbe I am expecting. In fact, it returns a different table together with misc data.
I am baffled to why this is happening. The wikia is a simple HTML page with little to no dynamic information whatsoever, so I really can't understand why I am getting erroneous results.
What could the problem be?
Unfortunately, YQL does not work properly with pages that are loaded over time, as is the case wit the wikia.
So, even then the XPath is correct, when Yahoo makes the first (and only) request, it receives an incomplete HTML, and never completes it.
To fix the issue, I decided instead to locally parse the HTML in my nodejs server using the npm-request and npm-cheerio packages.
The first package downloads the full page HTML, and the second parses it for the information I am looking for.
An effective solution that instead of relying on a third party tool, transfers all the work to my server.
Hope this helps someone, in the future !

How to turn off phone carrier HTML optimizations?

I made an app which provides a schedule for the pupils at my school. It gets its data from the school's online schedule service. Due to the lack for a real API, I reverse-engineerd the website: Now, the app parses it with string operations basically.
And here's the problem: The string searches do not match on certain mobile carriers' networks because they're stripping away the spaces and other foo. Is there an universal way to turn that off?
No, this is up to the carrier and even if there was a way to disable it, it would be non-standard and not worth addressing.
Additionally, you should not use string operations but a real HTML parser, like JSoup is for Java (there is a .NET port too, NSoup). If you look at the examples, it is relatively easy to use and will protect your application from space normalizations and any other change in the markup irrelevant to your application.
For data stored in inline JavaScript, you could first extract the right node from the document and then use a regex to trim the relevant parts. Or you could also use a regex on the HTML document as a whole, but remember that you can't really parse HTML using regexes.
Adopting another strategy, request pages over HTTPs rather than HTTP (if the server supports TLS/SSL) so that they can't be manipulated by the carrier.

Rails, HTML to JSON?

Given a static HTML page, is there an automated way to generate json?
For a large website that contains a lot of static HTML I am wanting to generate json for RSS feeds and search functionality and am looking for a way to convert HTML to json.
I could obviously write json templates for every page and every language but that would be a unmaintainable. That would double an 800page website to 1600 pages and that is not an option.
One approach I thought of could be to write a bot that would loop through the routes to index the pages and save data to a database which would give me all the choices I could wish for, for searching such as solr, elastic search, thinking sphinx etc...
I could use capybarra to aid me in this by visiting each path and extracting text to save to a database in a rake task as a background job but not sure how that would work in a production environment and it seems that such a common requirement might have already been achieved but for the life of me I can't find one.
I would be far happier (I think) if I could find a way to convert HTML text content to JSON
Any ideas? Has this already been done? are there any gems that might help? or is there built in functionality that I have not thought of, maybe a way to get html into a hash that could then be converted into json? whatever the approach it needs to be automated. I'm just stuck for the best approach.
Basically html looks a lot like xml, but with strong tag meanings, so you could use xml to json conversion, if it all ends up getting tree of html tags embedded in each other.
And so your question becomes this question Except you might get problems with single tags, without closing one. So you might get all of these and put a closing bracket after each one before trying to get it as hash from xml. Oh, early answer. Btw in general for parsing text data you should look at regular expressions.
I chose to go with a nokogiri solution in the end and wrote a parser to meet my needs

Jenkins API tree filter using regex?

I'm using the tree parameter to filter JSON data coming back from the API and that works great. My issue is I need to fetch some specific data from an array with a bunch of stuff I don't care about. I'm wondering if there is a way, using the tree command, to filter using a regex or contains string?
For example, to give me back all fileNames that start with MyProject:
http://myapi.com?tree=fileName=MyProject*
Regular expressions are great for regular grammars.
Trees tend to follow context free grammars. You might do a lot better with a language that can support context aware operations, like XPath. Yes, a few very simple items might work without the extra features of XPath; however, once you do step on a use case that is beyond what is possible with regular grammars (they only support a small subset of what can be searched), it is literally impossible to accomplish the search with the tool in hand.
If you care to see how regular grammars tend to come with limitations, study the pumping lemma, and then think deeply about it's implications. A quick brush-up on parsing theory might also be useful. You are up against mathematics, including the parts of mathematics which include logical operations. It's not a matter of being a difficult problem to solve, it has been proven that regular expressions cannot match context free grammars.
If you are just more interested in getting the job done quickly. I suggest you start off by reading up on XPath and try to leverage one of the already available tools, or at least use it as a guide in your tree matching efforts.
I found that using not using the JSON and instead switching to XML you can filter with XPATH. An example on finding all job urls with name starting with "Test" is this:
https://{jenkins_instance_url}/view/All/api/xml?tree=jobs[name,url]&xpath=/*/job[starts-with(name,'Test')]/url&wrapper=jobs

Parsing language for both binary and character files

The problem:
You have some data and your program needs specified input. For example strings which are numbers. You are searching for a way to transform the original data in a format you need.
And the problem is: The source can be anything. It can be XML, property lists, binary which
contains the needed data deeply embedded in binary junk. And your output format may vary
also: It can be number strings, float, doubles....
You don't want to program. You want routines which gives you commands capable to transform the data in a form you wish. Surely it contains regular expressions, but it is very good designed and it offers capabilities which are sometimes much more easier and more powerful.
ADDITION:
Many users have this problem and hope that their programs can convert, read and write data which is given by other sources. If it can't, they are doomed or use programs like business
intelligence. That is NOT the problem.
I am talking of a tool for a developer who knows what is he doing, but who is also dissatisfied to write every time routines in a regular language. A professional data manipulation tool, something like a hex editor, regex, vi, grep, parser melted together
accessible by routines or a REPL.
If you have the spec of the data format, you can access and transform the data at once. No need to debug or plan meticulously how to program the transformation. I am searching for a solution because I don't believe the problem is new.
It allows:
joining/grouping/merging of results
inserting/deleting/finding/replacing
write macros which allows to execute a command chain repeatedly
meta-grouping (lists->tables->n-dimensional tables)
Example (No, I am not looking for a solution to this, it is just an example):
You want to read xml strings embedded in a binary file with variable length records. Your
tool reads the record length and deletes the junk surrounding your text. Now it splits open
the xml and extracts the strings. Being Indian number glyphs and containing decimal commas instead of decimal points, your tool transforms it into ASCII and replaces commas with points. Now the results must be stored into matrices of variable length....etc. etc.
I am searching for a good language / language-design and if possible, an implementation.
Which design do you like or even, if it does not fulfill the conditions, wouldn't you want to miss ?
EDIT: The question is if a solution for the problem exists and if yes, which implementations are available. You DO NOT implement your own sorting algorithm if Quicksort, Mergesort and Heapsort is available. You DO NOT invent your own text parsing
method if you have regular expressions. You DO NOT invent your own 3D language for graphics if OpenGL/Direct3D is available. There are existing solutions or at least papers describing the problem and giving suggestions. And there are people who may have worked and experienced such problems and who can give ideas and suggestions. The idea that this problem is totally new and I should work out and implement it myself without background
knowledge seems for me, I must admit, totally off the mark.
UPDATE:
Unfortunately I had less time than anticipated to delve in the subject because our development team is currently in a hot phase. But I have contacted the author of TextTransformer and he kindly answered my questions.
I have investigated TextTransformer (http://www.texttransformer.de) in the meantime and so far I can see it offers a complete and efficient solution if you are going to parse character data.
For anyone who will give it a try to implement a good parsing language, the smallest set of operators to directly transform any input data to any output data if (!) they were powerful enough seems to be:
Insert/Remove: Self-explaining
Group/Ungroup: Split the input data into a set of tokens and organize them into groups
and supergroups (datastructures, lists, tables etc.)
Transform
Substituition: Change the content of the tokens (special operation: replace)
Transposition: Change the order of tokens (swap,merge etc.)
Have you investigated TextTransformer?
I have no experience with this, but it sounds pretty good and the author makes quite competent posts in the comp.compilers newsgroup.
You still have to some programming work.
For a programmer, I would suggest:
Perl against a SQL backend.
For a non-programmer, what it sounds like you're looking for is some sort of business intelligence suite.
This suggestion may broaden the scope of your search too much... but here it is:
You could either reuse, as-is, or otherwise get "inspiration" from the [open source] code of the SnapLogic framework.
Edit (answering the comment on SnapLogic documentation etc.)
I agree, the SnapLogic documentation leaves a bit to be desired, in particular for people in your situation, i.e. when just trying to quickly get an overview of what SnapLogic can do, and if it would generally meet their needs, without investing much time and learn the system in earnest.
Also, I realize that the scope and typical uses of of SnapLogic differ, somewhat, from the requirements expressed in the question, and I should have taken the time to better articulate the possible connection.
So here goes...
A salient and powerful feature of SnapLogic is its ability to [virtually] codelessly create "pipelines" i.e. processes made from pre-built components;
Components addressing the most common needs of Data Integration tasks at-large are supplied with the SnapLogic framework. For example, there are components to
read in and/or write to files in CSV or XML or fixed length format
connect to various SQL backends (for either input, output or both)
transform/format [readily parsed] data fields
sort records
join records for lookup and general "denormalized" record building (akin to SQL joins but applicable to any input [of reasonnable size])
merge sources
Filter records within a source (to select and, at a later step, work on say only records with attribute "State" equal to "NY")
see this list of available components for more details
A relatively weak area of functionality of SnapLogic (for the described purpose of the OP) is with regards to parsing. Standard components will only read generic file formats (XML, RSS, CSV, Fixed Len, DBMSes...) therefore structured (or semi-structured?) files such as the one described in the question, with mixed binary and text and such are unlikely to ever be a standard component.
You'd therefore need to write your own parsing logic, in Python or Java, respecting the SnapLogic API of course so the module can later "play nice" with the other ones.
BTW, the task of parsing the files described could be done in one of two ways, with a "monolithic" reader component (i.e. one which takes in the whole file and produces an array of readily parsed records), or with a multi-component approach, whereby an input component reads in and parse the file at "record" level (or line level or block level whatever this may be), and other standard or custom SnapLogic components are used to create a pipeline which effectively expresses the logic of parsing a record (or block or...) into its individual fields/attributes.
The second approach is of course more modular and may be applicable if the goal is to process many different files format, whereby each new format requires piecing together components with no or little coding. Whatever the approach used for the input / parsing of the file(s), the SnapLogic framework remains available to create pipelines to then process the parsed input in various fashion.
My understanding of the question therefore prompted me to suggest SnapLogic as a possible framework for the problem at hand, because I understood the gap in feature concerning the "codeless" parsing of odd-formatted files, but also saw some commonality of features with regards to creating various processing pipelines.
I also edged my suggestion, with an expression like "inspire onself from", because of the possible feature gap, but also because of the relative lack of maturity of the SnapLogic offering and its apparent commercial/open-source ambivalence.
(Note: this statement is neither a critique of the technical maturity/value of the framework per-se, nor a critique of business-oriented use of open-source, but rather a warning that business/commercial pressures may shape the offering in various direction)
To summarize:
Depending on the specific details of the vision expressed in the question, SnapLogic may be worthy of consideration, provided one understands that "some-assembly-required" will apply, in particular in the area of file parsing, and that the specific shape and nature of the product may evolve (but then again it is open source so one can freeze it or bend it as needed).
A more generic remark is that SnapLogic is based on Python which is a very swell language for coding various connectors, convertion logic etc.
In reply to Paul Nathan you mentioned writing throwaway code as something rather unpleasant. I don't see why it should be so. After all, all of our code will be thrown away and replaced eventually, no matter how perfect we wrote it. So my opinion is that writing throwaway code is pretty much ok, if you don't spend too much time writing it.
So, it seems that there are two approaches to solving your solution: either a) find some specific tool intended for the purpose (parse data, perform some basic operations on it and storing it in some specific structure) or b) use some general purpose language with lots of libraries and code it yourself.
I don't think that approach a) is viable because sooner or later you'll bump into an obstacle not covered by the tool and you'll spend your time and nerves hacking the tool, or mailing the authors and waiting for them to implement what you need. I might as well be wrong, so please if you find a perfect tool, drop here a link (I myself am doing lots of data processing in my day job and I can't swear that I couldn't do it more efficiently).
Approach b) may at first seem "unpleasant", but given a nice high-level expressive language with bunch of useful libraries (regexps, XML manipulation, creating parsers...) it shouldn't be too hard, and may be gradually turned into a DSL for the very purpose. Beside Perl which was already mentioned, Python and Ruby sound like good candidates for these languages (I bet some Lisp derivatives too, but I have no experience there).
You might find AntlrWorks useful if you go so far as defining formal grammars for what you're parsing.