Retrieve EBIT from XBRL documents - xbrl

It appears that EBIT information is not very uniform across different XBRL documents.
Cross comparing data with other sources, such as Yahoo, I have seen some XBRL use the fact us-gaap:OperatingIncomeLoss to store it if using US-GAAP, or ifrs-full:ProfitLossBeforeTax if using IFRS.
However, sometimes it looks like they also use us-gaap:IncomeLossFromContinuingOperationsBeforeIncomeTaxesMinorityInterestAndIncomeLossFromEquityMethodInvestments or us-gaap:IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest.
And sometimes many of those are actually filled with different values, so there is no way to know which one is correct.
Is there a more reliable way to retrieve EBIT data?

The use of different concepts to mean the same thing is a general problem, which is beyond the XBRL specification and usually addressed by each reporting authority.
This has been investigated notably by Charles Hoffman, CPA who suggests using his report frames. Charlie explained how to build "standardized" US GAAP balance sheets, income statements and cash flow statements with a general mapping to the original reports, in order to compare companies in the EDGAR repository.

Related

Namespace for (DDD) entities cutting across domains

I have a couple of business-related domains like Purchase, Marketing and Economy. Having the models arranged into a namespace* for each domain would be nice, but there are some entities cutting across domains, like an Item. How to organize those cross-cutting objects?
* = As in C#/Java/Python namespaces.
Since you have the concept of Bounded Context, you should not share domains between the namespaces. Actually, you should have one Item for each namespace that requires it, and each of those Item should have it's own fields as required by the context it is included.
As Eric Evans said, it is not a big deal replicate data in order to never share the same domain between contexts, but only data.
Determining whether you have the correct design will require some experience with the domain so you should check with your domain expert.
You may very well require a Shared Kernel for classes that are cross-cutting. You'll have to be careful that you do not abuse the shared kernel by placing too many generic / logical classes in there.
To add to what #rafaels88 has answered you may need to create a BC specific domain construct where some logical entity exists. For instance, a User in the Identity & Access Control BC would be an Author in one BC but perhaps a Supervisor in another.
You could also duplicate an AR in one BC as a VO in another. A Customer in the CRM BC may be the system of record for a customer and, therefore, contain a whole lot more information. In the Order BC, however, a Customer VO may only contain an Id, Name, and perhaps Address (for example).
So you will need to evaluate what type of object you have before deciding where to place it.

Interesting NLP/machine-learning style project -- analyzing privacy policies

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.
I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:
First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.
The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.
A very interesting problem indeed!
On a higher level, what you want is summarization- a document has to be reduced to a few key phrases. This is far from being a solved problem. A simple approach would be to search for keywords as opposed to key phrases. You can try something like LDA for topic modelling to find what each document is about. You can then search for topics which are present in all documents- I suspect what will come up is stuff to do with licenses, location, copyright, etc. MALLET has an easy-to-use implementation of LDA.
I would approach this as a machine learning problem where you are trying to classify things in multiple ways- ie wants location, wants ssn, etc.
You'll need to enumerate the characteristics you want to use (location, ssn), and then for each document say whether that document uses that info or not. Choose your features, train your data and then classify and test.
I think simple features like words and n-grams would probably get your pretty far, and a dictionary of words related to stuff like ssn or location would finish it nicely.
Use the machine learning algorithm of your choice- Naive Bayes is very easy to implement and use and would work ok as a first stab at the problem.

How to analyze information from the comments of users on my site?

Can anybody suggest a way to process the information and analyze the data from the comments users post on a article in my website.
I exactly want to process the comments as follows:
Example: Like on a article on computerization may get the following comments:
I love computerization as it makes the work easier.
Computerization is spreading unemployment as 1 computer can work better than 4 people.
How I process this information -
: I take the comments and try to recognize some predefined[and extensible] keywords in it.
Assuming that you are trying to extract some useful information from the comments, you could apply some machine learning to the comments to classify or categorize the data contained within, the sentiments etc.
There are number of different types of learning you can do on the text, however I personally recommend using support vector machines or a naive bayes classifier to be able to categorize and analyze the comments. You could also possibly use clustering, but there needs to be an element of natural language processing in the solution you choose. There are number of different libraries that you can use to implement the code to use either, i.e. svmlight, javaml, etc. I have personally used javaml and it is a good library.

Coding a domain specific text generator

A friend of mine is in the real estate business and after being showed the art of writing copy for real estate ads, I realized that it is very formulaic. Especially when advertising online as there are predefined fields you fill in.
Naturally, I thought about creating a generator that pretty much automates writing the ads. i don't expect it to generate outstanding or even very good copy, just that it can put together words and sentences like a human would.
I have a skeleton/template that defines an ad and I've also put together a set of phrases and words that can be randomly selected, but I am interested in more general aspects of coding such a generator? Any suggestions, tips or literature that I can read to better understand this little project better?
using metadata about the listing would be one way.
Say for a given house, you have these attributes:
(type: bungalo, sq feet: <= 1400) You could use the phrase "cozy cottage".
bedrooms: obvious, same thing with bathrooms. Assume using the word Large, medium, etc.
garage spots: if > 2 then "Can park many vehicles", etc.
You could go even further with this given the lat/lon for the address, there are web services that you can find the amount of parks nearby, crime in the neighborhood, etc.
Rick
I'd say there are three basic approaches you could take to a problem like this, depending on how flexible you want the system to be and on how much work you want to put into it. The simplest is to treat it as a report generation problem, along the lines of Rick's suggestion. That's probably the way I'd go to produce a first draft of a listing. The results would be pure boilerplate, but each listing could be quickly punched up by the copywriter.
If you wanted to get fancy, though, you could come at it as a natural language generation problem. You'd start with some kind of a knowledge representation describing the meaning of the listing and set of rules (finite state transducers, say) for mapping meanings to linguistic forms. There's a sizable academic literature on that kind of stuff, though it's kind of out of fashion these days. Places to start might be Blackburn & Bos's book or the NLTK suite (especially some of the projects in the contrib package).
The third way of doing it would be to treat it as a translation problem, essentially "translating" database entries into ad copy. You'd start with a large collection of listings and the corresponding human-written ads and construct a statistical model of the relationship between the two. Moses/Giza++ is a general purpose tool for building and applying such models.

Using Semantic MediaWiki for tabular data

Am I completely off-track to think about using Semantic MediaWiki to store (and organise, report on, etc.) 'tabular' data such as financial transactions or weather readings that would usually live in a spreadsheet or database?
It seems that one would need a separate, tiny, page for each tuple; but then, that's by design and perhaps it's perfectly okay.
I ask, simply because SMW seems like such a quick and easy way to get a collaborative data repository up and running.
Semantic MediaWiki is better suited for keeping track of Factual or Encyclopedic data, where you can have pages about everything you need to know about a certain topic.
For tabular or numerical data such as measurements, financial, sensor data, you would indeed need to create little pages about each data point, which is not practical in many cases.
However, there are extensions to Media Wiki that allow you to integrate external data sources (in MySQL databases or CSV files somewhere) with MediaWiki pages. This can allow you to have the best of both worlds - dynamic access and queries of tabular data and semantic annotations of pages around them.
Take a look at :
http://www.mediawiki.org/wiki/Extension:External_Data
No, I don't think it's such a bad idea.
Using SemanticForms you could enter lots of little data pages quickly and easily (for example, an invoice might require additional pages for each line item, but they could all be entered from one form using the 'multiple' feature of the for template form tag). So although I've never tried logging weather data in SMW, I think it would be pretty easy. I don't see what the problem would be with storing data across so many pages; it's easy enough to combine it in whatever formats you require.
Give it a go and let us know how it goes!
You can use either the Semantic Internal Objects extension (SIO), or SMW's built in subobjects (the former works well with the already mentioned External Data extension), to store multiple semantic objects (could be the rows of your spreadsheet) in one page.
However, unless you are really looking for a collaborative tool with semantic capabilities, I doubt SMW is the best suited piece of software for your task.
edit (november 2015): Since SMW version 1.9, there nothing that SIO can do that the built-in subobjects can't, so I would recommend the latter.