How to extracting abstract about entities and the relations between entities? - rapidminer

I want to use rapidminer or gate to extract abstract about entities(Characters) (or just main characteristics) and the relations between entities in a story. Do you have an idea or sample i can modify to that aim?
I tried to use extensions with rapid miner like Aylien and Rosette, but the extract entities operator asks about an attribute parameter, and I couldn't discover what its value about? where to get it? and how to continue with finding the relations between entities?

when using the Extract Entities operator fro mthe Aylien extension for RapidMiner, the input attribute parameter should be the attribute (column in the example set) that contains the text examples you want to analyze.
For more inspiration take a look at the text mining section of the RapidMiner Community.

Related

Word2Vec - How can I store and retrieve extra information regarding each instance of corpus?

I need to combine Word2Vec with my CNN model. To this end, I need to persist a flag (a binary one is enough) for each sentence as my corpus has two types (a.k.a. target classes) of sentences. So, I need to retrieve this flag of each vector after creation. How can I store and retrieve this information inside the input sentences of Word2Vec as I need both of them in order to train my deep neural network?
p.s. I'm using Gensim implementation of Word2Vec.
p.s. My corpus has 6,925 sentences, and Word2Vec produces 5,260 vectors.
Edit: More detail regarding my corpus (as requested):
The structure of the corpus is as follows:
sentences (label: positive) -- A Python list
Feature-A: String
Feature-B: String
Feature-C: String
sentences (label: negative) -- A Python list
Feature-A: String
Feature-B: String
Feature-C: String
Then all the sentences were given as the input to Word2Vec.
word2vec = Word2Vec(all_sentences, min_count=1)
I'll feed my CNN with the extracted features (which is the vocabulary in this case) and the targets of sentences. So, I need these labels of the sentences as well.
Because the Word2Vec model doesn't retain any representation of the individual training texts, this is entirely a matter for you in your own Python code.
That doesn't seem like very much data. (It's rather tiny for typical Word2Vec purposes to have just a 5,260-word final vocabulary.)
Unless each text (aka 'sentence') is very long, you could even just use a Python dict where each key is the full string of a sentence, and the value is your flag.
But if, as is likely, your source data has some other unique identifier per text – like a unique database key, or even a line/row number in the canonical representation – you should use that identifier as a key instead.
In fact, if there's a canonical source ordering of your 6,925 texts, you could just have a list flags with 6,925 elements, in order, where each element is your flag. When you need to know the status of a text from position n, you just look at flags[n].
(To make more specific suggestions, you'd need to add more details about the original source of the data, and exactly when/why you'd need to be checking this extra property later.)

TTL file format - I have no idea what this is

I have a file which has a structure, but I don't know what format it is, nor how to parse it. The file extension is ttl, but I have never encountered this before.
Some lines from the file looks like this:
<http://data.europa.eu/esco/label/790ff9ed-c43b-435c-b6b3-6a4a6e8e8326>
a skosxl:Label ;
skosxl:literalForm "gérer des opérations d’allègement"#fr .
<http://data.europa.eu/esco/label/98570af6-b237-4cdd-b555-98fe3de26ef8>
a skosxl:Label ;
esco:hasLabelRole <http://data.europa.eu/esco/label-role/neutral> , <http://data.europa.eu/esco/label-role/male> , <http://data.europa.eu/esco/label-role/female> ;
skosxl:literalForm "particleboard machine technician"#en .
<http://data.europa.eu/esco/label/aaac5531-fc8d-40d5-bfb8-fc9ba741ac21>
a skosxl:Label ;
esco:hasLabelRole "http://data.europa.eu/esco/label-role/female" , "http://data.europa.eu/esco/label-role/standard-female" ;
skosxl:literalForm "pracovnice denní péče o děti"#cs .
And it goes on like this for 400 more MB. Additional attributes are added, for some, but not all nodes.
It reminds me of some form of XML, but I don't have much experience working with different formats. It also looks like something that can be modeles as a graph.
Do you have any idea what data format it is, and how I could parse it in python?
Yes, #Phil is correct that is turtle syntax for storing RDF data.
I would suggest you import it into an RDF store of some sort rather than try and parse 400MB+ yourself. You can use GraphDB, Blazegraph, Virtuso and the list goes on. A search for RDF stores should give many other options.
Then you can use SPARQL to query the RDF store (which is like SQL for relational databases) using Python RDFlib. Here is an example from RDFLib.
That looks like turtle - a data description language for the semantic web.
The :has label and :label are specified for two different semantic libraries defined to share data (esco and skosxl there should not be much problem finding these libraries with a search engine, assuming the data is in the semantic web) . :literal form could be thought of as the value in an XML tag.
They represent ontologies in a data structure:
Subject : 10
Predicate : Name
Object : John
As for python, read the data as a file, use the subject as the keys of a dictionary, put the values in a database, its unclear what you want to do with the data.
Semantic data is open, incomplete and could have an unusual, complex structure. The example above is very simple the primer linked above may help.

Freemarker: find specific object in array of arrays

I have a complex many-to-many relationship defined. The cross-reference table is an entity, so I have Contact with a One-To-Many to ContactList, and List with a One-To-Many to Contact List. Contact List contains listID, contactID, and a few Booleans. The relationships seem to work well and on the backend I can get a list of contacts on a review list using the Spring-Data-Jpa findByContactListsIn(Set).
However, I am trying to build a list of contacts in Freemarker, and show whether they were in the current list.
Before I made an Entity out of ContactList, I had a standard Many-To-Many relationship between them, and I was able to do something like this in my .ftl:
<#if list.contacts?seq_contains(contact)>
But I needed to add some data to ContactList specifically, so I needed it to be more complicated. How can I do something similar now? I tried:
<#if list.contactLists?seq_contains(contact)
But of course that always returns false, because it is comparing two different entity types. Is there a way to find if a contact is in one of the contactList objects?
I suppose I could do some back-end trickery, but I am looking for a front-end solution to this.
Don't use ?seq_contains for finding generic object at all. It doesn't call Object.equals, instead it works like the == operator of the template language, which only allows comparing strings, numbers, booleans and dates/times, otherwise it gives you an error. Unfortunately it won't fail in your case, because POJO-s are also strings (and their string value is what toString() returns). This is an unfortunate legacy of the stock ObjectWrapper (scheduled to be fixed in FM3); not even a quirk in the template language. Ideally you get an error there. Instead, now it silently compares the return value of the toString()-s...
Your data-model should already contain what the template should actually display. FTL is not a programming language, so if you try to extract that from the data-model in it, it will be a pain. But, that the data-model contains that data can also mean that some objects in the data-model have methods that extract the data you need. As a last resort, you can add objects that just contain helper methods.
Update: Returning to ?seq_contains, if you need the Java semantics and list is a Java Collection, you can just use the Java API: list?api.contains(contact).

Is it possible to escape & &apos; present in database?

The data retrieved from database has & or &apos;. How do I escape and show as & or ' without using gsub method?
If you can't stop the data from being inserted like that, then there is code here to create a function in MySQL that you can use in your query in order to return the decoded data.
Or from within Ruby, not using a replace strategy, take a look at how-do-i-encode-decode-html-entities-in-ruby.
First of all, an escape-sequence is found in string-analysis only, not in html or XML where you talk of masquerading. You can escape a string for reasons of concatenation for example. Html-Entities are specific entities which are replaced in urns to masquerade a special character. It is absolutely wrong to save strings still containing html-entities in a db-table. The masked string has to be demasked first, after you "reget" it from post :). Otherwise you try to save html-entities in a special table, eg. for programming reasons. A text-file should do better - try dBase 2 - or simply google the web for a page with an entity-listing.
The second point is that XML is - for the realization of better reading of your own code (in general), thought to be a personally defined markup-language. That is why any non-std-tags within that specification, have to be defined by your own. (It was strange to read about regular entities as "XML-entities", like in the case of "&apos(;)", explained on this entity-page: http://www.madore.org/~david/computers/unicode/htmlent.html)
Std-XML-tags (not entities) are mainly important in aspects of finalizing your html-code to better fit to ongoing programming languages later on, but in my opinion the mentioned ones are still html-entities!
This can and should be performed on the view level, ie, the front-end, since its an HTML entity.
assuming you use jquery, you can do this to make &apos; appear as ' on the HTML.
$('<div/>').html(''').text()
You can find respective entity values in the link above

Which word do you use to describe a JSON-like object?

I have a naming issue.
If I read an object x from some JSON I can call my variable xJson (or some variation). However sometimes it is possible that the data could have come from a number of different sources amongst which JSON is not special (e.g. XMLRPC, programmatically constructed from Maps,Lists & primitives ... etc).
In this situation what is a good name for the variable? The best I have come up with so far is something like 'DynamicData', which is ok in some situations, but is a bit long and not probably not very clear to people unfamiliar with the convention.
SerializedData?
A hierarchical collection of hashes and lists of data is often referred to as a document no matter what serialization format is used. Another useful description might be payload or body in the sense of a message body for transmission or a value string written to a key/value store.
I tend to call the object hierarchy a "doc" myself, and the serialized format a "document." Thus a RequestDocument is parsed into a RequestDoc, and upon further identification it might become an OrderDoc, or a CustomerUpdateDoc, etc. An InvoiceDoc might become known generically as a ResponseDoc eventually serialized to a ResponseDocument.
The longer form is awkward, but such serialized strings are typically short-lived and localized in the code anyway.
If your data is the model, name it after the model it's representing. e.g., name it after the purpose of the contents, not the format of the contents. So if it's a list of customer information, name it "customers", or "customerModel", or something like that.
If you don't know what the contents are, the name isn't important, unless you want to differentiate the format. "responseData", "jsonResponse", etc...
And "DynamicData" isn't a long name, unless there is absolutely nothing descriptive to be said about the data. "data" might be just fine.