Combining split cells in Openrefine - json

I've been trying to build an excel sheet about all the papers published by staffs and students of my university. I used Scopus API to retrieve all the information like Author, Title and publish dates and it worked perfectly.
Since the retrieved data was a JSON File I had to convert it to Excel, So I chose OpenRefine and when I converted the file it created multiple rows if the paper had more than one writers
For example Like
Sample Scopus
And My JSON response looks like
{
abstracts-retrieval-response: {
coredata: {
citedby-count: 0,
prism:volume: 430-431,
prism:pageRange: 240-246,
prism:coverDate: 2018-03-01,
dc:title: Solving the 3-COL problem by using tissue P systems without environment and proteins on cells,
prism:aggregationType: Journal,
prism:doi: 10.1016/j.ins.2017.11.022,
prism:publicationName: Information Sciences
},
authors: {
author: [
{
ce:given-name: Daniel,
preferred-name: {
ce:given-name: Daniel,
ce:initials: D.,
ce:surname: Díaz-Pernil,
ce:indexed-name: Díaz-Pernil D.
},
#seq: 1,
ce:initials: D.,
#_fa: true,
affiliation: {
#id: 60033284,
#href: http://api.elsevier.com/content/affiliation/affiliation_id/60033284
},
ce:surname: Díaz-Pernil,
#auid: 16645195100,
author-url: http://api.elsevier.com/content/author/author_id/16645195100,
ce:indexed-name: Diaz-Pernil D.
},
{
ce:given-name: Hepzibah A.,
preferred-name: {
ce:given-name: Hepzibah A.,
ce:initials: H.A.,
ce:surname: Christinal,
ce:indexed-name: Christinal H.
},
#seq: 2,
ce:initials: H.A.,
#_fa: true,
affiliation: {
#id: 60100082,
#href: http://api.elsevier.com/content/affiliation/affiliation_id/60100082
},
ce:surname: Christinal,
#auid: 57197875639,
author-url: http://api.elsevier.com/content/author/author_id/57197875639,
ce:indexed-name: Christinal H.A.
},
{
ce:given-name: Miguel A.,
preferred-name: {
ce:given-name: Miguel A.,
ce:initials: M.A.,
ce:surname: Gutiérrez-Naranjo,
ce:indexed-name: Gutiérrez-Naranjo M.
},
#seq: 3,
ce:initials: M.A.,
#_fa: true,
affiliation: {
#id: 60033284,
#href: http://api.elsevier.com/content/affiliation/affiliation_id/60033284
},
ce:surname: Gutiérrez-Naranjo,
#auid: 6506630834,
author-url: http://api.elsevier.com/content/author/author_id/6506630834,
ce:indexed-name: Gutierrez-Naranjo M.A.
}
]
}
}
}
So how do I combine all the authors into a single cell according to the Title?

After importing the JSON into OpenRefine, you need to organise the project into Records. See http://kb.refinepro.com/2012/03/difference-between-record-and-row.html for an explanation of the difference between 'rows' and 'records' in OpenRefine.
To get the project into records you need to move a column containing information that will only appear once in each record (e.g. the title column - which maybe labelled something like "_ - abstracts-retrieval-response - coredata - dc:title" based on the JSON you've pasted here) to the start of the project. See http://kb.refinepro.com/2012/06/create-records-in-google-refine.html for more information on creating records in OpenRefine.
Once you have done this, switch to the 'records' view (click the 'records' link towards the top left of the data table) and then do as #Ettore-Rizza mentions in his comment - pick the column containing the names you want to use (e.g. "_ - abstracts-retrieval-response - authors - author - _ - ce:indexed-name" column) and use the Edit cells -> Join Multi-valued Cells option from the drop down menu at the top of the column.
Because each author related to the article is described in the JSON with multiple fields including various name forms plus a URL) you'll need to either remove the other columns containing author info, or merge the multiple values into a single field using the 'Join Multi-value cells option on all the affected columns (unless you need to retain this information, it is much easier to remove the unwanted columns)
Once this is done, and assuming there are no other fields which have repeated data in the record, you should have a single row per title.

Related

JSON structure for a ticket reservation system with Firebase?

I'm currently developing a ticket reservation system where the user should be able to choose an event, see a list of sections, a section contains a list of seats, thats either free or occupied/booked by another user. I've chosen to use Firebase as my backend, but got very little experience with databases and zero using JSON. How would I go about structuring a system like this?
This is what I got so far:
{
"events" : {
"e2017" : {
"name" : "event 2017",
"date" : "1490567256550"
}
},
"eventSections" : {
"e2017" : {
"e2017-A" : {
"isFull" : false,
"totalSeats": 40,
"bookedSeats": 20
}
}
},
"sectionSeats" : {
"e2017-A" : {
"A1": {
"isBooked" : true,
"bookedBy" : "userId"
},
"A2": {
"isBooked" : false,
"bookedBy" : false
}
}
}
}
The following structure is an example structure for a very specific use case of your question. You had the following criteria:
1) user should be able to choose an event
2) see a list of sections
3) a section contains a list of seats
4) (seats are) either free or occupied/booked by another user
and a structure that meets that criteria
events
event_0
sections
section_0
seats
seat_0: "free"
seat_1: "occupied"
section_1
seats
seat_0: "occupied"
seat_1: "occupied"
event_1
sections
section_0
seats
seat_0: "free"
seat_1: "occupied"
section_1
seats
seat_0: "occupied"
seat_1: "occupied"
While this is an answer, it's not the only one. For example. What if you wanted to know which events had free seats? You really can't query this structure for that data (unless you filter in code).
However, if you present a list of events to a user where they can click on the event to get more data, this structure would work fine as it could be used to present sections and seats available.
If you have a different specific use case, update your question with more data.

Best way to handle data list of REST web service with foreign key(one to many)

I am going to implement the REST base CRUD modal in my my app.I wan to display the list of product data with edit and delete link
Product
id, title, unit_id, product_type_id, currency_id,price
Q1: what should be json response look like?
There are two formats comes in my mind to place the data in Json as a response of REST Get call
[
{
id:1,
title:"T-Shirt",
unit_id:20,
unit_title: "abc"
product_type_id:30,
product_type_title:"xyz"
currency_id: 10,
currency_name: "USD"
min_price:20
},
{...}
]
and the another one is
[
{
id:1,
title:"T-Shirt",
unit: {
id: 20,
title: "abc"
},
product_type: {
id: 30,
title: "xyz"
},
currency_id: {
id:10,
name: "USD"
},
min_price:20
},
{...}
]
what is the better and standard way to handle the above scenario?
Furthermore, let suppose I have 10 more properties in product table which will never display on list page. but i needed it when user going to edit the specific item.
Q2: Should I the load all data once at the time of displaying product list and pass the data to edit component.
or
Load only the needed propeties of product table and pass the id to produt edit component and a new REST GET call with id to get the properties of product.
I am using React + Redux for my front end
Typically, you would create additional methods for API consumers to retrieve the values that populate the lists of currency, product_type and unit when editing in a UI.
I wouldn't return more data than necessary for an individual Product object.

Using JSON-based Database for unordered data

I am working on a simple app for Android. I am having some trouble using the Firebase database since it uses JSON objects and I am used to relational databases.
My data will consists of two users that share a value. In relational databases this would be represented in a table like this:
**uname1** **uname2** shared_value
In which the usernames are the keys. If I wanted the all the values user Bob shares with other users, I could do a simple union statement that would return the rows where:
uname1 == Bob or unname == Bob
However, in JSON databases, there seems to be a tree-like hierarchy in the data, which is complicated since I would not be able to search for users at the top level. I am looking for help in how to do this or how to structure my database for best efficiency if my most common search will be one similar to the one above.
In case this is not enough information, I will elaborate: My database would be structured like this:
{
'username': 'Bob'
{
'username2': 'Alice'
{
'shared_value' = 2
}
}
'username': 'Cece'
{
'username2': 'Bob'
{
'shared_value' = 4
}
}
As you can see from the example, Bob is included in two relationships, but looking into Bobs node doesn't show that information. (The relationship is commutative, so who is "first" cannot be predicted).
The most intuitive way to fix this would be duplicate all data. For example, when we add Bob->Alice->2, also add Alice->Bob->2. In my experience with relational databases, duplication could be a big problem, which is why I haven't done this already. Also, duplication seems like an inefficient fix.
Is there a reason why you don't invert this? How about a collection like:
{ "_id": 2, "usernames":[ "Bob", "Alice"]}
{ "_id": 4, "usernames":[ "Bob", "Cece"]}
If you need all the values for "Bob", then index on "usernames".
EDIT:
If you need the two usernames to be a unique key, then do something like this:
{ "_id": {"uname1":"Bob", "uname2":"Alice"}, "value": 2 }
But this would still permit the creation of:
{ "_id": {"uname1":"Alice", "uname2":"Bob"}, "value": 78 }
(This issue is also present in your as-is relational model, btw. How do you handle it there?)
In general, I think implementing an array by creating multiple columns with names like "attr1", "attr2", "attr3", etc. and then having to search them all for a possible value is an artifact of relational table modeling, which does not support array values. If you are converting to a document-oriented storage, these really should be an embedded list of values, and you should use the document paradigm and model them as such, instead of just reimplementing your table rows as documents.
You can still have old structure:
[
{ username: 'Bob', username2: 'Alice', value: 2 },
{ username: 'Cece', username2: 'Bob', value: 4 },
]
You may want to create indexes on 'username' and 'username2' for performance. And then just do the same union.
To create a tree-like structure, the best way is to create an "ancestors" array that stores all the ancestors of a particular entry. That way you can query for either ancestors or descendants and all documents that are related to a particular value in the tree. Using your example, you would be able to search for all descendants of Bob's, or any of his ancestors (and related documents).
The answer above suggest:
{ "_id": {"uname1":"Bob", "uname2":"Alice"}, "value": 2 }
That is correct. But you don't get to see the relationship between Bob and Cece with this design. My suggestion, which is from Mongo, is to store ancestor keys in an ancestor array.
{ "_id": {"uname1":"Bob", "uname2":"Alice"}, "value": 2 , "ancestors": [{uname: "Cece"}]}
With this design you still get duplicates, which is something that you do not want. I would design it like this:
{"username": "Bob", "ancestors": [{"username": "Cece", "shared_value": 4}]}
{"username": "Alice", "ancestors": [{"username": "Bob", "shared_value": 2}, {"username": "Cece"}]}

How to enter multiple table data in mongoDB using json

I am trying to learn mongodb. Suppose there are two tables and they are related. For example like this -
1st table has
First name- Fred, last name- Zhang, age- 20, id- s1234
2nd table has
id- s1234, course- COSC2406, semester- 1
id- s1234, course- COSC1127, semester- 1
id- s1234, course- COSC2110, semester- 1
how to insert data in the mongo db? I wrote it like this, not sure is it correct or not -
db.users.insert({
given_name: 'Fred',
family_name: 'Zhang',
Age: 20,
student_number: 's1234',
Course: ['COSC2406', 'COSC1127', 'COSC2110'],
Semester: 1
});
Thank you in advance
This would be a assuming that what you want to model has the "student_number" and the "Semester" as what is basically a unique identifier for the entries. But there would be a way to do this without accumulating the array contents in code.
You can make use of the upsert functionality in the .update() method, with the help of of few other operators in the statement.
I am going to assume you are going this inside a loop of sorts, so everything on the right side values is actually a variable:
db.users.update(
{
"student_number": student_number,
"Semester": semester
},
{
"$setOnInsert": {
"given_name": given_name,
"family_name": family_name,
"Age": age
},
"$addToSet": { "courses": course }
},
{ "upsert": true }
)
What this does in an "upsert" operation is first looks for a document that may exist in your collection that matches the query criteria given. In this case a "student_number" with the current "Semester" value.
When that match is found, the document is merely "updated". So what is being done here is using the $addToSet operator in order to "update" only unique values into the "courses" array element. This would seem to make sense to have unique courses but if that is not your case then of course you can simply use the $push operator instead. So that is the operation you want to happen every time, whether the document was "matched" or not.
In the case where no "matching" document is found, a new document will then be inserted into the collection. This is where the $setOnInsert operator comes in.
So the point of that section is that it will only be called when a new document is created as there is no need to update those fields with the same information every time. In addition to this, the fields you specified in the query criteria have explicit values, so the behavior of the "upsert" is to automatically create those fields with those values in the newly created document.
After a new document is created, then the next "upsert" statement that uses the same criteria will of course only "update" the now existing document, and as such only your new course information would be added.
Overall working like this allows you to "pre-join" the two tables from your source with an appropriate query. Then you are just looping the results without needing to write code for trying to group the correct entries together and simply letting MongoDB do the accumulation work for you.
Of course you can always just write the code to do this yourself and it would result in fewer "trips" to the database in order to insert your already accumulated records if that would suit your needs.
As a final note, though it does require some additional complexity, you can get better performance out of the operation as shown by using the newly introduced "batch updates" functionality.For this your MongoDB server version will need to be 2.6 or higher. But that is one way of still reducing the logic while maintaining fewer actual "over the wire" writes to the database.
You can either have two separate collections - one with student details and other with courses and link them with "id".
Else you can have a single document with courses as inner document in form of array as below:
{
"FirstName": "Fred",
"LastName": "Zhang",
"age": 20,
"id": "s1234",
"Courses": [
{
"courseId": "COSC2406",
"semester": 1
},
{
"courseId": "COSC1127",
"semester": 1
},
{
"courseId": "COSC2110",
"semester": 1
},
{
"courseId": "COSC2110",
"semester": 2
}
]
}

Is mongodb (or other nosql dbs) the best solution for the following scenario?

Considering the following data structures what would be better to QUERY the data once stored in a database system (rdbms or nosql)? The fields within the metadata field are user defined and will differ from user to user. Possible values are Strings, Number, "Dates" or even arrays.
var file1 = {
id: 123, name: "mypicture", owner: 1
metadata: {
people: ["Ben", "Tom"],
created: 2013/01/01,
license: "free",
rating: 4
...
},
tags: ["tag1", "tag2", "tag3", "tag4"]
}
var file2 = {
id: 155, name: "otherpicture", owner: 1
metadata: {
people: ["Tom", "Carla"],
created: 2013/02/02,
license: "free",
rating: 4
...
},
tags: ["tag4", "tag5"]
}
var file1OtherUser = {
id: 345, name: "mydocument", owner: 2
metadata: {
autors: ["Mike"],
published: 2013/02/02,
…
},
tags: ["othertag"]
}
Our users should have the ability to search/filter their files:
User 1: Show all files where "Tom" is in "people" array
User 1: Show all files "created" between 2013/01/01 and 2013/02/01
User 1: Show all files having "license" "free" and "rating" greater 2
User 2: Show all files "published" in "2012" and tagged with "important"
...
Results should be filtered in way like you can do in OS X with intelligent folders. The individual metadata fields are defined before files are being uploaded/stored. But they also may change after that, e.g. User 1 may rename the metadata field "people" to "cast".
As #WiredPrairie said, the field within the metadata field look variable, maybe dependant upon what the user enters which is supported by:
User 1 may rename the metadata field "people" to "cast".
MongoDB cannot create variable indexes whereby you just say that every new field in metadata gets added to the compound index, however you could do a key-value type structure like so:
var file1 = {
id: 123, name: "mypicture", owner: 1
metadata: [
{k: people, v:["Ben", "Tom"]},
{k: created, v:2013/01/01},
],
tags: ["tag1", "tag2", "tag3", "tag4"]
}
That is one method of doing this, allowing you to index on both k and v dynamically within the metadata field. You would then query by this like so:
db.col.find({metadata:{$elemMatch:{k:people,v:["Ben"]}}})
However this does introduce another problem. $elemMatch works on top level, not nested elements. Imagine you wanted to find all files where "Ben" was one of the people, you can't use $elemMatch here so you would have to do:
db.col.find({metadata.k:people,metadata.v:"Ben"})
The immediate problem with this query is in the way MongoDB queries. When it queries the metadata field it will say: where one field of "k" equals "people" and a field of "v" equals "Ben".
Since this is a multi-value field you could run into the problem where even though "Ben" is not in the peoples list, because he exists in another field on the metadata you actually pick out the wrong documents; i.e. this query would pick up:
var file1 = {
id: 123, name: "mypicture", owner: 1
metadata: [
{k: people, v:["Tom"]},
{k: created, v:2013/01/01},
{k: person, v: "Ben"}
],
tags: ["tag1", "tag2", "tag3", "tag4"]
}
The only real way to solve this is to factor off the dynamic fields to another collection where you don't have this problem.
This creates a new problem though, you can no longer get a full file with a single round trip and nor can you aggregate both the file row and its user defined fields in one go. So all in all you loose a lot of abilities by dong this.
That being said you can still perform quite a few queries, i.e.:
User 1: Show all files where "Tom" is in "people" array
User 1: Show all files "created" between 2013/01/01 and 2013/02/01
User 1: Show all files having "license" "free" and "rating" greater 2
User 2: Show all files "published" in "2012" and tagged with "important"
All of those would still be possible with this schema.
As for which is better -RDBMS or NoSQL; it is difficult to say here, I would say both could be quite good, if done right, at querying this structure.