How to persist this json in database? - json

My Json code are as below
[
{"group":{"key":"Chinese","title":"Chinese","shortTitle":"Chinese","recipesCount":0,"description":"Chinese cuisine is any of several styles originating from regions of China, some of which have become increasingly popular in other parts of the world – from Asia to the Americas, Australia, Western Europe and Southern Africa. The history of Chinese cuisine stretches back for many centuries and produced both change from period to period and variety in what could be called traditional Chinese food, leading Chinese to pride themselves on eating a wide range of foods. Major traditions include Anhui, Cantonese, Fujian, Hunan, Jiangsu, Shandong, Szechuan, and Zhejiang cuisines. ","rank":"","backgroundImage":"images/Chinese/chinese_group_detail.png", "headerImage":"images/Chinese/chinese_group_header.png"},
"key":1000,
"title":"Abalone Egg Custard",
"shortTitle" : "Abalone Egg Custard",
"serves":4,
"perServing":"65kcal / 2.2g fat",
"favorite":false,
"rating": 3 ,
"directions":["Step 1.","Step 2.","Step 3.","Step 4.","Step 5."],
"backgroundImage":"images/Chinese/AbaloneEggCustard.jpg",
"healthytips":["Tip 1","Tip 2","Tip 3"],
"nutritions":["Calories 65kcal","Total Fat 2.2g","Carbs 4g","Cholesterol 58mg","Sodium 311mg","Dietary Fibre 0.3g"],
"ingredients":["1 head Napa or bok choy cabbage","1/4 cup sugar","1/4 teaspoon salt","3 tablespoons white vinegar","3 green onions","1 (3-ounce) package ramen noodles with seasoning pack","1 (6-ounce) package slivered almonds","1 tablespoon sesame seeds","1/2 cup vegetable oil"]}
]
how am I going to persist this in database? Cause the end of the day I have to read from the database and able to parse it using webapi

Persist it as a CLOB data type in your database in the likely event that the length is going to exceed the limits of a varchar.

There are so many potential answers here -- you'll need to provide many more details to get a specific answer.
Database
What database are you using -- is it relation, object, no-sql? If you come from a no-sql perspective -- saving it as a lump is likely fine. From a RDBMS perspective (like SQL Server), you map all the fields down to a series of rows in a set of related tables. If you're using a relation database, just jamming an unparsed, unvalidated lump of JSON text in the database is the wrong way to go. Why bother hiring a database that provides DRI at all.
Data Manipulation Layer
Included in your question is what type of data manipulation you'll use -- could be linq to sql, could be straight ADO, a micro ORM like Dapper, Massive, or PetaPoco, a full blown ORM like Entity Framework or NHibernate.
Have you picked one of these or are you looking for guidance on selecting one?
Parsing in WebAPI
Convering from JSON to an Object or an Object to JSON is easy in WebApi. For JSON specifically, the JSON.Net formatter is hanging around. You can get started by looking here, here, here, and here.
Conceptually, however, it sounds like you're missing part of the magic of WebAPI. With WebAPI you return your object in it's native state (or IQueryable if you want OData support). After your function call finishes the Formatter's take over and serialize it into the proper shape based on the client request. This process is called Content Negotiation. The idea is that your methods are format agnostic and the framework serializes the data into the transport format your client wants (xml,json, whatever).
The reverse is true too, where the framework deserializes the format provided by the client into a native object.

Related

Understanding openaddresses data format

I have downloaded us-west geolocation data (postal addresses) from openaddresses.io. Some of the addresses in the datasets are not complete i.e., some of them doesn't have info like zip_code. Is there a way to retrieve it or is the data incomplete?
I have tried to search other files hoping to find any related info. The complete dataset doesn't contain any info relate to it. City of Mesa, AZ has multiple zip codes, so it is hard to assign one to the address. Is there any way to address this problem?
This is how data looks like (City of Mesa, AZ)
LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
-111.8747353,33.456605,790,N DOBSON RD,,SRPMIC,,,,,dc0c53196298eb8d
-111.8886227,33.4295194,2630,W RIO SALADO PKWY,,MESA,,,,,c38b700309e1e9ce
-111.8867018,33.4290795,2401,E RIO SALADO PKWY,,TEMPE,,,,,9b912eb2b1300a27
-111.8832045,33.4232903,700,S EVERGREEN RD,,TEMPE,,,,,3435b99ab3f4f828
-111.8761202,33.4296416,2100,W RIO SALADO PKWY,,MESA,,,,,b74349c833f7ee18
-111.8775844,33.4347782,1102,N RIVERVIEW,,MESA,,,,,17d0cf1542c66083
Short Answer: The data incomplete.
The data in OpenAddresses.io is only as complete as the datasource it pulls from. OpenAddresses is just an aggregation of publicly available datasets. There's no real consistency between government agencies that make their data available. As a result, other sections of the OpenAddresses dataset might have city names or zip codes, but there's often something missing.
If you're looking to fill in the missing data, take a look at how projects like Pelias use multiple data sources to augment missing data.
Personally, I always end up going back to OpenStreetMaps (OSM). One could argue that OpenAddresses is better quality because it comes from official sources and doesn't try to fill in data using approximations, but the large gaps of missing data make it far less useful, at least on its own.

When creating a file format. Is it better to create multiple formats or have several optional sections?

Let's say I am the technical lead for a software company specializing in writing applications for chefs that help organize recipes. Starting out we are developing one app for bakers who make cakes and another for sushi chefs. One of the requirements is to create a standard file format for importing and exporting recipes. (This file format will go on to be an industry standard with other companies using it to interface with our products) we are faced with two options: Make a standard recipes format (lets say .recipe) which uses common properties where applicable and optional properties where they differ, or making independent formats for each application (let us say .sushi and .cake).
Imagine the file format would look something like this for sushi:
{
"name":"Big California",
"type":"sushi",
"spiciness": 0,
"ingredients": [
{
"name":"rice"
"amount": 20.0,
"units": "ounces"
},
{
...
}
],
}
and imagine the file format would look something like this for cakes:
{
"name":"Wedding Cake",
"type":"cake",
"layers": 3,
"ingredients": [
{
"name":"flour"
"amount": 40.0,
"units": "ounces"
},
{
...
}
],
}
Notice the file formats are very similar with only the spiciness and layers properties differing between them. Undoubtedly as the applications grow in complexity and sophistication, and will cause many more specialized properties to be added. There will also be more applications added to the suite for other types of chefs. With this context,
Is it wiser to have each application read/write .recipe files that adhere to a somewhat standardized interface, or is it wiser to remove all interdependence and have each application read/write their respective .sushi and .cake file types?
This kind of thing get be a very, very deep thing to get right. I think a lot of it depends on what you want to be able to do with recipes beyond simply displaying them for use by a chef.
For example, does one want to normalise data across the whole system? That is, when a recipe specifies "flour", what do you want to say about that flour, and how standardised do you want that to be? Imagine a chef is preparing an entire menu, and wants to know how much "plain white high gluten zero additives flour" is used by all the recipes in that menu. They might want to know this so they know how much to buy. There's actually quite a lot you can say about just flour that means simply having "flour" as a data item in a recipe may not go far enough.
The "modern" way of going about these things is to simply have plain text fields, and rely on some kind of flexible search to make equivalency associations between fields like "flour, white, plain, strong" and "high gluten white flour". That's what Google does...
The "proper" way to do it is to come up with a rigid schema that allows "flour" to be fully specified. It's going to be hard to come up with a file / database format schema that can exhaustively and unambiguously describe every single possible aspect of "flour". And if you take it too far, then you have the problem of making associations between two different records of "flour" that, for all reasonable purposes are identical, but differ in some minor aspect. Suppose you had a field for particle size; the search would have to be clever enough to realise that there's no real difference in flours that differ by, for example, 0.5 micrometer in average particle size.
We've discussed the possible extent to the definition of a "flour". One also has to consider the method by which the ingredient is prepared. That adds a whole new dimension of difficulty. And then one would have to desribed all the concievable kitchen utensils too. One can see the attractions of free text...
With that in mind, I would aim to have a single file format (.recipe), but not to break down the data too much. I would forget about trying to categorise each and every ingredient down to the last possible level of detail. Instead, for each ingredient I'd have a free text description, then perhaps a well structured quantity field (e.g. a number and a unit, 1 and cup), and finally a piece of free text describing the ingredient preparation (e.g. sieved). Then I'd have something that describes a preparation step, referencing the ingredients; that would have some free text fields and structured fields ("sieve the", , "into a bowl"). The file will contain a list of these. You might also have a list of required utensils, and a general description field too. You'll be wanting to add structured fields for recipe metadata (e.g. 'cake', or 'sushi', or serves 2).
Or something like that. Having some structure allows some additional functionality to be implemented (e.g. tidy layout of the recipe on a screen). Just having a single free-text field for the whole thing means that it'd be difficult to add, say, an ingredient ordering feature - who is to say what lines in the text represent ingredients?
Having separate file formats would involve coming up with a large number of schema. It would be even more unmanagable.

Association Rules for Text File

I am a student using Rapidminer, and I am doing a project using Yummly's What's Cooking dataset (https://www.kaggle.com/c/whats-cooking/data). The dataset has 20 different cuisine types (e.g. Italian, Chinese, Indian, etc.).
Our goal is to develop a data mining model that identifies the cuisine type of future dishes by analyzing the ingredient list of the dish. We are using association rules to do so. However, I keep getting "no rules found" and have no idea why. I am thinking this has something to do with my attributes being formatted as text and not using the nominal to binominal operator, but am not sure how to fix it.
Currently my process looks like....
data -> select attributes -> FP growth -> create association rules
Can you help?
According to the documentation for the FP-Growth operator, all the attributes in the example set need to be binomial.
I'll admit--I haven't looked at the data directly because I didn't want to register an account on kaggle, so I'm not sure exactly how it's formatted, but you would probably want to set the type of cuisine as a label and then have each of the remaining attributes represent each ingredient that is included in one or more of the recipes. Each dish would have a 1 in the column if the ingredient is used and a 0 if it's not used. (Depending on the original format of the data, since you mentioned it's text, you may want to check out the text processing extension, which can create an example set like what I just described.) Then, if you convert the 0s and 1s to binomial, you should be able to use FP-Growth.

What data format is this?

I was checking one share trading site's AJAX response and below is what it showed up in Firebug Response tab of XHR section. Can anyone explain me what format is this and how is it parsed ?
<ST=tat>
<SI=0>
<TB=txtSearch>
<560v=Tata Motors Ltdv=TATMOT>
<566v=Tata Steel Ltdv=TATSTE>
<3199v=Ashram Online.com Ltdv=ASHONL>
<4866v=Kreon Finnancial Services Ltdv=KREFIN>
<552v=Tata Chemicals Ltdv=TATCHE>
<554v=Tata Power Company Ltdv=TATPOW>
<2986v=Tata Metaliks Ltdv=TATMET>
<300v=Tata Sponge Iron Ltdv=TATSPO>
<121v=Tata Coffee Ltdv=TATCOF>
<2295v=Tata Communications Ltdv=TATCOM>
<0v=Time In Milli-Secondsv=0>
I think what we are dealing with here is some proprietary format, likely an Eldricht SGML Horror of some sort.
Banking in general has all sorts of Eldricht horrors running about.
On a related note, this is very much not XML.
Edit:
A quick analysis* indicates that this is a format consisting of a series of statements bracketed by <>; with the parts of the statements separated by = or v=. = seems to indicate a parameter to a control statement, indicated by a two-letter code. (<ST=tat>), while v= seems to indicate an assignment or coupling of some kind (short for "value"?), or perhaps just a field separator.
<ST appears to be short for "search term"; <TB appears to be short for "(source) table". The meaning of <SI eludes me. It is possible that <TB terminates the metadata section, but it's equally possible that the metadata section has a fixed number of terms.
As nothing refers to the number of fields in each statement in the data section, and they are all of the same length (3 fields), it is likely that the number of fields is fixed, but it might derive from the value of <TB, or even <SI, in some way.
What is abundantly clear, however, is that this data is not intended for consumption by other applications than the one that supplies it.
*Caveat: Without a much larger sample it's impossible to tell if this analysis is valid.
It is not a commonly used "web format".
It is probably a proprietary format used by that site and will be parsed by their custom JavaScript.

How to search for a person's name in a text? (heuristic)

I have a huge list of person's full names that I must search in a huge text.
Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.
Example:
I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:
...The candidate Barack Obama was elected the president of the United States... (incomplete)
...The candidate Barack Hussein was elected the president of the United States... (incomplete)
...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
...The candidate John McCain lost the the election... (no occurrences of Obama name)
Certanily there isn't a deterministic solution for it, but...
What is a good heuristic for this kind of search?
If you had to, how would you do it?
You said it's about 200 pages.
Divide it into 200 one-page PDFs.
Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.
Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.
What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.
Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.
At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.
I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.
The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.
I'd personnaly go for regular expressions while generating a list of permutations with some programming.
Both SQL Server and Oracle have built-in SOUNDEX Functions.
Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.
pure old regular expression scripting will do the job.
use Ruby, it's quite fast. read lines and match words.
cheers