converting flattened json from list into dataframe in python - json

I have a nested json file which I managed to flatten, but as a result I got a list which looks like this:
[{'people_gender': 'Female',
'people_age_group': 'Young adult',
'people_distance': 91,
'time': 0.33},
{'people_gender': 'Male',
'people_age_group': 'Adult',
'people_distance': 88,
'time': 0.66}]
These are only two first instances of the list but there is of course no point of copying the whole list. Now I would like to convert it into the dataframe so the 'people_gender', 'people_age_group', 'people_distance' and 'time' are columns and in the rows are the results for respective time moments.
I simply tried:
df = pd.DataFrame(np.array(file))
but this just gives me the data frame with one column and in rows there are every entries for the given time moments and I don't know how to tackle it from there.

You can use json_normalize() to get this json in

Related

How to omit the header in when use spark to read csv.file?

I am trying to use Spark to read a csv file in jupyter notebook. So far I have
spark = SparkSession.builder.master("local[4]").getOrCreate()
reviews_df = spark.read.option("header","true").csv("small.csv")
reviews_df.collect()
This is how the reviews_df looks like:
[Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5'),
Row(reviewerID=u'A2YB0B3QOHEFR', asin=u'B000JJSRNY', overall=u'5'),
Row(reviewerID=u'AAI0092FR8V1W', asin=u'B0060MYKYY', overall=u'5'),
Row(reviewerID=u'A2TAPSNKK9AFSQ', asin=u'6303187218', overall=u'5'),
Row(reviewerID=u'A316JR2TQLQT5F', asin=u'6305364206', overall=u'5')...]
But each row of the data frame contains the column names, how can I reformat the data, so that it can become:
[(u'A1YKOIHKQHB58W', u'B0001VL0K2', u'5'),
(u'A2YB0B3QOHEFR', u'B000JJSRNY', u'5')....]
Dataframe always returns Row objects, thats why when you issue collect() on dataframe, it shows -
Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5')
to get what you want, you can do -
reviews_df.rdd.map(lambda row : (row.reviewerID,row.asin,row.overall)).collect()
this will return you tuple of values of rows

Removing characters from column in pandas data frame

My goal is to (1) import Twitter JSON, (2) extract data of interest, (3) create pandas data frame for the variables of interest. Here is my code:
import json
import pandas as pd
tweets = []
for line in open('00.json'):
try:
tweet = json.loads(line)
tweets.append(tweet)
except:
continue
# Tweets often have missing data, therefore use -if- when extracting "keys"
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
# Create a data frame satisfying conditions:
df2 = df[(df['Lang']==('en')) & (df['Geo'].dropna())]
So far, everything seems to be working fine.
Now, the extracted values for Geo result in the following example:
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
To get rid of everything except the coordinates inside the squared brackets I tried using:
df2.Geo.str.replace("[({':]", "") ### results in NaN
# and also this:
df2['Geo'] = df2['Geo'].map(lambda x: x.lstrip('{'coordinates': [').rstrip('], 'type': 'Point'')) ### results in syntax error
Please advise on the correct way to obtain coordinates values only.
The following line from your question indicates that this is an issue with understanding the underlying data type of the returned object.
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
You are returning a Python dictionary here -- not a string! If you want to return just the values of the coordinates, you should just use the 'coordinates' key to return those values, e.g.
df2.loc[1921,'Geo']['coordinates']
[39.11890951, -84.48903638]
The returned object in this case will be a Python list object containing the two coordinate values. If you want just one of the values, you can slice the list, e.g.
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
This workflow is much easier to deal with than casting the dictionary to a string, parsing the string, and recapturing the coordinate values as you are trying to do.
So let's say you want to create a new column called "geo_coord0" which contains all of the coordinates in the first position (as shown above). You could use a something like the following:
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
This uses a Python list comprehension to iterate over all entries in the df2['Geo'] column and for each entry it uses the same syntax we used above to return the first coordinate value. It then assigns these values to a new column in df2.
See the Python documentation on data structures for more details on the data structures discussed above.

JSON as input in a mapreduce

I have a JSON file contains fields such as machine_id, category, and ... Category contains states of machines such as "alarm", "failure". I simply like to see how many times each machine_id has been reported using rmr2.
For example, if I have the following:
machine_id, state
48, alarm
39, failure
48, utilization
I like to see this result:
48,2
39,1
What I did:
I wrote a simple mapreduce to read the value of JSON file and used it as an input in the second mapreduce. Code is:
mp = function(k,v){
machine_id=v$machine_id
keyval(machine_id,1) }
rd = function(k,v) keyval(k,length(v))
mapreduce(input = mapreduce(input='\user\cloudera\sample.json', input.format="json" , map=function(k,v) keyval(k,v)) , map=mp, reduce = rd)
Unfortunately, it returns only the last two values of JSON file. It seems that it doesn't read entire of the value of the JSON file. I would appreciate any help.

CSV Parser through angularJS

I am building a CSV file parser through node and Angular . so basically a user upload a csv file , on my server side which is node the csv file is traversed and parsed using node-csv
. This works fine and it returns me an array of object based on csv file given as input , Now on angular end I need to display two table one is csv file data itself and another is cross tabulation analysis. I am facing problem while rendering data, so for a table like
I am getting parse responce as
For cross tabulation we need data in a tabular form as
I have a object array which I need to manipulate in best possible way so as to make easily render on html page . I am not getting a way how to do calculation on data I get so as to store cross tabulation result .Any idea on how should I approach .
data json is :
[{"Sample #":"1","Gender":"Female","Handedness;":"Right-handed;"},{"Sample #":"2","Gender":"Male","Handedness;":"Left-handed;"},{"Sample #":"3","Gender":"Female","Handedness;":"Right-handed;"},{"Sample #":"4","Gender":"Male","Handedness;":"Right-handed;"},{"Sample #":"5","Gender":"Male","Handedness;":"Left-handed;"},{"Sample #":"6","Gender":"Male","Handedness;":"Right-handed;"},{"Sample #":"7","Gender":"Female","Handedness;":"Right-handed;"},{"Sample #":"8","Gender":"Female","Handedness;":"Left-handed;"},{"Sample #":"9","Gender":"Male","Handedness;":"Right-handed;"},{"Sample #":";"}
There are many ways you can do this and since you have not been very specific on the usage, I will go with the simplest one.
Assuming you have an object structure such as this:
[
{gender: 'female', handdness: 'lefthanded', id: 1},
{gender: 'male', handdness: 'lefthanded', id: 2},
{gender: 'female', handdness: 'righthanded', id: 3},
{gender: 'female', handdness: 'lefthanded', id: 4},
{gender: 'female', handdness: 'righthanded', id: 5}
]
and in your controller you have exposed this with something like:
$scope.members = [the above array of objects];
and you want to display the total of female members of this object, you could filter this in your html
{{(members | filter:{gender:'female'}).length}}
Now, if you are going to make this a table it will obviously make some ugly and unreadable html so especially if you are going to repeat using this, it would be a good case for making a directive and repeat it anywhere, with the prerequisite of providing a scope object named tabData (or whatever you wish) in your parent scope
.directive('tabbed', function () {
return {
restrict: 'E',
template: '<table><tr><td>{{(tabData | filter:{gender:"female"}).length}}</td></tr><td>{{(tabData | filter:{handedness:"lefthanded"}).length}}</td></table>'
}
});
You would use this in your html like so:
<tabbed></tabbed>
And there are ofcourse many ways to improve this as you wish.
This is more of a general data structure/JS question than Angular related.
Functional helpers from Lo-dash come in very handy here:
_(data) // Create a chainable object from the data to execute functions with
.groupBy('Gender') // Group the data by its `Gender` attribute
// map these groups, using `mapValues` so the named `Gender` keys persist
.mapValues(function(gender) {
// Create named count objects for all handednesses
var counts = _.countBy(gender, 'Handedness');
// Calculate the total of all handednesses by summing
// all the values of this named object
counts.Total = _(counts)
.values()
.reduce(function(sum, num) { return sum + num });
// Return this named count object -- this is what each gender will map to
return counts;
}).value(); // get the value of the chain
No need to worry about for-loops or anything of the sort, and this code also works without any changes for more than two genders (even for more than two handednesses - think of the aliens and the ambidextrous). If you aren't sure exactly what's happening, it should be easy enough to pick apart the single steps and their result values of this code example.
Calculating the total row for all genders will work in a similar manner.

Python csv writer, numpy array to csv

I have Python dict containing 4 key value pairs. Each value is a numpy arrays. Now I want to print the whole dict to a csv, forcing to write one numpy array per row.
with open(os.path.join("csv", title), 'w', newline='') as f:
w = csv.DictWriter(f, list(data.keys()))
w.writeheader()
w.writerow(data)
Is what I have used yet. But some of my arrays get written to several rows instead of a single line.
Here an example of input data:
{'DE': array([[ 38574. , 38538.1904, 39511.6190, 42521.1428,
50586. , 46282.5238, 42714.4761, 40612.0476],
[ 42798.4666, 42112.5333, 42277.8666, 42886.1333,
50224.3333, 48148.8 , 44272.6666, 41210.2 ]])}
I expect the output so that, each line of my array is written on one line. Instead I get a file containing "\n" after a certain amount of digits. how can i force to write the whole array in one row?
DE has a multidimensional array as its value, Inter has an empty list as its value, you end up with two columns one with Inter as the header with an empty list in its column and a second column DE with the array in its column which is exactly what the code should be doing.
If you want to alter each array length try setting numpy.set_printoptions:
numpy.set_printoptions(linewidth=1000)