How can we do feature selection on json data? - json

I have large dataset in json format from which I want to extract important attributes whcih captures the most variance. I want to extract these attributes to build a search engine on the dataset with these attributes being the hash key.
The main question being asked here is doing feature selection on a json data.

You could read the data into a pandas DataFrame Object with the pandas.read_json() function. You can use this DataFrame Object to gain insight into your data. For example:
data = pandas.load_json(json_file)
data.head() # Displays the top five rows
data.info() # Displays description of the data
Or you can use matplotlib on this DataFrame to plot a histogram for each numerical attribute
import matplotlib.pyplot as plt
data.hist(bins=50, figsize=(20,15))
If you are interested into correlation of attributes, you can use the pandas.scatter_matrix() function.
You have to manually pick the attributes that fit best to your task and this tools help you to understand the data and gain insight into it.

Related

How can I read and write column descriptions and typeclasses in foundry transforms?

I want to read the column descriptions and typeclasses from my upstream datasets, then I want to simply pass them through to my downstream datasets.
How can I do this in Python Transforms?
If you upgrade your repository to at least 1.206.0, you'll be able to access a new feature inside the Transforms Python API: read and write of descriptions and typeclasses. For visibility, this question is also highly related to this one
The column_descriptions property gives back a structured Dict<str, List<Dict<str, str>>>, for example a column of tags will have a column_typeclasses object of {'tags': [{"name": "my_name", "kind": "my_kind"}]}. A typeclass always consists of two components, a name, and a kind, which is present in every dictionary of the list shown above. It is the only two keys possible to pass in this dict, and the corresponding values for each key must be str.
Full documentation is in the works for this feature, so stay tuned.
from transforms.api import transform, Input, Output
#transform(
my_output=Output("ri.foundry.main.dataset.my-output-dataset"),
my_input=Input("ri.foundry.main.dataset.my-input-dataset"),
)
def my_compute_function(my_input, my_output):
recent = my_input.dataframe().limit(10)
existing_typeclasses = my_input.column_typeclasses
existing_descriptions = my_input.column_descriptions
my_output.write_dataframe(
recent,
column_descriptions=existing_descriptions,
column_typeclasses=existing_typeclasses
)

Is there a way to get columns names of dataframe in pyspark without reading the whole dataset?

I have huges datasets in my HDFS environnement, say 500+ datasets and all of them are around 100M+ rows. I want to get only the column names of each dataset without reading the whole datasets because it will take too long time to do that. My data are json formatted and I'm reading them using the classic spark json reader : spark.read.json('path'). So what's the best way to get columns names without wasting my time and memory ?
Thanks...
from the official doc :
If the schema parameter is not specified, this function goes through the input once to determine the input schema.
Therefore, you cannot get the column names with only the first line.
Still, you can do an extra step first, that will extract one line and create a dataframe from it, then extract the column names.
One answer could be the following :
Read the data using spark.read.txt('path') method
Limit the number of rows to 1 with the method limit(1) since we just want the header as column names
Convert the table to rdd and collect it as a list with the method collect()
Convert the first row collected from unicode string to python dict (since I'm working with json formatted data).
The keys of the above dict is exactly what we are looking for (columns names as list in python).
This code worked for me:
from ast import literal_eval
literal_eval(spark.read.text('path').limit(1)
.rdd.flatMap(lambda x: x)
.collect()[0]).keys()
The reason it works faster might be that pyspark won't load the whole dataset with all the field structures if you read it using txt format (because everything is read as a big string), it's lighter and more efficient for that specific case.

Export a pandas dataframe to a sortable table in HTML

Is there a way to export a pandas dataframe into an HTML file and incorporate some additional code that makes the output sortable by column?
I have been using Dash DataTable to give the user the option to sort the results, but I was wondering if there is another way in which a server running is not needed and the user can just load the HTML page and sort the results.
So far I have been able to have semi interactive plots based on this SO post, but I would like to add also sortable tables in the HTML and after searching online I am not clear what is the best way to do it (still a newbie with HTML)
For sorting you have to use JavaScript and for the exporting part use method pandas.DataFrame.to_html().
I used panel for this purpose (version 0.14.2). It creates sortable HTML tables by default.
import panel as pn
# df denotes your existing pandas DataFrame
pn.widgets.Tabulator(df)
#df.datetime = df.datetime.astype(str)
df.save("df.html")
If you have columns which are datetimes or timedeltas it may be best to cast them to string first for more sensible representation.
If you wish for a timedelta to be sortable it is best to convert it to a sensible integer.

Python Loop through variable in URLs

What I want to do here is that I want to change an user id within an url for every url and then get outputs from urls.
What I did so far:
import urllib
import requests
import json
url="https://api.abc.com/users/12345?api_key=5632lkjgdlg&_format=_show"
data=requests.get(url).json()
print (data['user'])
(I type in 'user' inside of the print because it gives all the information about a focal user in json format)
My question is that I want to change user id (which is 12345 in this url example) by giving another number (any random number) and then get the outputs from every url I type in. For example, change to 5211, for example, and get the result. And then change to 959444 and get the result and so on. I think I need to use loop to make this iterate through just by changing the numbers within an url but kept failing to do this due to difficulty splitting the original url and then changing only the user id number inside. Could anyone help me out?
Thank you so much in advance.
=====================The Next Following Question is Stated Below================
Thank you for your previous answer! I tried to build my codes more based on the answer and made it but ran into another issue. I could iterate through and fetch each user's information in a json format. The format gave me a single quote (rather than double quotes) and a weird u' notation in front of every keys in json format but I could solve this issue. Anyway, I cleaned up json format and made it in a perfect neat json format.
My plan is to convert each json into a csv file but want to stack all the json I scrape to one csv file. For example, the first json format on user1 will be converted into a csv file and user1 will be considered row1 and all the keys in json will be column names and all the corresponding values will be the values for the corresponding columns. And the second json format I scrape will convert into the same csv file but in the second row, and so on.
from pandas.io.json import json_normalize
eg_data=[data['user']]
df=pd.DataFrame.from_dict(json_normalize(data['user']))
print (df)
df.to_csv('C:/Users/todd/Downloads/eg.csv')
print (df)
So, I found that json_normalize flattens the nested brackets so it's useful in a real world example. Also, I tried to use pandas dataframe to make it as a table. Here I have 2 questions: 1. How do I stack each json format that I scraped one by one in a row in one csv file? (If there's another way to do this without using pandas frame, that would be also appreciated) 2. As I know, pandas dataframe won't give you an output unless every row has the same number of columns. But in my case since every json format I've scraped has either 10 columns or 20 columns depending on whether a json format has nested brackets or not. In this case, how do I stack all the rows and make it in one csv file?
Comments or questions will be greatly appreciated.
You can split it into two initially and join them together every time you generate a random number
import random
url1="https://api.abc.com/users/"
url2="?api_key=5632lkjgdlg&_format=_show"
for i in range(4):
num=random.randint(1000,10000) #you can change the range here for generating a random number
url=url1+str(num)+url2
print(url)
OUTPUT
https://api.abc.com/users/2079?api_key=5632lkjgdlg&_format=_show
https://api.abc.com/users/2472?api_key=5632lkjgdlg&_format=_show
and so on...
But, if you wanted to split at that exact place without knowing how it looks beforehand, you can use regex as you know for sure that a ? is found after this number.
import re
url="https://api.abc.com/users/12345?api_key=5632lkjgdlg&_format=_show"
matches=re.split('\d+(?=\?)',url)
['https://api.abc.com/users/', '?api_key=5632lkjgdlg&_format=_show']
Now just set
url1=matches[0]
url2=matches[1]
And use the for loop.

How to pre-process category info in deeplearning (keras) training input data?

I have category information in the training input data, I am wondering what's the best way to normalize it.
The category information is like "city", "gender" and etc.
I'd like to use Keras to handle the process.
Scikitlearn has a preprocessing library with functions to normalize or scale your data.
This video gives an example for how to preprocess data that will be used for training a model with Keras. The preprocessing here is done with the library mentioned above.
As shown in the video, with the use of Scikitlearn's MinMaxScaler class, you can specify a range that you want your data to be transformed into, and then fit your data to that range using the MinMaxScaler.fit_transform() function.