What I want to do here is that I want to change an user id within an url for every url and then get outputs from urls.
What I did so far:
import urllib
import requests
import json
url="https://api.abc.com/users/12345?api_key=5632lkjgdlg&_format=_show"
data=requests.get(url).json()
print (data['user'])
(I type in 'user' inside of the print because it gives all the information about a focal user in json format)
My question is that I want to change user id (which is 12345 in this url example) by giving another number (any random number) and then get the outputs from every url I type in. For example, change to 5211, for example, and get the result. And then change to 959444 and get the result and so on. I think I need to use loop to make this iterate through just by changing the numbers within an url but kept failing to do this due to difficulty splitting the original url and then changing only the user id number inside. Could anyone help me out?
Thank you so much in advance.
=====================The Next Following Question is Stated Below================
Thank you for your previous answer! I tried to build my codes more based on the answer and made it but ran into another issue. I could iterate through and fetch each user's information in a json format. The format gave me a single quote (rather than double quotes) and a weird u' notation in front of every keys in json format but I could solve this issue. Anyway, I cleaned up json format and made it in a perfect neat json format.
My plan is to convert each json into a csv file but want to stack all the json I scrape to one csv file. For example, the first json format on user1 will be converted into a csv file and user1 will be considered row1 and all the keys in json will be column names and all the corresponding values will be the values for the corresponding columns. And the second json format I scrape will convert into the same csv file but in the second row, and so on.
from pandas.io.json import json_normalize
eg_data=[data['user']]
df=pd.DataFrame.from_dict(json_normalize(data['user']))
print (df)
df.to_csv('C:/Users/todd/Downloads/eg.csv')
print (df)
So, I found that json_normalize flattens the nested brackets so it's useful in a real world example. Also, I tried to use pandas dataframe to make it as a table. Here I have 2 questions: 1. How do I stack each json format that I scraped one by one in a row in one csv file? (If there's another way to do this without using pandas frame, that would be also appreciated) 2. As I know, pandas dataframe won't give you an output unless every row has the same number of columns. But in my case since every json format I've scraped has either 10 columns or 20 columns depending on whether a json format has nested brackets or not. In this case, how do I stack all the rows and make it in one csv file?
Comments or questions will be greatly appreciated.
You can split it into two initially and join them together every time you generate a random number
import random
url1="https://api.abc.com/users/"
url2="?api_key=5632lkjgdlg&_format=_show"
for i in range(4):
num=random.randint(1000,10000) #you can change the range here for generating a random number
url=url1+str(num)+url2
print(url)
OUTPUT
https://api.abc.com/users/2079?api_key=5632lkjgdlg&_format=_show
https://api.abc.com/users/2472?api_key=5632lkjgdlg&_format=_show
and so on...
But, if you wanted to split at that exact place without knowing how it looks beforehand, you can use regex as you know for sure that a ? is found after this number.
import re
url="https://api.abc.com/users/12345?api_key=5632lkjgdlg&_format=_show"
matches=re.split('\d+(?=\?)',url)
['https://api.abc.com/users/', '?api_key=5632lkjgdlg&_format=_show']
Now just set
url1=matches[0]
url2=matches[1]
And use the for loop.
Related
I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps
I have huges datasets in my HDFS environnement, say 500+ datasets and all of them are around 100M+ rows. I want to get only the column names of each dataset without reading the whole datasets because it will take too long time to do that. My data are json formatted and I'm reading them using the classic spark json reader : spark.read.json('path'). So what's the best way to get columns names without wasting my time and memory ?
Thanks...
from the official doc :
If the schema parameter is not specified, this function goes through the input once to determine the input schema.
Therefore, you cannot get the column names with only the first line.
Still, you can do an extra step first, that will extract one line and create a dataframe from it, then extract the column names.
One answer could be the following :
Read the data using spark.read.txt('path') method
Limit the number of rows to 1 with the method limit(1) since we just want the header as column names
Convert the table to rdd and collect it as a list with the method collect()
Convert the first row collected from unicode string to python dict (since I'm working with json formatted data).
The keys of the above dict is exactly what we are looking for (columns names as list in python).
This code worked for me:
from ast import literal_eval
literal_eval(spark.read.text('path').limit(1)
.rdd.flatMap(lambda x: x)
.collect()[0]).keys()
The reason it works faster might be that pyspark won't load the whole dataset with all the field structures if you read it using txt format (because everything is read as a big string), it's lighter and more efficient for that specific case.
So basically I'm at a wall with an assignment and it's beginning to really frustrate me. Essentially I have a CSV file and my goal is to count how an the amount of times a string is called. So like column 1 would have a string and column 2 would have a integer connected to it. I ultimately need this to be formatted into a dictionary. Where I am stuck is how the heck do I do this without using imported libraries. I am only allowed to iterate through the file using for loops. Would my best bet be indexing each line and creating that into a string and count how many times that string is called? Any insight would be appreciated.
If you don't want to you any library (and assuming you are using python) you can use a dict comprehension, like this:
with open("data.csv") as file:
csv_as_dict = {line[0]: line[1] for line in file.readlines()}
Note: The question is possibly a duplicate of Creating a dictionary from a csv file?.
I'm working on some Python code for my local billiard hall and I'm running into problems with JSON encoding. When I dump my data into a file I obviously get all the data in a single line. However, I want my data to be dumped into the file following the format that I want. For example (Had to do picture to get point across),
My custom JSON format
. I've looked up questions on custom JSONEncoders but it seems they all have to do with datatypes that aren't JSON serializable. I never found a solution for my specific need which is having everything laid out in the manner that I want. Basically, I want all of the list elements to on a separate row but all of the dict items to be in the same row. Do I need to write my own custom encoder or is there some other approach I need to take? Thanks!
I'm a newcomer to numpy, and am having a hard time reading CSVs into a numpy array with genfromtxt.
I found a CSV file on the web that I'm using as an example. It's a mixture of floats and strings. It's here: http://pastebin.com/fMdRjRMv
I'm using numpy via pylab (initializing on a Ubuntu system via: ipython -pylab). numpy.version.version is 1.3.0.
Here's what I do:
Example #1:
data = genfromtxt("fMdRjRMv.txt", delimiter=',', dtype=None)
data.shape
(374, 15)
data[10,10] ## Take a look at an example element
'30'
type(data[10,10])
type 'numpy.string_'
There are no errant quotation marks in the CSV file, so I've no idea why it should think that the number is a string. Does anyone know why this is the case?
Example #2 (skipping the first row):
data = genfromtxt("fMdRjRMv.txt", delimiter=',', dtype=None, skiprows=1)
data.shape
(373,)
Does anyone know why it would not read all of this into a 1-dimensional array?
Thanks so much!
In your example #1, the problem is that all the values in a single column must share the same datatype. Since the first line of your data file has the column names, this means that the datatype of every column is string.
You have the right idea in example #2 of skipping the first row. Note however that 1.3.0 is a rather old version (I have 1.6.1). In newer versions skiprows is deprecated and you should use skip_header instead.
The reason that the shape of the array is (373,) is that it is a structured array (see http://docs.scipy.org/doc/numpy/user/basics.rec.html), which is what numpy uses to represent inhomogeneous data. So data[10] gives you an entire row of your table. You can also access the data columns by name, for example data['f10']. You can find the names of the columns in data.dtype.names. It is also possible to use the original column names that are defined in the first line of your data file:
data = genfromtxt("fMdRjRMv.txt", dtype=None, delimiter=',', names=True)
then you can access a column like data['Age'].