Loading huge json document - json

I have a large JSON file about 60gb. I am trying to view its contents using JSON package in python. But when I run it the system runs out of memory. I just want to view like a few (like 5) records. I tried the below code but it runs but the system starts to run out of memory.
Is there a way to view just a few records rather then load the entire file?
I tried the below code. But it loads the entire file and then the system starts to run out of memory.
import json
with open("large-file.json", "r") as f:
data = json.load(f)
user_to_repos = {}
for record in data:
user = record["actor"]["login"]
repo = record["repo"]["name"]
if user not in user_to_repos:
user_to_repos[user] = set()
user_to_repos[user].add(repo)
type here

Related

Plotting specific column of GIS data using geopandas

I am trying to recreate a map of NY state geographical data using geopandas, and the available data found here: https://pubs.usgs.gov/of/2005/1325/#NY. I can print the map but can not figure out how to make use of the other files to plot their columns.
Any help would be appreciated.
What exactly are you trying to do?
Here's a quick setup you can use to download the Shapefiles from the site and have access to a GeoDataFrame:
import geopandas as gpd
from io import BytesIO
import requests as r
# Link to file
shp_link = 'https://pubs.usgs.gov/of/2005/1325/data/NYgeol_dd.zip'
# Downloading the fie into memory
my_req = r.get(shp_link)
# Creating a file stream for GeoPandas
my_zip = BytesIO(my_req.content)
# Loading the data into a GeoDataFrame
my_geodata = gpd.read_file(my_zip)
# Printing all of the column names
for this_col in my_geodata.columns:
print(this_col)
Now you can access the multiple columns in my_geodata the data using square brackets. For example, if I want to access the data stored in the column called "SOURCE", I can just use my_geodata["SOURCE"].
Now it's just a matter of figuring out what exactly you want to do with that data.

pyspark: reduce size of JSON variable

I'm trying to analyse a JSON file that contains data from the Twitter API. The file is 2GB so it takes ages to load or attempt to run any analysis on.
So in pyspark I load it up:
df = sqlContext.read.json('/data/statuses.log.2014-12-30.gz')
this takes about 20 minutes as does any further analysis so I want to look at just a small section of the dataset so I can test my scripts quickly and easily. I tried
df = df.head(1000)
but this seems to alter the dataset somehow, so when I try
print(df.groupby('lang').count().sort(desc('count')).show())
I get the error
AttributeError: 'list' object has no attribute 'groupby'
Is there any way I can reduce to size of the data so I can play around with it without having to wait ages each time?
Solved it:
df = df.limit(1000)

PySpark loading from URL

I wanted to load csv files from a URL in PySpark, is it even possible to do so?
I keep the files on GitHub.
Thanks!
There is no naive way in pyspark (see here).
However, if you have a function that takes as input a URL and outputs the csv:
def read_from_URL(UR):
# your logic here
return data
You can use spark to parallelize this operation:
URL_list = ['http://github.com/file/location/file1.csv, ...]
data = sc.parallelize(URL_list).map(read_from_URL)

running nested jobs in spark

I am using PySpark. I have a list of gziped json files on s3 which I have to access, transform and then export in parquet to s3. Each json file contains around 100k lines so parallelizing it wont make much sense(but i am open to parallelizing it), but there are around 5k files which I have parallelize. My approach is pass the json file list to script -> run parallelize on the list -> run map(? this is where I am getting blocked). how do I access and transform the json create a DF out of the transformed json and dump it as parquet into s3.
To read json in a distributed fashion, you will need to parallelize your keys as you mention. To do this while reading from s3, you'll need to use boto3. Below is a skeleton sketch of how to do so. You'll likely need to modify distributedJsonRead to fit your use case.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
return Row(**contents)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.map(distributedJsonRead)
Boto3 Reference: http://boto3.readthedocs.org/en/latest/guide/quickstart.html
Edit: to address the 1:1 mapping of input files to output files
Later on, it's likely that having a merged parquet data set would be easier to work with. But if this is the way you need to do it, you could try something like this
for k in keyList:
rawtext = sc.read.json(k) # or whichever method you need to use to read in the data
outpath = k[:-4]+'parquet'
rawtext.write.parquet(outpath)
I don't think you will not be able to parallelize these operations if you want a 1:1 mapping of json to parquet files. Spark's read/write functionality is designed to be called by the driver, and needs access to sc and sqlContext. This is another reason why having 1 parquet directory is likely the way to go.

Grails - Rendered Json file too huge for client side operations

I have a grails controller to render json files as follows, to be further used by d3 on my front end (.gsp file):
def dataSource
def salesjson = {
def sql = new Sql(dataSource)
def rows = sql.rows("select date_hour,mv,device,department,browser,platform,total_revenue as metric, total_revenue_ly as metric_ly from composite")
sql.close()
render rows as JSON
}
I use this file to render my crossfiltered dc-charts on the front end. The problem is, queries such as the one above returns large JSON file / object and my client stops working and hangs. (100MB plus, on the client side, and still loading!!)
I can't think of any alternative to this method, which would reduce my file size (maybe rendering as a csv string? Would that help a lot? If so, how do I go about it? I have about 600,000 rows in my json currently)
What other options do I have?
I'd suggest using something like MessagePack to create more binary like smaller representations of your JSON. There are a few other options out there, but I think this one is probably the most JVM / Javascript friendly.