Count number of users per window using PySpark - json

I'm using Kafka to stream a JSON file, sending each line as a message. One of the keys is the user's email.
Then I use PySpark to count the number of unique users per window, using their email to identify them. The command
def print_users_count(count):
print 'The number of unique users is:', count
print_users_count((lambda message: message['email']).distinct().count())
Gives me the error below. How can I fix this?
AttributeError Traceback (most recent call last)
<ipython-input-19-311ba744b41f> in <module>()
2 print 'The number of unique users is:', count
3
----> 4 print_users_count((lambda message: message['email']).distinct().count())
AttributeError: 'function' object has no attribute 'distinct'
Here is my PySpark code:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
try:
sc.stop()
except:
pass
sc = SparkContext(appName="KafkaStreaming")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
# Define the PySpark consumer.
kafkaStream = KafkaUtils.createStream(ssc, bootstrap_servers, 'spark-streaming2', {topicName:1})
# Parse the incoming data as JSON.
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
# Count the number of messages per batch.
parsed.count().map(lambda x:'Messages in this batch: %s' % x).pprint()

Your not applying the lambda function to anything. What is message referencing? Right not the lambda function is just that, a function. That si why your getting AttributeError: 'function' object has no attribute 'distinct'. It is not being applied to any data, so it is not returning any data. You need to reference the dataframe which the key email is in.
See the pyspark docs for pyspark.sql.functions.countDistinct(col, *cols) and pyspark.sql.functions.approx_count_distinct pyspark docs. This should be a simpler solution to getting a unique count.

Related

In Palantir Foundry, can I find which CSV file is causing schema errors in a dataset?

I'm seeing errors like the following when building downstream of some datasets containing CSV files:
Caused by: java.lang.IllegalStateException: Header specifies 185 column types but line split into 174: "SUSPECT STRING","123...
or
Caused by: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: Exception parsing 'SUSPECT STRING' into a IntegerType$ for column "COLOUR_ID": Unable to deserialize value using com.palantir.spark.parsers.text.converters.IntegerConverter. The value being deserialized was: SUSPECT STRING
Looking at the errors it seems to me like some of my CSV files have the wrong schema. How can I find which ones?
One technique you could use would be to:
create a transform that reads the CSV files in as if they were unstructured text files, then
filter the resulting DataFrame down to just the suspect rows, as identified by the extracts contained in the error message
Below is an example of such a transform:
from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = (
spark_session.read.text(file_name)
.filter(F.col("value").contains(F.lit("SUSPECT STRING")))
.withColumn("_filename", F.lit(file_name))
)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
#transform(
output_dataset=Output("my_output"),
input_dataset=Input("my_input"),
)
def compute(ctx, input_dataset, output_dataset):
session = ctx.spark_session
input_filesystem = input_dataset.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
output_dataset.write_dataframe(output_df)
This would then output the rows of interest along with the paths to the files they're in.

Iterating through describe_instances() to print key & value boto3

I am currently working on a python script to print pieces of information on running EC2 instances on AWS using Boto3. I am trying to print the InstanceID, InstanceType, and PublicIp. I looked through Boto3's documentation and example scripts so this is what I am using:
import boto3
ec2client = boto3.client('ec2')
response = ec2client.describe_instances()
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
instance_id = instance["InstanceId"]
instance_type = instance["InstanceType"]
instance_ip = instance["NetworkInterfaces"][0]["Association"]
print(instance)
print(instance_id)
print(instance_type)
print(instance_ip)
When I run this, "instance" prints one large block of json code, my instanceID, and type. But I am getting an error since adding NetworkInterfaces.
instance_ip = instance["NetworkInterfaces"][0]["Association"]
returns:
Traceback (most recent call last):
File "/Users/me/AWS/describeInstances.py", line 12, in <module>
instance_ip = instance["NetworkInterfaces"][0]["Association"]
KeyError: 'Association'
What am I doing wrong while trying to print the PublicIp?
Here is the structure of NetworkInterfaces for reference:
The full Response Syntax for reference can be found here (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.Client.describe_instances)
Association man not always may be present. Also an instance may have more then one interface. So your working loop could be:
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
instance_id = instance["InstanceId"]
instance_type = instance["InstanceType"]
#print(instance)
print(instance_id, instance_type)
for network_interface in instance["NetworkInterfaces"]:
instance_ip = network_interface.get("Association", "no-association")
print(' -', instance_ip)

Iterate through a JSON file using Python

i am trying to loop through a simple json file (see link) and to calculate the sum of all integers from the file.
When iterating through the file I receive the following error:
TypeError: string indices must be integers
Could you please help.
code below
import urllib.request, urllib.parse, urllib.error
import json
total=0
#url = input('Enter URL: ')
url= ' http://py4e-data.dr-chuck.net/comments_42.json'
uh=urllib.request.urlopen(url)
data = uh.read().decode()
print('Retrieved', len(data), 'characters')
print(data)
info = json.loads(data)
print('User count:', len(info)) #it displays "User count: 2" why?
for item in info:
num=item["comments"][0]["count"]
total=total+num
print (total)
The json file starts with a note. Your for-loop reads the keys of a dictionary, so the first item is 'note' (a string), which can only be subscripted with an integer, hence the error message.
You probably want to loop over info["comments"] which is the list with all dictionaries containing 'name' and 'count':
for item in info["comments"]:
num=item["count"]
total=total+num
print (total)

Unable to print output of JSON code into a .csv file

I'm getting the following errors when trying to decode this data, and the 2nd error after trying to compensate for the unicode error:
Error 1:
write.writerows(subjects)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 160: ordinal not in range(128)
Error 2:
with open("data.csv", encode="utf-8", "w",) as writeFile:
SyntaxError: non-keyword arg after keyword arg
Code
import requests
import json
import csv
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encode="utf-8", "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
Using requests and with the correction to the second part (as below) I have no problem running. I think your first problem is due to the second error (is a consequence of that being incorrect).
I am on Python3 and can run yours with my fix to open line and with
r = urllib.request.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
I personally would use requests.
import requests
import csv
data = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1').json()
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encoding ="utf-8", mode = "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
For your second, looking at documentation for open function, you need to use the right argument names and add the name of the mode argument if not positional matching.
with open("data.csv", encoding ="utf-8", mode = "w") as writeFile:

python loop over a list's length, writing specific data to csv file

I am collecting data from a set of urls using curl requests and converting it into json. These data are represented in python as lists or dictionaries.
EDIT:
Next, I want to loop my script over a value in a dictionary that is inside a list (type list of dictionaries) until the length of the list is met. I wish to loop my other curl requests for each 'instance' and then write that information to a .csv with the name 'instance_name'.csv
Information in the .csv is populated from 2 different curl requests, excluding the one I want to loop everything over. The .csv needs to be created and populated by 'instance_name'. But actual content is populated via other curl requests.
Information of the list I want to loop over:
>>> instances = [i['instance_name'] for i in i_data]
>>> print(instances)
[u'Instance0', u'Instance1', .... u'Instance16']
>>> type(i_data)
<type 'list'>
>>> len(i_data)
17
>>> print(i_data[0])
{u'instance_name': u'Instance1', u'attribute2': {u'attribute2_1': u'yes', u'attribute2_2': u'no', u'attribute2_3': u'bye', u'attribute2_4': u'hello', u'attribute2_5': 500}, u'attribute3': u'abcd', u'attribute4': u'wxyz'}
>>>
How can I start this loop? eg:
i = 0
for i in len(i_data[i]):
with open('{}.csv'.format(i['instance_name']), 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
Trying to test:
>>> i = 0
>>> for i in len(i_data[i]):
... print('Hello')
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'int' object is not iterable
>>>
Secondly, how can I match certain field names in the csv with certain keys from lists or dictionaries?
For a key called 'Id' I want to place the value of that into the .csv file as Id
fieldnames = ['Id', 'domain_name', 'website', 'Usage', 'Limit']
Should my fieldnames be the same as the Keys so that the values know where to go? Or how exactly can I do this?
I am getting this error right now:
Traceback (most recent call last):
File "./usage_2.py", line 42, in <module>
writer.writerow(t_data['domain_name'])
File "/usr/local/lib/python2.7/csv.py", line 152, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/usr/local/lib/python2.7/csv.py", line 148, in _dict_to_list
+ ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: u'p', u'e', u'r', u'f', u'e', u'c', u't', u'u', u's', u'g', u'r', u'e', u'u', u'p', u'a', u'h', u'r', u'.', u'o', u'n', u'm', u'i', u'c', u'r', u'o', u's', u'o', u'f', u't', u'.', u'c', u'o', u'm'
Because, I think it's trying to put json data with the u' and also I don't know if data is going where it's suppose to.
One example of to write the data:
writer.writerow(t_data['domain_name'])
The entry looks like:
>>> print(t_data['domain_name'])
abc.123.com
>>>
And this 't_data' pulled from another curl request is represented as a dictionary when I check.
>>> type(t_data)
<type 'dict'>
>>>