Proper use of flatMap

Proper use of flatMap - csv

Why I keep getting this error everytime I try an action of my RDD & how to fix it?
/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
I've tried to figure out which is the last RDD I can do action on and its ratingByUser, which indicates the problem is in the flatMap.
What I'm trying to do is that I take CSV with (userID,movieID,rating) and I want to create unique combinations of movieID per userID with the rating, but different users can generate the same pair of movieID,ex for this CSV:
1,2000,5
1,2001,2
1,2002,3
2,2000,4
2,2001,1
2,2004,5
I want RDD:
key (2000,2001), value (5,2,1)
key (2000,2002), value (5,3,1)
key (2001,2002), value (2,3,1)
key (2000,2001), value (4,1,1)
key (2000,2004), value (4,5,1)
key (2001,2004), value (1,5,1)
# First Map function - gets line and returns key(userID) value(movieID,rating)
def parseLine(line):
fields=line.split(",")
userID=int(fields[0])
movieID=int(fields[1])
rating=int(fields[2])
return userID, (movieID,rating)
# Function to create movie unique pairs with ratings
# all pair start with the lowest ID
# returns key (movieIDj,movieIDi) & value (rating-j,rating-i,1)
# the 1 in value is added in order to count number of ratings in the reduce
def createPairs(userRatings):
pairs=[]
for i1 in range(len(userRatings[1])-1):
for i2 in range(i1+1,len(userRatings[1])):
if userRatings[i1][0]<userRatings[1][i2][0]:
pairs.append(((userRatings[1][i1][0],userRatings[1][i2][0]),(userRatings[1][i1][1],userRatings[1][i2][1],1)))
else:
pairs.append(((userRatings[1][i2][0],userRatings[1][i1][0]),(userRatings[1][i2][1],userRatings[1][i1][1],1)))
return pairs
# Create SC object from the ratings file
lines = sc.textFile("/FileStore/tables/dvmlbdnj1487603982330/ratings.csv")
# Map lines to Key(userID),Value(movieID,rating)
movieRatings = lines.map(parseLine)
# Join all rating by same user into one key
# (UserID1,(movie1,rating1)),(UserID1,(movie2,rating2)) --> UserID1,[(movie1,rating1),(movie2,rating2)]
ratingsPerUser = movieRatings.groupByKey()
# activate createPairs func
# We use flatMap, since each user have different number of ratings --> different number pairs
pairsOfMovies = ratingsPerUser.flatMap(createPairs)

Problem is function passed to flatMap not flatMap.
Group by key returns iterator:
It cannot be traversed multiple times
It cannot be indexed.
Convert to list first:
ratingsPerUser.mapValues(list).flatMap(createPairs)

Related

Update and insert from qtableWidget into MYSQL database more efficiently

I'm building a desktop app with PyQt5 to connect with, load data from, insert data into and update a MySQL database. What I came up with to update the database and insert data into the database works. But I feel there should be a much faster way to do it in terms of computation speed. If anyone could help that would be really helpful. What I have as of now for updating the database is this -
def log_change(self, item):
self.changed_items.append([item.row(),item.column()])
# I connect this function to the item changed signal to log any cells which have been changed
def update_db(self):
# Creating an empty list to remove the duplicated cells from the initial list
self.changed_items_load= []
[self.changed_items_load.append(x) for x in self.changed_items if x not in self.changed_items_load]
# loop through the changed_items list and remove cells with no values in them
for db_wa in self.changed_items_load:
if self.tableWidget.item(db_wa[0],db_wa[1]).text() == "":
self.changed_items_load.remove(db_wa)
try:
mycursor = mydb.cursor()
# loop through the list and update the database cell by cell
for ecr in self.changed_items_load:
command = ("update table1 set `{col_name}` = %s where id=%s;")
# table widget column name matches db table column name
data = (str(self.tableWidget.item(ecr[0],ecr[1]).text()),int(self.tableWidget.item(ecr[0],0).text()))
mycursor.execute(command.format(col_name = self.col_names[ecr[1]]),data)
# self.col_names is a list of the tableWidget columns
mydb.commit()
mycursor.close()
except OperationalError:
Msgbox = QMessageBox()
Msgbox.setText("Error! Connection to database lost!")
Msgbox.exec()
except NameError:
Msgbox = QMessageBox()
Msgbox.setText("Error! Connect to database!")
Msgbox.exec()
For inserting data and new rows into the db I was able to find some info online about that. But I have been unable to insert multiple lines at once as well as insert varying column length for each row. Like if I want to insert only 2 columns at row 1, and then 3 columns at row 2... something like that.
def insert_db(self):
# creating a list of each column
self.a = [self.tableWidget.item(row,1).text() for row in range (self.tableWidget.rowCount()) if self.tableWidget.item(row,1) != None]
self.b = [self.tableWidget.item(row,2).text() for row in range (self.tableWidget.rowCount()) if self.tableWidget.item(row,2) != None]
self.c = [self.tableWidget.item(row,3).text() for row in range (self.tableWidget.rowCount()) if self.tableWidget.item(row,3) != None]
self.d = [self.tableWidget.item(row,4).text() for row in range (self.tableWidget.rowCount()) if self.tableWidget.item(row,4) != None]
try:
mycursor = mydb.cursor()
mycursor.execute("INSERT INTO table1(Name, Date, Quantity, Comments) VALUES ('%s', '%s', '%s', '%s')" %(''.join(self.a),
''.join(self.b),
''.join(self.c),
''.join(self.d)))
mydb.commit()
mycursor.close()
except OperationalError:
Msgbox = QMessageBox()
Msgbox.setText("Error! Connection to database lost!")
Msgbox.exec()
except NameError:
Msgbox = QMessageBox()
Msgbox.setText("Error! Connect to database!")
Msgbox.exec()
Help would be appreciated. Thanks.

Like if I want to insert only 2 columns at row 1, and then 3 columns at row 2
No. A given Database table has a specific number of columns. That is an integral part of the definition of a "table".
INSERT adds new rows to a table. It is possible to construct a single SQL statement that inserts multiple rows "all at once".
UPDATE modifies one or more rows of a table. The rows are indicated by some condition specified in the Update statement.
Constructing SQL with %s is risky -- it gets in trouble if there are quotes in the string being inserted.
(I hope these comments help you get to the next stage of understanding databases.)

Best way to loop through a dictionary embedded in list and assign key values to variables

High Level: I am trying to assign identical keys with unique values to their own respective variables. The number of identical keys can be 1 or many. I have a list that is shown here:
[{'client_id_new': 'client key 1'}, {'client_id_new': 'client key 2'}]
This line allows me to retrieve the number of identical keys: newClientIds = sum('client_id_new' in d for d in parsed)
I subtract 1 from this value so it can be in line with an index beginning at 0.
My question: What is the best way to assign the unique values to variables that will iterate throughout the entirety of the dictionary?
Adding to question:
I would like to access the dictionary values in the list and assign them to a variable that contains the entirety of values from the list separated by commas.
So for example, a for loop that spits out an end result looking like this:
print(allValues)
All values(a string variable would look something like this):
"client key 1", "client key 2"
If there was more values up above like:
[{'client_id_new': 'client key 1'}, {'client_id_new': 'client key 2'}, {'client_id_new': 'client key 344'}, {'client_id_new': 'client key 327'}]
Then the output of the 'allValues' variable would look like this:
"client key 1", "client key 2", "client key 344", "client key 327"
my attempt looks something like this but im not sure if this is the best way to do it.
count = 0
while (count <= newClientIds):
newString = parsed[count] + newString
count += 1
print(newString)

while (count <= newClientIds):
if (count == 0):
newString = parsed[count]['client_id_new']
else:
newString = parsed[count]['client_id_new'] + ", " + newString
count += 1

Remove NULL fields from JSON in Greenplum

Using Greenplum 5.* database which is based on Postgres 8.4.
I am using row_to_json and array_to_json functions to create JSON output; but this ending up having keys with null values in JSON. Postgres latest version have json_strip_null function to remove keys with null values.
I need to import the generated JSON files to MongoDB; but mongoimport also doesn't have option to ignore null keys from JSON.
One way I tried it to create JSON file with null and then use sed to remove null fields from JSON file.
sed -i 's/\(\(,*\)"[a-z_]*[0-9]*":null\(,*\)\)*/\3/g' output.json
But looking for a way to do it database itself as it will be faster. Any suggestions how to render json_strip_null function in Greenplum without affecting the query performance?

i've had the same issue in GP 5.17 on pg8.3 - and have had success with this regex to remove the null value key-pairs. i use this in the initial insert to a json column, but you could adapt however:
select
col5,
col6,
regexp_replace(regexp_replace(
(SELECT row_to_json(j) FROM
(SELECT
col1,col2,col3,col4
) AS j)::text,
'(?!{|,)("[^"]+":null[,]*)','','g'),'(,})$','}')::json
AS nvp_json
from foo
working from the inside-out, the result of the row_to_json constructor is first cast to text, then the inner regexp replaces any "name":null, values, the outer regexp trims any hanging commas from the end, and finally the whole thing is cast back to json.

I solved this problem using plpython function. This generic function can be used to remove null and empty valued keys from any JSON.
CREATE OR REPLACE FUNCTION json_strip_null(json_with_nulls json)
RETURNS text
AS $$
import json
def clean_empty(d):
if not isinstance(d, (dict, list)):
return d
if isinstance(d, list):
return [v for v in (clean_empty(v) for v in d) if v not in (None, '')]
return {k: v for k, v in ((k, clean_empty(v)) for k, v in d.items()) if v not in (None, '')}
json_to_dict = json.loads(json_with_nulls)
json_without_nulls = clean_empty(json_to_dict)
return json.dumps(json_without_nulls, separators=(',', ':'))
$$ LANGUAGE plpythonu;
This function can be used as,
SELECT json_strip_null(row_to_json(t))
FROM table t;

You can use COALESCE to replace the nulls with an empty string or another value.
https://www.postgresql.org/docs/8.3/functions-conditional.html
The COALESCE function returns the first of its arguments that is not null. Null is returned only if all arguments are null. It is often used to substitute a default value for null values when data is retrieved for display, for example:
SELECT COALESCE(description, short_description, '(none)') ...
This returns description if it is not null, otherwise short_description if it is not null, otherwise (none).
...

How to find a rails object based on a date value

I have a User object and I am attempting to do 2 different queries as part of a script that needs to run nightly. Given the schema below I would like to:
Get all the Users with a non nil end_date
Get all the Users with an end_date that is prior to today (I.E. has passed)
Users Schema:
# == Schema Information
#
# Table name: users
#
# id :integer not null, primary key
# name :string(100) default("")
# end_date :datetime
I've been trying to use User.where('end_date != NULL) and other things but I cannot seem to get the syntax correct.

Your methods should be as below inside the User model :
def self.users_with_end_date_not_null
self.where.not(end_date: nil)
# below Rails 4 use
# self.where("end_date != ?", nil)
end
def self.past_users n
self.where(end_date: n.day.ago)
end

Default size of integer in Rails tables (MySQL)

When I run
rails g model StripeCustomer user_id:integer customer_id:integer
annotate
I got
# == Schema Information
# Table name: stripe_customers
# id :integer(4) not null, primary key
# user_id :integer(4)
# customer_id :integer(4)
# created_at :datetime
# updated_at :datetime
Does it mean I can only hold up to 9,999 records only? (I am quite surprise how small a default size for keys is). How do I change default IDs to be 7 digits in existing tables?
Thank you.

While the mysql client's describe command really uses the display width (see the docs), the schema information in the OP's question is very probably generated by the annontate_models gem's get_schema_info method that uses the limit attribute of each column. And the limit attribute is the number of bytes for :binary and :integer columns (see the docs).
The method reads (see how the last line adds the limit):
def get_schema_info(klass, header, options = {})
info = "# #{header}\n#\n"
info << "# Table name: #{klass.table_name}\n#\n"
max_size = klass.column_names.collect{|name| name.size}.max + 1
klass.columns.each do |col|
attrs = []
attrs << "default(#{quote(col.default)})" unless col.default.nil?
attrs << "not null" unless col.null
attrs << "primary key" if col.name == klass.primary_key
col_type = col.type.to_s
if col_type == "decimal"
col_type << "(#{col.precision}, #{col.scale})"
else
col_type << "(#{col.limit})" if col.limit
end
#...
end

Rails actually means 4 bytes here, i.e. the standard mysql integer type (see the docs)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Proper use of flatMap - csv

Problem is function passed to flatMap not flatMap. Group by key returns iterator: It cannot be traversed multiple times It cannot be indexed. Convert to list first: ratingsPerUser.mapValues(list).flatMap(createPairs)

Related

Update and insert from qtableWidget into MYSQL database more efficiently

Best way to loop through a dictionary embedded in list and assign key values to variables

Remove NULL fields from JSON in Greenplum

How to find a rails object based on a date value

Default size of integer in Rails tables (MySQL)

Categories

Resources