python pandas_dedupe loads and clusters file slowly - duplicates

I'm using the pandas-dedupe library to sift through a 51,540 record dataframe with 43 columns.
It's taking about an hour to load the file and then cluster it after providing the active learning inputs.
I've tried changing the sample_size, but overall I must be missing something, since this is quite a small dataset comparatively to what I'm seeing other users throw at it on these forums.
Is it too many columns specified for dedupe? I checked other parameters to include, but didn't come up with anything I'd further include.
if __name__ == "__main__":
df = pd.read_csv(R"filepath.txt", sep="\t", encoding="ISO-8859-1")
dedupe = pdd.dedupe_dataframe(
df,
['fname', 'lname', 'company', 'email'],
sample_size=0.05
)
dedupe.to_csv(R"filepath.txt")

Related

How can I process large files in Code Repositories?

I have a data feed that gives a large .txt file (50-75GB) every day. The file contains several different schemas within it, where each row corresponds to one schema. I would like to split this into partitioned datasets for each schema, how can I do this efficiently?
The largest problem you need to solve is the iteration speed to recover your schemas, which can be challenging for a file at this scale.
Your best tactic here will be to get an example 'notional' file with each of the schemas you want to recover as a line within it, and to add this as a file within your repository. When you add this file into your repo (alongside your transformation logic), you will then be able to push it into a dataframe, much as you would with the raw files in your dataset, for quick testing iteration.
First, make sure you specify txt files as a part of your package contents, this way your tests will discover them (this is covered in documentation under Read a file from a Python repository):
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.txt']
}
)
I am using a txt file with the following contents:
my_column, my_other_column
some_string,some_other_string
some_thing,some_other_thing,some_final_thing
This text file is at the following path in my repository: transforms-python/src/myproject/datasets/raw.txt
Once you have configured the text file to be shipped with your logic, and after you have included the file itself in your repository, you can then include the following code. This code has a couple of important functions:
It keeps raw file parsing logic completely separate from the stage of reading the file into a Spark DataFrame. This is so that the way this DataFrame is constructed can be left to the test infrastructure, or to the run time, depending on where you are running.
This keeping of the logic separate lets you ensure the actual row-by-row parsing you want to do is its own testable function, instead of having it live purely within your my_compute_function
This code uses the Spark-native spark_session.read.text method, which will be orders of magnitude faster than row-by-row parsing of a raw txt file. This will ensure the parallelized DataFrame is what you operate on, not a single file, line by line, inside your executors (or worse, your driver).
from transforms.api import transform, Input, Output
from pkg_resources import resource_filename
def raw_parsing_logic(raw_df):
return raw_df
#transform(
my_output=Output("/txt_tests/parsed_files"),
my_input=Input("/txt_tests/dataset_of_files"),
)
def my_compute_function(my_input, my_output, ctx):
all_files_df = None
for file_status in my_input.filesystem().ls('**/**'):
raw_df = ctx.spark_session.read.text(my_input.filesystem().hadoop_path + "/" + file_status.path)
parsed_df = raw_parsing_logic(raw_df)
all_files_df = parsed_df if all_files_df is None else all_files_df.unionByName(parsed_df)
my_output.write_dataframe(all_files_df)
def test_my_compute_function(spark_session):
file_path = resource_filename(__name__, "raw.txt")
raw_df = raw_parsing_logic(
spark_session.read.text(file_path)
)
assert raw_df.count() > 0
raw_columns_set = set(raw_df.columns)
expected_columns_set = {"value"}
assert len(raw_columns_set.intersection(expected_columns_set)) == 1
Once you have this code up and running, your test_my_compute_function method will be very fast to iterate on, so that you can perfect your schema recovery logic. This will make it substantially easier to get your dataset building at the very end, but without any of the overhead of a full build.

R: jsonlite's stream_out function producing incomplete/truncated JSON file

I'm trying to load a really big JSON file into R. Since the file is too big to fit into memory on my machine, I found that using the jsonlite package's stream_in/stream_out functions is really helpful. With these functions, I can subset the data first in chunks without loading it, write the subset data to a new, smaller JSON file, and then load that file as a data.frame. However, this intermediary JSON file is getting truncated (if that's the right term) while being written with stream_out. I will now attempt to explain with further detail.
What I'm attempting:
I have written my code like this (following an example from documentation):
con_out <- file(tmp <- tempfile(), open = "wb")
stream_in(file("C:/User/myFile.json"), handler = function(df){
df <- df[which(df$Var > 0), ]
stream_out(df, con_out, pagesize = 1000)
}, pagesize = 5000)
myData <- stream_in(file(tmp))
As you can see, I open a connection to a temporary file, read my original JSON file with stream_in and have the handler function subset each chunk of data and write it to the connection.
The problem
This procedure runs without any problems, until I try to read it in myData <- stream_in(file(tmp)), upon which I receive an error. Manually opening the new, temporary JSON file reveals that the bottom-most line is always incomplete. Something like the following:
{"Var1":"some data","Var2":3,"Var3":"some othe
I then have to manually remove that last line after which the file loads without issue.
Solutions I've tried
I've tried reading the documentation thoroughly and looking at the stream_out function, and I can't figure out what may be causing this issue. The only slight clue I have is that the stream_out function automatically closes the connection upon completion, so maybe it's closing the connection while some other component is still writing?
I inserted a print function to print the tail() end of the data.frame at every chunk inside the handler function to rule out problems with the intermediary data.frame. The data.frame is produced flawlessly at every interval, and I can see that the final two or three rows of the data.frame are getting truncated while being written to file (i.e., they're not being written). Notice that it's the very end of the entire data.frame (after stream_out has rbinded everything) that is getting chopped.
I've tried playing around with the pagesize arguments, including trying very large numbers, no number, and Inf. Nothing has worked.
I can't use jsonlite's other functions like fromJSON because the original JSON file is too large to read without streaming and it is actually in minified(?)/ndjson format.
System info
I'm running R 3.3.3 x64 on Windows 7 x64. 6 GB of RAM, AMD Athlon II 4-Core 2.6 Ghz.
Treatment
I can still deal with this issue by manually opening the JSON files and correcting them, but it's leading to some data loss and it's not allowing my script to be automated, which is an inconvenience as I have to run it repeatedly throughout my project.
I really appreciate any help with this; thank you.
I believe this does what you want, it is not necessary to do the extra stream_out/stream_in.
myData <- new.env()
stream_in(file("MOCK_DATA.json"), handler = function(df){
idx <- as.character(length(myData) + 1)
myData[[idx]] <- df[which(df$id %% 2 == 0), ] ## change back to your filter
}, pagesize = 200) ## change back to 1000
myData <- myData %>% as.list() %>% bind_rows()
(I created some mock data in Mockaroo: generated 1000 lines, hence the small pagesize, to check if everything worked with more than one chunk. The filter I used was even IDs because I was lazy to create a Var column.)

SQLAlchemy automap: Best practices for performance

I'm building a python app around an existing (mysql) database and am using automap to infer tables and relationships:
base = automap_base()
self.engine = create_engine(
'mysql://%s:%s#%s/%s?charset=utf8mb4' % (
config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
), echo=False
)
# reflect the tables
base.prepare(self.engine, reflect=True)
self.TableName = base.classes.table_name
Using this I can do things like session.query(TableName) etc...
However, I'm worried about performance, because every time the app runs it will do the whole inference again.
Is this a legitimate concern?
If so, is there a possibility to 'cache' the output of Automap?
I think that "reflecting" the structure of your database is not the way to go. Unless your app tries to "infer" things from the structure, like static code analysis would for source files, then it is unnecessary. The other reason for reflecting it at run-time would be the reduced time to begin "using" the database using SQLAlchemy. However:
Another option would be to use something like SQLACodegen (https://pypi.python.org/pypi/sqlacodegen):
It will "reflect" your database once and create a 99.5% accurate set of declarative SQLAlchemy models for you to work with. However, this does require that you keep the model subsequently in-sync with the structure of the database. I would assume that this is not a big concern seeing as the tables you're already-working with are stable-enough such that run-time reflection of their structure does not impact your program much.
Generating the declarative models is essentially a "cache" of the reflection. It's just that SQLACodegen saved it into a very readable set of classes + fields instead of data in-memory. Even with a changing structure, and my own "changes" to the generated declarative models, I still use SQLACodegen later-on in a project whenever I make database changes. It means that my models are generally consistent amongst one-another and that I don't have things such as typos and data-mismatches due to copy-pasting.
Performance can be a legitimate concern. If the database schema is not changing, it can be time consuming to reflect the database every time a script is run. This is more of an issue during development, not starting up a long running application. It's also a significant time saver if your database is on a remote server (again, particularly during development).
I use code that is similar to the answer here (as noted by #ACV). The general plan is to perform the reflection the first time, then pickle the metadata object. The next time the script is run, it will look for the pickle file and use that. The file can be anywhere, but I place mine in ~/.sqlalchemy_cache. This is an example based on your code.
import os
from sqlalchemy.ext.declarative import declarative_base
self.engine = create_engine(
'mysql://%s:%s#%s/%s?charset=utf8mb4' % (
config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
), echo=False
)
metadata_pickle_filename = "mydb_metadata"
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
cached_metadata = pickle.load(file=cache_file)
except IOError:
# cache file not found - no problem, reflect as usual
pass
if cached_metadata:
base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
# save the metadata for future runs
try:
if not os.path.exists(cache_path):
os.makedirs(cache_path)
# make sure to open in binary mode - we're writing bytes, not str
with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
pickle.dump(Base.metadata, cache_file)
except:
# couldn't write the file for some reason
pass
self.TableName = base.classes.table_name
For anyone using declarative table class definitions, assuming a Base object defined as e.g.
Base = declarative_base(bind=engine)
metadata_pickle_filename = "ModelClasses_trilliandb_trillian.pickle"
# ------------------------------------------
# Load the cached metadata if it's available
# ------------------------------------------
# NOTE: delete the cached file if the database schema changes!!
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
cached_metadata = pickle.load(file=cache_file)
except IOError:
# cache file not found - no problem
pass
# ------------------------------------------
# define all tables
#
class MyTable(Base):
if cached_metadata:
__table__ = cached_metadata.tables['my_schema.my_table']
else:
__tablename__ = 'my_table'
__table_args__ = {'autoload':True, 'schema':'my_schema'}
...
# ----------------------------------------
# If no cached metadata was found, save it
# ----------------------------------------
if cached_metadata is None:
# cache the metadata for future loading
# - MUST DELETE IF THE DATABASE SCHEMA HAS CHANGED
try:
if not os.path.exists(cache_path):
os.makedirs(cache_path)
# make sure to open in binary mode - we're writing bytes, not str
with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
pickle.dump(Base.metadata, cache_file)
except:
# couldn't write the file for some reason
pass
Important Note!! If the database schema changes, you must delete the cached file to force the code to autoload and create a new cache. If you don't, the changes will be be reflected in the code. It's an easy thing to forget.
The answer to your first question is largely subjective. You are adding database queries to fetch the reflection metadata to the application load time. Whether or not that overhead is significant depends on your project requirements.
For reference, I have an internal tool at work that uses a reflection pattern because the the load-time is acceptable for our team. That might not be the case if it were an externally-facing product. My hunch is that for most applications the reflection overhead will not dominate the total application load time.
If you decide it is significant for your purposes, this question has an interesting answer where the user pickles the database metadata in order to locally cache it.
Adding to this, what #Demitri answered is close to correct but (at least in sqlalchemy 1.4.29), the example will fail on the last line self.TableName = base.classes.table_name when generating from the cached file. In this case declarative_base() has no attribute classes.
To fix is as simple as altering:
if cached_metadata:
base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
to
if cached_metadata:
base = automap_base(declarative_base(bind=self.engine, metadata=cached_metadata))
base.prepare()
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
This will create your automap object with the proper attributes.

Django Query Natural Sort

Let's say I have this Django model:
class Question(models.Model):
question_code = models.CharField(max_length=10)
and I have 15k questions in the database.
I want to sort it by question_code, which is alphanumeric. This is quite a classical problem and has been talked about in:
http://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
Does Python have a built in function for string natural sort?
I tried the code in the 2nd link (which is copied below, changed a bit), and notice it takes up to 3 seconds to sort the data. To make sure about the function's performance, I write a test which creates a list of 100k random alphanumeric string. It takes only 0.76s to sort that list. So what's happening?
This is what I think. The function needs to get the question_code of each question for comparing, thus calling this function to sort 15k values means requesting mysql 15k separate times. And this is the reason why it takes so long. Any idea? And any solution to natural sort for Django in general? Thanks a lot!
def natural_sort(l, ascending, key=lambda s:s):
def get_alphanum_key_func(key):
convert = lambda text: int(text) if text.isdigit() else text
return lambda s: [convert(c) for c in re.split('([0-9]+)', key(s))]
sort_key = get_alphanum_key_func(key)
return sorted(l, key=sort_key, reverse=ascending)
As far as I'm aware there isn't a generic Django solution to this. You can reduce your memory usage and limit your db queries by building an id/question_code lookup structure
from natsort import natsorted
question_code_lookup = Question.objects.values('id','question_code')
ordered_question_codes = natsorted(question_code_lookup, key=lambda i: i['question_code'])
Assuming you want to page the results you can then slice up ordered_question_codes, perform another query to retrieve all the questions you need order them according to their position in that slice
#get the first 20 questions
ordered_question_codes = ordered_question_codes[:20]
question_ids = [q['id'] for q in ordered_question_codes]
questions = Question.objects.filter(id__in=question_ids)
#put them back into question code order
id_to_pos = dict(zip((question_ids), range(len(question_ids))))
questions = sorted(questions, key = lambda x: id_to_pos[x.id])
If the lookup structure still uses too much memory, or takes too long to sort, then you'll have to come up with something more advanced. This certainly wouldn't scale well to a huge dataset

Trouble getting a dynamodb scan to work with boto

I'm using boto to access a dynamodb table. Everything was going well until I tried to perform a scan operation.
I've tried a couple of syntaxes I've found after repeated searches of The Internet, but no luck:
def scanAssets(self, asset):
results = self.table.scan({('asset', 'EQ', asset)})
-or-
results = self.table.scan(scan_filter={'asset':boto.dynamodb.condition.EQ(asset)})
The attribute I'm scanning for is called 'asset', and asset is a string.
The odd thing is the table.scan call always ends up going through this function:
def dynamize_scan_filter(self, scan_filter):
"""
Convert a layer2 scan_filter parameter into the
structure required by Layer1.
"""
d = None
if scan_filter:
d = {}
for attr_name in scan_filter:
condition = scan_filter[attr_name]
d[attr_name] = condition.to_dict()
return d
I'm not a python expert, but I don't see how this would work. I.e. what kind of structure would scan_filter have to be to get through this code?
Again, maybe I'm just calling it wrong. Any suggestions?
OK, looks like I had an import problem. Simply using:
import boto
and specifying boto.dynamodb.condition doesn't cut it. I had to add:
import dynamodb.condition
to get the condition type to get picked up. My now working code is:
results = self.table.scan(scan_filter={'asset': dynamodb.condition.EQ(asset)})
Not that I completely understand why, but it's working for me now. :-)
Or you can do this
exclusive_start_key = None
while True:
result_set = self.table.scan(
asset__eq=asset, # The scan filter is explicitly given here
max_page_size=100, # Number of entries per page
limit=100,
# You can divide the table by n segments so that processing can be done parallelly and quickly.
total_segments=number_of_segments,
segment=segment, # Specify which segment you want to process
exclusive_start_key=exclusive_start_key # To start for last key seen
)
dynamodb_items = map(lambda item: item, result_set)
# Do something with your item, add it to a list for later processing when you come out of the while loop
exclusive_start_key = result_set._last_key_seen
if not exclusive_start_key:
break
This is applicable for any field.
segmentation: suppose you have above script in test.py
you can run parallelly like
python test.py --segment=0 --total_segments=4
python test.py --segment=1 --total_segments=4
python test.py --segment=2 --total_segments=4
python test.py --segment=3 --total_segments=4
in different screens