I am trying to read some information into Pandas DataFrame and facing the problem due to the value of the data.
Specs of PC:
RAM 32 GB
IntelCore i7 4GHz
Setup:
Data is in MySQL DB, 9 columns (7 int, 1 date, 1 DateTime). DB is on the local machine, so no internet bandwidth issues.
22 mil. rows of data.
Tried to read directly from MySQL server - it never ends.
engine = sqlalchemy.create_engine('mysql+pymysql://root:#localhost:3306/database')
search_df = pd.read_sql_table('search', engine)
I checked with SO and got the impression that instead of using the connector, better to parse CSV. I exported table to CSV.
CSV file size - 1.5GB
My code
dtype = {
'search_id' : int,
'job_count_total' : int,
'job_count_done' : int,
'city_id_start' : int,
'city_id_end' : int,
'date_start' : str,
'datetime_create' : str,
'agent_id' : int,
'ride_segment_found_cnt' : int
}
search_df = pd.read_csv('search.csv', sep=',', dtype=dtype)
I tried both engines, c and python, different chunk sizes, low_memory as True and False, specified dtypes and not, but still getting MemoryError.
I tried everything, mentioned in the question above (which was marked as of origin, mine as duplicate), but nothing changes.
I spotted only two difference:
If I parsing without chunks that I get Memory Error on parsing.
When I am parsing in chunks - on concatenation into one DF.
Also, chunking by 5_000_000 rows gives an error on parsing, less - on concatenation.
Here is an error message on concatenation:
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
Basically, the problem was with memory.
I played a bit with chunk-size + added some filtrations, that I had later in code on the chunk.
That allowed me to fit dataframe into memory.
Related
As far as I know, Spark executors handle many tasks at the same time to guarantee processing data parallelly.Here comes the question. When connecting to external data storage,say mysql,how many tasks are there to finishi this job?In other words,are multiple tasks created at the same time and each task reads all data ,or data is read from only one task and is distributed to the cluster in some other way? How about writing data to mysql,how many connections are there?
Here is some piece of code to read or write data from/to mysql:
def jdbc(sqlContext: SQLContext, url: String, driver: String, dbtable: String, user: String, password: String, numPartitions: Int): DataFrame = {
sqlContext.read.format("jdbc").options(Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> s"(SELECT * FROM $dbtable) $dbtable",
"user" -> user,
"password" -> password,
"numPartitions" -> numPartitions.toString
)).load
}
def mysqlToDF(sparkSession:SparkSession, jdbc:JdbcInfo, table:String): DataFrame ={
var dF1 = sparkSession.sqlContext.read.format("jdbc")
.option("url", jdbc.jdbcUrl)
.option("user", jdbc.user)
.option("password", jdbc.passwd)
.option("driver", jdbc.jdbcDriver)
.option("dbtable", table)
.load()
// dF1.show(3)
dF1.createOrReplaceTempView(s"${table}")
dF1
}
}
here is a good article which answers your question:
https://freecontent.manning.com/what-happens-behind-the-scenes-with-spark/
In simple words: the workers separate the reading task into several parts and each worker only read a part of your input data. The number of tasks divided depends on your ressources and your data volume. The writing is the same principle: Spark writes the data to a distributed storage system, such as Hdfs and in Hdfs the data is stored in a ditributed way: each worker writes its data to some storage node in Hdfs.
By default data from jdbc source are loaded by one thread so you will have one task processed by one executor and thats the case you may expect in your second function mysqlToDF
In the first function "jdbc" you are closer to parallel read but still some parameters are needed, numPartitions is not enough, spark need some integer/date column and lower/upper bounds to be able to read in paralell (it will execute x queries for partial results)
Spark jdb documentation
In this docu you will find:
partitionColumn, lowerBound, upperBound (none) These options must
all be specified if any of them is specified. In addition,
numPartitions must be specified. They describe how to partition the
table when reading in parallel from multiple workers. partitionColumn
must be a numeric, date, or timestamp column from the table in
question. Notice that lowerBound and upperBound are just used to
decide the partition stride, not for filtering the rows in table. So
all rows in the table will be partitioned and returned. This option
applies only to reading.
numPartitions (none) The maximum
number of partitions that can be used for parallelism in table reading
and writing. This also determines the maximum number of concurrent
JDBC connections. If the number of partitions to write exceeds this
limit, we decrease it to this limit by calling coalesce(numPartitions)
before writing. read/write
regarding write
How about writing data to mysql,how many connections are there?
As stated in docu it also depends on numPartitions, if number of partitions when writing will be higher than numPartitions Spark will figure it out and call coalesce. Remember that coalesce may generate skew so sometimes it may be better to repartition it explicitly with repartition(numPartitions) to distribute data equally before write
If you don't set numPartitions number of paralell connections on write may be the same as number of active tasks in given moment so be aware that with to high parallelism and no upper bound you may choke source server
The Azure Table Service documentation states that entities (rows) must have at most 255 properties, which I understand to mean these tables can have at most 255 columns, which seems highly restrictive.
Two questions: first, do the same limits apply to Cosmos DB Table Storage? I can't seem to find any documentation that says one way or another, though the language of "entities" is still used. And second--if the same limit applies in Cosmos DB--is there any useful way around this limit for storage and querying, along the lines of JSON in SQL Server?
EDIT: here is some example code that attempts to write entities with 260 properties to Cosmos DB Table Storage and the error that is thrown. Account names and keys and such are redacted
# Libraries
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
import csv
import os
# Connect
## Table Storage
"""
access_key = 'access_key'
table_service = TableService(account_name='account_name', account_key= access_key)
"""
## Cosmos DB Table Storage
connection_string = "connection_string"
table_service = TableService(connection_string=connection_string)
# Create Table
if not table_service.exists('testTable'):
table_service.create_table('testTable')
length = 260
letters = [chr(i) for i in range(ord('a'), ord('z') + 1)]
keys = [a + b + c for a in letters for b in letters for c in letters][:length]
values = ['0' * (8 - len(str(i))) + str(i) for i in range(length)]
entity = dict(zip(keys, values))
entity['PartitionKey'] = 'TestKey'
entity['RowKey'] = '1'
table_service.insert_entity('testTable', entity)
This raises "ValueError: The entity contains more properties than allowed."
first, do the same limits apply to Cosmos DB Table Storage?
Based on the Azure Table storage limits, as you said ,max number of properties in a table entity is 255. However,I just found below statement in Azure Cosmos DB limits.
Azure Cosmos DB is a global scale database in which throughput and
storage can be scaled to handle whatever your application requires. If
you have any questions about the scale Azure Cosmos DB provides,
please send email to askcosmosdb#microsoft.com.
According to my test(I tired to add 260 properties into an entity), Azure Cosmos DB Table API accept that properties exceed 255.
If you want to get official reply, you could send email to above address.
is there any useful way around this limit for storage and querying,
along the lines of JSON in SQL Server?
If you want to store and query data of json format, I suggest you using cosmos db SQL API.It is versatile and flexible.You could refer to the doc.
Besides, if your data are stored in sql server database now. You could use Migration Tool to import data into cosmos db. Or you could Azure Data Factory to do more custom transmission.
Hope it helps you.
Since this pops pretty high on Google searches: As of now, it's 255 (-2 if you encrypt)
I just did a quick test using pytest:
from azure.cosmosdb.table import TableService
field_number = 250
entity = get_dummy_dict_entry_with_many_col(field_number)
for x in range(field_number, 1000):
print("Adding entity with {} elements.".format(len(entity)))
table_service.insert_entity(my_test_table_name, entity)
field_number += 1
entity["Field_nb_{}".format(field_number)] = field_number
entity["RowKey"] += str(field_number)
and got an exception in "def _validate_entity(entity, encrypt=None):"
# Two properties are added during encryption. Validate sufficient space
max_properties = 255
if encrypt:
max_properties = max_properties - 2
# Validate there are not more than 255 properties including Timestamp
if (len(entity) > max_properties) or (len(entity) == max_properties and 'Timestamp' not in entity):
> raise ValueError(_ERROR_TOO_MANY_PROPERTIES)
E ValueError: The entity contains more properties than allowed.
I'm trying to load a really big JSON file into R. Since the file is too big to fit into memory on my machine, I found that using the jsonlite package's stream_in/stream_out functions is really helpful. With these functions, I can subset the data first in chunks without loading it, write the subset data to a new, smaller JSON file, and then load that file as a data.frame. However, this intermediary JSON file is getting truncated (if that's the right term) while being written with stream_out. I will now attempt to explain with further detail.
What I'm attempting:
I have written my code like this (following an example from documentation):
con_out <- file(tmp <- tempfile(), open = "wb")
stream_in(file("C:/User/myFile.json"), handler = function(df){
df <- df[which(df$Var > 0), ]
stream_out(df, con_out, pagesize = 1000)
}, pagesize = 5000)
myData <- stream_in(file(tmp))
As you can see, I open a connection to a temporary file, read my original JSON file with stream_in and have the handler function subset each chunk of data and write it to the connection.
The problem
This procedure runs without any problems, until I try to read it in myData <- stream_in(file(tmp)), upon which I receive an error. Manually opening the new, temporary JSON file reveals that the bottom-most line is always incomplete. Something like the following:
{"Var1":"some data","Var2":3,"Var3":"some othe
I then have to manually remove that last line after which the file loads without issue.
Solutions I've tried
I've tried reading the documentation thoroughly and looking at the stream_out function, and I can't figure out what may be causing this issue. The only slight clue I have is that the stream_out function automatically closes the connection upon completion, so maybe it's closing the connection while some other component is still writing?
I inserted a print function to print the tail() end of the data.frame at every chunk inside the handler function to rule out problems with the intermediary data.frame. The data.frame is produced flawlessly at every interval, and I can see that the final two or three rows of the data.frame are getting truncated while being written to file (i.e., they're not being written). Notice that it's the very end of the entire data.frame (after stream_out has rbinded everything) that is getting chopped.
I've tried playing around with the pagesize arguments, including trying very large numbers, no number, and Inf. Nothing has worked.
I can't use jsonlite's other functions like fromJSON because the original JSON file is too large to read without streaming and it is actually in minified(?)/ndjson format.
System info
I'm running R 3.3.3 x64 on Windows 7 x64. 6 GB of RAM, AMD Athlon II 4-Core 2.6 Ghz.
Treatment
I can still deal with this issue by manually opening the JSON files and correcting them, but it's leading to some data loss and it's not allowing my script to be automated, which is an inconvenience as I have to run it repeatedly throughout my project.
I really appreciate any help with this; thank you.
I believe this does what you want, it is not necessary to do the extra stream_out/stream_in.
myData <- new.env()
stream_in(file("MOCK_DATA.json"), handler = function(df){
idx <- as.character(length(myData) + 1)
myData[[idx]] <- df[which(df$id %% 2 == 0), ] ## change back to your filter
}, pagesize = 200) ## change back to 1000
myData <- myData %>% as.list() %>% bind_rows()
(I created some mock data in Mockaroo: generated 1000 lines, hence the small pagesize, to check if everything worked with more than one chunk. The filter I used was even IDs because I was lazy to create a Var column.)
We use Redis database as:
key -> (file1, file2, file3)
The value is always a list of three compressed files.
Not all the keys have the 3 files i.e.
key2 -> (file4, file5)
Files are compressed using zlib.
Size of file is between 50 - 120 K compressed.
I want to store the "complete" keys (the ones with 3 files in the list) ,
in a database.
Until now I was storing the data in a MySQL table :
key_id : INTEGER , PRIMARY KEY
first : BLOB
second : BLOB
third : BLOB
This works pretty fine with the exception of slow inserts
(Mysql Server does at the same time other things).
I will query the data very rare but I want to be able to get them
one by one easily.
Redis is a database and I know I can dump to a file (rdb files).
So I think it is redudancy to use another DBMS.
But the redis database is memory limited, so I can not just
wait to finish the production of the values (files) and then
just dump to an rdb file.
I would like to create smaller rdb files that contain
only the "complete" keys.
i.e
at time 1 the redis contains the following:
key3 -> (a, b, c)
key14 -> (e, f)
key1 -> (g, h, i)
then if I decide to dump the dump file into 1.rdb should contain only:
key3 and key14
If the dump is successful I will delete the dumped keys (key3, key4)
and the redis should have:
key14 -> (e, f)
Now in time 5 the redis contains:
key5 -> (i, j , k)
key14 -> (d, e, f)
key6 -> (l, m)
So if I save to 2.rdb the file should contain only:
key5, key14
and then the above keys should be deleted from redis.
Is that possible? I am using python if it matters.
Do you have another idea for this task?
Another DBMS , store direct to filesystem etc.
P.S Forgot to mention that in total there would be around
15.000.000 keys so there would be 15.000.000 * 3 files.
Also I use Linux with ext3 filesystem
I'd try to keep everything in one system. Do you really need Redis to manage the keysets? A database table that gets accessed frequently should stay mostly cached in memory.
I set up my Rails application twice. One is working with MongoDB (Mongoid as mapper) and the other with MySQL and ActiveRecord. Then I wrote a rake task which inserts some test-data to both databases (100.000 entries).
I measured how long it takes for each database with the ruby Benchmark module. I did some testing with 100 and 10.000 entries where mongodb was always faster than mysql (about 1/3). The weird thing is that it takes about 3 times longer in mongodb to insert the 100.000 entries than with mysql. I have no idea why mongodb has this behaviour?! The only thing that I know is that the cpu time is much lower than the total time. Is it possible that mongodb starts some sort of garbage collection while it's inserting the data? At the beginning it's fast, but as more data mongodb is inserting, it gets slower and slower...any idea on this?
To get somehow a read performance of the two databases, I thought about measuring the time when the database gets an search query and respond the result. As I need some precise measurements, I don't want to include the time where Rails is processing my query from the controller to the database.
How do I do the measurement directly at the database and not in the Rails controller? Is there any gem / tool which would help me?
Thanks in advance!
EDIT: Updated my question according to my current situation
If your base goal is to measure database performance time at the DB level, I would recommend you get familiar with the benchRun method in MongoDB.
To do the type of thing you want to do, you can get started with the example on the linked page, here is a variant with explanations:
// skipped dropping the table and reinitializing as I'm assuming you have your test dataset
// your database is called test and collection is foo in this code
ops = [
// this sets up an array of operations benchRun will run
{
// possible operations include find (added in 2.1), findOne, update, insert, delete, etc.
op : "find" ,
// your db.collection
ns : "test.foo" ,
// different operations have different query options - this matches based on _id
// using a random value between 0 and 100 each time
query : { _id : { "#RAND_INT" : [ 0 , 100 ] } }
}
]
for ( x = 1; x<=128; x*=2){
// actual call to benchRun, each time using different number of threads
res = benchRun( { parallel : x , // number of threads to run in parallel
seconds : 5 , // duration of run; can be fractional seconds
ops : ops // array of operations to run (see above)
} )
// res is a json object returned, easiest way to see everything in it:
printjson( res )
print( "threads: " + x + "\t queries/sec: " + res.query )
}
If you put this in a file called testing.js you can run it from mongo shell like this:
> load("testing.js")
{
"note" : "values per second",
"errCount" : NumberLong(0),
"trapped" : "error: not implemented",
"queryLatencyAverageMs" : 69.3567923734754,
"insert" : 0,
"query" : 12839.4,
"update" : 0,
"delete" : 0,
"getmore" : 0,
"command" : 128.4
}
threads: 1 queries/sec: 12839.4
and so on.
I found the reason why MongoDB is getting slower while inserting many documents.
Many to many relations are not recommended for over 10,000 documents when using MRI due to the garbage collector taking over 90% of the run time when calling #build or #create. This is due to the large array appending occuring in these operations.
http://mongoid.org/performance.html
Now I would like to know how to measure the query performance of each database. My main concerns are the the measurement of the query time and the flow capacity / throughput. This measurement should be made directly at the database, so that nothing can adulterate the result.