The Azure Table Service documentation states that entities (rows) must have at most 255 properties, which I understand to mean these tables can have at most 255 columns, which seems highly restrictive.
Two questions: first, do the same limits apply to Cosmos DB Table Storage? I can't seem to find any documentation that says one way or another, though the language of "entities" is still used. And second--if the same limit applies in Cosmos DB--is there any useful way around this limit for storage and querying, along the lines of JSON in SQL Server?
EDIT: here is some example code that attempts to write entities with 260 properties to Cosmos DB Table Storage and the error that is thrown. Account names and keys and such are redacted
# Libraries
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
import csv
import os
# Connect
## Table Storage
"""
access_key = 'access_key'
table_service = TableService(account_name='account_name', account_key= access_key)
"""
## Cosmos DB Table Storage
connection_string = "connection_string"
table_service = TableService(connection_string=connection_string)
# Create Table
if not table_service.exists('testTable'):
table_service.create_table('testTable')
length = 260
letters = [chr(i) for i in range(ord('a'), ord('z') + 1)]
keys = [a + b + c for a in letters for b in letters for c in letters][:length]
values = ['0' * (8 - len(str(i))) + str(i) for i in range(length)]
entity = dict(zip(keys, values))
entity['PartitionKey'] = 'TestKey'
entity['RowKey'] = '1'
table_service.insert_entity('testTable', entity)
This raises "ValueError: The entity contains more properties than allowed."
first, do the same limits apply to Cosmos DB Table Storage?
Based on the Azure Table storage limits, as you said ,max number of properties in a table entity is 255. However,I just found below statement in Azure Cosmos DB limits.
Azure Cosmos DB is a global scale database in which throughput and
storage can be scaled to handle whatever your application requires. If
you have any questions about the scale Azure Cosmos DB provides,
please send email to askcosmosdb#microsoft.com.
According to my test(I tired to add 260 properties into an entity), Azure Cosmos DB Table API accept that properties exceed 255.
If you want to get official reply, you could send email to above address.
is there any useful way around this limit for storage and querying,
along the lines of JSON in SQL Server?
If you want to store and query data of json format, I suggest you using cosmos db SQL API.It is versatile and flexible.You could refer to the doc.
Besides, if your data are stored in sql server database now. You could use Migration Tool to import data into cosmos db. Or you could Azure Data Factory to do more custom transmission.
Hope it helps you.
Since this pops pretty high on Google searches: As of now, it's 255 (-2 if you encrypt)
I just did a quick test using pytest:
from azure.cosmosdb.table import TableService
field_number = 250
entity = get_dummy_dict_entry_with_many_col(field_number)
for x in range(field_number, 1000):
print("Adding entity with {} elements.".format(len(entity)))
table_service.insert_entity(my_test_table_name, entity)
field_number += 1
entity["Field_nb_{}".format(field_number)] = field_number
entity["RowKey"] += str(field_number)
and got an exception in "def _validate_entity(entity, encrypt=None):"
# Two properties are added during encryption. Validate sufficient space
max_properties = 255
if encrypt:
max_properties = max_properties - 2
# Validate there are not more than 255 properties including Timestamp
if (len(entity) > max_properties) or (len(entity) == max_properties and 'Timestamp' not in entity):
> raise ValueError(_ERROR_TOO_MANY_PROPERTIES)
E ValueError: The entity contains more properties than allowed.
Related
As far as I know, Spark executors handle many tasks at the same time to guarantee processing data parallelly.Here comes the question. When connecting to external data storage,say mysql,how many tasks are there to finishi this job?In other words,are multiple tasks created at the same time and each task reads all data ,or data is read from only one task and is distributed to the cluster in some other way? How about writing data to mysql,how many connections are there?
Here is some piece of code to read or write data from/to mysql:
def jdbc(sqlContext: SQLContext, url: String, driver: String, dbtable: String, user: String, password: String, numPartitions: Int): DataFrame = {
sqlContext.read.format("jdbc").options(Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> s"(SELECT * FROM $dbtable) $dbtable",
"user" -> user,
"password" -> password,
"numPartitions" -> numPartitions.toString
)).load
}
def mysqlToDF(sparkSession:SparkSession, jdbc:JdbcInfo, table:String): DataFrame ={
var dF1 = sparkSession.sqlContext.read.format("jdbc")
.option("url", jdbc.jdbcUrl)
.option("user", jdbc.user)
.option("password", jdbc.passwd)
.option("driver", jdbc.jdbcDriver)
.option("dbtable", table)
.load()
// dF1.show(3)
dF1.createOrReplaceTempView(s"${table}")
dF1
}
}
here is a good article which answers your question:
https://freecontent.manning.com/what-happens-behind-the-scenes-with-spark/
In simple words: the workers separate the reading task into several parts and each worker only read a part of your input data. The number of tasks divided depends on your ressources and your data volume. The writing is the same principle: Spark writes the data to a distributed storage system, such as Hdfs and in Hdfs the data is stored in a ditributed way: each worker writes its data to some storage node in Hdfs.
By default data from jdbc source are loaded by one thread so you will have one task processed by one executor and thats the case you may expect in your second function mysqlToDF
In the first function "jdbc" you are closer to parallel read but still some parameters are needed, numPartitions is not enough, spark need some integer/date column and lower/upper bounds to be able to read in paralell (it will execute x queries for partial results)
Spark jdb documentation
In this docu you will find:
partitionColumn, lowerBound, upperBound (none) These options must
all be specified if any of them is specified. In addition,
numPartitions must be specified. They describe how to partition the
table when reading in parallel from multiple workers. partitionColumn
must be a numeric, date, or timestamp column from the table in
question. Notice that lowerBound and upperBound are just used to
decide the partition stride, not for filtering the rows in table. So
all rows in the table will be partitioned and returned. This option
applies only to reading.
numPartitions (none) The maximum
number of partitions that can be used for parallelism in table reading
and writing. This also determines the maximum number of concurrent
JDBC connections. If the number of partitions to write exceeds this
limit, we decrease it to this limit by calling coalesce(numPartitions)
before writing. read/write
regarding write
How about writing data to mysql,how many connections are there?
As stated in docu it also depends on numPartitions, if number of partitions when writing will be higher than numPartitions Spark will figure it out and call coalesce. Remember that coalesce may generate skew so sometimes it may be better to repartition it explicitly with repartition(numPartitions) to distribute data equally before write
If you don't set numPartitions number of paralell connections on write may be the same as number of active tasks in given moment so be aware that with to high parallelism and no upper bound you may choke source server
I have a Django application. Sometimes in production I get an error when uploading data that one of the values is too long. It would be very helpful for debugging if I could see which value was the one that went over the limit. Can I configure this somehow? I'm using MySQL.
It would also be nice if I could enable/disable this on a per-model or column basis so that I don't leak user data to error logs.
When creating model instances from outside sources, one must take care to validate the input or have other guarantees that this data cannot violate constraints.
When not calling at least full_clean() on the model, but directly calling save, one bypasses Django's validators and will only get alerted to the problem by the database driver at which point it's harder to obtain diagnostics:
class JsonImportManager(models.Manager):
def import(self, json_string: str) -> int:
data_list = json.loads(json_string) # list of objects => list of dicts
failed = 0
for data in data_list:
obj = self.model(**data)
try:
obj.full_clean()
except ValidationError as e:
print(e.message_dict) # or use better formatting function
failed += 1
else:
obj.save()
return failed
This is of course very simple, but it's a good boilerplate to get started with.
I'm building a python app around an existing (mysql) database and am using automap to infer tables and relationships:
base = automap_base()
self.engine = create_engine(
'mysql://%s:%s#%s/%s?charset=utf8mb4' % (
config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
), echo=False
)
# reflect the tables
base.prepare(self.engine, reflect=True)
self.TableName = base.classes.table_name
Using this I can do things like session.query(TableName) etc...
However, I'm worried about performance, because every time the app runs it will do the whole inference again.
Is this a legitimate concern?
If so, is there a possibility to 'cache' the output of Automap?
I think that "reflecting" the structure of your database is not the way to go. Unless your app tries to "infer" things from the structure, like static code analysis would for source files, then it is unnecessary. The other reason for reflecting it at run-time would be the reduced time to begin "using" the database using SQLAlchemy. However:
Another option would be to use something like SQLACodegen (https://pypi.python.org/pypi/sqlacodegen):
It will "reflect" your database once and create a 99.5% accurate set of declarative SQLAlchemy models for you to work with. However, this does require that you keep the model subsequently in-sync with the structure of the database. I would assume that this is not a big concern seeing as the tables you're already-working with are stable-enough such that run-time reflection of their structure does not impact your program much.
Generating the declarative models is essentially a "cache" of the reflection. It's just that SQLACodegen saved it into a very readable set of classes + fields instead of data in-memory. Even with a changing structure, and my own "changes" to the generated declarative models, I still use SQLACodegen later-on in a project whenever I make database changes. It means that my models are generally consistent amongst one-another and that I don't have things such as typos and data-mismatches due to copy-pasting.
Performance can be a legitimate concern. If the database schema is not changing, it can be time consuming to reflect the database every time a script is run. This is more of an issue during development, not starting up a long running application. It's also a significant time saver if your database is on a remote server (again, particularly during development).
I use code that is similar to the answer here (as noted by #ACV). The general plan is to perform the reflection the first time, then pickle the metadata object. The next time the script is run, it will look for the pickle file and use that. The file can be anywhere, but I place mine in ~/.sqlalchemy_cache. This is an example based on your code.
import os
from sqlalchemy.ext.declarative import declarative_base
self.engine = create_engine(
'mysql://%s:%s#%s/%s?charset=utf8mb4' % (
config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
), echo=False
)
metadata_pickle_filename = "mydb_metadata"
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
cached_metadata = pickle.load(file=cache_file)
except IOError:
# cache file not found - no problem, reflect as usual
pass
if cached_metadata:
base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
# save the metadata for future runs
try:
if not os.path.exists(cache_path):
os.makedirs(cache_path)
# make sure to open in binary mode - we're writing bytes, not str
with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
pickle.dump(Base.metadata, cache_file)
except:
# couldn't write the file for some reason
pass
self.TableName = base.classes.table_name
For anyone using declarative table class definitions, assuming a Base object defined as e.g.
Base = declarative_base(bind=engine)
metadata_pickle_filename = "ModelClasses_trilliandb_trillian.pickle"
# ------------------------------------------
# Load the cached metadata if it's available
# ------------------------------------------
# NOTE: delete the cached file if the database schema changes!!
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
cached_metadata = pickle.load(file=cache_file)
except IOError:
# cache file not found - no problem
pass
# ------------------------------------------
# define all tables
#
class MyTable(Base):
if cached_metadata:
__table__ = cached_metadata.tables['my_schema.my_table']
else:
__tablename__ = 'my_table'
__table_args__ = {'autoload':True, 'schema':'my_schema'}
...
# ----------------------------------------
# If no cached metadata was found, save it
# ----------------------------------------
if cached_metadata is None:
# cache the metadata for future loading
# - MUST DELETE IF THE DATABASE SCHEMA HAS CHANGED
try:
if not os.path.exists(cache_path):
os.makedirs(cache_path)
# make sure to open in binary mode - we're writing bytes, not str
with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
pickle.dump(Base.metadata, cache_file)
except:
# couldn't write the file for some reason
pass
Important Note!! If the database schema changes, you must delete the cached file to force the code to autoload and create a new cache. If you don't, the changes will be be reflected in the code. It's an easy thing to forget.
The answer to your first question is largely subjective. You are adding database queries to fetch the reflection metadata to the application load time. Whether or not that overhead is significant depends on your project requirements.
For reference, I have an internal tool at work that uses a reflection pattern because the the load-time is acceptable for our team. That might not be the case if it were an externally-facing product. My hunch is that for most applications the reflection overhead will not dominate the total application load time.
If you decide it is significant for your purposes, this question has an interesting answer where the user pickles the database metadata in order to locally cache it.
Adding to this, what #Demitri answered is close to correct but (at least in sqlalchemy 1.4.29), the example will fail on the last line self.TableName = base.classes.table_name when generating from the cached file. In this case declarative_base() has no attribute classes.
To fix is as simple as altering:
if cached_metadata:
base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
to
if cached_metadata:
base = automap_base(declarative_base(bind=self.engine, metadata=cached_metadata))
base.prepare()
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
This will create your automap object with the proper attributes.
The setting in which I am working can be described as follows:
Database and what I want to extract from it
The data required to run the analysis is stored in a single de-normalized (more than 100 columns) Oracle table. Financial reporting data is published to the table every day and its range-partitioned on the reporting date (one partition per day). Here's the structure of the query I intend to run:
SELECT col1,
col2,
col3
FROM table
WHERE date BETWEEN start_date AND end_date
Strategy to load data with Dask
I am using sqlalchemy with the cx_Oracle driver to access the database. The strategy I am following to load data in parallel with Dask is:
from dask import bag as db
def read_rows(from_date, to_date, engine):
engine.dispose()
query = """
-- Query Text --
""".format(from_date, to_date)
with engine.connect() as conn:
ret = conn.execute(query).fetchall()
return ret
engine = create_engine(...) # initialise sqlalchemy engine
add_engine_pidguard(engine) # adding pidguard to engine
date_ranges = [...] # list of (start_date, end_date)-tuples
data_db = db.from_sequence(date_ranges)
.map(lambda x: read_rows(from_date=x[0], to_date=x[1], engine=engine)).concat()
# ---- further process data ----
...
add_engine_pidguard is taken from the sqlalchemy documentation:How do I use engines / connections / sessions with Python multiprocessing, or os.fork()?
Questions
Is the current way of running blocked queries fine - or is there a cleaner way of achieving this in sqlalchemy?
Since the queries operate in a multiprocessing environment, is the approach of managing the engines fine the way it is implemented?
Currently I am executing a "raw query", would it be beneficial from a performance point of view to define the table in a declarative_base (with respective column types) and use session.query on the required columns from within read_rows?
I would be apt to try code along the lines of
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
...
con = engine.connect()
df = dd.from_delayed([
delayed(pd.read_sql_query)(QUERY, con, params=params)
for params in date_ranges
])
In this case, I just have one connection I make--cx_Oracle connections are, as I understand it, able to be used by multiple threads. The data is loaded using dask.dataframe, without doing anything yet to make it anything other than the threaded scheduler. Database IO and many pandas operations release the GIL, so the threaded scheduler is a good candidate here.
This will let us jump right to having a dataframe, which is nice for many operations on structured data.
Currently I am executing a "raw query", would it be beneficial from a performance point of view to define the table in a declarative_base (with respective column types) and use session.query on the required columns from within read_rows?
This is not especially likely to improve performance, as I understand things.
I am new to CouchDB. I need to get 60 or more JSON files in a minute from a server.
I have to upload these JSON files to CouchDB individually as soon as I receive them.
I installed CouchDB on my Linux machine.
I hope some one can help me with my requirement.
If possible can someone help me with pseudo code.
My Idea:
Is to write a python script for uploading all JSON files to CouchDB.
Each and every JSON file must be each document and the data present in
JSON must be inserted same into CouchDB
(the specified format with values in a file).
Note:
These JSON files are Transactional, every second 1 file is generated
so I need to read the file upload as same format into CouchDB on
successful uploading archive the file into local system of different folder.
python program to parse the json and insert into CouchDb:
import sys
import glob
import errno,time,os
import couchdb,simplejson
import json
from pprint import pprint
couch = couchdb.Server() # Assuming localhost:5984
#couch.resource.credentials = (USERNAME, PASSWORD)
# If your CouchDB server is running elsewhere, set it up like this:
couch = couchdb.Server('http://localhost:5984/')
db = couch['mydb']
path = 'C:/Users/Desktop/CouchDB_Python/Json_files/*.json'
#dirPath = 'C:/Users/VijayKumar/Desktop/CouchDB_Python'
files = glob.glob(path)
for file1 in files:
#dirs = os.listdir( dirPath )
file2 = glob.glob(file1)
for name in file2: # 'file' is a builtin type, 'name' is a less-ambiguous variable name.
try:
with open(name) as f: # No need to specify 'r': this is the default.
#sys.stdout.write(f.read())
json_data=f
data = json.load(json_data)
db.save(data)
pprint(data)
json_data.close()
#time.sleep(2)
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
I would use CouchDB bulk API, even though you have specified that you need to send them to db one by one. For example, by implementing a simple queue that gets sent out every say 5 - 10 seconds via a bulk doc call will greatly increase performance of your application.
There is obviously a quirk in that and that is you need to know the IDs of the docs that you want to get from the DB. But for the PUTs it is perfect. (it is not entirely true, you can get ranges of docs using bulk operation if the IDs you are using for your docs can be sorted nicely).
From my experience working with CouchDB, I have a hunch that you are dealing with Transactional documents in order to compile them into some sort of sum result and act on that data accordingly (maybe creating next transactional doc in series). For that you can rely on CouchDB by using 'reduce' functions on the views you create. It takes a little practice to get reduce function working properly and is highly dependent on what it is you actually what to achieve and what data you are prepared to emit by the view so I can't really provide you with more detail on that.
So in the end the app logic would go something like that:
get _design/someDesign/_view/yourReducedView
calculate new transaction
add transaction to queue
onTimeout
send all in transaction queue
If I got that first part of why you are using transactional docs wrong all that would really change is the part where you getting those transactional docs in my app logic.
Also, before writing your own 'reduce' function, have a look at buil-in ones (they are alot faster then anything outside of db engine can do)
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
EDIT:
Since you are starting, I strongly recommend to have a look at CouchDB Definitive Guide.
NOTE FOR LATER:
Here is one hidden stone (well maybe not so much a hidden stone but not an obvious thing to look out for for the new-comer in any case). When you write reduce function make sure that it does not produce too much output for the query without boundaries. This will extremely slow down the entire view even when you provide reduce=false when getting stuff from it.
So you need to get JSON documents from a server and send them to CouchDB as you receive them. A Python script would work fine. Here is some pseudo-code:
loop (until no more docs)
get new JSON doc from server
send JSON doc to CouchDB
end loop
In Python, you could use requests to send the documents to CouchDB and probably to get the documents from the server as well (if it is using an HTTP API).
You might want to checkout the pycouchdb module for python3. I've used it myself to upload lots of JSON objects into couchdb instance. My project does pretty much the same as you describe so you can take a look at my project Pyro at Github for details.
My class looks like that:
class MyCouch:
""" COMMUNICATES WITH COUCHDB SERVER """
def __init__(self, server, port, user, password, database):
# ESTABLISHING CONNECTION
self.server = pycouchdb.Server("http://" + user + ":" + password + "#" + server + ":" + port + "/")
self.db = self.server.database(database)
def check_doc_rev(self, doc_id):
# CHECKS REVISION OF SUPPLIED DOCUMENT
try:
rev = self.db.get(doc_id)
return rev["_rev"]
except Exception as inst:
return -1
def update(self, all_computers):
# UPDATES DATABASE WITH JSON STRING
try:
result = self.db.save_bulk( all_computers, transaction=False )
sys.stdout.write( " Updating database" )
sys.stdout.flush()
return result
except Exception as ex:
sys.stdout.write( "Updating database" )
sys.stdout.write( "Exception: " )
print( ex )
sys.stdout.flush()
return None
Let me know in case of any questions - I will be more than glad to help if you will find some of my code usable.