I am using azure data bricks and am trying to read .mdb files in as part of an ETL program. After doing some research, the only jdbc connector that i've found for ms access (.mdb) formats is "ucanaccess". I've followed some tutorials on azure on how to connect to a jdbc data source, and the connection at first appears successful, but there are some strange behaviors that don't make any sense.
For one, I cannot actually query the data frame because there are different data type errors. This happens for every table in the .mdb file.
connectionProperties = {
"driver" : "net.ucanaccess.jdbc.UcanaccessDriver"
}
url = "jdbc:ucanaccess:///dbfs/mnt/pre-processed/aeaton#legacydirectional.com/DD/DAILIES/5-1-19/MD190062.MDB"
df = spark.read.jdbc(url=url, table="tblbhaitems", properties=connectionProperties)
The result here is a data frame being returned
(data frame returned)
Now, trying to actually get data from the data frame, I get the following error:
df.select("*").show()
error: "org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.139.64.6, executor 0): net.ucanaccess.jdbc.UcanaccessSQLException: UCAExc:::4.0.4 incompatible data type in conversion: from SQL type CHARACTER to java.lang.Integer, value: Item No"
Looking at this error, I decide to try querying a specific string column to at least test other data types. When I perform
df.select("`Job no ID`").show()
I get the column name repeated for every row value of that table:
+---------+
|Job no ID|
+---------+
|Job no ID|
|Job no ID|
|Job no ID|
|Job no ID|
|Job no ID|
+---------+
I'm totally at a loss at why it's connecting and sees the columns but not actually fetching any data. Unfortunately, .mdb files aren't very common, so I feel my options are likely limited here in what I have available to parse the data.
I was facing similar issue like you mentioned when using spark with ucanaccess jdbc driver.
In spark we can create and register custom jdbc dialect for ucanaccess jdbc driver like following:
import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects}
case object MSAccessJdbcDialect extends JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:ucanaccess")
override def quoteIdentifier(colName: String): String = s"[$colName]"
}
JdbcDialects.registerDialect(MSAccessJdbcDialect)
Related
I have a Django application. Sometimes in production I get an error when uploading data that one of the values is too long. It would be very helpful for debugging if I could see which value was the one that went over the limit. Can I configure this somehow? I'm using MySQL.
It would also be nice if I could enable/disable this on a per-model or column basis so that I don't leak user data to error logs.
When creating model instances from outside sources, one must take care to validate the input or have other guarantees that this data cannot violate constraints.
When not calling at least full_clean() on the model, but directly calling save, one bypasses Django's validators and will only get alerted to the problem by the database driver at which point it's harder to obtain diagnostics:
class JsonImportManager(models.Manager):
def import(self, json_string: str) -> int:
data_list = json.loads(json_string) # list of objects => list of dicts
failed = 0
for data in data_list:
obj = self.model(**data)
try:
obj.full_clean()
except ValidationError as e:
print(e.message_dict) # or use better formatting function
failed += 1
else:
obj.save()
return failed
This is of course very simple, but it's a good boilerplate to get started with.
I am trying to read some information into Pandas DataFrame and facing the problem due to the value of the data.
Specs of PC:
RAM 32 GB
IntelCore i7 4GHz
Setup:
Data is in MySQL DB, 9 columns (7 int, 1 date, 1 DateTime). DB is on the local machine, so no internet bandwidth issues.
22 mil. rows of data.
Tried to read directly from MySQL server - it never ends.
engine = sqlalchemy.create_engine('mysql+pymysql://root:#localhost:3306/database')
search_df = pd.read_sql_table('search', engine)
I checked with SO and got the impression that instead of using the connector, better to parse CSV. I exported table to CSV.
CSV file size - 1.5GB
My code
dtype = {
'search_id' : int,
'job_count_total' : int,
'job_count_done' : int,
'city_id_start' : int,
'city_id_end' : int,
'date_start' : str,
'datetime_create' : str,
'agent_id' : int,
'ride_segment_found_cnt' : int
}
search_df = pd.read_csv('search.csv', sep=',', dtype=dtype)
I tried both engines, c and python, different chunk sizes, low_memory as True and False, specified dtypes and not, but still getting MemoryError.
I tried everything, mentioned in the question above (which was marked as of origin, mine as duplicate), but nothing changes.
I spotted only two difference:
If I parsing without chunks that I get Memory Error on parsing.
When I am parsing in chunks - on concatenation into one DF.
Also, chunking by 5_000_000 rows gives an error on parsing, less - on concatenation.
Here is an error message on concatenation:
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
Basically, the problem was with memory.
I played a bit with chunk-size + added some filtrations, that I had later in code on the chunk.
That allowed me to fit dataframe into memory.
The Azure Table Service documentation states that entities (rows) must have at most 255 properties, which I understand to mean these tables can have at most 255 columns, which seems highly restrictive.
Two questions: first, do the same limits apply to Cosmos DB Table Storage? I can't seem to find any documentation that says one way or another, though the language of "entities" is still used. And second--if the same limit applies in Cosmos DB--is there any useful way around this limit for storage and querying, along the lines of JSON in SQL Server?
EDIT: here is some example code that attempts to write entities with 260 properties to Cosmos DB Table Storage and the error that is thrown. Account names and keys and such are redacted
# Libraries
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
import csv
import os
# Connect
## Table Storage
"""
access_key = 'access_key'
table_service = TableService(account_name='account_name', account_key= access_key)
"""
## Cosmos DB Table Storage
connection_string = "connection_string"
table_service = TableService(connection_string=connection_string)
# Create Table
if not table_service.exists('testTable'):
table_service.create_table('testTable')
length = 260
letters = [chr(i) for i in range(ord('a'), ord('z') + 1)]
keys = [a + b + c for a in letters for b in letters for c in letters][:length]
values = ['0' * (8 - len(str(i))) + str(i) for i in range(length)]
entity = dict(zip(keys, values))
entity['PartitionKey'] = 'TestKey'
entity['RowKey'] = '1'
table_service.insert_entity('testTable', entity)
This raises "ValueError: The entity contains more properties than allowed."
first, do the same limits apply to Cosmos DB Table Storage?
Based on the Azure Table storage limits, as you said ,max number of properties in a table entity is 255. However,I just found below statement in Azure Cosmos DB limits.
Azure Cosmos DB is a global scale database in which throughput and
storage can be scaled to handle whatever your application requires. If
you have any questions about the scale Azure Cosmos DB provides,
please send email to askcosmosdb#microsoft.com.
According to my test(I tired to add 260 properties into an entity), Azure Cosmos DB Table API accept that properties exceed 255.
If you want to get official reply, you could send email to above address.
is there any useful way around this limit for storage and querying,
along the lines of JSON in SQL Server?
If you want to store and query data of json format, I suggest you using cosmos db SQL API.It is versatile and flexible.You could refer to the doc.
Besides, if your data are stored in sql server database now. You could use Migration Tool to import data into cosmos db. Or you could Azure Data Factory to do more custom transmission.
Hope it helps you.
Since this pops pretty high on Google searches: As of now, it's 255 (-2 if you encrypt)
I just did a quick test using pytest:
from azure.cosmosdb.table import TableService
field_number = 250
entity = get_dummy_dict_entry_with_many_col(field_number)
for x in range(field_number, 1000):
print("Adding entity with {} elements.".format(len(entity)))
table_service.insert_entity(my_test_table_name, entity)
field_number += 1
entity["Field_nb_{}".format(field_number)] = field_number
entity["RowKey"] += str(field_number)
and got an exception in "def _validate_entity(entity, encrypt=None):"
# Two properties are added during encryption. Validate sufficient space
max_properties = 255
if encrypt:
max_properties = max_properties - 2
# Validate there are not more than 255 properties including Timestamp
if (len(entity) > max_properties) or (len(entity) == max_properties and 'Timestamp' not in entity):
> raise ValueError(_ERROR_TOO_MANY_PROPERTIES)
E ValueError: The entity contains more properties than allowed.
This post is marked for deletion, as the issue was with the IDE in not creating the proper jar, hence issues with the code interaction
I have a small flink application that reads from a kafka topic,
needs to query if the input from the topic (x) exists in a column of MySql Database before processing it (Not Ideal but its the current requirement)
When I run the Application through the IDE (Intellij) -> It works.
However when I submit the job to flink server it fails to open connection based on driver
Error from Flink Server
// ERROR
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
// ---------------------
// small summary of MAIN
// ---------------------
Get Data from Source (x)
source.map(x => {
// open connection (Fails to open)
// check if data exist in db
})
// -------------------------------------
// open connection function (Scala Code)
// -------------------------------------
def openConnection() : Boolean = {
try {
// - set driver
Class.forName("com.mysql.jdbc.Driver")
// - make the connection
connection = DriverManager.getConnection(url, user, pswd)
// - set status controller
connection_open = true
}
catch {
// - catch error
case e: Throwable => e.printStackTrace
// - set status controller
connection_open = false
}
// return result
return connection_open
}
Question
1) Whats the correct way to interface to MySql Database from a flink application?
2) I will also at a later stage have to do similar interaction with MongoDB, whats the correct way interacting with MongoDB from FLink?
Unbelievable IntelliJ does not update dependencies on rebuild command.
In IntelliJ, You have to delete and re-setup your artifact creator for all dependencies do be added. (Build, Clean, Rebuild,Delete) does not update its settings.
I deleted and recreated the artifact file. And it Works
Apologies for the unnecessary inconvenience (As you can imagine my frustration). But it's a word of caution for those developing in IntelliJ, to manually delete and recreate artifacts
Solution:
(File -> Project Structure -> Artifacts -> (-) delete previous one -> (+) create new one -> Select Main Class)
I encountered an error while doing full-import in solr-6.6.0.
I am getting exception as bellow
This happens when I set
batchSize="-1" in my db-config.xml
If I change this value to say batchSize="100" then import will run without any error.
But recommended value for this is "-1".
Any suggestion why solr throwing this exception.
By the way the data am trying to import is not huge, data am trying to import is just 250 documents.
Stack trace:
org.apache.solr.handler.dataimport.DataImportHandlerException: java.sql.SQLException: Operation not allowed after ResultSet closed
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:464)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:377)
at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:133)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:516)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:415)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:474)
at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:457)
at java.lang.Thread.run(Thread.java:745)
By the way am getting one more warning:
Could not read DIH properties from /configs/state/dataimport.properties :class org.apache.zookeeper.KeeperException$NoNodeException
This happens when config directory is not writable.
How can we make config directory writable in solrCloud mode.
Am using zookeeper as watch-dog. Can we go ahead and change permission of config files which are there is zookeeper?
your help greatly appreciated.
Using fetchSize="-1" is only recommended if you have problems running without it. Its behaviour is up to the JDBC driver, but the cause of people assuming its recommended is this sentence from the old wiki:
DataImportHandler is designed to stream row one-by-one. It passes a fetch size value (default: 500) to Statement#setFetchSize which some drivers do not honor. For MySQL, add batchSize property to dataSource configuration with value -1. This will pass Integer.MIN_VALUE to the driver as the fetch size and keep it from going out of memory for large tables.
Unless you're actually seeing issues with the default values, leave the setting alone and assume your JDBC driver does the correct thing (.. which it might not do with -1 as the value).
The reason for dataimport.properties having to be writable is that it writes a property for the last time the import ran to the file, so that you can perform delta updates by referencing the time of the last update in your SQL statement.
You'll have to make the directory writable for the client (solr) if you want to use this feature. My guess would be that you can ignore the warning if you're not using delta imports.