Assign unique ID's in parallell [duplicate] - csv

I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?

Scala:
If all you need is unique numbers you can use zipWithUniqueId and recreate DataFrame. First some imports and dummy data:
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val df = sc.parallelize(Seq(
("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")
Extract schema for further usage:
val schema = df.schema
Add id field:
val rows = df.rdd.zipWithUniqueId.map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
Create DataFrame:
val dfWithPK = sqlContext.createDataFrame(
rows, StructType(StructField("id", LongType, false) +: schema.fields))
The same thing in Python:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, LongType
row = Row("foo", "bar")
row_with_index = Row(*["id"] + df.columns)
df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF()
def make_row(columns):
def _make_row(row, uid):
row_dict = row.asDict()
return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
return _make_row
f = make_row(df.columns)
df_with_pk = (df.rdd
.zipWithUniqueId()
.map(lambda x: f(*x))
.toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))
If you prefer consecutive number your can replace zipWithUniqueId with zipWithIndex but it is a little bit more expensive.
Directly with DataFrame API:
(universal Scala, Python, Java, R with pretty much the same syntax)
Previously I've missed monotonicallyIncreasingId function which should work just fine as long as you don't require consecutive numbers:
import org.apache.spark.sql.functions.monotonicallyIncreasingId
df.withColumn("id", monotonicallyIncreasingId).show()
// +---+----+-----------+
// |foo| bar| id|
// +---+----+-----------+
// | a|-1.0|17179869184|
// | b|-2.0|42949672960|
// | c|-3.0|60129542144|
// +---+----+-----------+
While useful monotonicallyIncreasingId is non-deterministic. Not only ids may be different from execution to execution but without additional tricks cannot be used to identify rows when subsequent operations contain filters.
Note:
It is also possible to use rowNumber window function:
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df.withColumn("id", rowNumber().over(w)).show()
Unfortunately:
WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
So unless you have a natural way to partition your data and ensure uniqueness is not particularly useful at this moment.

from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("id", monotonically_increasing_id()).show()
Note that the 2nd argument of df.withColumn is monotonically_increasing_id() not monotonically_increasing_id .

I found the following solution to be relatively straightforward for the case where zipWithIndex() is the desired behavior, i.e. for those desirng consecutive integers.
In this case, we're using pyspark and relying on dictionary comprehension to map the original row object to a new dictionary which fits a new schema including the unique index.
# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer
# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
+ dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
.map(lambda (row, id): {k:v
for k, v
in row.asDict().items() + [("uuid", id)]})\
.toDF(newSchema)

For anyone else who doesn't require integer types, concatenating the values of several columns whose combinations are unique across the data can be a simple alternative. You have to handle nulls since concat/concat_ws won't do that for you. You can also hash the output if the concatenated values are long:
import pyspark.sql.functions as sf
unique_id_sub_cols = ["a", "b", "c"]
df = df.withColumn(
"UniqueId",
sf.md5(
sf.concat_ws(
"-",
*[
sf.when(sf.col(sub_col).isNull(), sf.lit("Missing")).otherwise(
sf.col(sub_col)
)
for sub_col in unique_id_sub_cols
]
)
),
)

Related

Dropping duplicates in a pyarrow table?

Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))

In Python, how to concisely replace nested values in json data?

This is an extension to In Python, how to concisely get nested values in json data?
I have data loaded from JSON and am trying to replace arbitrary nested values using a list as input, where the list corresponds to the names of successive children. I want a function replace_value(data,lookup,value) that replaces the value in the data by treating each entry in lookup as a nested child.
Here is the structure of what I'm trying to do:
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
def replace_value(data,lookup,value):
DEFINITION
lookup = ['alldata','TimeSeries','rates']
replace_value(json_data,lookup,[2,3])
# The following should return [2,3]
print(json_data['alldata']['TimeSeries']['rates'])
I was able to make a start with get_value(), but am stumped about how to do replacement. I'm not fixed to this code structure, but want to be able to programatically replace a value in the data given the list of successive children and the value to replace.
Note: it is possible that lookup can be of length 1
Follow the lookups until we're second from the end, then assign the value to the last lookup in the current object
def get_value(data,lookup): # Or whatever definition you like
res = data
for item in lookup:
res = res[item]
return res
def replace_value(data, lookup, value):
obj = get_value(data, lookup[:-1])
obj[lookup[-1]] = value
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
lookup = ['alldata','TimeSeries','rates']
replace_value(json_data,lookup,[2,3])
print(json_data['alldata']['TimeSeries']['rates']) # [2, 3]
If you're worried about the list copy lookup[:-1], you can replace it with an iterator slice:
from itertools import islice
def replace_value(data, lookup, value):
it = iter(lookup)
slice = islice(it, len(lookup)-1)
obj = get_value(data, slice)
final = next(it)
obj[final] = value
You can obtain the parent to the final sub-dict first, so that you can reference it to alter the value of that sub-dict under the final key:
def replace_value(data, lookup, replacement):
*parents, key = lookup
for parent in parents:
data = data[parent]
data[key] = replacement
so that:
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
lookup = ['alldata','TimeSeries','rates']
replace_value(json_data,lookup,[2,3])
print(json_data['alldata']['TimeSeries']['rates'])
outputs:
[2, 3]
Once you have get_value
get_value(json_data, lookup[:-1])[lookup[-1]] = value

Django / PostgresQL jsonb (JSONField) - convert select and update into one query

Versions: Django 1.10 and Postgres 9.6
I'm trying to modify a nested JSONField's key in place without a roundtrip to Python. Reason is to avoid race conditions and multiple queries overwriting the same field with different update.
I tried to chain the methods in the hope that Django would make a single query but it's being logged as two:
Original field value (demo only, real data is more complex):
from exampleapp.models import AdhocTask
record = AdhocTask.objects.get(id=1)
print(record.log)
> {'demo_key': 'original'}
Query:
from django.db.models import F
from django.db.models.expressions import RawSQL
(AdhocTask.objects.filter(id=25)
.annotate(temp=RawSQL(
# `jsonb_set` gets current json value of `log` field,
# take a the nominated key ("demo key" in this example)
# and replaces the value with the json provided ("new value")
# Raw sql is wrapped in triple quotes to avoid escaping each quote
"""jsonb_set(log, '{"demo_key"}','"new value"', false)""",[]))
# Finally, get the temp field and overwrite the original JSONField
.update(log=F('temp’))
)
Query history (shows this as two separate queries):
from django.db import connection
print(connection.queries)
> {'sql': 'SELECT "exampleapp_adhoctask"."id", "exampleapp_adhoctask"."description", "exampleapp_adhoctask"."log" FROM "exampleapp_adhoctask" WHERE "exampleapp_adhoctask"."id" = 1', 'time': '0.001'},
> {'sql': 'UPDATE "exampleapp_adhoctask" SET "log" = (jsonb_set(log, \'{"demo_key"}\',\'"new value"\', false)) WHERE "exampleapp_adhoctask"."id" = 1', 'time': '0.001'}]
It would be much nicer without RawSQL.
Here's how to do it:
from django.db.models.expressions import Func
class ReplaceValue(Func):
function = 'jsonb_set'
template = "%(function)s(%(expressions)s, '{\"%(keyname)s\"}','\"%(new_value)s\"', %(create_missing)s)"
arity = 1
def __init__(
self, expression: str, keyname: str, new_value: str,
create_missing: bool=False, **extra,
):
super().__init__(
expression,
keyname=keyname,
new_value=new_value,
create_missing='true' if create_missing else 'false',
**extra,
)
AdhocTask.objects.filter(id=25) \
.update(log=ReplaceValue(
'log',
keyname='demo_key',
new_value='another value',
create_missing=False,
)
ReplaceValue.template is the same as your raw SQL statement, just parametrized.
(jsonb_set(log, \'{"demo_key"}\',\'"another value"\', false)) from your query is now jsonb_set("exampleapp.adhoctask"."log", \'{"demo_key"}\',\'"another value"\', false). The parentheses are gone (you can get them back by adding it to the template) and log is referenced in a different way.
Anyone interested in more details regarding jsonb_set should have a look at table 9-45 in postgres' documentation: https://www.postgresql.org/docs/9.6/static/functions-json.html#FUNCTIONS-JSON-PROCESSING-TABLE
Rubber duck debugging at its best - in writing the question, I've realised the solution. Leaving the answer here in hope of helping someone in future:
Looking at the queries, I realised that the RawSQL was actually being deferred until query two, so all I was doing was storing the RawSQL as a subquery for later execution.
Solution:
Skip the annotate step altogether and use the RawSQL expression straight into the .update() call. Allows you to dynamically update PostgresQL jsonb sub-keys on the database server without overwriting the whole field:
(AdhocTask.objects.filter(id=25)
.update(log=RawSQL(
"""jsonb_set(log, '{"demo_key"}','"another value"', false)""",[])
)
)
> 1 # Success
print(connection.queries)
> {'sql': 'UPDATE "exampleapp_adhoctask" SET "log" = (jsonb_set(log, \'{"demo_key"}\',\'"another value"\', false)) WHERE "exampleapp_adhoctask"."id" = 1', 'time': '0.001'}]
print(AdhocTask.objects.get(id=1).log)
> {'demo_key': 'another value'}

Python-Sqlalchemy Binary Column Type HEX() and UNHEX()

I'm attempting to learn Sqlalchemy and utilize an ORM. One of my columns stores file hashes as binary. In SQL, the select would simply be
SELECT type, column FROM table WHERE hash = UNHEX('somehash')
How do I achieve a select like this (ideally with an insert example, too) using my ORM? I've begun reading about column overrides, but I'm confused/not certain that that's really what I'm after.
eg
res = session.query.filter(Model.hash == __something__? )
Thoughts?
Only for select's and insert's
Well, for select you could use:
>>> from sqlalchemy import func
>>> session = (...)
>>> (...)
>>> engine = create_engine('sqlite:///:memory:', echo=True)
>>> q = session.query(Model.id).filter(Model.some == func.HEX('asd'))
>>> print q.statement.compile(bind=engine)
SELECT model.id
FROM model
WHERE model.some = HEX(?)
For insert:
>>> from sqlalchemy import func
>>> session = (...)
>>> (...)
>>> engine = create_engine('sqlite:///:memory:', echo=True)
>>> m = new Model(hash=func.HEX('asd'))
>>> session.add(m)
>>> session.commit()
INSERT INTO model (hash) VALUES (HEX(%s))
A better approach: Custom column that converts data by using sql functions
But, I think the best for you is a custom column on sqlalchemy using any process_bind_param, process_result_value, bind_expression and column_expression see this example.
Check this code below, it create a custom column that I think fit your needs:
from sqlalchemy.types import VARCHAR
from sqlalchemy import func
class HashColumn(VARCHAR):
def bind_expression(self, bindvalue):
# convert the bind's type from String to HEX encoded
return func.HEX(bindvalue)
def column_expression(self, col):
# convert select value from HEX encoded to String
return func.UNHEX(col)
You could model your a table like:
from sqlalchemy import Column, types
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Model(Base):
__tablename__ = "model"
id = Column(types.Integer, primary_key=True)
col = Column(HashColumn(20))
def __repr__(self):
return "Model(col=%r)" % self.col
Some usage:
>>> (...)
>>> session = create_session(...)
>>> (...)
>>> model = Model(col='Iuri Diniz')
>>> session.add(model)
>>> session.commit()
this issues this query:
INSERT INTO model (col) VALUES (HEX(?)); -- ('Iuri Diniz',)
More usage:
>>> session.query(Model).first()
Model(col='Iuri Diniz')
this issues this query:
SELECT
model.id AS model_id, UNHEX(model.col) AS model_col
FROM model
LIMIT ? ; -- (1,)
A bit more:
>>> session.query(Model).filter(Model.col == "Iuri Diniz").first()
Model(col='Iuri Diniz')
this issues this query:
SELECT
model.id AS model_id, UNHEX(model.col) AS model_col
FROM model
WHERE model.col = HEX(?)
LIMIT ? ; -- ('Iuri Diniz', 1)
Extra: Custom column that converts data by using python types
Maybe you want to use some beautiful custom type and want to convert it between python and the database.
In the following example I convert UUID's between python and the database (the code is based on this link):
import uuid
from sqlalchemy.types import TypeDecorator, VARCHAR
class UUID4(TypeDecorator):
"""Portable UUID implementation
>>> str(UUID4())
'VARCHAR(36)'
"""
impl = VARCHAR(36)
def process_bind_param(self, value, dialect):
if value is None:
return value
else:
if not isinstance(value, uuid.UUID):
return str(uuid.UUID(value))
else:
# hexstring
return str(value)
def process_result_value(self, value, dialect):
if value is None:
return value
else:
return uuid.UUID(value)
I wasn't able to get #iuridiniz's Custom column solution to work because of the following error:
sqlalchemy.exc.StatementError: (builtins.TypeError) encoding without a string argument
For an expression like:
m = Model(col='FFFF')
session.add(m)
session.commit()
I solved it by overriding process_bind_param, which processes the parameter
before passing it to bind_expression for interpolation into your query language.
from sqlalchemy.types import VARCHAR
from sqlalchemy import func
class HashColumn(VARCHAR):
def process_bind_param(self, value, dialect):
# encode value as a binary
if value:
return bytes(value, 'utf-8')
def bind_expression(self, bindvalue):
# convert the bind's type from String to HEX encoded
return func.HEX(bindvalue)
def column_expression(self, col):
# convert select value from HEX encoded to String
return func.UNHEX(col)
And then defining the table is the same:
from sqlalchemy import Column, types
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Model(Base):
__tablename__ = "model"
id = Column(types.Integer, primary_key=True)
col = Column(HashColumn(20))
def __repr__(self):
return "Model(col=%r)" % self.col
I really like iuridiniz approach A better approach: Custom column that converts data by using sql functions, but I had some trouble making it work when using BINARY and VARBINARY to store hex strings in MySQL 5.7. I tried different things, but SQLAlchemy kept complaining about the encoding, and/or the use of func.HEX and func.UNHEX in contexts where they couldn't be used. Using python3 and SQLAlchemy 1.2.8, I managed to make it work extending the base class and replacing its processors, so that sqlalchemy does not require a function from the database to bind the data and compute the result, but rather it is done within python, as follows:
import codecs
from sqlalchemy.types import VARBINARY
class VarBinaryHex(VARBINARY):
"""Extend VARBINARY to handle hex strings."""
impl = VARBINARY
def bind_processor(self, dialect):
"""Return a processor that decodes hex values."""
def process(value):
return codecs.decode(value, 'hex')
return process
def result_processor(self, dialect, coltype):
"""Return a processor that encodes hex values."""
def process(value):
return codecs.encode(value, 'hex')
return process
def adapt(self, impltype):
"""Produce an adapted form of this type, given an impl class."""
return VarBinaryHex()
The idea is to replace HEX and UNHEX, which require DBMS intervention, with python functions that do just the same, encode and decode an hex string just like HEX and UNHEX do. If you directly connect to the database, you can use HEX and UNHEX, but from SQLAlchemy, codecs.enconde and codecs.decode functions make the work for you.
I bet that, if anybody were interested, writting the appropriate processors, one could even manage the hex values as integers from the python perspective, allowing to store integers that are greater the BIGINT.
Some considerations:
BINARY could be used instead of VARBINARY if the length of the hex string is known.
Depending on what you are going to do, it might worth to un-/capitalise the string on the constructor of class that is going to use this type of column, so that you work with a consistent capitalization, right at the moment of the object initialization. i.e., 'aa' != 'AA' but 0xaa == 0xAA.
As said before, you could consider a processor that converts db binary hex values to prython integer.
When using VARBINARY, be careful because 'aa' != '00aa'
If you use BINARY, lets say that your column is col = Column(BinaryHex(length=4)), take into account that any value that you provide with less than length bytes will be completed with zeros. I mean, if you do
obj.col = 'aabb' and commit it, when you later retrieve this, from the dataase, what you will get is obj.col == 'aabb0000', which is something quite different.

Coercion in SQLAlchemy from Column annotations

Good day everyone,
I have a file of strings corresponding to the fields of my SQLAlchemy object. Some fields are floats, some are ints, and some are strings.
I'd like to be able to coerce my string into the proper type by interrogating the column definition. Is this possible?
For instance:
class MyClass(Base):
...
my_field = Column(Float)
It feels like one should be able to say something like MyClass.my_field.column.type and either ask the type to coerce the string directly or write some conditions and int(x), float(x) as needed.
I wondered whether this would happen automatically if all the values were strings, but I received Oracle errors because the type was incorrect.
Currently I naively coerce -- if it's float()able, that's my value, else it's a string, and I trust that integral floats will become integers upon inserting because they are represented exactly. But the runtime value is wrong (e.g. 1.0 vs 1) and it just seems sloppy.
Thanks for your input!
SQLAlchemy 0.7.4
You can iterate over columns of the mapped Table:
for col in MyClass.__table__.columns:
print col, repr(col.type)
... so you can check the type of each field by its name like this:
def get_col_type(cls_, fld_):
for col in cls_.__table__.columns:
if col.name == fld_:
return col.type # this contains the instance of SA type
assert Float == type(get_col_type(MyClass, 'my_field'))
I would cache the results though if your file is large in order to save the for-loop on every row imported from the file.
Type coercion for sqlalchemy prior to committing to some database.
How can I verify Column data types in the SQLAlchemy ORM?
from sqlalchemy import (
Column,
Integer,
String,
DateTime,
)
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import event
import datetime
Base = declarative_base()
type_coercion = {
Integer: int,
String: str,
DateTime: datetime.datetime,
}
# this event is called whenever an attribute
# on a class is instrumented
#event.listens_for(Base, 'attribute_instrument')
def configure_listener(class_, key, inst):
if not hasattr(inst.property, 'columns'):
return
# this event is called whenever a "set"
# occurs on that instrumented attribute
#event.listens_for(inst, "set", retval=True)
def set_(instance, value, oldvalue, initiator):
desired_type = type_coercion.get(inst.property.columns[0].type.__class__)
coerced_value = desired_type(value)
return coerced_value
class MyObject(Base):
__tablename__ = 'mytable'
id = Column(Integer, primary_key=True)
svalue = Column(String)
ivalue = Column(Integer)
dvalue = Column(DateTime)
x = MyObject(svalue=50)
assert isinstance(x.svalue, str)
I'm not sure if I'm reading this question correctly, but I would do something like:
class MyClass(Base):
some_float = Column(Float)
some_string = Column(String)
some_int = Column(Int)
...
def __init__(self, some_float, some_string, some_int, ...):
if isinstance(some_float, float):
self.some_float = somefloat
else:
try:
self.some_float = float(somefloat)
except:
# do something intelligent
if isinstance(some_string, string):
...
And I would repeat the checking process for each column. I would trust anything to do it "automatically". I also expect your file of strings to be well structured, otherwise something more complicated would have to be done.
Assuming your file is a CSV (I'm not good with file reads in python, so read this as pseudocode):
while not EOF:
thisline = readline('thisfile.csv', separator=',') # this line is an ordered list of strings
thisthing = MyClass(some_float=thisline[0], some_string=thisline[1]...)
DBSession.add(thisthing)