SSIS Data Flow: duplicated rule problem after lookup - ssis

I have a data flow that I need to get a column value from 'SQL tableA' and do a lookup task in 'SQL tableB' using this column value. If the lookup found a connection between the two tables, I need to get the value of another column from 'SQL tableA' and put this value in 'SQL tableC'( the table that will be persisted ). If lookup fail, this column value will be NULL.
My problem: After this behavior above, the rest of my flow is the same. So I have two duplicated equal flows below lookup. And this is terrible for readability and maintenance.
What do I can do to resolve this situation with little performance loss?
The data model is legacy, so change the data model is impossible.
Best Regards,
Luis

The way I see it, there are really three options:
Use UNION ALL and possibly sacrifice performance for modularity. There may in fact be no performance issue. You should test and see
If possible, implement all of this in a stored procedure. You can implement code reuse there and it will quite possibly run much faster
Build a custom transformation component that implements those last three steps.
This option appeals to all programmers but may have the worst performance and in my opinion will just cause issues down the track. If you're writing reams of C# code inside SSIS then you'll eventually reach a point where it's easier to just build a standalone app.
It would be much easier to answer if you explained
What you're really doing
slowly changing dimension?
data cleansing?
adding reference data?
spamming
What are those three activities?
sending an email?
calling a web service?
calling some other API?
What your constraints are
Is all of this data on one server and can you create stored procs and tables?

Related

Table name changing to avoid SQL injection attack

I understand the basic process of SQL injection attack. My question is related to SQL injection prevention. I was told that one way to prevent such an attack is by frequently changing the table name! Is that possible?
If so, can someone provide me a link to read about it more because I couldn't find an explanation about it on the web.
No. That makes no sense. You'd either have to change every line of code that references the table or you'd have to leave in place something like a view with the old table name that acts exactly like the old table. No reasonable person would do that. Plus, it's not like there are a ton of reasonable names for tables so you'd be doing crazy things like saying table A stores customer data and AA stores employer data and AAA was the intersection between customers and employers.
SQL injection is almost comically simple to prevent. Use prepared statements with bind variables. Don't dynamically build SQL statements. Done. Of course, in reality, making sure that the new developer doesn't violate this dictum either because they don't know any better or because they can hack something out in a bit less time if they just do a bit of string concatenation makes it a bit more complex. But the basic approach is very simple.
Pffft. What? Frequently changing a table name?
That's bogus advice, as far as "preventing SQL Injection".
The only prevention for SQL Injection vulnerabilities is to write code that isn't vulnerable. And in the vast majority of cases, that is very easy to do.
Changing table names doesn't do anything to close a SQL Injection vulnerability. It might make a successful attack vector less repeatable, requiring an attacker to make some adjustments. But it does nothing prevent SQL Injection.
As a starting point for research on SQL Injection, I recommend OWASP (Open Web Application Security Project)
Start here: https://www.owasp.org/index.php/SQL_Injection
If you run across "changing a table name" as a mitigation, let me know. I've never run across that as a prevention or mitigation for SQL Injection vulnerability.
Here's things you can do to prevent SQL injection:
Use an ORM that encapsulates your SQL calls and provides a friendly layer to your database records. Most of these are very good at writing high quality queries and protecting you from injection bugs simply because of how you use them.
Use prepared statements with placeholder values whenever possible. Write queries like this:
INSERT INTO table_name (name, age) VALUES (:name, :age)
Be very careful to properly escape any and all values that are inserted into SQL though any other method. This is always a risky thing to do, so any code you do write like this should have any escaping you do made blindingly obvious so that a quick code review can verify it's working properly. Never hide escaping behind abstractions or methods with cute names like scrub or clean. Those methods might be subtly broken and you'd never notice.
Be absolutely certain any table name parameters, if dynamic, are tested versus a white list of known-good values. For example, if you can create records of more than one type, or put data into more than one table ensure that the parameter supplied is valid.
Trust nothing supplied by the user. Presume every single bit of data is tainted and hostile unless you've taken the trouble to clean it up. This goes doubly for anything that's in your database if you got your database from some other source, like inheriting a historical project. Paranoia is not unfounded, it's expected.
Write your code such that deleting a line does not introduce a security problem. That means never doing this:
$value = $db->escaped(value);
$db->query("INSERT INTO table (value) VALUES ('$value')");
You're one line away from failure here. If you must do this, write it like so:
$value_escaped = $db->escaped(value);
$db->query("INSERT INTO table (value) VALUES ('$value_escaped')");
That way deleting the line that does the escaping does not immediately cause an injection bug. The default here is to fail safely.
Make every effort to block direct access to your database server by aggressively firewalling it and restricting access to those that actually need access. In practice this means blocking port 3306 and using SSH for any external connections. If you can, eliminate SSH and use a secured VPN to connect to it.
Never generate errors which spew out stack traces that often contain information highly useful to attackers. For example, an error that includes a table name, a script path, or a server identifier is providing way too much information. Have these for development, and ensure these messages are suppressed on production servers.
Randomly changing table names is utterly pointless and will make your code a total nightmare. It will be very hard to keep all your code in sync with whatever random name the table is assuming at any particular moment. It will also make backing up and restoring your data almost impossible without some kind of decoder utility.
Anyone who recommends doing this is proposing a pointless and naïve solution to a an already solved problem.
Suggesting that randomly changing the table names fixes anything demonstrates a profound lack of understanding of the form SQL injection bugs take. Knowing the table name is a nice thing to have, it makes your life easier as an attacker, but many attacks need no knowledge of this. A common attack is to force a login as an administrator by injecting additional clauses in the WHERE condition, the table name is irrelevant.

update target table given DateCreated and DateUpdated columns in source table

What is the most efficient way of updating a target table given the fact that the source table contains a DateTimeCreated and DateTimeUpdated column?
I would like to keep the source in target in synch avoiding a
truncate. I am looking for a bets practice pattern in this situation
I'll avoid a best practice answer but give enough detail to make an appropriate choice. There are two main methods with which you might update a table in SSIS, avoiding a TRUNCATE - LOAD:
1) Use an OLEBD COMMAND
This method is good if:
you have a reliable DateTimeUpdated column,
there are not many rows to update,
there are not a lot of columns to update
there are not many added columns in the dataflow (i.e. derived column transforms)
and the update statement is fairly straightforward.
This method performs poorly with many columns because it performs a row-by-row update. Relying on an audit date column can be a great method to reduce the number of rows to update, but it can also cause problems if rows are updated in the source system and the audit column is not changed. I recommend only trusted it if it has a trigger or you can be certain that no human can perform updates on the table.
Additionally, this component falls short when there is a lot of columns to map or a lot of transforms going on in the data flow. For example, if you are converting all string columns from unicode to non-unicode, you may have many additional columns in the mix that will make mapping and maintenance a pain. The mapping tool in this component is good for about 10 columns, it starts to get confusing very quickly after that. Especially because you are mapping to numbered parameters rather than column names.
Lastly, if you are doing anything complex in the update statement, it is better suited for SQL code rather than maintaining it in the components editor which has no intellisense and is generally painful to use.
2) Stage the data and perform the update in Execute SQL task after the data flow
This method is good for all the reasons that the OLEDB command is bad for, but has some disadvantages as well. There is more code to maintain:
a couple of t-sql tasks,
a proc
and a staging table
This means also that it takes more time to set up as well. However, it does perform very well and the code is far easier to read and understand. Ongoing maintenance is simpler as well.
Please see my notes from this other question that I happened to answer today on the same subject: SSIS Compare tables content and update another

Advice required - using entity framework with normalised data

I've recent gone through the process of revamping my database, normaising a lot of entities. Obviously I now have a few more tables than I had. A lot of data I use on the website is readonly so this is simple to denormalise using a view, however there are entities that could benefit from denormalised retrieval but still need to be updated.
Here's an example.
A User may be a Member
A Member may have a Profile
A Member may have an Account
In addition I have 3 further lookup tables.
In total there are 3 tables for User and 4 tables for Member.
Ideally, I can create 2 views from the above tables.
However, User needs to be updated as do the entities belonging to Member. Additionally there are 6 separate tables associated with Users/Members, i.e. FavouriteCategories that also need to be retreived and updated from time to time.
I'm struggling to come up with the best, most efficient way of doing this.
I could simply not use views and bring all the entities and lookups into the model, but I would be reliant on EF to produce the retreival queries. The stuff I've read suggest that EF is not best at dealing with joined data.
I could add both the view and tables, using the tables for updates only. This seems sloppy due to the duplication, complication of the model, as well as underutilising the EF model functionality.
Maybe I could use the readonly view for data retrieval and create stored procs. I believe that the process of using EF with stored procs is a bit of a hack, so I'd probably keep the stored procs distinct from EF and simply pass params and call the SP via traditional methods. This again seems like a bit of a halfway house.
I'm not that experienced with .net or EF, so would appreciate some solid advice on either the methods I've referred to above or any better technique to acheive this. I don't want to go hacking the edmx file at this stage because... well it's just wrong.
I have a few entities that would benefit from the right solution. The User example is amongst the simplest, so there's a lot to gain from the right approach.
Help and advice would be very much appreciated.
Do you want to use EF? If yes use either first approach with not using views at all and allowing EF to handle everything or the last approach with using views and mapping stored procedures for insert, update and delete operations.
Combining mapped views for reading and mapped tables for modifications is possible as well but it is mostly the first solution (allowing EF to handle everything) with additional views for some query optimization.
You will not find cleaner approaches. Are mentioned approaches are valid solution for your problem. The only question is if you want to write SQL yourselves (view and stored procedures) or let EF to do that.
The worst approach is using EF for querying and manual calling of stored procedures for updating but in some cases it can be also useful.

Alternatives to LINQ To SQL on high loaded pages

To begin with, I LOVE LINQ TO SQL. It's so much easier to use than direct querying.
But, there's one great problem: it doesn't work well on high loaded requests. I have some actions in my ASP.NET MVC project, that are called hundreds times every minute.
I used to have LINQ to SQL there, but since the amount of requests is gigantic, LINQ TO SQL almost always returned "Row not found or changed" or "X of X updates failed". And it's understandable. For instance, I have to increase some value by one with every request.
var stat = DB.Stats.First();
stat.Visits++;
// ....
DB.SubmitChanges();
But while ASP.NET was working on those //... instructions, the stats.Visits value stored in the table got changed.
I found a solution, I created a stored procedure
UPDATE Stats SET Visits=Visits+1
It works well.
Unfortunately now I'm getting more and more moments like that. And it sucks to create stored procedures for all cases.
So my question is, how to solve this problem? Are there any alternatives that can work here?
I hear that Stackoverflow works with LINQ to SQL. And it's more loaded than my site.
This isn't exactly a problem with Linq to SQL, per se, it's an expected result with optimistic concurrency, which Linq to SQL uses by default.
Optimistic concurrency means that when you update a record, you check the current version in the database against the copy that was originally retrieved before making any offline updates; if they don't match, report a concurrency violation ("row not found or changed").
There's a more detailed explanation of this here. There's also a fairly sizable guide on handling concurrency errors. Typically the solution involves simply catching ChangeConflictException and picking a resolution, such as:
try
{
// Make changes
db.SubmitChanges();
}
catch (ChangeConflictException)
{
foreach (var conflict in db.ChangeConflicts)
{
conflict.Resolve(RefreshMode.KeepCurrentValues);
}
}
The above version will overwrite whatever is in the database with the current values, regardless of what other changes were made. For other possibilities, see the RefreshMode enumeration.
Your other option is to disable optimistic concurrency entirely for fields that you expect might be updated. You do this by setting the UpdateCheck option to UpdateCheck.Never. This has to be done at the field level; you can't do it at the entity level or globally at the context level.
Maybe I should also mention that you haven't picked a very good design for the specific problem you're trying to solve. Incrementing a "counter" by repeatedly updating a single column of a single row is not a very good/appropriate use of a relational database. What you should be doing is actually maintaining a history table - such as Visits - and if you really need to denormalize the count, implement that with a trigger in the database itself. Trying to implement a site counter at the application level without any data to back it up is just asking for trouble.
Use your application to put actual data in your database, and let the database handle aggregates - that's one of the things databases are good at.
Use a producer/consumer or message queue model for updates that don't absolutely have to happen immediately, particularly status updates. Instead of trying to update the database immediately keep a queue of updates that the asp.net threads can push to and then have a writer process/thread that writes the queue to the database. Since only one thread is writing, there will be much less contention on the relevant tables/roles.
For reads, use caching. For high volume sites even caching data for a few seconds can make a difference.
Firstly, you could call DB.SubmitChanges() right after stats.Visits++, and that would greatly reduce the problem.
However, that still is not going to save you from the concurrency violation (that is, simultaneously modifying a piece of data by two concurrent processes). To fight that, you may use the standard mechanism of transactions. With LINQ-to-SQL, you use transactions by instantiating a TransactionScope class, thusly:
using( TransactionScope t = new TransactionScope() )
{
var stats = DB.Stats.First();
stats.Visits++;
DB.SubmitChanges();
}
Update: as Aaronaught correctly pointed out, TransactionScope is not going to help here, actually. Sorry. But read on.
Be careful, though, not to make the body of a transaction too long, as it will block other concurrent processes, and thus, significantly reduce your overall performance.
And that brings me to the next point: your very design is probably flawed.
The core principle in dealing with highly shared data is to design your application in such way that the operations on that data are quick, simple, and semantically clear, and they must be performed one after another, not simultaneously.
The one operation that you're describing - counting visits - is pretty clear and simple, so it should be no problem, once you add the transaction. I must add, however, that while this will be clear, type-safe and otherwise "good", the solution with stored procedure is actually a much preferred one. This is actually exactly the way database applications were being designed in ye olden days. Think about it: why would you need to fetch the counter all the way from the database to your application (potentially over the network!) if there is no business logic involved in processing it. The database server may increment it just as well, without even sending anything back to the application.
Now, as for other operations, that are hidden behind // ..., it seems (by your description) that they're somewhat heavy and long. I can't tell for sure, because I don't see what's there, but if that's the case, you probably want to separate them into smaller and quicker ones, or otherwise rethink your design. I really can't tell anything else with this little information.

Storing JSON in an msSQL database?

I'm developing a form generator, and wondering if it would be bad mojo to store JSON in an SQL database?
I want to keep my database & tables simple, so I was going to have
`pKey, formTitle, formJSON`
on a table, and then store
{["firstName":{"required":"true","type":"text"},"lastName":{"required":"true","type":"text"}}
in formJSON.
Any input is appreciated.
I use JSON extensively in my CMS (which hosts about 110 sites) and I find the speed of access data to be very fast. I was surprised that there wasn't more speed degradation. Every object in the CMS (Page, Layout, List, Topic, etc) has an NVARCHAR(MAX) column called JSONConfiguration. My ORM tool knows to look for that column and reconstitute it as an object if needed. Or, depending on the situation, I will just pass it to the client for jQuery or Ext JS to process.
As for readability / maintainability of my code, you might say it's improved because I now have classes that represent a lot of the JSON objects stored in the DB.
I used JSON.net for all serialization / deserialization. https://www.newtonsoft.com/json
I also use a single query to return meta-JSON with the actual data. As in the case of Ext JS, I have queries that return both the structure of the Ext JS object as well as the data the object will need. This cuts out one post back / SQL round trip.
I was also surprised at how fast the code was to parse a list of JSON objects and map them into a DataTable object that I then handed to a GridView.
The only downside I've seen to using JSON is indexing. If you have a property of the JSON you need to search, then you have to store it as a separate column.
There are JSON DB's out there that might server your needs better: CouchDB, MongoDB, and Cassandra.
A brilliant way to make an object database from sql server. I do this for all config objects and everything else that doesn't need any specific querying. extending your object - easy, just create a new property in your class and init with default value. Don't need a property any more? Just delete it in the class. Easy roll out, easy upgrade. Not suitable for all objects, but if you extract any prop you need to index on - keep using it. Very modern way of using sql server.
It will be slower than having the form defined in code, but one extra query shouldn't cause you much harm. (Just don't let 1 extra query become 10 extra queries!)
Edit: If you are selecting the row by formTitle instead of pKey (I would, because then your code will be more readable), put an index on formTitle
We have used a modified version of XML for exactly the purpose you decribe for seven or eight years and it works great. Our customers' form needs are so diverse that we could never keep up with a table/column approach. We are too far down the XML road to change very easily but I think JSON would work as well and maybe evan better.
Reporting is no problem with a couple of good parsing functions and I would defy anyone to find a significant difference in performance between our reporting/analytics and a table/column solution to this need.
I wouldn't recommend it.
If you ever want to do any reporting or query based on these values in the future it's going to make your life a lot harder than having a few extra tables/columns.
Why are you avoiding making new tables? I say if your application requires them go ahead and add them in... Also if someone has to go through your code/db later it's probably going to be harder for them to figure out what you had going on (depending on what kind of documentation you have).
You should be able to use SisoDb for this. http://sisodb.com
I think it not an optimal idea to store object data in a string in SQL. You have to do transformation outside of SQL in order to parse it. That presents a performance issue and you lose the leverage of using SQL native data parsing capability. A better way would be to store JSON as an XML datatype in SQL. This way, you kill two birds with one stone: You don't have to create shit load of tables and still get all the native querying benefits of SQL.
XML in SQL Server 2005? Better than JSON in Varchar?