On a Symfony exploration project, I have a model using doctrine NestedSet behaviour. Since the data is prepared in a flat file, I wrote a conversion utility to generate corresponding YAML. It turns out that processing NestedSet YML of around 100 records (max_depth=4) consumes over 40MB of PHP memory, which is not available to me.
Is there a work-around to this problem?
I'm thinking of 2 possible solutions.
Write an equivalent PHP script to populate objects & save them
Insert data via SQL statements, the challenge being to compute the left & right nodes
What do Symfonians suggest?
I suggest you insert the data in several times, one time per level starting at level 0, for instance.
Option 2 was better.
I wrote a simple macro in Excel to compute the lgt and rgt values required for the pre-order tree. Logic as mentioned here: http://www.sitepoint.com/hierarchical-data-database-2/
The same Excel utility would convert values to a SQL query that could be dumped via file.
Going through the fixtures/object route exceeded the allowed memory limit.
Related
I am implementing a SSIS package and currently trying to do the following.
Truncate the destination table
Fetch the data by executing the stored procedure and insert it into the destination table.
I have created an Execute SQL task to address step 1 and dataflow with oledb source and oledb destination to address the second point. It been working successfully so far but isn't working for one my stored procedure that uses temp tables.
When I edit the oledb source and click the preview button, I get the error no column returned
I know that SSIS has an issue with generating column while executing stored procedures that depend on temp tables. I have converted the stored proc to use temporary table variables and its now able to return columns in SSIS when I do a preview. The only downside is that the stored procedure is taking longer time to execute. Its taking 1 hour 15 mins as compared to 15 mins while using temp tables.
I did see a suggestion to use SET FMTONLY before executing the stored procedure as an alternate solution to changing to temp table variables but that didn't seem to work as I am getting syntax or permission denied error.
Could somebody tell me a solution to my problem which does not compromise on the performance.
Sounds like you've already read all the approaches to using Temp tables in SSIS, including the IF 1=0... trick? If you haven't seen that one yet, google it.
You say that using Table Variables causes your stored procedure to take about 5 times longer than using Temp Tables. The most likely reason for that is that you are indexing your temp tables but not your table variables. If you didn't know that table variables can be indexed, they can. You might try that.
Finally, a solution that you haven't mentioned is that you can replace your temporary table with a real table that gets truncated when you're done using it.
Short comment:
Try EXEC WITH RESULT SETS and specify the metadata yourself for a proc with temp tables; or use the Script Component as a source and specify the Output columns yourself.
Long comment:
Technically speaking, it is the driver/database you are using in SSIS that would decide the behavior when working with temp tables.
Metadata is an important factor when using SSIS's pipeline components. By metadata, I mean the names of the columns, their data types etc that a pipeline component uses. When designing a data flow, someone/something should provide this metadata to the components that require it.
In most cases, SSIS automatically retreives the metadata. Components that do not connect to a external data source, like Conditional Split etc, get their metadata from the other components they are connected to. For the pipeline components that connect to a external data source (like Oledb source, oledb destination, Lookup etc.), SSIS provides a mechanism to get this metadata without human involvement. This mechanism involves the driver connecting to the database and retrieving the metadata of the output. If the driver/database is capable of returning the metadata, then that metadata is used. If the driver/database is incapable, then you get the errors you are seeing. The rest of my comments are based on the assumption that you are using a SQL Server database in your question.
When working with a SQL Server database in SSIS, typically, we use the native client drivers provided by Microsoft. When trying to get the metadata, these drivers try to get the metadata without actually executing the SQL Statement (actual execution can have side effects; and also, might take more than a few seconds/minutes/hours; and you dont want side effects and long wait times during package design time.) So to get the metadata, the driver relies on the metadata of the actual objects used in the sql command. If the command uses a physical table or view, SQL Server already has the metadata available and can supply it to the driver. If it is a temp table, SQL Server does not have the metadata until it can create the temp table. If using FMT ONLY option, you can use it in such a way to create the temp tables, but avoid any heavy processing/side affects and thus be able to retrieve metadata without penalties. Post 2012, these native client drivers rely on some newer functionality to retrieve metadata than the drivers before 2012. In 2012 and after, the driver uses the sp_describe_first_result_set proc to retrieve metadata. So, whether you can get metadata or not is determined by the ability of the sp_describe_first_result_set proc.
So while SSIS can automatically get the metadata (because of the driver/database), it does not automatically get the metadata in some cases (again because of the driver/database). In cases involving the second scenario, some other process (typically a human) can help the driver infer metadata or provide the metadata to the component directly.
To help the driver, in case of SQL Server 2012 and after, you can use the WITH RESULTSETS clause to specify the output metadata. When this clause is present, the driver will use it and doesnt try to query the metadata from system objects; and thus avoid the error which you would otherwise get. If you are using the drivers that came with SQL Server 2008, you can use FMT ONLY. This option is at the driver/database level.
Another option could be to use a Script Component as the Source and in the Output columns, you can specify the columns/metadata. SSIS would not try to retrieve metadata from the datasource in this case, but would rely on the definitions you provided in the Output section of the Script Component.
As you can see, both options involve a human (or some other process) specifying the metadata instead of SSIS trying to retrieve the metadata in an automated fashion. I would prefer the first option if working with SQL Server and the second option if working with databases like MySql.
I have got around 35 tables whose data need to be migrated from SQL Server to MySQL. I am using SSIS for this project and I have set up a control flow (using Load Multiple Tables) with a Script Task and a Foreach Loop Container that iterates through all the tables in my database. What I now need to do is convert the data type for some of the columns, in some of the tables, to 'Unicode String [DT_WSTR]' before I dump them in my destination tables. Is this something that can be done through SSIS? If so, any pointers or a set of instructions would be great.
Thanks,
Pratik Gandhi
Yes, this is a standard out-of-the-box task for SSIS.
Add a Data Flow Task.
Add a Data conversion component to the task
Add your source and destination servers
Map your columns, converting datatypes where required.
As always, MSDN provides further help.
I am using Pentaho Data Integration Software.
I am currently running a Pentaho Job as an ETL. I ETL data from multiple places and put them into a single database table. The schema for all of the places i ETL from are exactly the same. So, other than database connections and a single 'variable' that stores where that data came from, the transformation in Pentaho is exactly the same for each one. So i have a job, that runs each of these transformation.
The problem comes in, when i want to make a change. I need to change 6 transformations every time. What i want to do, is somehow set something like a variable in Pentaho, that tells it to run a single transformation, 6 times, with different database connections, and perhaps a single variable.
Is this possible?
Thanks in advanced.
If i have understood your question correctly, you need to loop multiple transformations using a single KTR file (assuming there is only one database type).
PDI provides you with a step called "Copy Rows to Result", where you can store the credentials of your database in multiple rows and for every run of the Job, it will use different connections and run the transformation multiple times (6 in ur case).
Note: I have assumed that you are having only one database type e.g. : mySQL but with different credentials.
Hope this helps :) I would be happy to provide you sample code in case you need it.
Well, why don't you use a job that will pass the host/user/password as variables? That way your whole data flow will be generic.
Hope this answer will lead you into the right direction!
I received over 100GB of data with 67million records from one of the retailers. My objective is to do some market-basket analysis and CLV. This data is a direct sql dump from one of the tables with 70 columns. I'm trying to find a way to extract information from this data as managing itself in a small laptop/desktop setup is becoming time consuming. I considered the following options
Parse the data and convert the same to CSV format. File size might come down to around 35-40GB as more than half of the information in each records is column names. However, I may still have to use a db as I cant use R or Excel with 66 million records.
Migrate the data to mysql db. Unfortunately I don't have the schema for the table and I'm trying to recreate the schema looking at the data. I may have to replace to_date() in the data dump to str_to_date() to match with MySQL format.
Are there any better way to handle this? All that I need to do is extract the data from the sql dump by running some queries. Hadoop etc. are options, but I dont have the infrastructure to setup a cluster. I'm considering mysql as I have storage space and some memory to spare.
Suppose I go in the MySQL path, how would I import the data? I'm considering one of the following
Use sed and replace to_date() with appropriate str_to_date() inline. Note that, I need to do this for a 100GB file. Then import the data using mysql CLI.
Write python/perl script that will read the file, convert the data and write to mysql directly.
What would be faster? Thank you for your help.
In my opinion writing a script will be faster, because you are going to skip the SED part.
I think that you need to setup a server on a separate PC, and run the script from your laptop.
Also use tail to faster get a part from the bottom of this large file, in order to test your script on that part before you run it on this 100GB file.
I decided to go with the MySQL path. I created the schema looking at the data (had to increase a few of the column size as there were unexpected variations in the data) and wrote a python script using MySQLdb module. Import completed in 4hr 40mins on my 2011 MacBook Pro with 8154 failures out of 67 million records. Those failures were mostly data issues. Both client and server are running on my MBP.
#kpopovbg, yes, writing script was faster. Thank you.
Suppose I have a CSV file with 1M email addresses. I need to iterate through the file and add each entry, for example:
with open(file) as csv:
for item in csv:
Email.objects.create(email=item)
This seems like it would be very slow going through the django ORM like this to create 1M objects and insert them into the db. Is there a better way than this, or should I go away from django for this task and do it directly with the db?
You can also try using new bulk_create
Besides bulk_create, you could put all inserts into one transaction as long as your DB backend supports it:
from django.db.transaction import commit_on_success
# with commit_on_success(), open(file) as csv: # in Python2.7
with commit_on_success():
for item in csv:
Email.objects.create(email=item)
Also note that bulk_create treats items w/ same values to be same, thus
Email.objects.bulk_create([Email(email=item), Email(email=item)])
actually creates one row instead of two
Because of more SQLs turnaround, the transaction solution is still slower than the bulk_create one, but you don't have to create all one million Email() instances in memory (generator seems not work here)
Furthermore, you could do it in SQL-level directly
This is something you should drop to DB-API to accomplish, since you bypass creating all the model objects.
IMHO, I don't see very big problem with speed if it's only one-time insert (1M records won't take you hours). If you'll be using django api to access those objects in the future, then probably you should avoid resorting to SQL level insert, and do it through django's methods, like suggested by livar (if using django 1.4)
You might want to look into the Django DSE package, which is apparently an efficient bulk insert/update library.