Magento importing customers(700k) using csv dataflow is too slow

Magento importing customers(700k) using csv dataflow is too slow - mysql

I am building up a Magento eCommerce website, and importing customer profiles from old one. I am using CSV importer, and this process is way slower than I can imagine. It takes almost 4 seconds for just one customer. So far the process has been running for 6+ hours and only 30k customers imported. The CSV file is chunked into several ~10M smaller ones.
For now, I am using Amazon Web Services EC2 instance (micro) for development server. It has 1 vCPU (2.5GHz) and 1GiB memory. But I don't think this can be an issue. I increased php memory limit to 1G.
I've read an article saying that this speed issues of importing products are very common because of the Magento's EAV database system and the heavy PHP API modules [Speeding up Magento Imports]. It says that Magento sends 450 MySQL queries in order to import one single product. I have also seen a workaround using [Magmi] which tries to bypass Magento's API and insert data directly into MySQL tables. However AFAIK it doesn't seem to import customers, but only products and categories. I don't know if they (products and customers) are using the same mechanism.
I disabled caching management and set the option of index management to 'manual update'. Though customer profiles don't really use these processes.
Do you have any suggestion to increase this CSV importing speed?
[Follow-up]
I have found one of the problem sources, Amazon EC2 T2 instances. They use CPU Credits to control maximum CPU usage. For micro instances, the base CPU performance is limited to 10% of its capacity. I used all of the CPU credits, and the server didn't allow me to use the full CPU.
I just checked the importing performance now, and it is importing approximately 2 records per sec. I think this is still too slow, so I will wait more for any other solutions. Perhaps optimizations?

Forget about Magmi, forget about Dataflow. Do it the best practice magento way...
Use this one:
https://github.com/avstudnitz/AvS_FastSimpleImport
You can use any array to import entities product and customer.

If you need to update existing products by csv import than include only those column which you want to update with SKU. yes, sku has required column then you can include columns which are going to update the product attributes.
It imports products so fast!!!!

Related

Kingswaysoft SSIS create Contacts Slow

I’m currently doing some testing for an upcoming data migration project and came across Kingswaysoft which seemed like it would be ideal for this purpose.
However I’m currently testing importing 225,000 contact records into a new sandbox Dynamics 365 instance and it is on course to take somewhere between 10 and 13 hours.
Is this typical of the speeds I should expect or am I doing something silly?
I am setting only some out of the box fields such as first name, last name, dob and address data.
I have a staging contact SQL database holding the 225k records to be uploaded.
I have the CRM Destination Component setup to use multi threading batch size of 250 with up to 16 threads.
Have tested using both Create and Upsert and both very slow.
Am I doing something wrong - I would have expected it to be much quicker.

When it comes to the data load to Dynamics 365 Online, the most important aspect that affects your data load performance is the network latency. You should try to put the data migration solution as close as possible to the Dynamics 365 online server. If you have the configuration right, you should be able to achieve something like 1m to 2m records per hour. The speed that you are getting is too slow. There must be something. There are many other things that could affect the data load performance, but start from network latency first. We have some other tips shared at https://www.kingswaysoft.com/products/ssis-integration-toolkit-for-microsoft-dynamics-365/help-manual/crm/advanced-topics#MaximizedPerformance, which you should check out.

Caching the data result of complex computation

I have a Spring Boot server application. Clients of this server ask for statistics about different things all the time. These statistics can be shared among clients, and must not be real time.
It's good enough if these statistics are refreshed every 15-30 mins.
Also, computing these statistics requires reading the whole database.
So, I'd like to cache these computed statistics and update them now and then.
What is your suggestion, what tool or pattern should I use?
I have the following ideas so far:
using memcached
upgrading to MySQL 5.7 which has JSON store, and store the data there
Please keep in mind that the hardware of my server is not too powerful: 512MB RAM and 1 CPU (cheapest option in DigitalOcean).
Thank you in advance!
Edit 1:
These statistics are composed of quite simple data structures: int to int maps, lists, etc. and they are NOT fitting well for a relational database.
Edit 2:
The whole data is only a few megabytes. The crutial point is that creating this data requires a lot of database reads, and a lot of clients are asking for it.
I also want to keep my server application stateless. I think it's important to mention.

A simple solution for the problem, is saving the data in JSON format to a file, and that's it.
Additionally, this file can be on a ram disk partition, so it will be blazing fast.

Store in memory or in local database

I'm developing an app in which I'll need to collect, from a MySQL server, a 5 years daily data (so, approximately 1825 rows of a table with about 6, 7 columns).
So, for handling this data, I can, after retrieving it, store it in a local SQLite database, or just keep it in memory.
I admit that, so far, the only advantage I could find for storing it in a local database, instead of just using what's already loaded, would be to have the data accessible in a next time the user were to open the app.
But I think I might not be taking into account all important factors.
Which factors should I take into account to decide between storing data in a local database or keep it in memory?
Best regards,
Nicolas Reichert

With respect, you're overthinking this. You're talking about a small amount of data: 2K rows is nothing for a MySQL server.
Therefore, I suggest you keep your app simple. When you need those rows in your app fetch them from MySQL. If you run the app again tomorrow, run the query again and fetch them again.
Are the rows the result of some complex query? To keep things simple you might consider creating a VIEW from the query. On the other hand, you can just as easily keep the query in your app.
Are the rows the result of a time-consuming query? In that case you could create a table in MySQL to hold your historical data. That way you'd only have to do the time-consuming query on your newer data.
At any rate, adding some alternative storage tech to your app (be it RAM or be it a local sqlite instance) isn't worth the trouble IMHO. Keep It Simple™.
If you're going to store the data locally, you have to figure out how to make it persistent. sqlite does that. It's not clear to me how RAM would do that unless you dump it to the file system.

Mechanism for extracting data out of Cassandra for load into relational databases

We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.
We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us.
So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?

You should take a look at sqoop, it has an integration with Cassandra as shown here.
This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:
Slice your dataset into different partitions.
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.
So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.
As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.
I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.

There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.

Can a magento store with 1000 websites and daily automated product updates be made to work by using multiple mysql per website?

The mysql performance of running the magento for this situation under one mysql installation is giving a headache. I wonder if it is feasible to setup an individual mysql for each website so that updates to the catalog can occur concurrently across all websites.

It sure can be made working within a cluster and if you queue your updates and plan ahead for such. But it won't be cheap and i'll guess you 'll need a mysql instance for every 30 to 50 website. It's worth to observe mysql sharding for heavily used tables and ways to run all this inside RAM to dramatically pull down the resource usage needed.
and for such task you have to be living and breathing INNODB person

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Magento importing customers(700k) using csv dataflow is too slow - mysql

Forget about Magmi, forget about Dataflow. Do it the best practice magento way... Use this one: https://github.com/avstudnitz/AvS_FastSimpleImport You can use any array to import entities product and customer.

If you need to update existing products by csv import than include only those column which you want to update with SKU. yes, sku has required column then you can include columns which are going to update the product attributes. It imports products so fast!!!!

Related

Kingswaysoft SSIS create Contacts Slow

Caching the data result of complex computation

Store in memory or in local database

Mechanism for extracting data out of Cassandra for load into relational databases

Can a magento store with 1000 websites and daily automated product updates be made to work by using multiple mysql per website?

Categories

Resources