Replace missing value with cell above in either Perl or MySQL? - mysql

I'm importing a csv file of contacts and where one parent has many children it leaves the duplicated values blank. I need to make sure that they are populated when they reach the database however.
Is there a way that I can implement the following when I'm importing a .csv file into Perl and then exporting into MySQL?
if (value is null)
value = value above.
Thanks!

Why don't you place the individual values you read from the CSV file into an array (e.g. #FIELD_DATA). Then when you encounter an empty field while iterating over a row (e.g. for column 4) you can write
unless (length($CSV_FIELD[4])) {
$CSV_FIELD[4] = $FIELD_DATA[4]
}

Not with an import statement afaik. You could, however, make use of triggers (http://dev.mysql.com/doc/refman/5.0/en/triggers.html). Keep in mind though, that this will seriously impact the performance of the import statement.
Also: if they are duplicate values you should have a critical look at your database model or your setup overall.

Related

how to read multiple csv files with different schema in pyspark?

I have different csv files kept in sub folders in a given folder and some of them have one format and some of them have another format in the column names.
april_df = spark.read.option("header", True).option("inferSchema", True).csv('/mnt/range/2018_04_28_00_11_11/')
Above command only refers to one format and ignores other format. Is there any quick way in the parameter like mergeschema for parquet?
format of some files is like:
id ,f_facing ,l_facing ,r_facing ,remark
other is
id, f_f, l_f ,r_f ,remark
but there could be chances in the future that some columns are missing etc so need a robust way to handle this.
It is not. Either the column should be filled with null in the pipeline or you will have to specify the schema before you import the file. But if you have an understanding of what columns might be missing in the future, you could possibly create a scenario where based on length of the df.columns, you specify the schema, although it seems tedious.

Need help creating schema for loading CSV into BigQuery

I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.
So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?
I would recommend doing 2 things here:
Preprocess your file and store the final layout of the file sans the first row i.e. the header row
BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required
Allowing BQ to autodetect schema is not advised.
Schema auto detection in BigQuery should be able to detect the first row of your CSV file as column names in most cases. One of the cases for which column name detection fails is when you have similar data types all over your CSV file. For instance, BigQuery schema auto detect would not be able to detect header names for the following file since every field is a String.
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
The "Header rows to skip" option in the UI would not help fixing this shortcoming of schema auto detection in BigQuery.
If you are following the GCP documentation for Loading CSV Data from Google Cloud Storage you have the option to skip n number of rows:
(Optional) An integer indicating the number of header rows in the source data.
The option is called "Header rows to skip" in the Web UI, but it's also available as a CLI flag (--skip_leading_rows) and as BigQuery API property (skipLeadingRows)
Yes you can modify the existing schema (aka DDL) using bq show..
bq show --schema --format=prettyjson project_id:dataset.table > myschema.json
Note that this will result in you creating a new BQ table all together.
I have way to schema for loading csv into bigquery. You just enough edit value column, for example :
weight|total|summary
2|4|just string
2.3|89.5|just string
if use schema generator by bigquery, field weight and total will define as INT64, but when insert second rows so error or failed. So, you just enough edit first rows like this
weight|total|summary
'2'|'4'|just string
2.3|89.5|just string
You must set field weight & total as STRING, and if you want to aggregate you just use convert type data in bigquery.
cheers
If 'column name' type and 'datatype' are the same for all over the csv file, then BigQuery misunderstood that 'column name' as data. And add a self generated name for the column. I couldn't find any technical way to solve this. So I took another approach. 
If the data is not sensitive, then add another column with the 'column name' in string type. And all of the values in the column in number type. Ex. Column name 'Test' and all values are 0. Upload the file to the BigQuery and use this query to drop the column name.
ALTER TABLE <table name> DROP COLUMN <Test>
Change and according to your Table.

Import csv in Rapidminer is not loading data properly

Importing csv in Rapidminer is not loading data properly in the attributes/ columns and returns errors.
I have set the parameter values correctly in the 'Data Import Wizard'.
Column Separation is set to comma and when I check the "Use Quotes" parameter I see that there are too many "?" appear in the columns even though there is data in the actual csv file.
And when I do not check the “Use Quotes” option then I notice that the content of the columns are distributed across different columns, i.e., data does not appear in the correct column. It also gives error for the date column.
How to resolve this? Any suggestions please? I saw a lot of Rapidminer videos and read about it but did not help.
I am trying to import twitter conversations data which I exported from a 3rd party SaaS tool which extracts Twitter data for us.
Could someone help me soon please? Thanks, Geeta
It's virtually impossible to debug this without seeing the data.
The use quotes option requires that each field is surrounded by double quotes. Do not use this if your data does not contain these because the input process will import everything into the first field.
When you use comma as the delimiter, the observed behaviour is likely to be because there are additional commas contained in the data. This seems likely if the data is based on Twitter. This confuses the import because it is just looking for commas.
Generally, if you can get the input data changed, try to get it produced using a delimiter that cannot appear in the raw text data. Good examples would be | or tab. If you can get quotes around the fields, this will help because it allows delimiter characters to appear in the field.
Dates formats can be handled using the data format parameter but my advice is to import the date field as a polynominal and then convert it later to date using the Nominal to Date operator. This gives more control especially when the input data is not clean.

CSV LOAD and updating existing nodes / creating new ones

I might be on the wrong track so I could use some helpful input. I receive data from other systems by CSV files which I can import into my DB with CSV LOAD. So far so good.
I stucked when I need to reload the CSV again to follow up updates. I cannot delet the former data as I might have additional user input already attached so I would need a query that imports the CSV data, makes a match and when it finds the node it will just use SET to override the existing properties. Saying that I am unsure how to catch the cases where there is no node in the DB (new record) and we need to create a node.
LOAD CSV FROM "file:xxx.csv" AS csvLine
MATCH (c:Customer {code:"ABC"})
SET c.name = name: csvLine[0]
***OPTIONAL MATCH // Here I am unsure how to express when the node is not found***
MERGE (c:Customer { name: csvLine[0], code: csvLine[1]})
So ideally Cypher would check if the node is there and make an UPDATE by SET the new property coming with the CSV or - if the node cannot be found - creates a new one with the CSV data.
And - as a sidenote: How would I find nodes that are not in the CSV file but in the DB in order to mark them as obsolete? (This might not be able in the import but maybe someone has an idea how to solve this in order to keep the DB clean of deleted records - which can only be detected by a comparison with the latest CSV import - happy for every idea).
Any idea or hint how to write the query for updaten the graph while importing?
You need to use MERGEs ON MATCH and/or ON CREATE handlers, see http://neo4j.com/docs/stable/query-merge.html#_use_on_create_and_on_match. I assume the customer code in the second column is the identifier - so the name in column one might change on updates:
LOAD CSV FROM "file:xxx.csv" AS csvLine
MERGE (c:Customer {code:csvLine[1]})
ON CREATE SET c.name = csvLine[0]
ON MATCH SET c.name = csvLine[0]

Unable to import 3.4GB csv into redshift because values contains free-text with commas

And so we found a 3.6GB csv that we have uploaded onto S3 and now want to import into Redshift, then do the querying and analysis from iPython.
Problem 1:
This comma delimited file contains values free text that also contains commas and this is interfering with the delimiting so can’t upload to Redshift.
When we tried opening the sample dataset in Excel, Excel surprisingly puts them into columns correctly.
Problem 2:
A column that is supposed to contain integers have some records containing alphabets to indicate some other scenario.
So, the only way to get the import through is to declare this column as varchar. But then we can do calculations later on.
Problem 3:
The datetime data type requires the date time value to be in the format YYYY-MM-DD HH:MM:SS, but the csv doesn’t contain the SS and the database is rejecting the import.
We can’t manipulate the data on a local machine because it is too big, and we can’t upload onto the cloud for computing because it is not in the correct format.
The last resort would be to scale the instance running iPython all the way up so that we can read the big csv directly from S3, but this approach doesn’t make sense as a long-term solution.
Your suggestions?
Train: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train.csv (3.4GB)
Train Sample: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train-sample.csv (133MB)
Try having different delimiter or use escape characters.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_preparing_data.html
For second issue, if you want to extract only numbers from the column after loading into char use regexp_replace or other functions.
For third issue, you can as well load it into VARCHAR field and then use substring cast(left(column_name, 10)||' '||right(column_name, 6)||':00' as timestamp)
to load it into final table from staging table
For the first issue, you need to find out a way to differentiate between the two types of commas - the delimiter and the text commas. Once you have done that, replace the delimiters with a different delimiter and use the same as delimiter in the copy command for Redshift.
For the second issue, you need to first figure out if this column needs to be present for numerical aggregations once loaded. If yes, you need to get this data cleaned up before loading. If no, you can directly load this as char/ varchar field. All your queries will still work but you will not be able to do any aggregations (sum/ avg and the likes) on this field.
For problem 3, you can use Text(date, "yyyy-mm-dd hh:mm:ss") function in excel to do a mass replace for this field.
Let me know if this works out.