DT_TEXT concatenating rows on Flat File Import - ssis

I have a project that imports a TSV file with a field set as text stream (DT_TEXT).
When I have invalid rows that get redirected, the DT_TEXT fields from my invalid rows gets appended to the first proceeding valid row.
Here's my test data:
Tab-delimited input file: ("tsv IN")
CatID Descrip
y "desc1"
z "desc2"
3 "desc3"
CatID is set as in integer (DT_I8)
Descrip is set as text steam (DT_TEXT)
Here's my basic Data Flow Task:
(I apologize, I cant post images until my rep is above 10 :-/ )
So my 2 invalid rows get redirected, and my 3rd row directs to sucess,
But here is my "Success" output:
"CatID","Descrip"
"3","desc1desc2desc3"
Is this a bug when using DT_TEXT fields? I am fairly new to SSIS, so maybe I misunderstand the use of text streams. I chose to use DT_TEXT as I was having truncation issues with DT_STR.
If its helpful, my tsv Fail output is below:
Flat File Source Error Output Column,ErrorCode,ErrorColumn
x "desc1"
,-1071607676,10
y "desc2"
,-1071607676,10
Thanks in advance.

You should really try and avoid using the DT_TEXT, DT_NTEXT or DT_IMAGE data types within SSIS fields as they can severely impact dataflow performance. The problem is that these types come through not as a CLOB (Character Large OBject), but as a BLOB (Binary Large OBject).
For reference see:
CLOB: http://en.wikipedia.org/wiki/Character_large_object
BLOB: http://en.wikipedia.org/wiki/BLOB
Difference: Help me understand the difference between CLOBs and BLOBs in Oracle
Using DT_TEXT you cannot just pull out the characters as you would from a large array. This type is represented as an array of bytes and can store any type of data, which in your case is not needed and is creating problems concatenating your fields. (I recreated the problem in my environment)
My suggestion would be to stick to the DT_STR for your description, giving it a large OutputColumnWidth. Make it large enough so no truncation will occur when reading from your source file and test it out.

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

Is there a way to get columns names of dataframe in pyspark without reading the whole dataset?

I have huges datasets in my HDFS environnement, say 500+ datasets and all of them are around 100M+ rows. I want to get only the column names of each dataset without reading the whole datasets because it will take too long time to do that. My data are json formatted and I'm reading them using the classic spark json reader : spark.read.json('path'). So what's the best way to get columns names without wasting my time and memory ?
Thanks...
from the official doc :
If the schema parameter is not specified, this function goes through the input once to determine the input schema.
Therefore, you cannot get the column names with only the first line.
Still, you can do an extra step first, that will extract one line and create a dataframe from it, then extract the column names.
One answer could be the following :
Read the data using spark.read.txt('path') method
Limit the number of rows to 1 with the method limit(1) since we just want the header as column names
Convert the table to rdd and collect it as a list with the method collect()
Convert the first row collected from unicode string to python dict (since I'm working with json formatted data).
The keys of the above dict is exactly what we are looking for (columns names as list in python).
This code worked for me:
from ast import literal_eval
literal_eval(spark.read.text('path').limit(1)
.rdd.flatMap(lambda x: x)
.collect()[0]).keys()
The reason it works faster might be that pyspark won't load the whole dataset with all the field structures if you read it using txt format (because everything is read as a big string), it's lighter and more efficient for that specific case.

Python Loop through variable in URLs

What I want to do here is that I want to change an user id within an url for every url and then get outputs from urls.
What I did so far:
import urllib
import requests
import json
url="https://api.abc.com/users/12345?api_key=5632lkjgdlg&_format=_show"
data=requests.get(url).json()
print (data['user'])
(I type in 'user' inside of the print because it gives all the information about a focal user in json format)
My question is that I want to change user id (which is 12345 in this url example) by giving another number (any random number) and then get the outputs from every url I type in. For example, change to 5211, for example, and get the result. And then change to 959444 and get the result and so on. I think I need to use loop to make this iterate through just by changing the numbers within an url but kept failing to do this due to difficulty splitting the original url and then changing only the user id number inside. Could anyone help me out?
Thank you so much in advance.
=====================The Next Following Question is Stated Below================
Thank you for your previous answer! I tried to build my codes more based on the answer and made it but ran into another issue. I could iterate through and fetch each user's information in a json format. The format gave me a single quote (rather than double quotes) and a weird u' notation in front of every keys in json format but I could solve this issue. Anyway, I cleaned up json format and made it in a perfect neat json format.
My plan is to convert each json into a csv file but want to stack all the json I scrape to one csv file. For example, the first json format on user1 will be converted into a csv file and user1 will be considered row1 and all the keys in json will be column names and all the corresponding values will be the values for the corresponding columns. And the second json format I scrape will convert into the same csv file but in the second row, and so on.
from pandas.io.json import json_normalize
eg_data=[data['user']]
df=pd.DataFrame.from_dict(json_normalize(data['user']))
print (df)
df.to_csv('C:/Users/todd/Downloads/eg.csv')
print (df)
So, I found that json_normalize flattens the nested brackets so it's useful in a real world example. Also, I tried to use pandas dataframe to make it as a table. Here I have 2 questions: 1. How do I stack each json format that I scraped one by one in a row in one csv file? (If there's another way to do this without using pandas frame, that would be also appreciated) 2. As I know, pandas dataframe won't give you an output unless every row has the same number of columns. But in my case since every json format I've scraped has either 10 columns or 20 columns depending on whether a json format has nested brackets or not. In this case, how do I stack all the rows and make it in one csv file?
Comments or questions will be greatly appreciated.
You can split it into two initially and join them together every time you generate a random number
import random
url1="https://api.abc.com/users/"
url2="?api_key=5632lkjgdlg&_format=_show"
for i in range(4):
num=random.randint(1000,10000) #you can change the range here for generating a random number
url=url1+str(num)+url2
print(url)
OUTPUT
https://api.abc.com/users/2079?api_key=5632lkjgdlg&_format=_show
https://api.abc.com/users/2472?api_key=5632lkjgdlg&_format=_show
and so on...
But, if you wanted to split at that exact place without knowing how it looks beforehand, you can use regex as you know for sure that a ? is found after this number.
import re
url="https://api.abc.com/users/12345?api_key=5632lkjgdlg&_format=_show"
matches=re.split('\d+(?=\?)',url)
['https://api.abc.com/users/', '?api_key=5632lkjgdlg&_format=_show']
Now just set
url1=matches[0]
url2=matches[1]
And use the for loop.

Spark - load numbers from a CSV file with non-US number format

I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)

genfromtxt dtype=None returns wrong shape

I'm a newcomer to numpy, and am having a hard time reading CSVs into a numpy array with genfromtxt.
I found a CSV file on the web that I'm using as an example. It's a mixture of floats and strings. It's here: http://pastebin.com/fMdRjRMv
I'm using numpy via pylab (initializing on a Ubuntu system via: ipython -pylab). numpy.version.version is 1.3.0.
Here's what I do:
Example #1:
data = genfromtxt("fMdRjRMv.txt", delimiter=',', dtype=None)
data.shape
(374, 15)
data[10,10] ## Take a look at an example element
'30'
type(data[10,10])
type 'numpy.string_'
There are no errant quotation marks in the CSV file, so I've no idea why it should think that the number is a string. Does anyone know why this is the case?
Example #2 (skipping the first row):
data = genfromtxt("fMdRjRMv.txt", delimiter=',', dtype=None, skiprows=1)
data.shape
(373,)
Does anyone know why it would not read all of this into a 1-dimensional array?
Thanks so much!
In your example #1, the problem is that all the values in a single column must share the same datatype. Since the first line of your data file has the column names, this means that the datatype of every column is string.
You have the right idea in example #2 of skipping the first row. Note however that 1.3.0 is a rather old version (I have 1.6.1). In newer versions skiprows is deprecated and you should use skip_header instead.
The reason that the shape of the array is (373,) is that it is a structured array (see http://docs.scipy.org/doc/numpy/user/basics.rec.html), which is what numpy uses to represent inhomogeneous data. So data[10] gives you an entire row of your table. You can also access the data columns by name, for example data['f10']. You can find the names of the columns in data.dtype.names. It is also possible to use the original column names that are defined in the first line of your data file:
data = genfromtxt("fMdRjRMv.txt", dtype=None, delimiter=',', names=True)
then you can access a column like data['Age'].