why random forest regression return a very bad result? - regression

I'm trying to use randomforestregressor() in scikit_learn to model some data.After processing my raw data, the data I applied to randomforestregressor() is as follows.
The following is only a little part of my data. In fact, there are around 6000 pieces of data.
Note, the first column is the datetimeindex of my created DataFrame 'final_data' that contains all the data. In addition, the data in column4 were strings. I just converted them to numbers by a map function.
import pandas as pd
from datetime import datetime
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
S_dataset1= final_data[(final_data.index >=pd.to_datetime('20160403')) &
(final_data.index <= pd.to_datetime('20161002'))]
S_dataset2= final_data[(final_data.index >=pd.to_datetime('20170403')) &
(final_data.index <= pd.to_datetime('20170901'))]
W_dataset = final_data[(final_data.index >=pd.to_datetime('20161002')) &
(final_data.index <= pd.to_datetime('20170403'))]
S_dataset = pd.concat([S_dataset1,S_dataset2])
A = W_dataset.iloc[:, :8]
B = W_dataset.loc[:,'col20']
W_data = pd.concat([A,B],axis = 1)
X = W_data.iloc[:,:].values
y = W_dataset['col9'].values
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.3,
random_state=1)
forest = RandomForestRegressor(n_estimators = 1000,criterion='mse',
random_state=1,n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Here is my code for predicting col9. I separated the final_data into two seasons which may make the prediction more accurate. However, the result is very bad. R2 score of train is around 0.9, but for test, it is only around 0.25. I really don't know why I get a so bad result. Could some tell me where I was wrong and how can improve my model? Many thanks!!!

I think the problem is because I didn't consider the effect of datetime to the prediction. After converting these datetimeindexs to their numerical values and input to my model, I got a quite good result. The R2 score is around 0.95-0.98.

Related

Trashword Identifcation with Sklearn for NLTK

For a NLTK project I want to mass identify, if a new word is likely a "trash" word or a meaningful word. It fits to the architecture to do this in the early phase, so I can proceed with "true" words later - so I like this negative approach to identify this non meaningful words.
For this I have a training set with words/trash labeled.
I work with single characters for the labeling and have the error:
ValueError: empty vocabulary; perhaps the documents only contain stop words
The code I use is:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
talk_data = pd.read_excel('nn_data_stack.xlsx', sheet_name='label2')
talk_data['word_char_list'] = [[*x] for x in talk_data['word'].astype(str)]
talk_data['word_char'] = [','.join(map(str, l)) for l in talk_data['word_char_list']]
talk_data['word_char'].replace(',',' ', regex=True, inplace=True)
z = talk_data['word_char']
y = talk_data['class']
z_train, z_test,y_train, y_test = train_test_split(z,y,test_size = 0.2)
cv = CountVectorizer()
features = cv.fit_transform(z_train)
Example of the data set I have for training
word
class
drawing
word
to be
word
龚莹author
trash
ï½°c
trash
Do I need to use an alternative to CountVectorizer?
I think I need to go to character embedding to ensure a proper input - but how?

Dropping duplicates in a pyarrow table?

Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))

GridSearchCV without Cross Validation CV = 1

I have a special dataset and this dataset could be trains with a %1 error. I need to do hyperparameter tuning for MLPRegressor without a split train set. Meanly cv = 1. Is this possible with GridSearchCV?
One of the options for cv parameter is:
An iterable yielding (train, test) splits as arrays of indices.
So, if you have X input matrix, y target vector, mlp classifier, and params grid you can do just one train-test split.
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
indices = np.arange(len(X))
train_idx, test_idx = train_test_split(indices, test_size=0.2)
clf = GridSearchCV(mlp, params, cv=[(train_idx, test_idx)])
But keep in mind that using 1 split for hyper-parameter sweep is a bad practice. Do not make many steps with such a grid search.

Get empty prediction with Facebook Prophet

Following the basic steps to create Prophet model and forecast
m = Prophet(daily_seasonality=True)
m.fit(data)
forecast = m.make_future_dataframe(periods=2)
forecast.tail().T
the result is as following (no yhat value ??)
The data passed in to fit the model has two columns (date and value).
Not sure what I have missed out here.
I managed to get it works by creating a new dataframe:
df_p = pd.DataFrame({'ds': d.index, 'y': d.values})

Optimizing django db queries

I am trying to optimize my db queries(mysql) in a django app.
This is the situation:
I need to retrieve some data about sales, stock about some products on a monthly basis. This is the function
def get_magazzino_month(year, month):
from magazzino.models import ddt_in_item, omaggi_item, inventario_item
from corrispettivi.models import corrispettivi_item, corrispettivi
from fatture.models import fatture_item, fatture, fatture_laboratori_item
from prodotti.models import prodotti
qt = 0
val = 0
products = prodotti.objects.all()
invents = inventario_item.objects.all().filter(id_inventario__data__year=year-1)
fatture_lab = fatture_laboratori_item.objects.all().order_by("-id_fattura__data")
for product in products:
inv_instance = filter_for_product(invents, product)
if inv_instance:
qt += inv_instance[0].quantita
lab_instance = fatture_lab.filter(id_prodotti=product).first()
prezzo_prodotto = (lab_instance.costo_acquisto/lab_instance.quantita - ((lab_instance.costo_acquisto/lab_instance.quantita) * lab_instance.sconto / 100)) if lab_instance else product.costo_acquisto
return val, qt
The problem is where I need to filter all the data to get only the product I need. It seems that the .filter option makes django requery the database, although all of the data is there. I tried making a function to filter it myself, but although the queries diminish, loading time increases dramatically.
This is the function to filter:
def filter_for_product(array, product):
result = []
for instance in array:
if instance.id_prodotti.id == product.id:
result.append(instance)
return result
Has anyone ever dealt with this kind of problem?
You can use prefetch_related() to return a queryset of related objects and Prefetch() to further control the operation.
from django.db.models import Prefetch
products = prodotti.objects.all().annotate(
Prefetch(
'product_set',
queryset=inventario_item.objects.all().filter(id_inventario__data__year=year-1),
to_attr='invent'
)
)
Then you can access each product's invent like products[0].invent
Using select_related() will help optimize your queries
A good example of what select_related() does and how to use it is available at simpleisbetterthancomplex.