Regression forecasting and predicting - Practical Machine Learning Tutorial with Python p.5

In this video, make sure you define the X's like so. I flipped the last two lines by mistake: X = np.array(df.drop(['label'],1)) X = preprocessing.scale(X) X_lately = X[-forecast_out:] X = X[:-forecast_out:] To forecast out, we need some data. We decided that we're forecasting out 10% of the data, thus we will want to, or at least *can* generate forecasts for each of the final 10% of the dataset. So when can we do this? When would we identify that data? We could call it now, but consider the data we're trying to forecast is not scaled like the training data was. Okay, so then what? Do we just do preprocessing.scale() against the last 10%? The scale method scales based on all of the known data that is fed into it. Ideally, you would scale both the training, testing, AND forecast/predicting data all together. Is this always possible or reasonable? No. If you can do it, you should, however. In our case, right now, we can do it. Our data is small enough and the processing time is low enough, so we'll preprocess and scale the data all at once. In many cases, you wont be able to do this. Imagine if you were using gigabytes of data to train a classifier. It may take days to train your classifier, you wouldn't want to be doing this every...single...time you wanted to make a prediction. Thus, you may need to either NOT scale anything, or you may scale the data separately. As usual, you will want to test both options and see which is best in your specific case. With that in mind, let's handle all of the rows from the definition of X onward. https://pythonprogramming.net/forecasting-predicting-machine-learning-tutorial/ https://twitter.com/sentdex https://www.facebook.com/pythonprogramming.net/ https://plus.google.com/+sentdex

Comments

Hi, I did a minor change in the code to compare the forecast data against the real data:

I created this guy:

X_real = np.array(df['Adj. Close'][-forecast_out:])

then changed the for to:

for i in range(len(forecast_set)):
next_date = datetime.datetime.fromtimestamp(next_unix)
next_unix += 86400
df.loc[next_date] =
[X_real[i]]+[np.nan for _ in range(1,len(df.columns)-1)]+[forecast_set[i]]

Just to plot both forecast and real. The shape is pretty good, almost identical, but there is a bias going on, forecast is always a little bit above the real data by 6. What might be the problem??
Solution to time stamping problem:
https://github.com/Abhilash04/Python_CSE/blob/master/Stock_Price_Prediction.py
I am getting an error while running the first part of the code:
ValueError: Found input variables with inconsistent numbers of samples: [3118, 3150]
Can someone help me out?
I do not understand what linear regression has to do with "machine learning". Regression is for me only a statistical approach, how is it connected to machine learning ???
Sorry if this is a stupid question but can you please explain again why did you do X = X[:-forecast_out]
last_unix = last_date.timestamp()
AttributeError: 'Timestamp' object has no attribute 'timestamp'

it is not working. can you please help?
last_unix = last_date.timestamp()
AttributeError: 'Timestamp' object has no attribute 'timestamp'

it is not working. can you please help?
My plotted forecast data seems to be showing 'real data' has a minus 32 days been used instead of a plus 32 days??? i'd love to continue with tutorial but hung up on this!!
why are there 2 identical statement?
y = np.array(df['label'])
print(forecast_set, accuracy, forecast_out), the accuracy is not correct. It is not the predict accuracy, it is the one from accuracy = clf.score(X_test, y_test)
hi, i don't know why but after adding the X and X_lately values I am not getting the equal length of X and y... what could be the problem??? please anyone help me out .. and when i remove that and execute again I get the equal values than !!!
I got this error
Traceback (most recent call last):
File "C:/Python27/reg2.py", line 41, in <module>
last_unix = last_date.timestamp()
AttributeError: 'Timestamp' object has no attribute 'timestamp'
+sentdex : I dont understand why on running the code :
clf=svm.SVR()

clf.fit(X_train,y_train)

accuracy=clf.score(X_test,y_test)
print(accuracy) # -0.0113307828442

the accuracy is negative!!

X = np.array(df_1.drop(['label'],1)) #to select all cols except label
y = np.array(df_1['label'])

#I commented this part since, it changes the dimension of row for X, hence samples of X doesn't matches with labels y samples dimensionally
#X = preprocessing.scale(X)#don't use high frequency training
#X=X[:-forecast_out+1]
#y = np.array(df_1['label'])
Seems a lot of people are experiencing an issue where the plot does not display any future dates. I may or may not have fixed this issue. The part where we shift forecast_col by -forecast_out is missing the 'periods' argument for the shift function.

Try this: df['label'] = df[forecast_col].shift(periods=forecast_out)

And let me know if I'm on to something. When I print the tail of the df, I am seeing 32 days into the future (i.e. running the code today produces forecast values up until march 3rd of 2017).

Edit: Nevermind, it appears that all this is doing is taking the previous 32 Adj. Close values and pasting it to the future 32 days.
Hey Harrison! I felt this is easier (python 2.7), and avoids the problems people are having with timestamps:

last_date= df.index.max()
next_unix= last_date + datetime.timedelta(days=1)

for i in forecast_set:
next_date= next_unix
df.ix[next_date]= [np.nan for _ in range(len(df.columns)-1)] +[i]
next_unix += datetime.timedelta(days= 1)
Thank´s for the videos, they help a lot. I´m wondering if there is a mistake in this video: when you plot the Forecast values, they are plotted as they were a continuation of the "Adj Close" values, while in fact they should be plotted "forecast_out" days ahead. I´m saying this because "Adj Close" represent the values today, while "Label" represent the future values. Am I wrong?
Hello, thanks for the awesome tutorials. I'm confused with df.loc method. The documentation says that ".loc will raise a KeyError when the items are not found.", but we don't get any KeyError, although the dataframe would not have the new dates we are indexing, right? So, what am I missing? Thanks!
Ive found you can access dataframe dates pretty easily by using df.ix[] syntax and using timedelta to generate the new dates.
By the way, the day you published the video, the stock price was 771, then grew to 787 and then fell to 705 in the next couple of weeks. So, anyone using this model and buying a put option would made a good profit.
Hi , many thanks for the videos. I've done the exercise and somehow the quandl brought me prices from 2004. Any hint about this? Best!

Additional Information:

Visibility: 41903

Duration: 14m 28s

Rating: 267