Regression Features and Labels – Practical Machine Learning Tutorial with Python p.3


Alright hello everybody and welcome to the third machine learning and second regression tutorial video. Where we left off, I was asking whether or not the adjusted close column would be a feature or a label. And the answer is really a feature and possibly none of the above. Um… It could be a label if we hadn’t already kind of decided that we are using the high minus low percent um… or the percent change. For example so, you could use adjusted close as a label if, say, at the beginning of the day you were trying to predict what the close might be that day. But in this case with the given features that we have chosen, you really wouldn’t even know this value. Um, you wouldn’t know the high minus low and you wouldn’t know the percent change until the close had already occurred. So, if you trained a classifier to predict this value, um, that would be an incredibly biased classifier. So, just kind of you have to start thinking of these things. Is this even possible in the real world? Cause you can kind of find yourself doing things uh that seem like a great idea at the time, but then it’s actually not even possible to do. So, in our case, adjusted close will either be a feature, or not, none of the above in the sense that actually what we will do is take like the last 10 values of the adjusted close and that’s a feature. And, that’s most representative of when we actually go and dig in and write the algorithm ourselves , um, you would take maybe the last 10 values, and try to predict the future value. Anyway, more on that later. So the last, uh the last tutorial we did features, and now in this tutorial we are gonna define a label. So, since I just got done telling you that this is not gonna be a label, what actually would be the label? Well, it would be at some point in the future, the price. Okay. And the only price column we have anymore is adjusted close. Um, but what we wanna do is actually get the adjusted close in the future , maybe the next day, maybe the next 5 days, something like that. So, we need to bring in some new information, basically to get that information up into the future. So, let’s go ahead and close this out and begin working on that. So, first of all, um we want to…, we are gonna take, we are not gonna print the head anymore. And, first of all, the, we are gonna say forecast_column or col, is just gonna be equal to adjusted close. I’ll explain why we are gonna do that in a second but basically it’s just a variable and later on, you could change this variable to be a different forecast column. So, you might not be working with stock prices, there’s other things that you can use when you are regression on, of course machine learning other than stock prices. So, in the future, if you aren’t, you’re gonna just, you’ll be able to use very similar code. You’ll obviously change the code leading up to this point, but you just change forecast column to be whatever you want it in the future. And I’ll show you why when we get to the code, why that’s gonna be useful. Um now, what we’re gonna say is just in case there is not, not, uh missing data, so df dot fill na. So fill na is just fill any, as for not available or in pandas term it’s gonna be actually a na in most cases and that’s not a number. So now we are gonna fill na with a specific value we’re gonna do – 99,999 and we’re gonna say inplace equals true. So with machine learning, you can’t work with na data. So you actually have to replace the na data with something and, or you can get rid of that entire column, but you don’t want to get rid of data in machine learning in the real world you actually will find that you miss a lot of data. You are lacking maybe one column, but you have got the other columns, and you don’t wanna sacrifice data if you don’t have to. So you can do this and it will be treated as an outlier in your dataset. And again this is just one more reason why going through and doing the algorithm by hand. will help you understand so much better what kind of effect that is gonna have on the algorithm. So, you’ll be thankful that we go through it. And then basically you’ll learn through each algorithm why, uh…, what doing something like that will do? So anyways, that’s the choice, that’s the best choice in my opinion rather than getting rid of data. Now, we are gonna forecast out. This is a regression algorithm, generally you use regression to forecast out. You don’t have to but generally that’s what you are doing. So I am gonna define forecast out as the equal to being the int value of math dot ceiling, um, and the ceiling will be point 1 times the length of the df. So, first of all, what are we doing there? And also we need to import math. But, first, what are we doing there? math dot ceil will take anything and get to the ceiling. So let’s say the length of the dataframe was a number that was gonna return a decimal point, that was gonna be like point 2, right? Let’s say that was gonna happen. Math dot ceil will round that up to 1. So, math dot ceil rounds everything up to the nearest whole So, um, and then we are making it an integer value, um, just so, cause I think math dot ceil returns a float and we don’t really want it to be a float either. But anyway, uh, this will be the number of days out, so basically what we are gonna do here is we are gonna try to predict out 10 percent of the dataframe and you’ll see that actually when we go out and do this, it’s not like you’ll just get 1 point 10 percent out, you can get tomorrow’s price and the next days price and so on. Um, you’re just using data that came 10 days ago to predict today. Ok. So, um, feel free to change that, right? Maybe you want point 01, right? Maybe you want to just predict like tomorrow’s price or something. You can play around with that if you want. We are just making stuff up basically as we go. So if you wanna change that, by all means change it. So let’s go ahead and go to the top and import math before we forget. Okay, so now, we need a, the actual, so we’ve got labels, oh I am sorry we have got features, right? these are features, or these are our features and now we need that label, so now that we have forecast out we can create that label. So we’re gonna say df, and then the label column. The label will be the equal of df forecast column, so that’s why we used forecast column. that way if later on you decide to change something you’ll be able to just change this variable rather than all the feature variables. So it’ll be equal to the df forecast column and then we are gonna do a dot shift minus forecast out. That’s why we needed it to be an int cause we are basically shifting the columns. So, what we’ve done is we are shifting the columns negatively. So it’ll go, basically if you have a column here it’ll get shifted up, the spreadsheet almost. This way, each row, the label column for each row will be adjusted close price 10 days into the future. Okay? So that’s our label, so our features are these attributes of what, in our mind, may cause the adjusted close price in 10 days to change or 10 percent. So actually this will be much greater than 10 days cause we didn’t even specify the timeframe So, we can tinker with this number later, it’s really not that important. Um, regression, you aren’t gonna get rich on just this algorithm, I promise you. But it’s actually good, you’ll find this actually not a bad model of stock price. And as you add more useful features, it can get, it can get pretty good. But, anyway, um, so now we have our label column and let’s go ahead and print df dot head again. So this just prints like the first 5 rows of the dataframe. Again if there’s anything we are doing with pandas that you are like “What’s going on?”. Ask and I can point you to tutorial because I’ve got I’ve done tutorials based on everything that I am gonna be doing. Um, ok, so these are, our each of these column features and then we finally have a label column that we’ve kind of, this is time into the future, um, for our data. So, now what we’re gonna go ahead and do is… in fact let me do a df dot, let’s do a df dot tail and also let’s just do a df dot drop na and then in place equals true. Cause those are some awful high numbers, for 10 percent out. Interesting. So I guess prices changed that much by that shift. So let’s try a smaller shift. Um, fascinating, that that would be 10 percent out. That’s a little better. Maybe, maybe we’ll use that point 01. Let’s use that one, cause the other ones were just so huge. So let’s go back to head and see if, if that number. So if, if you are not following, I am just comparing the forecast price to the adjusted close price. So of course when the when the stock price opens, this is actually a significant percent change, right, from 50 to 66, but the stock just came out and of course google does very well in time. So, so yeah, but anyway, yeah I think I’ll go with point 01 for now. Or oo oo we’ll, we’ll mess with bot whenever we go to predict some stuff Anywway. That’s it for this one. So we have done features, we got our label and now we’re actually ready to train, test, predict and actually run this algorithm on some realistic data so Stay tuned for that. If you have questions, comments, concerns, whatever up to this point Feel free to leave them below. Otherwise I was allways thankfull whatch these instructions and it’s till next time.

100 thoughts on “Regression Features and Labels – Practical Machine Learning Tutorial with Python p.3

  1. df.fillna(-99999, inplace(TRUE)) is not working. Can someone help me with some other way to replace values?

  2. So I just changed forecast_out to 100. It can only predict values till 2018-03-27. It cannot go past that. We are just shifting the values for prediction. We already have the values. We just shift it and say that its a prediction. This makes 0 sense and my head hurts right now. Why are we predicting the values which we already have???????????

  3. guys, im getting this error when i run the following code:

    code: data["label"] = data[forecast_col].shift(-forecast_out)
    error: "TypeError: 'Adj. Close'."
    After checking my code thoroughly, i am stuck at this point.

    If you would like to impart me some solutions, that would be truly appreciatable!

  4. Great video as always. Quick question: did anyone fully understand why 'Adj. Close' cannot be used a label?

  5. i get the folliwing error.

    Traceback (most recent call last):

    File "test.py", line 7, in <module>
    df= df[['ADj. Open','ADj. High','ADj. Low','ADj. Close','ADj. volume',]]
    File "/usr/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in _getitem_
    return self._getitem_array(key)
    File "/usr/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
    File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1327, in _convert_to_indexer
    .format(mask=objarr[mask]))
    KeyError: "['ADj. Open' 'ADj. High' 'ADj. Low' 'ADj. Close' 'ADj. volume'] not in index"
    drunkenpanda Documents

    Did the dataset change ?

  6. Hi sentdex, I think you've done an awesome job! I have a quick question though:
    Since you already did a fillna why do you do a dropna at the end? isn't it redundant since all na values would already be filled with a value of -99999 by the fillna?

  7. when i print the head of the dataframe, my values are abit different, even for columns that we didn't tamper with, like 'Adj. close' . did any of you guys get the same problem?

  8. Hi I think these videos are very good. Well done.
    One small question. When you're forecasting are you forecasting 0.01 of the dataset into the future eg the predicted value in 30 days from now or are you predicting from 1 day ahead up until you reach 0.01 of the dataset ahead eg from tomorrow until 30 days ahead ? Any advice would be much appreciated.ThanksShane

  9. 4:40 looking at another screen and didn't explain what follows very well. Specifically it is 5:47 predicting out 10% f dataframe?

  10. Got the theory part but lost track from math.ceil 🙁
    What is that .01/.1 and how does it help in getting values of the next10 days in the dataset??

  11. Please correct me if Im wrong ,ive tried to summarize the whole video.

    We first found out all meaningful columns .
    Finding adj_close value is senseless because we already know PCT_change (which is found using adj_close)
    Adj_close is set as forecast_col because we are finding the value of adj_close from the future.

    forecast_out specifies the number of days we have to go in future to find adj_close.
    forecast_out = int(math.ceil(0.01*len(df)))
    let len(df) be 1000 i.e we have data for 1000 days
    forecast_out = math.ceil(10) = 10
    therefore we will be finding the adj_close value of 15th september considering today is 5th september.

    df['label'] = df[forecast_col].shift(-forecast_out)
    this statement is actually creating a new column as label assigning it adj_close value of the future(15th september ) and shifting it 10 rows above , so the 5th septemeber row will be having label(adj_close ) of 15th september.

    I reckon we are not predicting any value , we are just finding the value existing in the database from the future row and shifting it according to forecast_out.
    thats why we are geting Nan values when printing the tail.(used dropna).
    consider last date is 25th september than so we will be getting some label values till 15th september but after that till 25th september we will get Nan.

  12. 3.5 python was installed in that pip vesrion was 7 +
    once it updated its 18+
    when i am trying to install Quandal it shwoing an error
    Collecting Quandal
    Could not find a version that satisfies the requirement Quandal (from versions: )
    No matching distribution found for Quandal
    please guide me

  13. The way I see it, in the Adj. Close column we have a value for each day. Let the value for a certain day, for simplicity let's say Monday, be 50. Now, if we would shift the Adj. Close column up by one, this would mean that the label value for the Sunday before this Monday would actually be 50. In other words, if you shift the Adj. Close column up by 1, the label column would miss a value for the last day that is available in the dataset, as there is no day after that so no available adjusted close value. Is this true, because then I finally understand it 😉

  14. Hello! I read your blog and watch your video and you are mentioning that 'One popular option is to replace missing data with -99,999. With many machine learning classifiers, this will just be recognized and treated as an outlier feature. ' Can you please help in order to understand that? Maybe a resource?

  15. There are a lot of flaws in your videos regarding the choice of examples (time series is not a standard example for linear regression, high – close instead of high – low), way of teaching (saying wrong things… please cut the video and say it right again) and taking certain things for granted (e.g. import math). I suggest to re-view your videos after you finished them and take the perspective of a student, if you can.

  16. Hey guys, not sure if this helps anyone, but here's a link to an IPython notebook for this part of the series. I've made sure that the code works (Python 3). If anything gets broken in the future, feel free to let me know.
    https://github.com/GoPlayOutside/SentdexML/blob/master/PML%20Tutorial%20w%20Python%20pt3.ipynb

  17. New to dataframes in Python — why is the column "Date" treated differently from the other columns in the data frame .. i.e. it's not in the list of columns that you subset from df, but still shows up when you print the new df? Why didn't you have to do df = df[['Date', 'Adj. High', 'Adj. Low, 'Adj. Close', 'Adj. Volume']] ? Are the entries under "Date" considered as a rownames versus actual data values?

  18. Hi, You are awsome and extra-ordinary… Never seen anyone spending theoritical in such amazing practical ways.
    Blessed to learn from you. Thank you.
    Just wanted to ask in what order i should try all of your course while learning Data-Science. Also wanted to learn finance and ML also.

  19. Ok, so I am very interested in this, but I have to be honest. Even tho everything you explained in the video worked for me and everything was fine. I have no idea what tf is going on… Well anyway, nice video, I think

  20. 3:48 did he meant entire row? if not then wouldnt it be better to delete a row(a record) instead of an entire column(feature)

  21. Hey Sentdex! I am getting a NaN under my label for 2018 but the 2004 prices are printing normally. Looks like the shift could even use adjusting now.

  22. So, are you basically making a copy of the Adj. close column in the labels column and pulling it up by 5/10 values?

  23. When you were shifting, you were basically doing the train test split and dropped all the rows which you were to use for testing. Am i right?

  24. I followed exaclty what you did however im having problem with the dataset used.. it seem im getting a error when importing quandl and nothing actually happens when i run but i keep getting an error saying i have already called it more then fifty times a day.. What do I do in this case

  25. For the line df['label'] = df[forecast_col].shift(-forecast_out), I think it should be '+forecast_out' instead of '-forecast_out'. I believe the idea was to predict present in terms of past data, the initial 1% of the row should have NaN as they don't have sufficent 1% data to do the prediction on.

  26. please help me over this error
    as i am trying to import cross_validation from sklearn
    i am getting following error:
    "cannot import name 'cross_validation' from 'sklearn' (C:ProgramDataAnaconda3libsite-packagessklearn__init__.py)"

  27. I have successfully installed the quandl but not able to get the data sets and manual search results not obtained

  28. Thank you alot .These videos are perfect.I am applying what I've learnt from you. And it is going great.You cannot guess how much I needed these videos.

  29. well, it showing an error "'series' objects are mutable thus they cannot be hashed in the line for df[HL_PCT] and the df[PCT_change]……….what should i do now

  30. Why you always handle the nan values with outliers, why dont you take the mean of the column and use it .please explain thanks.

  31. great stuff, would be even better if you had explained what each column states for in more details 🙂

  32. I'm a little unclear on how the dataframe shift creates a projection into the future. I didn't actually do the data analysis tutorials, in case it's in there, can someone give me a link?

  33. I'm not sure if yo monument at any point but august 19 2004 was the day that Google went public. I have to imaging that the data from the very first day of an PO is skewed.

  34. Im confused why we call this prediction when all we do is shifting (removing the data)? So we shorten the tail of the data. Why do we call it forecast?

  35. Can anybody tell why do we need to shift the values? When we do a shift, the lower values are NaN, so why do we do this step exactly?

  36. @sentdex Pinning the helpful and interesting comments would help a lot if ever one is possibly got mixed up somewhere.

    I kinda gave up on the series when I missed the point/relevance of why we are doing,

    forecast_out = int(math.ceil(0.01*len(df)))

    df['label'] = df[forecast_col].shift(-forecast_out)

    until i had to look through all the comments and someone made it clear, that it's just to make a 'training set' so that we could train our algorithm on it.

    Great tutorial series, hoping I'd not give up in the middle. 🙂

  37. We have, let's say, 3000 rows in a dataframe. So using 'forecast_out=int(math.ceil(0.01*len(df)))' and then shift method on df will shift the label column up by 30 days..I got it.. But my question is since we have not given our algorithm any way of doing calculations or predictions, how did it predict the values in the label column for the last 30 days, as seen in df.tail()…
    I mean I was expecting blank cells or 'NaN' or '??' for the last 30 days because all we are doing is shifting… So how did our computers straight-away predicted it? Did we train our model somewhere and I missed it???
    Thank you in advance!! Love this series

  38. Thank you for this wonder full tutorial. I have watched your 80% tutorial. your all tutorial is very helpful me. When I was used sklearn 0.20.1
    from sklearn import cross_validate is not work i replace this code
    through
    from sklearn.model_selection import cross_validate
    Thanks again!

  39. there are no NA values in this dataset as of 2019 ( you can check it yourself by using df.isna().sum() command ). and I dont think math.ceil ever returns a float. after all ceil or float's whole purpose is to round the number to a nearest integer.

  40. Love your videos , however, when trying to follow along while coding it is nearly impossible. I watch a lot of tutorials and for some reason I can never keep up with your coding and talking. Love the topics! however I wish you taught like most other courses around.

    I still get a lot out of your videos and I'm a fan but I have to write the code and then replay parts to understand it unlike others.

  41. I think beginners like me should watch first (Data Analysis w/ Python 3 and Pandas series) to understand what is going on here.

  42. Great series, if you don't understand the basics here then you need to go back and do the intro Python and Data Analysis tutorials first.

    To conserve Quandl calls you can do this (Python 3):

    import os
    import quandl

    if not os.path.exists('data.csv'):
    df = quandl.get('WIKI/GOOGL')
    df.to_csv('data.csv', index=False)
    else:
    df = pd.read_csv('data.csv')

  43. Bro, I have a question when u calculate HL_PCT, u use Adj. Close, but I think that we should use Adj. Low in the difference, that means, (df['Adj. High'] – df['Adj. Low']) / df['Adj. Low'] * 100.0. Correct me pls if I am wrong

  44. Hi , I would like to tell you I really liked the pytorch tutorials and really want the machine learning tutorials to be renewed with new updates of library like a new machine learning series

  45. Still cant understand why use "df.fillna(value=-99999, inplace=True)" if there are no "NaN" values to this point.

  46. I didn't get that part we shift the list rows!! why we should do this? and then we use Adj. Close too? please help. thank you!

  47. Man, you are assuming people watching this tutorial are at the same level of knowledge as you are. Your going straight to the technical stuff without laying ground to understanding fundamental concepts. This is not good tutorial for someone who's getting started with ML. You should not say it is beginner's tutorial, I just think maybe you should rename your tutorial, but you know this is your channel and you can do whatever you want to do… just thinking loud!

  48. In Finance, it is "easy" to forecast Stock Prices since it s AR (auto regressive) processes. Hence, the best prediction of a stock price at time t+1 is its price at time t.. so it ise useless to measure any accuracy on the prices.
    The point is to forecast the pct change, which is the "real" random variable following a normal process. The accurracy should be measured on this forecasted pct change…

  49. df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']] can you explain what is the meaning of this line with reference to basics of python? Is df a list? and if it is so then how this syntax is justified?

  50. I used the 'WIKI/GOOGL' arguement to get the dataset and it worked just fine but when I looked for it in the quandl website I could not find it. Seems like WIKI is no longer even the data publisher there. Seems odd. Don't know what's going on there.

  51. So at first I was wondering why my label showed 1094.00 instead of his 765.2 but then I realized that my data is updated to a data of 2018-01-30 at his is at 2016-02-19. This is at df.tail so i was wondering if that will make too much of a change

Leave a Reply

Your email address will not be published. Required fields are marked *