Scikit Learn Machine Learning Tutorial for investing with Python p. 5

In this video, we build on the previous machine learning with scikit-learn tutorial, and we're going to be pulling out the specific data point that we're interested in as using as a feature. sample code: http://pythonprogramming.net http://seaofbtc.com http://sentdex.com http://hkinsley.com https://twitter.com/sentdex Bitcoin donations: 1GV7srgR4NJx4vrk7avCmmVQQrqmv87ty6

Comments

To all the people who are getting Index out of range - the reason for that error is HTML markup elements in different html files. so you need to handle multiple conditions while splitting markup data. I have handled scenarios which iI found as below. the below code might not be optimised one but it covers all the scenarios.

def Keystats(gather="Total Debt/Equity (mrq):"):
statspath = path+'\_KeyStats'
stock_list = [x[0] for x in os.walk(statspath)]
for each_dir in stock_list[1:]:
each_file = os.listdir(each_dir)
if(len(each_file)>0):
for lfile in each_file:
date_stamp = datetime.strptime(lfile,'%Y%m%d%H%M%S.html')
unix_time = time.mktime(date_stamp.timetuple())
filePath = each_dir+'\\'+lfile
fileContent = open(filePath,'r').read()
fileContentNew = fileContent.split(gather + '</td>')
if(len(fileContentNew)==1):
fileContentNew = fileContentNew[0].split(gather + '</th>')
if(len(fileContentNew)==1):
print(filePath)
continue
fileContentNew = fileContentNew[1]
if(fileContentNew.startswith('\n')):
fileContentNew = fileContentNew.split('\n<td class="yfnc_tabledata1">')[1].split('</td>')[0]
else:
fileContentNew = fileContentNew.split('<td class="yfnc_tabledata1">')[1].split('</td>')[0]
You are really for this topic, thanks for your videos.
Can someone explain me, Why he used back slash for ticker rather forward slash? ( since we are using forward slash for directory traversal)
I didnt get the part of gather..?? can you pls explain me that. ??
+sentdex, thank you for your tutorials. They have been great introductions to machine learning. As many people have pointed out Yahoo has changed their website (some items we used to scrap from the html are not longer there). I was wondering if with these changes grabbing the visible text on the website would be the better approach for data gathering?
on mac im getting this error:
value=source.split(gather+':</td><td class="yfnc=tabledata1">')[1].split('</td>')[0]
IndexError: list index out of range
I think yahoo finance html just change, now when I look for Total Debt/Equity (mrq) I get this:
"Total Debt\u002FEquity (LTM)","SCREENER_FIELD_totalequity.lasttwelvemonths":"Total Equity (LTM)","SCREENER_FIELD_totalrevenues.lasttwelvemonths":"Total Revenues (LTM)","SCREENER_FIELD_totalrevenues1yrgrowth.lasttwelvemonths":"Total Revenues, 1 Yr. Growth % (LTM)","SCREENER_FIELD_totalsharesoutstanding":"Total Shares Outstanding","SCREENER_FIELD_totalsharesoutstandingonfilingdate.lasttwelvemonths":"Total Shares Outstanding on Filing Date (LTM)","SCREENER_FIELD_unleveredfreecashflow.lasttwelvemonths":"Unlevered Free Cash Flow (LTM)","SCREENER_FILTER_EMPTY_TEXT":"Enter criteria and click 'Find Stocks' to see the matching stocks","SCREENER_MATCH_RESULTS":"{start}-{end} of {total} Matching Stocks","SCREENER_NEW_TITLE":"New Untitled

What can we do in this case?
Great set of lectures!
I had an issue, hopefully you can assist me with it.

While parsing the local files, the code picks up the files from srcl (in KeyStats) and proceeds further instead of starting from a (the first file) for no apparent reason. Can't seem to figure out the reason why. I've tried using the same code as the one published on your website, same thing happens.
Hi (and thanks for all these really nice tutorials), it seems that there is something iffy with the aapl ticker for the file named 20060203134959.html (and others as well). Using source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0]
results in a "list index out of range" error. I did ctrl+U on it, and it seems that the line is cut off after </td>. I did a hack to circumvent, which is:
try:
value = source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0]
except Exception as e:
print str(e)
value = float('nan')
but it is not a very good hack since the value should be 0.
so what do you think about the stock market now?
for Mac OSX, you'll have to use:
ticker = each_dir.split("/")[-1]
I think that it's better to use python re (RegEx) than a series of splits
For Mac users, ValueError: time data '.DS_Store' does not match format '%Y%m%d%H%M%S.html' is due to Mac OS automatically creating .DS_Store files for each folder. They are hidden but the python script includes them. If you run into this error, all you need to do is delete the .DS_Store file. Search "recursively remove .DS_Store files" for instructions.
Still shows out of range
ticker = each_dir.split("/Users/xxx/Desktop/Coding/Python/MachineLearningStockData/intraQuarter/_KeyStats/")[1]
IndexError: list index out of range
Tip people who are using Mac operating systems...
For the "ticker = each_dir.split" snippet of code. What worked for me is going through my entire file directory till I got to the file ticker name. So it kinda looked like this:

ticker = each_dir.split('/Users/UserName/Desktop/intraQuarter/_KeyStats/')[1]

Hopefully this will help some folks that might be stuck on Mac computers :)
@sentdex Im liking the tutorials so far! Great job FYI.
Hi +sentdex. I'm a total beginner to Python and the most I've ever coded is HTML. So, I'm a super newbie. I'm running the script but not seeing an output when I print nor do I see any errors. I'm a bit confused to what's going wrong :S
I am not sure your code works on Mac's. I watched 24 episodes and then started over agin this time executing your code lesson by lesson. I am stopped cold in this lesson as I keep getting an error with the date_stamp when it gets to appl. I have cut and pasted your code and get same error.
error reads - ValueError: time data '.DS_Store' does not match format '%Y%m%d%H%M%S.html'
Any ideas?
.
What is "ticker" for?
there is a much better way for parsing the data using 'requests' and 'BeautifulSoup' python libraries. They are super easy to learn and use. cheers
Hello,Harrison.
Thanks for the tutorial. I found that my code can not get all the 'Total Debt/Equity'.
After geting some of the value ,it start to throw me an error [IndexError: list index out of range].

I checked the html sourcecode .the standard one we searching should be:
<tr><td class="yfnc_tablehead1" width="75%">Total Debt/Equity (mrq):</td><td class="yfnc_tabledata1">0</td></tr>

BUT.there are some exceptions on those source code :

intraQuarter/_KeyStats/aapl/20060207091730.html:
<td class="yfnc_tablehead1" width="75%">Total Debt/Equity (mrq):</td>
<td class="yfnc_tabledata1">0</td>

intraQuarter/_KeyStats/aee/20090221005651.html:
<tr><th scope="row" width="75%">Total Debt/Equity (mrq):</th><td class="yfnc_tabledata1">N/A</td></tr>

would you show us some beautiful soup skills to get around it? Thank you .

Additional Information:

Visibility: 16691

Duration: 11m 39s

Rating: 96