Getting started with Python and R for Data Science


Hi, welcome to this data science dojo
beginner tutorial on getting started with Python and R for data science. In
this beginner tutorial we’ll take you through some common Python and R
packages and libraries used for machine learning and data analysis as well as go
through a simple linear regression model. We’ll also help you setup Python and R on
your Windows, Mac, or Linux machine run your code locally and push your code to
a github repository. So let’s get started with installing Python and R. To install python on a Windows machine we first need to check if our machine is 64-bit
or 32-bit as this will determine the appropriate Python program to install. To
do this search for “about your PC” and you’ll see if your machine is 64-bit or
32-bit, in my case, its 64-bit. next, in your web browser, type “python.org /
downloads / windows” and scroll down to the version of python you wish to
download, in my case, I’ll choose the latest version for 64-bit executable installer. You can go with the default installation or you can do a custom
installation to include optional features such as “pip” or you can specify
your path directly under C so it’s easier to locate your Python program later on. and just click install once python has installed on your computer
you’ll need to add python to your path to be able to run Python scripts in a directory or folder. download Git for Windows to set your path and run the
Python command. The command using this program are basically the same when
using terminal in Mac or Linux Alternatively, for Windows, you can use
the default command prompt by searching “CMD” You can also set your local path by
searching “environment variables” and setting your path there Here’s an example of a Python script saved in my documents project one folder. Using a text editor of my choice, such as notepad plus plus to write my Python
code, I saved my file as a .Py file Then, I open my terminal which is in “C:
program files/git/git-cmd”. I navigate to documents project one and I set my local Python path. So we’ll set this up permanently using a bash RC file with the path to my Python program directly under “C” now, I simply type “Py” followed by the name of the file and extension If using Python 2.7 just type “Python” followed by the
name of the file and extension if we were to hit enter to run this, it would produce the output of
my code which has predicted Heights using a linear regression model. The final part of this python windows setup is installing pip to be able to easily
install Python packages and libraries pip might not have come with your
installation if you didn’t customize your installation or it might not be
installed in an older version of Python so to get pip, type in your web browser
“bootstrap.piper.io/git-pip.py and right click, to save in
your Python program folder and then run the command “python get-pip.py” so my Python programs under (C:) Moving on to installing R for windows,
simply type in your browser “cran.r-project.org/bin/windows/base”
and select the 32 or 64-bit Once it is downloaded, press ok and click “next” to all Once R has installed on your computer,
you can simply open the program on your desktop and start typing R commands or code. I recommend you to download R studio as it just makes the process of editing and debugging your code easier. Otherwise, you’re welcome to
use the R command line. To save an R file, click on “file”, “file history”, and this
will save your code so you can run it later if you wish to set your path or working directory,
just simply type “setwd” followed by the path to where you
would like to store your R files locally You might need to use double
backslash for Windows as Windows understands this to mean separators in the path. Now, let’s install Python on a Mac Go to Mac terminal in “finder”, “applications”, “utilities” and now we’re going to store our
command line utilities Xcode as this will help with the installation So type “xcode – select – -install” click “install” and “agree’ Now, we’re going
to use homebrew to install Python So type “/usr/bin/ruby” and we’re going to use curl and we’re going to type the URL to
homebrew on github press return enter your password if need be Next add the path, so we will create a
bash RC file to permanently add the path If you get an error message stating
“cannot write to path” try the “sudo” channel command accompanying this video.
All commands can be copied and pasted as they accompany this video. next we’ll install Python so just brew install Python or Python 3 if your using Python 3 we’ll also add this to our path So we’ll create another “- RC” file Now to check if pip is installed as part of your Python program, simply type “which pip” and
It’ll show you the location where your pip is installed and if you want to
check out the version just type “pip – V” and it’ll show you which version of
people you’ve installed. As mentioned pip is useful for easily installing Python
packages and libraries. Moving on to R, to install this on a Mac after
installing homebrew, simply type “brew tap homebrew/science” and then type “brew install r” To open the our command line simply type
“r” and enter. Now let’s install Python and R on Linux I’m using Ubuntu, later
versions of Ubuntu might already have Python installed but I’ll take you
through the process anyway. So open your terminal Okay now we’re going to type
“sudo apt-get install python 3.6 or 2.7” Now we’re going to type “sudo apt – get
install Python – set up tools” lastly, install pip to easily install
python libraries in packages by typing “sudo easy_install pip” To install R on Linux, simply type “sudo apt-get -y install r-base” Now type uppercase “R” and enter to open
the R command line now that we’ve got the setup and installation part of this
tutorial out of the way we can now move on to more fun stuff. Let’s have a quick
play with some data to get you familiar with some key data analysis and linear
regression concepts as well as basic scripting for this. I’m going to go
through an example of a simple linear regression in Python and R using
simulated data on people’s height in centimeters and their weight in
kilograms. The model is based on a formula which can be produced using
Python and R functions that gives a predictor out come or estimated y-value
given a certain x-value at a certain constant and slope. Here is what’s called
the “regression line” I like to think of it as a line of predicted values along
the x-axis for a given x-value the line predicts the y-value to fall about here
in height the actual values are slightly above and below the line, but the model
is generalized enough to take into account where most cases would probably
fall. The formula gives a constant value here which we add this to a given x-
value multiplied by a given coefficient or slope. The constant means when X is at
0, y is at this value and the slope means for every one unit increase in X, Y
increases by this number of units. So we can use this formula to plug in any new
x-value of a person’s weight to predict their height or y-value. Of course there
are many other factors not only weight that could influence a person’s height,
hence we’re just looking at a very simple model to get started with To implement linear regression in Python
we first need to install a few commonly used packages. We’ll open our terminal
and install “sklearn” for modeling If using Python 2.7, just type “python -m pip install” Now, we’re going to pip install
pandas for data importing We’ll also install matplotlib for plotting The last package we need to install is just “scipy” Next, go to your text editor and save a new Python file in
“Documents/project 1” or a folder of your choice So I’ll just call my file “LM
model”, save it as a Python file Also, don’t forget to CD into this folder in
terminal so you can run your script later. Now we’re going to import these
packages at the beginning of the script when it runs, so at the top of the file
we’ll type “from sklearn import linear model” So our linear regression tool. We’re also going to important data
frame from pandas we also want to use pandas as PD and we’ll just use it as pandas and we want to import matplotlib and use it as PLT Now we need to read in our data which
you can download as part of this tutorial and save in your current folder.
Will use the pandas read table function for this So we’ll put our data and
variable and we’ll just call it input data and we’ll use the read table function and we’ll give the data file
name an extension in our folder its comma separated as it’s a CSV file and we have headers and they start at line 0
and we’ll give our X&Y headers specific names This automatically infers the data
types for each column too. before applying a linear regression model,
let’s plot the data using matplotlib’s plot function to see if the data
naturally follows a linear pattern and the normal distribution as linear
regression is not appropriate or useful for datasets that don’t follow this assumption So we’ll use a scatter plot and we’re just plotting weight versus
height. So weight is on our x-axis and height is on our y-axis We’ll need to show this
graph, so it can render on our screen now save and run the script As we can see, the data is linear and
follows a normal distribution making linear regression appropriate to use on these data Now we’ll define our X predictor
variable weight and our Y outcome variable height So we’ll use PD as pandas
and we use the data frame function and we’ll use weight, as our predictor and we’ll make height our outcome variable Now we’ll fit a model to the data using the fit function and use this
to predict height to given weight So we’re using a linear regression model and we’ll fit the model to the data We can now compare the first, say, six
predicted values using the predict function with the actual height
values to see if they’re on par So first we’re going to get all the predicted values and we’re going to use our predictor
variable to predict the outcome and we’ll just print some sub heads to
differentiate the list of predicted values from the actual and we’ll have a look at the first 0
to 6 predictions and we’ll compare with the first 0 to 6 actual values All right, we’ll save and run the script A quick eyeball of the first few predictions with the
actual shows the model was not far off the mark. Which is good, however, to
properly assess a model, we can use measures such as R squared which is the
percentage of explained variants So we’ll go back to our script and we’re
going to use the score function to get the R squared and we want to print this obviously Now we’re just going to comment out the above lines as we no longer want to view these we’ll save and run our script again as we can see, a high r-squared
shows the model explained most or nearly all of the variance which is good
however relying solely on r-squared is probably not good enough when assessing
and measuring our models predictions sometimes it can be misleading to look
at the r-squared, but the course will go through other measures you can use To perform the same analysis in R, we’ll first install commonly used R package,
ggplot2, which is used for effectively visualizing and analyzing data I’ll select a cran mirror that’s close to me We need to load ggplot2 whenever we want to use it We’ll read in our data
using the read table function we’ll put our data in a variable we use read table we’ll give it our file in our current working directory its comma separated and we do have headers and we’ll just
use the default header names x and y This automatically infers data types too will also attach our data frame so we can refer to column headers or variable
names without having to refer to the name of our data each time
making this more convenient Now we’ll plot the data to see its
normal distribution, but we can also use ggplot2 to plot the regression line or
the line of best fit So we’ll plot our x and y, which is weight and height and in the smooth function, we’ll specify a linear model as we could see before the actual heights are close to the
predictions of the line implementing a simple linear regression in R is quite easy using the LM function Now, to see the first few predictions of height we’ll
use the predict function we first need to get all of the predictions and we’re just going to print the first
few to have a quick look so the first 0 to 6 and we’ll compare
with our actual values As seen before, for the first few cases,
the predictions are pretty close To print the r-squared or percentage of
explained variants for assessing the model we’ll use summary As seen before, it explains nearly all
the variants, but it’s a good idea to also look at errors or other measures
for this. Finally now that we’re finished we’ll detach our data In the last part of this tutorial we’ll
push our code to a github repository so you can share your code publicly or
store it privately if you wish. You can create a github account for free you can
also follow a data science dojo to clone or access a copy of the code provided as
part of the course material. Once you have created an account add a new
repository without initializing via the github website. The instructions to
push your code to github are on the website but I’ll take you through the process
anyway. First open your terminal and CD into your current project directory and
you’ll need to configure your user name and user email now configure your username We’ll initialize our project directory
as our git repository Then we’ll add all files in our project folder, we’re not
pushing it live yet, it’s just selecting the files commit your files to track the
first mission with the message should you wish to publish updates later on So I’m just gonna say first go at implementing simple linear regression as you can see all the files in project 1 folder are there Now we’re going to give
the URL of our main repository so go to the main page of your github repo and copy the URL and we’re going to paste it into the terminal when adding a remote repo Finally we’re going to push our code to
the repo and github master branch Now, if you have a look at your github
repo, you can see all your files are there All the work we have done in this tutorial is here. alternatively, after initializing your github repo via the
site, you can simply drag and drop your project folder onto the main page of your repo Now that you’ve gone through the basics
you should feel ready to dive into the course and gain a deeper and wider
understanding of data science. You know how to set up Python and R in your
machine, how to do basic scripting for reading and visualizing data, how to
apply a model and assess it, and now you can share your hacks and projects on
github. The data used in this tutorial the coded examples, the commands, the
URLs to programs, and so on are all accompanying this video. My name is
Rebecca Merrett, feel free to reach out to me by commenting on this video I’m
more than happy to help you get ready before you start your course thanks for
watching and happy analyzing

5 thoughts on “Getting started with Python and R for Data Science

  1. Thanks Rebecca………….great pace, English is flawless, and I got the height.csv data file from you github.com account………………:) …………..😋🐟😍……………..bye

  2. – Installing Python on Windows: 1:09

    – Installing R on Windows: 4:16

    – Installing Python on Mac: 5:39

    – Installing R on Mac: 8:10

    – Installing Python on Linux: 8:41

    – Installing R on Linux: 9:48

    – Simple linear regression model explanation: 10:13

    – Simple linear regression model in Python: 11:59

    – Simple linear regression model in R: 21:01

    – Pushing code to Github Repository: 25:26

  3. Aboslutely useful and wonderful. Thanks a lot for sharing this kind of information. You've got a suscripter!

Leave a Reply

Your email address will not be published. Required fields are marked *