Stats – What is “R squared” and what is the “R” in “R squared”

Coefficient of determination(决定系数), which is often called R Squared is a commonly used parameter to determine how well data points fit a curve.

set.seed(100)
x <- 1:100
y <- 2 * x + rnorm(length(x), 0, sd <- 10)
summary(m1 <- lm(y ~ x))

plot(y ~ x, type=”o”, main=”y = 2 * x, white noise sd=10″)
abline(m1, col=”red”)

The summary of the linear model m1 looks like below
Call: lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-22.6284 -6.8113 -0.2093 6.1186 26.1505

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.37592 2.06135 0.667 0.506
x 1.97333 0.03544 55.684 <2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.23 on 98 degrees of freedom
Multiple R-squared: 0.9694, Adjusted R-squared: 0.9691
F-statistic: 3101 on 1 and 98 DF, p-value: < 2.2e-16

As you can see here R-squared is 0.9694 is very close to 1 which indicates that our linear model fit the data pretty well or vice versa.
you can easily calculate the R-squared yourself by doing this(Wikipedia defination):
> 1 - sum(m1$residuals^2) / sum((y-mean(y))^2) [1] 0.9693628

Again, if you change the standard deviation from 10 to 1000, which means that the white noise actually beat the existing rule.. to be simple, there is not that much linear regression during that data range. You can still build a linear model on it but the R squared turned out to be really really small.

The R squared here is 0.0003613 , which indicates our linear model is really a bad model to fit the datasets.

The model says the slope is -0.6669 which is complete nonsense…

In conlusion, R-squared gives a pretty good idea of how the data points line up with your model.

Then the question is where does the name R – squared come from?

The R is actually Pearson product moment correlation coefficient (Pearson’s R), whose definition is the covariance divide by the product of their own standard deviation.

r <- cov(x, y, method=”pearson”) / (sd(x)*sd(y))

> r^2

[1] 0.9693628

Which lines up perfectly with the R-squared in our linear model above(sd=5).

(References: Pearson Correlation Coefficient, Coefficient of Determination.)

R in command line – Read Pythonic Data into R

To read the data that cleaned up by Python. There are a few options. You can pump Python result into some kind of popular datatypes (JSON), or You can do some ETL work on either side to make the output the Python could be easily read by R. However, there is already an R package called RPython that seamlessly fill in this gap.
Take a look at document of this package, you can find there are only 4 functions in the content page but that makes my life so much easier.

library(‘rPython’)
str_py_list <- "[1,'a', 'B']"
str(python.get(str_py_list))
List of 3
$ : num 1
$ : chr "a"
$ : chr "B"
str_py_tuple <- "(1,'a', 'B')"
str(python.get(str_py_tuple))
List of 3
$ : num 1
$ : chr "a"
$ : chr "B"
str_py_dict <- "{1:2, 'a':'A', 'B': 1+1}"
str(python.get(str_py_dict))
List of 3
$ a: chr "A"
$ 1: num 2
$ B: num 2

As you can see, the python.get will read in the string and parse python objects into the List in R. and then you can use data.frame to change it into dataframe type.

2. python.assign(PyObject,RObject) will read the R object and translate into an Python object. Combining with python.exec("python command"), you can read the R data into Python.

> data(iris)
> df python.assign(‘py_iris’, df)
> python.exec(“print len(py_iris)”)
5
> python.exec(“print py_iris.keys()”)
[u’Petal.Length’, u’Sepal.Length’, u’Petal.Width’, u’Sepal.Width’, u’Species’]

3. python.load() will run a script of python code

$cat datafireball.py
import urllib2, sys
sys.path.append(‘/Library/Python/2.7/site-packages/beautifulsoup4-4.2.1-py2.7.egg’)
from bs4 import BeautifulSoup
stream = urllib2.urlopen(‘https://datafireball.com/’)
soup = BeautifulSoup(stream)
print soup.find(‘div’, {‘class’:’site-description’}).text.encode(‘utf-8’)

Above is a very basic Python script which uses urllib2 library make http request to datafireball.com and then uses BeautifulSoup package to parse the html returned. In the end, it will print the description title of datafireball.com to the screen.
Note, if you know how to catch the title into R, please leave a comment, but this is what happens in R.

python.load(‘/tmp/datafireball.py’)
a journey of a data guy

R in command line – Install R on Redhat6

To distribute R code, first of all, you need to install R on every single node across your cluster. You can find all kinds of information of how to install R on debian distribution etc. Here I will tell you how to properly install R on Redhat 6.

Redhat is using RPM(Redhat Package Manager) and for a Redhat 6.4 (Amazon Web Service), when you are trying to do sudo yum install R, the computer is not smart enough to figure out what you are trying to install.

Actually, they do have the R package for Redhat 6 but it is in the EPEL(Extra Package for Enterprise Linux). You can find all the packages available for Redhat 6 here(clearly, you can see R-core, R-dev.. in that list).

What you need to do is:

su -c 'rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm'

Then you can install R in one go:

su -c 'yum install -y R R-core R-core-devel R-devel'

Note, the -y flag answers yes to all the questions asking you ‘Is that OK with you [yes/no]’ 🙂

R in command line – stdin/stdout

In most cases, people use R inside some kind of IDE or interactive mode, like the R command line or Rstudio. From there, you can import all kinds of dataset using read.csv…etc. However, sometimes, I found it extremely helpful to write your R code in Rscript and read the data from the standard input and write the result back to the standard output. In this way, you can seamlessly pipe your R script with Bash, Python all together in one line.

Also, Hadoop Streaming makes it super easy to combine the power of all kinds of different languages together and fully utilize the power of cluster computing. This series of posts, I will introduce how to use stdin/stdout in R, how to parse the string into arguments, how to organize the out and how to apply all of this into Hadoop Streaming.

Here I will post a few tips for R users to get started working with stdin and stdout:
1.

#!/usr/bin/Rscript

Very First of all, your code need to start this line of code, this “shebang” will tell the machine which interpreter it will use to run the code, not /user/bin/R not anything else, the Rscript!
2.

input<-file('stdin', 'r')

As mentioned in the help page for function file:
Use “stdin” to refer to the C-level ‘standard input’ of the process (which need not be connected to anything in a console or embedded version of R, and is not in RGui on Windows).
Then we’ve successfully created the connection.
3.

row <- readLines(input, n=1)

The ‘r’ flag is actually very important here, otherwise, you can only read the first line..
And as default, the connection is closed when we first created the input connection. From the help page of readLines: If the connection is open it is read from its current position. If it is not open, it is opened in “rt” mode for the duration of the call and then closed again. Usually, data stored in the flat file comes with a format that each line is a record. So n=1 tells R to read 1 row at a time.
4.

while(length(row)>0) {
    # do something with your row (record)
}

To make sure every row got processed. You just need to check the length of the line that you read and put that into a while loop as a check flag.
5.

write(result, "")

write(x, file=”data”…), the file could be a connection, but we want our result be written to the standard output, in this case, we can just use an empty string/stdout() to make it happen.