Skip to content

Instantly share code, notes, and snippets.

@suribes
Forked from swayson/lsa_hack.r
Created May 4, 2016 21:52
Show Gist options
  • Select an option

  • Save suribes/2c96e9a7520eb3b16102c3fbb421f04c to your computer and use it in GitHub Desktop.

Select an option

Save suribes/2c96e9a7520eb3b16102c3fbb421f04c to your computer and use it in GitHub Desktop.
Analyze Text Similarity with R: Latent Semantic Analysis and Multidimentional Scaling
# script from http://goo.gl/YbQyAQ
# load required libraries
install.packages(c("tm", "lsa"))
install.packages("scatterplot3d")
library(tm)
library(ggplot2)
library(lsa)
library(scatterplot3d)
# 1. Prepare mock data
text <- c("transporting food by cars will cause global warming. so we should go local.",
"we should try to convince our parents to stop using cars because it will cause global warming.",
"some food, such as mongo, requires a warm weather to grow. so they have to be transported to canada.",
"a typical electronic circuit can be built with a battery, a bulb, and a switch.",
"electricity flows from batteries to the bulb, just like water flows through a tube.",
"batteries have chemical energe in it. then electrons flow through a bulb to light it up.",
"birds can fly because they have feather and they are light.", "why some birds like pigeon can fly while some others like chicken cannot?",
"feather is important for birds' fly. if feather on a bird's wings is removed, this bird cannot fly.")
view <- factor(rep(c("view 1", "view 2", "view 3"), each = 3))
view
df <- data.frame(text, view, stringsAsFactors = FALSE)
df
# prepare corpus
corpus <- Corpus(VectorSource(df$text))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
# error below
?stemDocument
#corpus <- tm_map(corpus, stemDocument, language = "english")
corpus
#------------------------------------------------------------------------------
# 2. MDS with raw term-document matrix compute distance matrix
td.mat <- as.matrix(TermDocumentMatrix(corpus))
td.mat
dist.mat <- dist(t(as.matrix(td.mat)))
dist.mat # check distance matrix
# MDS
fit <- cmdscale(dist.mat, eig = TRUE, k = 2)
points <- data.frame(x = fit$points[, 1], y = fit$points[, 2])
ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y,
color = df$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(df)))
#------------------------------------------------------------------------------
# 3. MDS with LSA
td.mat.lsa <- lw_bintf(td.mat) * gw_idf(td.mat) # weighting
lsaSpace <- lsa(td.mat.lsa) # create LSA space
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace))) # compute distance matrix
dist.mat.lsa # check distance mantrix
# MDS
fit <- cmdscale(dist.mat.lsa, eig = TRUE, k = 2)
points <- data.frame(x = fit$points[, 1], y = fit$points[, 2])
ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y,
color = df$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(df)))
#------------------------------------------------------------------------------
fit <- cmdscale(dist.mat.lsa, eig = TRUE, k = 3)
colors <- rep(c("blue", "green", "red"), each = 3)
scatterplot3d(fit$points[, 1], fit$points[, 2], fit$points[, 3], color = colors,
pch = 16, main = "Semantic Space Scaled to 3D", xlab = "x", ylab = "y",
zlab = "z", type = "h")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment