textfeatures

Extracts features from text vector.

textfeatures(text, sentiment = TRUE, word_dims = NULL,
  normalize = TRUE, newdata = NULL, verbose = TRUE)

Arguments

text	Input data. Should be character vector or data frame with character variable of interest named "text". If a data frame then the first "id\|*_id" variable, if found, is assumed to be an ID variable.
sentiment	Logical, indicating whether to return sentiment analysis features, the variables `sent_afinn` and `sent_bing`. Defaults to TRUE. Setting this to FALSE will speed things up a bit.
word_dims	Integer indicating the desired number of word2vec dimension estimates. When NULL, the default, this function will pick a reasonable number of dimensions (ranging from 2 to 200) based on size of input. To disable word2vec estimates, set this to 0 or FALSE.
normalize	Logical indicating whether to normalize (mean center, sd = 1) features. Defaults to TRUE.
newdata	If a textfeatures_model is supplied to text, supply this with new data to which you would like to apply the textfeatures_model.
verbose	A single logical for printing logging messages as work progresses.

Value

A tibble data frame with extracted features as columns.

Examples


## the text of five of Trump's most retweeted tweets
trump_tweets <- c(
  "#FraudNewsCNN #FNN https://t.co/WYUnHjjUjg",
  "TODAY WE MAKE AMERICA GREAT AGAIN!",
  paste("Why would Kim Jong-un insult me by calling me \"old,\" when I would",
    "NEVER call him \"short and fat?\" Oh well, I try so hard to be his",
    "friend - and maybe someday that will happen!"),
  paste("Such a beautiful and important evening! The forgotten man and woman",
    "will never be forgotten again. We will all come together as never before"),
  paste("North Korean Leader Kim Jong Un just stated that the \"Nuclear",
    "Button is on his desk at all times.\" Will someone from his depleted and",
    "food starved regime please inform him that I too have a Nuclear Button,",
    "but it is a much bigger &amp; more powerful one than his, and my Button",
    "works!")
)

## get the text features of a character vector
textfeatures(trump_tweets)
#> ↪ Counting features in text...
#> ↪ Sentiment analysis...
#> ↪ Parts of speech...
#> ↪ Word dimensions started
#> ↪ Normalizing data
#> ✔ Job's done!
#> # A tibble: 5 x 37
#>   n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars
#>    <dbl>     <dbl>      <dbl>         <dbl>      <dbl>         <dbl>   <dbl>
#> 1  1.79      1.79       1.79          1.79           0             0  -1.57 
#> 2 -0.447    -0.447     -0.447        -0.447          0             0  -0.411
#> 3 -0.447    -0.447     -0.447        -0.447          0             0   0.608
#> 4 -0.447    -0.447     -0.447        -0.447          0             0   0.515
#> 5 -0.447    -0.447     -0.447        -0.447          0             0   0.857
#> # … with 30 more variables: n_uq_chars <dbl>, n_commas <dbl>, n_digits <dbl>,
#> #   n_exclaims <dbl>, n_extraspaces <dbl>, n_lowers <dbl>, n_lowersp <dbl>,
#> #   n_periods <dbl>, n_words <dbl>, n_uq_words <dbl>, n_caps <dbl>,
#> #   n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>,
#> #   sent_afinn <dbl>, sent_bing <dbl>, sent_syuzhet <dbl>, sent_vader <dbl>,
#> #   n_polite <dbl>, n_first_person <dbl>, n_first_personp <dbl>,
#> #   n_second_person <dbl>, n_second_personp <dbl>, n_third_person <dbl>,
#> #   n_tobe <dbl>, n_prepositions <dbl>, w1 <dbl>, w2 <dbl>, w3 <dbl>

## data frame with a character vector named "text"
df <- data.frame(
  id = c(1, 2, 3),
  text = c("this is A!\t sEntence https://github.com about #rstats @github",
    "and another sentence here",
    "The following list:\n- one\n- two\n- three\nOkay!?!"),
  stringsAsFactors = FALSE
)

## get text features of a data frame with "text" variable
textfeatures(df)
#> ↪ Counting features in text...
#> ↪ Sentiment analysis...
#> ↪ Parts of speech...
#> ↪ Word dimensions started
#> ↪ Normalizing data
#> ✔ Job's done!
#> # A tibble: 3 x 36
#>   n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars
#>    <dbl>     <dbl>      <dbl>         <dbl>      <dbl>         <dbl>   <dbl>
#> 1  1.15      1.15       1.15          1.15       1.15          1.15    0.243
#> 2 -0.577    -0.577     -0.577        -0.577     -0.577        -0.577  -1.10 
#> 3 -0.577    -0.577     -0.577        -0.577     -0.577        -0.577   0.856
#> # … with 29 more variables: n_uq_chars <dbl>, n_commas <dbl>, n_digits <dbl>,
#> #   n_exclaims <dbl>, n_extraspaces <dbl>, n_lowers <dbl>, n_lowersp <dbl>,
#> #   n_periods <dbl>, n_words <dbl>, n_uq_words <dbl>, n_caps <dbl>,
#> #   n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>,
#> #   sent_afinn <dbl>, sent_bing <dbl>, sent_syuzhet <dbl>, sent_vader <dbl>,
#> #   n_polite <dbl>, n_first_person <dbl>, n_first_personp <dbl>,
#> #   n_second_person <dbl>, n_second_personp <dbl>, n_third_person <dbl>,
#> #   n_tobe <dbl>, n_prepositions <dbl>, w1 <dbl>, w2 <dbl>

Arguments

Value

Examples

Contents