Extracts features from text vector.

textfeatures(x, sentiment = TRUE, word_dims = NULL, threads = 1,
  normalize = TRUE, export = FALSE)

Arguments

x

Input data. Should be character vector or data frame with character variable of interest named "text". If a data frame then the first "id|*_id" variable, if found, is assumed to be an ID variable.

sentiment

Logical, indicating whether to return sentiment analysis features, the variables sent_afinn and sent_bing. Defaults to FALSE. Setting this to true will speed things up a bit.

word_dims

Integer indicating the desired number of word2vec dimension estimates. When NULL, the default, this function will pick a reasonable number of dimensions (ranging from 2 to 200) based on size of input. To disable word2vec estimates, set this to 0 or FALSE.

threads

Integer, specifying the number of threads to use when generating word2vec estimates. Defaults to 1. Ignored if word_dims = 0.

normalize

Logical indicating whether to normalize (mean center, sd = 1) features. Defaults to TRUE.

export

Logical indicating whether to store sufficient information for exporting the feature extraction process (stores the means, standard deviations, and the word2vec reference object, which can then be used to process new data).

Value

A tibble data frame with extracted features as columns.

Examples

## the text of five of Trump's most retweeted tweets trump_tweets <- c( "#FraudNewsCNN #FNN https://t.co/WYUnHjjUjg", "TODAY WE MAKE AMERICA GREAT AGAIN!", paste("Why would Kim Jong-un insult me by calling me \"old,\" when I would", "NEVER call him \"short and fat?\" Oh well, I try so hard to be his", "friend - and maybe someday that will happen!"), paste("Such a beautiful and important evening! The forgotten man and woman", "will never be forgotten again. We will all come together as never before"), paste("North Korean Leader Kim Jong Un just stated that the \"Nuclear", "Button is on his desk at all times.\" Will someone from his depleted and", "food starved regime please inform him that I too have a Nuclear Button,", "but it is a much bigger &amp; more powerful one than his, and my Button", "works!") ) ## get the text features of a character vector textfeatures(trump_tweets)
#> INFO [2018-11-28 17:11:02] iter 10 loglikelihood = -11.817 #> INFO [2018-11-28 17:11:02] iter 20 loglikelihood = -12.426 #> INFO [2018-11-28 17:11:02] early stopping at 20 iteration
#> # A tibble: 5 x 31 #> id n_urls n_hashtags n_mentions n_chars n_commas n_digits n_exclaims #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 1.79 1.79 0 -1.68 -0.730 0 -1.79 #> 2 2 -0.447 -0.447 0 -0.141 -0.730 0 0.447 #> 3 3 -0.447 -0.447 0 0.559 1.10 0 0.447 #> 4 4 -0.447 -0.447 0 0.478 -0.730 0 0.447 #> 5 5 -0.447 -0.447 0 0.784 1.10 0 0.447 #> # ... with 23 more variables: n_extraspaces <dbl>, n_lowers <dbl>, #> # n_lowersp <dbl>, n_periods <dbl>, n_words <dbl>, n_caps <dbl>, #> # n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>, #> # sent_afinn <dbl>, sent_bing <dbl>, n_polite <dbl>, n_first_person <dbl>, #> # n_first_personp <dbl>, n_second_person <dbl>, n_second_personp <dbl>, #> # n_third_person <dbl>, n_tobe <dbl>, n_prepositions <dbl>, V3 <dbl>, #> # w1 <dbl>, w2 <dbl>
## data frame with a character vector named "text" df <- data.frame( id = c(1, 2, 3), text = c("this is A!\t sEntence https://github.com about #rstats @github", "and another sentence here", "The following list:\n- one\n- two\n- three\nOkay!?!"), stringsAsFactors = FALSE ) ## get text features of a data frame with "text" variable textfeatures(df)
#> Warning: dtm has 0 rows. Empty iterator?
#> INFO [2018-11-28 17:11:02] iter 10 loglikelihood = 0.000 #> INFO [2018-11-28 17:11:02] early stopping at 10 iteration
#> # A tibble: 3 x 30 #> id n_urls n_hashtags n_mentions n_chars n_commas n_digits n_exclaims #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 1.15 1.15 1.15 -0.792 0 0 0.173 #> 2 2 -0.577 -0.577 -0.577 -0.332 0 0 -1.08 #> 3 3 -0.577 -0.577 -0.577 1.12 0 0 0.902 #> # ... with 22 more variables: n_extraspaces <dbl>, n_lowers <dbl>, #> # n_lowersp <dbl>, n_periods <dbl>, n_words <dbl>, n_caps <dbl>, #> # n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>, #> # sent_afinn <dbl>, sent_bing <dbl>, n_polite <dbl>, n_first_person <dbl>, #> # n_first_personp <dbl>, n_second_person <dbl>, n_second_personp <dbl>, #> # n_third_person <dbl>, n_tobe <dbl>, n_prepositions <dbl>, V2 <dbl>, #> # w1 <dbl>