Recent Releases of quanteda
quanteda - CRAN v4.3.0
Changes and additions
Added
corpus_chunk()for chunking texts into smaller documents.Significantly reduce the memory usage for the
coperation on largetokensandtokens_xptrobjects.Further improvements to the verbose messages for corpus, tokens, dfm and fcm objects.
tokens_ngrams()now includes a new argumentapply_if, functioning similar to this argument intokens_compound()andtokens_lookup()(#2390).Replaced
remove_unigramwithmatch_patterninobject2id()to control the matching of single-word patterns or multi-word patterns.data_corpus_inauguralnow updated for Trump 2025.
Scientific Software - Peer-reviewed
- R
Published by kbenoit 9 months ago
quanteda - CRAN v4.2.0
quanteda 4.2.0
Changes and additions
Made the
coperation ontokensandtokens_xptrobjects significantly faster.New, and more consistent verbose messages for tokens and dfm objects.
Preserve the default
concatenatorof tokens objects intokens_compound()(#2432).Make the
coperation ontokensandtokens_xptrobjects significantly faster.
Bug fixes and stability enhancements
- Fix a bug in
dfm_lookup()that leads to wrong feature names whenexclusive = TRUE(#2424).
Scientific Software - Peer-reviewed
- R
Published by kbenoit about 1 year ago
quanteda - CRAN v4.0.2
Minor fixes:
A failing test caused by C++ code related to
fcm()and how tokens objects are re-indexed.An undeclared package ‘quanteda.textstats’ in Rd xrefs.
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 2 years ago
quanteda - CRAN v4.0.1
Fixed:
A failing test caused by the ever-shifting behaviour of Matrix and the devel R on r-devel-linux-x8664-debian-clang and r-devel-linux-x8664-debian-gcc.
An Undeclared package ‘quanteda.textstats’ in Rd xrefs.
An installation failure on r-devel-linux-x86_64-fedora-gcc due to searching for TBB in all the wrong places.
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 2 years ago
quanteda - CRAN v4.0
quanteda 4.0.0
Changes and additions
Introduces the
tokens_xptrobjects that extend thetokensobjects with external pointers for a greater efficiency. Oncetokensobjects are converted totokens_xptrobjects usingas.tokens_xptr(),tokens_*.tokens_xptr()methods are called automatically.Improved C++ functions to allow the users to change the number of threads for parallel computing in more flexible manner using
quanteda_options(). The value ofthreadscan be changed in the middle of analysis pipeline.Makes
"word4"the default (word) tokeniser, with improved efficiency, language handling, and customisation options.Replaced all occurrences of the magrittr
%>%pipe with the R pipe|>introduced in R 4.1, although the%>%pipe is still re-exported and therefore available to all users of quanteda without loading any additional packages.Added
min_ntokenandmax_ntokentotokens_subset()anddfm_subset()to extract documents based on number of tokens easily. It is equivalent to selecting documents usingntoken().Added a new argument
apply_ifthat allows a tokens-based operation to apply only to documents that meet a logical condition. This argument has been added totokens_select(),tokens_compound(),tokens_replace(),tokens_split(), andtokens_lookup(). This is similar to applyingpurrr::map_if()to a tokens object, but is implemented within the function so that it can be performed efficiently in C++.Added new arguments
append_key,separatorandconcatenatortotokens_lookup(). These allow tokens matched by dictionary values to be retained with their keys appended to them, separated byseparator. The addition of theconcatenatorargument allows additional control at the lookup stage for tokens that will be concatenated from having matched multi-word dictionary values. (#2324)Added a new argument
remove_paddingtontoken()andntype()that allows for not counting padding that might have been left over fromtokens_remove(x, padding = TRUE). This changes the previous number of types fromntype()when pads exist, by counting pads by default. (#2336)Removed dependency on RcppParallel to improve the stability of the C++ code. This change requires the users of Linux-like OS to install the Intel TBB library manually to enable parallel computing.
Removals
bootstrap_dfm()was removed for character and corpus objects. The correct way to bootstrap sentences is not to tokenize them as sentences and then bootstrap them from the dfm. This is consistent with requiring the user to tokenise objects prior to forming dfms or other "downstream" objects.dfm()no longer works on character or corpus objects, only on tokens or other dfm objects. This was deprecated in v3 and removed in v4.Very old arguments to
dfm()options that were not visible but worked with warnings (such asstem = TRUE) are removed.Deprecated or renamed arguments formerly passed in
tokens()that formerly mapped to the v3 arguments with a warning are removed.Methods for readtext objects are removed, since these are data.frame objects that are straightforward to convert into a
corpusobject.topfeatures()no longer works on an fcm object. (#2141)
Deprecations
Some on-the-fly calculations applied to character or corpus objects that require a temporary tokenisation are now deprecated. This includes:
nsentence()-- uselengths(tokens(x, what = "sentence"))instead;ntype()-- usentype(tokens(x))instead; and.ntoken()-- usentoken(tokens(x))instead.char_ngrams()-- usetokens_ngrams(tokens(x))instead.
corpus.kwic()is deprecated, with the suggestion to form a corpus from usingtokens_select(x, window = ...)instead.
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 2 years ago
quanteda - CRAN v3.3.0
Changes and additions
Implements a
"word4"tokeniser that is based on new RBBI (RuleBasedBreakIterator) rules, implemented in a new .yml file that can be edited and changed by users, but whose defaults represent a significant improvement in pattern handling for words, sentences, and other forms of patterns. These rules are customised from the ICU rules for breaks, with the standard and customised rules found now in thebreakrules/system folder, so that they could, in principle, be modified by the user.Other minor changes:
- changes how elapsed time is recorded, by creating a global environment to record these in (aaa.R)
- improves several of the R-coded patterns that apply to
"word2":- the hashtag pattern (`pattern_hashtag)
- the separator pattern (by adding
\\p{M}). - the URL pattern
- creates a new tokensrestore(), implemented in C++, to replace the older `preservespecial()` that rejoined splits created by the default stringi tokeniser machinery.
- makes some technical improvements to internal tokenisation functions, such as moving the ellipsis to the end of the function, to allow more modularity in developing future tokenisers.
Bug fixes and stability enhancements
dfm_group()now works correctly with an empty dfm (#2225).convert(x, to = "stm")no longer vulnerable to large numbers of removed features as in #2189.
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 3 years ago
quanteda - CRAN v3.2.4
Fixes test failures caused by recent changes to Matrix package behaviours.
Scientific Software - Peer-reviewed
- R
Published by kbenoit about 3 years ago
quanteda - CRAN v3.2.3
Bug fixes and stability enhancements
- Matrix package calls updated for compatibility with Matrix 1.4.2. (#2182)
- Changes to C++ code for
fcm()to prevent some (chance) errors downstream in LSX. (#2181)
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 3 years ago
quanteda - CRAN v3.2.2
Bug fixes and stability enhancements
fcm()computes the marginal frequency of upper-case tokens correctly (#2176).tokens_chunk()keeps all the docid, including those of empty documents, in the original object.tokens_select()recycles values when the length ofstartposorendposis less thanndoc(x).tokens_lookup()anddfm_lookup()can apply very large dictionaries (more than 100,000 keys).
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 3 years ago
quanteda - CRAN v3.2.0
Bug fixes and stability enhancements
dfm()returns a dfm with the identical column order even iftokens_compound()ortokens_ngrams()is used in the upstream (#2100).dfm_group()with NA values in a grouping variable now drops those, similar to the behaviour oftokens_group()andcorpus_group()(#2134).
Changes and additions
char_wordstem()now has a a new argumentcheck_whitespace, which will not throw an error when lower-casing text containing a whitespace character.dfm_remove()now has a new argumentpadding = FALSEthat whenTRUE, collects counts of the removed features in the first column. This produces results consistent with what is compiled as a dfm built from tokens where some have been removed withpadding = TRUE(#2152).
Scientific Software - Peer-reviewed
- R
Published by kbenoit about 4 years ago
quanteda - CRAN v3.1.0
Bug fixes and stability enhancements
- Improved and more consistent handling of empty corpus, tokens and dfm objects, to address #2110.
rbind.dfm()now preserves docvars (#2109).- Document name for Biden's 2021 Inaugural Address in
data_corpus_inauguralis now consistent with all other documents. - Fix #2127 that caused subsetting to change document names.
Changes and additions
phrase()now has aseparatorargument (#2124)
Deprecations
phrase()methods for tokens, collocations, and lists are deprecated in favour ofas.phrase(). (#2129)
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 4 years ago
quanteda - CRAN v3.0.0
Summary
quanteda 3.0 is a major release that improves functionality, completes the modularisation of the package begun in v2.0, further improves function consistency by removing previously deprecated functions, and enhances workflow stability and consistency by deprecating some shortcut steps built into some functions.
Changes and additions
Modularisation: We have now separated the
textplot_*()functions from the main package into a separate package quanteda.textplots, and thetextstat_*()functions from the main package into a separate package quanteda.textstats. This completes the modularisation begun in v2 with the move of thetextmodel_*()functions to the separate package quanteda.textmodels. quanteda now consists of core functions for textual data processing and management.The package dependency structure is now greatly reduced, by eliminating some unnecessary package dependencies, through modularisation, and by addressing complex downstream dependencies in packages such as stopwords. v3 should serve as a more lightweight and more consistent platform for other text analysis packages to build on.
We have added non-standard evaluation for
byandgroupsarguments to access object docvars:- The
*_sample()functions' argumentby, andgroupsin the*_group()functions, now take unquoted document variable (docvar) names directly, similar to the way thesubsetargument works in the*_subset()functions. - Quoted docvar names no longer work, as these will be evaluated literally.
- The
by = "document"formerly sampled fromdocid(x), but this functionality is now removed. Instead, useby = docid(x)to replicate this functionality. - For
groups, the default is nowdocid(x), which is now documented more completely. See?groupsand?docid.
- The
dfm()has a new argument,remove_padding, for removing the "pads" left behind after removing tokens withpadding = TRUE. (For other extensive changes todfm(), see "Deprecated" below.)tokens_group(), formerly internal-only, is now exported.corpus_sample(),dfm_sample(), andtokens_sample()now work consistently (#2023).The
kwic()return object structure has been redefined, and built with an option to use a new functionindex()that returns token spans following a pattern search. (#2045 and #2065)The punctuation regular expression and that for matching social media usernames has now been redefined so that the valid Twitter username
@_is now counted as a "tag" rather than as "punctuation". (#2049)The data object
data_corpus_inauguralhas been updated to include the Biden 2021 inaugural address.A new system of validators for input types now provides better argument type and value checking, with more consistent error messages for invalid types or values.
Upon startup, we now message the console with the Unicode and ICU version information. Because we removed our redefinition of
View()(see below), the former conflict warning is now gone.as.character.corpus()now has ause.names = TRUEargument, similar toas.character.tokens()(but with a different default value).
Deprecations
The main potentially breaking changes in version 3 relate to the deprecation or elimination of shortcut steps that allowed functions that required tokens inputs to skip the tokens creation step. We did this to require users to take more direct control of tokenization options, or to substitute the alternative tokeniser of their choice (and then coercing it to tokens via [as.tokens()]). This also allows our function behaviour to be more consistent, with each function performing a single task, rather than combining functions (such as tokenisation and constructing a matrix).
The most common example involves constructing a dfm directly from a character
or corpus object. Formerly, this would construct a tokens object internally
before creating the dfm, and allowed passing arguments to tokens() via ....
This is now deprecated, although still functional with a warning.
We strongly encourage either creating a tokens object first, or piping the
tokens return to dfm() using %>%. (See examples below.)
We have also deprecated direct character or corpus inputs to [kwic()], since this also requires a tokenised input.
The full listing of deprecations is:
dfm.character()anddfm.corpus()are deprecated. Users should create a tokens object first, and input that todfm().dfm(): As of version 3, only tokens objects are supported as inputs todfm(). Callingdfm()for character or corpus objects is still functional, but issues a warning. Convenience passing of arguments totokens()via...fordfm()is also deprecated, but undocumented, and functions only with a warning. Users should now create a tokens object (usingtokens()from character or corpus inputs before callingdfm().kwic(): As of version 3, only tokens objects are supported as inputs tokwic(). Callingkwic()for character or corpus objects is still functional, but issues a warning. Passing arguments totokens()via...inkwic()is now disabled. Users should now create a tokens object (usingtokens()from character or corpus inputs before callingkwic().Shortcut arguments to
dfm()are now deprecated. These are still active, with a warning, although they are no longer documented. These are:stem-- usetokens_wordstem()ordfm_wordstem()instead.select,remove-- usetokens_select()/dfm_select()ortokens_remove()/dfm_remove()instead.dictionary,thesaurus-- usetokens_lookup()ordfm_lookup()instead.valuetype,case_insensitive-- these are disabled; for the deprecated arguments that take these qualifiers, they are fixed to the defaults"glob"andTRUE.groups-- usetokens_group()ordfm_group()instead.
texts()andtexts<-are deprecated.- Use
as.character.corpus()to turn a corpus into a simple named character vector. - Use
corpus_group()instead oftexts(x, groups = ...)to aggregate texts by a grouping variable. - Use
[<-instead oftexts()<-for replacing texts in a corpus object.
- Use
Removals
See note above under "Changes" about the
textplot_*()andtextstat_*()functions.The following functions have been removed:
- all methods for defunct
corpuszipobjects. View()functionsas.wfm()andas.DocumentTermMatrix()(the same functionality is available viaconvert())metadoc()andmetacorpus()corpus_trimsentences()(replaced bycorpus_trim())- all of the
tortlfunctions - all legacy functions related to the ancient "corpuszip" corpus variant.
- all methods for defunct
dfmobjects can no longer be used as apatternindfm_select()(formerly deprecated).dfm_sample():- no longer has a
marginargument. Instead,dfm_sample()now samples only on documents, the same ascorpus_sample()andtokens_sample(); and - no longer works with
by = "document"-- useby = docid(x)instead.
- no longer has a
dictionary_edit(),char_edit(), andlist_edit()are removed.dfm_weight()- formerly deprecated"scheme"options are now removed.tokens()- formerly deprecated optionsremove_hyphensandremove_twitterare now removed. (Usesplit_hyphensinstead, and the default tokenizer always now preserves Twitter and other social media tags.)Special versions of
head()andtail()for corpus, dfm, and fcm objects are now removed, since the base methods work fine for these objects. The main consequence was the removal of thenfoption from the methods for dfm and fcm objects, which limited the number of features. This can be accomplished using the index operator[instead, or for printing, by specifyingprint(x, max_nfeat = 6L)(for instance).
Bug fixes and stability enhancements
Fixed a bug causing
topfeatures(x, group = something)to fail with weighted dfms (#2032).kwic()is more stable and does not crash when a vector is supplied as thewindowargument (#2008).Allow use of multi-threading with more than two threads by fixing
quanteda_options().Mentions of the now-removed
ngramsoption indfm(x, ...)has now been removed from the dfm documentation. (#1990)Handling for some early-cycle v2 dfm object is improved, to ensure that they are updated to the latest object format. (#2097)
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 5 years ago
quanteda - CRAN v2.1.2
Changes
textstat_keyness()performance is now improved through implementation in (multi-threaded) C++.
Bug fixes and stability enhancements
- Fixes breaking tests and examples on Solaris platform as well as other changes introduced by changes to the stringi package.
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 5 years ago
quanteda - CRAN v2.1.1
Bug fixes and stability enhancements
corpus_reshape()now allows reshaping back to documents even when segmented texts were of zero length. (#1978)- Special handling applied for Solaris to some issues breaking on that build, relating to the caching in
summary.corpus()/textstat_summary().
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 5 years ago
quanteda - CRAN v2.1.0
Changes
- Added
block_sizetoquanteda_options()to control the number of documents in blocked tokenization. - Fixed
print.dictionary2()to control the printing of nested levels withmax_nkey(#1967) - Added
textstat_summary()to provide detailed information about dfm, tokens and corpus objects. It will replacesummary()in future versions. - Fixed a performance issue causing slowdowns in tokenizing (using the default
what = "word") corpora with large numbers of documents that contain social media tags and URLs that needed to be preserved (such a large corpus of Tweets). - Updated the (default) "word" tokenizer to preserve hashtags and usernames better with non-ASCII text, and made these patterns user-configurable in
quanteda_options(). The following are now preserved: "#政治" as well as Weibo-style hashtags such as "#英国首相#". convert(x, to = "data.frame")now outputs the first column as "doc_id" rather than "document" since "document" is a commonly occurring term in many texts. (#1918)- Added new methods
char_select(),char_keep(), andchar_remove()for easy manipulation of character vectors. - Added
dictionary_edit()for easy, interactive editing of dictionaries, plus the functionschar_edit()andlist_edit()for editing character and list of character objects. - Added a method to
textplot_wordcloud()that plots objects fromtextstat_keyness(), to visualize keywords either by comparison or for the target category only. - Improved the performance of
kwic()(#1840). - Added new
logsmoothscheme todfm_weight(). - Added new
textstat_summary()method, which returns summary information about the tokens/types/features etc in an object. It also caches summary information so that this can be retrieved on subsequent calls, rather than re-computed.
Bug fixes and stability enhancements
- Stopped returning
NAfor non-existent features whenn>nfeat(x)intextstat_frequency(x, n). (#1929) - Fixed a problem in
dfm_lookup()andtokens_lookup()in which an error was caused when no dictionary key returned a single match (#1946). - Fixed a bug that caused a
textstat_simil/distobject converted to a data.frame to drop itsdocument2labels (#1939). - Fixed a bug causing
dfm_match()to fail on a dfm that included "pads" (""). (#1960) - Updated the
data_dfm_lbgexampleobject using more modern dfm internals. - Updates
textstat_readability(),textstat_lexdiv(), andnscrabble()so that empty texts are not dropped in the result. (#1976)
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 5 years ago
quanteda - CRAN v2.0.1
Changes
- Moved
data_corpus_irishbudget2010anddata_corpus_dailnoconf1991to the quanteda.textmodels package. - Em dashes and double dashes between words, whether surrounded by a space or not, are now converted to " - " to distinguish them from infix hyphens. (#1889)
- Verbose output for dfm and tokens creation is now corrected and more consistent. (#1894)
Bug fixes and stability enhancements
- Number removal is now both improved and fixed (#1909).
- Fixed an issue causing CRAN errors in pre-v4, related to the new default of
stringsAsFactors = FALSEfor data.frame objects. - An error in the print method for dfm objects is now fixed (#1897)
- Fixed a bug in
tokens_replace()when the pattern was not matched (#1895) - Fixed the names of dimensions not exchanging when a dfm was transposed (#1903)
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 6 years ago
quanteda - CRAN v2.0.0
quanteda 2.0 introduces some major changes, detailed here.
What's new in v2.0
New corpus object structure.
The internals of the corpus object have been redesigned, and now are based around a character vector with meta- and system-data in attributes. These are all updated to work with the existing extractor and replacement functions. If you were using these before, then you should not even notice the change. Docvars are now handled separately from the texts, in the same way that docvars are handled for tokens objects.
New metadata handling.
Corpus-level metadata is now inserted in a user metadata list via
meta()andmeta<-().metacorpus()is kept as a synonym formeta(), for backwards compatibility. Additional system-level corpus information is also recorded, but automatically when an object is created.Document-level metadata is deprecated, and now all document-level information is simply a "docvar". For backward compatibility,
metadoc()is kept and will insert document variables (docvars) with the name prefixed by an underscore.Corpus objects now store default summary statistics for efficiency. When these are present,
summary.corpus()retrieves them rather than computing them on the fly.New index operators for core objects. The main change here is to redefine the
$operator for corpus, tokens, and dfm objects (all objects that retain docvars) to allow this operator to access single docvars by name. Some other index operators have been redefined as well, such as[.corpusreturning a slice of a corpus, and[[.corpusreturning the texts from a corpus.See the full details at https://github.com/quanteda/quanteda/wiki/indexingcoreobjects.
*_subset()functions.The
subsetargument now must be logical, and theselectargument has been removed. (This is part ofbase::subset()but has never made sense, either in quanteda or base.)Return format from
textstat_simil()andtextstat_dist().Now defaults to a sparse matrix from the Matrix package, but coercion methods are provided for
as.data.frame(), to make these functions return a data.frame just like the other textstat functions. Additional coercion methods are provided foras.dist(),as.simil(), andas.matrix().settings functions (and related slots and object attributes) are gone. These are now replaced by a new
meta(x, type = "object")that records object-specific meta-data, including settings such as thenfor tokens (to record thengrams).All included data objects are upgraded to the new formats. This includes the three corpus objects, the single dfm data object, and the LSD 2015 dictionary object.
New print methods for core objects (corpus, tokens, dfm, dictionary) now exist, each with new global options to control the number of documents shown, as well as the length of a text snippet (corpus), the tokens (tokens), dfm cells (dfm), or keys and values (dictionary). Similar to the extended printing options for dfm objects, printing of corpus objects now allows for brief summaries of the texts to be printed, and for the number of documents and the length of the previews to be controlled by new global options.
All textmodels and related functions have been moved to a new package quanteda.textmodels. This makes them easier to maintain and update, and keeps the size of the core package down.
quanteda v2 implements major changes to the
tokens()constructor. These are designed to simplify the code and its maintenance in quanteda, to allow users to work with other (external) tokenizers, and to improve consistency across the tokens processing options. Changes include:
- A new method `tokens.list(x, ...)` constructs a `tokens` object from named list of characters, allowing users to tokenize texts using some other function (or package) such as `tokenize_words()`, `tokenize_sentences()`, or `tokenize_tweets()` from the **tokenizers** package, or the list returned by `spacyr::spacy_tokenize()`. This allows users to use their choice of tokenizer, as long as it returns a named list of characters. With `tokens.list()`, all tokens processing (`remove_*`) options can be applied, or the list can be converted directly to a `tokens` object without processing using `as.tokens.list()`.
- All tokens options are now _intervention_ options, to split or remove things that by default are not split or removed. All `remove_*` options to `tokens()` now remove them from tokens objects by calling `tokens.tokens()`, after constructing the object. "Pre-processing" is now actually post-processing using `tokens_*()` methods internally, after a conservative tokenization on token boundaries. This both improves performance and improves consistency in handling special characters (e.g. Twitter characters) across different tokenizer engines. (#1503, #1446, #1801)
Note that `tokens.tokens()` will remove what is found, but cannot "undo" a removal -- for instance it cannot replace missing punctuation characters if these have already been removed.
- The option `remove_hyphens` is removed and deprecated, but replaced by `split_hyphens`. This preserves infix (internal) hyphens rather than splitting them. This behaviour is implemented in both the `what = "word"` and `what = "word2"` tokenizer options. This option is `FALSE` by default.
- The option `remove_twitter` has been removed. The new `what = "word"` is a smarter tokenizer that preserves social media tags, URLs, and email-addresses. "Tags" are defined as valid social media hashtags and usernames (using Twitter rules for validity) rather than removing the `#` and `@` punctuation characters, even if `remove_punct = TRUE`.
New features
- Changed the default value of the
sizeargument indfm_sample()to the number of features, not the number of documents. (#1643) - Fixes a few CRAN-related issues (compiler warnings on Solaris and encoding warnings on r-devel-linux-x86_64-debian-clang.)
- Added
startposandendposarguments totokens_select(), for selecting on token positions relative to the start or end of the tokens in each document. (#1475) - Added a
convert()method for corpus objects, to convert them into data.frame or json formats. - Added a
spacy_tokenize()method for corpus objects, to provide direct access via the spacyr package.
Behaviour changes
- Added a
force = TRUEoption and error checking for the situations of applyingdfm_weight()ordfm_group()to a dfm that has already been weighted. (#1545) The functiontextstat_frequency()now allows passing this argument todfm_group()via.... (#1646) textstat_frequency()now has a new argument for resolving ties when ranking term frequencies, defaulting to the "min" method. (#1634)- New docvars accessor and replacement functions are available for corpus, tokens, and dfm objects via
$. (See Index Operators for Core Objects above.) textstat_entropy()now produces a data.frame that is more consistent with othertextstatmethods. (#1690)
Bug fixes and stability enhancements
- docnames now enforced to be character (formerly, could be numeric for some objects).
- docnames are now enforced to be strictly unique for all object classes.
- Grouping operations in
tokens_group()anddfm_group()are more robust to using multiple grouping variables, and preserve these correctly as docvars in the new dfm. (#1809) - Some fixes to documented ... objects in two functions that were previously causing CRAN check failures on the release of 1.5.2.
Other improvements
- All of the (three) included corpus objects have been cleaned up and augmented with improved meta-data and docvars. The inaugural speech corpus, for instance, now includes the President's political party affiliation.
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 6 years ago
quanteda - CRAN v1.5.2
Last 1.x.x release before major changes in v2.
New features
- Added Yule's I to
textstat_lexdiv(). - Added forward compatibility for newer (v2) corpus class objects.
- Added a new function
featfreq()to compute the overall feature frequencies from a dfm.
Bug fixes
- Fixed a bug in
tokens_lookup()whenexclusive = FALSEand the tokens object has paddings. (#1743) - Fixed a bug in
tokens_replace()(#1765).
Scientific Software - Peer-reviewed
- R
Published by kbenoit about 6 years ago
quanteda - CRAN v1.5.1
New features
- Added
omit_emptyas an argument toconvert(), to allow the user to control whether empty documents are excluded from converted dfm objects for certain formats. (#1660)
Bug fixes and stability enhancements
- Fixed a bug that affects the new
textstat_dist()andtextstat_simil()(#1730) - Fixed a bug in how
textstat_dist()andtextstat_simil()class symmetric matrices.
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 6 years ago
quanteda - CRAN v1.5.0
New features
- Add
flattenandlevelsarguments toas.list.dictionary2()to enable more flexible conversion of dictionary objects. (#1661) - In
corpus_sample(), thesizenow works with thebyargument, to control the size of units sampled from each group. - Improvements to
textstat_dist()andtextstat_simil(), see below. - Long tokens are not discarded automatically in the call to
tokens(). (#1713)
Behaviour changes
textstat_dist()andtextstat_simil()now return sparse symmetric matrix objects using classes from the Matrix package. This replaces the former structure based on thedistclass. Computation of these classes is now also based on the fast implementation in the proxyC package. When computing similarities, the newmin_similargument allows a user to ignore certain values below a specified similarity threshold. A new coercion methodas.data.frame.textstat_simildist()now exists for converting these returns into a data.frame of pairwise comparisons. Existing methods such asas.matrix(),as.dist(), andas.list()work as they did before.- We have removed the "faith", "chi-squared", and "kullback" methods from
textstat_dist()andtextstat_simil()because these were either not symmetric or not invariant to document or feature ordering. Finally, theselectionargument has been deprecated in favour of a newyargument. textstat_readability()now defaults tomeasure = "Flesch"if no measure is supplied. This makes it consistent withtextstat_lexdiv()that also takes a default measure ("TTR") if none is supplied. (#1715)- The default values for
max_ncharandmin_ncharintokens_select()are now NULL, meaning they are not applied if the user does not supply values. Fixes #1713.
Bug fixes and stability enhancements
kwic.corpus()andkwic.tokens()behaviour now aligned, meaning that dictionaries are correctly faceted by key instead of by value. (#1684)- Improved formatting of
tokens()verbose output. (#1683) - Subsetting and printing of subsetted kwic objects is more robust. (#1665)
- The "Bormuth" and "DRP" measures are now fixed for
textstat_readability(). (#1701)
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 6 years ago
quanteda - CRAN v1.4.3
Bug fixes and stability enhancements
- Changed the default value of the
sizeargument indfm_sample()to the number of features, not the number of documents. (#1643) - Fixes a few CRAN-related issues (compiler warnings on Solaris and encoding warnings on r-devel-linux-x86_64-debian-clang.)
Behaviour changes
- Added a
force = TRUEoption and error checking for the situations of applyingdfm_weight()ordfm_group()to a dfm that has already been weighted. (#1545) The functiontextstat_frequency()now allows passing this argument todfm_group()via.... (#1646) textstat_frequency()now has a new argument for resolving ties when ranking term frequencies, defaulting to the "min" method. (#1634)
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 7 years ago
quanteda - CRAN v1.4.1
quanteda 1.4.1
Bug fixes and stability enhancements
- Fixed an issue with special handling of whitespace variants that caused a test to fail when running Ubuntu 18.10 system with libicu-dev version 63.1 (#1604).
- Fixed the operation of
docvars<-.corpus()in a way that solves #1603 (reassignment of docvar names).
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 7 years ago
quanteda - CRAN v1.4.0
Bug fixes and stability enhancements
- Fixed bug in
dfm_compress()anddfm_group()that changed or deleted docvars attributes of dfm objects (#1506). - Fixed a bug in
textplot_xray()that caused incorrect facet labels when a pattern contained multiple list elements or values (#1514). kwic()now correctly returns the pattern associated with each match as the"keywords"attribute, for allpatterntypes (#1515)- Implemented some improvements in efficiency and computation of unusual edge cases for
textstat_simil()andtextstat_dist().
New features
textstat_lexdiv()now works on tokens objects, not just dfm objects. New methods of lexical diversity now include MATTR (the Moving-Average Type-Token Ratio, Covington & McFall 2010) and MSTTR (Mean Segmental Type-Token Ratio).- New function
tokens_split()allows splitting single into multiple tokens based on a pattern match. (#1500) - New function
tokens_chunk()allows splitting tokens into new documents of equally-sized "chunks". (#1520) - New function
textstat_entropy()now computes entropy for a dfm across feature or document margins. - The documentation for
textstat_readability()is vastly improved, now providing detailing all formulas and providing full references. - New function
dfm_match()allows a user to specify the features in a dfm according to a fixed vector of feature names, including those of another dfm. Replacesdfm_select(x, pattern)wherepatternwas a dfm. - A new argument
vertex_labelsizeadded totextplot_network()to allow more precise control of label sizes, either globally or individually.
Behaviour changes
tokens.tokens(x, remove_hyphens = TRUE)wherexwas generated withremove_hyphens = FALSEnow behaves similarly to how the same tokens would be handled had this option been called on character input astokens.character(x, remove_hyphens = TRUE). (#1498)
Scientific Software - Peer-reviewed
- R
Published by kbenoit about 7 years ago
quanteda - CRAN v1.3.14
quanteda v.1.3.14
Bug fixes and stability enhancements
- Improved the robustness of
textstat_keyness()(#1482). - Improved the accuracy of sparsity reporting for the print method of a dfm (#1473).
New Features
- Added the following measures to
textstat_lexdiv(): Yule's K, Simpson's D, and Herdan's Vm.
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 7 years ago
quanteda - CRAN v1.3.13
Bug fixes and stability enhancements
- Fixed a bug causing incorrect counting in
fcm(x, ordered = TRUE). (#1413) Also set the condition thatwindowcan be of size 1 (formerly the limit was 2 or greater). - Fixed deprecation warnings from adding a dfm as docvars, and this now inmports the feature names as docvar names automatically. (related to #1417)
- Fixed behaviour from
tokens(x, what = "fasterword", remove_separators = TRUE)so that it correctly splits words separated by\nand\tcharacters. (#1420) - Add error checking for functions taking dfm inputs in case a dfm has empty features (#1419).
- For
textstat_readability(), fixed a bug in Dale-Chall-based measures and in the Spache word list measure. These were caused by an incorrect lookup mechanism but also by limited implementation of the wordlists. The new wordlists include all of the variations called for in the original measures, but using fast fixed matching. (#1410) - Fixed problems with basic dfm operations (
rowMeans(),rowSums(),colMeans(),colSums()) caused by not having access to the Matrix package methods. (#1428) - Fixed problem in
textplot_scale1d()when input a predicted wordscores object withse.fit = TRUE(#1440). - Improved the stability of
textplot_network(). (#1460)
New Features
- Added new argument
intermediatetotextstat_readability(x, measure, intermediate = FALSE), which ifTRUEreturns intermediate quantities used in the computation of readability statistics. Useful for verification or direct use of the intermediate quantities. - Added a new
separatorargument tokwic()to allow a user to define which characters will be added between tokens returned from a keywords in context search. (#1449) - Reimplemented
textstat_dist()andtextstat_simil()in C++ for enhanced performance. (#1210) - Added a
tokens_sample()function (#1478).
Behaviour changes
- Removed the Hamming distance method from
textstat_dist()(#1443), based on the reasoning in #1442. - Removed the "chisquared" and "chisquared2" distance measures from
textstat_simil(). (#1442)
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 7 years ago
quanteda - (not accepted by CRAN 😞) v1.3.10
Prepared for and submitted to CRAN, and the version current with the publication of the JOSS article about quanteda.
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 7 years ago
quanteda - CRAN v1.3.0
New Features
- Added
to = "tripletlist"output type forconvert(), to convert a dfm into a simple triplet list. (#1321) - Added
tokens_tortl()andchar_tortl()to add markers for right-to-left language tokens and character objects. (#1322)
Behaviour changes
- Improved
corpus.kwic()by adding new argumentssplit_contextandextract_keyword. dfm_remove(x, selection = anydfm)is now equivalent todfm_remove(x, selection = featnames(anydfm)). (#1320)- Improved consistency of
predict.textmodel_nb()returns, and addedtype =argument. (#1329)
Bug fixes
- Fixed a bug in
textmodel_affinity()that caused failure when the input dfm had been compiled withtolower = FALSE. (#1338) - Fixed a bug affecting
tokens_lookup()anddfm_lookup()whennomatchis used. (#1347) - Fixed a problem whereby NA texts created a "document" (or tokens) containing
"NA"(#1372)
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 7 years ago
quanteda - CRAN v1.2.0
New Features
- Added an
nsentence()method for spacyr parsed objects. (#1289)
Bug fixes and stability enhancements
- Fix bug in
nsyllable()that incorrectly handled cased words, and returned wrong names withuse.names = TRUE. (#1282) - Fix the overwriting of
summary.character()caused by previous import of the network package namespace. (#1285) dfm_smooth()now correctly sets the smooth value in the dfm (#1274). Arithmetic operations on dfm objects are now much more consistent and do not drop attributes of the dfm, as sometimes happened with earlier versions.
Behaviour changes
tokens_toupper()andtokens_tolower()no longer remove unused token types. Solves #1278.dfm_trim()now takes more options, and these are implemented more consistently.min_termfreqandmax_termfreqhave replacedmin_countandmax_count, and these can be modified using atermfreq_typeargument. (Similar options are implemented fordocfreq_type.) Solves #1253, #1254.textstat_simil()andtextstat_dist()now take valid dfm indexes for the relevant margin for theselectionargument. Previously, this could also be a direct vector or matrix for comparison, but this is no longer allowed. Solves #1266.- Improved performance for
dfm_group()(#1295).
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 8 years ago
quanteda - CRAN v1.1.1
Changed the default number of threads to 2.
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 8 years ago
quanteda - CRAN v1.1.0
New Features
- Added
as.dfm()methods for tmDocumentTermMatrixandTermDocumentMatrixobjects. (#1222) predict.textmodel_wordscores()nows includes aninclude_reftextsargument to exclude training texts from the predicted model object (#1229). The default behaviour isinclude_reftexts = TRUE, producing the same behaviour as existed before the introduction of this argument. This allows rescaling based on the reference documents (since rescaling requires prediction on the reference documents) but provides an easy way to exclude the reference documents from the predicted quantities.textplot_wordcloud()now uses code entirely internal to quanteda, instead of using the wordcloud package.
Bug fixes and stability enhancements
- Eliminated unnecessary dependency on the digest package.
- Updated the vignette title to be less generic.
- Improved the robustness of
dfm_trim()anddfm_weight()for previously weighted dfm objects and when supplied thresholds are proportions instead of counts. (#1237) - Fixed a problem in
summary.corpus(x, n = 101)whenndoc(x) > 100(#1242). - Fixed a problem in
predict.textmodel_wordscores(x, rescaling = "mv")that always reset the reference values for rescaling to the first and second documents (#1251). - Issues in the color generation and labels for
textplot_keyness()are now resolved (#1233, #1233).
Performance improvements
- textmodel methods are now exported, to facilitate extension packages for other textmodel methods (e.g. wordshoal).
Behaviour changes
- Changed the default in
textmodel_wordfish()tosparse = FALSE, in response to #1216. dfm_group()now preserves docvars that are constant for the group aggregation (#1228).
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 8 years ago
quanteda - CRAN v1.0.0
New Features
- Added
vertex_labelfonttotextplot_network(). - Added
textmodel_lsa()for Latent Semantic Analysis models. - Added
textmodel_affinity()for the Perry and Benoit (2017) class affinity scaling model. - Added Chinese stopwords.
- Added a pkgdown vignette for applications in the Chinese language.
- Added
textplot_network()function. - The
stopwords()function and the associated internal data objectdata_char_stopwordshave been removed from quanteda, and replaced by equivalent functionality in the stopwords package. - Added
tokens_subset(), now consistent with other*_subset()functions (#1149).
Bug fixes and stability enhancements
- Performance has been improved for
fcm()and fortextmodel_wordfish(). dfm()now correctly passes through all...arguments totokens(). (#1121)- All
dfm_*()functions now work correctly with empty dfm objects. (#1133) - Fixed a bug in
dfm_weight()for named weight vectors (#1150) - Fixed a bug preventing
textplot_influence()from working (#1116).
Behaviour Changes
- The convenience wrappers to
convert()are simplified and no longer exported. To convert a dfm,convert()is now the only official function. nfeat()replacesnfeature(), which is now deprecated. (#1134)textmodel_wordshoal()has been removed, and relocated to a new package (wordshoal).- The generic wrapper function
textmodel(), which used to be a gateway to specifictextmodel_*()functions, has been removed. - (Most of) the
textmodel_*()have been reimplemented to make their behaviour consistent with thelm/glm()families of models, including especially how thepredict,summary, andcoefmethods work (#1007, #108). - The GitHub home for the repository has been moved to https://github.com/quanteda/quanteda.
Scientific Software - Peer-reviewed
- R
Published by kbenoit about 8 years ago
quanteda - CRAN v0.99.22
New Features
tokens_select()has a newwindowargument, permitting selection within an asymmetric window around thepatternof selection. (#521)tokens_replace()now allows token types to be substituted directly and quickly.- Added a
spacy_parsemethod for corpus objects. Also restored quanteda methods for spacyrspacy_parsedobjects.
Bug fixes and stability enhancements
- Improved documentation for
textmodel_nb()(#1010), and made output quantities from the fitted NB model regular matrix objects instead of Matrix classes.
Behaviour Changes
- All of the deprecated functions are now removed. (#991)
tokens_group()is now significantly faster.- The deprecated "list of characters"
tokenize()function and all methods associated with thetokenizedTextsobject types have been removed. - Added convenience functions for keeping tokens or features:
tokens_keep(),dfm_keep(), andfcm_keep(). (#1037) textmodel_NB()has been replaced bytextmodel_nb().
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 8 years ago
quanteda - CRAN v0.99.12
Changes since v0.99.9
New Features
- Added methods for changing the docnames of tokens and dfm objects (#987).
Bug fixes and stability enhancements
- The computation of tfidf has been more thoroughly described in the documentation for this function (#997).
- Now depends on R >= 3.4.0, to avoid showing errors in r-oldrelease builds.
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 8 years ago
quanteda - CRAN v.99.9
Changes since v0.99
New Features
- Added magrittr pipe support (#927).
%>%can now be used with quanteda without needing to attach magrittr (or, as many users apparently believe, the entire tidyverse.) corpus_segment()now behaves more logically and flexibly, and is clearly differentiated fromcorpus_reshape()in terms of its functionality. Its documentation is also vastly improved. (#908)- Added
data_dictionary_LSD2015, the Lexicoder Sentiment 2015 dictionary (#963). - Significant improvements to the performance of
tokens_lookup()anddfm_lookup()(#960). - New functions
head.corpus(),tail.corpus()provide fast subsetting of the first or last documents in a corpus. (#952)
Bug fixes and stability enhancements
- Fixed a problem when applying
purrr::map()todfm()(#928). - Added documentation for
regex2fixed()and associated functions. - Fixed a bug in
textstat_collocations.tokens()caused by "documents" containing only""as tokens. (#940) - Fixed a bug caused by
cbind.dfm()when features shared a name starting withquanteda_options("base_featname")(#946) - Improved dictionary handling and creation now correctly handles nested LIWC 2015 categories. (#941)
- Number of threads now set correctly by
quanteda_options(). (#966)
Behaviour changes
summary.corpus()now generates a special data.frame, which has its own print method, rather than requiringverbose = FALSEto suppress output (#926).textstat_collocations()is now multi-threaded.head.dfm(),tail.dfm()now behave consistently with base R methods for matrix, with the added argumentnfeature. Previously, these methods printed the subset and invisibly returned it. Now, they simply return the subset. (#952)
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 8 years ago
quanteda - CRAN v0.99
New features
- Improvements and consoldiation of methods for detecting multi-word expressions, now active only through
textstat_collocations(), which computes only thelambdamethod for now, but does so accurately and efficiently. (#753, #803). This function is still under development and likely to change further. - Added new
quanteda_optionsthat affect the maximum documents and features displayed by the dfm print method (#756). -
ngramformation is now significantly faster, including with skips (skipgrams). - Improvements to
topfeatures():- now accepts a
groupsargument that can be used to generate lists of top (or bottom) features in a group of texts, including by document (#336). - new argument
schemethat takes the default of (frequency)"count"but also a new"docfreq"value (#408).
- now accepts a
- New wrapper
phrase()converts whitespace-separated multi-word patterns into a list of patterns. This affects the feature/pattern matching intokens/dfm_select/remove,tokens_compound,tokens/dfm_lookup, andkwic.phrase()and the associated changes also make the behaviour of using character vectors, lists of characters, dictionaries, and collocation objects for pattern matches far more consistent. (See #820, #787, #740, #837, #836, #838) -
corpus.Corpus()for creating a corpus from a tm Corpus now works with more complex objects that include document-level variables, such as data from the manifestoR package (#849). - New plot function
textplot_keyness()plots term "keyness", the association of words with contrasting classes as measured bytextstat_keyness(). - Added corpus constructor for corpus objects (#690).
- Added dictionary constructor for dictionary objects (#690).
- Added a tokens constructor for tokens objects (#690), including updates to
tokens()that improve the consistency and efficiency of the tokenization. - Added new
quanteda_options():language_stemmerandlanguage_stopwords, now used for default in*_wordstemfunctions andstopwords()for defaults, respectively. Also uses this option indfm()whenstem = TRUE, rather than hard-wiring in the "english" stemmer (#386). - Added a new function
textstat_frequency()to compile feature frequencies, possibly by groups. (#825) - Added
nomatchoption totokens_lookup()anddfm_lookup(), to provide tokens or feature counts for categories not matched to any dictionary key. (#496)
Behaviour changes
- The functions
sequences()andcollocations()have been removed and replaced bytextstat_collocations(). - (Finally) we added "will" to the list of English stopwords (#818).
-
dfmobjects with one or both dimensions haveing zero length, and emptykwicobjects now display more appropriately in their print methods (per #811). - Pattern matches are now implemented more consistently across functions. In functions such as
*_select,*_remove,tokens_compound,featureshas been replaced bypattern, and inkwic,keywordshas been replaced bypattern. These all behave consistently with respect topattern, which now has a unified single help page and parameter description.(#839) See also above new features related tophrase(). - We have improved the performance of the C++ routines that handle many of the
tokens_*functions using hashed tokens, making some of them 10x faster (#853). - Upgrades to the
dfm_group()function now allow "empty" documents to be created using thefill = TRUEoption, for making documents conform to a selection (similar to howdfm_select()works for features, when supplied a dfm as the pattern argument). Thegroupsargument now behaves consistently across the functions where it is used. (#854) -
dictionary()now requires its main argument to be a list, not a series of elements that can be used to build a list. - Some changes to the behaviour of
tokens()have improved the behaviour ofremove_hyphens = FALSE, which now behaves more correctly regardless of the setting ofremove_punct(#887). - Improved
cbind.dfm()function allows cbinding vectors, matrixes, and (recyclable) scalars to dfm objects.
Bug fixes and stability enhancements
- For the underlying methods behind
textstat_collocations(), we corrected the word matching, and lambda and z calculation methods, which were slightly incorrect before. We also removed the chi2, G2, and pmi statistics, because these were incorrectly calculated for size > 2. - LIWC-formatted dictionary import now robust to assignment to term assignment to missing categories.
-
textmodel_NB(x, y, distribution = "Bernoulli")was previously inactive even when this option was set. It has now been fully implemented and tested (#776, #780). - Separators including rare spacing characters are now handled more robustly by the
remove_separatorsargument intokens(). See #796. - Improved memory usage when computing
ntoken()andntype(). (#795) - Improvements to
quanteda_options()now does not throw an error when quanteda functions are called directly without attaching the package. In addition, quanteda options can be set now in .Rprofile and will not be overwritten when the options initialization takes place when attaching the package. - Fixed a bug in
textstat_readability()that wrongly computed the number of words with fewer than 3 syllables in a text; this affected theFOG.NRIand theLinsear.Writemeasures only. - Fixed mistakes in the computation of two docfreq schemes:
"logave"and"inverseprob". - Fixed a bug in the handling of multi-thread options where the settings using
quanteda_options()did not actually set the number of threads. In addition, we fixed a bug causing threading to be turned off on macOS (due to a check for a gcc version that is not used for compiling the macOS binaries) prevented multi-threading from being used at all on that platform. - Fixed a bug causing failure when functions that use
quanteda_options()are called without the namespace or package being attached or loaded (#864). - Fixed a bug in overloading the View method that caused all named objects in the RStudio/Source pane to be named "x". (#893)
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 8 years ago
quanteda - CRAN v0.9.9.65
Changes since v0.9.9-50
New features
- Corpus construction using
corpus()now works for atm::SimpleCorpusobject. (#680) - Added
corpus_trim()andchar_trim()functions for selecting documents or subsets of documents based on sentence, paragraph, or document lengths. - Conversion of a dfm to an stm object now passes docvars through in the
$metaof the return object. - New
dfm_group(x, groups = )command, a convenience wrapper arounddfm.dfm(x, groups = )(#725). - Methods for extending quanteda functions to readtext objects updated to match CRAN release of readtext package.
- Corpus constructor methods for data.frame objects now conform to the "text interchange format" for corpus data.frames, automatically recognizing
doc_idandtextfields, which also provides interoperability with the readtext package. corpus construction methods are now more explicitly tailored to input object classes.
Bug fixes and stability enhancements
dfm_lookup()behaves more robustly on different platforms, especially for keys whose values match no features (#704).textstat_simil()andtextstat_dist()no longer take thenargument, as this was not sorting features in correct order.- Fixed failure of
tokens(x, what = "character")whenxincluded Twitter characters@and#(#637). - Fixed bug #707 where
ntype.dfm()produced an incorrect result. - Fixed bug #706 where
textstat_readability()andtextstat_lexdiv()for single-document returns whendrop = TRUE. - Improved the robustness of
corpus_reshape(). print, andhead, andtailmethods fordfmare more robust (#684).- Fixed bug in
convert(x, to = "stm")caused by zero-count documents and zero-count features in a dfm (#699, #700, #701). This also removes docvar rows from$metawhen this is passed through the dfm, for zero-count documents. - Corrected broken handling of nested Yoshikoder dictionaries in
dictionary(). (#722) dfm_compressnow preserves a dfm's docvars if collapsing only on the features margin, which means thatdfm_tolower()anddfm_toupper()no longer remove the docvars.fcm_compress()now retains the fcm class, and generates and error when an asymmetric compression is attempted (#728).textstat_collocations()now returns the collocations as character, not as a factor (#736)- Fixed a bug in
dfm_lookup(x, exclusive = FALSE)wherein an empty dfm ws returned with there was no no match (#116). - Argument passing through
dfm()totokens()is now robust, and preserves variables defined in the calling environment (#721). - Fixed issues related to dictionaries failing when applying
str(),names(), or other indexing operations, which started happening on Linux and Windows platforms following the CRAN move to 3.4.0. (#744) - Dictionary import using the LIWC format is more robust to improperly formatted input files (#685).
- Weights applied using
dfm_weight()now print friendlier error messages when the weight vector contains features not found in the dfm. See this Stack Overflow question for the use case that sparked this improvement.
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 8 years ago
quanteda - CRAN v0.9.9.50
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 9 years ago
quanteda - CRAN v0.9.9-22
Minor fixes in C++ release to comply with CRAN checks on lesser-used platforms.
Scientific Software - Peer-reviewed
- R
Published by kbenoit almost 9 years ago
quanteda - CRAN v0.9.9-24
New since v.09.9-17
Fixes incompatibilities on older compiler platforms.
Scientific Software - Peer-reviewed
- R
Published by kbenoit about 9 years ago
quanteda - CRAN v0.9.9-17
Bug fixes and minor feature additions.
Changes since v0.9.9-3
Bug fixes
- Fixed a bug causing
dfmandtokensto break on > 10,000 documents. (#438) - Fixed a bug in
tokens(x, what = "character", removeSeparators = TRUE)that returned an empty string. - Fixed a bug in
corpus.VCorpusif the VCorpus contains a single document. (#445) - Fixed a bug in
dfm_compressin which the function failed on documents that contained zero feature counts. (#467) - Fixed a bug in
textmodel_NBthat caused the class priorsPcto be refactored alphabetically instead of in the order of assignment (#471), also affecting predicted classes (#476).
New features
- New textstat function
textstat_keyness()discovers words that occur at differential rates between partitions of a dfm (using chi-squared, Fisher's exact test, and the G^2 likelihood ratio test to measure the strength of associations). - Added 2017-Trump to the inaugural corpus datasets (
data_corpus_inaugualanddata_char_inaugural). - Improved the
groupsargument intexts()(and indfm()that uses this function), which will now coerce to a factor rather than requiring one. - Added a dfm constructor from dfm objects, with the option of collapsing by groups.
- Added new arguments to
sequences():orderedandmax_length, the latter to prevent memory leaks from extremely long sequences. dictionary()now accepts YAML as an input file format.dfm_lookupandtokens_lookupnow accept alevelsargument to determine which level of a hierarchical dictionary should be applied.- Added
min_ncharandmax_nchararguments todfm_select. dictionary()can now be called on the argument of alist()without explicitly wrapping it inlist().fcmnow works directly on a dfm object whencontext = "documents".
Scientific Software - Peer-reviewed
- R
Published by kbenoit about 9 years ago
quanteda - CRAN release v0.9.9-3
Major new update published on CRAN on 2016-01-10. This is a pre-v1.0 release that implements major API changes while still retaining nearly all of the old functions, but hidden and deprecated. See NEWS.md.
Scientific Software - Peer-reviewed
- R
Published by kbenoit about 9 years ago
quanteda - CRAN release v0.9.8.5
Added
-
CITATIONfile
Bug Fixes
- (0.9.8.5) Fixed an incompatibility in sequences.cpp with Solaris x86 (#257)
- (0.9.8.4) Fix bug in verbose output of dfm that causes misreporting of number of features (#250)
- (0.9.8.4) Fix a bug in selectFeatures.dfm() that ignored case_insensitive = TRUE settings (#251) correct the documentation for this function.
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 9 years ago
quanteda - CRAN release 0.9.8.3
Bug fixes applied to 0.9.8
- Fix a bug in
tf(x, scheme = "propmax")that returned a wrong computation; correct the documentation for this function. - Fixed a bug in textfile() causing all texts to have the same name, for types using the "textField" argument (a single file containing multiple documents).
Scientific Software - Peer-reviewed
- R
Published by kbenoit over 9 years ago