Train Similarity Based Probability Model with anonymized training data
Source:R/model_training.R
train_similarity_based_reasoning.Rd
This function requires the mvtnorm package.
Usage
train_similarity_based_reasoning(
anonymized_data,
num_allowed_codes = 1291,
coding_index_w_codes,
coding_index_without_codes = NULL,
preprocessing = list(stopwords = NULL, stemming = NULL, strPreprocessing = TRUE,
removePunct = FALSE),
dist_type = c("wordwise", "substring", "fulltext"),
dist_control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
threshold = c(max = 3, use = 1),
simulation_control = list(n.draws = 250, check_normality = FALSE)
)
Arguments
- anonymized_data
surveyCountsSubstringSimilarity
orsurveyCountsWordwiseSimilarity
- num_allowed_codes
the number of allowed codes in the target classification. There are 1286 categories in the KldB 2010 plus 5 special codes in both anonymized training data sets, so the default value is 1291.
- coding_index_w_codes
a data.table with columns
- bezMale
a character vector, contains masculine job titles from the coding index.
- bezFemale
a character vector, contains feminine job titles from the coding index.
- Code
a character vector with associated classification codes.
- coding_index_without_codes
(not used, but automatically determined) Any words from
anonymized_data$dictString
that are not found withincoding_index_w_codes
belong into this character vector.- preprocessing
a list with elements
- stopwords
a character vector, use
tm::stopwords("de")
for German stopwords. Only used ifdist_type = "wordwise"
.- stemming
NULL
for no stemming and"de"
for stemming using the German porter stemmer. Do not use unless the job titles incoding_index_w_codes
were stemmed.- strPreprocessing
TRUE
ifpreprocess_string
shall be used.- removePunct
TRUE
ifremovePunctuation
shall be used.
- dist_type
How to calculate similarity between entries from both coding_indices and verbal answers from the survey? Three options are currently supported. Since we use the
stringdist
-function excessively, one could easily extend the functionality of this procedure to other distance metrics.- dist_type = "fulltext"
Uses the
stringdist
-function directly after preprocessing to calculate distances. (the simplest approach but least useful.)- dist_type = "substring"
An entry from the coding index and a verbal answer are similar if the entry from the coding index is a substring of the verbal answer.
- dist_type = "wordwise"
After preprocessing, split the verbal answer into words. Then calculate for each word separately the the similarity with entries from the coding index, using
stringdist
. Not the complete verbal answer but only the words (0 or more) that have highest similarity are then used to determine similarity with entries from the coding index.
- dist_control
If
dist_type = "fulltext"
ordist_type = "wordwise"
the entries from this list will be passed tostringdist
. Currently only two possible entries are supported (method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1) is recommended), but one could easily extend the functionality.- threshold
A numeric vector with two elements. If
dist_type = "fulltext"
ordist_type = "wordwise"
, the threshold determines up to which distance a verbal answer and an entry from the coding index are similar. The second number actually gets used. The first number is only used to speed up similarity calculations. It should be identical or larger than the second number.- simulation_control
a list with two components,
- n.draws
Number of draws from the posterior distribution to determine posterior predictive probabilities. The larger, the more precise the results will be.
- check_normality
We would like that the hyperprior distribution is normal. Set check_normality to TRUE to do some diagnostics about this.
Value
a list with components
- prediction.datasets$modelProb
Contains all entries from the coding index. dist = "official" if the entry stems from coding_index_w_codes and dist = selfcreated if the entry stems from coding_index_without_codes.
string.prob
is used for weighting purposes (model averaging) if a new verbal answer is similar to multiple strings.unobserved.mean.theta
gives a probability (usually very low) for any category that was not observed in the training data together with this string.- prediction.datasets$categoryProb
mean.theta
is the probability forcode
given that an incoming verbal answer is similar tostring
. Only available if this code was at least a single time observed with this string (Useunobserved.mean.theta
otherwise).- num_allowed_codes
Number of categories in the classification.
- preprocessing
The input parameter stored to replicate preprocessing with incoming data.
- dist_type
The input parameter stored to replicate distance calculations with incoming data.
- dist_control
The input parameter stored to replicate distance calculations with incoming data.
- threshold
The input parameter stored to replicate distance calculations with incoming data.
- simulation_control
The input parameters controlling the Monte Carlo simulation.
References
Schierholz, Malte (2019): New methods for job and occupation classification. Dissertation, Mannheim. https://madoc.bib.uni-mannheim.de/50617/, pp. 206-208 and p. 268, pp. 308-320
https://github.com/malsch/occupationCoding (function trainSimilarityBasedReasoning2 is implemented here)
See also
pretrained_models, which were created using this function.