I was playing with R’s var() function the other day.
Not sure whether it simply returns population variance by dividing over n or with Bessel’s correction by dividing over (n-1), I quickly entered some test data.
> rate <- c(9.34, 8.50, 7.62, 6.93, 6.60) > var(rate) [1] 1.27272 > sum((rate - mean(rate))^2)/length(rate) [1] 1.018176 > sum((rate - mean(rate))^2)/(length(rate) -1) [1] 1.27272
So looks like it indeed applies Bessel’s correction, which is great – then I was curious to see how var() is implemented.
I know I can normally see a bit of code by simply print function’s name so I start by:
> var function (x, y = NULL, na.rm = FALSE, use) { if (missing(use)) use <- if (na.rm) "na.or.complete" else "everything" na.method <- pmatch(use, c("all.obs", "complete.obs", "pairwise.complete.obs", "everything", "na.or.complete")) if (is.na(na.method)) stop("invalid 'use' argument") if (is.data.frame(x)) x <- as.matrix(x) else stopifnot(is.atomic(x)) if (is.data.frame(y)) y <- as.matrix(y) else stopifnot(is.atomic(y)) .Call(C_cov, x, y, na.method, FALSE) } <bytecode: 0x1e38b30> <environment: namespace:stats>
so far so good with some basic validations and argument parsing until I ran to last line where it goes on to call C_cov.
Tried to print C_cov but no luck this time:
> C_cov Error: object 'C_cov' not found
Not much experience with how this gets implemented I begin my exploration with function getAnywhere() and it return something seemingly confusing:
> getAnywhere(C_cov) 4 differing objects matching ‘C_cov’ were found in the following places namespace:stats Use [] to view one of them
Following the instruction and tried to display all 4:
> getAnywhere(C_cov)[1:4] $name [1] "cov" $address <pointer: 0xe87af0> attr(,"class") [1] "RegisteredNativeSymbol" $dll DLL name: stats Filename: /usr/lib/R/library/stats/libs/stats.so Dynamic lookup: FALSE $numParameters [1] 4
So it seems to suggest its real name is “cov” and is a symbol in dll /usr/lib/R/library/stats/libs/stats.so.
Of course it is not a real dll in strict sense probably the author who created this comes from window background, for us we would probably call it dso to be accurate.
So I dived into stats’s source code and I want to understand how the name “C_cov” connects with “cov” or should I expect to see “C_cov” in the source of stats package?
A quick grep shows there is no definition but only some similar reference:
$package/R-3.1.1/src/library/stats> $grep -nr "C_cov" * R/cor.R:150: .Call(C_cov, x, y, na.method, method == "kendall") R/cor.R:154: .Call(C_cov, Rank(na.omit(x)), NULL, na.method, R/cor.R:163: .Call(C_cov, Rank(dropNA(x, nas)), Rank(dropNA(y, nas)), R/cor.R:169: .Call(C_cov, x, y, na.method, method == "kendall") R/cor.R:184: .Call(C_cov, x, y, na.method, FALSE)
And there are cov() as a c function at the same time:
package/R-3.1.1/src/library/stats/src/cov.c 50 SEXP cov(SEXP x, SEXP y, SEXP na_method, SEXP kendall) 51 { 52 return corcov(x, y, na_method, kendall, FALSE); 53 }
A grep against “C_” shows me this is probably a convention for C/Fortran symbols which gets registered in the namespace file:
package/R-3.1.1/src/library/stats/src/NAMESPACE # Refer to all C/Fortran routines by their name prefixed by C_ useDynLib(stats, .registration = TRUE, .fixes = "C_") ...
And NAMESPACE file is normally loaded and parsed by base system’s loadNamespace():
170 loadNamespace <- function (package, lib.loc = NULL, 171 keep.source = getOption("keep.source.pkgs"), 172 partial = FALSE, versionCheck = NULL) 173 { ... 339 if (! packageHasNamespace(package, package.lib)) { 340 hasNoNamespaceError <- ... 360 else parseNamespaceFile(package, package.lib, mustExist = FALSE) ... 1060 ## NB this needs a decorated name, foo_ver, if appropriate 1061 packageHasNamespace <- function(package, package.lib) { 1062 namespaceFilePath <- function(package, package.lib) 1063 file.path(package.lib, package, "NAMESPACE") 1064 file.exists(namespaceFilePath(package, package.lib)) 1065 }
And by calling parse() on the NAMESPACE file we turned all the contents into expression objects and then we call the corresponding useDynLib:
1194 useDynLib = { 1195 1196 ## This attempts to process as much of the 1197 ## information as possible when NAMESPACE is parsed 1198 ## rather than when it is loaded and creates 1199 ## NativeRoutineMap objects to handle the mapping 1200 ## of symbols to R variable names. 1201 1202 ## The name is the second element after useDynLib 1203 dyl <- as.character(e[2L]) ... 1244 ## Deal with any prefix/suffix pair. 1245 fixes <- c("", "") 1246 idx <- match(".fixes", names(symNames)) 1247 if(!is.na(idx)) { 1248 ## Take .fixes and treat it as a call, 1249 ## e.g. c("pre", "post") or a regular name 1250 ## as the prefix. 1251 if(symNames[idx] != "") { 1252 e <- parse(text = symNames[idx], srcfile = NULL)[[1L]] 1253 if(is.call(e)) 1254 val <- eval(e) 1255 else 1256 val <- as.character(e) 1257 if(length(val)) 1258 fixes[seq_along(val)] <- val 1259 } 1260 symNames <- symNames[-idx] 1261 } 1262 1263 ## Deal with a .registration entry. It must be 1264 ## .registration = value and value will be coerced 1265 ## to a logical. 1266 useRegistration <- FALSE 1267 idx <- match(".registration", names(symNames)) 1268 if(!is.na(idx)) { 1269 useRegistration <- as.logical(symNames[idx]) 1270 symNames <- symNames[-idx] 1271 } 1272 1273 ## Now merge into the NativeRoutineMap. 1274 nativeRoutines[[ dyl ]] <<- 1275 if(dyl %in% names(nativeRoutines)) 1276 mergeNativeRoutineMaps(nativeRoutines[[ dyl ]], 1277 useRegistration, 1278 symNames, fixes) 1279 else 1280 nativeRoutineMap(useRegistration, symNames, 1281 fixes) 1282 } 1283 },
And most importantly it checks .fix in our useDynlib() directive and actually allows us to put both prefix and suffix.
In our case we provide it as a single string rather than a list so suffix defaults to “”.
The critical part is when we actually call nativeRoutineMap to apply the fixes, and we can guess it must have applied to all symbols fetched from the dso, which led us to the code:
280 assignNativeRoutines <- function(dll, lib, env, nativeRoutines) { 281 if(length(nativeRoutines) == 0L) return(NULL) 282 283 if(nativeRoutines$useRegistration) { 284 ## Use the registration information to register ALL the symbols 285 fixes <- nativeRoutines$registrationFixes 286 routines <- getDLLRegisteredRoutines.DLLInfo(dll, addNames = FALSE) 287 lapply(routines, 288 function(type) { 289 lapply(type, 290 function(sym) { 291 varName <- paste0(fixes[1L], sym$name, fixes[2L]) 292 if(exists(varName, envir = env)) 293 warning(gettextf("failed to assign RegisteredNativeSymbol for %s to %s since %s is already defined in the %s namespace", 294 sym$name, varName, varName, sQuote(package)), 295 domain = NA) 296 else 297 assign(varName, sym, envir = env) 298 }) 299 }) 300 301 }
so the line paste0() is what we have been looking for and the sym$name is the raw name, in this case,”cov”.
So the routines are returned from function call getDLLRegisteredRoutines.DLLInfo which is pretty straight forward and end up fetching all 4 classes of symbols:
1269 SEXP attribute_hidden 1270 R_getRegisteredRoutines(SEXP dll) 1271 { 1272 DllInfo *info; 1273 SEXP ans, snames; 1274 int i; 1275 const char * const names[] = {".C", ".Call", ".Fortran", ".External"}; 1276 1277 if(TYPEOF(dll) != EXTPTRSXP && 1278 R_ExternalPtrTag(dll) != install("DLLInfo")) 1279 error(_("R_getRegisteredRoutines() expects a DllInfo reference")); 1280 1281 info = (DllInfo *) R_ExternalPtrAddr(dll); 1282 if(!info) error(_("NULL value passed for DllInfo")); 1283 1284 1285 PROTECT(ans = allocVector(VECSXP, 4)); 1286 1287 SET_VECTOR_ELT(ans, 0, R_getRoutineSymbols(R_C_SYM, info)); 1288 SET_VECTOR_ELT(ans, 1, R_getRoutineSymbols(R_CALL_SYM, info)); 1289 SET_VECTOR_ELT(ans, 2, R_getRoutineSymbols(R_FORTRAN_SYM, info)); 1290 SET_VECTOR_ELT(ans, 3, R_getRoutineSymbols(R_EXTERNAL_SYM, info)); 1291 1292 PROTECT(snames = allocVector(STRSXP, 4)); 1293 for(i = 0; i < 4; i++) 1294 SET_STRING_ELT(snames, i, mkChar(names[i])); 1295 setAttrib(ans, R_NamesSymbol, snames); 1296 UNPROTECT(2); 1297 return(ans); 1298 }
Also one may notice the critical line that links the varName (which is now C_ prefixed) to the structure sym which actually establishes the link between “C_cov” and the info in the correct env.
So what environment is that, tracing backward a little and we see:
386 ns <- makeNamespace(package, version = version, lib = package.lib) ... 418 env <- asNamespace(ns) 419 ## save the package name in the environment 420 assign(".packageName", package, envir = env)
So env is the nothing but the namespace!
So let’s list out all the namespaces and try on “stats”:
> loadedNamespaces() [1] "base" "datasets" "graphics" "grDevices" "methods" "stats" [7] "utils" > ns <- asNamespace("stats") > class(ns) [1] "environment" > grep("cov", ls(ns), value=TRUE) [1] "C_cov" "cov" "cov2cor" "covratio" [5] "cov.wt" "vcov" "vcov.Arima" "vcov.glm" [9] "vcov.lm" "vcov.mlm" "vcov.nls" "vcov.summary.glm" [13] "vcov.summary.lm"
Indeed we are able to see the “C_cov” symbol and we can also show its class:
> class(ns$C_cov) [1] "CallRoutine" "NativeSymbolInfo"
So when we print it it shows exactly the same info we saw earlier:
> ns$C_cov $name [1] "cov" $address <pointer: 0xe87af0> attr(,"class") [1] "RegisteredNativeSymbol" $dll DLL name: stats Filename: /usr/lib/R/library/stats/libs/stats.so Dynamic lookup: FALSE $numParameters [1] 4 attr(,"class") [1] "CallRoutine" "NativeSymbolInfo"
Which, not surprisingly is supported by the print method of R class “NativeSymbolInfo”.
And we can even verify by directly apply “nm” on the dso:
nm -C /usr/lib/R/library/stats/libs/stats.so|grep -i cov 00000361c0 t corcov 0000039cc0 t cov
Of course based on how you build R, you might simply get “no symbols” which means symbols has been stripped but code is still in that dso.
Lastly as the nice picture from Advanced R – environmentsshows:
Your package “stats” env would be on the search list but the function would have the defining env as the namespace env we just saw.
Let’s do some verification to see if that’s indeed the case:
> search() [1] ".GlobalEnv" "package:stats" "package:graphics" [4] "package:grDevices" "package:utils" "package:datasets" [7] "package:methods" "Autoloads" "package:base" > e <- as.environment("package:stats") > e <environment: package:stats> attr(,"name") [1] "package:stats" attr(,"path") [1] "/usr/lib/R/library/stats" > environment(e$var) <environment: namespace:stats> > identical(environment(e$var), ns) [1] TRUE
So indeed it links correctly in this way.
Finally a trick to show the symbol once we know it’s in stat package:
> stats::C_cov Error: 'C_cov' is not an exported object from 'namespace:stats' > stats:::C_cov $name [1] "cov" $address <pointer: 0xe87af0> attr(,"class") [1] "RegisteredNativeSymbol" $dll DLL name: stats Filename: /usr/lib/R/library/stats/libs/stats.so Dynamic lookup: FALSE $numParameters [1] 4 attr(,"class") [1] "CallRoutine" "NativeSymbolInfo"
And the special form “:::” does nothing but call the get() to recursively find the symbol:
> `:::` function (pkg, name) { pkg <- as.character(substitute(pkg)) name <- as.character(substitute(name)) get(name, envir = asNamespace(pkg), inherits = FALSE) } <bytecode: 0x15d97f0> <environment: namespace:base> > `:::`("stats", "C_cov") $name [1] "cov" $address <pointer: 0xe87af0> attr(,"class") [1] "RegisteredNativeSymbol" $dll DLL name: stats Filename: /usr/lib/R/library/stats/libs/stats.so Dynamic lookup: FALSE $numParameters [1] 4 attr(,"class") [1] "CallRoutine" "NativeSymbolInfo"
Thank you very much, this post probably saved me a few hours of work