last edit: 10/7/2019
SQL database are very commonly used in the storage of very large genomic data resources. Many useful tools, such as DBI, dbplyr have provided convenient interfaces for R users to check and manipulate the data. These tools represent the SQL tables in tidy formats and support lazy and quick aggregation operations (e.g, *_join
, union
, etc.) for tables from same resources. Cross database aggregation is also supported when opted (using copy=TRUE
) but become very expensive due to the internal copying process of a whole table into the other connection. Use of advanced functions often involves specialized SQL knowledge which brings challenges for common R users. The interoperability of existing bioinformatics tools are suboptimal, e.g., the SummarizedExperiment container for representation of sequencing or genotyping experiments that many modern bioinformatics pipelines are based.
The SQLDataFrame package was developed using familiar DataFrame-like paradigm and lazily represents the very large dataset from different SQL databases, such as SQLite and MySQL. The DataFrame-like interface provides familiarity for common R users in easy data manipulations such as square bracket subsetting, rbinding, etc. For modern R users, it also recognizes the tidy
data analysis and dplyr grammar by supporting %>%
, select
, filter
, mutate
, etc. More importantly, database type-specific strategies were implemented in SQLDataFrame to efficiently handle the cross-database operations without incurring any internally expensive processes (especially for database with write permission). Some previously difficult data operations are made quick and easy in R, such as cross-database ID matching and conversion, variant annotation extraction, etc. The scalability and interoperability of SQLDataFrame are expected to significantly promote the handling of very large genomic data resources and facilitating the overall bioinformatics analysis.
Currently SQLDataFrame supports the DBI backend of SQLite, MySQL and Google BigQuery, which are most commonly used SQL-based databases. In the future or upon feature request, we would implement this package so that users could choose to use different database backend for SQLDataFrame
representation.
Here is a list of commonly used backends (bolded are already supported!):
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SQLDataFrame")
The development version is also available to download from Github.
BiocManager::install("Liubuntu/SQLDataFrame")
library(SQLDataFrame)
library(DBI)
SQLDataFrame
classSQLDataFrame
constructorThere are two ways to construct a SQLDataFrame
object:
Provide an argument of conn
, dbtable
and dbkey
. conn
is a valid DBIConnection from SQLite or MySQL; dbtable
specifies the database table name that is going to be represented as SQLDataFrame
object. If only one table is available in the specified database name, this argument could be left blank. The dbkey
argument is used to specify the column name in the table which could uniquely identify all the data observations (rows).
Provide dbtable
, dbkey
as specified above, and credentials to build valid DBIConnections. for SQLite, the credential argument includes dbname
. For MySQL, the credential arguments are host
, user
, password
. Additional to the credentials, users must provide the type
argument to specify the SQL database type. Supported types are “SQLite” and “MySQL”. If not specified, “SQLite” is used by default. Supported database tables could be on-disk or remote on the web or cloud.
dbfile <- system.file("extdata/test.db", package = "SQLDataFrame")
conn <- DBI::dbConnect(DBI::dbDriver("SQLite"), dbname = dbfile)
obj <- SQLDataFrame(conn = conn, dbtable = "state",
dbkey = "state")
construction from database credentials:
obj1 <- SQLDataFrame(dbname = dbfile, type = "SQLite",
dbtable = "state", dbkey = "state")
all.equal(obj, obj1)
#> [1] TRUE
Note that after reading the database table into SQLDataFrame
, the key columns will be kept as fixed columns showing on the left hand side, with |
separating key column(s) with the other columns. The ncol
, colnames
, and corresponding column subsetting will only apply to the non-key-columns.
obj
#> SQLDataFrame with 50 rows and 4 columns
#> state | division region population size
#> <character> | <character> <character> <numeric> <character>
#> Alabama | East South Central South 3615 medium
#> Alaska | Pacific West 365 small
#> Arizona | Mountain West 2280 medium
#> Arkansas | West South Central South 2110 medium
#> California | Pacific West 21198 large
#> ... . ... ... ... ...
#> Virginia | South Atlantic South 4981 medium
#> Washington | Pacific West 3559 medium
#> West Virginia | South Atlantic South 1799 medium
#> Wisconsin | East North Central North Central 4589 medium
#> Wyoming | Mountain West 376 small
dim(obj)
#> [1] 50 4
colnames(obj)
#> [1] "division" "region" "population" "size"
To make the SQLDataFrame
object as light and compact as possible, there are only 5 slots contained in the object: tblData
, dbkey
, dbnrows
, dbconcatKey
, indexes
. Metadata information could be returned through these 5 slots using slot accessors or other utility functions.
slotNames(obj)
#> [1] "dbkey" "dbnrows" "tblData" "indexes"
#> [5] "dbconcatKey" "elementType" "elementMetadata" "metadata"
dbtable(obj)
#> [1] "state"
dbkey(obj)
#> [1] "state"
connSQLDataFrame(obj)
#> <SQLiteConnection>
#> Path: /tmp/RtmpRM3VSz/Rinst26bffd2872677e/SQLDataFrame/extdata/test.db
#> Extensions: TRUE
Besides, many useful common methods are defined on SQLDataFrame
object to make it a more DataFrame-like data structure. e.g., we can use dimnames()
to return the row/colnames of the data. It returns an unnamed list, with the first element being rownames which is always NULL
, and 2nd element being colnames (could also use colnames()
method). dim()
method is defined to return the dimension of the database table, which enables the nrow()/ncol()
to extract a specific dimension. length()
method is also defined which works same as ncol()
.
Note that the rownames(SQLDataFrame)
would always be NULL
as rownames are not supported in SQLDataFrame
. However, ROWNAMES(obj)
was implemented for the [
subsetting with characters.
dim(obj)
#> [1] 50 4
dimnames(obj)
#> [[1]]
#> NULL
#>
#> [[2]]
#> [1] "division" "region" "population" "size"
length(obj)
#> [1] 4
ROWNAMES(obj)
#> [1] "Alabama" "Alaska" "Arizona" "Arkansas"
#> [5] "California" "Colorado" "Connecticut" "Delaware"
#> [9] "Florida" "Georgia" "Hawaii" "Idaho"
#> [13] "Illinois" "Indiana" "Iowa" "Kansas"
#> [17] "Kentucky" "Louisiana" "Maine" "Maryland"
#> [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
#> [25] "Missouri" "Montana" "Nebraska" "Nevada"
#> [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
#> [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
#> [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
#> [41] "South Dakota" "Tennessee" "Texas" "Utah"
#> [45] "Vermont" "Virginia" "Washington" "West Virginia"
#> [49] "Wisconsin" "Wyoming"
NOTE that the dbtable()
accessor only works for a SQLDataFrame
object that the lazy tbl carried in tblData
slot corresponds to a single database. If the SQLDataFrame
was generated from rbind
, union
or *_join
, call saveSQLDataFrame()
to save the lazy tbl to disk so that dbtable()
will be activated.
dbtable(obj)
#> [1] "state"
aa <- rbind(obj[1:5, ], obj[6:10, ])
aa
#> SQLDataFrame with 10 rows and 4 columns
#> state | division region population size
#> <character> | <character> <character> <numeric> <character>
#> Alabama | East South Central South 3615 medium
#> Alaska | Pacific West 365 small
#> Arizona | Mountain West 2280 medium
#> Arkansas | West South Central South 2110 medium
#> California | Pacific West 21198 large
#> Colorado | Mountain West 2541 medium
#> Connecticut | New England Northeast 3100 medium
#> Delaware | South Atlantic South 579 small
#> Florida | South Atlantic South 8277 large
#> Georgia | South Atlantic South 4931 medium
dbtable(aa) ## message
#> Warning in dbtable(aa): ## not available for SQLDataFrame with lazy queries of 'union', 'join', or 'rbind'.
#> ## call 'saveSQLDataFrame()' to save the data as database table and call 'dbtable()' again!
bb <- saveSQLDataFrame(aa, dbname = tempfile(fileext=".db"),
dbtable = "aa")
#> ## A new database table is saved!
#> ## Source: table<aa> [10 X 4]
#> ## Database: sqlite 3.39.4 [/tmp/RtmpiaiuTE/file26c3f93d6dbafe.db]
#> ## Use the following command to reload into R:
#> ## sdf <- SQLDataFrame(
#> ## dbname = '/tmp/RtmpiaiuTE/file26c3f93d6dbafe.db',
#> ## type = 'SQLite',
#> ## dbtable = 'aa',
#> ## dbkey = 'state')
connSQLDataFrame(bb)
#> <SQLiteConnection>
#> Path: /tmp/RtmpiaiuTE/file26c3f93d6dbafe.db
#> Extensions: TRUE
dbtable(bb)
#> [1] "aa"
We could also construct a SQLDataFrame
object directly from a file name. The makeSQLDataFrame
function takes input of character value of file name for common text files (.csv, .txt, etc.), write into database tables, and open as SQLDataFrame
object. Users could provide values for the dbname
and dbtable
argument. If NULL, default value for dbname
would be a temporary database file, and dbtable
would be the basename(filename)
without extension.
NOTE that the input file must have one or multiple columns that could uniquely identify each observation (row) to be used the dbkey()
for SQLDataFrame
. Also the file must be rectangular, i.e., rownames are not accepted. But users could save rownames as a separate column.
mtc <- tibble::rownames_to_column(mtcars)[,1:6]
filename <- file.path(tempdir(), "mtc.csv")
write.csv(mtc, file= filename, row.names = FALSE)
aa <- makeSQLDataFrame(filename, dbkey = "rowname", sep = ",",
overwrite = TRUE)
#> ## A new database table is saved!
#> ## Source: table<mtc> [32 X 5]
#> ## Database: sqlite 3.39.4 [/tmp/RtmpiaiuTE/file26c3f91009ab7b.db]
#> ## Use the following command to reload into R:
#> ## sdf <- SQLDataFrame(
#> ## dbname = '/tmp/RtmpiaiuTE/file26c3f91009ab7b.db',
#> ## type = 'SQLite',
#> ## dbtable = 'mtc',
#> ## dbkey = 'rowname')
#>
aa
#> SQLDataFrame with 32 rows and 5 columns
#> rowname | mpg cyl disp hp drat
#> <character> | <numeric> <integer> <numeric> <integer> <numeric>
#> "Mazda RX4" | 21.0 6 160 110 3.90
#> "Mazda RX4 Wag" | 21.0 6 160 110 3.90
#> "Datsun 710" | 22.8 4 108 93 3.85
#> "Hornet 4 Drive" | 21.4 6 258 110 3.08
#> "Hornet Sportabout" | 18.7 8 360 175 3.15
#> ... . ... ... ... ... ...
#> "Lotus Europa" | 30.4 4 95.1 113 3.77
#> "Ford Pantera L" | 15.8 8 351.0 264 4.22
#> "Ferrari Dino" | 19.7 6 145.0 175 3.62
#> "Maserati Bora" | 15.0 8 301.0 335 3.54
#> "Volvo 142E" | 21.4 4 121.0 109 4.11
connSQLDataFrame(aa)
#> <SQLiteConnection>
#> Path: /tmp/RtmpiaiuTE/file26c3f91009ab7b.db
#> Extensions: TRUE
dbtable(aa)
#> [1] "mtc"
With all the methods ([
subsetting, rbind
, *_join
, etc.,) provided in the next section, the SQLDataFrame
always work like a lazy representation until users explicitly call the saveSQLDataFrame
function for realization. saveSQLDataFrame
write the lazy tbl carried in tblData
slot into an on-disk database table, and re-open the SQLDataFrame
object from the new path.
It’s also recommended that users call saveSQLDataFrame
frequently to avoid too many lazy layers which slows down the data processing.
connSQLDataFrame(obj)
#> <SQLiteConnection>
#> Path: /tmp/RtmpRM3VSz/Rinst26bffd2872677e/SQLDataFrame/extdata/test.db
#> Extensions: TRUE
dbtable(obj)
#> [1] "state"
obj1 <- saveSQLDataFrame(obj, dbname = tempfile(fileext = ".db"),
dbtable = "obj_copy")
#> ## A new database table is saved!
#> ## Source: table<obj_copy> [50 X 4]
#> ## Database: sqlite 3.39.4 [/tmp/RtmpiaiuTE/file26c3f9369ab83.db]
#> ## Use the following command to reload into R:
#> ## sdf <- SQLDataFrame(
#> ## dbname = '/tmp/RtmpiaiuTE/file26c3f9369ab83.db',
#> ## type = 'SQLite',
#> ## dbtable = 'obj_copy',
#> ## dbkey = 'state')
connSQLDataFrame(obj1)
#> <SQLiteConnection>
#> Path: /tmp/RtmpiaiuTE/file26c3f9369ab83.db
#> Extensions: TRUE
dbtable(obj1)
#> [1] "obj_copy"
[[
subsetting[[,SQLDataFrame
Behaves similarly to [[,DataFrame
and returns a realized vector of values from a single column. $,SQLDataFrame
is also defined to conveniently extract column values.
head(obj[[1]])
#> [1] "East South Central" "Pacific" "Mountain"
#> [4] "West South Central" "Pacific" "Mountain"
head(obj[["region"]])
#> [1] "South" "West" "West" "South" "West" "West"
head(obj$size)
#> [1] "medium" "small" "medium" "medium" "large" "medium"
We can also get the key column values using character extraction.
head(obj[["state"]])
#> [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
#> [6] "Colorado"
[
subsettingSQLDataFrame
instances can be subsetted in a similar way of DataFrame
following the usual R conventions, with numeric, character or logical vectors; logical vectors are recycled to the appropriate length.
NOTE, use drop=FALSE
explicitly for single column subsetting if you want to return a SQLDataFrame
object, otherwise, the default drop=TRUE
would always return a realized value for that column.
obj[1:3, 1:2]
#> SQLDataFrame with 3 rows and 2 columns
#> state | division region
#> <character> | <character> <character>
#> Alabama | East South Central South
#> Alaska | Pacific West
#> Arizona | Mountain West
obj[c(TRUE, FALSE), c(TRUE, FALSE), drop=FALSE]
#> SQLDataFrame with 25 rows and 2 columns
#> state | division population
#> <character> | <character> <numeric>
#> Alabama | East South Central 3615
#> Arizona | Mountain 2280
#> California | Pacific 21198
#> Connecticut | New England 3100
#> Florida | South Atlantic 8277
#> ... . ... ...
#> South Dakota | West North Central 746
#> Texas | West South Central 12237
#> Vermont | New England 472
#> Washington | Pacific 3559
#> Wisconsin | East North Central 4589
obj[1:3, "population", drop=FALSE]
#> SQLDataFrame with 3 rows and 1 column
#> state | population
#> <character> | <numeric>
#> Alabama | 3615
#> Alaska | 365
#> Arizona | 2280
obj[, "population"] ## realized column value
#> [1] 3615 365 2280 2110 21198 2541 3100 579 8277 4931 868 813
#> [13] 11197 5313 2861 2280 9111 3806 1058 4122 5814 9111 3921 2341
#> [25] 4767 746 1544 590 812 7333 1799 18076 5441 637 10735 2715
#> [37] 2284 11860 931 2816 746 4173 12237 1203 472 4981 3559 1799
#> [49] 4589 376
Subsetting with character vector works for the SQLDataFrame
objects. With composite keys, users need to concatenate the key values by :
for row subsetting (See the vignette for internal implementation for more details).
rnms <- ROWNAMES(obj)
obj[c("Alabama", "Colorado"), ]
#> SQLDataFrame with 2 rows and 4 columns
#> state | division region population size
#> <character> | <character> <character> <numeric> <character>
#> Alabama | East South Central South 3615 medium
#> Colorado | Mountain West 2541 medium
obj1 <- SQLDataFrame(conn = conn, dbtable = "state",
dbkey = c("region", "population"))
rnms <- ROWNAMES(obj1)
obj1[c("South:3615.0", "West:365.0"), ]
#> SQLDataFrame with 2 rows and 3 columns
#> region population | division state size
#> <character> <numeric> | <character> <character> <character>
#> South 3615 | East South Central Alabama medium
#> West 365 | Pacific Alaska small
List style subsetting is also allowed to extract certain columns from the SQLDataFrame
object which returns SQLDataFrame
by default.
obj[1]
#> SQLDataFrame with 50 rows and 1 column
#> state | division
#> <character> | <character>
#> Alabama | East South Central
#> Alaska | Pacific
#> Arizona | Mountain
#> Arkansas | West South Central
#> California | Pacific
#> ... . ...
#> Virginia | South Atlantic
#> Washington | Pacific
#> West Virginia | South Atlantic
#> Wisconsin | East North Central
#> Wyoming | Mountain
obj["region"]
#> SQLDataFrame with 50 rows and 1 column
#> state | region
#> <character> | <character>
#> Alabama | South
#> Alaska | West
#> Arizona | West
#> Arkansas | South
#> California | West
#> ... . ...
#> Virginia | South
#> Washington | West
#> West Virginia | South
#> Wisconsin | North Central
#> Wyoming | West
We have also enabled the S3 methods of filter
and mutate
from dplyr
package, so that users could have the convenience in filtering data observations and adding new columns.
obj1 %>% filter(division == "South Atlantic" & size == "medium")
#> SQLDataFrame with 5 rows and 3 columns
#> region population | division state size
#> <character> <numeric> | <character> <character> <character>
#> South 4931 | South Atlantic Georgia medium
#> South 4122 | South Atlantic Maryland medium
#> South 2816 | South Atlantic South Carolina medium
#> South 4981 | South Atlantic Virginia medium
#> South 1799 | South Atlantic West Virginia medium
obj1 %>% mutate(p1 = population/10, s1 = size)
#> SQLDataFrame with 50 rows and 5 columns
#> region population | division state size
#> <character> <numeric> | <character> <character> <character>
#> South 3615 | East South Central Alabama medium
#> West 365 | Pacific Alaska small
#> West 2280 | Mountain Arizona medium
#> South 2110 | West South Central Arkansas medium
#> West 21198 | Pacific California large
#> ... ... . ... ... ...
#> South 4981 | South Atlantic Virginia medium
#> West 3559 | Pacific Washington medium
#> South 1799 | South Atlantic West Virginia medium
#> North Central 4589 | East North Central Wisconsin medium
#> West 376 | Mountain Wyoming small
#> p1 s1
#> <numeric> <character>
#> 361.5 medium
#> 36.5 small
#> 228.0 medium
#> 211.0 medium
#> 2119.8 large
#> ... ...
#> 498.1 medium
#> 355.9 medium
#> 179.9 medium
#> 458.9 medium
#> 37.6 small
To be consistent with DataFrame
, union
and rbind
methods were implemented for SQLDataFrame
, where union
returns the SQLDataFrame
sorted by the dbkey(obj)
, and rbind
keeps the original orders of input objects.
dbfile1 <- system.file("extdata/test.db", package = "SQLDataFrame")
con1 <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile1)
dbfile2 <- system.file("extdata/test1.db", package = "SQLDataFrame")
con2 <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile2)
ss1 <- SQLDataFrame(conn = con1, dbtable = "state",
dbkey = c("state"))
ss2 <- SQLDataFrame(conn = con2, dbtable = "state1",
dbkey = c("state"))
ss11 <- ss1[sample(5), ]
ss21 <- ss2[sample(10, 5), ]
obj1 <- union(ss11, ss21)
obj1 ## reordered by the "dbkey()"
obj2 <- rbind(ss11, ss21)
obj2 ## keeping the original order by updating the row index
#> SQLDataFrame with 10 rows and 4 columns
#> state | division region population size
#> <character> | <character> <character> <numeric> <character>
#> Arizona | Mountain West 2280 medium
#> Alaska | Pacific West 365 small
#> California | Pacific West 21198 large
#> Alabama | East South Central South 3615 medium
#> Arkansas | West South Central South 2110 medium
#> Connecticut | New England Northeast 3100 medium
#> Arizona | Mountain West 2280 medium
#> Georgia | South Atlantic South 4931 medium
#> Colorado | Mountain West 2541 medium
#> Florida | South Atlantic South 8277 large
The *_join
family methods was implemented for SQLDataFrame
objects, including the left_join
, inner_join
, semi_join
and anti_join
, which provides the capability of merging database files from different sources.
ss12 <- ss1[1:10, 1:2]
ss22 <- ss2[6:15, 3:4]
left_join(ss12, ss22)
#> Joining, by = "state"
#> SQLDataFrame with 10 rows and 4 columns
#> state | division region population size
#> <character> | <character> <character> <numeric> <character>
#> Alabama | East South Central South <NA> <NA>
#> Alaska | Pacific West <NA> <NA>
#> Arizona | Mountain West <NA> <NA>
#> Arkansas | West South Central South <NA> <NA>
#> California | Pacific West <NA> <NA>
#> Colorado | Mountain West 2541 medium
#> Connecticut | New England Northeast 3100 medium
#> Delaware | South Atlantic South 579 small
#> Florida | South Atlantic South 8277 large
#> Georgia | South Atlantic South 4931 medium
inner_join(ss12, ss22)
#> Joining, by = "state"
#> SQLDataFrame with 5 rows and 4 columns
#> state | division region population size
#> <character> | <character> <character> <numeric> <character>
#> Colorado | Mountain West 2541 medium
#> Connecticut | New England Northeast 3100 medium
#> Delaware | South Atlantic South 579 small
#> Florida | South Atlantic South 8277 large
#> Georgia | South Atlantic South 4931 medium
semi_join(ss12, ss22)
#> Joining, by = "state"
#> SQLDataFrame with 5 rows and 2 columns
#> state | division region
#> <character> | <character> <character>
#> Colorado | Mountain West
#> Connecticut | New England Northeast
#> Delaware | South Atlantic South
#> Florida | South Atlantic South
#> Georgia | South Atlantic South
anti_join(ss12, ss22)
#> Joining, by = "state"
#> SQLDataFrame with 5 rows and 2 columns
#> state | division region
#> <character> | <character> <character>
#> Alabama | East South Central South
#> Alaska | Pacific West
#> Arizona | Mountain West
#> Arkansas | West South Central South
#> California | Pacific West
SQLDataFrame now supports the MySQL database tables through RMySQL, for local MySQL servers, or remote ones on the web or cloud. The SQLDataFrame construction, *_join
functions, union
, rbind
, and saving are all supported. Aggregation operations are supported for same or cross MySQL databases. Details please see the function documentations.
Here I’ll show a simple use case for MySQL tables from ensembl.
library(RMySQL)
ensbConn <- dbConnect(dbDriver("MySQL"),
host="genome-mysql.soe.ucsc.edu",
user = "genome",
dbname = "xenTro9")
enssdf <- SQLDataFrame(conn = ensbConn,
dbtable = "xenoRefGene",
dbkey = c("name", "txStart"))
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 0 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 4 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 5 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 6 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 7 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 8 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 1 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 2 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 5 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 6 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 7 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 8 imported as
#> numeric
enssdf1 <- enssdf[1:20, 1:2]
enssdf2 <- enssdf[11:30,3:4]
res <- left_join(enssdf1, enssdf2)
#> Joining, by = c("name", "txStart")
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 1 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 2 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 5 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 1 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 2 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 5 imported as
#> numeric
SQLDataFrame has just added support for Google BigQuery tables. Construction and queries using [
and filter
are supported!
“Authentication and authorization” will be needed when using bigrquery. Check here for more details.
Also note that, the support of BigQuery tables has implemented specialized strategy for efficient data representation. The dbkey()
is assigned by default as SurrogateKey
, and dbkey
argument will be ignored during construction.
library(bigrquery)
bigrquery::bq_auth() ## use this to authorize bigrquery in the
## browser.
bqConn <- DBI::dbConnect(dbDriver("bigquery"),
project = "bigquery-public-data",
dataset = "human_variant_annotation",
billing = "") ## if not previous provided
## authorization, must specify a
## project name that was already
## linked with Google Cloud with
## billing info.
sdf <- SQLDataFrame(conn = bqConn, dbtable = "ncbi_clinvar_hg38_20180701")
sdf[1:5, 1:5]
sdf %>% select(GENEINFO)
sdf %>% filter(GENEINFO == "PYGL:5836")
sdf %>% filter(reference_name == "21")
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] RMySQL_0.10.24 DBI_1.1.3 SQLDataFrame_1.12.0
#> [4] S4Vectors_0.36.0 BiocGenerics_0.44.0 dbplyr_2.2.1
#> [7] dplyr_1.0.10
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.9 pillar_1.8.1 bslib_0.4.0 compiler_4.2.1
#> [5] jquerylib_0.1.4 tools_4.2.1 bit_4.0.4 digest_0.6.30
#> [9] memoise_2.0.1 jsonlite_1.8.3 evaluate_0.17 RSQLite_2.2.18
#> [13] lifecycle_1.0.3 tibble_3.1.8 pkgconfig_2.0.3 rlang_1.0.6
#> [17] cli_3.4.1 yaml_2.3.6 xfun_0.34 fastmap_1.1.0
#> [21] withr_2.5.0 stringr_1.4.1 knitr_1.40 generics_0.1.3
#> [25] vctrs_0.5.0 sass_0.4.2 bit64_4.0.5 tidyselect_1.2.0
#> [29] glue_1.6.2 R6_2.5.1 fansi_1.0.3 rmarkdown_2.17
#> [33] purrr_0.3.5 blob_1.2.3 magrittr_2.0.3 htmltools_0.5.3
#> [37] assertthat_0.2.1 utf8_1.2.2 stringi_1.7.8 lazyeval_0.2.2
#> [41] cachem_1.0.6