data.set memisc 0.99.26.3

Data Set Objects

Description

"data.set" objects are collections of "item" objects, with similar semantics as data frames. They are distinguished from data frames so that coercion by as.data.fame leads to a data frame that contains only vectors and factors. Nevertheless most methods for data frames are inherited by data sets, except for the method for the within generic function. For the within method for data sets, see the details section.

Thus data preparation using data sets retains all informations about item annotations, labels, missing values etc. While (mostly automatic) conversion of data sets into data frames makes the data amenable for the use of R’s statistical functions.

dsView is a function that displays data sets in a similar manner as View displays data frames. (View works with data sets as well, but changes them first into data frames.)

Usage

data.set(...,row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = FALSE, document = NULL)
as.data.set(x, row.names=NULL, ...)
## S4 method for signature 'list'
as.data.set(x,row.names=NULL,...)
is.data.set(x)
## S4 method for signature 'data.set'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S4 method for signature 'data.set'
within(data, expr, ...)

dsView(x)

## S4 method for signature 'data.set'
head(x,n=20,...)
## S4 method for signature 'data.set'
tail(x,n=20,...)

Arguments

...

For the data.set function several vectors or items, for within further, ignored arguments.

row.names, check.rows, check.names, stringsAsFactors, optional

arguments as in data.frame or as.data.frame, respectively.

document

NULL or an optional character vector that contains documenation of the data.

x

for is.data.set(x), any object; for as.data.frame(x,...) and dsView(x) a “data.set” object.

data

a data set, that is, an object of class “data.set”.

expr

an expression, or several expressions enclosed in curly braces.

n

integer; the number of rows to be shown by head or tail

Value

data.set and the within method for data sets returns a “data.set” object, is.data.set returns a logical value, and as.data.frame returns a data frame.

Details

The as.data.frame method for data sets is just a copy of the method for list. Consequently, all items in the data set are coerced in accordance to their measurement setting, see as.vector,item-method and measurement.

The within method for data sets has the same effect as the within method for data frames, apart from two differences: all results of the computations are coerced into items if they have the appropriate length, otherwise, they are automatically dropped.

Currently only one method for the generic function as.data.set is defined: a method for “importer” objects.

Examples

Data <- data.set(
         vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
         region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
         income = exp(rnorm(300,sd=.7))*2000
         )
Data <- within(Data,{
 description(vote) <- "Vote intention"
 description(region) <- "Region of residence"
 description(income) <- "Household income"
 wording(vote) <- "If a general election would take place next tuesday,
                   the candidate of which party would you vote for?"
 wording(income) <- "All things taken into account, how much do all
                   household members earn in sum?"
 foreach(x=c(vote,region),{
   measurement(x) <- "nominal"
   })
 measurement(income) <- "ratio"
 labels(vote) <- c(
                   Conservatives         =  1,
                   Labour                =  2,
                   "Liberal Democrats"   =  3,
                   "Don't know"          =  8,
                   "Answer refused"      =  9,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 labels(region) <- c(
                   England               =  1,
                   Scotland              =  2,
                   Wales                 =  3,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 foreach(x=c(vote,region,income),{
   annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
   })
 missing.values(vote) <- c(8,9,97,99)
 missing.values(region) <- c(97,99)

 # These to variables do not appear in the
 # the resulting data set, since they have the wrong length.
 junk1 <- 1:5
 junk2 <- matrix(5,4,4)

})
Warning in within(Data, { :
  Variables 'junk1','junk2' have wrong length, removing them.
# Since data sets may be huge, only a
# part of them are 'show'n
Data
Data set with 300 observations and 3 variables

                   vote               region    income
 1 *Not asked in survey              England 2364.9365
 2               Labour             Scotland 1488.7954
 3      *Answer refused              England 1217.8677
 4               Labour              England 1778.9219
 5    Liberal Democrats             Scotland 1568.2350
 6          *Don't know             Scotland 2428.9049
 7      *Answer refused *Not asked in survey 2093.2721
 8    Liberal Democrats                Wales 3380.5580
 9          *Don't know *Not asked in survey 1347.5785
10      *Answer refused             Scotland 1686.1971
11        Conservatives *Not asked in survey 5538.3095
12      *Not applicable              England 2526.3227
13               Labour              England  882.3118
14    Liberal Democrats              England 3044.6825
15      *Answer refused             Scotland 2095.6941
16        Conservatives              England  921.6308
17               Labour              England 1024.6084
18    Liberal Democrats                Wales 2336.0998
19               Labour                Wales 2528.7734
20               Labour             Scotland 1896.8119
21          *Don't know              England 1918.5024
22 *Not asked in survey *Not asked in survey 1536.7924
23        Conservatives              England 4405.8723
24               Labour                Wales 3772.3101
25    Liberal Democrats              England  576.5559
(25 of 300 observations shown)
# If we insist on seeing all, we can use 'print' instead
print(Data)

str(Data)
Data set with 300 obs. of 3 variables:
$ vote : Nmnl. item w/ 7 labels for 1,2,3,... + ms.v.  num 99 2 9 2 3 8 9 3 8 9
  ...
$ region: Nmnl. item w/ 5 labels for 1,2,3,... + ms.v.  num 1 2 1 1 2 2 99 3 99
  2 ...
 $ income: Rto. item  num  2365 1489 1218 1779 1568 ...
summary(Data)
                  vote                     region        income
Conservatives       :42   England             :140   Min.   :  321.5
Labour              :45   Scotland            : 72   1st Qu.: 1263.1
Liberal Democrats   :45   Wales               : 37   Median : 1972.5
*Don't know         :47   *Not asked in survey: 51   Mean   : 2556.0
*Answer refused     :47                              3rd Qu.: 2950.7
*Not applicable     :43                              Max.   :20562.5
*Not asked in survey:31
# If we want to 'View' a data set we can use 'dsView'
dsView(Data)
# Works also, but changes the data set into a data frame first:
View(Data)

Data[[1]]
Item 'Vote intention' (measurement: nominal, type: double, length = 300)

[1:300] *Not asked in survey Labour *Answer refused Labour Liberal Democrats
  *Don't know ...
Data[1,]
Data set with 1 observations and 3 variables

                  vote  region   income
1 *Not asked in survey England 2364.936
head(as.data.frame(Data))
               vote   region   income
1              <NA>  England 2364.936
2            Labour Scotland 1488.795
3              <NA>  England 1217.868
4            Labour  England 1778.922
5 Liberal Democrats Scotland 1568.235
6              <NA> Scotland 2428.905
EnglandData <- subset(Data,region == "England")
EnglandData
Data set with 140 observations and 3 variables

                   vote  region    income
 1 *Not asked in survey England 2364.9365
 2      *Answer refused England 1217.8677
 3               Labour England 1778.9219
 4      *Not applicable England 2526.3227
 5               Labour England  882.3118
 6    Liberal Democrats England 3044.6825
 7        Conservatives England  921.6308
 8               Labour England 1024.6084
 9          *Don't know England 1918.5024
10        Conservatives England 4405.8723
11    Liberal Democrats England  576.5559
12    Liberal Democrats England  776.6125
13      *Not applicable England 2361.1979
14          *Don't know England 1901.7334
15      *Not applicable England 1123.0880
16          *Don't know England 1134.5998
17 *Not asked in survey England 2153.2450
18          *Don't know England 1105.9471
19      *Not applicable England 2423.6032
20        Conservatives England  957.6335
21 *Not asked in survey England 3875.7992
22      *Answer refused England 2751.8808
23               Labour England 2542.8886
24        Conservatives England 4762.5405
25      *Not applicable England 1503.1837
(25 of 140 observations shown)
xtabs(~vote+region,data=Data)
                   region
vote                England Scotland Wales
  Conservatives          14       12     4
  Labour                 21       14     6
  Liberal Democrats      19       10     8
xtabs(~vote+region,data=within(Data, vote <- include.missings(vote)))
                      region
vote                   England Scotland Wales
  Conservatives             14       12     4
  Labour                    21       14     6
  Liberal Democrats         19       10     8
  *Don't know               30        6     4
  *Answer refused           17       15     5
  *Not applicable           22       10     6
  *Not asked in survey      17        5     4