Popularity of Data Analysis Software

Robert A. Muenchen wrote a useful book, R for SAS and SPSS Users. He also authors a blog entry that he updates regularly, where he presents various ways of measuring the popularity or market share of data analysis software such as R, SAS, Stata, and SPSS. I think it is quite an informative reading. The figure 1 is especially striking!

State Data Centers

As a Population Research Librarian, I sometimes field questions about census data for individual states. Rather than wading through the Census Bureau page to find detailed data and projections on the state and local level, I often go directly to a State Data Center site. Every state has one and they are part of the Census Bureau’s State Data Center Network. Basically, these Centers pull out the relevant data for their state and make it easily accessible. I find the staff to be very professional and willing to answer questions by phone, at least in New Jersey. So…if you live in New Jersey and would like to know how much more densely populated your town was in 2010 compared to 2000, go to the New Jersey State Data Center and scroll down to the “Population Density by Municipality: 2000 & 2010” table. Have fun exploring!

Data Life Cycle

I have been collecting diagrams depicting the concept of Data Life Cycle (or Data Lifecycle) for some time. Merriam-Webster’s online dictionary (2012) defines “life cycle” (the third meaning) as: “a series of stages through which something passes during its lifetime.”

dsdrData have a life cycle as well; according to the Data Sharing for Demographic Research (DSDR) (2005) booklet, Guide to Social Science Data Preparation and Archiving – Best Practice Throughout the Data Life Cycle. 3rd ed. Their diagram illustrates “the key considerations germane to archiving at each step in the data creation process (DSDR 2005: vii).”  This was where I first learned about the concept.

oaisLater on, I encountered an earlier Data Life Cycle diagram from the Consultative Committee for Space Data System’s (CCSDS) publication dated back in 2002 (CCSDS 2002:4-1). It depicts nicely the functional entities and their relationships.

ddiThomas’ (2005) was the first Data Life Cycle diagram I saw that featured any sort of feedback loops. This diagram is from the DDI 3.0 structural reform group report that came out back in 2005.

ktTalking about feedback, Humphrey’s (2006) diagram is not strictly about the Data Life Cycle, but about (empirical) knowledge creation, seen from the “data” angle.

shareOf course, if you are a project that is collecting lots of data, then the data management process repeats itself, forming a natural circle, as in this figure depicting the Survey of Health, Ageing and Retirement in Europe’s (SHARE)  database management tasks.

iwgddOr, if you have a more abstract viewpoint, then the Data Life Cycle becomes a complicated human endeavor, as the Interagency Working Group on Digital Data (IWGDD) (2009) expresses in this visual model.

vumcIf we focus on a specific setting where data are collected and “consumed,” then one of the appendices of Vanderbilt University Medical Center (VUMC 2005) Informatics Strategic Plan from back in 2005 has an interesting diagram that features four(!) circles.

I am not the first one who to collect diagrams. Here is an International Association for Social Science Information Services & Technology (IASSIST) blog entry on the Digital Life Cycle, published in 2006.

blIf you are interested in knowing how researchers actually interact with information, in 2009, the Research Information Network (RIN) and British Library has published a report annex, titled Patterns of Information Use and Exchange. Here is one of the case studies. Arrows in the picture indicate relationships, as in “A may lead to B.” Different colors represent different types of activities adapted from Humphrey (2006) mentioned above.

deathThus, data are created: Do they live forever? No. Data do die. Michener et al. (1997) depicts the process of normal degradation in data and metadata over time.

needs toolsRepresenting DataONE, Michener (2011) can make the Data Life Cycles come alive either by adding the stakeholder’s needs, or associating each phase with available appropriate tools.

icpsrThe latest Data Life Cycle diagram I came across so far brings us back to the beginning. By the fifth iteration, ICPSR’s (2012) booklet features the Data Life Cycle diagram that is severely bent to form an almost-circle, although it is not quite closed. The then-seven steps are now morphed into the six phases.


Let’s Keep the Virtuous Circle Going

One way to understand the concept of the “Data Life Cycle” is to realize that there is a virtuous circle going between data and research findings: new data beget new findings and the new findings, new data collection, all the while deepening our understanding and enriching our knowledge.

This was the image that came to my mind when I read emails from the Mexican Migration Project (MMP) and the Latin American Migration Project (LAMP) asking users to post the publications based on their data. Have you done research using either project data? Then please help the projects continue to collect and share important migration data by adding your publication to the list. Visit the project publications page (MMP, LAMP) today, and let’s keep the good circle going!

Note: The manager of the MMP project informed me that MMP has an informative Frequently Asked Questions (FAQ) page available for the data users. Cool!

Firm Data User Support for the Fragile Families Study

For nearly 15 years now, researchers at Princeton and Columbia Universities have been collecting data from 20 cities on the lifestyles, health, and wellbeing of unmarried parents and their children. This ongoing project, known as the Fragile Families and Child Wellbeing Study, began by interviewing parents when their children were born and has continued with follow-up interviews at their children’s first, third, fifth, and ninth birthdays. Researchers are now making preparations for the 15-year wave to examine adolescent wellbeing, behavior and peer influence, and once again follow-up with parents.

As the Fragile Families study continues moving forward, new findings are constantly emerging from the data. The Future of Children published a volume on Fragile Families summarizing many of these findings. In addition, hundreds of publications, including journal articles, books and book chapters, working papers, and research briefs have been made available for easy access on the Fragile Families publications website. Topics include family structure, employment and earnings, incarceration, child care, mental health and stress, parenting, relationship quality, race/ethnicity and nativity, religion, and much more. Keep your eyes peeled as these publications reveal the newest findings from the 9-year wave.

Researchers and Fragile Families staff members at Princeton seek to maximize the use of the rich data that has come out of the Study by making data files available for public use. Novice and experienced data users can email the FFDATA team (ffdata@princeton.edu) with questions about the Study and receive help with downloading and using the various files. They can also inquire about the three-day Fragile Families data users’ workshop that will be held in July at Columbia University.

Note: Chang would like to thank the FFDATA team members for the interesting and enlightening conversation on how much they do in order to support their data users. The study recently released the 9-year wave public-use data through OPR Data Archive.

Demography Volume 48

All four issues of Demography volume 48 (2011) are now published. As a data person, I am interested in the data the authors used for their research. Below is my attempt at summarizing what I have found out by reading the data section of all the articles — except five which did not directly rely on analyzing data (i.e. a correction, an acknowledgement, the index, et cetera).

Some of the data/project initials and short names are more familiar to the typical data user while others are less so. Click on the name to learn more about them. I am planning to follow up with all the data websites mentioned here for updates and new releases.

This list was generated by running a simple XQuery, which read three XML input files (articles.xml, articlesData.xml, and data.xml) and generated an HTML DIV element. The data names are linked to the most closely related website, and the authors are linked to the article itself via its DOI. The files are available for downloading here (a zip file).

Feel free to let me know if you find any errors of broken links. Thanks!

ACS Parrado Add Health Scharoun-Lee et al., Kusunoki and Upchurch BRFSS Bratter and Gorman CAPS Magruder CAS Swaroop and Krysan CE Zagheni CLHLS Wen and Gu CPS DeLeire et al., Parrado CPS Report Card Swaroop and Krysan Chilean Birth Certificates Torche China Statistical Yearbook Goodkind Chinese Census Ebenstein, Li et al. Chitwan Valley Family Study Bohra-Mishra and Massey DHS Bocquier et al., Case and Paxson, Magruder DTR van den Berg et al., Behrman et al. Demographic Yearbook Zheng et al. ELSA Ploubidis and Grundy, Delavande and Rohwedder ENADID Rendall et al.(1049-1058) ENOE Rendall et al.(1049-1058) Ethiopian Field Experiment Data Desai and Tarozzi FAOSTAT Lam Fragile Families Geller et al., Corman et al. GGS Perelli-Harris and Gerber GSS Wolfinger HMD Shkolnikov et al. HRS Delavande and Rohwedder IFLS Kuhn et al. INSEC Bohra-Mishra and Massey IPUMS-I Lam KIDS Cancian et al. Kenyan RHC Data Luke et al. Krygyzstan Data Guillot et al. LPR Behrman et al. Mexican Census Halpern-Manners Mozambican Survey Agadjanian et al. NCDB South et al. NELS:88 Stange NFHS Gaudin NHANES Johnston and Lee NHIS Fuller NIS Xie and Gough NLSY79 Barber and East, Dariotis et al., Brand and Davis, Vespa and Painter NNCS Swaroop and Krysan NSFG Parrado, Axinn et al., Magruder, Guzzo and Hayford NUJLSOA Takagi and Silverstein NVSS DeLeire et al. OECD.Stat Extracts Shkolnikov et al. PETS Stange PSID South et al., Grieger and Danziger PUMS Elo et al., Thomas Registry Database by Statistics Norway Kalil et al. SABE Maurer SATP Bohra-Mishra and Massey SHARE Delavande and Rohwedder SIPP Rendall et al.(481-506) SSD Zorlu and Mulder Simulation Ceballos, Diaz et al. U.S. Census McDaniel et al., Swaroop and Krysan UN Data Lam, Espenshade et al. US Supreme Court Data Stolzenberg Virginia 30k Boardman et al. WDI Zheng et al. WHO Data and Statistics Shkolnikov et al., Espenshade et al. WHO Mortality Database Rostron and Wilmoth WHS Pampel and Denney WIID2 Shkolnikov et al. World Bank Data Shkolnikov et al., Lam World Population Prospects Alkema et al.
2010 Census Data vs. American Community Survey

Until  the year 2010, the U.S. conducted a decennial census consisting of a short form, completed by everyone, and a longer, more extensive form, completed by certain households, so that information about many variables was available for almost all geographies for the decennial year.  As a result, we had a lot of information for the decennial year and very little for the years in between. With the advent of the American Community Survey (ACS), that has changed.

What is the ACS? It is an ongoing nationwide survey that replaces the long form.  It does not actually “count” the population but it does give information about the same variables that were available from the decennial census averaged into either  1-year, 3-year or 5-year estimates (periods of time vs. a point in time).  So, this means we have more information about the years in between the census, but less detail about the decennial year itself.

It will take some time to adjust to this new way of looking at census data and it helps to keep these important tips in mind:

  • Given the differences between the ACS and the decennial census, comparing data from the two sources is not recommended. The only data that can be compared is the short form data from 2010 to the previous decennial censuses.
  • ACS data can be compared to ACS data. Best practice is to compare only 1-year estimates with other 1-year estimates, 3-year estimates with other 3-year estimates and 5-year estimates with other 5-year estimates and the time period should not overlap. For example, comparing  data from 2005-2007 with 2006-2008 is not recommended but it is ok to compare 2005-2007 with 2008-2010.
  • Due to the nature of survey data and the sample sizes, data for the smallest geographies may only be available for the 5-year estimates.
  • Label your ACS data correctly: “2005-2007 ACS data” vs. either “2005,” “2006” or “”2007.”
  • Most important: pay attention to “Sampling Errors,” especially to “Margin of Error,” which is presented with the data.
  • Need help? Contact a librarian. jdonatie@princeton.edu


Data Analysis Training at Firestone Library

If you or your students need a primer or refresher on data analysis, Oscar Torres-Reyna, one of the data consultants at Firestone Library, offers free training sessions on Friday afternoons.  Registration is requested. For general information about Getting Started in Data Analysis, Oscar has a great web page.

Some of Oscar’s class offerings include:

  • Exploring data and descriptive statistics (Stata).
  • Exploring data and descriptive statistics (R).
  • Introduction to linear regression (Stata)
  • Introduction to panel data analysis (Stata)
  • Introduction to linear regression (R).
  • Introduction to panel data analysis (R)


On the evening of October the 3rd, the ten thousandth user* registered to access the OPR Data Archive.

This is not as momentous as the world’s population reaching 7 billion people, but it is a moment to celebrate, nonetheless. The user registration system went on-line in late 2003 when we had 76 users for the year. Since 2006, however, the user list has grown by about 1,400 every year.

With a new powerful database engine and a completely re-written web application taking advantage of the latest server technology, the archive is capable of serving the current and future users reliably and rapidly.

In case you missed them, here are the most recent new and updated data releases:

  • Survey of Unemployed Workers in New Jersey (NJUI)
  • The Mexican Migration Project (MMP) — 134 communities
  • The Fragile Families (FF) Year 9 Follow-up (Wave 5)

* User here means a distinct email address. There may be people registered with multiple email addresses.

More about Stata Manuals

If you are looking for Stata manuals in the Princeton Libraries, the best way to find them is to go to the Main Library Catalog, put “Stata” (no quotes) in the “Search For” box, click on “Subject Heading” in the “Search By” box and then click the “Search” button. You will see two relevant headings:

  • Stata (choose the one with the most hits) or
  • Stata-Handbooks, Manuals, etc.

Once you click on those links, you will see a list of what is available. Some are located in Stokes Library, some are in Firestone.

There is a heavy demand for Stata books, so the one you are seeking may be charged out. If so, you may submit a ”Recall” notice requesting that it be returned within two weeks (See the ”Recall” button at the top of the Library Catalog screen.) Alternatively, you can check if one of our Borrow Direct partner libraries has a copy available and get it from them. (See the “Borrow Direct” link at the top of the Library Catalog screen.) If you need any assistance with placing these requests, please feel free to contact Joann (jdonatie@princeton.edu) or another librarian (piaprlib@princeton.edu) and we’ll be happy to help you.

 Also, if you have a suggestion for a Stata manual you would like Stokes Library to purchase, please send Joann an e-mail (jdonatie@princeton.edu).




