IT 385
Managing Big Data
Professor Tim Eagle
Module 3:
Data Assets
Data Assets
●
●
●
ETL
Data Formats
Sources of Data
○
○
○
○
○
●
●
Data Providers
Data Scraping
○
○
●
Internal vs External
Log Data
Public vs. Private
Free vs. Paid
Refreshing Data
Old Way - Scraping from HTML
New Way - API’s, XML, CAML, etc.
Data Formatting
○
○
Strategies - ETL vs ELT
Tools
Extract, Transform, and Load
●
●
●
Extract is the process of reading data from a database. In this stage, the data is collected,
often from multiple and different types of sources.
Transform is the process of converting the extracted data from its previous form into the
form it needs to be in so that it can be placed into another database. Transformation
occurs by using rules or lookup tables or by combining the data with other data.
Load is the process of writing the data into the target database.
Data Formats
●
●
Excel
Delimited Files - Most Common for larger sets
○
●
●
●
Can be easy, but older sets are often complex. (FECA example)
JSON - Javascript Object Notation - Becoming more popular
XML - Extensible Markup Language - Most common for newer data sets
Proprietary Formats
○
○
SAS Datasets
Database Backups/dumps
Sources of Data
Internal vs External
Internal
●
●
●
●
Free
Varied depending on business
Should be able to paint a picture of
your business
Can be quite messy
External
●
●
●
Not Free
Plenty of places to look
Pay for cleaner data
Sources of Data
Log Data
●
●
Depending on the project you are working with, log Data might be useful
Sources:
○
○
○
●
●
●
System Logs
Network Logs
Application Logs
Often need more formatting than other Data sources
Often very noisy
Log aggregation tools bring it all together, might handle formatting
○
○
Splunk
Greylog
Sources of Data
Public vs. Private
Public Data Sets:
●
●
●
●
●
NWS Weather
Data.gov
SSA DMF
●
●
Private Data Sets
●
●
●
●
●
Typically not free
Very business-specific
Update frequently
Very specific licensing terms
https://cloud.google.com/public-datasets/
●
https://aws.amazon.com/opendata/public-dat
asets/
https://www.fdbhealth.com/solutions/medkno
wledge/medknowledge-drug-pricing
●
Duns and Bradstreet
https://datasetsearch.research.google.com/
https://github.com/awesomedata/awesome-p
ublic-datasets
https://risk.lexisnexis.com/products/accurint-f
or-healthcare
Sources of Data
Free vs. Paid
Free
●
●
●
●
●
Paid
Generally will require more cleaning
Updated less often
Some sets are free for partial sets of
data
No Support for data issues
You get what you paid for...
●
●
●
●
●
●
Cleanliness of data varies
Pay for more updates
Full sets of data, or built for your
specific needs
Support often provided
Can get costly (Cost-Benefit Analysis
needed)
Some Public Data costs as well
Sources of Data
Refreshing Data
●
●
How often?
Refresh Policy
○
○
○
○
●
Replace Existing Data
Update Existing Data
Append Data
Update and Add new Data
Storage Space limitations
Data Providers
Data Providers - Companies/Websites that aggregate various datasets, then provide data in
either a paid license or open license. Can be very market specific. Often have a single API to
access all data sets.
●
●
●
●
https://www.ecoinvent.org/home.html
https://intrinio.com/
https://www.programmableweb.com/category/all/apis
Data.gov
Also, there is a new push for Data as a Service, where we don’t download data sets, we just
query against a service provider.
Data Scraping
Old Way
●
●
●
Since the internet used to be fairly static, to get data, we would Scrape the data from the
web pages using code.
You’d have scripts run through pages, look for certain spots or words, and would capture
it in another file.
Often Memory and resource intensive.
Data Scraping
New Way
●
●
●
Grab RSS/XML feeds from pages
Use API’s to sites data
Use tools to Scrape data from Social media or webpages
○
○
○
●
●
Data Scraper - Chrome Plugin
WebHarvey
Import.io
Buy Pre Scraped and formatted data
Use a hybrid of new and old way to see the whole picture
○
https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python-and-s
electorlib/
Data Formatting
ETL vs. ELT
https://www.xplenty.com/blog/etl-vs-elt/
Data Formatting
ETL
●
A continuous, ongoing process with a well-defined workflow: ETL first extracts data from
homogeneous or heterogeneous data sources. Next, it deposits the data into a staging
area. Then the data is cleansed, enriched, transformed, and stored in the data warehouse.
●
Used to required detailed planning, supervision, and coding by data engineers and
developers: The old-school methods of hand-coding ETL transformations in data
warehousing took an enormous amount of time. Even after designing the process, it took
time for the data to go through each stage when updating the data warehouse with new
information.
●
Modern ETL solutions are easier and faster: Modern ETL, especially for cloud-based data
warehouses and cloud-based SaaS platforms, happens a lot faster.
Data Formatting
ELT
●
Ingest anything and everything as the data becomes available: ELT paired with a data
lake lets you ingest an ever-expanding pool of raw data immediately, as it becomes
available. There's no requirement to transform the data into a special format before
saving it in the data lake.
●
Transforms only the data you need: ELT transforms only the data required for a
particular analysis. Although it can slow down the process of analyzing the data, it offers
more flexibility—because you can transform the data in different ways on the fly to
produce different types of metrics, forecasts and reports. Conversely, with ETL, the entire
ETL pipeline—and the structure of the data in the OLAP warehouse—may require
modification if the previously-decided structure doesn't allow for a new type of analysis.
●
ELT is less-reliable than ETL: It’s important to note that the tools and systems of ELT are
still evolving, so they're not as reliable as ETL paired with an OLAP database. Although it
takes more effort to setup, ETL provides more accurate insights when dealing with
massive pools of data. Also, ELT developers who know how to use ELT technology are
more difficult to find than ETL developers
Data Formatting
Tools
●
●
●
●
●
Excel - It can do a ton of great stuff, however, it SH*** the bed with larger data sources
Scripting - Perl or Python - Work great with flat files, XML, JSON, but not with others
SQL - Can do much of the formatting in SQL and create new tables.
SAS / R - Same as scripting, very powerful, but a learning curve.
ETL Specific Tools
○
○
○
○
○
○
Informatica
Microsoft SSIS
Oracle Data Integrator
IBM InfoSphere DataStage
Apache Airflow, Kafka, Nifi
Talend Open Studio
Data Formatting
Tips
●
●
●
●
●
●
Dates/Times - convert to same timezone
Units - Convert to the same units when possible
Standardize addresses
Classify/Tag certain data
Timestamp data
If you are going to map, Geocode early - can be costly.
In Class Assignment
Find the best Data Source to help solve these problems.
●
●
●
●
●
When should our nation-wide business start to stock snow shovels?
Where is the best location to buy real estate for a car dealership?
Which third basemen plays the better in day games?
How many people have larger incomes and less children in a specific geographic area?
What Data Sources would have me (your professor) in them?
Post answers to discussion board posted, as replies to each question.
The End
Quiz 1 posted, due next
week
● Assignment 1 posted,
due in 2 weeks.
●
IT 385
Managing Big Data
Professor Tim Eagle
Module 4:
Data Quality
Data Quality
●
●
●
●
●
Garbage In, Garbage Out
Definition of Data Quality
The Continuum of Data Quality
Other Problems with Data Quality
Creating Better Data Quality
○
○
○
○
●
Data Cleansing
Master Data Management
Data deduplication
Data Interpretation
The need for Domain Experts
Garbage In,
Garbage Out
Definition of Data Quality
Validity
Reliability
Completeness
Precision
Timeliness
Integrity
Data measure what they are supposed to measure.
Everyone defines, measures, and collects data the same way—all
the time.
Data include all of the values needed to calculate indicators.
No variables are missing.
Data have sufficient detail. Units of measurement are very clear.
Data are up to date. Information is available on time.
Data are true. The values are safe from deliberate bias and have
not been changed for political or personal reasons.
Validity
●
Data measures what they are supposed to measure.
When it goes wrong:
AI And facial recognition.
https://www.wmm.com/sponsored-project/codedbias/?fbclid=IwAR1xSLJToeXvMbNePKnlTn
dCFjnO3485Iv0AMf5wcZC-Tb1UuIJSiT3ivnQ
Reliability
Everyone defines, measures, and collects data the same way - all the time.
When it goes wrong:
Even the discovery of the Americas was a result of bad data. Christopher Columbus
made a few significant miscalculations when charting the distance between Europe and
Asia. First, he favored values given by Persian geographer Alfraganus, over the more
accurate calculations of Greek geographer, Eratosthenes. Second, Columbus assumed
Alfraganus was referring to Roman miles in his calculations when, in reality he was
referring to Arabic miles.
Completeness
Data includes all of the values needed to calculate indicators. No variables are missing.
When it goes wrong:
2016 Election- The 2016 United States Presidential election was also mired with bad data.
National polling data used to predict state-by-state Electoral College votes led to the prediction
of a Hillary Clinton landslide, a forecast that lead to many American voters to stay home on
Election Day.
Also, the Big Short
Precision
●
Data has sufficient detail. Units of measurement are very clear.
When it goes wrong:
In 1999, NASA took a $125 million dollar hit when it lost the Mars Orbiter. It turns out that they
engineering team responsible for developing the Orbiter used English units of measurement
while NASA used the metric system. The problem here is the data was inconsistent making it a
rather costly and disastrous mistake.
Timeliness
Data is up to date. Information is available on time.
When it goes wrong:
The turning point of the civil war, Gettysburg. Lee, general of the Confederate army had old
intel, and didn’t know the accurate count of troops.
Integrity
Data is true. The values are safe from deliberate bias and have not been changed for political or
personal reasons.
When it goes wrong:
The Enron scandal in 2001 was largely a result of bad data. Enron was once the sixth-largest
company in the world. A host of fraudulent data provided to Enron’s shareholders resulted in
Enron’s meteoric rise and subsequent crash. An ethical external auditing firm could have
prevented this fraud from occurring.
Or the Anti-vaccination movement.
The Data Quality Continuum
The Data Quality Continuum
• Data and information is not static, it flows in a
data collection and usage process
–
–
–
–
–
–
Data gathering
Data delivery
Data storage
Data integration
Data retrieval
Data mining/analysis
Data Gathering
• How does the data enter the system?
• Sources of problems:
– Manual entry
– No uniform standards for content and formats
– Parallel data entry (duplicates)
– Approximations, surrogates – SW/HW constraints
– Measurement errors.
Solutions
• Potential Solutions:
– Preemptive:
• Process architecture (build in integrity checks)
• Process management (reward accurate data entry, data sharing,
data stewards)
– Retrospective:
• Cleaning focus (duplicate removal, merge/purge, name & address
matching, field value standardization)
• Diagnostic focus (automated detection of glitches).
Data Delivery
• Destroying or mutilating information by inappropriate pre-processing
– Inappropriate aggregation
– Nulls converted to default values
• Loss of data:
– Buffer overflows
– Transmission problems
– No checks
Solutions
•
Build reliable transmission protocols
– Use a relay server
•
Verification
– Checksums, verification parser
– Do the uploaded files fit an expected pattern?
•
Relationships
– Are there dependencies between data streams and processing steps
•
Interface agreements
– Data quality commitment from the data stream supplier.
Data Storage
• You get a data set. What do you do with it?
• Problems in physical storage
– Can be an issue, but terabytes are cheap.
• Problems in logical storage (ER 🡪 relations)
– Poor metadata.
• Data feeds are often derived from application programs or legacy
data sources. What does it mean?
– Inappropriate data models.
• Missing timestamps, incorrect normalization, etc.
– Ad-hoc modifications.
• Structure the data to fit the GUI.
– Hardware / software constraints.
• Data transmission via Excel spreadsheets, Y2K
Solutions
• Metadata
– Document and publish data specifications.
• Planning
– Assume that everything bad will happen.
– Can be very difficult.
• Data exploration
– Use data browsing and data mining tools to examine the data.
• Does it meet the specifications you assumed?
• Has something changed?
Data Integration
• Combine data sets (acquisitions, across departments).
• Common source of problems
– Heterogenous data : no common key, different field formats
• Approximate matching
– Different definitions
• What is a customer: an account, an individual, a family, …
– Time synchronization
• Does the data relate to the same time periods? Are the time
windows compatible?
– Legacy data
• IMS, spreadsheets, ad-hoc structures
– Sociological factors
• Reluctance to share – loss of power.
Solutions
• Commercial Tools
– Significant body of research in data integration
– Many tools for address matching, schema mapping are available.
• Data browsing and exploration
– Many hidden problems and meanings : must extract metadata.
– View before and after results : did the integration go the way you
thought?
Data Retrieval
• Exported data sets are often a view of the actual data. Problems occur
because:
– Source data not properly understood.
– Need for derived data not understood.
– Just plain mistakes.
• Inner join vs. outer join
• Understanding NULL values
• Computational constraints
– E.g., too expensive to give a full history, we’ll supply a snapshot.
• Incompatibility
– Ebcdic?
Data Mining and Analysis
• What are you doing with all this data anyway?
• Problems in the analysis.
– Scale and performance
– Confidence bounds?
– Black boxes and dart boards
• “fire your Statisticians”
– Attachment to models
– Insufficient domain expertise
– Casual empiricism
Solutions
• Data exploration
– Determine which models and techniques are appropriate, find data bugs,
develop domain expertise.
• Continuous analysis
– Are the results stable? How do they change?
• Accountability
– Make the analysis part of the feedback loop.
Other problems in DQ - Missing Data
• Missing data - values, attributes, entire records, entire sections
• Missing values and defaults are indistinguishable
• Truncation/censoring - not aware, mechanisms not known
• Problem: Misleading results, bias.
Data Glitches
• Systemic changes to data which are external to the recorded process.
– Changes in data layout / data types
• Integer becomes string, fields swap positions, etc.
– Changes in scale / format
• Dollars vs. euros
– Temporary reversion to defaults
• Failure of a processing step
– Missing and default values
• Application programs do not handle NULL values well …
– Gaps in time series
• Especially when records represent incremental changes.
Departmental Silos
●
●
●
●
Everyone sees their job, department, business as the most important thing.
Often Departments or other Groups will have their own Data Quality Standards for
their specific mission.
Data Quality Suffers when you have to look at data across the business or between
companies.
Example: Federal ID for Companies and Businesses
○ DUNS vs. TaxID vs. NPI vs. SSN vs. Department ID vs. Universal ID
Then why is every DB dirty?
• Consistency constraints are often not used
– Cost of enforcing the constraint
• E.g., foreign key constraints, triggers.
– Loss of flexibility
– Constraints not understood
• E.g., large, complex databases with rapidly changing requirements
– DBA does not know / does not care.
• Garbage in
– Merged, federated, web-scraped DBs.
• Undetectable problems
– Incorrect values, missing data
• Metadata not maintained
• Database is too complex to understand
Improving Data Quality
Data Cleansing
●
●
●
●
Not just about the Data itself, also about standardizing business log data & metrics
Create universal identifiers across your business, look at best practices.
Convert dates to same timezone and format.
Standardize naming conventions in metadata
Cleansing Methods
●
●
●
●
●
Histograms
Conversion Tables
○ Example - USA, U.S., U.S.A., US, United States
Tools
Algorithms
Manually
Master Data Management
MDM Continued
Data Deduplication
●
Data deduplication: A process that examines new data blocks
using hashing compares them to existing data blocks, and skips
redundant blocks when data is transferred to the target.
●
Data reduction: A process that tracks block changes, usually
using some kind of log or journal, and then transfers only new
blocks to the backup target.
Data Interpretation
●
Data interpretation refers to the implementation of processes through which data is
reviewed for the purpose of arriving at an informed conclusion. The interpretation of data
assigns a meaning to the information analyzed and determines its signification and
implications.
○
○
Qualitative Interpretation - Observations, Documents, Interviews
Quantitative Interpretation - Mean, Standard Deviation, Frequency Distribution
Data Interpretation Problems
●
Correlation mistaken for Causation
●
Confirmation Bias
●
Irrelevant Data
Domain Expertise
• Data quality gurus: “We found these peculiar records in
your database after running sophisticated algorithms!”
• Domain Experts: “Oh, those apples - we put them in the
same baskets as oranges because there are too few
apples to bother. Not a big deal. We knew that already.”
Why Domain Expertise?
• DE is important for understanding the data, the problem and interpreting the
results
• “The counter resets to 0 if the number of calls exceeds N”.
• “The missing values are represented by 0, but the default billed
amount is 0 too.”
• Insufficient DE is a primary cause of poor DQ – data are unusable
• DE should be documented as metadata
Where is the Domain Expertise?
• Usually in people’s heads – seldom documented
• Fragmented across organizations
– Often experts don’t agree. Force consensus.
• Lost during personnel and project transitions
• If undocumented, deteriorates and becomes fuzzy over time
The End
Readings: ebook Ch. 3
and Ch. 5
● Homework 1 is due
next week
●
ALSPAC DATA MANAGEMENT PLAN, 2019-2024
0. Proposal name
The Avon Longitudinal Study of Parents and Children (ALSPAC). Core Program Support
2019-2024.
1. Description of the data
1.1
Type of study
ALSPAC is a multi-generation, geographically based cohort study following 14,541 mothers recruited
during pregnancy (in 1990-1992) and their partners (G0), offspring (G1) and grandchildren (G2).
1.2
Types of data
Quantitative data
• Data from numerous self-completed paper-based/online questionnaires.
• Data from clinic-based assessments: physiological, cognitive and anthropometric measures,
structured interview data and computer based questionnaire data.
• Genetic, metabolomic, proteomic, epigenetic, biochemical and environmental exposure data
obtained from analysis of biological samples.
• Data derived from images collected as part of clinical assessments (including MRIs, Liver scans,
DXA, PqCT, retinal scan, 3D Face and body shape).
• Data obtained through linkage to administrative records including maternity and birth
records, child health records, cancer/death registrations through ONS, Primary and
Secondary health care records and education and criminal Records.
• Data obtained from social media including Twitter, Facebook and Instagram
Qualitative data from sub studies
• Small sub studies involving direct interview of participants or focus groups; audio/transcript files being
generated.
Bio resource
• Biological samples collection including DNA, lymphoblastic cell lines (LCLs), blood, saliva, urine, hair,
tissue such as placenta and umbilical cord, teeth and nail clippings.
1.3
Format and scale of the data
Type and scale of data
• 14,541 mothers originally enrolled producing 14,676 fetuses with 13988 G1 children still alive at one
year of age. Additional waves of enrollment resulted in a further 913 G1 children joining the study
since 2000.
• At the time of writing (Feb 2019), > 900 G2 children (including in utero) have enrolled and
provided data.
• Self-completion questionnaires are electronically captured (scanned or digitally collected); to
date there have been 48 questionnaires completed by mothers (n=6000-13500), 18 by partners
(n=3000-9500), and 34 by G1 (n=5000-8000). Currently, 14 different questionnaires are used to
collect data about G2 children.
• Data from clinical examinations are electronically captured (scanned paperwork or digitally
collected). There have been 4 maternal sweeps (n=3500-4700), 1 father sweep (n=2000) and 10
G1 sweeps (with a minimum of 4000 attendees at each sweep).
• Electronically captured (scanned from paper or digitally collected) enrolment and consent (for e.g.
record linkage, obtaining and using biological samples) forms.
• Image data e.g. face shape, DXA, MRI, liver scan, 3D body scan.
• Administrative data (e.g. educational records in the National Pupil Database, linkage to the NHS
central register) The scale of linkage to individual administrative data sources depends on the
completeness of these sources, the relevant participant consents and other permissions alongside
technical considerations relating to the systems where these data are held.
• Data obtained from biological samples including whole genome sequence data (n=2000), genome
wide association (GWAS) data (n~22,000), metabolomics data (from ~20,000 samples), genome
wide epigenetics data (n~5000) plus small scale bespoke biochemical, cellular and genetic
analysis.
MRC Template for a Data Management Plan, v01-1, 10 March 2017
1
•
Bio resource: over 30,000 DNA samples, 15,000 LCLs and over 1 million sample aliquots (blood,
urine, saliva, tissue, hair, nails).
Data formats
•
ALSPAC stores raw numerical data in a number of formats depending on the source of the data,
including MS Access, SQL Server databases, MS Excel compatible files (.csv and .xls format),
REDCap (MySQL), SPSS save and portable formats and Stata data files. Text data are stored as flat
text files, MS Access databases and .csv files.
•
Molecular data are stored in flat file formats including .csv and JSON allowing for more complex file
structures. Some formats are program specific (e.g. PLINK for SNP data). Some data are stored in
MySQL databases (e.g. methylation data) but some molecular data are unsuitable for databases due
to increasingly long index lengths; working copies of these ‘Big Data’ and archived data are stored on
University of Bristol (UoB) storage systems (e.g. the ACRC Research Data Storage Facility). Raw
(laboratory) data (e.g. Illumina IDAT format genotyping/ methylation files) will also be redundantly
archived on UoB storage systems ensuring future availability.
• Data made available for research use through the ALSPAC resource is stored internally in
multiple formats. Data is curated using a statistical package and stored in both SPSS and Stata
formats. The curated data is imported into a Data Warehouse Opal/MongoDB) and is stored in a
highly flexible structure. Custom datasets can be exported in any of the common statistical
formats, including SPSS, STATA, SAS, R, csv (Excel).
• Multimedia data such as DXA scans, recorded participant interviews and face shape images are
stored using uncompressed or lossless compression formats where possible.
• Where applicable data formats may be migrated as new technologies become available and are
proved robust enough to ensure digital continuity and continued availability of data.
2. Data collection / generation
2.1
Methodologies for data collection / generation
New data will be generated by:
• Questionnaires (online or paper-based) completed by all cohort groups.
• Clinic based assessments on all cohort groups.
• Further biological sample collection from all cohort groups
• Linkage to administrative records of G0, G1 and G2.
• Interviews from qualitative studies.
• Image files such as DXA scans on G0 and G1.
• Updated contact information as provided by study participants.
• Molecular laboratory analysis of existing and new samples and integration with public molecular data.
• Further biological sample collection (blood, urine, saliva, hair, placenta, umbilical cord, breast
milk, meconium, stools, DNA and cell lines) from all cohort groups.
• Social media, such as Twitter.
2.2
•
•
•
•
•
•
•
Data quality and standards
Each data item to be assessed by logical and range checks built into electronic data collection
systems, with ambiguous values assessed by an operator.
All assessment scales used in either questionnaires or clinic-based assessment to have
been validated externally with a known reference paper.
A small sample (~3%) of clinic participants to be re-invited to the clinic to validate earlier measures
and test for any possible fieldworker bias or equipment calibration issues.
Interview data to be collected and validated in real time on encrypted laptops with data routinely
transferred to the central repository.
Clinical assessment data to be collected by trained fieldworkers according to clear protocols; data to
be analysed regularly by research staff to ensure data quality standards are being met; regular
audits of clinic processes will be performed.
Repeat molecular analysis of a subset of samples, QC using control probes and analysis for batch
effects (built into the ALSAPC LIMS [laboratory information management system]).
The laboratory managing the bio resource has obtained the ISO9001 quality standard. Sample
data are stored in ALSPAC LIMS.
3. Data management, documentation and curation
3.1
Managing, storing and curating data.
MRC Template for a Data Management Plan, v01-1, 10 March 2017
2
•
•
•
•
•
•
3.2
Data from all sources will be cleaned, and prepared for analysis by the in-house statistics,
bioinformatics and data management teams following established standard operating
procedures (SOPs).
Each data item is referenced and stored using a universal indexing and naming convention.
Research data items are stored separately from administrative data (including subject
identifiers) and linked through anonymised files accessible only to specified members of the
data management team.
Instrument specific and bespoke data formats are archived ‘as is’ to ensure the integrity of the
original source material. As with other data sources, original copies of the data are never altered
in any way.
All research data collected are catalogued, maintained and archived on University of Bristol
infrastructure which is scalable, secure and backed-up routinely.
The Bio resource is licensed by the Human Tissue Authority (license number 12512) and samples
are stored in secure freezers and cryostores. These facilities are linked to an emergency generator
to provide back-up power and are covered by a 24hr alarm system to alert staff of freezer failures
out of normal working hours. The cell line collection is backed up at an external second site.
Metadata standards and data documentation
Metadata are collected as an integral process to (i) catalogue and index the data in a searchable
manner, (ii) define the assessment tools (validated measures, key reference publication, modifications
etc), (iii) describe the data collection process on an individual basis (age at completion, administration
and reminder process) (iv) catalogue laboratory information as captured through LIMS and (v) assign a
geographical reference point (at a non-disclosive level) to assist spatial analysis.
• Research data are not made available until it has been fully documented and published on the
ALSPAC website (http://www.bristol.ac.uk/alspac/researchers/our-data/).
• Metadata is provided through CLOSER Discovery (https://discovery.closer.ac.uk/) and is
compatible with the Data Documentation Initiative (DDI) Life-cycle 3.2.
• Study protocols, assessment tools, data derivation methods and coding schema are provided as
part of the research data documentation available as downloadable content from the ALSPAC
website (http://www.bristol.ac.uk/alspac/researchers/our-data/).
3.3
•
•
•
•
•
•
Data preservation strategy and standards
It is envisaged that ALSPAC will continue to operate as a resource for current and future generations
indefinitely. In line with this expectation all data are anticipated to be maintained indefinitely.
ALSPAC maintains an archive of data which is available to researchers on request and maintained
on secure servers which are backed up on a regular basis. The University infrastructure consists of
real time mirroring of data across two geographically separate data centers (Bristol and Slough) as
well as off-site tape backups on a nightly basis.
In order to ensure the longevity and availability of the resource, ALSPAC reviews data on a regular
basis and will apply digital continuity methods where applicable to migrate data formats at risk of
obsolescence to newer formats to ensure that information contained within the files remains
complete and usable.
Where data are disposed of (for example data that have been secured elsewhere on an obsolete
hard disk) this will be done securely and in line with University IT Information security policies.
Primary source material (E.g. Questionnaires, clinic data sheets and consent forms) will be
preserved as electronic (scanned) copies where practicable.
The UoB library holds an extensive administrative archive (Special Collections reference
DM2616) of catalogued paperwork up to 2005 (study grant applications, protocols, ethical
approvals, participant information materials, keying and coding specifications,
documentation on each data collection measure and their provenance and the file building
syntax). https://www.bristol.ac.uk/library/special-collections/strengths/alspac/.
4. Data security and confidentiality of potentially disclosive information
4.1
•
Formal information/data security standards
ALSPAC gained ISO 27001 certification in 2012 and is compliant with that standard
4.2 Main risks to data security
ALSPAC avoids potentially disclosive subsets of data at all costs. For example:
• Complete dates (of birth, Q completion, clinic attendance) are not released to researchers; instead
MRC Template for a Data Management Plan, v01-1, 10 March 2017
3
•
•
•
•
•
ages are derived.
Cell counts are no smaller than n=5; this applies to all publications that are reviewed by the ALSPAC
Executive prior to journal submission.
Free text (with its own unique ID) is coded separately from any other data; any identifying information
is screened out before being passed to a researcher.
Address data is dealt with separately to any other data (with its own unique ID), relevant address
level data are obtained, aggregated as appropriate and matched back to the main dataset.
All external researchers receive a dataset with an ID attached which is unique to them and their
project.
Anonymised frequency tables and summary statistics are freely available as part of the data dictionary.
5. Data sharing and access
Identify any data repository (-ies) that are, or will be, entrusted with storing, curating and/or
sharing data from your study, where they exist for particular disciplinary domains or data types.
Information on repositories is available here.
5.1
Suitability for sharing
Yes. ALSPAC data is used in a wide and varying number of research consortia and cross-cohort
collaborations.
5.2
•
•
•
•
•
5.3
Discovery by potential users of the research data
The ALSPAC access policy details how data can be accessed by researchers
(http://www.bristol.ac.uk/alspac/researchers/data-access/).
The cohort is advertised through a wide variety of sources including the MRC gateway to Research
and the Maelstrom Catalogue,
ALSPAC is the most commonly searched study in CLOSER Discovery.
The ALSPAC web site describes the cohort and available data
(http://www.bristol.ac.uk/alspac/researchers/our-data/), and hosts data documentation and
catalogues. The data dictionary is fully searchable by keyword.
A bespoke variable search tool is available (http://variables.alspac.bris.ac.uk). This enables a quick
search of the data and facility to download a list of variables. Within the next twelve months this will
be replaced with the Mica web portal (https://www.obiba.org/pages/products/mica) allowing for
advanced searching of variables and creation of a variable list for submission as a data access
request
Governance of access
ALSPAC is committed to providing access to ALSPAC data to the widest possible research community.
Currently, ALSPAC data are made available to researchers on a supported basis rather than via an
unrestricted, open resource. Bespoke datasets of requested variables are provided to collaborators by a
data preparation and statistics team upon completion of a Data Access Agreement
(http://www.bristol.ac.uk/alspac/researchers/access/)
Briefly, available data are described on the ALSPAC website. Researchers wishing to access these data
submit a proposal to the ALSPAC Executive committee. Approval for access is given if the data
requested are available and their release does not (i) risk disclosure of participant identity; (ii) violate any
ethico-legal or other stipulations that apply to ALSPAC; or (iii) run the risk of harming the study as a
whole or any participants in it. Crucially, data are viewed as a non-finite resource and proposals for
access are therefore not subject to formal scientific review. Once an application has been awarded,
specified variables are then provided to the investigator by a ‘data buddy’, who supports the user with
data descriptors and additional variables as required.
Requests for data acquired via linkage to routine health and administrative records are subject to access
constraints determined both by ALSPAC and the original data owner. These constraints can involve
seeking additional project clearances with the original data owner, the statistical modification of the data
to control for disclosure risks and ethical approval (for requests involving health data). All linkage data
are filtered for participant consent at the time of release and this can impact on the resulting sample
size. Access conditions are subject to changes applied by the data owner which are outside ALSPACs
control. All access requirements must be adhered to at all stages of the research cycle. Researchers
accessing these data do so under a legally binding contract.
Researchers will access linked health records via the UK Secure eResearch Platform (UKSeRP),
Swansea University. This will be managed by the ALSPAC data linkage team who control access
MRC Template for a Data Management Plan, v01-1, 10 March 2017
4
permissions and researcher outputs. The system will allow researchers to remotely access their
approved project data in a secure and auditable environment.
5.4
The study team’s exclusive use of the data
Where a researcher (member of the ALSPAC team or an external collaborator) has secured funding for
the collection and analysis of new data, they are entitled to apply for a period of exclusive access for a
period of up to 6 months from the point at which a cleaned dataset is made available to them. If this is
approved then during the exclusive access period the ALSPAC Executive will still consider requests for
access to the restricted data, but permission must be sought from the researcher who funded the data
collection to release the data or to explore the potential for collaborative analysis. If the funding
researcher declines, the restricted data will not be available to others until after the period of exclusive
access. After the embargo period all ALSPAC data are freely available to external researchers (within the
usual constraints related to scientific legitimacy and disclosure risk). ALSPAC do not “police” overlap of
projects or data requests. Details of approved projects and their data can be viewed by researchers on
the ALSPAC website.
5.5
Restrictions or delays to sharing, with planned actions to limit such restrictions
The ALSPAC policy on data sharing is partly determined by the terms of the consent given by the
participants to the collection of particular data items. Broadly, we work with consent agreements that
allow the widest possible sharing of ALSPAC information within the scientific community, balanced
against the need to recognise participant concerns that may influence their decisions about giving or
withholding consent at the time of data collection. In the majority of cases, the anonymity of participants
is maintained by providing linked data that do not include actual or potential personal identifiers (such as
date of birth, name and address) and by minimizing potentially disclosive information (such as low cell
counts). The point at which data become sufficiently detailed to the extent that anonymity cannot be
preserved is sometimes unclear and subject to challenge by different parties. Where such situations
arise, ALSPAC will take appropriate action to identify any risk to participant anonymity and where
necessary take steps to alter the data or introduce additional stages to the research process to reduce
these risks.
Depending on the proportion of the cohort consenting to different types of data collection, it is likely that
ALSPAC will hold particular sets of data that were collected without specific individual consent. However,
subsequent use of that data must preserve individual anonymity and thus not introduce any risk of
inadvertent disclosure. To ensure this, ALSPAC may modify the data provided to a researcher in order to
control for potential disclosure. ALSPAC are actively engaged with technological data sharing solutions
that allow analysis across linked datasets but do not allow the analyst to have sight of either linking
identifiers or the detail of individual level information needed for deductive disclosure, e.g. UKSeRP.
5.6
Regulation of responsibilities of users
The full ALSPAC data access policy is available online and provides information on data sharing for
prospective researchers. In brief, researchers wishing to use the ALSPAC resource complete an online
proposal form (https://proposals.epi.bristol.ac.uk/) describing the proposed research. The proposal should
have clearly stated aims and hypotheses and describe the relevant exposure, outcome and confounders
that are being requested. A Principal Investigator with an approved project is required to sign a Data
Access Agreement (signed at an Institution level) and all researchers within that project must complete a
confidentiality agreement before data is released. This emphasizes the confidential nature of the data and
informs the researcher that they must not share their dataset nor attempt to match their dataset with any
other ALSPAC data.
Requests to access biological samples are handled using the same procedures. However, the majority of
samples represent a finite resource, so proposals are assessed to ensure analysis will make good use of
these samples. Attempts are made to combine analyses where possible, but we reserve the right to turn
down proposals which would use up a large proportion of finite samples. Samples are issued under the
terms of a material transfer agreement (for researchers outside the University of Bristol) or a material
service level agreement (for UoB researchers outside the Bristol Medical School (PHS)). Samples and
indeed any other data are provided on condition that all further data obtained from them is returned to
ALSPAC to become part of the ALSPAC resource and made available to other researchers. Where
requests may use up a finite resource or risk stock, the request is referred to the ALSPAC Independent
Scientific Advisory Board.
6.
Responsibilities
The ALSPAC Executive Committee takes ultimate responsibility for all aspects of data management. Under
the 2019-2024 strategic award the data team will be headed by the Executive lead for Data. They will be
MRC Template for a Data Management Plan, v01-1, 10 March 2017
5
supported by a Senior Data Manager (SDM) who manages the data pipeline and a Technical Lead (TL)
who manages the ALSPAC systems. The SDM and TL will line manage a data team who will be
responsible for managing ALSPAC systems that support the collection, curation and storage of data in a
systematic, secure, confidential and accessible manner. The Executive Lead has responsibility for the final
sign-off of new data released for research use and its’ accompanying metadata. The SDM will be supported
by a small team of data preparation assistants who will prepare and document the research data and a
team of research associates who will provide continual quality assurance of new clinic data as it is
collected. Dedicated time will be provided for data security; ensuring that we continue to comply and gain
re-certification of ISO27001.
7. Relevant institutional, departmental or study policies on data sharing and data security
Policy
Data Management Policy
& Procedures
URL or Reference
University policy:
http://www.bristol.ac.uk/research/environment/governance/research-datapolicy/
Data Security Policy
ISO27001 (reference)
Data Sharing Policy
ALSPAC access policy available here:
http://www.bristol.ac.uk/alspac/researchers/access/
Institutional Information
Policy
http://www.bristol.ac.uk/infosec/policies/
Other:
Other
8. Author of this Data Management Plan (Name) and, if different to that of the Principal
Investigator, their telephone & email contact details
The ALSPAC executive
Email: alspac-exec@bristol.ac.uk Tel: +44 (117) 331 0010
MRC Template for a Data Management Plan, v01-1, 10 March 2017
6
Purchase answer to see full
attachment