IT 385 Allegany College of Maryland Data Quality Assessment Paper

User Generated

Nool23

Computer Science

IT 385

Allegany College of Maryland

IT

Description

Scenario:

Your team is doing research on web traffic on Marymount’s network (dorms and academic/admin buildings). You will be using web traffic data, firewall data, and data from two external data sources of your choosing. (You can use your best guess for size of datasets from the university). Please write out a Data Quality Assessment that will include the following information below for your data sets. Student privacy must be included

Data Quality Assessment

  1. Description of Data
    1. Type of Research
    2. Types of Data
    3. Format and Scale of Data
  2. Data Collection / Generation
    1. Methodologies for data collection/generation
    2. Data Quality and Standards
  3. Data Management, Documentation, and Curation
    1. Managing/storing and curating data
    2. Metadata standards and data documentation
    3. Data preservation strategy and standards
  4. Data Security and confidentiality
    1. Format information / data security standards
    2. Main risks to data security
  5. Data Sharing and Access
    1. Suitability for sharing
    2. Discovery by potential users of the research data
    3. Governance of Access
    4. The study teams exclusive use of the data
    5. Regulations of responsibilities of users
  6. Relevant institutional, departmental or study policies on data sharing and data security
    1. Include policy name and URL/Reference to it. Also any laws that may apply.

Deliverable:

Please provide the following in a PDF or Word Document(s):

  1. Research Question/Problem you will use.
  2. Data Quality Assessment Report, including 2 chosen external data sources used to match against web traffic.

Unformatted Attachment Preview

IT 385 Managing Big Data Professor Tim Eagle Module 3: Data Assets Data Assets ● ● ● ETL Data Formats Sources of Data ○ ○ ○ ○ ○ ● ● Data Providers Data Scraping ○ ○ ● Internal vs External Log Data Public vs. Private Free vs. Paid Refreshing Data Old Way - Scraping from HTML New Way - API’s, XML, CAML, etc. Data Formatting ○ ○ Strategies - ETL vs ELT Tools Extract, Transform, and Load ● ● ● Extract is the process of reading data from a database. In this stage, the data is collected, often from multiple and different types of sources. Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data. Load is the process of writing the data into the target database. Data Formats ● ● Excel Delimited Files - Most Common for larger sets ○ ● ● ● Can be easy, but older sets are often complex. (FECA example) JSON - Javascript Object Notation - Becoming more popular XML - Extensible Markup Language - Most common for newer data sets Proprietary Formats ○ ○ SAS Datasets Database Backups/dumps Sources of Data Internal vs External Internal ● ● ● ● Free Varied depending on business Should be able to paint a picture of your business Can be quite messy External ● ● ● Not Free Plenty of places to look Pay for cleaner data Sources of Data Log Data ● ● Depending on the project you are working with, log Data might be useful Sources: ○ ○ ○ ● ● ● System Logs Network Logs Application Logs Often need more formatting than other Data sources Often very noisy Log aggregation tools bring it all together, might handle formatting ○ ○ Splunk Greylog Sources of Data Public vs. Private Public Data Sets: ● ● ● ● ● NWS Weather Data.gov SSA DMF ● ● Private Data Sets ● ● ● ● ● Typically not free Very business-specific Update frequently Very specific licensing terms https://cloud.google.com/public-datasets/ ● https://aws.amazon.com/opendata/public-dat asets/ https://www.fdbhealth.com/solutions/medkno wledge/medknowledge-drug-pricing ● Duns and Bradstreet https://datasetsearch.research.google.com/ https://github.com/awesomedata/awesome-p ublic-datasets https://risk.lexisnexis.com/products/accurint-f or-healthcare Sources of Data Free vs. Paid Free ● ● ● ● ● Paid Generally will require more cleaning Updated less often Some sets are free for partial sets of data No Support for data issues You get what you paid for... ● ● ● ● ● ● Cleanliness of data varies Pay for more updates Full sets of data, or built for your specific needs Support often provided Can get costly (Cost-Benefit Analysis needed) Some Public Data costs as well Sources of Data Refreshing Data ● ● How often? Refresh Policy ○ ○ ○ ○ ● Replace Existing Data Update Existing Data Append Data Update and Add new Data Storage Space limitations Data Providers Data Providers - Companies/Websites that aggregate various datasets, then provide data in either a paid license or open license. Can be very market specific. Often have a single API to access all data sets. ● ● ● ● https://www.ecoinvent.org/home.html https://intrinio.com/ https://www.programmableweb.com/category/all/apis Data.gov Also, there is a new push for Data as a Service, where we don’t download data sets, we just query against a service provider. Data Scraping Old Way ● ● ● Since the internet used to be fairly static, to get data, we would Scrape the data from the web pages using code. You’d have scripts run through pages, look for certain spots or words, and would capture it in another file. Often Memory and resource intensive. Data Scraping New Way ● ● ● Grab RSS/XML feeds from pages Use API’s to sites data Use tools to Scrape data from Social media or webpages ○ ○ ○ ● ● Data Scraper - Chrome Plugin WebHarvey Import.io Buy Pre Scraped and formatted data Use a hybrid of new and old way to see the whole picture ○ https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python-and-s electorlib/ Data Formatting ETL vs. ELT https://www.xplenty.com/blog/etl-vs-elt/ Data Formatting ETL ● A continuous, ongoing process with a well-defined workflow: ETL first extracts data from homogeneous or heterogeneous data sources. Next, it deposits the data into a staging area. Then the data is cleansed, enriched, transformed, and stored in the data warehouse. ● Used to required detailed planning, supervision, and coding by data engineers and developers: The old-school methods of hand-coding ETL transformations in data warehousing took an enormous amount of time. Even after designing the process, it took time for the data to go through each stage when updating the data warehouse with new information. ● Modern ETL solutions are easier and faster: Modern ETL, especially for cloud-based data warehouses and cloud-based SaaS platforms, happens a lot faster. Data Formatting ELT ● Ingest anything and everything as the data becomes available: ELT paired with a data lake lets you ingest an ever-expanding pool of raw data immediately, as it becomes available. There's no requirement to transform the data into a special format before saving it in the data lake. ● Transforms only the data you need: ELT transforms only the data required for a particular analysis. Although it can slow down the process of analyzing the data, it offers more flexibility—because you can transform the data in different ways on the fly to produce different types of metrics, forecasts and reports. Conversely, with ETL, the entire ETL pipeline—and the structure of the data in the OLAP warehouse—may require modification if the previously-decided structure doesn't allow for a new type of analysis. ● ELT is less-reliable than ETL: It’s important to note that the tools and systems of ELT are still evolving, so they're not as reliable as ETL paired with an OLAP database. Although it takes more effort to setup, ETL provides more accurate insights when dealing with massive pools of data. Also, ELT developers who know how to use ELT technology are more difficult to find than ETL developers Data Formatting Tools ● ● ● ● ● Excel - It can do a ton of great stuff, however, it SH*** the bed with larger data sources Scripting - Perl or Python - Work great with flat files, XML, JSON, but not with others SQL - Can do much of the formatting in SQL and create new tables. SAS / R - Same as scripting, very powerful, but a learning curve. ETL Specific Tools ○ ○ ○ ○ ○ ○ Informatica Microsoft SSIS Oracle Data Integrator IBM InfoSphere DataStage Apache Airflow, Kafka, Nifi Talend Open Studio Data Formatting Tips ● ● ● ● ● ● Dates/Times - convert to same timezone Units - Convert to the same units when possible Standardize addresses Classify/Tag certain data Timestamp data If you are going to map, Geocode early - can be costly. In Class Assignment Find the best Data Source to help solve these problems. ● ● ● ● ● When should our nation-wide business start to stock snow shovels? Where is the best location to buy real estate for a car dealership? Which third basemen plays the better in day games? How many people have larger incomes and less children in a specific geographic area? What Data Sources would have me (your professor) in them? Post answers to discussion board posted, as replies to each question. The End Quiz 1 posted, due next week ● Assignment 1 posted, due in 2 weeks. ● IT 385 Managing Big Data Professor Tim Eagle Module 4: Data Quality Data Quality ● ● ● ● ● Garbage In, Garbage Out Definition of Data Quality The Continuum of Data Quality Other Problems with Data Quality Creating Better Data Quality ○ ○ ○ ○ ● Data Cleansing Master Data Management Data deduplication Data Interpretation The need for Domain Experts Garbage In, Garbage Out Definition of Data Quality Validity Reliability Completeness Precision Timeliness Integrity Data measure what they are supposed to measure. Everyone defines, measures, and collects data the same way—all the time. Data include all of the values needed to calculate indicators. No variables are missing. Data have sufficient detail. Units of measurement are very clear. Data are up to date. Information is available on time. Data are true. The values are safe from deliberate bias and have not been changed for political or personal reasons. Validity ● Data measures what they are supposed to measure. When it goes wrong: AI And facial recognition. https://www.wmm.com/sponsored-project/codedbias/?fbclid=IwAR1xSLJToeXvMbNePKnlTn dCFjnO3485Iv0AMf5wcZC-Tb1UuIJSiT3ivnQ Reliability Everyone defines, measures, and collects data the same way - all the time. When it goes wrong: Even the discovery of the Americas was a result of bad data. Christopher Columbus made a few significant miscalculations when charting the distance between Europe and Asia. First, he favored values given by Persian geographer Alfraganus, over the more accurate calculations of Greek geographer, Eratosthenes. Second, Columbus assumed Alfraganus was referring to Roman miles in his calculations when, in reality he was referring to Arabic miles. Completeness Data includes all of the values needed to calculate indicators. No variables are missing. When it goes wrong: 2016 Election- The 2016 United States Presidential election was also mired with bad data. National polling data used to predict state-by-state Electoral College votes led to the prediction of a Hillary Clinton landslide, a forecast that lead to many American voters to stay home on Election Day. Also, the Big Short Precision ● Data has sufficient detail. Units of measurement are very clear. When it goes wrong: In 1999, NASA took a $125 million dollar hit when it lost the Mars Orbiter. It turns out that they engineering team responsible for developing the Orbiter used English units of measurement while NASA used the metric system. The problem here is the data was inconsistent making it a rather costly and disastrous mistake. Timeliness Data is up to date. Information is available on time. When it goes wrong: The turning point of the civil war, Gettysburg. Lee, general of the Confederate army had old intel, and didn’t know the accurate count of troops. Integrity Data is true. The values are safe from deliberate bias and have not been changed for political or personal reasons. When it goes wrong: The Enron scandal in 2001 was largely a result of bad data. Enron was once the sixth-largest company in the world. A host of fraudulent data provided to Enron’s shareholders resulted in Enron’s meteoric rise and subsequent crash. An ethical external auditing firm could have prevented this fraud from occurring. Or the Anti-vaccination movement. The Data Quality Continuum The Data Quality Continuum • Data and information is not static, it flows in a data collection and usage process – – – – – – Data gathering Data delivery Data storage Data integration Data retrieval Data mining/analysis Data Gathering • How does the data enter the system? • Sources of problems: – Manual entry – No uniform standards for content and formats – Parallel data entry (duplicates) – Approximations, surrogates – SW/HW constraints – Measurement errors. Solutions • Potential Solutions: – Preemptive: • Process architecture (build in integrity checks) • Process management (reward accurate data entry, data sharing, data stewards) – Retrospective: • Cleaning focus (duplicate removal, merge/purge, name & address matching, field value standardization) • Diagnostic focus (automated detection of glitches). Data Delivery • Destroying or mutilating information by inappropriate pre-processing – Inappropriate aggregation – Nulls converted to default values • Loss of data: – Buffer overflows – Transmission problems – No checks Solutions • Build reliable transmission protocols – Use a relay server • Verification – Checksums, verification parser – Do the uploaded files fit an expected pattern? • Relationships – Are there dependencies between data streams and processing steps • Interface agreements – Data quality commitment from the data stream supplier. Data Storage • You get a data set. What do you do with it? • Problems in physical storage – Can be an issue, but terabytes are cheap. • Problems in logical storage (ER 🡪 relations) – Poor metadata. • Data feeds are often derived from application programs or legacy data sources. What does it mean? – Inappropriate data models. • Missing timestamps, incorrect normalization, etc. – Ad-hoc modifications. • Structure the data to fit the GUI. – Hardware / software constraints. • Data transmission via Excel spreadsheets, Y2K Solutions • Metadata – Document and publish data specifications. • Planning – Assume that everything bad will happen. – Can be very difficult. • Data exploration – Use data browsing and data mining tools to examine the data. • Does it meet the specifications you assumed? • Has something changed? Data Integration • Combine data sets (acquisitions, across departments). • Common source of problems – Heterogenous data : no common key, different field formats • Approximate matching – Different definitions • What is a customer: an account, an individual, a family, … – Time synchronization • Does the data relate to the same time periods? Are the time windows compatible? – Legacy data • IMS, spreadsheets, ad-hoc structures – Sociological factors • Reluctance to share – loss of power. Solutions • Commercial Tools – Significant body of research in data integration – Many tools for address matching, schema mapping are available. • Data browsing and exploration – Many hidden problems and meanings : must extract metadata. – View before and after results : did the integration go the way you thought? Data Retrieval • Exported data sets are often a view of the actual data. Problems occur because: – Source data not properly understood. – Need for derived data not understood. – Just plain mistakes. • Inner join vs. outer join • Understanding NULL values • Computational constraints – E.g., too expensive to give a full history, we’ll supply a snapshot. • Incompatibility – Ebcdic? Data Mining and Analysis • What are you doing with all this data anyway? • Problems in the analysis. – Scale and performance – Confidence bounds? – Black boxes and dart boards • “fire your Statisticians” – Attachment to models – Insufficient domain expertise – Casual empiricism Solutions • Data exploration – Determine which models and techniques are appropriate, find data bugs, develop domain expertise. • Continuous analysis – Are the results stable? How do they change? • Accountability – Make the analysis part of the feedback loop. Other problems in DQ - Missing Data • Missing data - values, attributes, entire records, entire sections • Missing values and defaults are indistinguishable • Truncation/censoring - not aware, mechanisms not known • Problem: Misleading results, bias. Data Glitches • Systemic changes to data which are external to the recorded process. – Changes in data layout / data types • Integer becomes string, fields swap positions, etc. – Changes in scale / format • Dollars vs. euros – Temporary reversion to defaults • Failure of a processing step – Missing and default values • Application programs do not handle NULL values well … – Gaps in time series • Especially when records represent incremental changes. Departmental Silos ● ● ● ● Everyone sees their job, department, business as the most important thing. Often Departments or other Groups will have their own Data Quality Standards for their specific mission. Data Quality Suffers when you have to look at data across the business or between companies. Example: Federal ID for Companies and Businesses ○ DUNS vs. TaxID vs. NPI vs. SSN vs. Department ID vs. Universal ID Then why is every DB dirty? • Consistency constraints are often not used – Cost of enforcing the constraint • E.g., foreign key constraints, triggers. – Loss of flexibility – Constraints not understood • E.g., large, complex databases with rapidly changing requirements – DBA does not know / does not care. • Garbage in – Merged, federated, web-scraped DBs. • Undetectable problems – Incorrect values, missing data • Metadata not maintained • Database is too complex to understand Improving Data Quality Data Cleansing ● ● ● ● Not just about the Data itself, also about standardizing business log data & metrics Create universal identifiers across your business, look at best practices. Convert dates to same timezone and format. Standardize naming conventions in metadata Cleansing Methods ● ● ● ● ● Histograms Conversion Tables ○ Example - USA, U.S., U.S.A., US, United States Tools Algorithms Manually Master Data Management MDM Continued Data Deduplication ● Data deduplication: A process that examines new data blocks using hashing compares them to existing data blocks, and skips redundant blocks when data is transferred to the target. ● Data reduction: A process that tracks block changes, usually using some kind of log or journal, and then transfers only new blocks to the backup target. Data Interpretation ● Data interpretation refers to the implementation of processes through which data is reviewed for the purpose of arriving at an informed conclusion. The interpretation of data assigns a meaning to the information analyzed and determines its signification and implications. ○ ○ Qualitative Interpretation - Observations, Documents, Interviews Quantitative Interpretation - Mean, Standard Deviation, Frequency Distribution Data Interpretation Problems ● Correlation mistaken for Causation ● Confirmation Bias ● Irrelevant Data Domain Expertise • Data quality gurus: “We found these peculiar records in your database after running sophisticated algorithms!” • Domain Experts: “Oh, those apples - we put them in the same baskets as oranges because there are too few apples to bother. Not a big deal. We knew that already.” Why Domain Expertise? • DE is important for understanding the data, the problem and interpreting the results • “The counter resets to 0 if the number of calls exceeds N”. • “The missing values are represented by 0, but the default billed amount is 0 too.” • Insufficient DE is a primary cause of poor DQ – data are unusable • DE should be documented as metadata Where is the Domain Expertise? • Usually in people’s heads – seldom documented • Fragmented across organizations – Often experts don’t agree. Force consensus. • Lost during personnel and project transitions • If undocumented, deteriorates and becomes fuzzy over time The End Readings: ebook Ch. 3 and Ch. 5 ● Homework 1 is due next week ● ALSPAC DATA MANAGEMENT PLAN, 2019-2024 0. Proposal name The Avon Longitudinal Study of Parents and Children (ALSPAC). Core Program Support 2019-2024. 1. Description of the data 1.1 Type of study ALSPAC is a multi-generation, geographically based cohort study following 14,541 mothers recruited during pregnancy (in 1990-1992) and their partners (G0), offspring (G1) and grandchildren (G2). 1.2 Types of data Quantitative data • Data from numerous self-completed paper-based/online questionnaires. • Data from clinic-based assessments: physiological, cognitive and anthropometric measures, structured interview data and computer based questionnaire data. • Genetic, metabolomic, proteomic, epigenetic, biochemical and environmental exposure data obtained from analysis of biological samples. • Data derived from images collected as part of clinical assessments (including MRIs, Liver scans, DXA, PqCT, retinal scan, 3D Face and body shape). • Data obtained through linkage to administrative records including maternity and birth records, child health records, cancer/death registrations through ONS, Primary and Secondary health care records and education and criminal Records. • Data obtained from social media including Twitter, Facebook and Instagram Qualitative data from sub studies • Small sub studies involving direct interview of participants or focus groups; audio/transcript files being generated. Bio resource • Biological samples collection including DNA, lymphoblastic cell lines (LCLs), blood, saliva, urine, hair, tissue such as placenta and umbilical cord, teeth and nail clippings. 1.3 Format and scale of the data Type and scale of data • 14,541 mothers originally enrolled producing 14,676 fetuses with 13988 G1 children still alive at one year of age. Additional waves of enrollment resulted in a further 913 G1 children joining the study since 2000. • At the time of writing (Feb 2019), > 900 G2 children (including in utero) have enrolled and provided data. • Self-completion questionnaires are electronically captured (scanned or digitally collected); to date there have been 48 questionnaires completed by mothers (n=6000-13500), 18 by partners (n=3000-9500), and 34 by G1 (n=5000-8000). Currently, 14 different questionnaires are used to collect data about G2 children. • Data from clinical examinations are electronically captured (scanned paperwork or digitally collected). There have been 4 maternal sweeps (n=3500-4700), 1 father sweep (n=2000) and 10 G1 sweeps (with a minimum of 4000 attendees at each sweep). • Electronically captured (scanned from paper or digitally collected) enrolment and consent (for e.g. record linkage, obtaining and using biological samples) forms. • Image data e.g. face shape, DXA, MRI, liver scan, 3D body scan. • Administrative data (e.g. educational records in the National Pupil Database, linkage to the NHS central register) The scale of linkage to individual administrative data sources depends on the completeness of these sources, the relevant participant consents and other permissions alongside technical considerations relating to the systems where these data are held. • Data obtained from biological samples including whole genome sequence data (n=2000), genome wide association (GWAS) data (n~22,000), metabolomics data (from ~20,000 samples), genome wide epigenetics data (n~5000) plus small scale bespoke biochemical, cellular and genetic analysis. MRC Template for a Data Management Plan, v01-1, 10 March 2017 1 • Bio resource: over 30,000 DNA samples, 15,000 LCLs and over 1 million sample aliquots (blood, urine, saliva, tissue, hair, nails). Data formats • ALSPAC stores raw numerical data in a number of formats depending on the source of the data, including MS Access, SQL Server databases, MS Excel compatible files (.csv and .xls format), REDCap (MySQL), SPSS save and portable formats and Stata data files. Text data are stored as flat text files, MS Access databases and .csv files. • Molecular data are stored in flat file formats including .csv and JSON allowing for more complex file structures. Some formats are program specific (e.g. PLINK for SNP data). Some data are stored in MySQL databases (e.g. methylation data) but some molecular data are unsuitable for databases due to increasingly long index lengths; working copies of these ‘Big Data’ and archived data are stored on University of Bristol (UoB) storage systems (e.g. the ACRC Research Data Storage Facility). Raw (laboratory) data (e.g. Illumina IDAT format genotyping/ methylation files) will also be redundantly archived on UoB storage systems ensuring future availability. • Data made available for research use through the ALSPAC resource is stored internally in multiple formats. Data is curated using a statistical package and stored in both SPSS and Stata formats. The curated data is imported into a Data Warehouse Opal/MongoDB) and is stored in a highly flexible structure. Custom datasets can be exported in any of the common statistical formats, including SPSS, STATA, SAS, R, csv (Excel). • Multimedia data such as DXA scans, recorded participant interviews and face shape images are stored using uncompressed or lossless compression formats where possible. • Where applicable data formats may be migrated as new technologies become available and are proved robust enough to ensure digital continuity and continued availability of data. 2. Data collection / generation 2.1 Methodologies for data collection / generation New data will be generated by: • Questionnaires (online or paper-based) completed by all cohort groups. • Clinic based assessments on all cohort groups. • Further biological sample collection from all cohort groups • Linkage to administrative records of G0, G1 and G2. • Interviews from qualitative studies. • Image files such as DXA scans on G0 and G1. • Updated contact information as provided by study participants. • Molecular laboratory analysis of existing and new samples and integration with public molecular data. • Further biological sample collection (blood, urine, saliva, hair, placenta, umbilical cord, breast milk, meconium, stools, DNA and cell lines) from all cohort groups. • Social media, such as Twitter. 2.2 • • • • • • • Data quality and standards Each data item to be assessed by logical and range checks built into electronic data collection systems, with ambiguous values assessed by an operator. All assessment scales used in either questionnaires or clinic-based assessment to have been validated externally with a known reference paper. A small sample (~3%) of clinic participants to be re-invited to the clinic to validate earlier measures and test for any possible fieldworker bias or equipment calibration issues. Interview data to be collected and validated in real time on encrypted laptops with data routinely transferred to the central repository. Clinical assessment data to be collected by trained fieldworkers according to clear protocols; data to be analysed regularly by research staff to ensure data quality standards are being met; regular audits of clinic processes will be performed. Repeat molecular analysis of a subset of samples, QC using control probes and analysis for batch effects (built into the ALSAPC LIMS [laboratory information management system]). The laboratory managing the bio resource has obtained the ISO9001 quality standard. Sample data are stored in ALSPAC LIMS. 3. Data management, documentation and curation 3.1 Managing, storing and curating data. MRC Template for a Data Management Plan, v01-1, 10 March 2017 2 • • • • • • 3.2 Data from all sources will be cleaned, and prepared for analysis by the in-house statistics, bioinformatics and data management teams following established standard operating procedures (SOPs). Each data item is referenced and stored using a universal indexing and naming convention. Research data items are stored separately from administrative data (including subject identifiers) and linked through anonymised files accessible only to specified members of the data management team. Instrument specific and bespoke data formats are archived ‘as is’ to ensure the integrity of the original source material. As with other data sources, original copies of the data are never altered in any way. All research data collected are catalogued, maintained and archived on University of Bristol infrastructure which is scalable, secure and backed-up routinely. The Bio resource is licensed by the Human Tissue Authority (license number 12512) and samples are stored in secure freezers and cryostores. These facilities are linked to an emergency generator to provide back-up power and are covered by a 24hr alarm system to alert staff of freezer failures out of normal working hours. The cell line collection is backed up at an external second site. Metadata standards and data documentation Metadata are collected as an integral process to (i) catalogue and index the data in a searchable manner, (ii) define the assessment tools (validated measures, key reference publication, modifications etc), (iii) describe the data collection process on an individual basis (age at completion, administration and reminder process) (iv) catalogue laboratory information as captured through LIMS and (v) assign a geographical reference point (at a non-disclosive level) to assist spatial analysis. • Research data are not made available until it has been fully documented and published on the ALSPAC website (http://www.bristol.ac.uk/alspac/researchers/our-data/). • Metadata is provided through CLOSER Discovery (https://discovery.closer.ac.uk/) and is compatible with the Data Documentation Initiative (DDI) Life-cycle 3.2. • Study protocols, assessment tools, data derivation methods and coding schema are provided as part of the research data documentation available as downloadable content from the ALSPAC website (http://www.bristol.ac.uk/alspac/researchers/our-data/). 3.3 • • • • • • Data preservation strategy and standards It is envisaged that ALSPAC will continue to operate as a resource for current and future generations indefinitely. In line with this expectation all data are anticipated to be maintained indefinitely. ALSPAC maintains an archive of data which is available to researchers on request and maintained on secure servers which are backed up on a regular basis. The University infrastructure consists of real time mirroring of data across two geographically separate data centers (Bristol and Slough) as well as off-site tape backups on a nightly basis. In order to ensure the longevity and availability of the resource, ALSPAC reviews data on a regular basis and will apply digital continuity methods where applicable to migrate data formats at risk of obsolescence to newer formats to ensure that information contained within the files remains complete and usable. Where data are disposed of (for example data that have been secured elsewhere on an obsolete hard disk) this will be done securely and in line with University IT Information security policies. Primary source material (E.g. Questionnaires, clinic data sheets and consent forms) will be preserved as electronic (scanned) copies where practicable. The UoB library holds an extensive administrative archive (Special Collections reference DM2616) of catalogued paperwork up to 2005 (study grant applications, protocols, ethical approvals, participant information materials, keying and coding specifications, documentation on each data collection measure and their provenance and the file building syntax). https://www.bristol.ac.uk/library/special-collections/strengths/alspac/. 4. Data security and confidentiality of potentially disclosive information 4.1 • Formal information/data security standards ALSPAC gained ISO 27001 certification in 2012 and is compliant with that standard 4.2 Main risks to data security ALSPAC avoids potentially disclosive subsets of data at all costs. For example: • Complete dates (of birth, Q completion, clinic attendance) are not released to researchers; instead MRC Template for a Data Management Plan, v01-1, 10 March 2017 3 • • • • • ages are derived. Cell counts are no smaller than n=5; this applies to all publications that are reviewed by the ALSPAC Executive prior to journal submission. Free text (with its own unique ID) is coded separately from any other data; any identifying information is screened out before being passed to a researcher. Address data is dealt with separately to any other data (with its own unique ID), relevant address level data are obtained, aggregated as appropriate and matched back to the main dataset. All external researchers receive a dataset with an ID attached which is unique to them and their project. Anonymised frequency tables and summary statistics are freely available as part of the data dictionary. 5. Data sharing and access Identify any data repository (-ies) that are, or will be, entrusted with storing, curating and/or sharing data from your study, where they exist for particular disciplinary domains or data types. Information on repositories is available here. 5.1 Suitability for sharing Yes. ALSPAC data is used in a wide and varying number of research consortia and cross-cohort collaborations. 5.2 • • • • • 5.3 Discovery by potential users of the research data The ALSPAC access policy details how data can be accessed by researchers (http://www.bristol.ac.uk/alspac/researchers/data-access/). The cohort is advertised through a wide variety of sources including the MRC gateway to Research and the Maelstrom Catalogue, ALSPAC is the most commonly searched study in CLOSER Discovery. The ALSPAC web site describes the cohort and available data (http://www.bristol.ac.uk/alspac/researchers/our-data/), and hosts data documentation and catalogues. The data dictionary is fully searchable by keyword. A bespoke variable search tool is available (http://variables.alspac.bris.ac.uk). This enables a quick search of the data and facility to download a list of variables. Within the next twelve months this will be replaced with the Mica web portal (https://www.obiba.org/pages/products/mica) allowing for advanced searching of variables and creation of a variable list for submission as a data access request Governance of access ALSPAC is committed to providing access to ALSPAC data to the widest possible research community. Currently, ALSPAC data are made available to researchers on a supported basis rather than via an unrestricted, open resource. Bespoke datasets of requested variables are provided to collaborators by a data preparation and statistics team upon completion of a Data Access Agreement (http://www.bristol.ac.uk/alspac/researchers/access/) Briefly, available data are described on the ALSPAC website. Researchers wishing to access these data submit a proposal to the ALSPAC Executive committee. Approval for access is given if the data requested are available and their release does not (i) risk disclosure of participant identity; (ii) violate any ethico-legal or other stipulations that apply to ALSPAC; or (iii) run the risk of harming the study as a whole or any participants in it. Crucially, data are viewed as a non-finite resource and proposals for access are therefore not subject to formal scientific review. Once an application has been awarded, specified variables are then provided to the investigator by a ‘data buddy’, who supports the user with data descriptors and additional variables as required. Requests for data acquired via linkage to routine health and administrative records are subject to access constraints determined both by ALSPAC and the original data owner. These constraints can involve seeking additional project clearances with the original data owner, the statistical modification of the data to control for disclosure risks and ethical approval (for requests involving health data). All linkage data are filtered for participant consent at the time of release and this can impact on the resulting sample size. Access conditions are subject to changes applied by the data owner which are outside ALSPACs control. All access requirements must be adhered to at all stages of the research cycle. Researchers accessing these data do so under a legally binding contract. Researchers will access linked health records via the UK Secure eResearch Platform (UKSeRP), Swansea University. This will be managed by the ALSPAC data linkage team who control access MRC Template for a Data Management Plan, v01-1, 10 March 2017 4 permissions and researcher outputs. The system will allow researchers to remotely access their approved project data in a secure and auditable environment. 5.4 The study team’s exclusive use of the data Where a researcher (member of the ALSPAC team or an external collaborator) has secured funding for the collection and analysis of new data, they are entitled to apply for a period of exclusive access for a period of up to 6 months from the point at which a cleaned dataset is made available to them. If this is approved then during the exclusive access period the ALSPAC Executive will still consider requests for access to the restricted data, but permission must be sought from the researcher who funded the data collection to release the data or to explore the potential for collaborative analysis. If the funding researcher declines, the restricted data will not be available to others until after the period of exclusive access. After the embargo period all ALSPAC data are freely available to external researchers (within the usual constraints related to scientific legitimacy and disclosure risk). ALSPAC do not “police” overlap of projects or data requests. Details of approved projects and their data can be viewed by researchers on the ALSPAC website. 5.5 Restrictions or delays to sharing, with planned actions to limit such restrictions The ALSPAC policy on data sharing is partly determined by the terms of the consent given by the participants to the collection of particular data items. Broadly, we work with consent agreements that allow the widest possible sharing of ALSPAC information within the scientific community, balanced against the need to recognise participant concerns that may influence their decisions about giving or withholding consent at the time of data collection. In the majority of cases, the anonymity of participants is maintained by providing linked data that do not include actual or potential personal identifiers (such as date of birth, name and address) and by minimizing potentially disclosive information (such as low cell counts). The point at which data become sufficiently detailed to the extent that anonymity cannot be preserved is sometimes unclear and subject to challenge by different parties. Where such situations arise, ALSPAC will take appropriate action to identify any risk to participant anonymity and where necessary take steps to alter the data or introduce additional stages to the research process to reduce these risks. Depending on the proportion of the cohort consenting to different types of data collection, it is likely that ALSPAC will hold particular sets of data that were collected without specific individual consent. However, subsequent use of that data must preserve individual anonymity and thus not introduce any risk of inadvertent disclosure. To ensure this, ALSPAC may modify the data provided to a researcher in order to control for potential disclosure. ALSPAC are actively engaged with technological data sharing solutions that allow analysis across linked datasets but do not allow the analyst to have sight of either linking identifiers or the detail of individual level information needed for deductive disclosure, e.g. UKSeRP. 5.6 Regulation of responsibilities of users The full ALSPAC data access policy is available online and provides information on data sharing for prospective researchers. In brief, researchers wishing to use the ALSPAC resource complete an online proposal form (https://proposals.epi.bristol.ac.uk/) describing the proposed research. The proposal should have clearly stated aims and hypotheses and describe the relevant exposure, outcome and confounders that are being requested. A Principal Investigator with an approved project is required to sign a Data Access Agreement (signed at an Institution level) and all researchers within that project must complete a confidentiality agreement before data is released. This emphasizes the confidential nature of the data and informs the researcher that they must not share their dataset nor attempt to match their dataset with any other ALSPAC data. Requests to access biological samples are handled using the same procedures. However, the majority of samples represent a finite resource, so proposals are assessed to ensure analysis will make good use of these samples. Attempts are made to combine analyses where possible, but we reserve the right to turn down proposals which would use up a large proportion of finite samples. Samples are issued under the terms of a material transfer agreement (for researchers outside the University of Bristol) or a material service level agreement (for UoB researchers outside the Bristol Medical School (PHS)). Samples and indeed any other data are provided on condition that all further data obtained from them is returned to ALSPAC to become part of the ALSPAC resource and made available to other researchers. Where requests may use up a finite resource or risk stock, the request is referred to the ALSPAC Independent Scientific Advisory Board. 6. Responsibilities The ALSPAC Executive Committee takes ultimate responsibility for all aspects of data management. Under the 2019-2024 strategic award the data team will be headed by the Executive lead for Data. They will be MRC Template for a Data Management Plan, v01-1, 10 March 2017 5 supported by a Senior Data Manager (SDM) who manages the data pipeline and a Technical Lead (TL) who manages the ALSPAC systems. The SDM and TL will line manage a data team who will be responsible for managing ALSPAC systems that support the collection, curation and storage of data in a systematic, secure, confidential and accessible manner. The Executive Lead has responsibility for the final sign-off of new data released for research use and its’ accompanying metadata. The SDM will be supported by a small team of data preparation assistants who will prepare and document the research data and a team of research associates who will provide continual quality assurance of new clinic data as it is collected. Dedicated time will be provided for data security; ensuring that we continue to comply and gain re-certification of ISO27001. 7. Relevant institutional, departmental or study policies on data sharing and data security Policy Data Management Policy & Procedures URL or Reference University policy: http://www.bristol.ac.uk/research/environment/governance/research-datapolicy/ Data Security Policy ISO27001 (reference) Data Sharing Policy ALSPAC access policy available here: http://www.bristol.ac.uk/alspac/researchers/access/ Institutional Information Policy http://www.bristol.ac.uk/infosec/policies/ Other: Other 8. Author of this Data Management Plan (Name) and, if different to that of the Principal Investigator, their telephone & email contact details The ALSPAC executive Email: alspac-exec@bristol.ac.uk Tel: +44 (117) 331 0010 MRC Template for a Data Management Plan, v01-1, 10 March 2017 6
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Please check this one and let me know any necessary enhancements we can make

Running Head: DATA QUALITY ASSESSMENT

Data quality assessment
Name
Course:
Institution

1

DATA QUALITY ASSESSMENT

2

Introduction
Data quality assessment (DQA) is the process of scientifically and statistically evaluating data
in order to determine whether they meet the quality required for projects or business processes
and are of the right type and quantity to be able to actually support their intended use. It can be
considered a set of guidelines and techniques that are used to describe data, given an application
context, and to apply processes to assess and improve the quality of data.
1. Description of Data
1.1 Type of study
The study is a qualitative study involving the networking architecture available in Marymount
University.
1.2 Types of data
The data will be obtained from the firewall, web traffic within the institutions computer network.
For comparison, data will also be obtained two external sources most preferably from similar
universities with similar network structures.
1.3 Format and scale of the data
Most data will be digital data from about 12 network switches which host with them firewalls,
some 20 routers will also be accessed and data read from them. This will generate a document
with an estimated number of about a million records within the time frame of the study.
The data will be raw numerical data in a number of formats depending on the source of the data,
including MS Access, SQL Server databases, MS Excel compatible files (.csv and .xls format),
REDCap (MySQL), SPSS save and portable formats and Stata data files. Text data are stored as
flat text files, MS Access databases and .csv files.

DATA QUALITY ASSESSMENT

3

2.0 Data Collection/Generation
2.1 Methodologies for data collection / generation
Data collection is the process of gathering and me...

Related Tags