Analyzing and Visualizing data

Anonymous
timer Asked: Feb 3rd, 2021

Question Description

Outline for Assignment 2

Using the Documenting Research Guide and the Assignment 2 Instructions, develop your outline.

When you begin working with the data, use the document Assignment 2:Finding the right data to get started.

Submit the outline in an MS Word document file type. Utilize the standards in APA 7 for all citations or references in the outline. Ensure that the document includes your name. Do not include your student identification number. Include a cover page with your name, you may use the Student Paper Template's cover page.

Both the Documenting Research Guide and the Student Paper Template can be found in the Useful Information folder in the content area.

Unformatted Attachment Preview

Research Assignment 2: Finding the right data This document is provided to help you work through this robust research question. This document will offer insight in how to address these fields, by working with three of fields and some tips for working with the data in R. Work through fields one piece at a time, shown here in bold. What are the most influential features when predicting whether a survey respondent has used the SO job board or is aware of the board but has never used it when considering respondents who reported residing in the country …(your countries listed here); reported their age as somewhere between 18 and 65 years old; and that indicated that they were either not at all, somewhat, or very confident in their manager; reported an undergraduate major in either an engineering field, information systems, or web design, or statistics; in addition to the responses these respondents reported regarding employment; how often the respondent contributes to open source; and whether or not they code for a hobby; when the respondent indicated that the number of years they have been coding is somewhere within one to 49 years using the data from SO (2019)? This research question requires several filters to obtain the secondary data sample that is necessary for this research. Consider the field that represents whether a survey respondent has used the SO job board or is aware of the board, but has never used it. When you identified the variable that represents this field, what unique values can be found in the field? > unique(df$SOJobs) [1] "No, I didn't know that Stack Overflow had a job board" [2] "No, I knew that Stack Overflow had a job board but have never used or visited it" [3] "Yes" [4] NA The question has limited the scope of the data sample to respondents that use it and those that are aware of it, but have never used it. We don’t have enough information! What is the survey question associated with these survey answers? Have you ever used or visited Stack Overflow Jobs? Can you determine what unique answers or values you need to keep now? Respondents that use it: how would they answer this survey question? Yes Respondents that don’t use it, but know about the job board? No, I knew… When you have lengthy character fields, it can get cumbersome to isolate the strings. If even one letter is out of place, your filter will not perform as expected. Using the entire data frame of 88,883 observations the first chunk returns zero observations; the second chunk returns 75,532 observations. df <- filter(df, # ‘and’ statement SOJobs == "Yes", SOJobs == "No, I knew that Stack Overflow had a job board but have never used or visited it") df <- filter(df, # ‘or’ statement SOJobs == "Yes"| SOJobs == "No, I knew that Stack Overflow had a job board but have never used or visited it") While the second options works, you may find that method to be slow to type and problematic due to typing errors. How about this approach? df <- filter(df, str_detect(SOJobs, "knew")|SOJobs == "Yes") The function str_detect() or string detect searches a character string for the contents. To keep it simple, don’t use the first word or last word of the string. Want to know why or more? Email me or use help in RStudio. 1/27/20 Assignment 2 Finding the right data.docx Page |1 Let’s look at another field from the research question: reported an undergraduate major in either an engineering field, information systems, or web design, or statistics. What unique values exist for this field? > unique(df$UndergradMajor) [1] [2] Web development or web design [3] Computer science, computer engineering, or software engineering [4] Mathematics or statistics [5] Another engineering discipline (ex. civil, electrical, mechanical) [6] Information systems, information technology, or system administration [7] A business discipline (ex. accounting, finance, marketing) [8] A natural science (ex. biology, chemistry, physics) [9] A social science (ex. anthropology, psychology, political science) [10] A humanities discipline (ex. literature, history, philosophy) [11] Fine arts or performing arts (ex. graphic design, music, studio art) [12] A health science (ex. nursing, pharmacy, radiology) [13] I never declared a major Do you need the survey question? Maybe not. You do need to capture the questions in the sample section. Here I’ve highlighted the words coinciding with the research question and truncated the list. > unique(df$UndergradMajor) [1] [2] Web development or web design [3] Computer science, computer engineering, or software engineering [4] Mathematics or statistics [5] Another engineering discipline (ex. civil, electrical, mechanical) [6] Information systems, information technology, or system administration [7] A business discipline (ex. accounting, finance, marketing) What’s the best method to approach this? Think about how you could use the string detect function here. df <- df %>% filter(str_detect(UndergradMajor, "engineering")| str_detect(UndergradMajor, "information")| str_detect(UndergradMajor, "statistics")| str_detect(UndergradMajor, "web")) %>% droplevels() # droplevels() is necessary when dropping factor levels When you pick the words to detect, consider all the unique values. How can you validate the filter worked like you had intended? Try using the function table() before and after your filter. The function table() will alphabetize your unique field names, so they won’t appear in the same order as they did with unique(). Add the Check the remaining fields and how often each of the fields occur in the data. > table(df$UndergradMajor) A business discipline (ex. accounting, finance, marketing) 1841 > table(df$UndergradMajor) A health scienceAnother (ex. nursing, pharmacy, radiology) engineering discipline (ex. civil, electrical, mechanical) 323 6222 A humanities discipline (ex. Computer literature, history, philosophy) science, computer engineering, or software engineering 1571 47214 A natural science (ex. biology, physics) Information systems,chemistry, information technology, or system administration 3232 5253 A social science (ex. anthropology, psychology, political science) Mathematics or statistics 1352 2975 Another engineering discipline (ex. civil, electrical, mechanical) Web development or web design 6222 3422 Computer science, computer engineering, or software engineering 1/27/20 Assignment 2 Finding the right data.docx Page |2 When you think it’s a field with numeric values, but it isn’t. From the research question: when the respondent indicated that the number of years they have been coding is somewhere within one to 49 years Pay attention to what these fields contain. If the data type is a character or factor field, you cannot use numeric values to filter. Look at what happens here when working with the entire data set. > nrow(filter(df, YearsCode >= 1, YearsCode <= 49)) [1] 59127 > nrow(filter(df, YearsCode != "More than 50 years", YearsCode != "Less than 1 year")) [1] 86445 Technically they filter for the same information. Are either one correct? Nope. What unique values exist? > unique(df$YearsCode) [1] "4" [5] "13" [9] "2" [13] "14" [17] "30" [21] "19" [25] "25" [29] "33" [33] "34" [37] "27" [41] "39" [45] "More than 50 years" [49] "48" [53] "49" NA "6" "5" "35" "9" "15" "1" "50" "24" "21" "38" "29" "46" "3" "8" "17" "7" "26" "20" "22" "41" "23" "36" "31" "44" "43" "16" "12" "10" "Less than 1 year" "40" "28" "11" "18" "42" "32" "37" "45" "47" Do you see the quotation marks? R is interpreting every value here as strings or words, not numbers. Which function call is correct? df <- filter(df, YearsCode != "Less than 1 year"| YearsCode != "More than 50 years") df <- filter(df, YearsCode != "Less than 1 year", YearsCode != "More than 50 years") # using an ‘or’ statement # using an ‘and’ statement The first returns 87,938. The second returns 86,445. What’s happening here? Because these filters are using != or does not equal, you have to consider how that impacts whether you use a comma or the vertical pipe |. For this example, you can see the difference when you use unique().This shows the number of unique values. > length(unique(df$YearsCode)) [1] 53 > length(unique(df$YearsCode)) [1] 52 > length(unique(df$YearsCode)) [1] 50 # unfiltered original data # filtered with the or statement above # filtered with the ‘and’ statement above An odd set of outcomes, right? How did it go from 53 to 52 and from 53 to 50? There were two labels filtered out in both function calls, yet neither removed two unique values. Both the ‘or’ and ‘and’ statements removed NA. The ‘or’ statement did not remove either string completely. The AND statement is needed here. The field still has to be converted to a numeric type, then filtered for the range in the research question. 1/27/20 Assignment 2 Finding the right data.docx Page |3 What range of data is available for this field that represents the number of years the respondent has been programming, after the ‘and’ statement? > range(df$YearsCode %>% as.numeric()) [1] 1 50 # after the ‘and’ statement, the field is 1:50 # the field needs to be permanently changed to numeric # the field will need to be filtered again > df$YearsCode <- as.numeric(df$YearsCode) > df <- filter(df, YearsCode >= 1, YearsCode <= 49) There is one more step you need to take, before you’re done with this field. You changed it, now validate those last changes. How do you know it’s correct? Beyond the three variables shown in this document so far, if you run into trouble trying to model this data in your analysis, there are two things you can look for. The first: is your data a regular data frame? Not a tbl_df or tibble? Convert it to a data frame df <- as.data.frame(df). The second: did you validate the changes you made? What does a summary of your data look like? How many levels do your factors have? Is it clean? If the function summary returns factor levels with a count of 0, your data is not clean! If you were filtering for Australia, Russia, and the Netherlands, along with only full-time and part-time workers, and the use of summary() returned the following: Employment Employed full-time :2252 Employed part-time : 108 Independent contractor, freelancer, or self-employed: 0 Not employed, and not looking for work : 0 Not employed, but looking for work : 0 Retired : 0 Country Australia :845 Russian Federation:764 Netherlands :751 Afghanistan : 0 Albania : 0 Algeria : 0 (Other) : 0 Your data is not clean! If you attempt to train a random forest model unclean data, as shown, it could take hours to process and may finish with an error. Clean the data first. Empty levels? Try this: > df <- droplevels(df) # no more empty factor levels! # after dropping the empty factor levels, summary returns this > summary(df) Employment Employed full-time:2252 Employed part-time: 108 Country Australia :845 Netherlands :751 Russian Federation:764 The final caveat: It will make reading the confusion matrices a lot easier to read if you change the labels of the outcome variable before training your model. > levels(df$SOJobs) # filtered SOJobs for analysis has two levels [1] "No, I knew that Stack Overflow had a job board but have never used or visited it" [2] "Yes" > levels(t2$SOJobs) <- c("No","Yes") 1/27/20 # change labels; order of labels matters Assignment 2 Finding the right data.docx Page |4 Research Assignment 2 The Outline for Research Assignment 2 and Research Assignment 2 will use this document. Use the Documenting Research Guide to understand how to use the information in this document for either of these submissions. Ask questions, if needed! Problem: Employers’ external job postings need to be posted to the one job board that targets their model candidate and only receive applicants that are perfect for the role. In reality jobs are typically posted in numerous places, and both suitable and unsuitable candidates apply for the role. Using specific candidate characteristics and a specific job board, considering what may or may not influence the use of a specific job board will lead to better targeting of candidates, reducing redundant job postings, and decreasing the number of unfit candidates. Question 1: What are the most influential features when predicting whether a survey respondent has used the SO job board or is aware of the board but has never used it when considering respondents who reported residing in the country of Spain, Australia, or Brazil; reported their age as somewhere between 18 and 65 years old; and that indicated that they were either not at all, somewhat, or very confident in their manager; reported an undergraduate major in either an engineering field, information systems, or web design, or statistics; in addition to the responses these respondents reported regarding employment; how often the respondent contributes to open source; and whether or not they code for a hobby; when the respondent indicated that the number of years they have been coding is somewhere within one to 49 years using the data from SO (2019)? Question 2: You are responsible for developing a second research question. This question must meet the criteria from Unit 1 Part 1. Additionally, it must relate to the problem statement. It does not have to use the same subset of data as the other research question. The analysis method must be an analysis method demonstrated in one of the lectures. When completing the outline, make sure to include both the given question and the well-developed, sound research question you have developed. Data: • The data and data dictionaries are online. o Note: The raw data in your program must be in the original form. Do not modify the data outside of the programming. Use the data dictionary to understand the data. o The data and data dictionary are downloaded together. When you visit this site, ensure you select the 2019 survey and you cite and reference the source in your work. ▪ Stack Overflow. (2019). Stack overflow annual developer survey [Data set and code book]. https://insights.stackoverflow.com/survey/ • Create a subset of data to represent the sample of secondary data in this analysis, based on the research questions. 11/4/20 Assignment 2 Sp Au Br.docx Page |1 Data Cleaning: • Do not remove missing values during cleaning. • When changing an object or part of an object, validate the change that occurred as expected. • The steps that are taken in cleaning are not discussed in the research paper. Analyze: • When analyzing the given research question, you must use a random forest model. o You must attempt to improve the model performance by one of the methods covered in Unit 5. o The research question you write must make use of a method of analysis demonstrated in the lectures from this course. o The use of Accuracy is not suitable in and of itself to determine the validity and reliability of the model. • The sub-stages of Analyze are necessary at least two times; profile, prepare, and apply. This method is for programming, not documenting research. • Ensure you establish that the model is valid and reliable before discussing the influential indicators. Results section and discussion section: • Ensure that assertions and assessments in the results and discussion sections are derived from the analysis in R. • Do not speculate. Use evidence. When documenting the results, consider the generalizability. • Explain what was done to improve model performance in words: not programming functions, variable names, or argument names. Assume the reader cannot see the programming code or raw data, but needs to understand what you did to improve the performance. Future recommendations: • Include recommendations for future analysis, based on the research in R. • Explore the insights you can gain from this model and provide your interpretations when documenting your research. Bonus challenge: Compare the influential indicators in predicting the outcome depending on the country by creating separate models for each country. Describe if there were or were not distinct differences in the contribution of the different predictors. Do not speculate when discussing the findings. Tip: An additional research question that meets the five criteria from the first lecture will bring this additional analysis into the focus of the research. The challenge does not replace the original research requirements for this assignment. If you were to complete the challenge, there would be three research questions. Required files to submit: 1) Research paper in APA 7 format; MS Word document file type 2) R Script; final version with file type .r 11/4/20 Assignment 2 Sp Au Br.docx Page |2 Important Information: • You will receive an email confirming the submission. Should you receive that email, your submission is received. o An error is derived from the use of SafeAssign. o SafeAssign does not recognize r file types. The warning does not impact the submission. • The research paper will be written in a professional writing style, following APA 7 student paper format, use the student paper template. o The document shall be 3-5 pages and at least 1000 words. The page count does include the cover page, tables, or figures, or the reference page. o Ensure that every reference in the reference list is also cited in the text. o Do not forget to cite and reference the source of the data. • It is ill-advised to modify the problem statement and research question provided. • If the research problem or research questions are modified, the requirements of the analysis will not change, nor the objective outlined in the original research question. • There are several different versions of this assignment. If the submitted work is in line with a different version than assigned, the submitted work is a demonstration of academic dishonesty. Do not share the work with peers. Do not accept work that you did not do. • Take a look at the rubric to get the best grade possible. 11/4/20 Assignment 2 Sp Au Br.docx Page |3 ...
Student has agreed that all tutoring, explanations, and answers provided by the tutor will be used to help in the learning process and in accordance with Studypool's honor code & terms of service.

This question has not been answered.

Create a free account to get help with this and any other question!

Brown University





1271 Tutors

California Institute of Technology




2131 Tutors

Carnegie Mellon University




982 Tutors

Columbia University





1256 Tutors

Dartmouth University





2113 Tutors

Emory University





2279 Tutors

Harvard University





599 Tutors

Massachusetts Institute of Technology



2319 Tutors

New York University





1645 Tutors

Notre Dam University





1911 Tutors

Oklahoma University





2122 Tutors

Pennsylvania State University





932 Tutors

Princeton University





1211 Tutors

Stanford University





983 Tutors

University of California





1282 Tutors

Oxford University





123 Tutors

Yale University





2325 Tutors