150 words agree or disagree to each questions
Q1s the World Wide Web (WWW) grew in the late 1900s and early 2000s, search engines and indexes were created to help locate information. In the early years, these results were returned by humans; however, as the web grew from dozens to millions of pages, automation was needed. Hadoop was created to solve this problem.
According to Stedman (n.d.). Hadoop is an open-source distributed processing framework that manages data processing and storage for big data applications in scalable clusters of computer servers. In simpler terms, this software application manages large amounts of data via ‘nodes’ across a network of computers. This enables us to increase data storage capacity as well as the processing speed. Nevertheless, this software application offers some advantages as well as challenges.
Some of the advantages include the ability to store and process large quantities of data, fault tolerance, computing power, flexibility, low cost, and scalability. Fault tolerance is accomplished through the many ‘nodes. If a node goes down, the process is redirected to another one. Computing power is accomplished by the large number of nodes used to process and store data. Hadoop’s flexibility is due to its ability to process structured, semi-structures, and unstructured data. Scalability refers to the easiness to grow the system by simply adding nodes. Nevertheless, there are also some challenges with Hadoop.
Some of the challenges include wide knowledge gap, security, data management, and lack of match for problem solving. There is a wide gap of talent when it comes down to Java professionals. Although new security tools are emerging, the software still faces fragmented data security issues. Additionally, data quality and standardization are lacking since Hadoop does not have easy-to-use, full-features tools for data management, cleansing, and governance (Hadoop: What is it and Why it Matters, n.d.). Lastly, the software is not efficient for iterative and interactive analytic tasks.
Another software used in conjunction with Hadoop is Hbase. Hbase “is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File Systems (HDFS) (What is Hbase?, n.d.). This software provides a fault-tolerant way of storing data sets. It is designed for real-time data processing and/or read/write access to large amounts of data. And supports writing application in Apache Avro, REST, and Thrift.
A shortfall with this software is its inability to support structured query language like SQL. Additionally, Hbase is not a relational data store. Lastly, is designed to scale linearly.
Hadoop: What is it and Why it Matters. (n.d.). Retrieved from SAS: https://www.sas.com/en_us/insights/big-data/hadoop.html
Stedman, C. (n.d.). What is Data Management and Why is it Important? Retrieved from Search Data Management: https://searchdatamanagement.techtarget.com/definition/Hadoop
What is Hbase? (n.d.). Retrieved from IBM: https://www.ibm.com/topics/hbase
Q2. During research into the likelihood/probability that someone who drinks alcohol has a greater probably of getting into legal trouble of a Driving Under the Influence (DUI) would be an interest study to use logistic regression on. The study data could be the average number of people who make alcohol purchases = (grocery store purchase + dinner drink + social drink) compared to the number of people who frequent these three establishments. By the number of average daily DUIs taken over a ten-year sample period = (ten years of DUI data (averaged) / 365). This study would be interest from a social point; often in reference to people who drink more then social the statement that “it’s only a matter of time”, in reference to them ending up in legal trouble. However, that stereo type maybe passed more frequently than what is truth.
Why I think this study would be both appropriate but also interesting is because with the inclusion of data my first hypothesis is that the more that people drink the more likely they are to get a DUI however, this may not be the case. The data is skewed because people who drink alcohol at dinner could receive a DUI, yet only fitting into a “social” environment. Furthermore, the data would also be hard to trust because of the last ten years the emergence of ride share (Uber, Lyft, local services, etc.) have probably also had a drastic decrease in drunk driving. This could be a separate, and interesting aspect to study in a separate regression. Perhaps with a separate hypothesis of: “did ride share services effect the DUI rates in cities (metropolitan and rural)”.