Department of Computer Engineering (2019 course)

Course Name & Code: Data Science and Big Data Analytics Laboratory (310256)

Group A: Data Science

1. Data Wrangling I

2. Data Wrangling II

3. Descriptive Statistics - Measures of Central Tendency and variability

4. Data Analytics I

5. Data Analytics II

6. Data Analytics III

7. Text Analytics

8. Data Visualization I

9.  Data Visualization II

10. Data Visualization III

Group B: Big Data Analytics – JAVA/SCALA

1.      Write a code in JAVA for a simple Word Count application that counts the number of occurrences of each word in a given input set using the Hadoop MapReduce framework on local-standalone set-up.

2.      Design a distributed application using MapReduce which processes a log file of a system.

3.      Locate dataset (e.g., sample_weather.txt) for working on weather data which reads the text input files and finds average for temperature, dew point and wind speed.

4.      Write a simple program in SCALA using Apache Spark framework.

Group C : Mini Projects/ Case Study – PYTHON/R

1.       Write a case study on Global Innovation Network and Analysis (GINA). Components of analytic plan are 1. Discovery business problem framed, 2. Data, 3. Model planning analytic technique and 4. Results and Key findings.

2.       Use the following dataset and classify tweets into positive and negative tweets. https://www.kaggle.com/ruchi798/data-science-tweets.

3.       Use the following dataset and classify tweets into positive and negative tweets. https://www.kaggle.com/ruchi798/data-science-tweets.

4.       Use the following covid_vaccine_statewise.csv dataset and perform following analytics on the given dataset https://www.kaggle.com/sudalairajkumar/covid19-in-india?select=covid_vaccine_statewise.csv a. Describe the dataset b. Number of persons state wise vaccinated for first dose in India c. Number of persons state wise vaccinated for second dose in India d. Number of Males vaccinated d. Number of females vaccinated.

5.       Write a case study to process data driven for Digital Marketing OR Health care systems with Hadoop Ecosystem components as shown. (Mandatory)

● HDFS: Hadoop Distributed File System

● YARN: Yet Another Resource Negotiator

● MapReduce: Programming based Data Processing

● Spark: In-Memory data processing

● PIG, HIVE: Query based processing of data services

● HBase: NoSQL Database (Provides real-time reads and writes)

● Mahout, Spark MLLib: (Provides analytical tools) Machine Learning algorithm          libraries

● Solar, Lucene: Searching and Indexing.

 

 

 

 

 

 

 

 

     

Comments

Popular posts from this blog