Notes/ Cell Answers to 2.2

Systems (OS's), pathlib is a solution to this problem.

  • What are commands you use in terminal to access files? To acsess files, in mac you have to use cd and ls to get into a directory.
  • What are the command you use in Windows terminal to access files? You use cat nano gedit etc. in order to acsess files.
  • What are some of the major differences? The major difference is that to get to paths its is easier and allows for faster and quicker acsess.

Provide what you observed, struggled with, or leaned while playing with this code.

  • Why is path a big deal when working with images? It is a crucial part because if a path is not defined that means it will be hard to display an image and will take longer for oyu to implement this within a project or a program
  • How does the meta data source and label relate to Unit 5 topics? According to collegeboard unit 5 deals with data mutations and storage and the general understanding of data. meta data relates to unit 5 topics as it realates to understanding of different type of data.
  • Look up IPython, describe why this is interesting in Jupyter Notebooks for both Pandas and Images? Ipython is interesting with juypter notebooks becuase ipython it helps display data from images without opening it from other sources exect jupyter notebooks.

  • How is Base64 similar or different to Binary and Hexadecimal? Base64 is a way to encode data into 64 characters, this is similar to binary to hexadecimal as they both use data within characters to data. However they differ as hexa decimial uses 16 characters while binary uses 2 and base64 uses 64 characters.

  • Translate first 3 letters of your name to Base64. S3Jp = KRI; this is stored through the data preventing data from being lost within the process.

Lossless image

  • This is an example of lossless data as when compressing the image you cant lose pixels very easy as it has it resulting in a lossless data.
  • This is lossy as their is a mix of many colors which makes this image to being easily have lose more data showing that this is an example of lossy data.

2.2 MCQ REFLECTION!

Reflection

  • I was able to score a 3/3 and I felt that I had a good understanding about data and different data such as lossy and lossless and discusses earlier in class this week. (1) Which of the following is an advantage of a lossless compression algorithm over a lossy compression algorithm?

Possible Answers:

(A) A lossless compression algorithm can guarantee that compressed information is kept secure, while a lossy compression algorithm cannot.

(B) A lossless compression algorithm can guarantee reconstruction of original data, while a lossy compression algorithm cannot.

(C) A lossless compression algorithm typically allows for faster transmission speeds than does a lossy compression algorithm.

(D) A lossless compression algorithm typically provides a greater reduction in the number of bits stored or transmitted than does a lossy compression algorithm.

Correct answer: B

(2) A user wants to save a data file on an online storage site. The user wants to reduce the size of the file, if possible, and wants to be able to completely restore the file to its original version. Which of the following actions best supports the user’s needs?

Possible Answers:

(A) Compressing the file using a lossless compression algorithm before uploading it

(B) Compressing the file using a lossy compression algorithm before uploading it

(C) Compressing the file using both lossy and lossless compression algorithms before uploading it

(D) Uploading the original file without using any compression algorithm

Correct answer: A

(3) A programmer is developing software for a social media platform. The programmer is planning to use compression when users send attachments to other users. Which of the following is a true statement about the use of compression?

Possible Answers:

(A) Lossless compression of video files will generally save more space than lossy compression of video files.

(B) Lossless compression of an image file will generally result in a file that is equal in size to the original file.

(C) Lossy compression of an image file generally provides a greater reduction in transmission time than lossless compression does.

(D) Sound clips compressed with lossy compression for storage on the platform can be restored to their original quality when they are played.

Correct answer: C

import numpy as np
from PIL import Image

# Load the image
image = Image.open('images/smileyface.jpg')


img_array = np.asarray(image)

binary_pixels = np.unpackbits(img_array, axis=-1)
# change of pixels resulting in red change leaving teh red image and height creating the specific pixels.
hex_pixels = np.apply_along_axis(lambda x: hex(int(''.join(map(str, x)), 2))[2:].zfill(2), -1, binary_pixels)

red_img = np.copy(img_array)
# defy the image to show red rather than orignial or gray showing the use of the points 
red_img[:, :, 1] = 0
red_img[:, :, 2] = 0

red_image = Image.fromarray(red_img)

red_image.save('images/smileyface.jpg')

resized_image = red_image.resize((red_image.width // 10, red_image.height // 10))

print(image.info)

resized_image.show()
{'jfif': 257, 'jfif_version': (1, 1), 'jfif_unit': 0, 'jfif_density': (1, 1)}

2.3 Hacks

Notes

  • Pandas is a way to implement data sets through python by importing pandas as pd
  • Datasets can also be found on kaggle and downloaded in order to find more of the specifics
  • A csv file can help store data such within a dataset
  • The DataFrame Selection Max and Min function helps to select data based on the maximum or minimum value of a given feature
  • hello

AP PREP

(1) A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.

The researcher has access to a database with the following information about each student.

Last name

First name

Grade level (9, 10, 11, or 12)

Grade point average (on a 0.0 to 4.0 scale)

The researcher also has access to another database with the following information about each student.

First name

Last name

Number of absences from school

Number of late arrivals to school

Upon compiling the data, the researcher identifies a problem due to the fact that neither data source uses a unique ID number for each student. Which of the following best describes the problem caused by the lack of unique ID numbers?

(A) Students who have the same name may be confused with each other.

(B) Students who have the same grade point average may be confused with each other.

(C) Students who have the same grade level may be confused with each other.

(D) Students who have the same number of absences may be confused with each other.

Correct Answer: A

(2) A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?

(A) A computer program cannot combine data from different files.

(B) Different counties may organize data in different ways.

(C) The number of counties is too large for the program to process.

(D) The total number of rows of data is too large for the program to process.

Correct Answer: B

(3) A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?

Select two answers.

(A) Users might attempt to use the Web site to search for multiple cities.

(B) Users might enter abbreviations for the names of cities.

(C) Users might misspell the name of the city.

(D) Users might be slow at typing a city name in the text field.

Correct Answers: B and C

(4) A database of information about shows at a concert venue contains the following information.

Name of artist performing at the show

Date of show

Total dollar amount of all tickets sold

Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?

(A) Average ticket price

(B) Length of the show in minutes

(C) Start time of the show

(D) Total dollar amount of food and drinks sold during the show

Correct Answer: A

(5) A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?

(A) The average number of hours per day that the car is in use

(B) The car’s average speed on a particular day

(C) The distance the car traveled on a particular day

(D) The number of bicycles the car passed on a particular day

Correct Answer: D

(6) A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions.

On average, how long does homework take you each night (in minutes)?

On average, how long do you study for each test (in minutes)?

Do you enjoy the subject material of this class (yes or no)?

Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?

I. Do students who enjoy the subject material tend to spend more time on homework each night than the other students do?

II. Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do?

III. Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?

(A) I only

(B) III only

(C) I and II

(D) I and III

Correct Answer: C

import pandas as pd

# Set the file path of the CSV file
file_path = "./files/housing.csv" # Replace with the actual path to the file on your system

# Load the CSV file into a pandas DataFrame
housing_df = pd.read_csv(file_path)

# Print the first 5 rows of the DataFrame
print(housing_df.head())
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  
import pandas as pd


file_path = pd.read_csv("./files/housing.csv")


mean_total_rooms = file_path['total_rooms'].mean()


print(f"The mean of the 'total_rooms' column is: {mean_total_rooms}")
The mean of the 'total_rooms' column is: 2635.7630813953488
import pandas as pd


df = pd.read_csv("./files/housing.csv")


mode_total_rooms = df['total_rooms'].mode()


print(f"The mode of the 'total_rooms' column is: {mode_total_rooms}")
The mode of the 'total_rooms' column is: 0    1527.0
Name: total_rooms, dtype: float64