数据统计

blacklad大约 2 分钟PythonPythonPandas

数据统计

1 介绍

数据统计是从数据中提取有用信息的科学方法。它包括数据的收集、整理、描述、分析和解释。通过统计方法，我们可以揭示数据的内在规律，并对未知数据进行预测和推断。

Pandas 是 Python 中一个强大的数据分析库，它提供了丰富的函数和方法来处理和分析数据。

import pandas as pd

# 创建 DataFrame
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice'],
    'age': [24, 27, 22, 32, 24],
    'score': [85.5, 88.0, 92.5, 70.0, None]
}
df = pd.DataFrame(data)

2 描述统计

使用 describe 返回一个有多行的所有数字列的统计表每一行对应一个统计指标，有总数、平均数、标准差、最小值、四分位数、最大值等。

print(df.describe())

             age     score
count   5.000000   4.00000
mean   25.800000  84.00000
std     3.898718   9.77241
min    22.000000  70.00000
25%    24.000000  81.62500
50%    24.000000  86.75000
75%    27.000000  89.12500
max    32.000000  92.50000

3 数学统计

通过数学方法对数据进行分析。

求和

求和是将所有数据相加。

sum_score = df['score'].sum()
print(f"Sum: {sum_score}")

Sum: 336.0

计数

是统计数据的个数。

count_score = df['score'].count()
print(f"Count: {count_score}")

Count: 4

均值

数据的平均值

mean_score = df['score'].mean()
print(f"Mean: {mean_score}")

Mean: 84.0

中位数

中位数是将数据按从小到大的顺序排列后，位于中间的那个数。如果数据个数为偶数，则是中间两个数的平均值。

median_score = df['score'].median()
print(f"Median: {median_score}")

Median: 86.75

方差和标准差

方差是各数据与均值之差的平方的平均值。标准差是方差的平方根，表示数据的离散程度。

# 方差
variance_score = df['score'].var()
# 标准差
std_deviation_score = df['score'].std()
print(f"Variance: {variance_score}")
print(f"Standard Deviation: {std_deviation_score}")

Variance: 95.5
Standard Deviation: 9.772410142846033

最小\大值

最小值和最大值分别是数据中的最小值和最大值。

min_score = df['score'].min()
max_score = df['score'].max()
print(f"Min: {min_score}")
print(f"Max: {max_score}")

Min: 70.0
Max: 92.5

4 数据抽样

数据抽样是从总体中选取部分数据进行分析的一种方法。常见的抽样方法包括简单随机抽样、系统抽样、分层抽样等。

随机抽取 3 行样本：

sample = df.sample(n=3)
print(sample)

      name  age  score
4    Alice   24    NaN
2  Charlie   22   92.5
3    David   32   70.0

随机抽取 50% 的样本

sample = df.sample(frac=0.5)
print(sample)

      name  age  score
3    David   32   70.0
2  Charlie   22   92.5

按照列随机抽取

sample = df.sample(n=2, axis=1)
print(sample)

   score     name
0   85.5    Alice
1   88.0      Bob
2   92.5  Charlie
3   70.0    David
4    NaN    Alice

5 协方差

协方差是两个变量的联合变异性。正协方差表示两个变量同时增加或减少，负协方差表示一个变量增加时另一个减少。协方差矩阵显示了所有变量之间的协方差。

print(df.cov())

        age  score
age    15.2  -39.0
score -39.0   95.5