On this tutorial, we carry out a sophisticated single-cell RNA-seq evaluation workflow utilizing Scanpy on the PBMC-3k benchmark dataset. We begin by loading the dataset, inspecting its construction, and making use of high quality management checks to judge gene counts, whole counts, mitochondrial content material, and ribosomal gene indicators. We then filter low-quality cells and genes, detect potential doublets with Scrublet, normalize the info, apply log transformation, and determine extremely variable genes for downstream evaluation. Additionally, we rating cell-cycle phases, regress out undesirable technical variation, scale the info, and scale back dimensionality utilizing PCA, UMAP, and t-SNE. We additionally cluster cells with the Leiden algorithm, determine marker genes, annotate cell populations utilizing canonical PBMC markers, discover trajectory construction with PAGA and diffusion pseudotime, calculate a customized interferon-response rating, and eventually save the absolutely analyzed AnnData object for future use.
Copy CodeCopied!pip set up -q scanpy leidenalg python-igraph scrublet
import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings(“ignore”)
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor=”white”, figsize=(5, 5))
sc.logging.print_header()
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
print(adata)
adata.var[“mt”] = adata.var_names.str.startswith(“MT-“)
adata.var[“ribo”] = adata.var_names.str.startswith((“RPS”, “RPL”))
sc.pp.calculate_qc_metrics(
adata, qc_vars=[“mt”, “ribo”], percent_top=None, log1p=False, inplace=True
)
sc.pl.violin(
adata,
[“n_genes_by_counts”, “total_counts”, “pct_counts_mt”],
jitter=0.4, multi_panel=True,
)
sc.pl.scatter(adata, x=”total_counts”, y=”pct_counts_mt”)
sc.pl.scatter(adata, x=”total_counts”, y=”n_genes_by_counts”)
We set up the required single-cell evaluation libraries and import Scanpy, NumPy, Pandas, Matplotlib, and warning controls. We load the PBMC-3k benchmark dataset, make gene names distinctive, and examine the AnnData object construction. We then calculate high quality management metrics for mitochondrial and ribosomal genes and visualize count-level high quality patterns utilizing violin and scatter plots.
Copy CodeCopiedsc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.n_genes_by_counts < 2500, :].copy()
adata = adata[adata.obs.pct_counts_mt < 5, :].copy()
sc.pp.scrublet(adata)
print(“Predicted doublets:”, int(adata.obs[“predicted_doublet”].sum()))
adata = adata[~adata.obs[“predicted_doublet”], :].copy()
adata.layers[“counts”] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata.uncooked = adata
adata = adata[:, adata.var.highly_variable].copy()
We filter out low-quality cells and infrequently detected genes to enhance the reliability of the dataset. We use Scrublet by way of Scanpy to determine predicted doublets and take away them earlier than deeper evaluation. We then protect uncooked counts, normalize expression values, apply log transformation, choose extremely variable genes, and preserve solely probably the most informative options.
Copy CodeCopieds_genes = [“MCM5″,”PCNA”,”TYMS”,”FEN1″,”MCM2″,”MCM4″,”RRM1″,”UNG”,”GINS2″,
“MCM6″,”CDCA7″,”DTL”,”PRIM1″,”UHRF1″,”HELLS”,”RFC2″,”NASP”,
“RAD51AP1″,”GMNN”,”WDR76″,”SLBP”,”CCNE2″,”UBR7″,”POLD3″,”MSH2″,
“ATAD2″,”RAD51″,”RRM2″,”CDC45″,”CDC6″,”EXO1″,”TIPIN”,”DSCC1″,
“BLM”,”CASP8AP2″,”USP1″,”CLSPN”,”POLA1″,”CHAF1B”,”E2F8″]
g2m_genes = [“HMGB2″,”CDK1″,”NUSAP1″,”UBE2C”,”BIRC5″,”TPX2″,”TOP2A”,”NDC80″,
“CKS2″,”NUF2″,”CKS1B”,”MKI67″,”TMPO”,”CENPF”,”TACC3″,”SMC4″,
“CCNB2″,”CKAP2L”,”CKAP2″,”AURKB”,”BUB1″,”KIF11″,”ANP32E”,
“TUBB4B”,”GTSE1″,”KIF20B”,”HJURP”,”CDCA3″,”CDC20″,”TTK”,
“CDC25C”,”KIF2C”,”RANGAP1″,”NCAPD2″,”DLGAP5″,”CDCA2″,”CDCA8″,
“ECT2″,”KIF23″,”HMMR”,”AURKA”,”PSRC1″,”ANLN”,”LBR”,”CKAP5″,
“CENPE”,”NEK2″,”G2E3″,”CBX5″,”CENPA”]
s_genes = [g for g in s_genes if g in adata.var_names]
g2m_genes = [g for g in g2m_genes if g in adata.var_names]
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)
sc.pp.regress_out(adata, [“total_counts”, “pct_counts_mt”])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver=”arpack”)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.tsne(adata, n_pcs=40)
We outline S-phase and G2/M-phase marker genes and retain solely these current within the dataset. We rating every cell for cell-cycle section, regress out undesirable variation from whole counts and mitochondrial proportion, and scale the info for downstream modeling. We then run PCA, examine defined variance, assemble the neighborhood graph, and generate UMAP and t-SNE embeddings.
Copy CodeCopiedsc.tl.leiden(adata, decision=0.5, taste=”igraph”, n_iterations=2, directed=False)
sc.pl.umap(adata, coloration=”leiden”, legend_loc=”on information”, title=”Leiden clusters”)
sc.pl.tsne(adata, coloration=”leiden”, legend_loc=”on information”, title=”t-SNE clusters”)
sc.tl.rank_genes_groups(adata, “leiden”, methodology=”wilcoxon”)
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
outcome = adata.uns[“rank_genes_groups”]
teams = outcome[“names”].dtype.names
top_df = pd.DataFrame({g: outcome[“names”][g][:10] for g in teams})
print(“nTop 10 markers per cluster:n”, top_df)
marker_genes = {
“B-cell”: [“CD79A”, “MS4A1”],
“CD8 T-cell”: [“CD8A”, “CD8B”],
“CD4 T-cell”: [“IL7R”, “CD4”],
“NK”: [“GNLY”, “NKG7”],
“CD14 Monocyte”: [“CD14”, “LYZ”],
“FCGR3A Monocyte”: [“FCGR3A”, “MS4A7”],
“Dendritic”: [“FCER1A”, “CST3”],
“Megakaryocyte”: [“PPBP”],
}
sc.pl.dotplot(adata, marker_genes, groupby=”leiden”, standard_scale=”var”)
sc.pl.stacked_violin(adata, marker_genes, groupby=”leiden”, swap_axes=True)
We apply Leiden clustering to group cells based mostly on the neighborhood graph and visualize the clusters on UMAP and t-SNE plots. We carry out differential expression evaluation utilizing the Wilcoxon check to determine the highest marker genes for every cluster. We then use canonical PBMC marker genes to assist cell-type annotation by way of dot plots and stacked violin plots.
Copy CodeCopiedsc.tl.paga(adata, teams=”leiden”)
sc.pl.paga(adata, coloration=”leiden”, threshold=0.1)
sc.tl.umap(adata, init_pos=”paga”)
sc.pl.umap(adata, coloration=”leiden”, legend_loc=”on information”)
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep=”X_diffmap”)
adata.uns[“iroot”] = np.flatnonzero(adata.obs[“leiden”] == adata.obs[“leiden”].cat.classes[0])[0]
sc.tl.dpt(adata)
sc.pl.umap(adata, coloration=[“leiden”, “dpt_pseudotime”], legend_loc=”on information”)
ifn_genes = [“ISG15”, “IFI6”, “IFIT1”, “IFIT3”, “MX1”, “OAS1”, “STAT1”, “IRF7″]
ifn_genes = [g for g in ifn_genes if g in adata.raw.var_names]
sc.tl.score_genes(adata, gene_list=ifn_genes, score_name=”IFN_score”)
sc.pl.umap(adata, coloration=”IFN_score”, cmap=”viridis”)
adata.write(“pbmc3k_analyzed.h5ad”)
print(“n Evaluation full — saved to pbmc3k_analyzed.h5ad”)
print(adata)
We run PAGA to mannequin connectivity between Leiden clusters and reinitialize UMAP utilizing the PAGA graph to acquire a clearer trajectory construction. We compute diffusion maps and diffusion pseudotime to discover potential development patterns throughout cell states. We additionally calculate an interferon-response gene-set rating, visualize it on UMAP, and save the ultimate analyzed object as an .h5ad file.
In conclusion, we constructed an end-to-end Scanpy pipeline for single-cell RNA-seq evaluation, reworking uncooked PBMC information into interpretable organic insights. We cleaned and preprocessed the dataset, eliminated noisy cells and doublets, chosen informative genes, and generated significant embeddings to visualise mobile construction. We then used Leiden clustering and differential expression evaluation to find marker genes and join clusters to identified immune cell varieties. By including PAGA, diffusion pseudotime, and customized gene-set scoring, we prolonged the workflow past primary clustering and confirmed how Scanpy helps deeper organic interpretation. Eventually, we had a saved .h5ad object that accommodates the processed information, annotations, scores, clusters, and visible evaluation outcomes, prepared for downstream exploration or reporting.
Try the Full Codes with Pocket book right here. Additionally, be at liberty to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.
Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us
The publish The right way to Construct a Single-Cell RNA-seq Evaluation Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery appeared first on MarkTechPost.

